Entry tags:
[LJ Genie] Using Scrapy programmatically
Has anyone reading this used Scrapy, the Python HTML-scraping framework, programmatically as part of a larger system? I'm interested in using it to replace BeautifulSoup in a project I'm working on which involves extracting specific, XPath-targetable tags from the contents of a whole bunch of different URLs. BeautifulSoup can do it, but the CPU and memory load is really heavy and I'd like to find a lighter-weight solution. (Scrapy supports XPath out of the box, which was a great design decision on their part.)
The specific problem I'm having with Scrapy is that despite the fact that it supports writing custom scrapers, it's designed as a command-line-driven tool to the exclusion of anything else. I want to instantiate a scraper from within a routine, run it, and hand the contents of the tags it collects off to another routine all within the same process, without having to invoke a separate process or touch the disk -- this system has to consume a lot of network data and I can't afford for it to become I/O bound. (I can queue the inbound network data -- in fact, since my current architecture is completely synchronous, I already am -- but not having to do so is preferable. Scrapy is asynchronous and that's a plus.)
Since it's written in Python, I can trace the control flow and figure out what specific pieces I need to import and/or customise to get it to do what I want, but it's a pretty densely layered system and it would be nice to have some examples to use for guidance. The documentation is unfortunately useless in this regard -- all the examples are for command-line invocation -- and neither Google Code Search nor koders.com turn up anything useful.
N.B.: I'm reluctant to just use libxml2, because most of the pages I'm scraping are not XHTML-compliant. In fact, a surprisingly large number of them have HTML so malformed that BeautifulSoup chokes on them and I have to use an exponential-backoff approach to parse only a piece of the document at a time. (And in practice, that means I sometimes lose data anyway; this is annoying, but frustratingly necessary. Dear web developers who cannot be bothered to make their content machine-readable without lots of massaging: die in a fire.) It is my understanding that Scrapy is quite tolerant of bad markup, but if I'm wrong about that, please correct me.
The specific problem I'm having with Scrapy is that despite the fact that it supports writing custom scrapers, it's designed as a command-line-driven tool to the exclusion of anything else. I want to instantiate a scraper from within a routine, run it, and hand the contents of the tags it collects off to another routine all within the same process, without having to invoke a separate process or touch the disk -- this system has to consume a lot of network data and I can't afford for it to become I/O bound. (I can queue the inbound network data -- in fact, since my current architecture is completely synchronous, I already am -- but not having to do so is preferable. Scrapy is asynchronous and that's a plus.)
Since it's written in Python, I can trace the control flow and figure out what specific pieces I need to import and/or customise to get it to do what I want, but it's a pretty densely layered system and it would be nice to have some examples to use for guidance. The documentation is unfortunately useless in this regard -- all the examples are for command-line invocation -- and neither Google Code Search nor koders.com turn up anything useful.
N.B.: I'm reluctant to just use libxml2, because most of the pages I'm scraping are not XHTML-compliant. In fact, a surprisingly large number of them have HTML so malformed that BeautifulSoup chokes on them and I have to use an exponential-backoff approach to parse only a piece of the document at a time. (And in practice, that means I sometimes lose data anyway; this is annoying, but frustratingly necessary. Dear web developers who cannot be bothered to make their content machine-readable without lots of massaging: die in a fire.) It is my understanding that Scrapy is quite tolerant of bad markup, but if I'm wrong about that, please correct me.
no subject
(Anonymous) 2009-08-11 12:50 am (UTC)(link)There is currently a proposal for adding programatically control of Scrapy: http://dev.scrapy.org/wiki/SEP-004
However, it's not gonna happen soon as we're now focusing on cleaning up the code and documenting the last bits and pieces for the first stable release, which we hope to do it on 2-4 weeks.
Btw, Scrapy doesn't perform any extra cleansing on markup, it just uses libxml2 directly, which is probably not that bad as you suspect. Latest versions of libxml2 (the 2.6 series at least) are reasonable good at dealing with bad markup (as long as you're using the HTML parser, not the XML one) and even if it isn't as good as BeautifulSoup, the performance gain probably outperforms the lack of parsing robustness.
I'd recommend you to try libxml2 2.6.32 with a few ugly pages and judge for yourself.
Another good library to consider is html5lib: http://code.google.com/p/html5lib/
no subject
The API you propose looks sensible for the most part -- it looks like it'd do about 90% of what I need as-is. (I'm being self-centred, of course, but I can't possibly be the only person in the world with this use case. ;) ) The callback-chaining looks really cool too.
I do question whether the Crawler's .run() method should necessarily be a blocking call; in my use case, it would be nice to be able to populate scraped_items in such a way that I could iterate over it, with an option to either terminate when the end of the iterable is reached or sleep and wait for the iterable to be extended. (And while I'm asking for a car and a pony, I'll also ask for a villa in the south of France and note that this further suggests that it would be nice to be able to instantiate an asynchronous Crawler from any iterable, such as a generator or a long-lived stream.) Perhaps that's more appropriate for a middleware?
I'll definitely give libxml2 a stab -- I have plenty of (other people's) awful HTML to try it on. <:)
Best of luck with the release, and feel free to drop me a line when you've got time to start implementing that proposal -- I can at least act as a guinea pig, and can probably allocate some Copious Free Time toward hacking on it myself.