Someone wrote in [personal profile] maradydd 2009-08-11 12:50 am (UTC)

Hi Meredith, I'm a Pablo Hoffman, a Scrapy developer.

There is currently a proposal for adding programatically control of Scrapy: http://dev.scrapy.org/wiki/SEP-004

However, it's not gonna happen soon as we're now focusing on cleaning up the code and documenting the last bits and pieces for the first stable release, which we hope to do it on 2-4 weeks.

Btw, Scrapy doesn't perform any extra cleansing on markup, it just uses libxml2 directly, which is probably not that bad as you suspect. Latest versions of libxml2 (the 2.6 series at least) are reasonable good at dealing with bad markup (as long as you're using the HTML parser, not the XML one) and even if it isn't as good as BeautifulSoup, the performance gain probably outperforms the lack of parsing robustness.

I'd recommend you to try libxml2 2.6.32 with a few ugly pages and judge for yourself.

Another good library to consider is html5lib: http://code.google.com/p/html5lib/

Post a comment in response:

If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org