Meredith L. Patterson ([identity profile] maradydd.livejournal.com) wrote in [personal profile] maradydd 2009-08-11 01:35 am (UTC)

Hi Pablo -- thanks for dropping by! (The LJ Genie must be working overtime.)

The API you propose looks sensible for the most part -- it looks like it'd do about 90% of what I need as-is. (I'm being self-centred, of course, but I can't possibly be the only person in the world with this use case. ;) ) The callback-chaining looks really cool too.

I do question whether the Crawler's .run() method should necessarily be a blocking call; in my use case, it would be nice to be able to populate scraped_items in such a way that I could iterate over it, with an option to either terminate when the end of the iterable is reached or sleep and wait for the iterable to be extended. (And while I'm asking for a car and a pony, I'll also ask for a villa in the south of France and note that this further suggests that it would be nice to be able to instantiate an asynchronous Crawler from any iterable, such as a generator or a long-lived stream.) Perhaps that's more appropriate for a middleware?

I'll definitely give libxml2 a stab -- I have plenty of (other people's) awful HTML to try it on. <:)

Best of luck with the release, and feel free to drop me a line when you've got time to start implementing that proposal -- I can at least act as a guinea pig, and can probably allocate some Copious Free Time toward hacking on it myself.

Post a comment in response:

If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org