maradydd | Entries tagged with python

Google Analytics does some pretty cool stuff, but has one major drawback for mobile web application developers: it's Javascript-based, meaning that hits from mobile devices that don't speak Javascript silently go untracked. Recently, the Analytics team released some code that does server-side tracking; the linked ZIP file contains source and examples in ASP, JSP, PHP and Perl. Why not Python, you might wonder? I wondered too, particularly since an AppEngine project I'm working on is at least somewhat intended for phones (hey, you never know when you might be away from your desk but really want to know if a certain BioBrick exists), so I did a little poking around to see if it was possible to instrument an AppEngine application using server-side Mobile Analytics.

( The short answer is no. )

Current Mood: disappointed

So, I'm not sure if y'all are talking behind my back, if my Google-juice is better than I thought it was, or if Python developers are just telepathic, but I'd like to note that I'm now two for two when it comes to whining about some library not doing exactly what I want it to and having one of that library's developers pop up within 24 hours to address my question -- that is, as long as the library's written in Python.

No, I'm not going to push my luck (though, note this changeset for 3.0), but this is just one more reason why this community owns.

Current Mood: pleased

Has anyone reading this used Scrapy, the Python HTML-scraping framework, programmatically as part of a larger system? I'm interested in using it to replace BeautifulSoup in a project I'm working on which involves extracting specific, XPath-targetable tags from the contents of a whole bunch of different URLs. BeautifulSoup can do it, but the CPU and memory load is really heavy and I'd like to find a lighter-weight solution. (Scrapy supports XPath out of the box, which was a great design decision on their part.)

The specific problem I'm having with Scrapy is that despite the fact that it supports writing custom scrapers, it's designed as a command-line-driven tool to the exclusion of anything else. I want to instantiate a scraper from within a routine, run it, and hand the contents of the tags it collects off to another routine all within the same process, without having to invoke a separate process or touch the disk -- this system has to consume a lot of network data and I can't afford for it to become I/O bound. (I can queue the inbound network data -- in fact, since my current architecture is completely synchronous, I already am -- but not having to do so is preferable. Scrapy is asynchronous and that's a plus.)

Since it's written in Python, I can trace the control flow and figure out what specific pieces I need to import and/or customise to get it to do what I want, but it's a pretty densely layered system and it would be nice to have some examples to use for guidance. The documentation is unfortunately useless in this regard -- all the examples are for command-line invocation -- and neither Google Code Search nor koders.com turn up anything useful.

N.B.: I'm reluctant to just use libxml2, because most of the pages I'm scraping are not XHTML-compliant. In fact, a surprisingly large number of them have HTML so malformed that BeautifulSoup chokes on them and I have to use an exponential-backoff approach to parse only a piece of the document at a time. (And in practice, that means I sometimes lose data anyway; this is annoying, but frustratingly necessary. Dear web developers who cannot be bothered to make their content machine-readable without lots of massaging: die in a fire.) It is my understanding that Scrapy is quite tolerant of bad markup, but if I'm wrong about that, please correct me.

So I've been working with Django lately, and I continue to be pleased with the preeminent saneness with which it handles the interaction between HTML and Python. Here is the latest example.

Suppose you have an HTML form that your Django backend will be processing. Give each input or select element in your form a name attribute, and whatever function you POST the form back to will receive a request.POST dictionary keyed by the name. For instance, if you have a form like this:

<form id="shopping">
  <p>
    <select name="fruit">
      <option value="apples">apples</option>
      <option value="bananas">bananas</option>
      <option value="cherries">cherries</option>
    </select>
  </p>
  <p>
    <select name="meat">
      <option value="buffalo">buffalo</option>
      <option value="moose">moose</option>
      <option value="quail">quail</option>
    </select>
  </p>
  <input type="submit" value="Submit" />
</form>

Then your receiving function will get a request.POST consisting of a dictionary with the keys fruit and meat, and the value for each will be a list containing the values that were selected. And, yes, if you give those selects the multiple attribute, turning them into multi-valued choice sets, your list will contain all values that were selected. Very handy.

But wait, there's more!

Suppose that you want to give your hypothetical shopper the ability to select more than one type of fruit or meat at a time without using multiple, so you write some DOM-manipulating javascript to dynamically add more copies of the appropriate select element as needed, giving each element a unique name. (How to do this is left as an exercise for the reader. I did it, you can too.) Suppose further that you also want to give your users the ability to specify how many units of each item they want, so you add text inputs (with appropriate input validation, of course, also left as an exercise for the reader). Give each <input type="text"> the same name as its corresponding select, and you'll get a request.POST that looks like:

{ 'fruit_0': ['4', 'cherries'], 'fruit_1': ['3', 'apples'], 'meat_0': ['1', 'buffalo'] }

(In this case I'm using subscripts in my javascript to generate distinct names. There may actually be a simpler way to do this, though I haven't hit on it yet.)

This is especially useful in the case where you have some function that you want to pass each of your (amount, item) pairs to, because then you can use the handy *args syntax, e.g. [doStuffTo(*thing) for thing in request.POST.values()]. You could also use a dictionary comprehension if you're using Python 3, you bleeding-edge hacker, you. Though I don't know if Django is compatible with Python 3 (and I doubt it, given all the backward-compatibility stuff that Python 3 breaks). That, too, is left as an exercise for the reader.

Current Mood: geeky

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Radio Free Meredith

science keeps me warm at night

Foot, meet gun

solve_halting.py

[LJ Genie] Using Scrapy programmatically

[code] Hey Django: ur doin it rite

Profile

September 2010

Syndicate

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags