Foot, meet gun
Oct. 25th, 2009 04:58 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Google Analytics does some pretty cool stuff, but has one major drawback for mobile web application developers: it's Javascript-based, meaning that hits from mobile devices that don't speak Javascript silently go untracked. Recently, the Analytics team released some code that does server-side tracking; the linked ZIP file contains source and examples in ASP, JSP, PHP and Perl. Why not Python, you might wonder? I wondered too, particularly since an AppEngine project I'm working on is at least somewhat intended for phones (hey, you never know when you might be away from your desk but really want to know if a certain BioBrick exists), so I did a little poking around to see if it was possible to instrument an AppEngine application using server-side Mobile Analytics.
So, the core of the Analytics server-side code is a routine that calls out to www.google-analytics.com/__utm.gif?[a bunch of parameters] and renders a 1x1 GIF as a web-bug. At first glance, this seems pretty straightforward: write a handler that makes the call to Google's web-bug and produces the $your-server web-bug. AppEngine even supports a subset of the Python Imaging Library, which means that you can build the web-bug programmatically in much the same way as the PHP/Perl/&c code does (although AppEngine's subset of PIL doesn't actually speak GIF, only JPG and PNG, so you have to render a PNG instead). Porting the samples from PHP was easy enough, with occasional references to the Perl to clarify things. (I still can't write Perl, but I can read it relatively well. So much for it being a write-only language!) However, I was surprised to find that the call out to Google's web-bug failed. What was going on there?
I dropped into the interpreter and tried a simple httplib call. No dice; it came up 404. Given that Google's using successful hits to this URL in order to register hits, that was no good -- it meant that requests from my web-bug were simply vanishing into the ether. However, placing the same request from a web browser retrieved the image just fine. What was going on there?
To find out, I opened up wireshark, reissued the request from my browser, and beheld the details of a packet with many lovely headers:
As luck would have it, I struck gold on the first try -- passing every header except the Host header resulted in the 404 I'd gotten earlier. As long as the Host header was in the request, everything was fine. In fact, passing nothing but the request-body and the Host header worked perfectly ... in the interpreter. In AppEngine, not so much.
See, AppEngine does a lot of sandboxing, for security purposes. To this end, they've rolled their own versions of a lot of Python standard library modules, including httplib and urllib (and, I expect, anything that involves a network socket). Part of this sandboxing involves stripping "untrusted" headers from any network request generated by AppEngine code -- including Host. Ha ha!
Except ... Host is a mandatory header in HTTP 1.1. It can be empty, but it has to be there. They're supposed to return 400 Bad Request, not 404 Not Found, if no Host header is present at all ... but surely they wouldn't have a URL fetcher that was so badly standards-nonconforming as to fail to send a required header?
Well, actually, they don't, which is always nice. url_fetch.py does in fact set the Host header, right there on line 170. And this is where things get weird.
See, the error that was showing up in the AppEngine logs wasn't a 404 Not Found -- it was a -2, 'Name or service not known'. If that doesn't sound like an HTTP error to you, you're right: it's a DNS error.
Those of you scratching your heads may now scratch them harder: swap out 'www.google-analytics.com' for whatever IP address you get when you dig that hostname, and it still fails. There is absolutely no reason for DNS to be involved, because there is nothing for it to have to resolve, and yet we get back a DNS resolution error, on both the dev server and on appspot.com (so it's not an issue with my /etc/hosts, which was my first thought).
Oh, but I can fetch other stuff -- f'rinstance, the front page of this-here blog -- just fine. No trouble whatsoever. How ya like them apples?
So, for the time being, I'm not sure whether Analytics has shot AppEngine in the foot or vice versa, but someone sure is gimping around saying "ow" a lot.
So, the core of the Analytics server-side code is a routine that calls out to www.google-analytics.com/__utm.gif?[a bunch of parameters] and renders a 1x1 GIF as a web-bug. At first glance, this seems pretty straightforward: write a handler that makes the call to Google's web-bug and produces the $your-server web-bug. AppEngine even supports a subset of the Python Imaging Library, which means that you can build the web-bug programmatically in much the same way as the PHP/Perl/&c code does (although AppEngine's subset of PIL doesn't actually speak GIF, only JPG and PNG, so you have to render a PNG instead). Porting the samples from PHP was easy enough, with occasional references to the Perl to clarify things. (I still can't write Perl, but I can read it relatively well. So much for it being a write-only language!) However, I was surprised to find that the call out to Google's web-bug failed. What was going on there?
I dropped into the interpreter and tried a simple httplib call. No dice; it came up 404. Given that Google's using successful hits to this URL in order to register hits, that was no good -- it meant that requests from my web-bug were simply vanishing into the ether. However, placing the same request from a web browser retrieved the image just fine. What was going on there?
To find out, I opened up wireshark, reissued the request from my browser, and beheld the details of a packet with many lovely headers:
Replicating this request, headers and all, in the interpreter gave me back a 200 OK. Time for some differential diagnosis: what headers were optional, and which ones couldn't it live without?GET /__utm.gif?utmwv=utmac=MO-11263623-1&utmwv=4.4sp&utmip=&utmn=1277580264&utmhn=localhost&utmp=http%3A%2F%2Fmaradyddtestapp.appspot.com%2F&utmr=-&utmcc=__utma%253D999.999.999.999.999.1%253B&utmvid=0x01234567890abcdef HTTP/1.1 Host: www.google-analytics.com Connection: keep-alive User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/532.2 (KHTML, like Gecko) Chrome/4.0.222.5 Safari/532.2 Cache-Control: max-age=0 Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Encoding: gzip,deflate Accept-Language: en-US,en;q=0.8 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3 If-Modified-Since: Wed, 21 Jan 2004 19:50:30 GMT
As luck would have it, I struck gold on the first try -- passing every header except the Host header resulted in the 404 I'd gotten earlier. As long as the Host header was in the request, everything was fine. In fact, passing nothing but the request-body and the Host header worked perfectly ... in the interpreter. In AppEngine, not so much.
See, AppEngine does a lot of sandboxing, for security purposes. To this end, they've rolled their own versions of a lot of Python standard library modules, including httplib and urllib (and, I expect, anything that involves a network socket). Part of this sandboxing involves stripping "untrusted" headers from any network request generated by AppEngine code -- including Host. Ha ha!
Except ... Host is a mandatory header in HTTP 1.1. It can be empty, but it has to be there. They're supposed to return 400 Bad Request, not 404 Not Found, if no Host header is present at all ... but surely they wouldn't have a URL fetcher that was so badly standards-nonconforming as to fail to send a required header?
Well, actually, they don't, which is always nice. url_fetch.py does in fact set the Host header, right there on line 170. And this is where things get weird.
See, the error that was showing up in the AppEngine logs wasn't a 404 Not Found -- it was a -2, 'Name or service not known'. If that doesn't sound like an HTTP error to you, you're right: it's a DNS error.
Those of you scratching your heads may now scratch them harder: swap out 'www.google-analytics.com' for whatever IP address you get when you dig that hostname, and it still fails. There is absolutely no reason for DNS to be involved, because there is nothing for it to have to resolve, and yet we get back a DNS resolution error, on both the dev server and on appspot.com (so it's not an issue with my /etc/hosts, which was my first thought).
Oh, but I can fetch other stuff -- f'rinstance, the front page of this-here blog -- just fine. No trouble whatsoever. How ya like them apples?
So, for the time being, I'm not sure whether Analytics has shot AppEngine in the foot or vice versa, but someone sure is gimping around saying "ow" a lot.
(no subject)
Date: 2009-10-25 05:54 am (UTC)(no subject)
Date: 2009-10-25 09:16 am (UTC)(no subject)
Date: 2009-10-25 12:16 pm (UTC)(no subject)
Date: 2009-10-29 11:29 am (UTC)http://code.google.com/p/googleappengine/issues/list
(no subject)
Date: 2009-10-29 11:45 am (UTC)