maradydd: (Default)
maradydd ([personal profile] maradydd) wrote2008-01-02 07:52 pm

Small world

There's a post up on BoingBoing today (ok, yesterday for me) about open vs. closed search algorithms, suggesting that the search algorithms used by Google, Yahoo et al are bad because of their lack of transparency. It invokes a comparison to an important concept in computer security: "security through obscurity" is dangerous because an effective encryption scheme should be equally hard to break whether you know the internals of the algorithm that generated the ciphertext or whether you don't.

I think comparing this to search is a bad (or at best misleading) idea, and expounded on this in the comments. But I'm far more entertained by the fact that the two best comments on the post so far come from two sources with whom I am tangentially familiar, albeit from totally different directions: [livejournal.com profile] jrtom and [livejournal.com profile] radtea. Small damn world!

[identity profile] maradydd.livejournal.com 2008-01-02 08:21 pm (UTC)(link)
Making the algorithms open doesn't mean that any flaws will be fixed quickly; it just means that they'll be _found_ (more) quickly.

Sure, and I'm not pretending otherwise. (Note that I only said responses will be generated more quickly; it's anyone's guess as to how well those responses will work, particularly since the spammers certainly won't be opening up their source code!) I should have said above that I think the potential for an open-source search engine to implode in grand style due to sheer developer frustration is enormous. But I still think that if enough dedicated people were on board, cool things could happen; it's hard to say how many is "enough", though, or how dedicated they need to be. Startups tend to have fanatically dedicated people working for them because the people know that the volume and quality of their work has a direct influence on whether they're going to have a job in three months; this really can't be said for open-source projects. Even when the work sucks giant monkey balls, a sense of urgency can be a great source of inspiration.

random Wikia users won't have access to the data that informs the algorithm design

Do we know this is true? (I didn't see it indicated in the article, but it was a pretty short article.) I suppose the bandwidth costs would be kind of insane if just any random person could pull the data whenever they wanted ("hey, let's DDoS Wikia today!"), but perhaps developer keys and rate-limiting, or BitTorrent, or something.

Nor, I suspect, will Wikia be letting just anyone _edit_ their algorithms, unless they're complete idiots.

Sure. I took [livejournal.com profile] neoliminal's question to mean something like the setup that LiveJournal has, where the code is published and can be replicated elsewhere (e.g. DeadJournal), but users can't make changes to an instance of the system that they don't control.

From my experience, part of the problem with translating academic results to search engines in particular is that it's hard for an academic to demonstrate that their approach or improvement will work in actual practice.

Oh, absolutely, although the better conferences (e.g. SIGIR, KDD, &c) seem to at least pay lip service to scalability issues. But I totally agree that academics almost universally have blinders on when it comes to the notion of people using their systems in unintended or unexpected ways, and they don't write papers (and certainly don't implement code) with an eye toward this very real problem.

Still, I like the notion of J. Random Bored Hacker being able to read a paper, bang some code together, and see whether it works. J. Random Bored Hacker isn't going to have the hardware resources to put together his own private Google, but I know probably ten or twelve different people who have clusters running in their homes/warehouses/whatever just for shits and grins. There's got to be some guy out there with fifty Beowulfed dual-Xeons and a healthy curiosity about search...

I gather that you're in some completely other time zone these days

Yep, I'm in Belgium until late February, alas. If you'll be business-tripping later in the year, though, drop me a line in advance and we can grab dinner! (I am currently without car, and my day-to-day transportation needs are met well enough by SF public transit that I'm not especially motivated to fix my engine, shell out to get it fixed or buy a new car, but Mountain View is fairly reachable by train.)

[identity profile] jrtom.livejournal.com 2008-01-02 08:51 pm (UTC)(link)
random Wikia users won't have access to the data that informs the algorithm design

Do we know this is true? (I didn't see it indicated in the article, but it was a pretty short article.) I suppose the bandwidth costs would be kind of insane if just any random person could pull the data whenever they wanted ("hey, let's DDoS Wikia today!"), but perhaps developer keys and rate-limiting, or BitTorrent, or something.


"Kind of insane" is putting it mildly, I'd bet: I doubt that even someone with a T1 would even be able to keep up with mirroring the data that Google (or Microsoft, or Yahoo) uses as input. Consider that this data almost certainly involves every single search and page visit, not to mention data from web spiders giving you content and topology (i.e., link) updates.

That is, while, yes, lots of people doing this would placed a strain on [Google|Yahoo|Microsoft], but it seems infeasible for almost any individual to even _receive_ the data.

the better conferences (e.g. SIGIR, KDD, &c) seem to at least pay lip service to scalability issues.

Of course. But search engines (or, as I corrected myself in my last post to BB, "search services") are big agglomerations of rapidly evolving code, huge inputs, and lots of infrastructure. So it's not enough for an academic to say "I've demonstrated that the scalability is good up to 10M pages" (which would be somewhat exceptional, I suspect) because they're still a few orders of magnitude low, and because they have no way of demonstrating that their approach won't screw something else up in the existing engine, or push them over a critical performance boundary (in the wrong direction).

As a side point: Google runs its stuff on a custom version of Linux, I believe. I don't know how tied their algorithms are to that...but do they have to provide the source for that, too, so that J. Random Hacker can have the appropriate substrate on which to run their experiments?

As for J. Random Bored Hacker, that's what Lucene et al. are for. Of course, it doesn't allow you to duplicate the precise environment that a specific production search service has...but at least you can use it to dink with the algorithms themselves.


Sorry to hear you won't be around. Don't know when I might be in the Bay again, but if and when I'll let you know so that we can grab dinner (quick, before it gets away).

[identity profile] maradydd.livejournal.com 2008-01-02 09:24 pm (UTC)(link)
Wow, I really suck at conveying things I actually mean tonight. I feel like I'm coming across as utterly ignorant of the sheer size of the data (it's actually more like "you may think it's a long way down the road to the chemist" levels of ignorant, I don't pretend to be an expert on this by any stretch of the imagination).

I doubt that even someone with a T1 would even be able to keep up with mirroring the data that Google (or Microsoft, or Yahoo) uses as input.

Sure, I wouldn't expect anyone to try and mirror all the data. But I do think that independent developers could get useful data to work with through statistical sampling.

...

Hey, here's a stupid thought. The web is big, so exhaustive spidering is costly, indexing content is costly, yada yada yada. But I wonder: how similar do you suppose the content that Google keeps on the backend is to the content that Microsoft or Yahoo keep on their backends? Or, rephrased: are the material differences between the major search services more a function of the data they maintain, the algorithms they use, or both?

Because, you know, it could be kind of cool if there were a large, world-readable index that somebody could access through a scalable service like Amazon EC2, thereby having the ability to modify their code on the fly and gauge the effects of that.

Lucene I am not familiar with; I'll have to look into it. Apache is doing all kinds of interesting stuff these days.

(I don't know what Google's using OS-wise, but I do know that they've modified the hell out of MySQL for a whole lot of different things -- though what those things are, I have no idea.)

[identity profile] jrtom.livejournal.com 2008-01-02 10:24 pm (UTC)(link)
Sure, I wouldn't expect anyone to try and mirror all the data. But I do think that independent developers could get useful data to work with through statistical sampling.

Ah, got it. Hmm. Might be feasible to do sampling.


As to your proposal: my completely uninformed guess is that these services overlap in something like 90% of the _types_ of data that they collect/maintain.

...

Actually, I'm going to revise that figure downwards. I'd guess that Google uses data from GMail and their ad programs, for instance, to inform their search ranking algorithms.

The suggestion is an interesting one, but I suspect that it wouldn't fly:

(1) It would have to be a (partial) mirror; there's no way that the services would want to base off something that wasn't dedicated. So cost-wise it would be pure overhead for the sponsors.

(2) A lot of the data has privacy implications that are hard to deal with. Remember that flap over the release of a bunch of search queries?

(3) I suspect that the various services don't really want attention drawn to the breadth and depth of the information that they use for these purposes.

Definitely an interesting idea, though, and it might be possible to do something like this even if the data sources that the major players use are widely disparate. It wouldn't help you to answer questions like "how would Google work if I tweaked this constant?" but for general search research, it could be useful.

YaCy might be an interesting resource in this context.