maradydd | Small world

Current Mood: amused

Entry tags:

Small world

There's a post up on BoingBoing today (ok, yesterday for me) about open vs. closed search algorithms, suggesting that the search algorithms used by Google, Yahoo et al are bad because of their lack of transparency. It invokes a comparison to an important concept in computer security: "security through obscurity" is dangerous because an effective encryption scheme should be equally hard to break whether you know the internals of the algorithm that generated the ciphertext or whether you don't.

I think comparing this to search is a bad (or at best misleading) idea, and expounded on this in the comments. But I'm far more entertained by the fact that the two best comments on the post so far come from two sources with whom I am tangentially familiar, albeit from totally different directions:

jrtom and

radtea. Small damn world!

Flat | Top-Level Comments Only

Wow, I really suck at conveying things I actually mean tonight. I feel like I'm coming across as utterly ignorant of the sheer size of the data (it's actually more like "you may think it's a long way down the road to the chemist" levels of ignorant, I don't pretend to be an expert on this by any stretch of the imagination).

I doubt that even someone with a T1 would even be able to keep up with mirroring the data that Google (or Microsoft, or Yahoo) uses as input.

Sure, I wouldn't expect anyone to try and mirror all the data. But I do think that independent developers could get useful data to work with through statistical sampling.

...

Hey, here's a stupid thought. The web is big, so exhaustive spidering is costly, indexing content is costly, yada yada yada. But I wonder: how similar do you suppose the content that Google keeps on the backend is to the content that Microsoft or Yahoo keep on their backends? Or, rephrased: are the material differences between the major search services more a function of the data they maintain, the algorithms they use, or both?

Because, you know, it could be kind of cool if there were a large, world-readable index that somebody could access through a scalable service like Amazon EC2, thereby having the ability to modify their code on the fly and gauge the effects of that.

Lucene I am not familiar with; I'll have to look into it. Apache is doing all kinds of interesting stuff these days.

(I don't know what Google's using OS-wise, but I do know that they've modified the hell out of MySQL for a whole lot of different things -- though what those things are, I have no idea.)

Sure, I wouldn't expect anyone to try and mirror all the data. But I do think that independent developers could get useful data to work with through statistical sampling.

Ah, got it. Hmm. Might be feasible to do sampling.

As to your proposal: my completely uninformed guess is that these services overlap in something like 90% of the _types_ of data that they collect/maintain.

...

Actually, I'm going to revise that figure downwards. I'd guess that Google uses data from GMail and their ad programs, for instance, to inform their search ranking algorithms.

The suggestion is an interesting one, but I suspect that it wouldn't fly:

(1) It would have to be a (partial) mirror; there's no way that the services would want to base off something that wasn't dedicated. So cost-wise it would be pure overhead for the sponsors.

(2) A lot of the data has privacy implications that are hard to deal with. Remember that flap over the release of a bunch of search queries?

(3) I suspect that the various services don't really want attention drawn to the breadth and depth of the information that they use for these purposes.

Definitely an interesting idea, though, and it might be possible to do something like this even if the data sources that the major players use are widely disparate. It wouldn't help you to answer questions like "how would Google work if I tweaked this constant?" but for general search research, it could be useful.

YaCy might be an interesting resource in this context.

Flat | Top-Level Comments Only

Small world

no subject

no subject