Small world

Jan. 2nd, 2008 07:52 pm
maradydd: (Default)
[personal profile] maradydd
There's a post up on BoingBoing today (ok, yesterday for me) about open vs. closed search algorithms, suggesting that the search algorithms used by Google, Yahoo et al are bad because of their lack of transparency. It invokes a comparison to an important concept in computer security: "security through obscurity" is dangerous because an effective encryption scheme should be equally hard to break whether you know the internals of the algorithm that generated the ciphertext or whether you don't.

I think comparing this to search is a bad (or at best misleading) idea, and expounded on this in the comments. But I'm far more entertained by the fact that the two best comments on the post so far come from two sources with whom I am tangentially familiar, albeit from totally different directions: [livejournal.com profile] jrtom and [livejournal.com profile] radtea. Small damn world!

(no subject)

Date: 2008-01-02 10:24 pm (UTC)
From: [identity profile] jrtom.livejournal.com
Sure, I wouldn't expect anyone to try and mirror all the data. But I do think that independent developers could get useful data to work with through statistical sampling.

Ah, got it. Hmm. Might be feasible to do sampling.


As to your proposal: my completely uninformed guess is that these services overlap in something like 90% of the _types_ of data that they collect/maintain.

...

Actually, I'm going to revise that figure downwards. I'd guess that Google uses data from GMail and their ad programs, for instance, to inform their search ranking algorithms.

The suggestion is an interesting one, but I suspect that it wouldn't fly:

(1) It would have to be a (partial) mirror; there's no way that the services would want to base off something that wasn't dedicated. So cost-wise it would be pure overhead for the sponsors.

(2) A lot of the data has privacy implications that are hard to deal with. Remember that flap over the release of a bunch of search queries?

(3) I suspect that the various services don't really want attention drawn to the breadth and depth of the information that they use for these purposes.

Definitely an interesting idea, though, and it might be possible to do something like this even if the data sources that the major players use are widely disparate. It wouldn't help you to answer questions like "how would Google work if I tweaked this constant?" but for general search research, it could be useful.

YaCy might be an interesting resource in this context.

Profile

maradydd: (Default)
maradydd

September 2010

S M T W T F S
   1234
567891011
12131415 161718
19202122232425
26 27282930  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags