Small world
Jan. 2nd, 2008 07:52 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
There's a post up on BoingBoing today (ok, yesterday for me) about open vs. closed search algorithms, suggesting that the search algorithms used by Google, Yahoo et al are bad because of their lack of transparency. It invokes a comparison to an important concept in computer security: "security through obscurity" is dangerous because an effective encryption scheme should be equally hard to break whether you know the internals of the algorithm that generated the ciphertext or whether you don't.
I think comparing this to search is a bad (or at best misleading) idea, and expounded on this in the comments. But I'm far more entertained by the fact that the two best comments on the post so far come from two sources with whom I am tangentially familiar, albeit from totally different directions:
jrtom and
radtea. Small damn world!
I think comparing this to search is a bad (or at best misleading) idea, and expounded on this in the comments. But I'm far more entertained by the fact that the two best comments on the post so far come from two sources with whom I am tangentially familiar, albeit from totally different directions:
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
(no subject)
Date: 2008-01-02 10:24 pm (UTC)Ah, got it. Hmm. Might be feasible to do sampling.
As to your proposal: my completely uninformed guess is that these services overlap in something like 90% of the _types_ of data that they collect/maintain.
...
Actually, I'm going to revise that figure downwards. I'd guess that Google uses data from GMail and their ad programs, for instance, to inform their search ranking algorithms.
The suggestion is an interesting one, but I suspect that it wouldn't fly:
(1) It would have to be a (partial) mirror; there's no way that the services would want to base off something that wasn't dedicated. So cost-wise it would be pure overhead for the sponsors.
(2) A lot of the data has privacy implications that are hard to deal with. Remember that flap over the release of a bunch of search queries?
(3) I suspect that the various services don't really want attention drawn to the breadth and depth of the information that they use for these purposes.
Definitely an interesting idea, though, and it might be possible to do something like this even if the data sources that the major players use are widely disparate. It wouldn't help you to answer questions like "how would Google work if I tweaked this constant?" but for general search research, it could be useful.
YaCy might be an interesting resource in this context.