no subject

Sure, I wouldn't expect anyone to try and mirror all the data. But I do think that independent developers could get useful data to work with through statistical sampling.

Ah, got it. Hmm. Might be feasible to do sampling.

As to your proposal: my completely uninformed guess is that these services overlap in something like 90% of the _types_ of data that they collect/maintain.

...

Actually, I'm going to revise that figure downwards. I'd guess that Google uses data from GMail and their ad programs, for instance, to inform their search ranking algorithms.

The suggestion is an interesting one, but I suspect that it wouldn't fly:

(1) It would have to be a (partial) mirror; there's no way that the services would want to base off something that wasn't dedicated. So cost-wise it would be pure overhead for the sponsors.

(2) A lot of the data has privacy implications that are hard to deal with. Remember that flap over the release of a bunch of search queries?

(3) I suspect that the various services don't really want attention drawn to the breadth and depth of the information that they use for these purposes.

Definitely an interesting idea, though, and it might be possible to do something like this even if the data sources that the major players use are widely disparate. It wouldn't help you to answer questions like "how would Google work if I tweaked this constant?" but for general search research, it could be useful.

YaCy might be an interesting resource in this context.

(54 comments)

no subject

Post a comment in response: