Wow, I really suck at conveying things I actually mean tonight. I feel like I'm coming across as utterly ignorant of the sheer size of the data (it's actually more like "you may think it's a long way down the road to the chemist" levels of ignorant, I don't pretend to be an expert on this by any stretch of the imagination).
I doubt that even someone with a T1 would even be able to keep up with mirroring the data that Google (or Microsoft, or Yahoo) uses as input.
Sure, I wouldn't expect anyone to try and mirror all the data. But I do think that independent developers could get useful data to work with through statistical sampling.
...
Hey, here's a stupid thought. The web is big, so exhaustive spidering is costly, indexing content is costly, yada yada yada. But I wonder: how similar do you suppose the content that Google keeps on the backend is to the content that Microsoft or Yahoo keep on their backends? Or, rephrased: are the material differences between the major search services more a function of the data they maintain, the algorithms they use, or both?
Because, you know, it could be kind of cool if there were a large, world-readable index that somebody could access through a scalable service like Amazon EC2, thereby having the ability to modify their code on the fly and gauge the effects of that.
Lucene I am not familiar with; I'll have to look into it. Apache is doing all kinds of interesting stuff these days.
(I don't know what Google's using OS-wise, but I do know that they've modified the hell out of MySQL for a whole lot of different things -- though what those things are, I have no idea.)
(no subject)
Date: 2008-01-02 09:24 pm (UTC)I doubt that even someone with a T1 would even be able to keep up with mirroring the data that Google (or Microsoft, or Yahoo) uses as input.
Sure, I wouldn't expect anyone to try and mirror all the data. But I do think that independent developers could get useful data to work with through statistical sampling.
...
Hey, here's a stupid thought. The web is big, so exhaustive spidering is costly, indexing content is costly, yada yada yada. But I wonder: how similar do you suppose the content that Google keeps on the backend is to the content that Microsoft or Yahoo keep on their backends? Or, rephrased: are the material differences between the major search services more a function of the data they maintain, the algorithms they use, or both?
Because, you know, it could be kind of cool if there were a large, world-readable index that somebody could access through a scalable service like Amazon EC2, thereby having the ability to modify their code on the fly and gauge the effects of that.
Lucene I am not familiar with; I'll have to look into it. Apache is doing all kinds of interesting stuff these days.
(I don't know what Google's using OS-wise, but I do know that they've modified the hell out of MySQL for a whole lot of different things -- though what those things are, I have no idea.)