(no subject)

Date: 2008-01-02 09:24 pm (UTC)
Wow, I really suck at conveying things I actually mean tonight. I feel like I'm coming across as utterly ignorant of the sheer size of the data (it's actually more like "you may think it's a long way down the road to the chemist" levels of ignorant, I don't pretend to be an expert on this by any stretch of the imagination).

I doubt that even someone with a T1 would even be able to keep up with mirroring the data that Google (or Microsoft, or Yahoo) uses as input.

Sure, I wouldn't expect anyone to try and mirror all the data. But I do think that independent developers could get useful data to work with through statistical sampling.

...

Hey, here's a stupid thought. The web is big, so exhaustive spidering is costly, indexing content is costly, yada yada yada. But I wonder: how similar do you suppose the content that Google keeps on the backend is to the content that Microsoft or Yahoo keep on their backends? Or, rephrased: are the material differences between the major search services more a function of the data they maintain, the algorithms they use, or both?

Because, you know, it could be kind of cool if there were a large, world-readable index that somebody could access through a scalable service like Amazon EC2, thereby having the ability to modify their code on the fly and gauge the effects of that.

Lucene I am not familiar with; I'll have to look into it. Apache is doing all kinds of interesting stuff these days.

(I don't know what Google's using OS-wise, but I do know that they've modified the hell out of MySQL for a whole lot of different things -- though what those things are, I have no idea.)
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

If you are unable to use this captcha for any reason, please contact us by email at support@dreamwidth.org

Profile

maradydd: (Default)
maradydd

September 2010

S M T W T F S
   1234
567891011
12131415 161718
19202122232425
26 27282930  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags