maradydd | Small world

While the analogy of cryptography to corporate secrets is spurious, what did you think of the idea of an open source search engine?

I think it's a fine idea precisely for the reason that (given enough developers behind it) it has the potential to speed up the arms race. If spammers have access to the source code, they will be able to develop their attacks more quickly; as a result, we'll see those attacks in the wild more quickly, and more people will have the potential to develop responses to those attacks. Given enough eyes (and that's always the problem, isn't it?), those responses will be generated more quickly, and as a whole I think it'll increase the rate of advancement of the field.

In particular, there are loads of academic papers on search engines whose results just never get integrated into any real system because the closed-source developers are too busy and most academics don't really care about seeing their work actually implemented. Having an avenue for academic research to filter over into the real world is always a good thing, at least in CS.

If spammers have access to the source code, they will be able to develop their attacks more quickly; as a result, we'll see those attacks in the wild more quickly, and more people will have the potential to develop responses to those attacks.

Making the algorithms open doesn't mean that any flaws will be fixed quickly; it just means that they'll be _found_ (more) quickly. Part of the problem here is that it's a lot harder to make a good fix to an algorithm than it is to correct an historical article. For one thing, as I mentioned on BB, random Wikia users won't have access to the data that informs the algorithm design; all they'll be able to see is the algorithm and the resultant rankings. Nor, I suspect, will Wikia be letting just anyone _edit_ their algorithms, unless they're complete idiots.

there are loads of academic papers on search engines whose results just never get integrated into any real system because the closed-source developers are too busy and most academics don't really care about seeing their work actually implemented

From my experience, part of the problem with translating academic results to search engines in particular is that it's hard for an academic to demonstrate that their approach or improvement will work in actual practice. They don't have the data and they don't have the massive scale, and they don't know about all the rest of the inputs to (and demands on) the system. So the response of the search engine devs is likely to be something like "yeah, that might be cool, but...". Which is presumably one reason why Google wants to hire all those PhDs, so that they can work on these things in situ.

(Speaking of which, I'm going to be in Mountain View next week (for orientation at Google, actually). I gather that you're in some completely other time zone these days, but if you'll be in the area I'd like to get together and chat. My contact information is on my website, which is reachable from my LJ userinfo; let me know if you're interested and available.)

Making the algorithms open doesn't mean that any flaws will be fixed quickly; it just means that they'll be _found_ (more) quickly.

Sure, and I'm not pretending otherwise. (Note that I only said responses will be generated more quickly; it's anyone's guess as to how well those responses will work, particularly since the spammers certainly won't be opening up their source code!) I should have said above that I think the potential for an open-source search engine to implode in grand style due to sheer developer frustration is enormous. But I still think that if enough dedicated people were on board, cool things could happen; it's hard to say how many is "enough", though, or how dedicated they need to be. Startups tend to have fanatically dedicated people working for them because the people know that the volume and quality of their work has a direct influence on whether they're going to have a job in three months; this really can't be said for open-source projects. Even when the work sucks giant monkey balls, a sense of urgency can be a great source of inspiration.

random Wikia users won't have access to the data that informs the algorithm design

Do we know this is true? (I didn't see it indicated in the article, but it was a pretty short article.) I suppose the bandwidth costs would be kind of insane if just any random person could pull the data whenever they wanted ("hey, let's DDoS Wikia today!"), but perhaps developer keys and rate-limiting, or BitTorrent, or something.

Nor, I suspect, will Wikia be letting just anyone _edit_ their algorithms, unless they're complete idiots.

Sure. I took

neoliminal's question to mean something like the setup that LiveJournal has, where the code is published and can be replicated elsewhere (e.g. DeadJournal), but users can't make changes to an instance of the system that they don't control.

From my experience, part of the problem with translating academic results to search engines in particular is that it's hard for an academic to demonstrate that their approach or improvement will work in actual practice.

Oh, absolutely, although the better conferences (e.g. SIGIR, KDD, &c) seem to at least pay lip service to scalability issues. But I totally agree that academics almost universally have blinders on when it comes to the notion of people using their systems in unintended or unexpected ways, and they don't write papers (and certainly don't implement code) with an eye toward this very real problem.

Still, I like the notion of J. Random Bored Hacker being able to read a paper, bang some code together, and see whether it works. J. Random Bored Hacker isn't going to have the hardware resources to put together his own private Google, but I know probably ten or twelve different people who have clusters running in their homes/warehouses/whatever just for shits and grins. There's got to be some guy out there with fifty Beowulfed dual-Xeons and a healthy curiosity about search...

I gather that you're in some completely other time zone these days

Yep, I'm in Belgium until late February, alas. If you'll be business-tripping later in the year, though, drop me a line in advance and we can grab dinner! (I am currently without car, and my day-to-day transportation needs are met well enough by SF public transit that I'm not especially motivated to fix my engine, shell out to get it fixed or buy a new car, but Mountain View is fairly reachable by train.)

random Wikia users won't have access to the data that informs the algorithm design

Do we know this is true? (I didn't see it indicated in the article, but it was a pretty short article.) I suppose the bandwidth costs would be kind of insane if just any random person could pull the data whenever they wanted ("hey, let's DDoS Wikia today!"), but perhaps developer keys and rate-limiting, or BitTorrent, or something.

"Kind of insane" is putting it mildly, I'd bet: I doubt that even someone with a T1 would even be able to keep up with mirroring the data that Google (or Microsoft, or Yahoo) uses as input. Consider that this data almost certainly involves every single search and page visit, not to mention data from web spiders giving you content and topology (i.e., link) updates.

That is, while, yes, lots of people doing this would placed a strain on [Google|Yahoo|Microsoft], but it seems infeasible for almost any individual to even _receive_ the data.

the better conferences (e.g. SIGIR, KDD, &c) seem to at least pay lip service to scalability issues.

Of course. But search engines (or, as I corrected myself in my last post to BB, "search services") are big agglomerations of rapidly evolving code, huge inputs, and lots of infrastructure. So it's not enough for an academic to say "I've demonstrated that the scalability is good up to 10M pages" (which would be somewhat exceptional, I suspect) because they're still a few orders of magnitude low, and because they have no way of demonstrating that their approach won't screw something else up in the existing engine, or push them over a critical performance boundary (in the wrong direction).

As a side point: Google runs its stuff on a custom version of Linux, I believe. I don't know how tied their algorithms are to that...but do they have to provide the source for that, too, so that J. Random Hacker can have the appropriate substrate on which to run their experiments?

As for J. Random Bored Hacker, that's what Lucene et al. are for. Of course, it doesn't allow you to duplicate the precise environment that a specific production search service has...but at least you can use it to dink with the algorithms themselves.

Sorry to hear you won't be around. Don't know when I might be in the Bay again, but if and when I'll let you know so that we can grab dinner (quick, before it gets away).

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Heh.

Re: Heh.

There are only 5000 people in the world

Re: There are only 5000 people in the world

Re: There are only 5000 people in the world