Small world
Jan. 2nd, 2008 07:52 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
There's a post up on BoingBoing today (ok, yesterday for me) about open vs. closed search algorithms, suggesting that the search algorithms used by Google, Yahoo et al are bad because of their lack of transparency. It invokes a comparison to an important concept in computer security: "security through obscurity" is dangerous because an effective encryption scheme should be equally hard to break whether you know the internals of the algorithm that generated the ciphertext or whether you don't.
I think comparing this to search is a bad (or at best misleading) idea, and expounded on this in the comments. But I'm far more entertained by the fact that the two best comments on the post so far come from two sources with whom I am tangentially familiar, albeit from totally different directions:
jrtom and
radtea. Small damn world!
I think comparing this to search is a bad (or at best misleading) idea, and expounded on this in the comments. But I'm far more entertained by the fact that the two best comments on the post so far come from two sources with whom I am tangentially familiar, albeit from totally different directions:
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
(no subject)
Date: 2008-01-02 07:56 pm (UTC)Making the algorithms open doesn't mean that any flaws will be fixed quickly; it just means that they'll be _found_ (more) quickly. Part of the problem here is that it's a lot harder to make a good fix to an algorithm than it is to correct an historical article. For one thing, as I mentioned on BB, random Wikia users won't have access to the data that informs the algorithm design; all they'll be able to see is the algorithm and the resultant rankings. Nor, I suspect, will Wikia be letting just anyone _edit_ their algorithms, unless they're complete idiots.
there are loads of academic papers on search engines whose results just never get integrated into any real system because the closed-source developers are too busy and most academics don't really care about seeing their work actually implemented
From my experience, part of the problem with translating academic results to search engines in particular is that it's hard for an academic to demonstrate that their approach or improvement will work in actual practice. They don't have the data and they don't have the massive scale, and they don't know about all the rest of the inputs to (and demands on) the system. So the response of the search engine devs is likely to be something like "yeah, that might be cool, but...". Which is presumably one reason why Google wants to hire all those PhDs, so that they can work on these things in situ.
(Speaking of which, I'm going to be in Mountain View next week (for orientation at Google, actually). I gather that you're in some completely other time zone these days, but if you'll be in the area I'd like to get together and chat. My contact information is on my website, which is reachable from my LJ userinfo; let me know if you're interested and available.)
(no subject)
Date: 2008-01-02 08:21 pm (UTC)Sure, and I'm not pretending otherwise. (Note that I only said responses will be generated more quickly; it's anyone's guess as to how well those responses will work, particularly since the spammers certainly won't be opening up their source code!) I should have said above that I think the potential for an open-source search engine to implode in grand style due to sheer developer frustration is enormous. But I still think that if enough dedicated people were on board, cool things could happen; it's hard to say how many is "enough", though, or how dedicated they need to be. Startups tend to have fanatically dedicated people working for them because the people know that the volume and quality of their work has a direct influence on whether they're going to have a job in three months; this really can't be said for open-source projects. Even when the work sucks giant monkey balls, a sense of urgency can be a great source of inspiration.
random Wikia users won't have access to the data that informs the algorithm design
Do we know this is true? (I didn't see it indicated in the article, but it was a pretty short article.) I suppose the bandwidth costs would be kind of insane if just any random person could pull the data whenever they wanted ("hey, let's DDoS Wikia today!"), but perhaps developer keys and rate-limiting, or BitTorrent, or something.
Nor, I suspect, will Wikia be letting just anyone _edit_ their algorithms, unless they're complete idiots.
Sure. I took
From my experience, part of the problem with translating academic results to search engines in particular is that it's hard for an academic to demonstrate that their approach or improvement will work in actual practice.
Oh, absolutely, although the better conferences (e.g. SIGIR, KDD, &c) seem to at least pay lip service to scalability issues. But I totally agree that academics almost universally have blinders on when it comes to the notion of people using their systems in unintended or unexpected ways, and they don't write papers (and certainly don't implement code) with an eye toward this very real problem.
Still, I like the notion of J. Random Bored Hacker being able to read a paper, bang some code together, and see whether it works. J. Random Bored Hacker isn't going to have the hardware resources to put together his own private Google, but I know probably ten or twelve different people who have clusters running in their homes/warehouses/whatever just for shits and grins. There's got to be some guy out there with fifty Beowulfed dual-Xeons and a healthy curiosity about search...
I gather that you're in some completely other time zone these days
Yep, I'm in Belgium until late February, alas. If you'll be business-tripping later in the year, though, drop me a line in advance and we can grab dinner! (I am currently without car, and my day-to-day transportation needs are met well enough by SF public transit that I'm not especially motivated to fix my engine, shell out to get it fixed or buy a new car, but Mountain View is fairly reachable by train.)
(no subject)
Date: 2008-01-02 08:51 pm (UTC)Do we know this is true? (I didn't see it indicated in the article, but it was a pretty short article.) I suppose the bandwidth costs would be kind of insane if just any random person could pull the data whenever they wanted ("hey, let's DDoS Wikia today!"), but perhaps developer keys and rate-limiting, or BitTorrent, or something.
"Kind of insane" is putting it mildly, I'd bet: I doubt that even someone with a T1 would even be able to keep up with mirroring the data that Google (or Microsoft, or Yahoo) uses as input. Consider that this data almost certainly involves every single search and page visit, not to mention data from web spiders giving you content and topology (i.e., link) updates.
That is, while, yes, lots of people doing this would placed a strain on [Google|Yahoo|Microsoft], but it seems infeasible for almost any individual to even _receive_ the data.
the better conferences (e.g. SIGIR, KDD, &c) seem to at least pay lip service to scalability issues.
Of course. But search engines (or, as I corrected myself in my last post to BB, "search services") are big agglomerations of rapidly evolving code, huge inputs, and lots of infrastructure. So it's not enough for an academic to say "I've demonstrated that the scalability is good up to 10M pages" (which would be somewhat exceptional, I suspect) because they're still a few orders of magnitude low, and because they have no way of demonstrating that their approach won't screw something else up in the existing engine, or push them over a critical performance boundary (in the wrong direction).
As a side point: Google runs its stuff on a custom version of Linux, I believe. I don't know how tied their algorithms are to that...but do they have to provide the source for that, too, so that J. Random Hacker can have the appropriate substrate on which to run their experiments?
As for J. Random Bored Hacker, that's what Lucene et al. are for. Of course, it doesn't allow you to duplicate the precise environment that a specific production search service has...but at least you can use it to dink with the algorithms themselves.
Sorry to hear you won't be around. Don't know when I might be in the Bay again, but if and when I'll let you know so that we can grab dinner (quick, before it gets away).
(no subject)
Date: 2008-01-02 09:24 pm (UTC)I doubt that even someone with a T1 would even be able to keep up with mirroring the data that Google (or Microsoft, or Yahoo) uses as input.
Sure, I wouldn't expect anyone to try and mirror all the data. But I do think that independent developers could get useful data to work with through statistical sampling.
...
Hey, here's a stupid thought. The web is big, so exhaustive spidering is costly, indexing content is costly, yada yada yada. But I wonder: how similar do you suppose the content that Google keeps on the backend is to the content that Microsoft or Yahoo keep on their backends? Or, rephrased: are the material differences between the major search services more a function of the data they maintain, the algorithms they use, or both?
Because, you know, it could be kind of cool if there were a large, world-readable index that somebody could access through a scalable service like Amazon EC2, thereby having the ability to modify their code on the fly and gauge the effects of that.
Lucene I am not familiar with; I'll have to look into it. Apache is doing all kinds of interesting stuff these days.
(I don't know what Google's using OS-wise, but I do know that they've modified the hell out of MySQL for a whole lot of different things -- though what those things are, I have no idea.)
(no subject)
Date: 2008-01-02 10:24 pm (UTC)Ah, got it. Hmm. Might be feasible to do sampling.
As to your proposal: my completely uninformed guess is that these services overlap in something like 90% of the _types_ of data that they collect/maintain.
...
Actually, I'm going to revise that figure downwards. I'd guess that Google uses data from GMail and their ad programs, for instance, to inform their search ranking algorithms.
The suggestion is an interesting one, but I suspect that it wouldn't fly:
(1) It would have to be a (partial) mirror; there's no way that the services would want to base off something that wasn't dedicated. So cost-wise it would be pure overhead for the sponsors.
(2) A lot of the data has privacy implications that are hard to deal with. Remember that flap over the release of a bunch of search queries?
(3) I suspect that the various services don't really want attention drawn to the breadth and depth of the information that they use for these purposes.
Definitely an interesting idea, though, and it might be possible to do something like this even if the data sources that the major players use are widely disparate. It wouldn't help you to answer questions like "how would Google work if I tweaked this constant?" but for general search research, it could be useful.
YaCy might be an interesting resource in this context.
(no subject)
Date: 2008-01-03 03:12 am (UTC)You're jumping over a step.
Those are practical problems, and in academia, it never gets that far. They don't care about practice, so they don't even attempt to address the concerns that they can reasonably make guesses about.
*If* academics cared about things working in practice, then what you're mentioning would be a problem. But, by and large, they don't.
(no subject)
Date: 2008-01-03 03:25 am (UTC)(If nothing else, you might consider that quite a few academics have ties to industry.)
(no subject)
Date: 2008-01-03 06:48 am (UTC)Yeah, me too. I can actually one-up you: I *am* one. But hell if I'm going to stay one, and I gather from your comments that you're in industry now yourself.
Real academics don't leave industry: they go on to get their PhDs, do a few postdocs, become a professor, find a tenure-track position, get tenure, and during this entire process train other baby academics to grow up and be paper-producing machines who don't care about how things work "in practice" because they're concerned about "the bigger picture".
Academics who care about making things feasible wind up not publishing a lot of "good" work because they waste their time on mere implementation details, and thus have to choose: worry about things working in practice, and wash out of academia (perhaps go find a job at a company that is actually doing neat theory things in practice, and be happy as a clam, except for the whole not publishing anything business), or give up on that practical consideration angle, at least most of the time, and fight the academia battle.
Yes, of course there are exceptions to this and any generalization. But I think I am characterizing things pretty well: in the end, you either have to pick, have to not care, or have to get really lucky when it comes to "papers vs. practical work."
(no subject)
Date: 2008-01-03 06:37 pm (UTC)You are correct that I'm now in industry. I may go back and finish the PhD at some point, but for now it's on indefinite hold.
I do know other academics that are strongly concerned with practical considerations, however. Some fields promote this more than others. For example (while this is outside my personal experience), I suspect that you'll find very few civil engineering professors that aren't concerned with practice. Ditto for most engineering disciplines.
If you want to constrain this to fields in "computer science", I would still say that there are areas in which academics are likely to be deeply interested in applications. Data mining, for one. Machine learning. Compiler design (I conjecture). Software methodology. In any of these fields, of course you'll find the theoreticians, but the fields themselves put a fair amount of emphasis on practical demonstrations and meaningful experiments and metrics, and if your papers don't reflect that, then you may have trouble publishing them.
In lots of other areas, I completely agree that academics tend to be less concerned with practical details--which, IMO, is as it should be (someone needs to do basic research that isn't immediately concerned with practice, if for no other reason than that we don't always see applications immediately).
As a point of interest: the on-site reviewers for the government grant on which I worked the longest while a PhD student were consistently most impressed and happy with the software library (JUNG) that I co-wrote while on that grant, although they thought the theoretical/algorithms/methods stuff was good too. (My supervisor, who's done a good job of nailing a foot in both the theoretical and practical side of the fence, initially thought that I shouldn't be spending as much time on JUNG. He changed his mind once the reviews started coming back. :) )
(no subject)
Date: 2008-01-03 06:42 pm (UTC)If you know of a university that has data mining professors who pay more than just lip service to giving a damn about applications, please let me know in case I ever decide to go back and finish my own PhD. (I left Iowa largely because of an advisor who flat-out ordered me not to work on OBELisQ because it "wasn't useful and couldn't be done anyway.")
(no subject)
Date: 2008-01-03 06:50 pm (UTC)U of Washington has a few, I think, but I'm a bit out of touch there. I suspect that UC Berkeley and other SF-area departments also have some such.
What you want to do, I suspect, is find a department in which the professors have a tendency to have strong ties to industry--this may be manifested by consulting jobs, or part-time appointments, or oscillations between industry and academic positions, or students that go on to work in industry.
(no subject)
Date: 2008-01-04 06:28 pm (UTC)(no subject)
Date: 2008-01-04 05:49 pm (UTC)They won't last as academics. ;)
I suspect you're right about civil engineering and engineering having a higher percentage of people interested in practice -- that said, handwaving-away a problem for the sake of publishing a paper isn't beneath any academic in any field, and the problem continues to exist because the program committees for conferences put up with it.
As for experiments and metrics, sure, emphasis is placed on them for certain fields and certain conferences. That said, "meaningful" is up for debate. What's meaningful to an academic is not necessarily meaningful to an implementor.
[And as a side note -- you've never driven in Brussels, I assume, or you wouldn't have been so certain about civil engineers and practice. ;) ]
He changed his mind once the reviews started coming back.
Oh, definitely -- if "practicality = more grant money" then they change their minds in a heart-beat. Sometimes they even try to take credit for the whole idea in the first place. But I ask this: what percentage of students listen to their advisors, vs. the percentage that stick with the practical work? I'd like to think most of them stick to their convictions and plow ahead, and if I sample my friends in academia, that seems to be the case. But that could also tell me that the academics I am friends with are stubborn people who care more about practicality than finishing their PhDs.
I wonder if a sociologist has done any studies on this.
Nah, too practical.