Entry tags:
Small world
There's a post up on BoingBoing today (ok, yesterday for me) about open vs. closed search algorithms, suggesting that the search algorithms used by Google, Yahoo et al are bad because of their lack of transparency. It invokes a comparison to an important concept in computer security: "security through obscurity" is dangerous because an effective encryption scheme should be equally hard to break whether you know the internals of the algorithm that generated the ciphertext or whether you don't.
I think comparing this to search is a bad (or at best misleading) idea, and expounded on this in the comments. But I'm far more entertained by the fact that the two best comments on the post so far come from two sources with whom I am tangentially familiar, albeit from totally different directions:
jrtom and
radtea. Small damn world!
I think comparing this to search is a bad (or at best misleading) idea, and expounded on this in the comments. But I'm far more entertained by the fact that the two best comments on the post so far come from two sources with whom I am tangentially familiar, albeit from totally different directions:
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
no subject
no subject
In particular, there are loads of academic papers on search engines whose results just never get integrated into any real system because the closed-source developers are too busy and most academics don't really care about seeing their work actually implemented. Having an avenue for academic research to filter over into the real world is always a good thing, at least in CS.
no subject
Making the algorithms open doesn't mean that any flaws will be fixed quickly; it just means that they'll be _found_ (more) quickly. Part of the problem here is that it's a lot harder to make a good fix to an algorithm than it is to correct an historical article. For one thing, as I mentioned on BB, random Wikia users won't have access to the data that informs the algorithm design; all they'll be able to see is the algorithm and the resultant rankings. Nor, I suspect, will Wikia be letting just anyone _edit_ their algorithms, unless they're complete idiots.
there are loads of academic papers on search engines whose results just never get integrated into any real system because the closed-source developers are too busy and most academics don't really care about seeing their work actually implemented
From my experience, part of the problem with translating academic results to search engines in particular is that it's hard for an academic to demonstrate that their approach or improvement will work in actual practice. They don't have the data and they don't have the massive scale, and they don't know about all the rest of the inputs to (and demands on) the system. So the response of the search engine devs is likely to be something like "yeah, that might be cool, but...". Which is presumably one reason why Google wants to hire all those PhDs, so that they can work on these things in situ.
(Speaking of which, I'm going to be in Mountain View next week (for orientation at Google, actually). I gather that you're in some completely other time zone these days, but if you'll be in the area I'd like to get together and chat. My contact information is on my website, which is reachable from my LJ userinfo; let me know if you're interested and available.)
no subject
Sure, and I'm not pretending otherwise. (Note that I only said responses will be generated more quickly; it's anyone's guess as to how well those responses will work, particularly since the spammers certainly won't be opening up their source code!) I should have said above that I think the potential for an open-source search engine to implode in grand style due to sheer developer frustration is enormous. But I still think that if enough dedicated people were on board, cool things could happen; it's hard to say how many is "enough", though, or how dedicated they need to be. Startups tend to have fanatically dedicated people working for them because the people know that the volume and quality of their work has a direct influence on whether they're going to have a job in three months; this really can't be said for open-source projects. Even when the work sucks giant monkey balls, a sense of urgency can be a great source of inspiration.
random Wikia users won't have access to the data that informs the algorithm design
Do we know this is true? (I didn't see it indicated in the article, but it was a pretty short article.) I suppose the bandwidth costs would be kind of insane if just any random person could pull the data whenever they wanted ("hey, let's DDoS Wikia today!"), but perhaps developer keys and rate-limiting, or BitTorrent, or something.
Nor, I suspect, will Wikia be letting just anyone _edit_ their algorithms, unless they're complete idiots.
Sure. I took
From my experience, part of the problem with translating academic results to search engines in particular is that it's hard for an academic to demonstrate that their approach or improvement will work in actual practice.
Oh, absolutely, although the better conferences (e.g. SIGIR, KDD, &c) seem to at least pay lip service to scalability issues. But I totally agree that academics almost universally have blinders on when it comes to the notion of people using their systems in unintended or unexpected ways, and they don't write papers (and certainly don't implement code) with an eye toward this very real problem.
Still, I like the notion of J. Random Bored Hacker being able to read a paper, bang some code together, and see whether it works. J. Random Bored Hacker isn't going to have the hardware resources to put together his own private Google, but I know probably ten or twelve different people who have clusters running in their homes/warehouses/whatever just for shits and grins. There's got to be some guy out there with fifty Beowulfed dual-Xeons and a healthy curiosity about search...
I gather that you're in some completely other time zone these days
Yep, I'm in Belgium until late February, alas. If you'll be business-tripping later in the year, though, drop me a line in advance and we can grab dinner! (I am currently without car, and my day-to-day transportation needs are met well enough by SF public transit that I'm not especially motivated to fix my engine, shell out to get it fixed or buy a new car, but Mountain View is fairly reachable by train.)
no subject
Do we know this is true? (I didn't see it indicated in the article, but it was a pretty short article.) I suppose the bandwidth costs would be kind of insane if just any random person could pull the data whenever they wanted ("hey, let's DDoS Wikia today!"), but perhaps developer keys and rate-limiting, or BitTorrent, or something.
"Kind of insane" is putting it mildly, I'd bet: I doubt that even someone with a T1 would even be able to keep up with mirroring the data that Google (or Microsoft, or Yahoo) uses as input. Consider that this data almost certainly involves every single search and page visit, not to mention data from web spiders giving you content and topology (i.e., link) updates.
That is, while, yes, lots of people doing this would placed a strain on [Google|Yahoo|Microsoft], but it seems infeasible for almost any individual to even _receive_ the data.
the better conferences (e.g. SIGIR, KDD, &c) seem to at least pay lip service to scalability issues.
Of course. But search engines (or, as I corrected myself in my last post to BB, "search services") are big agglomerations of rapidly evolving code, huge inputs, and lots of infrastructure. So it's not enough for an academic to say "I've demonstrated that the scalability is good up to 10M pages" (which would be somewhat exceptional, I suspect) because they're still a few orders of magnitude low, and because they have no way of demonstrating that their approach won't screw something else up in the existing engine, or push them over a critical performance boundary (in the wrong direction).
As a side point: Google runs its stuff on a custom version of Linux, I believe. I don't know how tied their algorithms are to that...but do they have to provide the source for that, too, so that J. Random Hacker can have the appropriate substrate on which to run their experiments?
As for J. Random Bored Hacker, that's what Lucene et al. are for. Of course, it doesn't allow you to duplicate the precise environment that a specific production search service has...but at least you can use it to dink with the algorithms themselves.
Sorry to hear you won't be around. Don't know when I might be in the Bay again, but if and when I'll let you know so that we can grab dinner (quick, before it gets away).
(no subject)
(no subject)
no subject
You're jumping over a step.
Those are practical problems, and in academia, it never gets that far. They don't care about practice, so they don't even attempt to address the concerns that they can reasonably make guesses about.
*If* academics cared about things working in practice, then what you're mentioning would be a problem. But, by and large, they don't.
no subject
(If nothing else, you might consider that quite a few academics have ties to industry.)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
no subject
Publicly, at least. The NSA doesn't file bug reports.
no subject
no subject
no subject
(And as a side note: KML means "Keyhole Markup Language." You need a new name...)
no subject
For something like the code associated with a search service like Google's/Yahoo's/Microsoft's, the required level of obsession (or other motivation) would have to be pretty high, but I think that we all agree that there are many people who are deeply interested in finding flaws that can be exploited through cheap manipulation of the inputs.
no subject
(I.e., is it possible to build search algorithms that can't be subverted by malicious actors even if the algorithm is known to those malicious actors?)
I'm not seeing a way to do it, but then again, I'm not seeing a strong proof of impossibility, either.
no subject
I suppose the idea of security here is a bit obfuscated because one doesn't normally associate security with search rankings.
no subject
Well, the people designing these algorithms in practice do, as they have to deal with SEO/spammers.
I might be missing something, but it seems like what you're showing here is that yes, you can provide security through obscurity in search engine algorithms. We know this -- that's how the system currently works. What I want to know is "is this the only way?" Not "are these algorithms faster" as someone else brought up, but "is there a secure-against-cheating algorithm that can be publicly disclosed without compromising its security that provides good enough results and performance?" Or, "If not, prove it."
Either side of that question is hard to answer, I think.
no subject
Well, let me amend this -- we know you can do this so that it takes the spammer/attacker some non-trivial amount of time to break your measure (i.e., do a blackbox analysis on your new algorithm and defeat it) such that you already have a fix in place, or can have a fix in place quickly enough that the attacker doesn't reduce your utility below that of your competitors'.
The existing system works "well enough". Would an open system work as well, worse, not at all, better, or perfectly? Those are going to be hard questions to answer empirically, and harder questions to answer with proofs.
no subject
The question of whether it's possible to build unsubvertible search algorithms with performance equivalent to the current ones (for whatever definitions of "performance" you find useful and interesting) is a more subtle question.
My feeling is that, even taking aside such ridiculous examples as the one I proffer above, removing features that can be subverted will lead to weaker, less interesting models. (I grant you that there are some interesting problems in adapting/constraining such features so as to make their subversion either difficult or expensive without adversely affecting good actors.)
no subject
no subject
Let's momentarily accept as a premise (and I'll be the first to say that I don't think you can take this as a premise, but it makes the math easier) that in the universe U of web pages, for every query Q there exists some ideal total ordering T which reflects relevance to Q. (I'm pretty sure this extends wlog to partial orderings as well.) Now assume we have some algorithm A(U,Q) which generates the corresponding T for any Q.
"Subverting A" is thus defined as modifying the contents of U such that A(U,Q) no longer generates an appropriate T, but rather some T' which contains results incorrectly ranked in terms of their relevance to Q. We could also say that A is subversion-resistant if changing some web page u in U in a way that does not increase u's relevance to Q cannot possibly increase u's position in T.
But the part where all this falls down and goes boom is, what's the standard for relevance? In crypto, we have a goal: "disguise signal in some reversible way such that the disguised signal is indistinguishable from random noise." We know what noise looks like; Shannon told us. We can thus make strong statements about the security of crypto algorithms, because we can make meaningful statements about the entropy of their output and thus how hard they are to distinguish from noise.
If we had a universal standard of relevance the same way we have a universal standard of noise, we could say something meaningful about search-engine rankings in the same way. But I don't see how to give
Impossible things are hard, let's do math.
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
no subject
So, my question still stands. Your Algorithm A is broken, as you say (in fact, Yahoo worked that way in the Good Old Days, and I know of lots of people who grabbed websites like aardvark.com to exploit that attack.) "Return pages in random order" is not a search algorithm; it's a random retrieval algorithm, and furthermore, doesn't prevent an attacker from analyzing your PRNG and composing pages that are "randomly" selected by the search engine.
Not seeing the proof of impossibility here.
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
(no subject)
Heh.
I'm sort of wondering just how central this flawed comparison is to the book that Cory mentioned. I like Cory and he's a smart guy, but I think he's been pounding on the "free the IP!" campaign (with which I agree, in general) for long enough that he's in danger of starting to perceive everything as a nail. (So to speak.)
I should have clarified that I actually _do_ think that user feedback on both relevance and ranking could be useful, so I'll look forward to seeing what Wikia does in that space...but making those algorithms (or at least the ranking algorithm) publicly known is, I think, asking for trouble.
Re: Heh.
There are only 5000 people in the world
If there were more, we wouldn't keep running into each other all the time.
Re: There are only 5000 people in the world
Re: There are only 5000 people in the world
Looking at your f-list I see many connections in common, although