maradydd: (Default)
maradydd ([personal profile] maradydd) wrote2008-01-02 07:52 pm

Small world

There's a post up on BoingBoing today (ok, yesterday for me) about open vs. closed search algorithms, suggesting that the search algorithms used by Google, Yahoo et al are bad because of their lack of transparency. It invokes a comparison to an important concept in computer security: "security through obscurity" is dangerous because an effective encryption scheme should be equally hard to break whether you know the internals of the algorithm that generated the ciphertext or whether you don't.

I think comparing this to search is a bad (or at best misleading) idea, and expounded on this in the comments. But I'm far more entertained by the fact that the two best comments on the post so far come from two sources with whom I am tangentially familiar, albeit from totally different directions: [livejournal.com profile] jrtom and [livejournal.com profile] radtea. Small damn world!

[identity profile] neoliminal.livejournal.com 2008-01-02 07:20 pm (UTC)(link)
While the analogy of cryptography to corporate secrets is spurious, what did you think of the idea of an open source search engine?

[identity profile] maradydd.livejournal.com 2008-01-02 07:37 pm (UTC)(link)
I think it's a fine idea precisely for the reason that (given enough developers behind it) it has the potential to speed up the arms race. If spammers have access to the source code, they will be able to develop their attacks more quickly; as a result, we'll see those attacks in the wild more quickly, and more people will have the potential to develop responses to those attacks. Given enough eyes (and that's always the problem, isn't it?), those responses will be generated more quickly, and as a whole I think it'll increase the rate of advancement of the field.

In particular, there are loads of academic papers on search engines whose results just never get integrated into any real system because the closed-source developers are too busy and most academics don't really care about seeing their work actually implemented. Having an avenue for academic research to filter over into the real world is always a good thing, at least in CS.

[identity profile] jrtom.livejournal.com 2008-01-02 07:56 pm (UTC)(link)
If spammers have access to the source code, they will be able to develop their attacks more quickly; as a result, we'll see those attacks in the wild more quickly, and more people will have the potential to develop responses to those attacks.

Making the algorithms open doesn't mean that any flaws will be fixed quickly; it just means that they'll be _found_ (more) quickly. Part of the problem here is that it's a lot harder to make a good fix to an algorithm than it is to correct an historical article. For one thing, as I mentioned on BB, random Wikia users won't have access to the data that informs the algorithm design; all they'll be able to see is the algorithm and the resultant rankings. Nor, I suspect, will Wikia be letting just anyone _edit_ their algorithms, unless they're complete idiots.

there are loads of academic papers on search engines whose results just never get integrated into any real system because the closed-source developers are too busy and most academics don't really care about seeing their work actually implemented

From my experience, part of the problem with translating academic results to search engines in particular is that it's hard for an academic to demonstrate that their approach or improvement will work in actual practice. They don't have the data and they don't have the massive scale, and they don't know about all the rest of the inputs to (and demands on) the system. So the response of the search engine devs is likely to be something like "yeah, that might be cool, but...". Which is presumably one reason why Google wants to hire all those PhDs, so that they can work on these things in situ.

(Speaking of which, I'm going to be in Mountain View next week (for orientation at Google, actually). I gather that you're in some completely other time zone these days, but if you'll be in the area I'd like to get together and chat. My contact information is on my website, which is reachable from my LJ userinfo; let me know if you're interested and available.)

[identity profile] maradydd.livejournal.com 2008-01-02 08:21 pm (UTC)(link)
Making the algorithms open doesn't mean that any flaws will be fixed quickly; it just means that they'll be _found_ (more) quickly.

Sure, and I'm not pretending otherwise. (Note that I only said responses will be generated more quickly; it's anyone's guess as to how well those responses will work, particularly since the spammers certainly won't be opening up their source code!) I should have said above that I think the potential for an open-source search engine to implode in grand style due to sheer developer frustration is enormous. But I still think that if enough dedicated people were on board, cool things could happen; it's hard to say how many is "enough", though, or how dedicated they need to be. Startups tend to have fanatically dedicated people working for them because the people know that the volume and quality of their work has a direct influence on whether they're going to have a job in three months; this really can't be said for open-source projects. Even when the work sucks giant monkey balls, a sense of urgency can be a great source of inspiration.

random Wikia users won't have access to the data that informs the algorithm design

Do we know this is true? (I didn't see it indicated in the article, but it was a pretty short article.) I suppose the bandwidth costs would be kind of insane if just any random person could pull the data whenever they wanted ("hey, let's DDoS Wikia today!"), but perhaps developer keys and rate-limiting, or BitTorrent, or something.

Nor, I suspect, will Wikia be letting just anyone _edit_ their algorithms, unless they're complete idiots.

Sure. I took [livejournal.com profile] neoliminal's question to mean something like the setup that LiveJournal has, where the code is published and can be replicated elsewhere (e.g. DeadJournal), but users can't make changes to an instance of the system that they don't control.

From my experience, part of the problem with translating academic results to search engines in particular is that it's hard for an academic to demonstrate that their approach or improvement will work in actual practice.

Oh, absolutely, although the better conferences (e.g. SIGIR, KDD, &c) seem to at least pay lip service to scalability issues. But I totally agree that academics almost universally have blinders on when it comes to the notion of people using their systems in unintended or unexpected ways, and they don't write papers (and certainly don't implement code) with an eye toward this very real problem.

Still, I like the notion of J. Random Bored Hacker being able to read a paper, bang some code together, and see whether it works. J. Random Bored Hacker isn't going to have the hardware resources to put together his own private Google, but I know probably ten or twelve different people who have clusters running in their homes/warehouses/whatever just for shits and grins. There's got to be some guy out there with fifty Beowulfed dual-Xeons and a healthy curiosity about search...

I gather that you're in some completely other time zone these days

Yep, I'm in Belgium until late February, alas. If you'll be business-tripping later in the year, though, drop me a line in advance and we can grab dinner! (I am currently without car, and my day-to-day transportation needs are met well enough by SF public transit that I'm not especially motivated to fix my engine, shell out to get it fixed or buy a new car, but Mountain View is fairly reachable by train.)

[identity profile] jrtom.livejournal.com 2008-01-02 08:51 pm (UTC)(link)
random Wikia users won't have access to the data that informs the algorithm design

Do we know this is true? (I didn't see it indicated in the article, but it was a pretty short article.) I suppose the bandwidth costs would be kind of insane if just any random person could pull the data whenever they wanted ("hey, let's DDoS Wikia today!"), but perhaps developer keys and rate-limiting, or BitTorrent, or something.


"Kind of insane" is putting it mildly, I'd bet: I doubt that even someone with a T1 would even be able to keep up with mirroring the data that Google (or Microsoft, or Yahoo) uses as input. Consider that this data almost certainly involves every single search and page visit, not to mention data from web spiders giving you content and topology (i.e., link) updates.

That is, while, yes, lots of people doing this would placed a strain on [Google|Yahoo|Microsoft], but it seems infeasible for almost any individual to even _receive_ the data.

the better conferences (e.g. SIGIR, KDD, &c) seem to at least pay lip service to scalability issues.

Of course. But search engines (or, as I corrected myself in my last post to BB, "search services") are big agglomerations of rapidly evolving code, huge inputs, and lots of infrastructure. So it's not enough for an academic to say "I've demonstrated that the scalability is good up to 10M pages" (which would be somewhat exceptional, I suspect) because they're still a few orders of magnitude low, and because they have no way of demonstrating that their approach won't screw something else up in the existing engine, or push them over a critical performance boundary (in the wrong direction).

As a side point: Google runs its stuff on a custom version of Linux, I believe. I don't know how tied their algorithms are to that...but do they have to provide the source for that, too, so that J. Random Hacker can have the appropriate substrate on which to run their experiments?

As for J. Random Bored Hacker, that's what Lucene et al. are for. Of course, it doesn't allow you to duplicate the precise environment that a specific production search service has...but at least you can use it to dink with the algorithms themselves.


Sorry to hear you won't be around. Don't know when I might be in the Bay again, but if and when I'll let you know so that we can grab dinner (quick, before it gets away).

(no subject)

[identity profile] maradydd.livejournal.com - 2008-01-02 21:24 (UTC) - Expand

(no subject)

[identity profile] jrtom.livejournal.com - 2008-01-02 22:24 (UTC) - Expand

[identity profile] enochsmiles.livejournal.com 2008-01-03 03:12 am (UTC)(link)
They don't have the data and they don't have the massive scale, and they don't know about all the rest of the inputs to (and demands on) the system.

You're jumping over a step.

Those are practical problems, and in academia, it never gets that far. They don't care about practice, so they don't even attempt to address the concerns that they can reasonably make guesses about.

*If* academics cared about things working in practice, then what you're mentioning would be a problem. But, by and large, they don't.

[identity profile] jrtom.livejournal.com 2008-01-03 03:25 am (UTC)(link)
I think you're overstating the case, or at least overgeneralizing. From personal experience I can state that _some_ academics, at least, do in fact care about practical considerations. I have been one of them. :)

(If nothing else, you might consider that quite a few academics have ties to industry.)

(no subject)

[identity profile] enochsmiles.livejournal.com - 2008-01-03 06:48 (UTC) - Expand

(no subject)

[identity profile] jrtom.livejournal.com - 2008-01-03 18:37 (UTC) - Expand

(no subject)

[identity profile] maradydd.livejournal.com - 2008-01-03 18:42 (UTC) - Expand

(no subject)

[identity profile] jrtom.livejournal.com - 2008-01-03 18:50 (UTC) - Expand

(no subject)

[identity profile] jrtom.livejournal.com - 2008-01-04 18:28 (UTC) - Expand

(no subject)

[identity profile] enochsmiles.livejournal.com - 2008-01-04 17:49 (UTC) - Expand

[identity profile] enochsmiles.livejournal.com 2008-01-03 03:10 am (UTC)(link)
I stopped believing in the "many eyes make code shallow" nonsense a long time ago. The average amount of time it takes for a bug in PGP (whose source code has always been public, and both the company and the people involved privately have made a huge stir trying to get public auditing of the code) about five years to be discovered by source-code analysis.

Publicly, at least. The NSA doesn't file bug reports.

[identity profile] jrtom.livejournal.com 2008-01-03 06:39 pm (UTC)(link)
Honestly I'm not surprised. As you say, the NSA doesn't file bug reports...and they, and other organizations like them, are both the most motivated to find flaws in it, and the least motivated to report them. (And among the most capable of finding flaws, of course.)

[identity profile] maradydd.livejournal.com 2008-01-03 06:44 pm (UTC)(link)
*shrug* KML has seen a sudden upswing in users over the last few weeks, and they're filing bug reports like there's no tomorrow -- and tracking them down to their origin in the source. I certainly appreciate it; it makes my job easier.

[identity profile] enochsmiles.livejournal.com 2008-01-04 04:45 pm (UTC)(link)
Functionality bugs and security bugs are different. How many of these bugs are ones that do not affect functionality, but only affect security? I suspect very few.

(And as a side note: KML means "Keyhole Markup Language." You need a new name...)

[identity profile] jrtom.livejournal.com 2008-01-03 06:54 pm (UTC)(link)
Following on to my earlier response, and addressing your point more directly: I don't believe that "many eyes make code shallow", per se. I do believe that the more people that can see (and are motivated to look at) your code, the more likely it is that some sufficiently obsessed person will take a close look and run across any bugs that do occur. (That is, you don't necessarily need _lots_ of people, just the right ones, but the more people you have looking, the higher the probability that the right ones will be included.)

For something like the code associated with a search service like Google's/Yahoo's/Microsoft's, the required level of obsession (or other motivation) would have to be pretty high, but I think that we all agree that there are many people who are deeply interested in finding flaws that can be exploited through cheap manipulation of the inputs.

[identity profile] enochsmiles.livejournal.com 2008-01-03 03:06 am (UTC)(link)
Hmm. Is the analogy of secure crypto algorithms to cheat-proof search algorithms spurious?

(I.e., is it possible to build search algorithms that can't be subverted by malicious actors even if the algorithm is known to those malicious actors?)

I'm not seeing a way to do it, but then again, I'm not seeing a strong proof of impossibility, either.

[identity profile] neoliminal.livejournal.com 2008-01-03 02:21 pm (UTC)(link)
let's assume you know said search algorithm was known to someone even though it was supposedly hidden. further assume this was the only protection in place against attacks to raise SEO scores. I assume that this would, indeed, be security through obscurity, bending slightly the idea that search engines were keeping said source private as an act of security.

I suppose the idea of security here is a bit obfuscated because one doesn't normally associate security with search rankings.

[identity profile] enochsmiles.livejournal.com 2008-01-04 05:04 pm (UTC)(link)
I suppose the idea of security here is a bit obfuscated because one doesn't normally associate security with search rankings.

Well, the people designing these algorithms in practice do, as they have to deal with SEO/spammers.

I might be missing something, but it seems like what you're showing here is that yes, you can provide security through obscurity in search engine algorithms. We know this -- that's how the system currently works. What I want to know is "is this the only way?" Not "are these algorithms faster" as someone else brought up, but "is there a secure-against-cheating algorithm that can be publicly disclosed without compromising its security that provides good enough results and performance?" Or, "If not, prove it."

Either side of that question is hard to answer, I think.

[identity profile] enochsmiles.livejournal.com 2008-01-04 06:56 pm (UTC)(link)
We know this -- that's how the system currently works.

Well, let me amend this -- we know you can do this so that it takes the spammer/attacker some non-trivial amount of time to break your measure (i.e., do a blackbox analysis on your new algorithm and defeat it) such that you already have a fix in place, or can have a fix in place quickly enough that the attacker doesn't reduce your utility below that of your competitors'.

The existing system works "well enough". Would an open system work as well, worse, not at all, better, or perfectly? Those are going to be hard questions to answer empirically, and harder questions to answer with proofs.

[identity profile] jrtom.livejournal.com 2008-01-03 06:21 pm (UTC)(link)
Of course it's possible to build unsubvertible search algorithms. I present Algorithm A, which returns all known web pages in alphabetical order. :)

The question of whether it's possible to build unsubvertible search algorithms with performance equivalent to the current ones (for whatever definitions of "performance" you find useful and interesting) is a more subtle question.

My feeling is that, even taking aside such ridiculous examples as the one I proffer above, removing features that can be subverted will lead to weaker, less interesting models. (I grant you that there are some interesting problems in adapting/constraining such features so as to make their subversion either difficult or expensive without adversely affecting good actors.)

[identity profile] jrtom.livejournal.com 2008-01-03 06:23 pm (UTC)(link)
On reflection, Algorithm A is subvertible, too: it will encourage people to create pages starting with 'A' (i.e., the phone book problem). We'll change that to "in random order" instead. :)

[identity profile] maradydd.livejournal.com 2008-01-03 07:01 pm (UTC)(link)
Ok, so this is the discussion I was hoping to provoke. :) It also gets back to my "these problems are inverses" characterisation that I made on BB.

Let's momentarily accept as a premise (and I'll be the first to say that I don't think you can take this as a premise, but it makes the math easier) that in the universe U of web pages, for every query Q there exists some ideal total ordering T which reflects relevance to Q. (I'm pretty sure this extends wlog to partial orderings as well.) Now assume we have some algorithm A(U,Q) which generates the corresponding T for any Q.

"Subverting A" is thus defined as modifying the contents of U such that A(U,Q) no longer generates an appropriate T, but rather some T' which contains results incorrectly ranked in terms of their relevance to Q. We could also say that A is subversion-resistant if changing some web page u in U in a way that does not increase u's relevance to Q cannot possibly increase u's position in T.

But the part where all this falls down and goes boom is, what's the standard for relevance? In crypto, we have a goal: "disguise signal in some reversible way such that the disguised signal is indistinguishable from random noise." We know what noise looks like; Shannon told us. We can thus make strong statements about the security of crypto algorithms, because we can make meaningful statements about the entropy of their output and thus how hard they are to distinguish from noise.

If we had a universal standard of relevance the same way we have a universal standard of noise, we could say something meaningful about search-engine rankings in the same way. But I don't see how to give [livejournal.com profile] enochsmiles a proof of impossibility wrt cheat-proof search algorithms unless I have something to compare orderings against. Random oracle anyone?

Impossible things are hard, let's do math.

(no subject)

[identity profile] maradydd.livejournal.com - 2008-01-03 19:05 (UTC) - Expand

(no subject)

[identity profile] maradydd.livejournal.com - 2008-01-03 19:08 (UTC) - Expand

(no subject)

[identity profile] jrtom.livejournal.com - 2008-01-03 20:02 (UTC) - Expand

(no subject)

[identity profile] jrtom.livejournal.com - 2008-01-03 20:01 (UTC) - Expand

(no subject)

[identity profile] radtea.livejournal.com - 2008-01-04 04:10 (UTC) - Expand

(no subject)

[identity profile] enochsmiles.livejournal.com - 2008-01-04 17:01 (UTC) - Expand

(no subject)

[identity profile] enochsmiles.livejournal.com - 2008-01-04 16:59 (UTC) - Expand

(no subject)

[identity profile] jrtom.livejournal.com - 2008-01-04 18:09 (UTC) - Expand

(no subject)

[identity profile] enochsmiles.livejournal.com - 2008-01-04 18:14 (UTC) - Expand

(no subject)

[identity profile] jrtom.livejournal.com - 2008-01-04 18:26 (UTC) - Expand

[identity profile] enochsmiles.livejournal.com 2008-01-04 04:51 pm (UTC)(link)
Obviously there are optimization tricks being used now that would have to go out the window. If I have a crypto algorithm I wish to make more efficient than "secure" (i.e., relies on key secrecy and not algorithm secrecy, and the fastest attack is brute force) algorithms by various tricks that render it only secure if the algorithm is hidden, I could do so. But it is no longer "secure" by the full definition provided in the field of cryptography.

So, my question still stands. Your Algorithm A is broken, as you say (in fact, Yahoo worked that way in the Good Old Days, and I know of lots of people who grabbed websites like aardvark.com to exploit that attack.) "Return pages in random order" is not a search algorithm; it's a random retrieval algorithm, and furthermore, doesn't prevent an attacker from analyzing your PRNG and composing pages that are "randomly" selected by the search engine.

Not seeing the proof of impossibility here.

(no subject)

[identity profile] jrtom.livejournal.com - 2008-01-04 18:01 (UTC) - Expand

(no subject)

[identity profile] enochsmiles.livejournal.com - 2008-01-04 18:05 (UTC) - Expand

(no subject)

[identity profile] jrtom.livejournal.com - 2008-01-04 18:22 (UTC) - Expand

(no subject)

[identity profile] enochsmiles.livejournal.com - 2008-01-04 18:40 (UTC) - Expand

(no subject)

[identity profile] jrtom.livejournal.com - 2008-01-04 18:58 (UTC) - Expand

(no subject)

[identity profile] maradydd.livejournal.com - 2008-01-04 19:02 (UTC) - Expand

(no subject)

[identity profile] enochsmiles.livejournal.com - 2008-01-04 19:04 (UTC) - Expand

(no subject)

[identity profile] enochsmiles.livejournal.com - 2008-01-04 19:00 (UTC) - Expand

(no subject)

[identity profile] enochsmiles.livejournal.com - 2008-01-04 18:10 (UTC) - Expand

(no subject)

[identity profile] jrtom.livejournal.com - 2008-01-04 18:24 (UTC) - Expand

(no subject)

[identity profile] enochsmiles.livejournal.com - 2008-01-04 18:51 (UTC) - Expand

Heh.

[identity profile] jrtom.livejournal.com 2008-01-02 07:40 pm (UTC)(link)
Hi. :) I liked your characterization of these problems as inverses of one another. (I hadn't connected "MLP" with you, though, so I was glad to see your note here, though.)

I'm sort of wondering just how central this flawed comparison is to the book that Cory mentioned. I like Cory and he's a smart guy, but I think he's been pounding on the "free the IP!" campaign (with which I agree, in general) for long enough that he's in danger of starting to perceive everything as a nail. (So to speak.)

I should have clarified that I actually _do_ think that user feedback on both relevance and ranking could be useful, so I'll look forward to seeing what Wikia does in that space...but making those algorithms (or at least the ranking algorithm) publicly known is, I think, asking for trouble.

Re: Heh.

[identity profile] maradydd.livejournal.com 2008-01-02 07:45 pm (UTC)(link)
From a practical standpoint I'll be curious to see how materially similar the results end up being to something like Digg, which is already pretty easy to spike.

There are only 5000 people in the world

[identity profile] radtea.livejournal.com 2008-01-02 08:02 pm (UTC)(link)

If there were more, we wouldn't keep running into each other all the time.

Re: There are only 5000 people in the world

[identity profile] maradydd.livejournal.com 2008-01-02 08:23 pm (UTC)(link)
The funny bit is that I can't actually remember how I know you, for whatever value of "know" is appropriate here. Comments on [livejournal.com profile] perspectivism's LJ or something?

Re: There are only 5000 people in the world

[identity profile] radtea.livejournal.com 2008-01-04 03:46 am (UTC)(link)
I am a leaf on the wind.

Looking at your f-list I see many connections in common, although [livejournal.com profile] perspectivism is our only mutual friend. I'm apt to comment in journals I haven't friended or been friended by. [livejournal.com profile] kirinqueen is one of several possibilities. I'm also fond of John Enright's poetry, although I've never commented in his journal, and your entry mentioning my comment on boingboing was point out to me by [livejournal.com profile] other in a comment in my journal.