maradydd: (Default)
[personal profile] maradydd
Lately, I've been helping a colleague of mine with some data-mining research on the Enron email database (yeah, the one the DoJ subpoenaed from them). We're not doing anything the media will find terribly exciting -- I mean, I don't think we'll be uncovering any exciting new conspiracies or anything like that -- but we're hoping to discover interesting and/or useful things about how large organisations use email as a tool. Anyway, one of the things my colleague wanted was a directed-graph representation of senders and recipients, where each edge represents an email from the source vertex to the target vertex; edges increase in weight as the sender sends more emails to the recipient.

"This is a job for Graphviz!" I said, and since I'd been meaning to sit down and learn how to use the Boost Graph Library anyway, I sat down and got to it. It's been a somewhat bumpy ride -- the BGL does some weird, weird things with templates, and its documentation seems to be written for people who, erm, already know what they're doing -- but I am now the proud author of my very own topologiser, Klein.

Klein creates maps of networks, mainly communication networks. It reads "source" and "target" data as tuples from a PostgreSQL database; it takes as command-line parameters the database name, source field name, target field name, table name (I'm already planning to handle joins in a future version, but not now), any necessary database connexion parameters, and (optionally) a minimum edge-weight. (The latter is in case you run into the same problem we did, where the graph is so large that Graphviz can't handle the resulting .dot file and you just want to look at a higher-activity subgraph.) It outputs a file in DOT, the definition language which Graphviz uses. It requires libpqxx, which should come packaged with PostgreSQL and can also be found here, and libpopt, which is part of GNOME. It is pretty fast; analysing a graph of ~78,000 vertices and ~290,000 edges took about twenty minutes on a Pentium-4 M with 512MB of RAM, and that was with KDE running and reading the source data over 802.11b. (The resulting graph was way too big for Graphviz to handle, though.)

In the next few days, I'll throw GNU Autotools at it (which will be an adventure in and of itself) and package it up for release under the GPL. In later releases, I plan to include other data-source options, including reading from an ordinary istream; it would be terribly neat, I think, to figure out some way to read from an Ethereal or tcpdump session (preferably from a network adapter in promiscuous mode on a switch) in order to model network traffic density. I haven't thought very far ahead about other useful features, but if anyone has an "ooh, shiny" moment, I'll happily listen. :)

And, all that said, I'd be remiss if I didn't point out the role LJ has played in my evolution as a software developer. About two and a half years ago, [livejournal.com profile] other mentioned the existence of Graphviz, and the following semester, [livejournal.com profile] ernunnos pointed me at Boost; okay, maybe it took me two years to become a good enough C++ programmer to make use of it, but we all have to start somewhere. Thanks, guys. :)

(no subject)

Date: 2004-11-06 03:51 am (UTC)
From: [identity profile] enochsmiles.livejournal.com
Sweet. This data may actually be useful for something I'm working on. Will you be publishing your analysis?

(no subject)

Date: 2004-11-06 03:58 am (UTC)
From: [identity profile] maradydd.livejournal.com
We definitely want to get a paper out of it, though at the moment I'm not entirely sure which direction my colleague wants to take it (we're kind of at the "let's look at the data from several different angles and see if anything interesting falls out" stage of things).

Care to elaborate on what you're working on? (clonearmy@gmail.com, if you'd prefer to discuss via email.)

(no subject)

Date: 2004-11-06 04:03 am (UTC)

Cool

Date: 2004-11-06 04:28 am (UTC)
From: [identity profile] other.livejournal.com
No problem. I'm just glad that my mapping of Objectivists' sex habits has come to some good.

Re: Cool

Date: 2004-11-06 05:53 am (UTC)
From: [identity profile] enochsmiles.livejournal.com
Shudder.

Someone should re-work the irc sex chart.

Re: Cool

Date: 2005-11-22 07:44 pm (UTC)
From: [identity profile] enochsmiles.livejournal.com
... I said, not realizing I'd be partially responsible for providing a direct link between the two, less than a year later. So, where's this Objectivist sex chart to be found?

Question

Date: 2004-11-06 04:11 pm (UTC)
From: [identity profile] doissetep.livejournal.com
Does the line/vertice representation distinguish the direction the mail is sent, or does it only only one segment between vertices?

Let's say employee A sends 5 messages to employee B. That would refect a certain line thickness between vertices A and B. Do messages from employee B back to employee A futher contribute to the thickness without differentiating the instigator of the message?

Re: Question

Date: 2004-11-06 11:08 pm (UTC)
From: [identity profile] maradydd.livejournal.com
If A sends B some messages, then there is an edge between A and B with an arrow pointing toward B. If B then sends A some messages, there is a separate edge between B and A, with the arrow pointing toward A, and the two lines will have different thicknesses (unless they send each other the same amount of messages, natch).

Re: Question

Date: 2004-11-09 01:02 pm (UTC)
From: [identity profile] doissetep.livejournal.com
Excellent.

Do you have any idea as to the relative rank (i.e., executive, manager, or worker) of the mail senders and recipients?

Re: Question

Date: 2004-11-12 03:44 am (UTC)
From: [identity profile] maradydd.livejournal.com
Um, not really. It can be intuited from some of the emails, but I think we've got a pretty good mix of roles, all told.

Profile

maradydd: (Default)
maradydd

September 2010

S M T W T F S
   1234
567891011
12131415 161718
19202122232425
26 27282930  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags