Content and the Scan Statistic for Enron

John Conroy
Institute for Defense Analyses

This talk will focus on the interplay of content and the scan statistic. The scan statistic can be use to detect changes in a time series of graphs. In particular, it can be used to detect when a subset of the nodes have a higher than expected connectivity. As an example, we consider the Enron email dataset. The Enron dataset is a collection emails from about 150 Enron executives which was released by the U.S. Justice Department. The scan statistic on this data detects a number of anomalous rises in communication among a subset of the Enron executives. Given a period of such a rise in communication we seek to identify the cause of this rise in communication. To this end we employ a technique which has been successfully used in document summarization and other information retrieval tasks. Specifically, given a collection of emails of interest we compute the signature terms for the collection. The signature terms are those terms with a higher than expected likelihood of occurrence. These terms can then be used to generate a description of the topic or to selectively extract portions of the email messages to give a gist of the communication volume. Content can also be used to induce a new time series of graphs. For each time period we consider two nodes to be connected if there is significant correlation in the signature terms as computed by content of the messages they have received or sent. Such correlation between the nodes gives rise to a time series of graphs. We explore how this time series relates to the communication patterns of the nodes. This talk is a follow on talk to the presentation of Carey Priebe on scan statistics.


Presentation (PowerPoint File)
Video of Talk (RealPlayer File)

Back to Graduate Summer School: Intelligent Extraction of Information from Graphs and High Dimensional Data