Guest Post: E-Discovery and The Enron E-Mail Dataset Research

By Peter Vogel on October 21, 2009

GUEST BLOG FROM VICTORIA VANBUREN

Before Dave Grant joined Gardere as the Director of e-Discovery, he was responsible for e-Discovery at Enron in the last few years before its total melt down and was responsible for managing more than 1.25 million documents. While at Enron, Dave responded to more than 100 subpoenas from various states and federal agencies. The Enron database has become a focal point of eDiscovery research. This Guest Blog about the Enron database is part of a bigger picture regarding academic research for developing efficient tools to improve eDiscovery.

I welcome Victoria VanBuren as the first Guest Blogger with her blog concerning the Enron eMail database. Victoria runs the DISPUTING blog with Karl Bayer in Austin, and has a great knack for posting interesting blogs and finding blogs on important topics. She is also a co-founder and an active participant on theLinkedIn Commercial and Industry Arbitration and Mediation Group. In addition to being a lawyer, Victoria is working on a degree in computer science so and I’m sure we will see Guest Blogs from her in the future.

GUEST POST: E-DISCOVERY AND THE ENRON E-MAIL DATASET RESEARCH

By Victoria VanBuren

The U.S. Supreme Court granting of certiorari to former Enron CEO Jeffrey Skilling dominated the news headlines last week. Interestingly, the Federal Energy Commission (FERC), during its investigation into Enron’s involvement in the energy crisis of 2000-01, made available to the public a large database, called the “Enron Corpus.” This dataset consists of about half a million e-mail communications from former Enron senior executives and energy traders.

Enron E-mail Dataset Research

Because of its size and public status, the Enron Corpus is a rare and valuable tool for experimenting on text classification methods. After FERC posted it to the web, this dataset has been the subject of research by computer science departments of several universities, including the Massachusetts Institute of Technology and Stanford University. The summer of 2009, the team at TREC Legal Track, an organization co-sponsored by the U.S. Department of Defense, started conducting research on the Enron Corpus with the purpose of improving large-scale search techniques.

Our Research – Bayesian Text Classifier

The spring of 2009, computer science students at Texas State University David Villarreal, Thomas McMillen, Andrew Minnick, and I, under the supervision of computer forensic expert Wilbon Davis utilized the Enron Corpus to train a Bayes-based algorithm to classify the Enron e-mails into relevant and irrelevant to a given legal issue. This type of algorithm is commonly used by e-mail spam filters.

The Results

The team hoped that this mathematical approach would achieve better accuracy levels than the ~ 20% found using Boolean keyword searching, a method employed by many lawyers. Surprisingly, the Bayesian filter found e-mails to be known relevant at averages ranging between 43% and 66%. And as expected, the irrelevant accuracy results were even higher, averages ranging between 44% and 77%. Texas State University published the Technical Report last week and it can be downloaded for free here.