Introducing the Enron Corpus. 论文

2004Conference on Email and Anti-Spam引用 543
Hate Speech and Cyberbullying DetectionSpam and Phishing DetectionMisinformation and Its Impacts

摘要

A large set of email messages, the Enron corpus, was made public during the legal investigation concerning the Enron corporation. This dataset, along with a thorough explanation of its origin, is available at http://www-2.cs.cmu.edu/~enron/. This paper provides a brief introduction and analysis of the dataset. The raw Enron corpus contains 619,446 messages belonging to 158 users. We cleaned the corpus before this analysis by removing certain folders from each user, such as “discussion_threads”. These folders were present for most users, and did not appear to be used directly by the users, but rather were computer generated. Many, such as “all_documents”, also contained large numbers of duplicate emails, which were already present in the users’ other folders. Our goal in this paper is to analyze the suitability of this corpus for exploring how to classify messages as organized by a human, so these folders would have likely been misleading.