Introducing the Enron Corpus. 论文

2004Conference on Email and Anti-Spam引用 543

Hate Speech and Cyberbullying DetectionSpam and Phishing DetectionMisinformation and Its Impacts

Misinformation and Its Impacts Spam and Phishing Detection Hate Speech and Cyberbullying Detection

作者

摘要

A large set of email messages, the Enron corpus, was made public during the legal investigation concerning the Enron corporation. This dataset, along with a thorough explanation of its origin, is available at http://www-2.cs.cmu.edu/~enron/. This paper provides a brief introduction and analysis of the dataset. The raw Enron corpus contains 619,446 messages belonging to 158 users. We cleaned the corpus before this analysis by removing certain folders from each user, such as “discussion_threads”. These folders were present for most users, and did not appear to be used directly by the users, but rather were computer generated. Many, such as “all_documents”, also contained large numbers of duplicate emails, which were already present in the users’ other folders. Our goal in this paper is to analyze the suitability of this corpus for exploring how to classify messages as organized by a human, so these folders would have likely been misleading.

作者查看全部 (2)

Yiming Yang

Bryan Klimt

Introducing the Enron Corpus. 论文

摘要

作者查看全部 (2)

相关技术查看全部 (2)

相关事件

相关文章