Distributional word clusters vs. words for text categorization 论文

2003引用 285

Text and Document Classification TechnologiesAlgorithms and Data CompressionSpam and Phishing Detection

Algorithms and Data Compression Text and Document Classification Technologies Spam and Phishing Detection

作者

摘要

We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and ecient representation of documents. When combined with the classica-tion power of the SVM, this method yields high performance in text categorization. This novel combination of SVM with word-cluster representation is compared with SVM-based categorization using the simpler bag-of-words (BOW) representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method based on word clusters signicantly outperforms the word-based representation in terms of categorization accuracy or representation eciency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the word-cluster representation. We investigate the potential reasons for this behavior and relate it to structural dierences between the datasets. 1.

作者查看全部 (4)

Yoad Winter

Naftali Tishby

Ran El‐Yaniv

Ron Bekkerman

Distributional word clusters vs. words for text categorization 论文

摘要

作者查看全部 (4)

相关技术查看全部 (3)

相关事件

相关文章