Distributional word clusters vs. words for text categorization 论文

2003引用 285
Text and Document Classification TechnologiesAlgorithms and Data CompressionSpam and Phishing Detection

摘要

We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and ecient representation of documents. When combined with the classica-tion power of the SVM, this method yields high performance in text categorization. This novel combination of SVM with word-cluster representation is compared with SVM-based categorization using the simpler bag-of-words (BOW) representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method based on word clusters signicantly outperforms the word-based representation in terms of categorization accuracy or representation eciency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the word-cluster representation. We investigate the potential reasons for this behavior and relate it to structural dierences between the datasets. 1.