Distributional word clusters vs. words for text categorization 论文
摘要
We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and ecient representation of documents. When combined with the classica-tion power of the SVM, this method yields high performance in text categorization. This novel combination of SVM with word-cluster representation is compared with SVM-based categorization using the simpler bag-of-words (BOW) representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method based on word clusters signicantly outperforms the word-based representation in terms of categorization accuracy or representation eciency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the word-cluster representation. We investigate the potential reasons for this behavior and relate it to structural dierences between the datasets. 1.