A divisive information theoretic feature clustering algorithm for text classification 论文

2003引用 513

Text and Document Classification TechnologiesFace and Expression RecognitionAdvanced Clustering Algorithms Research

Face and Expression Recognition Advanced Clustering Algorithms Research Text and Document Classification Technologies

作者

摘要

High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new informationtheoretic divisive algorithm for feature/word clustering and apply it to text classification. Existing techniques for such &quot;distributional clustering&quot; of words are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value. We show that our algorithm minimizes the &quot;within-cluster Jensen-Shannon divergence&quot; while simultaneously maximizing the &quot;between-cluster Jensen-Shannon divergence&quot;.

作者查看全部 (3)

Rahul Kumar

Subramanyam Mallela

Inderjit S. Dhillon

A divisive information theoretic feature clustering algorithm for text classification 论文

摘要

作者查看全部 (3)

相关技术查看全部 (3)

相关事件

相关文章