Beyond TFIDF weighting for text categorization in the vector space model 论文

2005引用 232

Text and Document Classification TechnologiesImage Retrieval and Classification TechniquesFace and Expression Recognition

Face and Expression Recognition Image Retrieval and Classification Techniques Text and Document Classification Technologies

关系图谱

作者

摘要

KNN and SVM are two machine learning approaches to Text Categorization (TC) based on the Vector Space Model. In this model, borrowed from Information Retrieval, documents are represented as a vector where each component is associated with a particular word from the vocabulary. Traditionally, each component value is assigned using the information retrieval TFIDF measure. While this weighting method seems very appropriate for IR, it is not clear that it is the best choice for TC problems. Actually, this weighting method does not leverage the information implicitly contained in the categorization task to represent documents. In this paper, we introduce a new weighting method based on statistical estimation of the importance of a word for a specific categorization problem. This method also has the benefit to make feature selection implicit, since useless features for the categorization problem considered get a very small weight. Extensive experiments reported in the paper shows that this new weighting method improves significantly the classification accuracy as measured on many categorization tasks. 1

作者查看全部 (2)

Guy W. Mineau

Pascal Soucy

Beyond TFIDF weighting for text categorization in the vector space model 论文

摘要

作者查看全部 (2)

相关技术查看全部 (3)

相关事件

相关文章