Toward Optimal Feature Selection in Naive Bayes for Text Categorization 论文

2016IEEE Transactions on Knowledge and Data Engineering引用 255

Text and Document Classification TechnologiesSpam and Phishing DetectionAdvanced Text Analysis Techniques

Text and Document Classification Technologies Advanced Text Analysis Techniques Spam and Phishing Detection

作者

摘要

Automated feature selection is important for text categorization to reduce feature size and to speed up learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination ( <inline-formula><tex-math notation="LaTeX">$MD$</tex-math> </inline-formula> ) and methods, for text categorization. The promising results of extensive experiments demonstrate the effectiveness of the proposed approaches.

作者查看全部 (3)

Bo Tang

Haibo He

Steven Kay

Toward Optimal Feature Selection in Naive Bayes for Text Categorization 论文

摘要

作者查看全部 (3)

相关技术查看全部 (2)

相关事件

相关文章