Name Tagging with Word Clusters and Discriminative Training 论文

2004引用 270
Topic ModelingNatural Language Processing TechniquesData Quality and Management

摘要

We present a technique for augmenting annotated training data with hierarchical word clusters that are automatically derived from a large unannotated corpus. Cluster membership is encoded in features that are incorporated in a discriminatively trained tagging model. Active learning is used to select training examples. We evaluate the technique for named-entity tagging. Compared with a state-of-the-art HMM-based name finder, the presented technique requires only 13 % as much annotated data to achieve the same level of performance. Given a large annotated training set of 1,000,000 words, the technique achieves a 25 % reduction in error over the state-of-the-art HMM trained on the same material. 1