Revisiting Lexicon Evaluation in Unsupervised Word Discovery 文章

ArXiv CS.CL2026-06-05NEWSen作者: Simon Malan, Danel Slabbert, Herman Kamper

摘要

arXiv:2606.06183v1 Announce Type: cross Abstract: Building a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster. We show that this metric has an inherent bias toward the quality of large clusters, inhibiting fair evaluation. Moreover, it ignores how well true classes are distributed across clusters. Based on established theory in clustering literature, we propose two metrics that address these shortcomings: a modified metric that weighs cluster size when assessing within-cluster consistency, and an inverse metric that assesses how true words are spread across clusters.

相关事件查看全部 (1)

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据

相关技术

暂无数据