CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios 事件

PRODUCT_LAUNCH2026-06-05影响: MEDIUM

CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios arXiv:2606.06088v1 Announce Type: new Abstract: We present CHALIS (Challenging Language Identification Samples), a new benchmark dataset explicitly designed to address difficult cases in language identification: cousin languages and orthographic noise. Our dataset has two parts: First, we collected sentences shared across mutually intelligible language pairs (Czech/Slovak, Spanish/Catalan, Portuguese/Galician, Danish