Goldfish: Monolingual Language Models for 350 Languages 文章

ArXiv CS.CL2026-06-01NEWSen作者: Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen

摘要

arXiv:2408.10441v3 Announce Type: replace Abstract: For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. Despite state-of-the-art performance on reasoning tasks, we find that these models still struggle with basic grammatical text generation in many languages. First, large multilingual models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B) using FLORES perplexity as an evaluation metric. Second, when we train small monolingual models with only 125M parameters on 1GB or less data for 350 languages, these small models outperform large multilingual models both in perplexity and on a massively multilingual grammaticality benchmark. To facilitate future work on low-resource language modeling, we release Goldfish, a suite of over 1,000 small monolingual language models trained comparably for 350 languages.

相关事件查看全部 (2)

Goldfish: Monolingual Language Models for 350 Languages
2026-06-01PRODUCT_LAUNCH影响: MEDIUM

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据