Discovering Lexical Gaps Using Embeddings from Multilingual LLMs 文章

ArXiv CS.CL2026-05-26NEWSen作者: Yoonwon Jung, Aaron S. Cohen, Benjamin K. Bergen

摘要

arXiv:2605.24310v1 Announce Type: new Abstract: Lexical gaps are words that do not exist in certain languages. They pose challenges for building multilingual lexical resources, for machine translation, and for cross-lingual transfer. Existing lexical gap detection relies on human judgments or fixed conceptual taxonomies. We propose a data-driven framework for identifying cross-lingual lexical gaps. We extracted contextualized embeddings from Korean-English bilingual LLMs for Korean-to-English and English-to-Korean translation pairs. Combinations of LLMs, embedding types, dimensionality, and orthogonal transformations across 100 train-test splits yielded 4000 distinct embedding spaces in each source language. In each space, we computed the semantic similarity between each source word and its nearest neighbor in the target language, and compared their distribution for gap words versus non-gap words.

Discovering Lexical Gaps Using Embeddings from Multilingual LLMs 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (1)