The Latin Substrate: How Language Models Represent and Mediate Script Choice 文章

ArXiv CS.CL2026-06-01NEWSen作者: Daniil Gurgurov, Alan Saji, Katharina Trinley, Josef van Genabith, Simon Ostermann

摘要

arXiv:2605.31363v1 Announce Type: new Abstract: Many languages are written in multiple scripts, requiring large language models (LLMs) to generate equivalent linguistic content in distinct orthographic forms. While prior work suggests that LLMs route information through shared latent representations, how they internally mediate script variation remains poorly understood. We study this question by first examining per-layer output distributions with the logit lens, which reveals consistent latent romanization during transliteration, and then through representational and mechanistic analyses of script generation. At the representational level, we show that scripts of the same language become increasingly separable across layers and that a simple linear steering direction can flip a model's output script while largely maintaining semantic content.

The Latin Substrate: How Language Models Represent and Mediate Script Choice 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (2)