The Latin Substrate: How Language Models Represent and Mediate Script Choice 文章

ArXiv CS.CL2026-06-01NEWSen作者: Daniil Gurgurov, Alan Saji, Katharina Trinley, Josef van Genabith, Simon Ostermann

摘要

arXiv:2605.31363v1 Announce Type: new Abstract: Many languages are written in multiple scripts, requiring large language models (LLMs) to generate equivalent linguistic content in distinct orthographic forms. While prior work suggests that LLMs route information through shared latent representations, how they internally mediate script variation remains poorly understood. We study this question by first examining per-layer output distributions with the logit lens, which reveals consistent latent romanization during transliteration, and then through representational and mechanistic analyses of script generation. At the representational level, we show that scripts of the same language become increasingly separable across layers and that a simple linear steering direction can flip a model's output script while largely maintaining semantic content.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据