Reconsidering Positional Supervision in Masked Diffusion Language Model Training 文章

ArXiv CS.CL2026-06-02NEWSen作者: Mengyu Ye, Keito Kudo, Ryosuke Takahashi, Jun Suzuki

摘要

arXiv:2601.22947v2 Announce Type: replace Abstract: Masked diffusion language models (MDLMs) generate text by unmasking tokens in parallel and have recently emerged as alternatives to autoregressive language models. They can be viewed as parallel decoders trained with a position-wise cross-entropy (CE) loss, the same setup as non-autoregressive translation (NAT). In NAT, CE-trained parallel decoders have been argued to be sensitive to small positional shifts, since CE penalizes them harshly. We ask whether CE-trained MDLMs are similarly sensitive to such shifts under iterative decoding. To probe this, we apply a controlled intervention that introduces them during decoding. On LLaDA-8B-Instruct with Arena-Hard, displacing as little as 1% of generated tokens by one position substantially reduces win rates against the unintervened model, showing that MDLMs are sensitive to such small shifts under iterative parallel decoding.

Reconsidering Positional Supervision in Masked Diffusion Language Model Training 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (6)

相关技术查看全部 (4)