Measuring, Localizing, and Ablating Alignment Signatures in LLMs 文章

ArXiv CS.CL2026-06-01NEWSen作者: Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-T\"ur, Nick Feamster

摘要

arXiv:2605.30526v1 Announce Type: cross Abstract: Aligned language models often exhibit a recognizable AI-like style, yet its connection to post-training and internal representations remains poorly understood. In this work, we study whether post-training introduces or amplifies AI-like stylistic regularities and whether these regularities have a localized internal signature. To this end, we compare human text, base-model generations, and aligned-model generations under matched human-source prefixes. Aligned generations show lower human-corpus affinity and higher AI-detection rates than base generations, suggesting that post-training shifts generated text away from human-corpus style and toward detector-visible AI-like text. We then introduce PASTA (Post-training Alignment Signature Targeted Ablation), a training-free method that estimates a post-training alignment signature from aligned-base residual contrasts and ablates the corresponding direction during decoding.

Measuring, Localizing, and Ablating Alignment Signatures in LLMs 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (1)