MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models 文章

ArXiv CS.CL2026-06-02NEWSen作者: Partha Pratim Saha, Samarth Raina, Mayur Parvatikar, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

摘要

arXiv:2606.01060v1 Announce Type: new Abstract: Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and retrieval-time corruption, suggesting behavior-level evaluation alone is incomplete. Post-training should leave measurable traces in internal computation. We ask: when an instruction-tuned (IT) model becomes a preference-aligned (PA) model, what geometric structure changes, where do those changes concentrate, and how selectively do they vary across concepts, prompts, and model families? We introduce MENTIS, a geometry-first framework for measuring alignment-induced internal reorganization in paired checkpoints.

相关公司

暂无数据

相关人物

暂无数据