Consistency Training while Mitigating Obfuscation via Rate Matching 文章

ArXiv CS.CL2026-06-02NEWSen作者: Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri Africa

摘要

arXiv:2606.02211v1 Announce Type: new Abstract: Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency training reduces this influence by training models to behave similarly across inputs with and without the extraneous feature. However, existing methods train for consistency over entire responses or internal activations, which also constrains whether the model verbalises said extraneous features. We show this leads to obfuscation, where the model learns not to mention a cue while remaining influenced by it, which may undermine monitorability. To address this, we introduce Rate Matching Consistency Training (RMCT), which trains for consistency over selected behavioural properties without constraining how this behaviour is expressed. RMCT matches the rate at which the model exhibits a target behaviour (e.g.

Consistency Training while Mitigating Obfuscation via Rate Matching 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (3)