Consistency Training Can Entrench Misalignment 事件

PRODUCT_LAUNCH2026-06-03影响: MEDIUM

Consistency Training Can Entrench Misalignment arXiv:2606.03810v1 Announce Type: new Abstract: Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly understood. Could the self-bootstrapping nature of these methods amplify undesired behavior in models? We test seven consistency training methods on 108 ``model organisms: open-sour