Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure 文章

ArXiv CS.AI2026-05-28NEWSen作者: Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing, Mykel J. Kochenderfer

摘要

arXiv:2605.27996v1 Announce Type: new Abstract: Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias substitution. The failure is enabled by a measurement-versus-optimization gap between audit and policy-induced distributions during mitigation evaluation and policy training. We formalize mitigation outcomes into a regime taxonomy and prove that successful mitigation, bias substitution, and overcorrection produce identical observables under any audit-distribution scoring, including ranking accuracy and win-rate, even when granted oracle access to the true reward. Across published preference-learning mitigation work, no method we survey reports the evidence needed to certify successful mitigation.

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术