Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure arXiv:2605.27996v1 Announce Type: new Abstract: Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias substitution. The failure is enabled by a measurement-versus-optimization gap between audit and policy-induced distributions during miti

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure · 相关公司

A
arXivNONPROFIT
G
GLENONPROFIT
I
IRECNONPROFIT
E
EARNNONPROFIT
A
ACTNONPROFIT
E
EGINONPROFIT
I
ITUNONPROFIT