One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models 文章

ArXiv CS.CL2026-06-02NEWSen作者: Daniel Fein, Max Lamparth, Violet Xiang, Mykel J. Kochenderfer, Nick Haber

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models · 相关技术