Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling 事件

Name: Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling
Start: 2026-06-02

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling arXiv:2602.10623v2 Announce Type: replace-cross Abstract: Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framewo

人工智能

关系图谱