Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling 文章

ArXiv CS.AI2026-06-02NEWSen作者: Zhibin Duan, Guowei Rong, Zhuo Li, Bo Chen, Mingyuan Zhou, Dandan Guo

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling · 相关技术