When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming 文章

ArXiv CS.AI2026-06-03NEWSen作者: Zelalem Abahana

When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming · 相关人物

暂无数据