Alignment Risks from Capability-Seeking RL Training 文章

ArXiv CS.CL2026-06-05NEWSen作者: Yujun Zhou, Yue Huang, Han Bao, Kehan Guo, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang

摘要

arXiv:2602.12124v2 Announce Type: replace-cross Abstract: While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk arises from capability-seeking RL training in vulnerable environments. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, can learn to exploit these flaws to maximize reward, even without being explicitly instructed to do so. To test this, we design a suite of four diverse "vulnerability games," each presenting a structural vulnerability related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. Our experiments show that models often learn to exploit these vulnerabilities, discovering opportunistic strategies that increase reward while sometimes preserving or even improving standard task-performance metrics.

相关事件查看全部 (2)

Alignment Risks from Capability-Seeking RL Training
2026-06-05REGULATION影响: MEDIUM
Alignment Risks from Capability-Seeking RL Training
2026-06-05PRODUCT_LAUNCH影响: MEDIUM

相关公司

暂无数据

相关人物

暂无数据