Alignment Risks from Capability-Seeking RL Training 事件

REGULATION2026-06-05影响: MEDIUM

Alignment Risks from Capability-Seeking RL Training arXiv:2602.12124v2 Announce Type: replace-cross Abstract: While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk arises from capability-seeking RL training in vulnerable environments. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, can learn to exploit these flaws to maximize reward, even without

Alignment Risks from Capability-Seeking RL Training · 相关报道