Alignment Risks from Capability-Seeking RL Training 事件
REGULATION2026-06-05影响: MEDIUM
Alignment Risks from Capability-Seeking RL Training arXiv:2602.12124v2 Announce Type: replace-cross Abstract: While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk arises from capability-seeking RL training in vulnerable environments. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, can learn to exploit these flaws to maximize reward, even without
相关产品查看全部 (10)
相关报道查看全部 (1)
Alignment Risks from Capability-Seeking RL Training
ArXiv CS.CL2026-06-05