Alignment Risks from Capability-Seeking RL Training 事件
REGULATION2026-06-05影响: MEDIUM
Alignment Risks from Capability-Seeking RL Training arXiv:2602.12124v2 Announce Type: replace-cross Abstract: While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk arises from capability-seeking RL training in vulnerable environments. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, can learn to exploit these flaws to maximize reward, even without
Alignment Risks from Capability-Seeking RL Training · 相关报道
相关报道
Alignment Risks from Capability-Seeking RL Training
ArXiv CS.CL2026-06-05