Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards 文章

ArXiv CS.CL2026-06-01NEWSen作者: Magnus J{\o}rgenv{\aa}g, David Kacz\'er, Lasse Ruttert, Marvin G\"ulhan, Lucie Flek, Florian Mai

查看原文 →

关系图谱

摘要

arXiv:2605.31328v1 Announce Type: new Abstract: Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT) setting, evidence that it also arises from reinforcement learning (RL) is limited to large, closed-source models, leaving the phenomenon expensive to study and difficult to reproduce. We characterize EM from RL in small, off-the-shelf open-weight models along three axes. First, we show that rewarding narrow, overtly misaligned behavior produces substantially higher general-domain misalignment than sample-matched SFT. Second, we show that EM from RL can be induced by reward signals that could plausibly arise naturally, such as unpopular aesthetic preferences or poor rhetorical appeals.

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (3)

相关人物

相关产品查看全部 (5)

相关技术查看全部 (31)