Persona-Model Collapse in Emergent Misalignment 事件

PRODUCT_LAUNCH2026-05-26影响: MEDIUM

Persona-Model Collapse in Emergent Misalignment arXiv:2605.12850v2 Announce Type: replace Abstract: Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using

Persona-Model Collapse in Emergent Misalignment · 相关公司

S
SECGOVERNMENT
R
RonCOMPANY
I
INVOLVNONPROFIT
A
arXivNONPROFIT
T
TERINONPROFIT
A
ACTNONPROFIT
C
CharacterNONPROFIT
R
RatioRESEARCH_INSTITUTE