Persona-Model Collapse in Emergent Misalignment 事件

PRODUCT_LAUNCH2026-05-26影响: MEDIUM

Persona-Model Collapse in Emergent Misalignment arXiv:2605.12850v2 Announce Type: replace Abstract: Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using