Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning 事件
OPEN_SOURCE2026-06-09影响: MEDIUM
Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning arXiv:2606.07631v1 Announce Type: cross Abstract: Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated behavioral evaluation. We ask whether emergent misalignment can instead be detected from internal representations during finetuning. Using sev