Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning 事件

OPEN_SOURCE2026-06-09影响: MEDIUM

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning arXiv:2606.07631v1 Announce Type: cross Abstract: Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated behavioral evaluation. We ask whether emergent misalignment can instead be detected from internal representations during finetuning. Using sev

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning · 相关报道