Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning 事件
OPEN_SOURCE2026-06-09影响: MEDIUM
Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning arXiv:2606.07631v1 Announce Type: cross Abstract: Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated behavioral evaluation. We ask whether emergent misalignment can instead be detected from internal representations during finetuning. Using sev
相关产品查看全部 (10)
相关报道查看全部 (1)
Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning
ArXiv CS.AI2026-06-09