In-Training Defenses against Emergent Misalignment in Language Models 事件
PRODUCT_LAUNCH2026-06-06影响: MEDIUM
In-Training Defenses against Emergent Misalignment in Language Models arXiv:2508.06249v3 Announce Type: replace-cross Abstract: Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EM): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broad