Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry 事件

PRODUCT_LAUNCH2026-05-27影响: MEDIUM

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry arXiv:2604.27019v3 Announce Type: replace-cross Abstract: Safety-aligned language models must refuse harmful requests without broad over-refusal, but it remains unclear how dynamic adversarial fine-tuning changes refusal-control carriers: Kullback--Leibler (KL)-constrained directions or small subspaces that causally modulate refusal without large safe-prompt distribution shifts. We study a 7B backbone under supervised fine-tuning (SFT