Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry 事件
PRODUCT_LAUNCH2026-05-27影响: MEDIUM
Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry arXiv:2604.27019v3 Announce Type: replace-cross Abstract: Safety-aligned language models must refuse harmful requests without broad over-refusal, but it remains unclear how dynamic adversarial fine-tuning changes refusal-control carriers: Kullback--Leibler (KL)-constrained directions or small subspaces that causally modulate refusal without large safe-prompt distribution shifts. We study a 7B backbone under supervised fine-tuning (SFT
Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry · 相关报道
相关报道
Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry
ArXiv CS.CL2026-05-27