One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries 事件

PRODUCT_LAUNCH2026-05-26影响: MEDIUM

One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries arXiv:2605.14605v2 Announce Type: replace-cross Abstract: Model providers increasingly release open weights or allow users to fine-tune foundation models through APIs. Although these models are safety-aligned before release, their safeguards can often be removed by fine-tuning on harmful data. Recent defenses aim to make models robust to such malicious fine-tuning, but they are largely evaluated only