Latent-space Attacks for Refusal Evasion in Language Models 文章

ArXiv CS.AI2026-06-08NEWSen作者: Giorgio Piras, Raffaele Mura, Fabio Brau, Maura Pintor, Luca Oneto, Fabio Roli, Battista Biggio

查看原文 →

关系图谱

摘要

arXiv:2605.21706v2 Announce Type: replace Abstract: Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work's difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack.

Latent-space Attacks for Refusal Evasion in Language Models 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术