Latent-space Attacks for Refusal Evasion in Language Models 事件

PRODUCT_LAUNCH2026-06-08影响: MEDIUM

Latent-space Attacks for Refusal Evasion in Language Models arXiv:2605.21706v2 Announce Type: replace Abstract: Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space tr

Latent-space Attacks for Refusal Evasion in Language Models · 相关报道