Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection 文章

ArXiv CS.CL2026-05-27NEWSen作者: Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen

摘要

arXiv:2605.02958v2 Announce Type: replace-cross Abstract: Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. We ask whether this view misses how refusal is constructed across layer-token positions. Using causal tracing, we identify a \textit{Refusal Trajectory}: a sparse upstream activation pattern that often persists even when attacks such as GCG suppress terminal refusal signals. Based on this observation, we propose SALO (Sparse Activation Localization Operator), a lightweight white-box detector that operates on raw hidden-state volumes from a selected layer window. Across Qwen, Llama, and Mistral models, SALO improves jailbreak detection on several attack families under a fixed XSTest-calibrated operating point.