Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection 事件
PRODUCT_LAUNCH2026-05-27影响: MEDIUM
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection arXiv:2605.02958v2 Announce Type: replace-cross Abstract: Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. We ask whether this view misses how refusal is constructed across layer-token positions. Using causal tracing, we identify a \textit{Refusal Trajectory}: a sparse upstream activation pattern that ofte
相关人物
暂无数据