Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations 事件
PRODUCT_LAUNCH2026-05-28影响: MEDIUM
Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations arXiv:2605.28553v1 Announce Type: new Abstract: In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that safety-relevant behavior is represented in intermediate a