CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders 文章

ArXiv CS.AI2026-05-28NEWSen作者: Su-Hyeon Kim, Hyundong Jin, Yejin Lee, Yo-Sub Han

摘要

arXiv:2604.01604v2 Announce Type: replace Abstract: While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal behavior for model safety analysis. For example, steering-based jailbreak attacks exploit this by identifying and manipulating sparse, neuron-like refusal features to bypass safety guardrails. Current feature selection methods primarily rely on how strongly features activate on harmful prompts. However, activation strength alone often captures superficial heuristics such as topic or lexical cues, rather than the true causal mechanisms. Thus, selecting refusal features requires measuring inter-feature relationships, rather than treating each feature as an isolated activation signal. Based on this insight, we propose CRaFT, a circuit-guided framework for identifying critical refusal features that directly govern the refusal decision.

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (2)

相关技术查看全部 (2)