CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders 文章

ArXiv CS.AI2026-05-28NEWSen作者: Su-Hyeon Kim, Hyundong Jin, Yejin Lee, Yo-Sub Han

摘要

arXiv:2604.01604v2 Announce Type: replace Abstract: While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal behavior for model safety analysis. For example, steering-based jailbreak attacks exploit this by identifying and manipulating sparse, neuron-like refusal features to bypass safety guardrails. Current feature selection methods primarily rely on how strongly features activate on harmful prompts. However, activation strength alone often captures superficial heuristics such as topic or lexical cues, rather than the true causal mechanisms. Thus, selecting refusal features requires measuring inter-feature relationships, rather than treating each feature as an isolated activation signal. Based on this insight, we propose CRaFT, a circuit-guided framework for identifying critical refusal features that directly govern the refusal decision.

相关公司

暂无数据

相关人物

暂无数据