CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders arXiv:2604.01604v2 Announce Type: replace Abstract: While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal behavior for model safety analysis. For example, steering-based jailbreak attacks exploit this by identifying and manipulating sparse, neuron-like refusal features to bypass safety guardrails. Current feature selection methods primari