CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders 事件
PRODUCT_LAUNCH2026-05-28影响: MEDIUM
CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders arXiv:2604.01604v2 Announce Type: replace Abstract: While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal behavior for model safety analysis. For example, steering-based jailbreak attacks exploit this by identifying and manipulating sparse, neuron-like refusal features to bypass safety guardrails. Current feature selection methods primari
相关报道查看全部 (1)
CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders
ArXiv CS.AI2026-05-28