Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal 文章

ArXiv CS.AI2026-05-27NEWSen作者: Kia-J\"ung Yang, Dominik Meier, Jiachen Zhao, Terry Ruas, Bela Gipp

详细信息

来源站点: ArXiv CS.AI
作者: Kia-J\"ung Yang, Dominik Meier, Jiachen Zhao, Terry Ruas, Bela Gipp
文章类型: NEWS
语言: en
发布日期: 2026-05-27

摘要

arXiv:2605.26772v1 Announce Type: new Abstract: Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70%, indicating that the CoT actively reinforces refusal. In a two-stage intervention where the model regenerates its CoT under activation steering, refusal is reversed in 94% of cases, while the resulting CoT alone retains 48% of this effect even after steering is removed. This suggests that the CoT can carry and reconstruct the compliance signal independently.

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal 文章

详细信息

摘要

相关事件

相关公司查看全部 (2)

相关人物

相关产品查看全部 (8)

相关技术查看全部 (27)