Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction 文章

ArXiv CS.AI2026-05-27NEWSen作者: Changyue Jiang, Wenqi Zhang, Xudong Pan, Geng Hong, Min Yang

摘要

arXiv:2505.11063v3 Announce Type: replace Abstract: LLM-based agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought directly shapes subsequent actions. Small deviations in these thoughts can therefore propagate into unsafe behaviors, yet existing guardrails typically operate only on final outputs or require intrusive model modifications. We introduce Thought-Aligner, a lightweight plug-in safety model that performs causal correction on unsafe thoughts before action execution, without altering the underlying agent. The corrected thoughts are fed back into the agent, steering its decision process and tool use toward safer trajectories. Because it operates solely at the thought level, Thought-Aligner is model-agnostic and can be integrated into diverse agent frameworks. We train Thought-Aligner via two-stage contrastive learning on paired safe and unsafe thoughts generated across ten risk scenarios.

Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (4)

相关人物

相关产品查看全部 (8)

相关技术查看全部 (27)