Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use 文章

ArXiv CS.CL2026-06-04NEWSen作者: Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.CL
作者: Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah
文章类型: NEWS
语言: en
发布日期: 2026-06-04

原文

摘要

arXiv:2603.03205v2 Announce Type: replace Abstract: Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions.

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (2)

相关技术