TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents 文章

ArXiv CS.AI2026-06-06NEWSen作者: Chengqi Dong, Chuhuai Yue, Hang He, yandong liu, Fenghe Tang, S Kevin Zhou, Xiaohan Wang, Jiajun Chai, Guojun Yin

查看原文 →

关系图谱

摘要

arXiv:2606.05784v1 Announce Type: new Abstract: We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use steps in failing trajectories to be penalized no differently from valueless ones. We further empirically quantify the scale of this phenomenon. Over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment, demonstrating that the wasted training signal is both substantial and structurally exploitable. Building on this insight, we propose Tool-Aware Policy Optimization (TAPO), which exploits the parameter-determinism property of information-acquisition tools: similar call parameters define equivalent information-acquisition actions and should therefore share comparable action credit.

TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品

相关技术查看全部 (2)