Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains 文章

ArXiv CS.CV2026-06-02NEWSen作者: Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Qinghao Wang, Minpeng Liao

查看原文 →

关系图谱

摘要

arXiv:2606.02357v1 Announce Type: new Abstract: Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images'' agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning. Each agent is compared with its Tool-Free counterpart and with a Pure-Text Reasoner trained from the same source pool without tool-calling trajectories. Tool access yields little consistent aggregate improvement, does not reliably reduce generated-token cost, and leaves only a small tool-only solved set: 93% of DeepEyesV2's tool-solved problems and 96% of Thyme's are also solved by at least one non-tool setting.

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (8)

相关技术