iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning 文章

ArXiv CS.CV2026-06-01NEWSen作者: Chang-Bin Zhang, Yujie Zhong, Qiang Zhang, Kai Han

摘要

arXiv:2605.31096v1 Announce Type: new Abstract: While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (\textbf{iVGR}), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process.