ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning 文章

ArXiv CS.CV2026-05-29NEWSen作者: Guannan Lv, Ren Nie, Hongjian Dou, Tingting Gao

摘要

arXiv:2605.27959v2 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object relations, while incurring decoding costs that scale with the number and size of RoIs. Alternatively, adaptive visual feature selection often requires fine-grained supervision or complex heuristics. To address these limitations, we propose ROVER (Routing Object-centric Visual Evidence for grounded multi-image Reasoning), a lightweight, learnable plugin for efficient global visual evidence routing.

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (2)

相关技术查看全部 (2)