ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning arXiv:2605.27959v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object r

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning · 相关报道