Improving Visual Grounding in Remote Sensing via Cluster-Guided Refinement and Model Ensemble Voting 文章

ArXiv CS.CV2026-06-02NEWSen作者: Panav Shah, Geet Sethi, Ashutosh Gandhe

摘要

arXiv:2606.00556v1 Announce Type: new Abstract: Visual grounding aims to locate image regions that correspond to natural language descriptions and is a key component of interpretable vision systems. In remote sensing imagery, grounding is particularly challenging due to complex scenes, small objects, and large variations in scale. Relying on a single model is often insufficient to address these diverse challenges. In this work, we propose two grounding pipelines, Sequential Grounding Refinement (SGR) and Cluster-Aware Grounding Refinement (CGR), that combine the complementary strengths of RemoteSAM, a visual grounding model specialized for remote sensing, and SAM3, a powerful general-purpose segmentation model. Our approach first uses RemoteSAM to obtain an initial estimate of object location, which is then refined using SAM3 to produce more accurate and spatially consistent segmentations.