Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos 文章

ArXiv CS.CV2026-06-16NEWSen作者: Ke Li, Di Wang, Yongshan Zhu, Ting Wang, Weiping Ni, Tao Lei, Quan Wang, Xinbo Gao

详细信息

来源站点: ArXiv CS.CV
作者: Ke Li, Di Wang, Yongshan Zhu, Ting Wang, Weiping Ni, Tao Lei, Quan Wang, Xinbo Gao
文章类型: NEWS
语言: en
发布日期: 2026-06-16

摘要

arXiv:2606.16124v1 Announce Type: new Abstract: Remote sensing visual grounding (RSVG) aims to localize a referred target in a remote sensing image or video according to a natural language expression. Existing RSVG methods usually rely on task-specific manual annotations, which are costly to collect and inevitably limited in covering the diversity of real-world geospatial scenarios. As a result, they often struggle to generalize to open-vocabulary queries involving novel objects, fine-grained attributes, complex spatial relationships, and functional semantics. In this paper, we propose RSVG-ZeroOV, a training-free framework that leverages frozen generic foundation models for zero-shot open-vocabulary RSVG. RSVG-ZeroOV follows an Overview-Focus-Evolve paradigm, which exploits the distinct yet complementary attention patterns of vision-language models (VLMs) and diffusion models (DMs) to progressively generate precise grounding results.

Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (3)

相关技术查看全部 (9)