Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos 文章

ArXiv CS.CV2026-06-16NEWSen作者: Ke Li, Di Wang, Yongshan Zhu, Ting Wang, Weiping Ni, Tao Lei, Quan Wang, Xinbo Gao

详细信息

来源站点
ArXiv CS.CV
作者
Ke Li, Di Wang, Yongshan Zhu, Ting Wang, Weiping Ni, Tao Lei, Quan Wang, Xinbo Gao
文章类型
NEWS
语言
en
发布日期
2026-06-16

摘要

arXiv:2606.16124v1 Announce Type: new Abstract: Remote sensing visual grounding (RSVG) aims to localize a referred target in a remote sensing image or video according to a natural language expression. Existing RSVG methods usually rely on task-specific manual annotations, which are costly to collect and inevitably limited in covering the diversity of real-world geospatial scenarios. As a result, they often struggle to generalize to open-vocabulary queries involving novel objects, fine-grained attributes, complex spatial relationships, and functional semantics. In this paper, we propose RSVG-ZeroOV, a training-free framework that leverages frozen generic foundation models for zero-shot open-vocabulary RSVG. RSVG-ZeroOV follows an Overview-Focus-Evolve paradigm, which exploits the distinct yet complementary attention patterns of vision-language models (VLMs) and diffusion models (DMs) to progressively generate precise grounding results.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据