An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation 文章

ArXiv CS.CV2026-06-02NEWSen作者: Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao, Junyu Gao, Xuelong Li

摘要

arXiv:2606.00987v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \textbf{Multi-temporal Referring Segmentation (MTRS)}, a new task that aims to segment language-described temporal changes from multi-temporal images. MTRS extends conventional referring segmentation and change detection by jointly requiring temporal correspondence reasoning, language grounding, and pixel-level mask prediction. We propose \textbf{CRAFT-Agent}, an automated data construction pipeline with human auditing, and build \textbf{MTRefSeg-21K}, the first MTRS benchmark, containing 21K high-quality multi-temporal image-text-mask triplets across diverse scenes, viewpoints, and domains.

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品查看全部 (6)

相关技术查看全部 (2)