Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning 文章

ArXiv CS.CV2026-06-02NEWSen作者: Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Qingming Huang

摘要

arXiv:2606.02321v1 Announce Type: new Abstract: Recent advances in large vision-language models have expanded video retrieval from simple text-based search to more flexible scenarios, where users may specify the desired result through both visual examples and textual instructions. In the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge, the system is required to retrieve a target video according to a reference video and a modification instruction. To address this task, we develop Visual Representation-Guided Video-LLM Reasoning for Training-Free Composed Video Retrieval. Our framework first uses frozen DINOv3 models to obtain a compact set of visually relevant candidates, and then applies large vision-language models to evaluate whether each candidate satisfies the modification instruction. A final reasoning-based refinement is further performed on the top candidates to improve the first-ranked prediction. Without training, our system achieves 48.78 Recall@1 and 51.

Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (5)

相关技术查看全部 (1)