Rethinking Video-Language Model from the Language Input Perspective 文章

ArXiv CS.CV2026-05-28NEWSen作者: Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, Daizong Liu

摘要

arXiv:2605.27920v1 Announce Type: new Abstract: Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts.

Rethinking Video-Language Model from the Language Input Perspective 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品

相关技术查看全部 (2)