Rethinking Video-Language Model from the Language Input Perspective 文章

ArXiv CS.CV2026-05-28NEWSen作者: Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, Daizong Liu

摘要

arXiv:2605.27920v1 Announce Type: new Abstract: Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts.

相关事件查看全部 (2)

arXiv:2605.27920v1 发布
BREAKTHROUGH影响: low

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据