SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels 文章

ArXiv CS.CV2026-05-27NEWSen作者: Darshan Singh, Zeeshan Khan, Makarand Tapaswi

摘要

arXiv:2401.07669v3 Announce Type: replace Abstract: Adapting CLIP for videos has gained popularity due to its semantic and rich representation. While CLIP is a good starting point, it typically undergoes post-pretraining (contrastive finetuning) on large video narration or caption datasets (e.g. HowTo100M, WebVid2.5M). However, such narrations or captions often lack comprehensive information needed to represent a video holistically. As the learning signal from text is sparse, the visual learning is inefficient and adaptation requires millions of samples to post-pretrain. In this work, we ask: is it possible to efficiently adapt CLIP for general and holistic video understanding? We use videos labeled with structured and dense Semantic Role Labels (SRLs) that capture actions, people or objects, their attributes, adverbs (manner), and location in a structured format representing the entire video in a holistic way.

SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (5)

相关人物

相关产品查看全部 (6)

相关技术查看全部 (16)