PEEK: Picking Essential frames via Efficient Knowledge distillation 文章

ArXiv CS.CV2026-06-01NEWSen作者: Killian Steunou, Anas Filali Razzouki, Khalil Guetari, Moun\^im A. El-Yacoubi, Yannis Tevissen

查看原文 →

关系图谱

摘要

arXiv:2605.31029v1 Announce Type: new Abstract: Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets.

PEEK: Picking Essential frames via Efficient Knowledge distillation 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品查看全部 (7)

相关技术查看全部 (3)