PEEK: Picking Essential frames via Efficient Knowledge distillation 文章

ArXiv CS.CV2026-06-01NEWSen作者: Killian Steunou, Anas Filali Razzouki, Khalil Guetari, Moun\^im A. El-Yacoubi, Yannis Tevissen

摘要

arXiv:2605.31029v1 Announce Type: new Abstract: Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets.