VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding 文章

ArXiv CS.CV2026-06-02NEWSen作者: Junbo Zou, Ziheng Huang, Shengjie Zhang, Liwen Zhang, Weining Shen

摘要

arXiv:2602.04094v2 Announce Type: replace Abstract: Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据