Watch, Remember, Reason: Human-View Video Understanding with MLLMs 文章

ArXiv CS.CV2026-06-08NEWSen作者: Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.CV
作者: Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang
文章类型: NEWS
语言: en
发布日期: 2026-06-08

原文

摘要

arXiv:2606.07433v1 Announce Type: new Abstract: Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions.

Watch, Remember, Reason: Human-View Video Understanding with MLLMs 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品

相关技术查看全部 (2)