Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining 文章

ArXiv CS.CV2026-06-01NEWSen作者: Bo Peng, YuanJie Lyu, PengGang Qin, Tong Xu

摘要

arXiv:2605.31069v1 Announce Type: new Abstract: Accurately predicting future events is fundamental to content understanding and decision-making across various domains. While prior research has primarily focused on text or short-video scenarios, long-video event prediction, characterized by vast multimodal context and more complex narratives, remains underexplored. Meanwhile, although recent Long-Video Language Models (LVLMs), built on Large Language Models (LLMs) and Vision-Language Models (VLMs), have shown promise in long-video question answering and summarization, they struggle to generalize to event prediction, as they can neither precisely extract event-related details nor perform fine-grained analysis of event development. To address this gap, we propose VISTA, a multi-level event semantics mining framework for long-video event prediction.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据