Perception First: A Frontier Native-Video Model with Self-Consistency for Implicit Video Question Answering 文章

ArXiv CS.CV2026-06-02NEWSen作者: Ali Alavi

摘要

arXiv:2606.01485v1 Announce Type: new Abstract: We describe our submission to the VRR Challenge @ CVPR 2026, built on the \emph{ImplicitQA} / \emph{VRR-QA} benchmark~\cite{implicitqa}: multiple-choice video question answering in which answers are deliberately \emph{not} observable in any single frame and must be inferred from spatial layout, motion, depth, viewpoint, causality, and social context across discontinuous frames of creative video. We conduct a systematic, training-free study spanning open-source Video-LMMs (Qwen2.5-VL~\cite{qwen25vl}, Qwen3-VL~\cite{qwen3vl}, InternVL3, Gemma-3, and the RL-tuned video reasoners Video-R1~\cite{videor1} and VideoChat-R1.5~\cite{videochatr15}) and a battery of inference-time strategies (chain-of-thought, question decomposition, describe-then-reason cascades, audio transcripts, spatial state prompting, self-consistency~\cite{selfconsistency}, multi-model ensembling, and category routing).

Perception First: A Frontier Native-Video Model with Self-Consistency for Implicit Video Question Answering 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品查看全部 (15)

相关技术查看全部 (8)