Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction 事件
PRODUCT_LAUNCH2026-06-05影响: MEDIUM
Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction arXiv:2606.05769v1 Announce Type: new Abstract: Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We int