VRAG: Learning World Models for Interactive Video Generation 文章

ArXiv CS.CV2026-05-29NEWSen作者: Taiye Chen, Xun Hu, Zihan Ding, Chi Jin

摘要

arXiv:2505.21996v4 Announce Type: replace Abstract: Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models.

VRAG: Learning World Models for Interactive Video Generation 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (10)