VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning 文章

ArXiv CS.CV2026-05-28NEWSen作者: Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue

摘要

arXiv:2510.08555v2 Announce Type: replace Abstract: Existing controllable video generation methods are typically designed for rigid, task-specific settings, such as first-frame image-to-video, inpainting, or interpolation, treating spatio-temporal control as a set of isolated problems. We formalize a unified task, arbitrary spatio-temporal video completion, where a model generates a coherent video from user-specified patches placed at any spatial location and timestamp. However, realizing such a unified framework within modern latent video diffusion models is non-trivial: causal video VAEs compress multiple frames into a single latent slot, making frame-level conditioning fundamentally ill-posed, and directly feeding sparsely populated, zero-padded video inputs into the VAE leads to severe out-of-distribution artifacts.