LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation 文章

ArXiv CS.CV2026-06-02NEWSen作者: Qixin Hu, Shuai Yang, Wei Huang, Song Han, Yukang Chen

摘要

arXiv:2606.02553v1 Announce Type: new Abstract: Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents.