LVSA: Training-Free Sparse Attention for Long Video Diffusion 文章

ArXiv CS.CV2026-06-01NEWSen作者: Gael Glorian, Ioannis Lamprou, Zhen Zhang, Yujie Yuan, Hongsheng Liu

摘要

arXiv:2605.31057v1 Announce Type: new Abstract: Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention.

LVSA: Training-Free Sparse Attention for Long Video Diffusion 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品查看全部 (6)

相关技术查看全部 (4)