Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion 文章

ArXiv CS.CV2026-05-27NEWSen作者: Tuna Tuncer, Felix Becker, Thomas Pfeil

摘要

arXiv:2605.26266v1 Announce Type: cross Abstract: Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm.