Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation 文章

ArXiv CS.AI2026-05-29NEWSen作者: Soumyadeep Jana, Sagar Nishad, Sanasam Ranbir Singh

摘要

arXiv:2605.29873v1 Announce Type: new Abstract: Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill cache degrades performance by corrupting critical context. While preserving the prefill cache is essential, decoding-phase compression remains underexplored, with existing methods relying on rigid recency windows or instantaneous attention. Our analysis of attention dynamics reveals strong temporal patterns: critical tokens receive sustained attention over long horizons, while local reasoning involves short-lived bursts. Static heuristics fail to capture this behavior, leading to premature eviction of important tokens or retention of stale ones. We propose Moment-KV, a decoding-time KV cache compression method based on momentum-driven temporal attention aggregation.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据