Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving 文章

ArXiv CS.CL2026-06-03NEWSen作者: Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao, Bin Cui

摘要

arXiv:2606.02964v1 Announce Type: cross Abstract: Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches instead evict KV cache blocks from GPU memory and reconstruct them on demand to preserve exact outputs. Existing lossless KV cache management systems primarily base eviction decisions on access frequency or positional heuristics, without considering how different KV cache blocks affect the execution efficiency of GPU attention kernels. In this paper, we propose AsymCache, a computation-latency-aware KV cache management system for LLM inference that explicitly aligns cache residency decisions with GPU attention kernel performance, including three key components: Multi-Segment Attention (MSA) for efficient non-contiguous KV context processing, a cache eviction policy that jointly optimizes hit rate and…

摘要可能不完整,可查看原文

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据