CriticalKV: Optimizing KV Cache Eviction from an Output Perturbation Perspective 事件

PRODUCT_LAUNCH2026-05-29影响: MEDIUM

CriticalKV: Optimizing KV Cache Eviction from an Output Perturbation Perspective arXiv:2502.03805v2 Announce Type: replace Abstract: Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large KV cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights rem