OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference 事件

PRODUCT_LAUNCH2026-06-01影响: MEDIUM

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference arXiv:2510.07651v2 Announce Type: replace Abstract: Large language models (LLMs) with extended context windows enable powerful applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attent