ART: Attention Run-time Termination for Efficient Large Language Model Decoding 事件

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

ART: Attention Run-time Termination for Efficient Large Language Model Decoding arXiv:2606.00024v1 Announce Type: new Abstract: Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth required to fetch the extensive Key-Value (KV) cache. Most existing KV management methods rely on key-only pruning before decoding, despite the evidence that attention outputs depend jointly on keys and values, as incorporating values in their methods incurs prohibitiv