ART: Attention Run-time Termination for Efficient Large Language Model Decoding 事件
PRODUCT_LAUNCH2026-06-02影响: MEDIUM
ART: Attention Run-time Termination for Efficient Large Language Model Decoding arXiv:2606.00024v1 Announce Type: new Abstract: Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth required to fetch the extensive Key-Value (KV) cache. Most existing KV management methods rely on key-only pruning before decoding, despite the evidence that attention outputs depend jointly on keys and values, as incorporating values in their methods incurs prohibitiv
ART: Attention Run-time Termination for Efficient Large Language Model Decoding · 相关报道
相关报道
ART: Attention Run-time Termination for Efficient Large Language Model Decoding
ArXiv CS.CL2026-06-02