UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training 文章

ArXiv CS.CL2026-05-28NEWSen作者: Keqi Deng, Shaoshi Ling, Ruchao Fan, Jinyu Li

摘要

arXiv:2605.27740v1 Announce Type: new Abstract: Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but accurately and cheaply estimating cache importance, for both training-free use and sparsity-aware training, remains challenging. This paper proposes UNIQUE, a universal top-k sparse attention framework that addresses both requirements and stays consistently effective across LLM modalities. UNIQUE operates at the granularity of KV pages and estimates per-page importance with a simple yet accurate score combining the mean of the page's keys as a representative vector with their standard deviation as an offset term.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据