SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference 文章

ArXiv CS.CL2026-06-04NEWSen作者: Yaosheng Fu, Guangxuan Xiao, Xin Dong, Song Han, Oreste Villa

摘要

arXiv:2606.04511v1 Announce Type: new Abstract: Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains $O(T^2)$ complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per-layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU-to-GPU prefetch with current-layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi-head selector. SparDA adds $<$0.

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (4)