SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference 事件

PRODUCT_LAUNCH2026-06-04影响: MEDIUM

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference arXiv:2606.04511v1 Announce Type: new Abstract: Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains $O(T^2)$ complexity and can dominate attention cost at long contexts. We propose Spa

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference · 相关技术