vAttention: Verified Sparse Attention 文章

ArXiv CS.AI2026-05-26NEWSen作者: Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

查看原文 →

关系图谱

摘要

arXiv:2510.05688v2 Announce Type: replace-cross Abstract: State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform.

vAttention: Verified Sparse Attention 文章

摘要

相关事件查看全部 (2)

相关公司查看全部 (3)

相关人物

相关产品查看全部 (9)

相关技术查看全部 (19)