Long-Context Modeling with Dynamic Hierarchical Sparse Attention for Memory-Constrained LLM Inference 文章

ArXiv CS.CL2026-05-29NEWSen作者: Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho

摘要

arXiv:2510.24606v2 Announce Type: replace Abstract: The quadratic cost of attention limits the scalability of long-context LLMs, especially under limited hardware memory budgets. While attention is often sparse, existing static sparse methods cannot adapt to task- or input-dependent variations, and recent dynamic approaches rely on predefined templates or heuristics that may sacrifice generality. We propose Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that predicts attention sparsity online while keeping the LLM backbone frozen. DHSA performs hierarchical routing by estimating importance at the chunk level and propagating it to token-level interactions, preserving causally important dependencies while enabling efficient sparsification. Across Needle-in-a-Haystack test, LongBench and RULER, DHSA maintains near-dense accuracy in highly sparse regimes, achieving 12--20% relative accuracy gains over Block Sparse Attention at comparable prefill cost.