GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding 文章

ArXiv CS.AI2026-05-28NEWSen作者: Fanxu Meng

摘要

arXiv:2605.15250v2 Announce Type: replace-cross Abstract: Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache.

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (14)

相关技术查看全部 (8)