GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding 事件
PRODUCT_LAUNCH2026-05-28影响: MEDIUM
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding arXiv:2605.15250v2 Announce Type: replace-cross Abstract: Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor par