Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse 文章

ArXiv CS.CL2026-05-28NEWSen作者: Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li

摘要

arXiv:2602.01203v3 Announce Type: replace Abstract: Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers.

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (2)

相关技术查看全部 (4)