Spectral Scaling Laws of Muon 文章

ArXiv CS.AI2026-06-04NEWSen作者: Gagik Magakyan, Pablo Parrilo, Asuman Ozdaglar

摘要

arXiv:2606.04058v1 Announce Type: cross Abstract: Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source state-of-the-art models adopting Muon. To keep these updates tractable, Muon performs the orthonormalization with the Newton--Schulz (NS) iteration. Since NS is only approximate, directions with small singular values fail to be orthonormalized. In Muon, NS is applied to the momentum matrix at every step, yet little is known about how the singular value spectrum of these momentum matrices behaves during training, or how that behavior changes with model size. We present the first systematic study of this question. Tracking singular value quantiles of the momentum buffer across layers in models ranging from 77M to 2.8B parameters, we observe a consistent picture: after a short burn-in, the quantiles stabilize at a value determined by the layer type and model size.

Spectral Scaling Laws of Muon 文章

摘要

相关事件查看全部 (3)

相关公司

相关人物

相关产品

相关技术查看全部 (2)