MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training 文章

ArXiv CS.CL2026-05-27NEWSen作者: Jiacheng Li, Jianchao Tan, Hongtao Xu, Jiaqi Zhang, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai

查看原文 →

关系图谱

摘要

arXiv:2605.26842v1 Announce Type: cross Abstract: The Muon optimizer has recently offered a promising alternative to AdamW for large language model training, leveraging matrix orthogonalization to produce geometry-aware updates. However, like all first-order methods, Muon can become trapped in sharp local minima. In this work, we present MONA, an optimizer that bridges Muon's orthogonalization framework with curvature-aware acceleration. MONA adds an acceleration term directly into Muon's gradient processing pipeline. This term is calculated from the exponential moving average of gradient differences. We provide a detailed convergence analysis for MONA, showing that the acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization.

MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (6)

相关人物

相关产品查看全部 (9)

相关技术查看全部 (20)