Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra 文章

ArXiv CS.CV2026-05-26NEWSen作者: Ben S. Southworth, Shuai Jiang, Daniel McBride, Eric C. Cyr, Stephen Thomas

摘要

arXiv:2605.24770v1 Announce Type: cross Abstract: Muon is a recently developed matrix-aware optimizer that has shown strong results in transformer training, but its behavior in vision transformers (ViTs) is not yet well understood. We study Muon for ViT training, largely on ImageNet-100 and Pl@ntNet-300K, comparing against AdamW under standard vision recipes involving mixup, cutmix, smoothing, and random augmentation and erasing. Muon consistently outperforms AdamW, with especially large gains on long-tailed Pl@ntNet macro top-1. These gains are also recipe-dependent, where Muon benefits much more than AdamW from advanced and significant data augmentation techniques. To understand this interaction, we analyze the singular-value structure of matrix gradients throughout the ViT. Within Muon training runs, removing heavy data augmentation induces a late-training spectral concentration and mode collapse in gradient matrices, primarily in deep MLP-down blocks.