One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs 文章

ArXiv CS.AI2026-05-27NEWSen作者: Di He, Songjun Tu, Keyu Wang, Lu Yin, Shiwei Liu

详细信息

来源站点: ArXiv CS.AI
作者: Di He, Songjun Tu, Keyu Wang, Lu Yin, Shiwei Liu
文章类型: NEWS
语言: en
发布日期: 2026-05-27

摘要

arXiv:2605.22297v2 Announce Type: replace-cross Abstract: Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate training, while layers with stronger heavy-tailedness receive smaller learning rates.

One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs 文章

详细信息

摘要

相关事件

相关公司查看全部 (5)

相关人物

相关产品查看全部 (5)

相关技术查看全部 (20)