Weight Decay Improves Language Model Plasticity 文章

ArXiv CS.CL2026-06-01NEWSen作者: Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade

摘要

arXiv:2602.11137v2 Announce Type: replace-cross Abstract: Large language models are typically trained in two broad phases: pretraining to produce a base model, followed by further training to improve downstream performance. However, hyperparameter optimization and scaling laws are studied primarily from the perspective of the base model's validation loss, overlooking a crucial model property: downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks upon additional training. We focus on the role of weight decay, a key regularization parameter during pretraining, and show through systematic experiments that larger weight decay increases the plasticity of the pretrained model, resulting in greater performance gains downstream after fine-tuning.

Weight Decay Improves Language Model Plasticity 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (1)