Weight Decay Improves Language Model Plasticity 文章

ArXiv CS.CL2026-06-01NEWSen作者: Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade

摘要

arXiv:2602.11137v2 Announce Type: replace-cross Abstract: Large language models are typically trained in two broad phases: pretraining to produce a base model, followed by further training to improve downstream performance. However, hyperparameter optimization and scaling laws are studied primarily from the perspective of the base model's validation loss, overlooking a crucial model property: downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks upon additional training. We focus on the role of weight decay, a key regularization parameter during pretraining, and show through systematic experiments that larger weight decay increases the plasticity of the pretrained model, resulting in greater performance gains downstream after fine-tuning.

相关事件查看全部 (1)

Weight Decay Improves Language Model Plasticity
2026-06-01PRODUCT_LAUNCH影响: MEDIUM

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据