Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models 文章

ArXiv CS.AI2026-05-29NEWSen作者: Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen

摘要

arXiv:2605.29303v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models. This paradigm provides a cold-start for RL exploration, avoiding the inefficiency of pure RL where on-policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre-trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low-data regimes, SFT should prioritize activating task-relevant capabilities rather than memorizing specific content.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据