Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning 事件
PRODUCT_LAUNCH2026-06-02影响: MEDIUM
Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning arXiv:2606.00755v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling tem