Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning 事件

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning arXiv:2606.00755v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling tem

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning · 相关技术