Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning 事件

Name: Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning
Start: 2026-06-02

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning arXiv:2606.00755v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling tem

人工智能

关系图谱

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning 事件

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning · 相关技术

相关技术