Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning 文章

ArXiv CS.AI2026-05-28NEWSen作者: Zehao Liu, Yuanpu Cao, Jinghui Chen, Vasant G. Honavar

摘要

arXiv:2605.27765v1 Announce Type: cross Abstract: Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language models by leveraging the model's own feedback-conditioned predictions as a self-teacher. Unlike GRPO, however, whose group-relative advantage naturally concentrates learning on a sweet spot of intermediate-difficulty questions, SDPO's KL-based advantage lacks an implicit notion of difficulty awareness. We analyze this gap through the lens of GRPO's advantage normalization. Extending the learnability framework to normalized rewards, we show that normalization absorbs the variance term $p(1-p)$, equalizing leading-order learnability across questions and leaving $\sqrt{p(1-p)}$ as the sole residual scaling factor in the per-question gradient.

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (2)