Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning 文章

ArXiv CS.AI2026-05-28NEWSen作者: Zehao Liu, Yuanpu Cao, Jinghui Chen, Vasant G. Honavar

摘要

arXiv:2605.27765v1 Announce Type: cross Abstract: Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language models by leveraging the model's own feedback-conditioned predictions as a self-teacher. Unlike GRPO, however, whose group-relative advantage naturally concentrates learning on a sweet spot of intermediate-difficulty questions, SDPO's KL-based advantage lacks an implicit notion of difficulty awareness. We analyze this gap through the lens of GRPO's advantage normalization. Extending the learnability framework to normalized rewards, we show that normalization absorbs the variance term $p(1-p)$, equalizing leading-order learnability across questions and leaving $\sqrt{p(1-p)}$ as the sole residual scaling factor in the per-question gradient.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据