Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning arXiv:2605.27765v1 Announce Type: cross Abstract: Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language models by leveraging the model's own feedback-conditioned predictions as a self-teacher. Unlike GRPO, however, whose group-relative advantage naturally concentrates learning on a sweet spot of intermediate-difficulty questions, SDP

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning · 相关技术