Distilling LLM Feedback for Lean Theorem Proving 事件
PRODUCT_LAUNCH2026-06-01影响: MEDIUM
Distilling LLM Feedback for Lean Theorem Proving arXiv:2605.30861v1 Announce Type: new Abstract: Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token lev