Distilling LLM Feedback for Lean Theorem Proving 事件

PRODUCT_LAUNCH2026-06-01影响: MEDIUM

Distilling LLM Feedback for Lean Theorem Proving arXiv:2605.30861v1 Announce Type: new Abstract: Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token lev

Distilling LLM Feedback for Lean Theorem Proving · 相关公司

P
PonCOMPANY
V
ViseCOMPANY
A
arXivNONPROFIT
E
EARNNONPROFIT
E
EATNONPROFIT
A
ACTNONPROFIT
F
FINDNONPROFIT
R
RatioRESEARCH_INSTITUTE