Reinforcement Learning from Rich Feedback with Distributional DAgger 文章

ArXiv CS.CL2026-06-08NEWSen作者: Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad

摘要

arXiv:2606.05152v2 Announce Type: replace-cross Abstract: Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据