MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop 文章

ArXiv CS.AI2026-06-02NEWSen作者: Xuancheng Li, Haitao Li, Yujia Zhou, YiqunLiu, Qingyao Ai

摘要

arXiv:2601.22900v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning across domains, but outcome-only scalar rewards are often sparse and uninformative. This limitation is especially severe for failed samples, where scalar rewards indicate only that a solution is incorrect without explaining why the reasoning breaks down. In this paper, we leverage richer verbal feedback to guide RLVR on failed samples and convert feedback-induced progress into trainable learning signals. We propose MulFeRL (Multi-turn Feedback-guided Reinforcement Learning), a multi-turn, event-triggered RLVR framework that combines progress induction for feedback-guided regeneration of failed samples, progress credit assignment for learning from verifier-confirmed progress, and structured feedback injection for integrating feedback into the model's reasoning process.

相关公司

暂无数据

相关人物

暂无数据