FABSVer: Faster Training and Better Self-Verification for LLM Mathematical Reasoning 文章

ArXiv CS.CL2026-05-28NEWSen作者: Haihui Pan, Junwei Bao, Hongfei Jiang, Yang Song

摘要

arXiv:2605.28389v1 Announce Type: new Abstract: While large language models have made significant progress in mathematical reasoning, they remain unreliable at judging the correctness of their own solutions. Existing approaches that equip models with self-verification typically treat solution generation and verification as two separate tasks, leading to substantially increased training time. In this paper, we propose FABSVer, which fuses these two tasks into a single generation pass, dramatically reducing training overhead while jointly optimizing both capabilities. We further identify a convergence bottleneck both theoretically and empirically: as training progresses, the reward reaches a plateau because the policy is constrained by a fixed reference model. To overcome this, we introduce Dynamic Reference Model Update (DRMU), which raises the reward ceiling and enables sustained reward growth.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据