Self-Trained Verification for Training- and Test-Time Self-Improvement 文章

ArXiv CS.CL2026-06-02NEWSen作者: Chen Henry Wu, Aditi Raghunathan

摘要

arXiv:2605.30290v2 Announce Type: replace-cross Abstract: Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据