Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization 文章

ArXiv CS.CL2026-05-29NEWSen作者: Shiping Gao, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Lifu Huang

摘要

arXiv:2604.13197v2 Announce Type: replace Abstract: Process reward models (PRMs) provide fine-grained supervision for reasoning, but reliable PRMs often require step annotations or heavy verification pipelines, making them costly to scale and refresh during online RL. Implicit PRMs reduce this cost by training log-likelihood-ratio rewards from trajectory-level outcome labels. However, the log-ratio is constrained only as a sequence-level aggregate during training, while inference decomposes it into token- or step-level scores for partial prefixes. This train-inference mismatch leaves local credits weakly identified, so distribution-wide scoring can amplify misleading advantages. We propose Implicit Prefix-Value Reward Model (IPVRM), which directly learns the probability of eventual correctness for each prefix from outcome labels.