Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR 文章

ArXiv CS.AI2026-05-27NEWSen作者: Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang

摘要

arXiv:2604.11056v2 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning ability of Large Language Models (LLMs), but sparse outcome rewards make token-level credit assignment difficult. We study token-level credit as a reward-conditioned shift from the behavior policy to a hindsight posterior. In autoregressive RLVR, this shift can be expressed through Conditional Mutual Information (CMI), which shows that token entropy upper-bounds possible hindsight credit. Entropy, however, indicates capacity rather than update direction, so we introduce the Four Quadrant Decomposition to separate updates by reward polarity and token entropy. Controlled interventions show that these two factors jointly shape token updates. Sustained reasoning gains concentrate in signed high-entropy quadrants, whereas low-entropy updates saturate quickly.