Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR 文章

ArXiv CS.CL2026-05-29NEWSen作者: Dohyung Kim, Minbeom Kim, Jeonghye Kim, Sangmook Lee, Sojeong Rhee, Kyomin Jung

摘要

arXiv:2602.12642v2 Announce Type: replace Abstract: Reward-maximizing RL methods have shown to be capable of enhancing the reasoning performance of LLMs, but often lead to reduced generation diversity. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates.

Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (2)