Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR 文章

ArXiv CS.CL2026-05-29NEWSen作者: Dohyung Kim, Minbeom Kim, Jeonghye Kim, Sangmook Lee, Sojeong Rhee, Kyomin Jung

摘要

arXiv:2602.12642v2 Announce Type: replace Abstract: Reward-maximizing RL methods have shown to be capable of enhancing the reasoning performance of LLMs, but often lead to reduced generation diversity. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据