GRPO is Secretly a Process Reward Model 文章

ArXiv CS.AI2026-05-29NEWSen作者: Michael Sullivan, Alexander Koller

摘要

arXiv:2509.21154v4 Announce Type: replace-cross Abstract: Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions).

相关事件查看全部 (1)

GRPO is Secretly a Process Reward Model
2026-05-29PRODUCT_LAUNCH影响: MEDIUM

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据