Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization 文章

ArXiv CS.CV2026-05-29NEWSen作者: Shufan Li, Konstantinos Kallidromitis, Akash Gokul Yusuke Kato, Kazuki Kozuka, Aditya Grover

查看原文 →

关系图谱

摘要

arXiv:2605.29198v1 Announce Type: new Abstract: Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals.

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (4)