Counterfactual Credit Policy Optimization for Multi-Agent Collaboration 文章

ArXiv CS.AI2026-05-27NEWSen作者: Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, Fuzhen Zhuang

摘要

arXiv:2603.21563v2 Announce Type: replace Abstract: Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles, but reinforcement learning for such systems is limited by credit assignment: shared terminal rewards obscure individual contributions and can encourage free-riding. We introduce Collaborative Credit Policy Optimization (CCPO), an optimizer-agnostic credit assignment layer that converts team-level outcomes into agent-specific learning signals. CCPO provides two complementary allocators. Counterfactual credit estimates an agent's marginal contribution by comparing the realized team outcome with a counterfactual outcome where that agent is removed. Verifier-anchored LLM self-evaluation is an exploratory allocator that uses constrained self- and peer-evaluations to redistribute credit while keeping the external verifier outcome dominant.