Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents arXiv:2604.18235v2 Announce Type: replace Abstract: Search agents achieve strong question-answering performance through multi-turn interactions with search engines, with Group Relative Policy Optimization (GRPO) being a widely used training algorithm. However, GRPO-style algorithms still face several challenges in multi-hop search settings. First, correct intermediate steps are often penalized when the

Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents · 相关技术