BranPO: Scalable Contrastive Branch Sampling for Long-Horizon Agentic Reinforcement Learning 事件
PRODUCT_LAUNCH2026-06-02影响: MEDIUM
BranPO: Scalable Contrastive Branch Sampling for Long-Horizon Agentic Reinforcement Learning arXiv:2602.03719v2 Announce Type: replace Abstract: Agentic reinforcement learning enables large language models to perform multi-turn planning and tool use, but long-horizon training remains challenging under sparse trajectory-level rewards, where a single outcome is uniformly assigned to all decisions. Prior methods introduce finer-grained supervision via tree-based exploration or process-level evalua