Rollout-Level Advantage-Prioritized Experience Replay for GRPO 事件

PRODUCT_LAUNCH2026-06-04影响: MEDIUM

Rollout-Level Advantage-Prioritized Experience Replay for GRPO arXiv:2606.04560v1 Announce Type: cross Abstract: Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded. Naive replay is not well suited in this setting because LLM policies drift quickly per gradient step. Stored rollouts therefore become stale and can destabilize training.