OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning 文章

ArXiv CS.AI2026-05-28NEWSen作者: Xinyu Ma, Mingzhou Xu, Xuebo Liu, Chang Jin, Qiang Wang, Derek F. Wong, Min Zhang

摘要

arXiv:2604.18530v2 Announce Type: replace Abstract: Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial policy distribution. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER (Offline-Guided Exploration Reward), a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi-teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model's own entropy to incentivize autonomous exploration.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据