Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots 文章

ArXiv CS.AI2026-06-04NEWSen作者: Guangcheng Zhu, Shenzhi Yang, Haobo Wang, Xing Zheng, Yingfan MA, Xuening Feng, Zhongqi Chen, Bowen Song, Weiqiang Wang, Gang Chen

查看原文 →

关系图谱

摘要

arXiv:2606.04503v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has greatly advanced large reasoning models (LRMs), but it requires timely training on a huge fully-annotated dataset. To this end, data-efficient RLVR methods have been widely studied from two perspectives: (i) data selection methods identify a small subset of "golden" samples that yield near-full-data performance, but they rely on a pre-existing pool of labeled data. (ii) unsupervised RLVR methods train the model using its own internal supervision signals on large-scale unlabeled data, yet they exhibit suboptimal performance. Accordingly, we investigate the "pick in the dark" setup for RLVR, which aims to select, without prior supervision, unlabeled samples that are most beneficial for training and worthy of annotation.

Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (2)