Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling 文章

ArXiv CS.CL2026-06-03NEWSen作者: Runpeng Dai, Tong Zheng, Rui Liu, Chengsong Huang, Hongtu Zhu

摘要

arXiv:2606.03102v1 Announce Type: new Abstract: Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints.

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling 文章

摘要

相关事件

相关公司

相关人物

相关产品

相关技术查看全部 (8)