Annealed Softmax Greedy in Many-Armed Bayesian Bandits 文章

ArXiv CS.AI2026-06-01NEWSen作者: William Overman, Mohsen Bayati

摘要

arXiv:2605.31034v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) and group-based policy optimization methods such as GRPO update a stochastic policy by sampling multiple completions per prompt and increasing the policy's probability on those with higher reward, regularized by a KL penalty toward a reference policy. These updates do not include explicit mechanisms that track epistemic uncertainty. This paper studies a stylized explanation for why such uncertainty-agnostic updates can nevertheless be effective. We analyze an annealed softmax (Boltzmann) policy that selects actions according to a softmax of empirical mean rewards in a many-armed Bayesian Bernoulli bandit.

Annealed Softmax Greedy in Many-Armed Bayesian Bandits 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (5)