Near-optimal Regret Bounds for Reinforcement Learning 论文

2010引用 711

Advanced Bandit Algorithms ResearchReinforcement Learning in RoboticsMachine Learning and Algorithms

机器人 Machine Learning and Algorithms Reinforcement Learning in Robotics Advanced Bandit Algorithms Research

作者

摘要

For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s ′ there is a policy which moves from s to s ′ in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS √ AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω ( √ DSAT) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T. Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a regret bound of Õ(ℓ1/3T 2/3DS √ A).

作者查看全部 (3)

Peter Auer

Ronald Ortner

Thomas Jaksch

Near-optimal Regret Bounds for Reinforcement Learning 论文

摘要

作者查看全部 (3)

相关技术查看全部 (3)

相关事件

相关文章