Near-optimal Regret Bounds for Reinforcement Learning 论文

2010引用 711
Advanced Bandit Algorithms ResearchReinforcement Learning in RoboticsMachine Learning and Algorithms

摘要

For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s ′ there is a policy which moves from s to s ′ in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS √ AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω ( √ DSAT) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T. Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a regret bound of Õ(ℓ1/3T 2/3DS √ A).