Bandits With Heavy Tail 论文

2013IEEE Transactions on Information Theory引用 249
Advanced Bandit Algorithms ResearchMachine Learning and AlgorithmsOptimization and Search Problems

摘要

The stochastic multiarmed bandit problem is well understood when the reward distributions are sub-Gaussian. In this paper, we examine the bandit problem under the weaker assumption that the distributions have moments of order <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$1 + \varepsilon $</tex></formula> , for some <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$ \varepsilon \in (0,1]$</tex></formula> . Surprisingly, moments of order 2 (i.e., finite variance) are sufficient to obtain regret bounds of the same order as under sub-Gaussian reward distributions. In order to achieve such regret, we define sampling strategies based on refined estimators of the mean such as the truncated empirical mean, Catoni's <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$M$</tex> </formula> -estimator, and the median-of-means estimator. We also derive matching lower bounds that also show that the best achievable regret deteriorates when <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex Notation="TeX">$ \varepsilon &lt; 1$</tex></formula> .