Contextual bandits with linear Payoff functions 论文
2011引用 577
Advanced Bandit Algorithms ResearchOptimization and Search ProblemsMachine Learning and Algorithms
摘要
In this paper, we study the contextual bandit problem (also known as the multi-armed bandit problem with expert advice) for linear payoff functions. For T rounds, K actions, and d(√ dimensional feature vectors, we prove an O Td ln 3) (KT ln(T)/δ) regret bound that holds with probability 1 − δ for the simplest known (both conceptually and computationally) efficient upper confidence bound algorithm for this problem. We also prove a lower bound of Ω ( √ Td) for this setting, matching the upper bound up to logarithmic factors. 1