Contextual bandits with linear Payoff functions 论文

2011引用 577
Advanced Bandit Algorithms ResearchOptimization and Search ProblemsMachine Learning and Algorithms

摘要

In this paper, we study the contextual bandit problem (also known as the multi-armed bandit problem with expert advice) for linear payoff functions. For T rounds, K actions, and d(√ dimensional feature vectors, we prove an O Td ln 3) (KT ln(T)/δ) regret bound that holds with probability 1 − δ for the simplest known (both conceptually and computationally) efficient upper confidence bound algorithm for this problem. We also prove a lower bound of Ω ( √ Td) for this setting, matching the upper bound up to logarithmic factors. 1