MULTIARMED BANDITS IN THE WORST CASE

We present a survey of results on a recently formulated variant of the classical (stochastic) multiarmed bandit problem in which no assumption is made on the mechanism generating the rewards. We describe randomized allocation policies for this variant and prove bounds on their regret as a function of the time horizon and the number of arms. These bounds hold for any assignment of rewards to the arms and are tight to within logarithmic factors.