• Comment: Ordinary spelling and grammar mistakes aside (please copyedit), it is unclear to me 1) what is the formulation of the problem the algorithm is trying to solve and 2) what is the final computed quantity of the algorithm. You might also want to spend some effort explaining what regret minimisation is, and as with all optimisation algorithms, give a full formulation in terms of the objective function, optimising variable and the form of any constraints. Also, the title of the draft should be the full name of the algorithm and not an acronym. Fermiboson (talk) 13:01, 26 December 2025 (UTC)


In Multi-armed bandit problems, KL-UCB (Kullback-Leiber divergence - Upper Confidence Bound)[1] is an UCB-type algorithm that matches the asymptotic lower bound of Lai-Robbins[2] that is a problem-dependent lower bound for regret minimization.

History

[edit]

The algorithm was first introduced in 2011 for Bernoulli distribution[1]. It was then extent for one-dimensional exponential families and bounded distributions in 2013 [3] . An adaptation called KL-UCB-Switch that uses a mix of MOSS and KL-UCB was made to obtain both the problem-dependent and problem-independent asymptotic lower bound in 2022. [4]. The algorithm was also extended to Lipschitz bandits in 2014 [5]

Algorithm

[edit]

It is a UCB-type algorithm based on optimism, which means that we pull the arm with the highest Upper Confidence Bound (UCB). This algorithm compute the UCB using the Kullback-Leibler divergence.

In multi-armed bandit problem where each arm distributions are in a known set , we compute at each turn , for each arm the index[3]

where

  • is the number of pulls of the arm up to turn
  • is the empirical distribution of the arm at turn
  • is a well-chosen sequence of positive numbers. Often equal to with .[3]

We can note that the algorithm does not require the knowledge of .

In the special case of Gaussian distribution with fixed variance , we have that

with being the empirical mean of the arm at turn .

Regret bound

[edit]

For being a one-dimensional exponential family, with we have for each sub-optimal arm [3]

where is the highest mean. We can see that the algorithm matches the optimal problem-dependent lower bound of Lai-Robbins.[2]

For the distributions bounded in , for we have for each sub-optimal arm [3]

The algorithm also matches the problem-dependant lower bound.

Runtime

[edit]

For , the Runtime needed per step and for an arm with observations is [6], which is higher than that of other optimal algorithms, such as NPTS[7] with [6], MED[8] with [6], and IMED[9] with .[6]

See also

[edit]

References

[edit]
  1. ^ a b Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2011). "A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences". In Kakade, Sham M.; von Luxburg, Ulrike (eds.). Proceedings of the 24th Annual Conference on Learning Theory. Proceedings of Machine Learning Research. Vol. 19. Budapest, Hungary: PMLR. pp. 497–514.
  2. ^ a b Lai, T.L.; Robbins, Herbert (1985). "Asymptotically Efficient Adaptive Allocation Rules". Advances in Applied Mathematics. 6 (1): 4–22. doi:10.1016/0196-8858(85)90002-8.
  3. ^ a b c d e Cappé, Olivier; Garivier, Aurélien; Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2013). "Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation". The Annals of Statistics: 1516–1541.
  4. ^ Garivier, Aurélien; Hadiji, Hédi; Ménard, Pierre; Stoltz, Gilles (2022). "KL-UCB-switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints". Journal of Machine Learning Research. 23 (179): 1–66.
  5. ^ Magureanu, Stefan; Combes, Richard; Proutière, Alexandre (2014). "Lipschitz Bandits: Regret Lower Bounds and Optimal Algorithms". arXiv:1405.4758 [cs.LG].
  6. ^ a b c d Baudry, Dorian; Pesquerel, Fabien; Degenne, Rémy; Maillard, Odalric-Ambrym (2023). "Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits". Advances in Neural Information Processing Systems. 36: 11469–11514.
  7. ^ Riou, Charles; Honda, Junya (2020). "Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions". In Kontorovich, Aryeh; Neu, Gergely (eds.). Proceedings of the 31st International Conference on Algorithmic Learning Theory. Proceedings of Machine Learning Research. Vol. 117. PMLR. pp. 777–826.
  8. ^ Honda, Junya; Takemura, Akimichi (2010). "An Asymptotically Optimal Bandit Algorithm for Bounded Support Models". COLT. pp. 67–79.
  9. ^ Honda, Junya; Takemura, Akimichi (2015). "Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Rewards". Journal of Machine Learning Research. 16 (113): 3721–3756.