Draft:KL-UCB

Submission declined on 26 December 2025 by Fermiboson (talk).

This submission provides insufficient context for those unfamiliar with the subject matter. Please see the guide to writing better articles for information on how to better format your submission.

If you would like to continue working on the submission, click on the "Edit" tab at the top of the window.
If you have not resolved the issues listed above, your draft will be declined again and potentially deleted.
If you need extra help, please ask us a question at the AfC Help Desk or get live help from experienced editors.
Please do not remove reviewer comments or this notice until the submission is accepted.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Declined by Fermiboson 32 hours ago. Last edited by Fermiboson 32 hours ago. Reviewer: Inform author.

Resubmit

Please note that if the issues are not fixed, the draft will be declined again.

Comment: Ordinary spelling and grammar mistakes aside (please copyedit), it is unclear to me 1) what is the formulation of the problem the algorithm is trying to solve and 2) what is the final computed quantity of the algorithm. You might also want to spend some effort explaining what regret minimisation is, and as with all optimisation algorithms, give a full formulation in terms of the objective function, optimising variable and the form of any constraints. Also, the title of the draft should be the full name of the algorithm and not an acronym. Fermiboson (talk) 13:01, 26 December 2025 (UTC)

Comment: In accordance with Wikipedia's Conflict of interest guideline, I disclose that I have a conflict of interest regarding the subject of this article. Adrien PREVOST (talk) 20:14, 25 December 2025 (UTC)

In Multi-armed bandit problems, KL-UCB (Kullback-Leiber divergence - Upper Confidence Bound)^[1] is an UCB-type algorithm that matches the asymptotic lower bound of Lai-Robbins^[2] that is a problem-dependent lower bound for regret minimization.

History

The algorithm was first introduced in 2011 for Bernoulli distribution^[1]. It was then extent for one-dimensional exponential families and bounded distributions in 2013 ^[3] . An adaptation called KL-UCB-Switch that uses a mix of MOSS and KL-UCB was made to obtain both the problem-dependent and problem-independent asymptotic lower bound in 2022. ^[4]. The algorithm was also extended to Lipschitz bandits in 2014 ^[5]

Algorithm

It is a UCB-type algorithm based on optimism, which means that we pull the arm with the highest Upper Confidence Bound (UCB). This algorithm compute the UCB using the Kullback-Leibler divergence.

In multi-armed bandit problem where each arm distributions $\nu _{k}$ are in a known set ${\mathcal {D}}$ , we compute at each turn $t$ , for each arm $a$ the index^[3]

$U_{a}(t):=\max \left\{\mu \ |\ N_{a}(t){\mathcal {K}}_{inf}({\hat {\nu }}_{a}(t),\mu ,{\mathcal {D}})\leq \delta _{t}\right\}$

where

$N_{a}(t)$ is the number of pulls of the arm $a$ up to turn $t$
${\mathcal {K}}_{inf}(\nu ,\mu ,{\mathcal {D}}):=\inf \left\{\mathrm {KL} (\nu ,{\tilde {\nu }})\ |\ {\tilde {\nu }}\in {\mathcal {D}},\ \mathbb {E} [{\tilde {\nu }}]>\mu \right\}$
${\hat {\nu }}_{a}(t)$ is the empirical distribution of the arm $a$ at turn $t$
$\delta _{t}$ is a well-chosen sequence of positive numbers. Often equal to $\ln t+c\ln \ln t$ with $c>0$ .^[3]

We can note that the algorithm does not require the knowledge of $T$ .

In the special case of Gaussian distribution with fixed variance $\sigma ^{2}$ , we have that

$U_{a}(t)={\hat {\mu }}_{a}(t)+{\sqrt {\frac {2\sigma ^{2}\delta _{t}}{N_{a}(t)}}}$

with ${\hat {\mu }}_{a}(t)$ being the empirical mean of the arm $a$ at turn $t$ .

Regret bound

For ${\mathcal {D}}$ being a one-dimensional exponential family, with $\delta _{t}:=\ln t+3\ln \ln t$ we have for each sub-optimal arm $a$ ^[3]

\mathbb {E} [N_{a}(T)]\leq {\frac {1}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{*},{\mathcal {D}})}}\ln T+O({\sqrt {\ln T}})

where $\mu ^{*}$ is the highest mean. We can see that the algorithm matches the optimal problem-dependent lower bound of Lai-Robbins.^[2]

For ${\mathcal {D}}:={\mathcal {P}}([0,1])$ the distributions bounded in $[0,1]$ , for $\delta _{t}:=\ln t+\ln \ln t$ we have for each sub-optimal arm $a$ ^[3]

\mathbb {E} [N_{a}(T)]\leq {\frac {1}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{*},{\mathcal {P}}([0,1]))}}\ln T+O((\ln T)^{4/5}\ln \ln T)

The algorithm also matches the problem-dependant lower bound.

Runtime

For ${\mathcal {D}}={\mathcal {P}}([0,1])$ , the Runtime needed per step and for an arm $k$ with $n$ observations is ${\mathcal {O}}(n(\ln n)^{2})$ ^[6], which is higher than that of other optimal algorithms, such as NPTS^[7] with ${\mathcal {O}}(n)$ ^[6], MED^[8] with ${\mathcal {O}}(n\ln n)$ ^[6], and IMED^[9] with ${\mathcal {O}}(n\ln n)$ .^[6]

References

^ ^a ^b Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2011). "A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences". In Kakade, Sham M.; von Luxburg, Ulrike (eds.). Proceedings of the 24th Annual Conference on Learning Theory. Proceedings of Machine Learning Research. Vol. 19. Budapest, Hungary: PMLR. pp. 497–514.
^ ^a ^b Lai, T.L.; Robbins, Herbert (1985). "Asymptotically Efficient Adaptive Allocation Rules". Advances in Applied Mathematics. 6 (1): 4–22. doi:10.1016/0196-8858(85)90002-8.
^ ^a ^b ^c ^d ^e Cappé, Olivier; Garivier, Aurélien; Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2013). "Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation". The Annals of Statistics: 1516–1541.
^ Garivier, Aurélien; Hadiji, Hédi; Ménard, Pierre; Stoltz, Gilles (2022). "KL-UCB-switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints". Journal of Machine Learning Research. 23 (179): 1–66.
^ Magureanu, Stefan; Combes, Richard; Proutière, Alexandre (2014). "Lipschitz Bandits: Regret Lower Bounds and Optimal Algorithms". arXiv:1405.4758 [cs.LG].
^ ^a ^b ^c ^d Baudry, Dorian; Pesquerel, Fabien; Degenne, Rémy; Maillard, Odalric-Ambrym (2023). "Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits". Advances in Neural Information Processing Systems. 36: 11469–11514.
^ Riou, Charles; Honda, Junya (2020). "Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions". In Kontorovich, Aryeh; Neu, Gergely (eds.). Proceedings of the 31st International Conference on Algorithmic Learning Theory. Proceedings of Machine Learning Research. Vol. 117. PMLR. pp. 777–826.
^ Honda, Junya; Takemura, Akimichi (2010). "An Asymptotically Optimal Bandit Algorithm for Bounded Support Models". COLT. pp. 67–79.
^ Honda, Junya; Takemura, Akimichi (2015). "Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Rewards". Journal of Machine Learning Research. 16 (113): 3721–3756.

[Maillard2011-1] Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2011). "A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences". In Kakade, Sham M.; von Luxburg, Ulrike (eds.). Proceedings of the 24th Annual Conference on Learning Theory. Proceedings of Machine Learning Research. Vol. 19. Budapest, Hungary: PMLR. pp. 497–514.

[Lai1985-2] Lai, T.L.; Robbins, Herbert (1985). "Asymptotically Efficient Adaptive Allocation Rules". Advances in Applied Mathematics. 6 (1): 4–22. doi:10.1016/0196-8858(85)90002-8.

[Cappe2013-3] Cappé, Olivier; Garivier, Aurélien; Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2013). "Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation". The Annals of Statistics: 1516–1541.

[Garivier2022-4] Garivier, Aurélien; Hadiji, Hédi; Ménard, Pierre; Stoltz, Gilles (2022). "KL-UCB-switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints". Journal of Machine Learning Research. 23 (179): 1–66.

[Magureanu2014-5] Magureanu, Stefan; Combes, Richard; Proutière, Alexandre (2014). "Lipschitz Bandits: Regret Lower Bounds and Optimal Algorithms". arXiv:1405.4758 [cs.LG].

[Baudry2023-6] Baudry, Dorian; Pesquerel, Fabien; Degenne, Rémy; Maillard, Odalric-Ambrym (2023). "Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits". Advances in Neural Information Processing Systems. 36: 11469–11514.

[Riou2020-7] Riou, Charles; Honda, Junya (2020). "Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions". In Kontorovich, Aryeh; Neu, Gergely (eds.). Proceedings of the 31st International Conference on Algorithmic Learning Theory. Proceedings of Machine Learning Research. Vol. 117. PMLR. pp. 777–826.

[Honda2010-8] Honda, Junya; Takemura, Akimichi (2010). "An Asymptotically Optimal Bandit Algorithm for Bounded Support Models". COLT. pp. 67–79.

[Honda2015-9] Honda, Junya; Takemura, Akimichi (2015). "Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Rewards". Journal of Machine Learning Research. 16 (113): 3721–3756.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

History

Algorithm

Regret bound

Runtime

See also

References