Learning in three classic bandit strategies - ORA - Oxford University Research Archive

Abstract:: In a Bandit Problem one is repeatedly asked to implement one of a set of actions each of which produce some observable long term behaviour, and one aims to achieve some objective such as maximisation of a reward stream, or consistency. The fundamental dilema one must address is when to choose to explore the behaviour of the actions and when to choose to exploit the information one has gained about them.

In this thesis we provide analyses of three classic Bandit Problem strategies, the Gittins Index Strategy, the Narendra Strategy, and the Thompson Strategy. We are motivated by questions of how reward optimisation in the Bandit Problem setting can be related to learning.

In particular we prove that the Gittins Index Strategy is inconsistent and give a lower bound on rate at which the probability that it does not learn goes to zero as the discount factor tends to one. In doing so we provide new asymptotic explicit upper and lower bounds for the Gittins Indices.

We also provide an analysis of the Narendra Strategy in a quenched context, giving a characterisation of consistency and a study of the rates of convergence that can be observed.

Finally we perform a regret analysis of the Thompson Strategy, showing in particular that for Bernoulli Bandits it achieves logarithmic cumulative regret.

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Korda, N. V. (2011). Learning in three classic bandit strategies [PhD thesis]. University of Oxford.

MLA Style

Korda, NV. Learning in Three Classic Bandit Strategies. 2011. University of Oxford, PhD thesis.

Chicago Style

Korda, NV. 2011. “Learning in Three Classic Bandit Strategies.” PhD thesis, University of Oxford.
Print

Access Document

Files:: Korda_2011_Learning_in_three.pdf

(Preview, Dissemination version, pdf, 2.6MB, Terms of use)

Authors

+ Korda, NV More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Mathematical Institute
Oxford college:: St Anne's College
Role:: Author

Contributors

Role:: Supervisor

+ Engineering and Physical Sciences Research Council More from this funder

Funder identifier:: https://ror.org/0439y7842
Funding agency for:: Korda, NV
Grant:: MATH0725

+ Man Group More from this funder

Funding agency for:: Korda, NV

DOI:: 10.5287/ora-xmjr6g2b8
Type of award:: DPhil
Level of award:: Doctoral
Awarding institution:: University of Oxford

Language:: English
Keywords:: learning automata

bandit problems

Markov decision processes

Bayesian analysis

limit analysis

experimental design

regret learning

reinforcement learning

active learning
Subjects:: Markov processes

Probability learning

Machine learning

Experimental design

Active learning
Deposit date:: 2025-11-17
ARK identifier:: ark:/29072/ora_bf2635db58d34694a6f9a695cb7d5b8f

Terms of use

Copyright holder:: Nathaniel V. Korda
Copyright date:: 2011

Licence:: Terms and Conditions of Use for Oxford University Research Archive

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP