Thesis icon

Thesis

Learning in three classic bandit strategies

Abstract:
In a Bandit Problem one is repeatedly asked to implement one of a set of actions each of which produce some observable long term behaviour, and one aims to achieve some objective such as maximisation of a reward stream, or consistency. The fundamental dilema one must address is when to choose to explore the behaviour of the actions and when to choose to exploit the information one has gained about them.

In this thesis we provide analyses of three classic Bandit Problem strategies, the Gittins Index Strategy, the Narendra Strategy, and the Thompson Strategy. We are motivated by questions of how reward optimisation in the Bandit Problem setting can be related to learning.

In particular we prove that the Gittins Index Strategy is inconsistent and give a lower bound on rate at which the probability that it does not learn goes to zero as the discount factor tends to one. In doing so we provide new asymptotic explicit upper and lower bounds for the Gittins Indices.

We also provide an analysis of the Narendra Strategy in a quenched context, giving a characterisation of consistency and a study of the rates of convergence that can be observed.

Finally we perform a regret analysis of the Thompson Strategy, showing in particular that for Bernoulli Bandits it achieves logarithmic cumulative regret.

Actions

Access Document

Files:

Authors

More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Mathematical Institute
Oxford college:
St Anne's College
Role:
Author

Contributors

Role:
Supervisor


More from this funder
Funder identifier:
https://ror.org/0439y7842
Funding agency for:
Korda, NV
Grant:
MATH0725
More from this funder
Funding agency for:
Korda, NV


DOI:
Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP