Thesis
Learning in three classic bandit strategies
- Abstract:
-
In a Bandit Problem one is repeatedly asked to implement one of a set of actions each of which produce some observable long term behaviour, and one aims to achieve some objective such as maximisation of a reward stream, or consistency. The fundamental dilema one must address is when to choose to explore the behaviour of the actions and when to choose to exploit the information one has gained about them.
In this thesis we provide analyses of three classic Bandit Problem strategies, the Gittins Index Strategy, the Narendra Strategy, and the Thompson Strategy. We are motivated by questions of how reward optimisation in the Bandit Problem setting can be related to learning.
In particular we prove that the Gittins Index Strategy is inconsistent and give a lower bound on rate at which the probability that it does not learn goes to zero as the discount factor tends to one. In doing so we provide new asymptotic explicit upper and lower bounds for the Gittins Indices.
We also provide an analysis of the Narendra Strategy in a quenched context, giving a characterisation of consistency and a study of the rates of convergence that can be observed.
Finally we perform a regret analysis of the Thompson Strategy, showing in particular that for Bernoulli Bandits it achieves logarithmic cumulative regret.
Actions
Access Document
- Files:
-
-
(Preview, Dissemination version, pdf, 2.6MB, Terms of use)
-
Authors
+ Engineering and Physical Sciences Research Council
More from this funder
- Funder identifier:
- https://ror.org/0439y7842
- Funding agency for:
- Korda, NV
- Grant:
- MATH0725
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2025-11-17
- ARK identifier:
Terms of use
- Copyright holder:
- Nathaniel V. Korda
- Copyright date:
- 2011
If you are the owner of this record, you can report an update to it here: Report update to this record