Thesis icon

Thesis

Optimization methods for reinforcement learning: theory and applications

Abstract:
Reinforcement learning (RL) is a powerful paradigm for training agents to make optimal decisions in sequential environments through interaction and feedback. A central aspect of RL research lies in the development and application of efficient optimization methods that enable agents to learn complex behaviors. This thesis contributes to this area by introducing novel optimization techniques and by analyzing their applications in various RL settings, encompassing cooperative multi-agent systems, policy optimization with general parameterizations, and preference-based learning.

In the context of cooperative multi-agent reinforcement learning, where multiple agents collaborate to achieve a common goal, a significant challenge arises from the exponential increase in complexity with the number of agents. To address this, we introduce a scalable algorithm based on Natural Policy Gradient, which leverages local information exchange between neighboring agents within a defined range. We theoretically demonstrate that, under standard assumptions on spatial decay of correlations, our algorithm converges to the globally optimal policy with a statistical and computational complexity that remains independent of the number of agents.

This thesis further investigates the use of mirror descent as a versatile framework for policy optimization in reinforcement learning. We develop Approximate Mirror Policy Optimization (AMPO) and establish for it the first linear convergence guarantee for a policy-gradient-based method that accommodates general policy parameterizations. Furthermore, we empirically examine the impact of different mirror maps within the Policy Mirror Descent (PMD) and AMPO frameworks, revealing that the commonly used negative entropy is not always the best choice.

Finally, we extend the application of mirror descent to the domain of preference optimization (PO), a crucial technique for aligning agents with human preferences. In particular, we propose a novel family of algorithms called Mirror Preference Optimization (MPO). Through evolutionary strategies, we identify specialized MPO algorithms tailored to specific characteristics of preference datasets, such as mixed-quality or noisy data. Our findings demonstrate that these discovered algorithms outperform existing state-of-the-art PO methods in both simulated robotic tasks and a large language model alignment task, highlighting the effectiveness of our mirror descent-based approach for preference learning.

Actions

Access Document

Files:

Authors

More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Statistics
Sub department:
Statistics
Oxford college:
Linacre College
Role:
Author

Contributors

Institution:
University of Oxford
Division:
MPLS
Department:
Statistics
Role:
Supervisor
ORCID:
0000-0001-7772-4160


More from this funder
Funder identifier:
https://ror.org/0439y7842
Grant:
STAT2021_EPSRCDTP_ 1236656


DOI:
Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford


Language:
English
Keywords:
Subjects:
Deposit date:
2026-05-06
ARK identifier:

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP