Thesis
Optimization methods for reinforcement learning: theory and applications
- Abstract:
-
Reinforcement learning (RL) is a powerful paradigm for training agents to make optimal decisions in sequential environments through interaction and feedback. A central aspect of RL research lies in the development and application of efficient optimization methods that enable agents to learn complex behaviors. This thesis contributes to this area by introducing novel optimization techniques and by analyzing their applications in various RL settings, encompassing cooperative multi-agent systems, policy optimization with general parameterizations, and preference-based learning.
In the context of cooperative multi-agent reinforcement learning, where multiple agents collaborate to achieve a common goal, a significant challenge arises from the exponential increase in complexity with the number of agents. To address this, we introduce a scalable algorithm based on Natural Policy Gradient, which leverages local information exchange between neighboring agents within a defined range. We theoretically demonstrate that, under standard assumptions on spatial decay of correlations, our algorithm converges to the globally optimal policy with a statistical and computational complexity that remains independent of the number of agents.
This thesis further investigates the use of mirror descent as a versatile framework for policy optimization in reinforcement learning. We develop Approximate Mirror Policy Optimization (AMPO) and establish for it the first linear convergence guarantee for a policy-gradient-based method that accommodates general policy parameterizations. Furthermore, we empirically examine the impact of different mirror maps within the Policy Mirror Descent (PMD) and AMPO frameworks, revealing that the commonly used negative entropy is not always the best choice.
Finally, we extend the application of mirror descent to the domain of preference optimization (PO), a crucial technique for aligning agents with human preferences. In particular, we propose a novel family of algorithms called Mirror Preference Optimization (MPO). Through evolutionary strategies, we identify specialized MPO algorithms tailored to specific characteristics of preference datasets, such as mixed-quality or noisy data. Our findings demonstrate that these discovered algorithms outperform existing state-of-the-art PO methods in both simulated robotic tasks and a large language model alignment task, highlighting the effectiveness of our mirror descent-based approach for preference learning.
Actions
Access Document
- Files:
-
-
(Preview, Dissemination version, pdf, 2.6MB, Terms of use)
-
Authors
Contributors
+ Rebeschini, P
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Statistics
- Role:
- Supervisor
- ORCID:
- 0000-0001-7772-4160
+ Engineering and Physical Sciences Research Council
More from this funder
- Funder identifier:
- https://ror.org/0439y7842
- Grant:
- STAT2021_EPSRCDTP_ 1236656
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2026-05-06
- ARK identifier:
Terms of use
- Copyright holder:
- Carlo Alfano
- Copyright date:
- 2025
If you are the owner of this record, you can report an update to it here: Report update to this record