Optimization methods for reinforcement learning: theory and applications

Alfano, C

Thesis

Optimization methods for reinforcement learning: theory and applications

Abstract:: Reinforcement learning (RL) is a powerful paradigm for training agents to make optimal decisions in sequential environments through interaction and feedback. A central aspect of RL research lies in the development and application of efficient optimization methods that enable agents to learn complex behaviors. This thesis contributes to this area by introducing novel optimization techniques and by analyzing their applications in various RL settings, encompassing cooperative multi-agent systems, policy optimization with general parameterizations, and preference-based learning.

In the context of cooperative multi-agent reinforcement learning, where multiple agents collaborate to achieve a common goal, a significant challenge arises from the exponential increase in complexity with the number of agents. To address this, we introduce a scalable algorithm based on Natural Policy Gradient, which leverages local information exchange between neighboring agents within a defined range. We theoretically demonstrate that, under standard assumptions on spatial decay of correlations, our algorithm converges to the globally optimal policy with a statistical and computational complexity that remains independent of the number of agents.

This thesis further investigates the use of mirror descent as a versatile framework for policy optimization in reinforcement learning. We develop Approximate Mirror Policy Optimization (AMPO) and establish for it the first linear convergence guarantee for a policy-gradient-based method that accommodates general policy parameterizations. Furthermore, we empirically examine the impact of different mirror maps within the Policy Mirror Descent (PMD) and AMPO frameworks, revealing that the commonly used negative entropy is not always the best choice.

Finally, we extend the application of mirror descent to the domain of preference optimization (PO), a crucial technique for aligning agents with human preferences. In particular, we propose a novel family of algorithms called Mirror Preference Optimization (MPO). Through evolutionary strategies, we identify specialized MPO algorithms tailored to specific characteristics of preference datasets, such as mixed-quality or noisy data. Our findings demonstrate that these discovered algorithms outperform existing state-of-the-art PO methods in both simulated robotic tasks and a large language model alignment task, highlighting the effectiveness of our mirror descent-based approach for preference learning.

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Alfano, C. (2025). Optimization methods for reinforcement learning: theory and applications [PhD thesis]. University of Oxford.

MLA Style

Alfano, C. Optimization Methods for Reinforcement Learning: Theory and Applications. 2025. University of Oxford, PhD thesis.

Chicago Style

Alfano, C. 2025. “Optimization Methods for Reinforcement Learning: Theory and Applications.” PhD thesis, University of Oxford.
Print