Thesis icon

Thesis

Efficient and scalable methods for deep reinforcement learning

Abstract:

This thesis proposes some new answers to an old question - how can artificially intelligent agents efficiently learn from their experiences to make optimal decisions? We adopt the framework of reinforcement learning (RL), in which agents are trained to maximise their expected long-term cumulative rewards, and build on the recent successes of deep RL by using deep neural network function approximators for policies, value functions, and other model components. Deep RL often learns very inefficiently from experience, and can struggle to scale to very complex problems with large action spaces or sparse feedback. We address these challenges in several ways.

In Part I, we dive deep into a subfield of RL concerning multiple agents that must cooperate to achieve a common goal. These multi-agent systems test the limits of our algorithms due to their complex dynamics, large joint action spaces, and decentralisation constraints. We develop methods to address partial observability and multi-agent credit assignment (Chapter 3), nonstationarity induced by co-learning agents (Chapter 4), and efficient representation and learning of joint action values (Chapter 5).

In Part II, we leave the specific setting of multi-agent RL to build more general inductive biases into algorithms and architectures. In Chapter 6 we leverage the inductive bias that tree-search planning is an effective representation of value functions or policies to accelerate learning. In Chapter 7 we use a curriculum of progressively growing action spaces to enable efficient exploration without compromising long-term optimality.

In Part III, we focus on estimators of higher-order derivatives in the context of RL. Among other applications, these estimators can be used in meta-learning, where we attempt to learn algorithms or inductive biases from data rather than hand-designing them as in Parts I and II. In Chapter 8 we propose an objective which may be (automatically) differentiated any number of times to produce unbiased estimates of higher-order derivatives. In Chapter 9, we extend this objective to reduce its variance, as well as allowing a flexible trade-off of bias and variance in estimators of any-order derivatives for RL.

Together, these contributions make valuable strides towards realising efficient and scalable solutions to challenging RL problems, as well as opening up exciting directions for future work building on the algorithms, architectures, and estimators proposed here.

Actions


Authors


More by this author
Division:
MPLS
Department:
Computer Science
Role:
Author

Contributors

Role:
Supervisor
Role:
Examiner


Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford


Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP