Thesis icon

Thesis

Training efficient agents for long-term decision making

Abstract:
Reinforcement learning has ventured from tabletop simulators to real robots and open-world games, but today’s agents still learn with prohibitively low sample efficiency, ignore the priors encoded in foundation models, and forget most of what they have seen after a few hundred steps. This thesis pursues a unifying agenda—efficiently training efficient decision-making agents—through three successive contributions.

Chapter 1 demonstrates that sample efficiency can be substantially improved by re-weighting experience toward the transitions that are most informative. An ensemble-based uncertainty criterion selectively upsamples those rare interactions that clarify causal structure, enabling offline reinforcement learning to achieve safe, performant policies with far fewer gradient updates than uniform replay.

Stronger supervision is possible even when no new interaction data are collected, provided we can import structure learned elsewhere. Chapter 2 investigates this idea by tapping the internal representations of large generative vision models. Text-to-image diffusion backbones, although trained for synthesis rather than control, accumulate multi-scale spatial and semantic cues that are difficult to rediscover from scratch in a robotics dataset. By freezing these backbones and projecting their multi-layer activations into a control-friendly embedding—what we term Stable Control Representations (SCRs)—an agent starts with a rich inductive prior over object geometry and language grounding. In manipulation and open-vocabulary navigation tasks, SCRs cut the number of gradient steps needed to reach a given return by up to an order of magnitude and consistently outperform contrastively trained encoders, all without generating a single additional pixel. This result shows that re-using pretrained knowledge can convert computationally expensive exploration into cheap representation reuse, markedly improving sample efficiency.

While these chapters focus on learning efficiently, deployed agents must also act efficiently by leveraging context that spans hours or days. Chapter 3 introduces Memo, a transformer policy that interleaves periodic summary tokens with streaming observations so memory capacity grows gently with task length. To measure such long-term reasoning, Chapter 4 contributes FindingDory, a procedurally extendable benchmark family whose 60 tasks probe how well embodied agents store and retrieve experience.

Together, these works chart a coherent path toward agents that learn quickly, inherit rich priors, and remember what matters, moving a step closer to truly lifelong, self-improving intelligence.

Actions

Access Document

Files:

Authors

More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Computer Science
Role:
Author

Contributors

Institution:
University of Oxford
Division:
MPLS
Department:
Computer Science
Role:
Supervisor
ORCID:
0000-0002-2733-2078


DOI:
Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP