Thesis
Training efficient agents for long-term decision making
- Abstract:
-
Reinforcement learning has ventured from tabletop simulators to real robots and open-world games, but today’s agents still learn with prohibitively low sample efficiency, ignore the priors encoded in foundation models, and forget most of what they have seen after a few hundred steps. This thesis pursues a unifying agenda—efficiently training efficient decision-making agents—through three successive contributions.
Chapter 1 demonstrates that sample efficiency can be substantially improved by re-weighting experience toward the transitions that are most informative. An ensemble-based uncertainty criterion selectively upsamples those rare interactions that clarify causal structure, enabling offline reinforcement learning to achieve safe, performant policies with far fewer gradient updates than uniform replay.
Stronger supervision is possible even when no new interaction data are collected, provided we can import structure learned elsewhere. Chapter 2 investigates this idea by tapping the internal representations of large generative vision models. Text-to-image diffusion backbones, although trained for synthesis rather than control, accumulate multi-scale spatial and semantic cues that are difficult to rediscover from scratch in a robotics dataset. By freezing these backbones and projecting their multi-layer activations into a control-friendly embedding—what we term Stable Control Representations (SCRs)—an agent starts with a rich inductive prior over object geometry and language grounding. In manipulation and open-vocabulary navigation tasks, SCRs cut the number of gradient steps needed to reach a given return by up to an order of magnitude and consistently outperform contrastively trained encoders, all without generating a single additional pixel. This result shows that re-using pretrained knowledge can convert computationally expensive exploration into cheap representation reuse, markedly improving sample efficiency.
While these chapters focus on learning efficiently, deployed agents must also act efficiently by leveraging context that spans hours or days. Chapter 3 introduces Memo, a transformer policy that interleaves periodic summary tokens with streaming observations so memory capacity grows gently with task length. To measure such long-term reasoning, Chapter 4 contributes FindingDory, a procedurally extendable benchmark family whose 60 tasks probe how well embodied agents store and retrieve experience.
Together, these works chart a coherent path toward agents that learn quickly, inherit rich priors, and remember what matters, moving a step closer to truly lifelong, self-improving intelligence.
Actions
Access Document
- Files:
-
-
(Preview, Dissemination version, pdf, 35.8MB, Terms of use)
-
Authors
Contributors
+ Gal, Y
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Computer Science
- Role:
- Supervisor
- ORCID:
- 0000-0002-2733-2078
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2026-03-07
- ARK identifier:
Terms of use
- Copyright holder:
- Gunshi Gupta
- Copyright date:
- 2025
If you are the owner of this record, you can report an update to it here: Report update to this record