Thesis
Data-efficient generative models for drug discovery and protein design
- Abstract:
-
Developing novel and more effective therapeutics is one of the most challenging and resource-intensive endeavours in biomedical research. Despite significant advances in our understanding of biological systems and disease mechanisms, the productivity of drug discovery efforts has been declining steadily. Recent years have witnessed the resurgence of deep learning algorithms as the state-of-the-art across a range of complex modelling domains, positioning them as a powerful means of accelerating key bottlenecks throughout the drug discovery pipeline. However, their successful application to molecular design faces several fundamental challenges: data is scarce, search spaces are large, and medicinal chemists often care most about novel compounds that are meaningfully different from those that have been explored in the past.
This thesis aims to improve the performance of deep learning algorithms in this setting, presenting several methodological contributions designed to make predictive and generative models more reliable and robust in low-data, out-of-distribution regimes. We begin by focusing on classifier-guided diffusion in Chapter 3, introducing a tractable and easy-to-optimise regularisation term that improves conditional sampling and enables the more reliable generation of novel molecules with desirable functional properties. Chapter 4 considers constrained diffusion, presenting two generative modelling frameworks that facilitate the direct integration of geometric and physical constraints into standard Euclidean and Riemannian diffusion processes, resulting in samples that are guaranteed to satisfy relevant feasibility and safety criteria. Chapter 5 builds on this work to develop a novel Metropolis-based discretisation scheme, resulting in significantly faster sampling speeds, improved empirical performance and the ability to handle arbitrary non-convex constraints. Finally, Chapter 6 investigates the out-of-distribution generalisation of molecular property prediction models, introducing a semi-supervised probabilistic framework that is able to leverage relevant unlabeled data to improve predictive performance in a number of challenging evaluation settings.
Collectively, these research efforts demonstrate that integrating relevant prior knowledge and constraints into the modelling process can lead to more accurate and data-efficient deep learning methods that have the potential to accelerate key steps in the discovery of safer and more effective therapeutics. Concluding thoughts and promising directions for future work are presented in Chapter 7.
Actions
Access Document
- Files:
-
-
(Preview, Dissemination version, pdf, 35.1MB, Terms of use)
-
Authors
Contributors
+ Deane, C
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Statistics
- Role:
- Supervisor
- ORCID:
- 0000-0003-1388-2252
+ Morris, G
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Statistics
- Role:
- Supervisor
- ORCID:
- 0000-0003-1731-8405
+ Teh, Y
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Statistics
- Role:
- Supervisor
- ORCID:
- 0000-0001-5365-6933
+ Clarendon Fund
More from this funder
- Funding agency for:
- Klarner, LJ
- Programme:
- Clarendon Scholarship
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2026-04-13
- ARK identifier:
Terms of use
- Copyright holder:
- Leo Jannis Klarner
- Copyright date:
- 2025
If you are the owner of this record, you can report an update to it here: Report update to this record