Data-efficient generative models for drug discovery and protein design

Klarner, LJ

Thesis

Data-efficient generative models for drug discovery and protein design

Abstract:: Developing novel and more effective therapeutics is one of the most challenging and resource-intensive endeavours in biomedical research. Despite significant advances in our understanding of biological systems and disease mechanisms, the productivity of drug discovery efforts has been declining steadily. Recent years have witnessed the resurgence of deep learning algorithms as the state-of-the-art across a range of complex modelling domains, positioning them as a powerful means of accelerating key bottlenecks throughout the drug discovery pipeline. However, their successful application to molecular design faces several fundamental challenges: data is scarce, search spaces are large, and medicinal chemists often care most about novel compounds that are meaningfully different from those that have been explored in the past.

This thesis aims to improve the performance of deep learning algorithms in this setting, presenting several methodological contributions designed to make predictive and generative models more reliable and robust in low-data, out-of-distribution regimes. We begin by focusing on classifier-guided diffusion in Chapter 3, introducing a tractable and easy-to-optimise regularisation term that improves conditional sampling and enables the more reliable generation of novel molecules with desirable functional properties. Chapter 4 considers constrained diffusion, presenting two generative modelling frameworks that facilitate the direct integration of geometric and physical constraints into standard Euclidean and Riemannian diffusion processes, resulting in samples that are guaranteed to satisfy relevant feasibility and safety criteria. Chapter 5 builds on this work to develop a novel Metropolis-based discretisation scheme, resulting in significantly faster sampling speeds, improved empirical performance and the ability to handle arbitrary non-convex constraints. Finally, Chapter 6 investigates the out-of-distribution generalisation of molecular property prediction models, introducing a semi-supervised probabilistic framework that is able to leverage relevant unlabeled data to improve predictive performance in a number of challenging evaluation settings.

Collectively, these research efforts demonstrate that integrating relevant prior knowledge and constraints into the modelling process can lead to more accurate and data-efficient deep learning methods that have the potential to accelerate key steps in the discovery of safer and more effective therapeutics. Concluding thoughts and promising directions for future work are presented in Chapter 7.

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Klarner, L. J. (2025). Data-efficient generative models for drug discovery and protein design [PhD thesis]. University of Oxford.

MLA Style

Klarner, LJ. Data-Efficient Generative Models for Drug Discovery and Protein Design. 2025. University of Oxford, PhD thesis.

Chicago Style

Klarner, LJ. 2025. “Data-Efficient Generative Models for Drug Discovery and Protein Design.” PhD thesis, University of Oxford.
Print