Thesis
Machine learning for retrosynthesis and synthesisable molecule generation in drug discovery
- Abstract:
-
Drug discovery is a notoriously difficult and slow process, with high research and development costs and a decreasing success rate. Computer-Aided Drug Design methods show promise in improving the efficiency of early stage drug discovery, increasing the number of compounds that can be evaluated per design cycle and allowing for pre-filtering of molecules with fast computational methods before they are synthesised. However, many of the compounds designed in silico are not synthesisable in practice or the synthesis routes towards them are not obvious. This leads to computational resources being wasted on designing molecules that can never be tested experimentally. This thesis explores new methods for two approaches assessing and improving synthesisability in drug discovery: retrosynthesis prediction and synthesisability-constrained molecule generation.
First, the problem of retrosynthesis prediction for molecules containing heterocyclic scaffolds is considered. Four domain adaptation approaches are benchmarked to develop a single-step retrosynthesis prediction model with improved performance for ring disconnections. Accuracy for heterocycle formations and all reaction classes, as well as computational cost, are considered. A further fine-tuning workflow for continual retraining of the model with newly published data is introduced. The application of the most versatile model, trained with a mixed fine-tuning strategy, is then demonstrated in multi-step retrosynthesis in a retrospective analysis for two drug-like compounds.
Next, the development of retro-active, a method for synthesisable molecule generation and optimisation, is described. Retro-active generates molecules based on a known synthesis route and a provided starting material pool. The use of active learning for starting material selection allows for the optimisation of the resulting product molecules for user-defined scoring functions. A benchmark of starting material acquisition and product enumeration methods is included, as well as a comparison to alternative non-machine learning-based starting material selection approaches. The applicability of retro-active for both ligand-based and structure-based drug discovery is demonstrated.
The use case of retro-active is then extended to multi-parameter optimisation, to simulate a real-life drug discovery scenario. The compounds are optimised for their structural, physicochemical, and ADMET properties, with a scoring function that combines physics-based and machine learning-based scores. The robustness of the method is demonstrated with both convergent and linear synthesis route topologies and ligands for different target proteins.
The thesis concludes with final remarks regarding retrosynthesis prediction and synthesisable molecule generation with retro-active, including future research directions and challenges in the field.
Actions
Access Document
- Files:
-
-
(Preview, Dissemination version, pdf, 22.0MB, Terms of use)
-
Authors
Contributors
+ Duarte Gonzalez, F
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Chemistry
- Sub department:
- Organic Chemistry
- Role:
- Supervisor
- ORCID:
- 0000-0002-6062-8209
+ Brennan, P
- Institution:
- University of Oxford
- Division:
- MSD
- Department:
- NDM
- Role:
- Supervisor
+ Engineering and Physical Sciences Research Council
More from this funder
- Funder identifier:
- https://ror.org/0439y7842
- Grant:
- EP/S024093/1
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2026-02-14
- ARK identifier:
Terms of use
- Copyright holder:
- Ewa Wieczorek
- Copyright date:
- 2024
If you are the owner of this record, you can report an update to it here: Report update to this record