Journal article
An exploration of dataset bias in single-step retrosynthesis prediction
- Abstract:
- Single-step retrosynthesis models are integral to the development of computer-aided synthesis planning (CASP) tools, leveraging past reaction data to generate new synthetic pathways. However, it remains unclear how the diversity of reactions within a training set impacts model performance. Here, we assess how dataset size and diversity, as defined using automatically extracted reaction templates, affect accuracy and reaction feasibility of three state-of-the-art architectures – template-based LocalRetro and template-free MEGAN and RootAligned. We show that increasing the diversity of the training set (from 1k to 10k templates) significantly increases top-5 round-trip accuracy while reducing top-10 accuracy, impacting prediction feasibility and recall, respectively. In contrast, increasing dataset size without increasing template diversity yields minimal performance gains for LocalRetro and MEGAN, showing that these architectures are robust even with smaller datasets. Moreover, reaction templates that are less common in the training dataset have significantly lower top-k accuracy than more common ones, regardless of the model architecture. Finally, we use an external data source to validate the drastic difference between top-k accuracies on seen and unseen templates, showing that there is limited capability for generalisation to novel disconnections. Our findings suggest that reaction templates can be used to describe the underlying diversity of reaction datasets and the scope of trained models, and that the task of single-step retrosynthesis suffers from a class imbalance problem.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Version of record, pdf, 1.3MB, Terms of use)
-
- Publisher copy:
- 10.1039/d5dd00358j
Authors
+ Engineering and Physical Sciences Research Council
More from this funder
- Funder identifier:
- https://ror.org/0439y7842
- Publisher:
- Royal Society of Chemistry
- Journal:
- Digital Discovery More from this journal
- Publication date:
- 2025-12-29
- Acceptance date:
- 2025-12-22
- DOI:
- EISSN:
-
2635-098X
- ISSN:
-
2635-098X
- Language:
-
English
- Keywords:
- Pubs id:
-
2364440
- UUID:
-
uuid_0c642764-7949-4c00-983b-c0c8170bd91d
- Local pid:
-
pubs:2364440
- Source identifiers:
-
3678439
- Deposit date:
-
2026-01-21
- ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.
Terms of use
- Copyright date:
- 2025
- Licence:
- CC Attribution (CC BY) 3.0
If you are the owner of this record, you can report an update to it here: Report update to this record