Journal article icon

Journal article

An exploration of dataset bias in single-step retrosynthesis prediction

Abstract:
Single-step retrosynthesis models are integral to the development of computer-aided synthesis planning (CASP) tools, leveraging past reaction data to generate new synthetic pathways. However, it remains unclear how the diversity of reactions within a training set impacts model performance. Here, we assess how dataset size and diversity, as defined using automatically extracted reaction templates, affect accuracy and reaction feasibility of three state-of-the-art architectures – template-based LocalRetro and template-free MEGAN and RootAligned. We show that increasing the diversity of the training set (from 1k to 10k templates) significantly increases top-5 round-trip accuracy while reducing top-10 accuracy, impacting prediction feasibility and recall, respectively. In contrast, increasing dataset size without increasing template diversity yields minimal performance gains for LocalRetro and MEGAN, showing that these architectures are robust even with smaller datasets. Moreover, reaction templates that are less common in the training dataset have significantly lower top-k accuracy than more common ones, regardless of the model architecture. Finally, we use an external data source to validate the drastic difference between top-k accuracies on seen and unseen templates, showing that there is limited capability for generalisation to novel disconnections. Our findings suggest that reaction templates can be used to describe the underlying diversity of reaction datasets and the scope of trained models, and that the task of single-step retrosynthesis suffers from a class imbalance problem.
Publication status:
Published
Peer review status:
Peer reviewed

Actions

Access Document

Files:
Publisher copy:
10.1039/d5dd00358j

Authors

More by this author
Role:
Author
ORCID:
0000-0002-6528-7757
More by this author
Institution:
University of Oxford
Division:
MSD
Department:
NDM
Sub department:
CMD
Role:
Author
ORCID:
0009-0007-7935-298X
More by this author
Role:
Author
ORCID:
0000-0002-6062-8209



Publisher:
Royal Society of Chemistry
Journal:
Digital Discovery More from this journal
Publication date:
2025-12-29
Acceptance date:
2025-12-22
DOI:
EISSN:
2635-098X
ISSN:
2635-098X


Language:
English
Keywords:
Pubs id:
2364440
UUID:
uuid_0c642764-7949-4c00-983b-c0c8170bd91d
Local pid:
pubs:2364440
Source identifiers:
3678439
Deposit date:
2026-01-21
ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP