Journal article icon

Journal article

Are we fitting data or noise? Analysing the predictive power of commonly used datasets in drug-, materials-, and molecular-discovery

Abstract:
Data-driven techniques for establishing quantitative structure property relations are a pillar of modern materials and molecular discovery. Fuelled by the recent progress in deep learning methodology and the abundance of new algorithms, it is tempting to chase benchmarks and incrementally build ever more capable machine learning (ML) models. While model evaluation has made significant progress, the intrinsic limitations arising from the underlying experimental data are often overlooked. In the chemical sciences data collection is costly, thus datasets are small and experimental errors can be significant. These limitations of such datasets affect their predictive power, a fact that is rarely considered in a quantitative way. In this study, we analyse commonly used ML datasets for regression and classification from drug discovery, molecular discovery, and materials discovery. We derived maximum and realistic performance bounds for nine such datasets by introducing noise based on estimated or actual experimental errors. We then compared the estimated performance bounds to the reported performance of leading ML models in the literature. Out of the nine datasets and corresponding ML models considered, four were identified to have reached or surpassed dataset performance limitations and thus, they may potentially be fitting noise. More generally, we systematically examine how data range, the magnitude of experimental error, and the number of data points influence dataset performance bounds. Alongside this paper, we release the Python package NoiseEstimator and provide a web-based application for computing realistic performance bounds. This study and the resulting tools will help practitioners in the field understand the limitations of datasets and set realistic expectations for ML model performance. This work stands as a reference point, offering analysis and tools to guide development of future ML models in the chemical sciences.
Publication status:
Published
Peer review status:
Peer reviewed

Actions


Access Document


Files:
Publisher copy:
10.1039/d4fd00091a

Authors


More by this author
Institution:
University of Oxford
Role:
Author
ORCID:
0000-0002-1305-9517
More by this author
Role:
Author
ORCID:
0000-0002-5015-1443
More by this author
Institution:
University of Oxford
Role:
Author
ORCID:
0000-0001-5100-8836



Publisher:
Royal Society of Chemistry
Journal:
Faraday Discussions More from this journal
Publication date:
2024-06-04
Acceptance date:
2024-06-04
DOI:
EISSN:
1364-5498
ISSN:
1359-6640


Language:
English
Source identifiers:
2278588
Deposit date:
2024-09-23
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP