Thesis
Improving the understanding of cancer and cancer care by applying data science and machine learning methods to electronic patient records
- Abstract:
-
Electronic health records (EHR) hold great potential for improving the understanding of cancer care by containing high-resolution real-world data for large numbers of patients. This dissertation explores the application of data science and machine learning (ML) methods to EHRs for the purposes of translational colorectal cancer (CRC) research.
I first explore the challenges in using EHRs throughout the data life cycle. I present a lightweight information extraction pipeline that retrieves TNM staging scores---common descriptors of cancer severity---from free text clinical reports with high sensitivity and precision, and also retrieves information about the presence and recurrence of CRC. These data items are essential to CRC research, for identifying cases, studying treatment variation, and comparing treatment outcomes. The pipeline was developed using data from Oxford University Hospitals (OUH) and Royal Marsden (RMH) NHS Foundation Trusts (FT), and supported the establishment of the National Institute for Health Research (NIHR) Health Informatics Collaborative (HIC) CRC database.
I then focus on a specific application: combining the faecal immunochemical test (FIT) results with routinely collected data to predict CRC in symptomatic patients. The current practice is to refer patients with FIT above 10 μg/g for invasive endoscopic investigations, but only one in six investigated have CRC, motivating prediction model development. I demonstrate that an externally-derived model does not outperform FIT in the Oxford University Hospitals FIT dataset (OUH-FIT), and highlight the importance of clinically-relevant performance measures. I then show that employing more predictors, a spectrum of ML models, and novel training methods, was not sufficient to outperform FIT on OUH-FIT data. Finally, I build on and incorporate an existing sequence analysis method into an interactive app that allows to explore and cluster thousands of medical event sequences, such as visualising treatment patterns of CRC patients.
The principal contributions are: a holistic discussion of EHR data quality; a staging extraction algorithm that facilitates further research/audits; a comprehensive pipeline for developing/evaluating FIT-based CRC prediction models; and a fast medical sequence exploration app that can help check data quality and identify treatment variations. There is considerable potential to use these tools on larger datasets to understand if FIT-based models are bound to fail (or if they may work on subgroups with more severe disease); and to contrast different treatment patterns employed for subgroups of CRC patients with complex disease, such as those with liver metastases.
Actions
Authors
Contributors
- Institution:
- University of Oxford
- Role:
- Supervisor
- Institution:
- University of Oxford
- Division:
- MSD
- Department:
- Primary Care Health Sciences
- Role:
- Supervisor
- Institution:
- University of Oxford
- Division:
- MSD
- Department:
- Nuffield Department of Population Health
- Role:
- Supervisor
- Funder identifier:
- https://ror.org/0439y7842
- Grant:
- EP/S02428X/1
- Programme:
- Centre for Doctoral Training in Health Data Science
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2025-04-14
Terms of use
- Copyright holder:
- Andres Tamm
- Copyright date:
- 2023
If you are the owner of this record, you can report an update to it here: Report update to this record