Thesis
Novel machine learning for applications in cancer genomics
- Abstract:
-
Genomics has advanced rapidly in the past decade, with whole-genome sequencing and single-cell RNA sequencing now routine in cancer research. Machine and deep learning have also taken off, but applying them to biology remains challenging due to confounding factors, irregularly distributed data, and a desire for causal insight rather than prediction. In genomics, the lack of ground-truth labels further limits supervised learning. This work develops methods bridging both domains.
The first part of my work revisits copy number alteration calling in cancer, introducing araCNA, a deep learning model trained via simulation rather than emulating the outputs of other models. Using novel long-range sequence models like Mamba, araCNA predicts copy number profiles on whole-genome sequenced cancer samples. araCNA presents a different paradigm for which deep learning models can be applied in genomics - for amortised inference rather than as emulators. The second part of my work focuses on unsupervised discovery in single-cell RNA sequencing (scRNA-seq). I investigate the standard scRNA-seq pipeline assumptions and show how most approaches overlook the sparse, near-binary nature of scRNA-seq data. To address this, I develop bfact, a Boolean matrix factorisation method combining combinatorial optimisation with heuristic post-processing. bfact outperforms existing BMF methods and, when applied to scRNA-seq, finds biologically relevant gene programs beyond current approaches.
Actions
Access Document
- Files:
-
-
(Preview, Dissemination version, pdf, 41.0MB, Terms of use)
-
Authors
Contributors
+ Yau, C
- Institution:
- University of Oxford
- Division:
- MSD
- Department:
- Women's & Reproductive Health
- Role:
- Supervisor
- ORCID:
- 0000-0001-7615-8523
+ Koohy, H
- Institution:
- University of Oxford
- Division:
- MSD
- Department:
- Radcliffe Department of Medicine
- Role:
- Examiner
+ Rattray, M
- Institution:
- University of Manchester
- Role:
- Examiner
+ Engineering and Physical Sciences Research Council
More from this funder
- Funder identifier:
- https://ror.org/0439y7842
- Funding agency for:
- Visscher, E
- Grant:
- EP/S02428X/1
- Programme:
- Oxford EPSRC Centre for Doctoral Training in Health Data Science
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2026-05-09
- ARK identifier:
Terms of use
- Copyright holder:
- Ellen Visscher
- Copyright date:
- 2025
- Notes:
- araCNA: somatic copy number profiling using long-range sequence models is derived from this thesis.
If you are the owner of this record, you can report an update to it here: Report update to this record