Conference item
Navigating severe class imbalance in population cohort data
- Abstract:
- Class imbalance is a major challenge in predictive modelling for rare disease outcomes, particularly in large-scale population cohorts. Traditional machine learning models often struggle with imbalanced datasets, leading to biased performance metrics and poor generalisability. This study systematically evaluates multiple approaches to mitigate class imbalance in predicting Multiple myeloma using proteomic and clinical data from UK Biobank. We compare standard classification models (XGBoost and logistic regression) with synthetic resampling (SMOTE), anomaly detection techniques (isolation forests, local outlier factors, one-class SVM, and autoencoders), and a transformer-based foundation model (TabPFN), using standard classification performance metrics. Our results indicate that anomaly detection models generalise better than conventional classifiers (XGBoost and logistic regression), while SMOTE fails to improve, and may actively worsen, predictive performance. To address the precision-sensitivity trade-off, we introduce a sequential XGBoost ensemble classifier (SeqXGB) that prioritises high precision over sensitivity to minimise false positive predictions. Compared with a single XGBoost model, the SeqXGB approach successfully reduces false positives (420 vs 9), but significantly limits sensitivity (0.70 vs 0.15) in held-out test data. Our findings highlight that no single method is universally optimal for addressing class imbalance; rather, model selection should be guided by clinical application, balancing the risks of false positives and false negatives.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Accepted manuscript, pdf, 489.1KB, Terms of use)
-
- Publisher copy:
- 10.1109/EMBC58623.2025.11254293
Authors
+ National Institute for Health Research
More from this funder
- Funder identifier:
- https://ror.org/0187kwz08
- Grant:
- NIHR302440
- Publisher:
- IEEE
- Publication date:
- 2025-12-03
- Acceptance date:
- 2025-04-08
- Event title:
- 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2025)
- Event location:
- Copenhagen, Denmark
- Event website:
- https://embc.embs.org/2025/
- Event start date:
- 2025-07-14
- Event end date:
- 2025-07-17
- DOI:
- EISSN:
-
2694-0604
- ISSN:
-
2375-7477
- Language:
-
English
- Keywords:
- Pubs id:
-
2121412
- Local pid:
-
pubs:2121412
- Deposit date:
-
2025-05-02
- ARK identifier:
Terms of use
- Copyright holder:
- IEEE
- Copyright date:
- 2025
- Rights statement:
- © IEEE 2025
- Notes:
- This paper was presented at the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2025), 14th-17th July 2025, Copenhagen, Denmark. The author accepted manuscript (AAM) of this paper has been made available under the University of Oxford's Open Access Publications Policy, and a CC BY public copyright licence has been applied.
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record