Journal article icon

Journal article

Can general purpose large language models assist pediatricians in predicting infants with serious bacterial infection?

Abstract:
Background: Serious Bacterial Infection (SBI) in neonates and young infants often exhibit nonspecific symptoms and clinical signs in the early stages of illness, making early diagnosis challenging. Timely recognition and appropriate treatment are essential to prevent adverse outcomes. While several clinical algorithms are widely used for SBI risk stratification, these tools have limitations, particularly low positive predictive value. This study evaluates the diagnostic accuracy of general-purpose large language models (LLMs) in detecting SBI in neonates and infants under 90 days of age admitted to the emergency department. Our objective is to improve diagnostic precision, reduce unnecessary interventions, and enhance patient outcomes. LLM performance was compared against traditional machine learning models, state-of-the-art rule-based methods, and an ensemble of physicians to assess their potential as clinical decision-support tools in scenarios of diagnostic uncertainty. Results: On a dataset of 742 patients, LLMs demonstrated diagnostic accuracy comparable to traditional machine learning models and state-of-the-art rule-based methods. The optimized CatBoost (class-weighted) model achieved the best overall performance, with a PPV of 0.70, NPV of 0.90, sensitivity of 0.54, specificity of 0.95, F1-score of 0.60, and MCC of 0.54, outperforming the baseline CatBoost model and achieving results on par with large language models (LLMs) and physicians. When optimally prompted, LLMs performed on par with ensembles of experienced clinicians. Additionally, LLMs exhibited effective medical reasoning and provided credible diagnostic predictions, particularly valuable in cases of clinician uncertainty. The models achieved balanced performance across multiple evaluation metrics, including PPV, NPV, sensitivity, specificity, F1-score, and Matthew’s correlation coefficient (MCC). ChatGPT-4o achieved a sensitivity of 0.65 and specificity of 0.83, with an MCC of 0.41. Claude Sonnet 3.5 reached a sensitivity of 0.60 and specificity of 0.86, MCC 0.42 and Google Gemini 2.0 Flash had lower sensitivity (0.43) but the highest specificity (0.94), with an MCC of 0.43. In comparison, the best-performing individual pediatrician achieved a higher sensitivity (0.74) but lower specificity (0.68), with an MCC of 0.33, while the pediatricians’ majority vote yielded sensitivity of 0.69, specificity of 0.81, and MCC of 0.43 — comparable to the top-performing LLMs. Conclusions: These Artificial intelligence tools offer a promising direction for SBI risk prediction, achieving performance comparable to that of experienced pediatric specialists, while maintaining simplicity of use/data-preprocessing for potential real-world applications.
Publication status:
Published
Peer review status:
Peer reviewed

Actions

Access Document

Files:
Publisher copy:
10.1186/s12911-025-03258-3

Authors

More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Computer Science
Sub department:
Computer Science
Role:
Author


Publisher:
BioMed Central
Journal:
BMC Medical Informatics and Decision Making More from this journal
Volume:
25
Issue:
1
Article number:
423
Publication date:
2025-11-14
Acceptance date:
2025-10-22
DOI:
EISSN:
1472-6947
ISSN:
1472-6947


Language:
English
Keywords:
Pubs id:
2350293
UUID:
uuid_fc425c13-fb5e-4008-ab69-157c70ff56cb
Local pid:
pubs:2350293
Source identifiers:
3475850
Deposit date:
2025-11-15
ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP