Journal article
Can general purpose large language models assist pediatricians in predicting infants with serious bacterial infection?
- Abstract:
- Background: Serious Bacterial Infection (SBI) in neonates and young infants often exhibit nonspecific symptoms and clinical signs in the early stages of illness, making early diagnosis challenging. Timely recognition and appropriate treatment are essential to prevent adverse outcomes. While several clinical algorithms are widely used for SBI risk stratification, these tools have limitations, particularly low positive predictive value. This study evaluates the diagnostic accuracy of general-purpose large language models (LLMs) in detecting SBI in neonates and infants under 90 days of age admitted to the emergency department. Our objective is to improve diagnostic precision, reduce unnecessary interventions, and enhance patient outcomes. LLM performance was compared against traditional machine learning models, state-of-the-art rule-based methods, and an ensemble of physicians to assess their potential as clinical decision-support tools in scenarios of diagnostic uncertainty. Results: On a dataset of 742 patients, LLMs demonstrated diagnostic accuracy comparable to traditional machine learning models and state-of-the-art rule-based methods. The optimized CatBoost (class-weighted) model achieved the best overall performance, with a PPV of 0.70, NPV of 0.90, sensitivity of 0.54, specificity of 0.95, F1-score of 0.60, and MCC of 0.54, outperforming the baseline CatBoost model and achieving results on par with large language models (LLMs) and physicians. When optimally prompted, LLMs performed on par with ensembles of experienced clinicians. Additionally, LLMs exhibited effective medical reasoning and provided credible diagnostic predictions, particularly valuable in cases of clinician uncertainty. The models achieved balanced performance across multiple evaluation metrics, including PPV, NPV, sensitivity, specificity, F1-score, and Matthew’s correlation coefficient (MCC). ChatGPT-4o achieved a sensitivity of 0.65 and specificity of 0.83, with an MCC of 0.41. Claude Sonnet 3.5 reached a sensitivity of 0.60 and specificity of 0.86, MCC 0.42 and Google Gemini 2.0 Flash had lower sensitivity (0.43) but the highest specificity (0.94), with an MCC of 0.43. In comparison, the best-performing individual pediatrician achieved a higher sensitivity (0.74) but lower specificity (0.68), with an MCC of 0.33, while the pediatricians’ majority vote yielded sensitivity of 0.69, specificity of 0.81, and MCC of 0.43 — comparable to the top-performing LLMs. Conclusions: These Artificial intelligence tools offer a promising direction for SBI risk prediction, achieving performance comparable to that of experienced pediatric specialists, while maintaining simplicity of use/data-preprocessing for potential real-world applications.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Version of record, pdf, 1.2MB, Terms of use)
-
- Publisher copy:
- 10.1186/s12911-025-03258-3
Authors
- Publisher:
- BioMed Central
- Journal:
- BMC Medical Informatics and Decision Making More from this journal
- Volume:
- 25
- Issue:
- 1
- Article number:
- 423
- Publication date:
- 2025-11-14
- Acceptance date:
- 2025-10-22
- DOI:
- EISSN:
-
1472-6947
- ISSN:
-
1472-6947
- Language:
-
English
- Keywords:
- Pubs id:
-
2350293
- UUID:
-
uuid_fc425c13-fb5e-4008-ab69-157c70ff56cb
- Local pid:
-
pubs:2350293
- Source identifiers:
-
3475850
- Deposit date:
-
2025-11-15
- ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.
Terms of use
- Copyright date:
- 2025
If you are the owner of this record, you can report an update to it here: Report update to this record