Journal article icon

Journal article

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

Abstract:

Background

Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners.

Objective

Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium.

Methods

AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model's answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners' reports from 2018 to 2022. Novel explanations from ChatGPT-defined as information provided that was not inputted within the question or multiple answer choices-were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT's strengths and weaknesses.

Results

Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT's performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ρ=-0.241 and -0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23).

Conclusions

Large language models are approaching human expert-level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis.
Publication status:
Published
Peer review status:
Peer reviewed

Actions

Access Document

Files:
Publisher copy:
10.2196/46599

Authors

More by this author
Institution:
University of Oxford
Role:
Author
ORCID:
0000-0001-8968-4768
More by this author
Role:
Author
ORCID:
0000-0002-3054-1161
More by this author
Role:
Author
ORCID:
0009-0008-4209-1306
More by this author
Role:
Author
ORCID:
0000-0001-6370-8426
More by this author
Role:
Author
ORCID:
0009-0009-0327-1221


Publisher:
JMIR Publications
Journal:
JMIR Medical Education More from this journal
Volume:
9
Pages:
e46599-e46599
Publication date:
2023-04-11
DOI:
EISSN:
2369-3762
ISSN:
2369-3762


Language:
English
Keywords:
Pubs id:
2341973
UUID:
uuid_a6a31fa6-0cc0-49e1-b9f7-65ee7decff79
Local pid:
pubs:2341973
Source identifiers:
W4364378939
Deposit date:
2025-12-03
ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP