Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

Thirunavukarasu, AJ; Hassan, R; Mahmood, S; Sanghera, R; Barzangi, K; El Mukashfi, M; Shah, S

AI Collection

Journal article

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

Abstract:: Background
Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners.
Objective
Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium.
Methods
AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model's answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners' reports from 2018 to 2022. Novel explanations from ChatGPT-defined as information provided that was not inputted within the question or multiple answer choices-were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT's strengths and weaknesses.
Results
Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT's performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ρ=-0.241 and -0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23).
Conclusions
Large language models are approaching human expert-level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis.

Publication status:: Published

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Thirunavukarasu, A. J., Hassan, R., Mahmood, S., Sanghera, R., Barzangi, K., El Mukashfi, M., & Shah, S. (2023). Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care. JMIR Medical Education, 9, e46599–e46599.

MLA Style

Thirunavukarasu, AJ, et al. “Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care.” JMIR Medical Education, vol. 9, 2023, pp. e46599–e46599.

Chicago Style

Thirunavukarasu, AJ, R Hassan, S Mahmood, et al. 2023. “Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care.” JMIR Medical Education 9: e46599–e46599.
Print

Access Document

Files:: Thirunavukarasu_et_al_2023_Trialling_a_Large.pdf

(Preview, Version of record, pdf, 683.5KB, Terms of use)

Publisher copy:: 10.2196/46599

Authors

+ Thirunavukarasu, AJ More by this author

Institution:: University of Oxford
Role:: Author
ORCID:: 0000-0001-8968-4768

+ Hassan, R More by this author

Role:: Author
ORCID:: 0000-0002-3054-1161

+ Mahmood, S More by this author

Role:: Author
ORCID:: 0009-0008-4209-1306

+ Sanghera, R More by this author

Role:: Author
ORCID:: 0000-0001-6370-8426

+ Barzangi, K More by this author

Role:: Author
ORCID:: 0009-0009-0327-1221

More authors...

Publisher:: JMIR Publications
Journal:: JMIR Medical Education More from this journal
Volume:: 9
Pages:: e46599-e46599
Publication date:: 2023-04-11
DOI:: 10.2196/46599
EISSN:: 2369-3762
ISSN:: 2369-3762

Language:: English
Keywords:: Medicine

Test (biology)

Computer science

Transformer

Pathology

Observational study

Generative grammar

Artificial intelligence

Engineering

Knowledge management
Pubs id:: 2341973
UUID:: uuid_a6a31fa6-0cc0-49e1-b9f7-65ee7decff79
Local pid:: pubs:2341973
Source identifiers:: W4364378939
Deposit date:: 2025-12-03
ARK identifier:: ark:/29072/ora_a6a31fa60cc049e1b9f765ee7decff79

Terms of use

Licence:: CC Attribution (CC BY)

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Journal article

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

Background

Objective

Methods

Results

Conclusions

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Journal article

Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

Background

Objective

Methods

Results

Conclusions

Actions

Access Document

Authors

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions