Journal article icon

Journal article

Benchmarking transformer-based models for medical record de-identification in a single center multi-specialty evaluation

Abstract:
Protecting patient confidentiality is central to enabling research using electronic health records. Automated text de-identification offers a scalable alternative to manual redaction. However, different approaches vary in accuracy and adaptability. We evaluated four transformer-based, task-specific models and five large language models on 3,650 clinical records spanning general and specialty datasets from a UK hospital group. Records were dual-annotated by clinicians, allowing precise comparison of performance. The Microsoft Azure de-identification service achieved the highest F1 score, approaching clinician performance, while fine-tuned AnonCAT and GPT-4-0125 with few-shot prompting also performed strongly. Smaller LLMs frequently over-redacted or produced hallucinatory content, limiting interpretability. Task-specific models demonstrated greater stability across datasets, while low-level adaptation improved performance in both model classes. These findings highlight that automated de-identification systems can provide effective support for large-scale sharing of clinical records, but success depends on careful model choice, adaptation strategies, and safeguards to ensure robust data utility and privacy.
Publication status:
Published
Peer review status:
Peer reviewed

Actions

Access Document

Publisher copy:
10.1016/j.isci.2025.113732

Authors

More by this author
Institution:
University of Oxford
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Sub department:
Engineering Science
Role:
Author
More by this author
Institution:
University of Oxford
Role:
Author
More by this author
Institution:
University of Oxford
Role:
Author
More by this author
Institution:
University of Oxford
Role:
Author


More from this funder
Funder identifier:
https://ror.org/029chgv08


Publisher:
Cell Press
Journal:
iScience More from this journal
Volume:
28
Issue:
12
Pages:
113732
Publication date:
2025-10-08
DOI:
EISSN:
2589-0042
ISSN:
2589-0042
Pmid:
41438050


Language:
English
Keywords:
UUID:
uuid_cfc38bbf-f5f5-40ac-a16c-e5be0cf043be
Source identifiers:
3619639
Deposit date:
2026-01-01
ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP