Journal article
Benchmarking transformer-based models for medical record de-identification in a single center multi-specialty evaluation
- Abstract:
- Protecting patient confidentiality is central to enabling research using electronic health records. Automated text de-identification offers a scalable alternative to manual redaction. However, different approaches vary in accuracy and adaptability. We evaluated four transformer-based, task-specific models and five large language models on 3,650 clinical records spanning general and specialty datasets from a UK hospital group. Records were dual-annotated by clinicians, allowing precise comparison of performance. The Microsoft Azure de-identification service achieved the highest F1 score, approaching clinician performance, while fine-tuned AnonCAT and GPT-4-0125 with few-shot prompting also performed strongly. Smaller LLMs frequently over-redacted or produced hallucinatory content, limiting interpretability. Task-specific models demonstrated greater stability across datasets, while low-level adaptation improved performance in both model classes. These findings highlight that automated de-identification systems can provide effective support for large-scale sharing of clinical records, but success depends on careful model choice, adaptation strategies, and safeguards to ensure robust data utility and privacy.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Version of record, pdf, 2.3MB, Terms of use)
-
- Publisher copy:
- 10.1016/j.isci.2025.113732
Authors
- Publisher:
- Cell Press
- Journal:
- iScience More from this journal
- Volume:
- 28
- Issue:
- 12
- Pages:
- 113732
- Publication date:
- 2025-10-08
- DOI:
- EISSN:
-
2589-0042
- ISSN:
-
2589-0042
- Pmid:
-
41438050
- Language:
-
English
- Keywords:
- UUID:
-
uuid_cfc38bbf-f5f5-40ac-a16c-e5be0cf043be
- Source identifiers:
-
3619639
- Deposit date:
-
2026-01-01
- ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.
Terms of use
- Copyright date:
- 2025
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record