Conference item
Evaluating histopathology foundation models for few-shot tissue clustering: an application to LC25000 augmented dataset cleaning
- Abstract:
- Recent digital histopathology datasets have significantly advanced the development of deep learning-based histopathology frameworks. However, data leakage in model training can lead to artificially high metrics that do not genuinely reflect the strength of the approach. The LC25000 dataset, consisting of tissue image tiles extracted from lung and colon samples, is a popular benchmark dataset. In the released version, tissue tiles were augmented randomly and mixed. Nevertheless, many studies report near-perfect accuracy scores, often due to data leakage, where augmented images of the same tissue tile are split into both training and test sets. To improve the quality of performance reports, we develop a semi-automatic pipeline to clean LC25000. By clustering and separating all augmented images of the same tiles, using recently proposed histopathology foundation models and manual correction, we create a clean version of LC25000. We then evaluate the quality of features extracted by these foundational models, using the clustering task as a benchmark. Our contributions are: 1) We publicly release our semi-automatic annotation pipeline along with the LC25000-clean dataset to facilitate appropriate utilization of this dataset, reducing the risk of overestimating models’ performance; 2) We profile various combinations of feature extraction and clustering methods for identifying duplicates of the same image generated by basic image transformations; 3) We propose the clustering task as a minimal-setup benchmark to evaluate the quality of tissue image features learned by histopathology foundation models. Clustering labels, annotation pipeline, and evaluation code: https://github.com/GeorgeBatch/LC25000-clean.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Accepted manuscript, pdf, 3.3MB, Terms of use)
-
- Publisher copy:
- 10.1007/978-3-031-73748-0_2
Authors
+ Engineering and Physical Sciences Research Council
More from this funder
- Funder identifier:
- https://ror.org/0439y7842
- Grant:
- EP/S02428X/1
- Publisher:
- Springer
- Host title:
- Data Engineering in Medical Imaging: Second MICCAI Workshop, DEMI 2024, Held in Conjunction with MICCAI 2024, Marrakesh, Morocco, October 10, 2024, Proceedings
- Pages:
- 11-21
- Series:
- Lecture Notes in Computer Science
- Series number:
- 15265
- Place of publication:
- Cham, Switzerland
- Publication date:
- 2024-10-25
- Acceptance date:
- 2024-07-16
- Event title:
- 2nd Workshop in Data Engineering in Medical Imaging (DEMI) at MICCAI 2024
- Event location:
- Marrakesh, Morocco
- Event website:
- https://demi-workshop.github.io/
- Event start date:
- 2024-10-10
- Event end date:
- 2024-10-10
- DOI:
- EISSN:
-
1611-3349
- ISSN:
-
0302-9743
- EISBN:
- 9783031737480
- ISBN:
- 9783031737473
- Language:
-
English
- Keywords:
- Pubs id:
-
2016327
- Local pid:
-
pubs:2016327
- Deposit date:
-
2024-07-17
Terms of use
- Copyright holder:
- Batchkala et al.
- Copyright date:
- 2025
- Rights statement:
- © 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG.
- Notes:
- This is the accepted manuscript version of the article. The final version is available online from Springer at https://dx.doi.org/10.1007/978-3-031-73748-0_2
If you are the owner of this record, you can report an update to it here: Report update to this record