Journal article icon

Journal article

Evaluating the utility of amino acid similarity-aware kmers to represent TCR repertoires for classification

Abstract:
Insights gained through interpretation of models trained on the T-cell receptor (TCR) repertoire contribute to advances in understanding of immune-mediated disease. This has the potential to improve diagnostic tests and treatments, particularly for autoimmune diseases. However, TCR repertoire datasets with samples from donors of known autoimmune disease status generally include orders of magnitude fewer samples than TCR sequences. Promising TCR repertoire classification approaches consider relationships between non-identical TCR sequences. In particular, kmer methods demonstrate strong and stable performance for small datasets. We propose a TCR repertoire representation that considers the relationships between amino acids within kmers flexibly and efficiently. XGBoost and logistic regression models are trained and tested on kmer representations of TCR repertoire datasets including samples from patients with coeliac disease as well as donors with previous cytomegalovirus infection. XGBoost models outperform logistic regression, indicating that interactions may be crucial for discriminative ability. We find that a reduced alphabet based on BLOSUM62 can lead to a model with slightly stronger XGBoost testing performance than other kmer features. Though it remains unclear whether there is an amino acid encoding that can substantially improve TCR repertoire classification with reduced alphabet kmers, evidence that this representation enables faster training of XGBoost models in comparison to kmer clusters suggests that our reduced alphabet approach permits wider exploration of amino acid similarity in practice. Finally, we detail motifs which are important in each top-performing XGBoost model and compare them to TCR sequences previously associated with each immune status. We highlight the challenge of interpreting non-linear TCR repertoire classification models trained on kmers which, if overcome, could lead to biomarker discovery for autoimmune diseases.
Publication status:
Published
Peer review status:
Peer reviewed

Actions

Access Document

Publisher copy:
10.1371/journal.pcbi.1014211

Authors

More by this author
Institution:
University of Oxford
Role:
Author
ORCID:
0000-0002-1196-1195
More by this author
Role:
Author
ORCID:
0000-0003-3242-6017
More by this author
Role:
Author
ORCID:
0000-0002-3026-4723


More from this funder
Funder identifier:
https://ror.org/05ar5fy68


Publisher:
Public Library of Science
Journal:
PLoS Computational Biology More from this journal
Volume:
22
Issue:
4
Article number:
e1014211
Publication date:
2026-04-30
Acceptance date:
2026-04-07
DOI:
EISSN:
1553-7358
ISSN:
1553734X, 1553-734X


Language:
English
Source identifiers:
4004693
Deposit date:
2026-04-30
ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP