Journal article icon

Journal article

PLMC: language model of protein sequences enhances protein crystallization prediction

Abstract:
X-ray diffraction crystallography has been most widely used for protein three-dimensional (3D) structure determination for which whether proteins are crystallizable is a central prerequisite. Yet, there are a number of procedures during protein crystallization, including protein material production, purification, and crystal production, which take turns affecting the crystallization outcome. Due to the expensive and laborious nature of this multi-stage process, various computational tools have been developed to predict protein crystallization propensity, which is then used to guide the experimental determination. In this study, we presented a novel deep learning framework, PLMC, to improve multi-stage protein crystallization propensity prediction by leveraging a pre-trained protein language model. To effectively train PLMC, two groups of features of each protein were integrated into a more comprehensive representation, including protein language embeddings from the large-scale protein sequence database and a handcrafted feature set consisting of physicochemical, sequence-based and disordered-related information. These features were further separately embedded for refinement, and then concatenated for the final prediction. Notably, our extensive benchmarking tests demonstrate that PLMC greatly outperforms other state-of-the-art methods by achieving AUC scores of 0.773, 0.893, and 0.913, respectively, at the aforementioned individual stages, and 0.982 at the final crystallization stage. Furthermore, PLMC is shown to be superior for predicting the crystallization of both globular and membrane proteins, as demonstrated by an AUC score of 0.991 for the latter. These results suggest the significant potential of PLMC in assisting researchers with the experimental design of crystallizable protein variants.
Publication status:
Published
Peer review status:
Peer reviewed

Actions


Access Document


Files:
Publisher copy:
10.1007/s12539-024-00639-6

Authors


More by this author
Role:
Author
ORCID:
0000-0002-9337-3839
More by this author
Role:
Author
ORCID:
0000-0002-5982-0754
More by this author
Institution:
University of Oxford
Division:
MSD
Department:
NDORMS
Sub department:
Botnar Research Centre
Role:
Author
ORCID:
0000-0002-1274-5080
More by this author
Institution:
University of Oxford
Division:
MSD
Department:
NDORMS
Sub department:
Botnar Research Centre
Role:
Author
ORCID:
0000-0001-5288-3077


Publisher:
Springer
Journal:
Interdisciplinary Sciences: Computational Life Sciences More from this journal
Volume:
16
Issue:
4
Pages:
802-813
Publication date:
2024-08-19
Acceptance date:
2024-05-21
DOI:
EISSN:
1867-1462
ISSN:
1913-2751
Pmid:
39155325


Language:
English
Keywords:
Pubs id:
2023355
Local pid:
pubs:2023355
Deposit date:
2024-12-20

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP