Thesis
Protein language representation learning to predict SARS-CoV-2 mutational landscape
- Abstract:
-
With the proliferation of SARS-CoV-2 pandemic globally, numerous variants have been emerging on a daily basis containing distinct transmission and infection rates, risks and impact over evasion of antibody neutralisation. Early discovery of high-risk mutations is critical towards undertaking data-informed therapeutic design decisions and effective pandemic management. This dissertation explores the application of Language Models, commonly used for textual processing, to decipher SARS-CoV-2 spike protein sequences which are an amalgamation of amino acids represented as alphabets. Deep protein language models are revolutionising protein biology, and with the introduction of two novel models: transformer encoder-based sequence only CoVBERT for predicting point mutations, and MuFormer which leverages the sequence and structural space to design mutational protein sequences iteratively. CoVBERT has been able to predict highly transmissible mutations including D614G with a masked marginal log likelihood of 0.95, surpassing state-of-the-art large protein language models. This reflects over large language models ability to encapture in vitro mutagenesis by learning the language of evolution.
MuFormer is capable of generating de novo protein sequences using AlphaFold2 for fixed backbone design, and curates evolutionary novel mutational sequences by injecting the representation derived state-of-the-art protein language models. The generated mutational sequences have been validated with historical data which exemplified the ability of MuFormer to capture phylogenetic properties for generating mutations such as Omicron and Delta variant, given Alpha variant as the input. MuFormer conditions not only over the sequence, but also the structure to generate end-to-end protein sequences and structure by optimising using two strategies of fixed backbone design (MuFormer-fixbb) and backbone atom optimisation (MuFormer-bba). Both these variants of MuFormer outperformed AlphaFold2 over the mutational sequence generation task for several structure and sequence likelihood metrics. These models ascertain over the potential of large language models, termed as foundational models, towards learning the representational language of biology which can assist in controlling pandemics by predicting mutations with higher infectivity in advance.
Actions
Access Document
- Files:
-
-
(Preview, Dissemination version, pdf, 22.0MB, Terms of use)
-
Authors
Contributors
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Computer Science
- Sub department:
- Computer Science
- Role:
- Supervisor
- ORCID:
- 0000-0002-1779-6741
- DOI:
- Type of award:
- MSc
- Level of award:
- Masters
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2024-06-24
- ARK identifier:
Terms of use
- Copyright holder:
- Batra, H
- Copyright date:
- 2022
If you are the owner of this record, you can report an update to it here: Report update to this record