Journal article icon

Journal article

Codon language embeddings provide strong signals for use in protein engineering

Abstract:
The task of understanding and interpreting the complex information encoded within genomic sequences remains a grand challenge in biological research and clinical applications. In this context, recent advancements in large language model research have led to the development of both encoder-only and decoder-only foundation models designed to decode intricate information in DNA sequences. However, several issues persist, particularly regarding the efficient management of long-range dependencies inherent in genomic sequences, the effective representation of nucleotide variations, and the considerable computational costs associated with large model architectures and extensive pretraining datasets. Current genomic foundation models often face a critical tradeoff: smaller models with mediocre performance versus large models with improved performance. To address these challenges, we introduce dnaGrinder, a unique and efficient genomic foundation model. dnaGrinder excels at managing long-range dependencies within genomic sequences while minimizing computational costs without compromising performance. It achieves results that are not just comparable but often superior to leading DNA models such as Nucleotide Transformer and DNABERT-2. Furthermore, dnaGrinder is designed for easy fine-tuning on workstation-grade GPUs, accommodating input lengths exceeding 17,000 tokens. On a single high-performance GPU, it supports sequences longer than 140,000 tokens, making it a highly efficient and accessible tool for both basic biological research and clinical applications
Publication status:
Published
Peer review status:
Peer reviewed

Actions

Access Document

Publisher copy:
10.1038/s42256-024-00791-0

Authors

More by this author
Institution:
University of Oxford
Role:
Author
ORCID:
0000-0003-1408-5554
More by this author
Institution:
University of Oxford
Role:
Author
ORCID:
0000-0003-1388-2252


More from this funder
Funder identifier:
10.13039/501100000266
Grant:
EP/T517811/1


Publisher:
Nature Research
Journal:
Nature Machine Intelligence More from this journal
Volume:
6
Issue:
2
Pages:
170-179
Publication date:
2024-02-23
DOI:
EISSN:
2522-5839
ISSN:
2522-5839


Language:
English
Keywords:
Pubs id:
1679351
Local pid:
pubs:1679351
Source identifiers:
W4392095606
Deposit date:
2026-06-08
ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP