Large Language Models in Biology

April 10, 2024
Silvin Golumbeanu, Naba Naufer, Ph.D. & Stefanie Morgan, Ph.D.

If you’ve played around with ChatGPT, you’ve encountered a Large Language Model (LLM) at work. An LLM is a type of machine learning model trained on large amounts of natural language data – i.e. data written or spoken by humans – in order to develop an “understanding” of its nuances, generate new data, and interact with it at scale. While chatbot interfaces are among the more popular, hypervisible uses of LLMs, these models can also be used to extract meaning from vast, complex biological datasets.

How can LLMs be applied to biological data?

Language can be described as a system of symbols that, when combined in certain ways, give rise to context-dependent meaning. DNA, amino acid sequences, gene expression profiles, and other forms of biological data can all be interpreted as languages. Each has its own ‘vocabulary’ and ‘grammar’ reflecting underlying biological processes and interactions. Our bodies are fluent in them, constantly processing instructions and carrying on cellular conversations without our conscious input. By treating these different forms of biological data as languages, researchers are harnessing LLMs to discover important signals and patterns.

What are some examples of LLM applications in biology?

Genetics and genomics

DNA sequences, composed of the fundamental nucleotides adenine (A), guanine (G), cytosine (C), and thymine (T), are the essential building blocks of all living organisms. Strung together into an entire genome, variations in these sequences contribute to complex traits and disease risk factors. Thus, a clear application of LLMs in biology is to better examine the link between alterations in the DNA sequence that correspond to functional outcomes. Building upon the ESM-1b model from Rives et al., Brandes et al. utilized a 650-million-parameter protein language model to predict phenotypic consequences from the totality of ∼450 million possible missense variant effects in the human genome. As these variants ultimately are linked to many protein alterations, and thus disease mechanisms and possible therapeutic targets, this type of exhaustive profiling of protein-disrupting variants across the genome holds enormous potential for improving human health.

ESM1b predicted effects of different SNP variant impacts on protein function. Figure 5c from Brandes et al. demonstrates how a small splicing effect (excision of five amino acids from the primary isoform of the MEN1 protein) can result in dramatic changes in the predicted effects of variants in a much larger region.

Using LLMs to understand patterns in genomic sequences can also be crucial in studying pathogens. For example, Maxim Zyvagin et al. published a study in 2023 introducing Genome-Scale Language Models (GenSLMs) to predict the evolution of the SARS-CoV-2 viral genome, accurately identifying variants of concern. One of the first LLMs trained on genome scale nucleotide sequences, it provided a key foundation for both evolutionary dynamics prediction and serves as a foundation model for biological sequence data. This is expected to pave the way for building hierarchical AI models for many other biological applications, such as protein annotation workflows, metagenome reconstruction, protein engineering, and biological pathway design.

While caveats to both approaches exist, by integrating the outputs of multiple approaches like these, we can undoubtedly gain greater insights into the druggable genome variants than previously possible.


Rigorous analysis of transcriptomic data has provided a wealth of insights on the roles of RNA-mediated processes in development and disease, leading to advances in personalized medicine. However, LLM-powered analysis of single-cell RNA (scRNA) data is emerging as a formidable tool for understanding biological processes at cellular resolution in this field as well, revealing cell- and tissue-specific changes and contributions to disease. For example, in a 2022 study, Dr. Fan Yang et al. successfully developed a transformer-based LLM: single-cell Bidirectional Encoder Representations from Transformers (scBERT) to accurately annotate cell types from scRNAseq data. Subsequently, Geneformer, developed by Dr. Christina Theodoris et al. in 2023, emerged as an alternative transformer-based LLM trained on scRNA data to predict tissue-specific gene network dynamics in data-limited settings to accelerate discovery of key network regulators and candidate therapeutic targets.

In both instances, the LLMs learned from a massive dataset and were further refined, then able to be utilized to accurately draw inferences from real-world datasets with a high degree of accuracy. While use of them in the biotech space for drug development applications has only just begun, they are undoubtedly powerful tools for helping researchers rapidly identify and prioritize targets for reversing the molecular phenotype of a disease.

Cell types annotated by experts (left) compared to scBERT predictions (right), demonstrate remarkable similarity. Figure 2b from Yang et al., “scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data,” Nature Machine Intelligence (2022).

Proteomics and protein engineering

Understanding how the complexities of 3D protein structures impact specific functions and drug sensitivities is an ongoing challenge that LLMs are particularly well poised to investigate. Groups led by Drs. Ali Madani and Noelia Ferruz are developing LLMs called ProGen and ProtGPT2, respectively, which can generate de novo protein sequences with predictable functions. ProGen and ProtGPT2 are intended to generate novel protein sequences rather than structures, though their outputs can be further explored for structural insights using tools like AlphaFold. These LLMs can help generate biomedically relevant insights from the exponentially growing body of protein sequence data, which far outpaces the availability of structural data.

LLMs can also be employed to predict protein structures from protein sequences, as demonstrated by Lin et al.’s ESMFold algorithm. Published in 2023 by a team from Meta AI, ESMFold is a transformer-based protein language model that can rapidly and accurately predict atomic-level structures from primary protein sequences. This kind of model is especially useful for metagenomic sequencing data, for which researchers have little if any information beyond primary protein sequence. The team used the model to generate the ESM Metagenomic Atlas, a database of over 700 million predicted protein structures. Models like ESMFold and their resulting databases can help us characterize newly identified or poorly-understood proteins that are difficult if not impossible to sample beyond primary sequence.

Examples of different ProtGPT2-generated proteins. Figure 4 from Ferruz et al., “ProtGPT2 is a deep unsupervised language model for protein design,” Nature Communications (2022).

Small molecule drug discovery and biochemistry

While LLMs can be used to decode the languages of living organisms – specifically DNA, RNA, and protein sequences – they can also be used to discover and optimize new drugs.  By translating libraries of chemical compounds into text-based training datasets, researchers are developing Chemical Language Models (CLMs) to predict small molecule drugs that could target specific proteins in diseases. In a 2023 Nature study, Dr. Michael Moret et al. leveraged a CLM to design a molecule that effectively repressed the PI3K/Akt pathway – the dysregulation of which is commonly associated with many types of cancer – in a brain tumor model. Successfully integrating CLMs and similar models in drug discovery pipelines could accelerate compound screening and hasten experimental validation, particularly when combined with other powerful predictive modeling tools described above.

Docking positions of novel PI3Kγ inhibitors identified by a CLM. Figure 7 from Moret et al., “Leveraging molecular structure and bioactivity with chemical language models for de novo drug design,” Nature Communications (2023).

Antibody evolution and biologics

Beyond modeling for small-molecule oriented drug discovery or predicting druggable targets, LLMs can also be utilized to assist the development of antibodies against diseases. For example, in a 2024 Nature Biotechnology study, Dr. Brian Hie et al. used an LLM to guide lab evolution of antibody variants, successfully generating candidates with neutralizing activity against Ebola and SARS-CoV-2 viruses. The authors add that these models could also be used to optimize therapeutically relevant proteins for other purposes, such as overcoming antibiotic resistance. Similar guided antibody development may prove useful for more therapeutically focused applications in the future, particularly in combination with predictive modeling of disease-specific antibody targets.

How are LLMs changing biological data exploration?

Collectively, Large Language Models have demonstrated great potential as a tool for deciphering and recapitulating nuances and complex relationships from a variety of biochemical data forms, from DNA sequences and RNA transcripts to protein sequences and libraries of chemical compounds. While they initially require a large dataset from which to learn, they can be effectively fine-tuned for specific tasks with much smaller datasets, and perform remarkably well with limited inputs thereafter. As LLMs become increasingly able to accurately predict the effects of genetic variants, novel therapeutic compounds, and more, scientists will be able to derive actionable insights from their data with fewer samples and iterations, rapidly testing more targeted hypotheses. Ultimately, this frees up researchers to focus on what human minds do best: ask new questions and imagine new solutions.

How can Watershed help teams leverage LLMs for biological data?

While Large Language Models open up exciting new avenues for research, there are significant barriers to effectively setting up, optimizing, and implementing them at scale. Watershed can help your team overcome these challenges with powerful computational resources, customized expertise, our advanced Workflow Engine, and beyond. Get in touch with our team of bioinformaticians and engineers at to learn more.