Foundation Models for Biomedical Research

December 18, 2024
Silvin Gol & Evelien Schaafsma, Ph.D.

Foundation models are a subset of machine learning models trained on huge datasets in order to be broadly applicable in a variety of contexts. Most foundation model training data are unlabelled, making these models self-supervised – they do not need to be told what to look for, but rather infer patterns and relationships in the data by themselves. Because of their ability to learn contextually from these massive datasets, foundation models can make predictions about a variety of different inputs, even if they were not trained on these specific kinds of data.

Biology comprises many different components – like genes, amino acids, proteins, chromatin, etc. – working closely together to orchestrate different processes. Foundation models allow researchers to integrate diverse data types from all of these different sources into a unified framework. By highlighting novel connections, these models can provide direction for unresolved research questions and form the basis for new hypotheses.

What are some examples of breakthrough foundation models in biology?

The number of papers published describing or using a foundation model has exploded in 2024. Searching the term “foundation model” in PubMed returns <10 papers a year before 2023, while 2024 alone returns 150 (including preprints), with some still trickling in before the end of the year. Below is a selection of these foundation models developed in the last decade for different biomedical research areas.

PubMed search results for the keyword “foundation model” between 2010 and 2024.

Transcriptomics

  • Geneformer (Theodoris et al. 2023) employs a specific kind of deep learning, called transfer learning, to make predictions about tissue-specific gene network dynamics from scRNA data. This context-aware model allows researchers to glean insights from settings with limited data, like rare diseases and difficult-to-access tissues. Originally trained on 30 million single-cell transcriptomes, the authors recently introduced an updated version of Geneformer with several key changes, including an expanded pre-training dataset of 95 million single-cell transcriptomes.

  • scGPT (single-cell GPT) is a generative AI tool introduced by Cui et al. in 2024 to distill genetic and cellular insights from single-cell data. Like Geneformer, scGPT also uses a transformer model pre-trained on ~30 million cells to make its predictions. scGPT can be used to annotate cell types, infer gene networks, cluster groups of cells, integrate multi-omic datasets, and more.

  • scVI (single-cell variational inference) uses stochastic optimization and deep learning to both approximate distributions of gene expression data across single-cell datasets and perform key analyses like visualization, clustering, and differential expression. Published by Lopez et al. in 2018, scVI and related tools are maintained by Dr. Nir Yosef’s lab on the scvi-tools website.

Geneformer pre-training and fine-tuning architecture. From Theodoris et al., “Transfer learning enables predictions in network biology,” Nature (2023).

Genomics

  • DeepSEA (Zhou & Troyanskaya 2015) uses deep learning to predict the effects of noncoding genomic variants on chromatin and epigenetic regulatory mechanisms. Much of the human genome is considered “noncoding”, i.e. the sequence does not directly encode a gene. However, many of these noncoding sequences have important regulatory or other functions, which are difficult to discern without effective computational tools like DeepSEA.

  • Enformer (Avsec et al. 2021) also uses deep learning to predict the effects of noncoding DNA on gene expression, however it is specifically optimized to include long-range interactions (up to 100kb) as well. As many regulatory elements have effects on genes that are quite far away (>20kb), integrating these long-range interactions improves the accuracy of variant effect predictions.

  • DNABERT (Bidirectional Encoder Representations from Transformers, pre-trained on DNA) uses a different approach to predict the effects of noncoding DNA on gene expression. In a 2021 paper, Ji et al. adapted the BERT (Devlin et al. 2018) large language model to contextually “understand” DNA sequences and predict important regulatory regions like promoters, transcription factor binding sites, and splice sites.

How DNABERT training differs from that of “traditional” transformer models. DNABERT is pre-trained on general purpose genomic data and then fine-tuned for specific tasks. From Ji et al., “DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome,” Bioinformatics (2021).

Structural Biology

  • AlphaFold uses neural networks to predict 3D protein structures from amino acid sequences with near-experimental accuracy, winning its developers the 2024 Nobel Prize in Chemistry. Determining protein structures purely through experimental approaches is an incredibly time-consuming process, making AlphaFold a promising tool in a number of research areas from drug development to basic biology. Abramson et al. recently introduced AlphaFold3 – the latest open source version of AlphaFold.

AlphaFold3-predicted structure examples. From Abramson et al., “Accurate structure prediction of biomolecular interactions with AlphaFold 3,” Nature (2024).

Spatial Biology

  • Nicheformer is trained on both dissociated single-cell transcriptomics data (>57 million cells) and spatially-resolved transcriptomics data (>53 million cells) to make context-specific predictions about the spatial microenvironment of cells. Introduced in a recent preprint by Schaar et al., Nicheformer uses transfer learning to bridge the gap between dissociated and spatially-resolved data, successfully performing spatial tasks regardless of data origin, and allowing for contextualization of dissociated cell data.

  • Novae (Blampey et al. 2024) uses graph-based learning to enable comparison of spatial domains across different samples and experiments. Tissue samples and experimental iterations naturally exhibit variations in spatial gene expression, which can result in a batch effect. In order to accurately interpret results, researchers must distinguish between variations due to batch effect versus biological processes. Novae, which is pretrained on ~30 million cells representing 18 tissue types, corrects for this effect and allows researchers to make more informative comparisons between tissue samples, gene panels, and other groups of data.

Overview of Nicheformer pre-training data composition and fine-tuning for specific spatial tasks. From Schaar et al., “Nicheformer: a foundation model for single-cell and spatial omics,” bioRxiv (2024).

Biomedical Language

  • BioBERT stands for Bidirectional Encoder Representations from Transformers for Biomedical Text Mining. Introduced in a preprint by Lee et al. in 2019, BioBERT uses transfer learning to understand complex biomedical texts. The authors adapted the BERT large language model to improve performance on text mining tasks specific to biomedical language.

  • BioELECTRA is a language encoder model adapted from ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) for biomedical texts. Introduced in a 2021 preprint by Kanakarajan et al., BioELECTRA differs from BERT-based models in that it uses replaced token detection to make predictions rather than masked language modeling.

Overview of BioBERT pre-training and fine-tuning. From Lee et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” arXiv (2019).

Clinical Practice

  • CONCH (Lu et al. 2024) stands for Contrastive learning from Captions for Histopathology. Trained on over 1.17 million image-caption pairs, CONCH combines visual and language-based learning to perform analysis tasks in histopathology such as image classification, captioning, and segmentation. Unlike other models in digital pathology, CONCH is not limited by missing captions, is widely applicable to many tissues and diseases areas, and demonstrates potential to outperform visual-only models.

  • TxGNN (Huang et al. 2024) uses a neural network trained on a medical knowledge graph to identify potential new purposes for existing drugs. Many medications are found to be effective for treating multiple conditions, however there has historically been no systematic way to identify and rank these off-label uses. TxGNN’s predictions align with current off-label practice, offering a promising exploratory tool for expanding drug indications.

CONCH-enabled classification of: (a) regions of interest (ROIs) and (b) whole histopathological slides (WSIs). From Lu et al., “A visual-language foundation model for computational pathology,” Nature Medicine (2024).

How are these foundation models being used in biomedical research?

While many models have been introduced too recently to have extensive citations, several have already been quite impactful across application areas. From cancer to cardiovascular health, researchers are combining foundation models with novel analytical approaches to explore the pathology, progression, and treatment of various diseases in unprecedented detail. The following studies, most of which have been published in the last year, highlight some promising applications of biological foundation models.

Cancer

Cancer is a complex family of diseases with an incredibly diverse array of causes and prognoses for patients. Cancer researchers are using foundation models to elucidate the unique mechanisms underlying disease progression and therapeutic resistance in different patient populations, identifying novel therapeutic targets and diagnostic biomarkers.

  • In a 2024 study, Li et al. used single-cell RNA analysis, bolstered by scGPT-enabled cell annotation, to pinpoint factors driving therapeutic resistance in a mouse model of breast cancer. The authors identified subsets of tumor-associated macrophages (TAMs) with different roles in modulating resistance to PARPi (Poly ADP-ribose polymerase inhibitor) therapy. To isolate highly variable genes among these macrophage subtypes, the authors trained scGPT on mouse tumor models and tested the algorithm on a large human pan-cancer myeloid dataset. In one of these TAM subsets, populations of cells expressing the C5aR1 gene were significantly associated with PARPi resistance; importantly, inhibiting expression of this gene was found to re-sensitize the tumors to this specific therapy.

  • Segmenting cancer cases by transcriptional signatures is not only helpful for predicting therapeutic responses, but also for predicting metastasis and disease progression. In another study from this year, Pakula et al. used scRNA sequencing and scVI modeling to identify changes in the tumor microenvironment – specifically the stroma – that influence the progression of prostate cancer in mice and humans. The authors found that different clusters of stromal cells were associated with distinct disease states, deriving a transcriptional signature that effectively predicts local metastasis.

(f) scGPT-enabled identification of cell types by C5aR1 and CD86 expression in tumor-associated macrophages from human triple-negative breast cancer samples and (g) expression of lineage-specific markers in these cell populations. From Li et al., “C5aR1 inhibition reprograms tumor associated macrophages and reverses PARP inhibitor resistance in breast cancer,” Nature Communications (2024).

Neurodegeneration

The central nervous system (CNS) comprises many different cell types and environmental components interacting to regulate critical functions like musculoskeletal coordination, motor control, cognition, emotional regulation, and much more. Variations in these processes can lead to debilitating neurodegenerative diseases like Alzheimer’s, Parkinson’s, and multiple sclerosis. Foundation models are enabling increasingly interdisciplinary research groups to untangle the complicated web of causes and contributing factors to neurodegeneration.

  • Blood tests are used to diagnose and assess progression of many diseases; however, in diseases of the CNS, much can also be learned from the cerebrospinal fluid (CSF) that surrounds and protects it. In a 2020 study of a mouse model of multiple sclerosis (MS), Schafflick et al. used single-cell transcriptomic analyses powered by scVI to identify a subset of T-cells that potentially drive expansion of B-cells into the CSF, promoting MS progression. The authors used scVI to harmonize the single-cell data, generate a latent space for cell cluster identification, and analyze differential expression of genes across cell clusters.

  • In a 2024 study, Quan et al. used Geneformer and additional single-cell analyses to explore the role of type-1 interferons (IFN-I), a family of cytokines involved in neuroinflammation and immune response, in Parkinson’s disease (PD). Using PD patient data from the Gene Expression Omnibus database, Geneformer predicted a role for the NFATc2 transcription factor in regulating IFN-I response and activating microglia-mediated inflammation, which was subsequently validated in a mouse model.

Proposed IFN-I signaling pathway involved in Parkinson’s disease, from Quan et al., “Single cell analysis reveals the roles and regulatory mechanisms of type-I interferons in Parkinson’s disease,” Cell Communication & Signaling (2024). Authors used Geneformer to identify a potential role for NFATc2 transcription factor in regulating this pathway.

Infectious Disease

Pathogens are constantly evolving novel ways to evade immune responses and resist treatments. Researchers are beginning to leverage foundation models to better understand host-pathogen interactions, the mechanisms of infection and proliferation, how to overcome treatment resistance, and address other key questions across infectious diseases.

  • M. tuberculosis, the bacterial pathogen responsible for the ongoing tuberculosis (TB) epidemic, has developed many drug-resistant strains over the years. This year, Wang et al. used AlphaFold prediction combined with cryo-electron microscopy to describe transport and inhibition mechanisms for EfpA, a multidrug efflux pump in M. tuberculosis. EfpA plays a key role in multidrug resistance, helping to transport antibiotics out of bacterial cells, making it a promising therapeutic target for resistant TB.

  • Looking to another epidemic-causing pathogen, there is still much to be learned about the impacts of the novel SARS-CoV-2 virus on various organ systems. In 2021, Wendisch et al. published a study examining how SARS-CoV-2 infection leads to lung fibrosis in patients with acute respiratory distress. By integrating multiple single-cell transcriptomic datasets with scVI, in addition to imaging and functional genomics approaches, the authors show how certain macrophage populations become profibrotic in these patients and highlight similarities with other cases of pulmonary fibrosis.

AlphaFold-predicted structure and mechanism of a channel in the M. tuberculosis EfpA protein involved in antibiotic resistance, showing (e) lateral gates, (f) binding of three lipids, and (g) cross-sections of lipid binding sites. From Wang et al., “Structures of the Mycobacterium tuberculosis efflux pump EfpA reveal the mechanisms of transport and inhibition,” Nature Communications (2024).

Stem Cells & Regenerative Medicine

The human body comprises up to ~2000 distinct cell types (depending on the resource and analysis strategy used), each following its own journey from stem to fully differentiated cell. By illuminating the processes governing stem cell renewal and differentiation into unique cell types, researchers can better understand the limits and therapeutic potential of tissue regeneration in biomedical applications. Foundation models can offer insights into these processes by revealing patterns in transcriptional profiles, protein function, and more.

For example, Aguadé-Gorgorió et al. recently used AlphaFold protein structure prediction to understand how the MYCT1 protein modulates critical processes in the self-renewal and engraftment of human haematopoietic stem cells (HSCs). Gene perturbation studies demonstrated that MYCT1 is important for maintaining HSC “stemness”, specifically through regulating environmental sensing and endocytosis. The AlphaFold-predicted structure helped pinpoint MYCT1 localization and mechanism of action in endosomal membranes.

AlphaFold-predicted structure of MYCT1 protein (left) and proposed localization in endosomal membrane (right). From Aguadé-Gorgorió et al., “MYCT1 controls environmental sensing in human haematopoietic stem cells,” Nature (2024).

Rare Disease

As genomic data for rare diseases is inherently scarce, the genetic basis for many of these diseases is still poorly understood. Foundation models are opening up new avenues of discovery for rare disease research by enabling the prediction of genomic variants – both in coding and noncoding DNA – and their effects from limited data.

In one example from earlier this year, Soriano et al. used DeepSEA, among other predictive tools, to reveal ultraconserved non-coding elements (UCNEs) involved in retinal development and rare eye diseases. The authors integrated multi-omics data to uncover 45 genes related to rare eye disease that are potentially cis-regulated by UCNEs. WGS data mining revealed that 29 of these genes are associated with 84 UCNEs, comprising 178 rare variants that may contribute to rare eye disease. Tools like DeepSea enabled the authors to narrow down their candidate variants for subsequent clinical exploration and validation, and can be leveraged by researchers to hone in on variations in other rare diseases.

Overview of workflow for identifying ultraconserved non-coding elements (UCNEs) with potential cis-regulatory roles in rare eye disease. From Soriano et al., “Multi-omics analysis in human retina uncovers ultraconserved cis-regulatory elements at rare eye disease loci,” Nature Communications (2024).

Cardiovascular Disease

Heart disease is the leading cause of death in the United States, surpassing cancer. However, like the term “cancer”, “heart disease” represents a large range of cardiovascular issues with varying causes and implications for long-term health and mortality. Cardiovascular diseases are also intricately intertwined with conditions affecting other organ systems, like diabetes and psychiatric illnesses. Unraveling the mechanisms underlying the development and progression of cardiovascular diseases is key to improving diagnostics and treatments, as well as preventing the onset of disease altogether.

In a recent study, Hill et al. demonstrate how a foundation model can be used to dissect the molecular basis of one of the most common cardiovascular conditions – an arrhythmia called atrial fibrillation (AF). The authors used single-nucleus RNA-seq (snRNA-seq) and scVI modeling to identify transcriptional changes by cell type between patients with and without AF. These analyses and validation experiments revealed cardiomyocyte-expressed ATRNL1 as a key player in modulating cell stress response and cardiac action potential. Results also showed that KCNN3, an AF candidate gene that was previously thought to be expressed in cardiomyocytes, is actually expressed by lymphatic endothelial cells.

(e) Transcriptional differences in cell populations between healthy (CTRL) and atrial fibrillation (AF) patients, with (f) showing differential abundance of different cell types between patient groups. From Hill et al., “Large-scale single-nuclei profiling identifies role for ATRNL1 in atrial fibrillation,” Nature Communications (2024).

Gene Therapy/Drug Delivery

Beyond improving our understanding of disease mechanisms, foundation models can also be used in the design of drug delivery tools. Viral vectors, such as adeno-associated viruses (AAVs), are commonly used to deliver gene therapies to treat a variety of diseases. Researchers are continuously altering these vectors and how they are delivered in order to improve treatment efficacy and safety. Foundation models can be used to guide vector engineering and streamline the trial-and-error process of optimizing drug delivery.

One realm of applications is AAV-mediated drug delivery to the CNS, which necessitates crossing the blood-brain barrier (BBB). Getting past the BBB is inherently difficult, as this barrier has evolved to keep potentially harmful substances out of the CNS. In a 2024 study, Shay et al. combined AlphaFold modelling with cryo-electron tomography and human cell microarray assays to identify interactions between engineered AAV capsids and proteins involved in immune response (IL3) and BBB crossing (lab-evolved LRP6). These approaches reveal binding sites for both proteins, as well as off-target tissue binding interactions, informing how these vectors can be improved for noninvasive delivery of therapeutics to the CNS.

AlphaFold-predicted interaction of human LRP6 protein domains (E1:E4) with AAV peptides (X1, CAP-Mac). From Shay et al., “Human cell surface-AAV interactomes identify LRP6 as blood-brain barrier transcytosis receptor and immune cytokine IL3 as AAV9 binder,” Nature Communications (2024).

How can I use foundation models in my research?

While these models can be an exceptionally powerful research tool, they also require significant compute resources to run on any feasible timescale. Specifically, training a foundation model is incredibly computationally intensive, as training datasets can contain many terabytes of data.

Due to these constraints, standard computing with only CPUs (central processing units) is not amenable to foundation models – acceleration with GPUs (graphical processing units) is necessary. All computing involves CPUs, however GPUs are supercharged processing units that significantly enhance CPU performance by handling multiple tasks at once, enabling complex tasks like running massive datasets through foundation models. GPUs themselves differ in the kinds of processes for which they’re best suited; some tasks require a heavyweight GPU, while others run well on more lightweight resources.

Nowadays, there are resources available that translate analytical tools into GPU-accelerated versions, such as RAPIDS single-cell analysis, Parabricks variant calling, and RELION 3D modeling. The Watershed operating system gives you all of the tools you need to access these workflows and successfully implement foundation models with GPU acceleration, including:

  • Install-and-go integration of GPU-accelerated pipelines
  • Easy access to multiple types of GPUs for various applications
  • Seamless switching between GPUs during testing and production
  • GPU parallelization to further improve model training speed

To learn more about how Watershed can empower your entire team, reach out to us at contact@watershed.bio or schedule a demo.