Predictive Genetic Modeling

35,730 followers 1y

Synthetic biology is - quite literally - our future. A goundbreaking new biological foundation model Evo2 achieves state-of-the-art prediction of genetic variation impacts and generates coherent genome sequences, spanning all domains of life. A diverse team from leading research institutions including Arc Institute Stanford University NVIDIA University of California, Berkeley trained the model on 9.3 trillion DNA base pairs and has fully shared all code, parameters, and data. A few highlights from the paper (link in comments) 🔬 Zero-shot prediction achieves state-of-the-art accuracy in genetic variant interpretation. Evo 2 can predict the functional consequences of genetic mutations across all domains of life without specialized training. It surpasses existing models in assessing the pathogenicity of both coding and noncoding variants, including BRCA1 cancer-linked mutations. This generalist capability suggests Evo 2 could revolutionize genetic disease research, reducing reliance on expensive, manually curated datasets. 🛠 Genome-scale generation paves the way for synthetic life design. Evo 2 can generate full-length genome sequences with realistic structure and function, including mitochondrial genomes, bacterial chromosomes, and yeast DNA. Unlike prior models, Evo 2 ensures natural sequence coherence, improving synthetic biology applications like engineered microbes or artificial organelles. This sets the stage for programmable biology at an unprecedented scale. 🧬 Unprecedented long-context understanding revolutionizes genomic analysis. Evo 2 operates with a context window of up to 1 million nucleotides—far beyond the capabilities of previous models—allowing it to analyze genomic features across vast distances. This ability enables it to accurately identify regulatory elements, exon-intron boundaries, and structural components critical for understanding genome function. Its long-context recall is a major breakthrough for interpreting complex biological sequences. 🎛 Inference-time search enables controllable epigenomic design. Evo 2’s generative abilities extend beyond raw DNA sequence to epigenomic features, allowing researchers to design sequences with specific chromatin accessibility patterns. This approach successfully encoded Morse code messages into synthetic epigenomes, demonstrating a new method for controlling gene regulation via AI. This could lead to breakthroughs in gene therapy and epigenetic engineering. 🔮 Future potential: Toward AI-driven biological design and virtual cell modeling. Evo 2 represents a major leap toward AI-powered genomic engineering. Future iterations could integrate additional biological layers—such as transcriptomics and proteomics—to create virtual cell models that simulate complex cellular behaviors. This could revolutionize drug discovery, genetic therapy, and even synthetic life creation.

6 Comments

Joseph Steward

Medical, Technical & Marketing Writer | Biotech, Genomics, Oncology & Regulatory | Python Data Science, Medical AI & LLM Applications | Content Development & Management

38,008 followers 1y

Computational modeling of gene regulatory networks has become increasingly important for understanding the biological complexity underlying disease progression in diverse cell types and for the identification of potential therapeutic targets in drug discovery. To address this inherent complexity, a team led by Christina Theodoris and Patrick Ellinor developed Geneformer, a context-aware deep learning model pretrained on large-scale transcriptomic data to enable predictions in network biology with limited data. I included the link to the full paper and brief summary below. Transfer learning enables predictions in network biology. https://lnkd.in/dXiKxTga The model and data for training are available on Hugging Face at https://lnkd.in/d3Esd6eK. https://lnkd.in/dGxB2QaZ. Methods overview: The authors assembled a large-scale pretraining corpus called Genecorpus-30M, comprising 29.9 million human single-cell transcriptomes from various tissues. They developed a rank value encoding method to represent the transcriptome of each single cell, ranking genes by their expression within that cell normalized by their expression across the entire corpus. The researchers designed Geneformer's architecture with six transformer encoder units, each composed of a self-attention layer and feed-forward neural network layer. They implemented a masked learning objective for pretraining, where 15% of genes within each transcriptome were masked, and the model was trained to predict the masked genes. The authors optimized the pretraining process using dynamic length-grouped padding and distributed GPU training to handle the large-scale dataset efficiently. Results overview The authors showed that Geneformer boosted cell-type predictions compared to alternative methods, especially in complex multiclass prediction applications. The researchers fine-tuned Geneformer to predict gene dosage sensitivity, achieving high accuracy with limited data and generalizing well to newly reported disease genes. They applied Geneformer to predict chromatin dynamics, including bivalent domains and transcription factor regulatory range, outperforming alternative methods. The authors used Geneformer to predict network hierarchy and distinguish central versus peripheral factors within gene networks. They developed an in silico deletion approach to model gene network connections and identify dosage-sensitive genes. The researchers applied Geneformer to disease modeling of cardiomyopathy, identifying candidate therapeutic targets that were experimentally validated in an iPSC-based model of the disease.

1 Comment

Jan Beger

Our conversations must move beyond algorithms.

89,471 followers 3mo

A new DNA model can now predict how genetic variants affect splicing, gene expression, and chromatin, with single-base precision and across tissues. 1️⃣ AlphaGenome looks at 1 million letters of DNA at a time and predicts over 5,900 functional features, like gene activity, splicing, and chromatin structure. 2️⃣ It works across 11 types of genomic signals, including RNA expression, transcription start sites, splicing patterns, DNA accessibility, and 3D genome contacts. 3️⃣ Most models either look at small regions in high detail or large regions with blurry output. AlphaGenome does both: wide context and sharp, base-level predictions. 4️⃣ It identifies how variants change gene regulation across tissues and data types in under 1 second, making it useful for large-scale variant screening. 5️⃣ It is especially strong for splicing: it predicts not just splice sites, but also which sites get used and how exons are joined. It outperforms current tools like SpliceAI and Pangolin. 6️⃣ For gene expression, it more accurately predicts how variants affect RNA levels, even those far from the gene, better than other models like Enformer or Borzoi. 7️⃣ It explains known disease variants, such as non-coding mutations near the TAL1 oncogene, showing how they disrupt multiple layers of regulation. 8️⃣ Tests show its best performance comes from combining wide DNA context, fine resolution, and multiple regulatory signals in one model. 9️⃣ AlphaGenome is freely available, with tools to predict gene regulation and variant effects directly from DNA sequence. 🔟 Clinicians and researchers can use it to explore how non-coding variants may contribute to disease, without needing multiple separate tools. ✍🏻 Žiga Avsec, Natasha Latysheva, Jun Cheng, Guido Novati, Kyle Taylor, Tom Ward, Clare Bycroft, Lauren Nicolaisen, Eirini Arvaniti, Joshua Pan, Raina Thomas, Vincent Dutordoir, Matteo Perino, PhD, Soham De, Alexander Karollus, Adam Gayoso, Toby Sargeant, Anne Mottram, Lai Hong Wong, Pavol Drotár, Adam Kosiorek, Andrew Senior, Richard Tanburn, Taylor Applebaum, Souradeep Basu, Demis Hassabis, Pushmeet Kohli. Advancing regulatory variant effect prediction with AlphaGenome. Nature. 2026. DOI: 10.1038/s41586-025-10014-0

3 Comments

Idrees Mohammed

midoc.ai - AI Powered Patient Focussed Approach | Founder @The Cloud Intelligence Inc.| AI-Driven Healthcare | AI Automations in Healthcare

6,340 followers 1y

AI provides more accurate predictions on how do rare genetic variants affect health. The recent advancements in understanding rare genetic variants and their impact on health have taken a significant leap forward with the introduction of a novel algorithm by researchers from the German Cancer Research Center, the European Molecular Biology Laboratory, and the Technical University of Munich. Their study, published in Nature Medicine, presents DeepRVAT (Deep Variant Association Testing), a deep learning-based tool that enhances the prediction of rare genetic variants. These genetic variants, occurring at frequencies of 0.1% or lower, have often been overlooked in traditional genome-wide association studies. However, they can play a crucial role in the manifestation of diseases. The new algorithm utilizes data from 161,000 individuals from the UK Biobank, integrating insights about biological traits and genes. The model was trained on around 13 million variants, employing detailed annotations that inform on the potential effects of each variant on cellular processes. The results from DeepRVAT are remarkable, as it identified 352 associations with disease-related genes across 34 traits, significantly surpassing previous models in performance and reliability. This innovative approach not only improves the accuracy of predicting genetic predispositions, especially for high-risk variants, but also uncovers links to various diseases, including cardiovascular conditions, cancers, and metabolic disorders. With the potential to transform personalized medicine, DeepRVAT can be flexibly combined with other testing methods and requires less computing power than its counterparts. The researchers are keen to apply this tool in clinical settings, particularly in identifying tailored treatments for pediatric cancer patients. As the integration of DeepRVAT into diagnostic frameworks like the German Human Genome Phenome Archive progresses, it stands to revolutionize our understanding and treatment of rare diseases, marking a significant advancement in genomic research and personalized healthcare. What are your thoughts over this ? #ai #medical #healthcare #aiInnovation

19 Comments

Elliot Hershberg

Partner at Amplify | Author of Century of Bio

12,256 followers 1y

AI for predicting gene expression with experimental accuracy 🧬 I've been digging into the details of this new GET (General Expression Transformer) since it was published earlier this month. The model accepts a matrix of two inputs (Figure 1): 1. A peak from ATAC-seq data. This is a region of the genome that should be accesible to be bound by transcriptional regulators like transcription factors (TFs). 2. TF motifs. These are the stretches of DNA sequence that these regulators bind to. By masking regions of this matrix, the model learns regulatory syntax. Matching the accesibility data with RNA sequencing data, the regulatory syntax can then be used to predict gene expression—even in cell types that weren't included in training. It's an elegant formulation that showed strong performance and generalization across multiple tasks (Hence the use of "foundation model"). It seems to outperform other expression prediction models (Enformer shown) for predicting experimental data (Figure 2). One interesting part of this study: only a hand-picked set of ~1M cells from specific data sets were used for training—rather than training on the bulk ENCODE data set. This is where BioML can differ from ML more broadly. Carefully curated data + specific biological priors can deliver great performance. Super interesting. Link to study: https://lnkd.in/gtsMJmPi

3 Comments

Ken Wasserman

Assistant Professor at Georgetown University School of Medicine

4,549 followers 10mo

Is "AlphaGenome" the key to understanding the mechanisms of human ageing? "Deep learning models that predict functional genomic measurements from DNA sequence are powerful tools for deciphering the genetic regulatory code. Existing methods trade off between input sequence length and prediction resolution, thereby limiting their modality scope and performance. We present AlphaGenome, which takes as input 1 megabase of DNA sequence and predicts thousands of functional genomic tracks up to single base pair resolution across diverse modalities – including gene expression, transcription initiation, chromatin accessibility, histone modifications, transcription factor binding, chromatin contact maps, splice site usage, and splice junction coordinates and strength. Trained on human and mouse genomes, AlphaGenome matches or exceeds the strongest respective available external models on 24 out of 26 evaluations on variant effect prediction. AlphaGenome’s ability to simultaneously score variant effects across all modalities accurately recapitulates the mechanisms of clinically-relevant variants near the TAL1 oncogene. To facilitate broader use, we provide tools for making genome track and variant effect predictions from sequence." "AlphaGenome advances efforts to decipher the genome’s regulatory code, offering a unified deep learning model that simultaneously predicts diverse functional genomic signals from megabase-scale DNA sequences. It matches or surpasses specialized SOTA models in their respective domains, which highlights the model’s robust grasp of fundamental regulatory principles encoded in DNA and its utility for mechanistic interpretation of non-coding variation. A core strength is AlphaGenome’s efficient multimodal variant effect prediction, which simultaneously scores variant impacts across all predicted modalities in a single inference pass. This integrated capability is crucial for understanding variants with complex mechanisms, as illustrated by the recapitulation of oncogenic TAL1 variant effects, and could power large scale analyses that dissect regulatory sequence elements genome-wide. Furthermore, AlphaGenome’s novel capability to directly model splice junctions enables a more holistic view of splice-disrupting variants." https://lnkd.in/erZJ2nM7

"AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model" Ken Wasserman on LinkedIn

3 Comments

Niko McCarty

Making a positive future with biotechnology. Fellow at Astera Institute. Founding Editor at Asimov Press. Writing at nikomc.com

16,357 followers 2mo

Another interesting paper from Arc Institute. This one combines protein language models with directed evolution to rapidly engineer proteins. Here is how it works, step-by-step: 1. Select the protein you want to engineer. Give the protein sequence to four different protein language models which, together, "score" the likely fitness of each amino acid mutation. Get a ranked list of 50-100 protein mutations that these models *predict* might improve function. (This is all done on the computer.) 2. Take the top 15 predicted "hits" and then synthesize the pairwise combination for each of them. For example, if a mutation at amino acid #100 boosted activity, but so did mutations at positions #120, #135, and so on, then you'd make a protein with each of these "double" mutations. If you took the top 15 hits, then this is only 105 total proteins to synthesize. 3. Make and test all the double mutants in the wet-lab. Measure the activity of each double mutant. So, for example, if you wanted to engineer a protein to be "brighter," then you would put each double mutant in a microplate well and measure this directly. (This step basically captures epistatic relationships; it helps the models figure out which mutations are beneficial or damaging.) Feed the single + double mutant activity data to a neural network, called MULTI-evolve. The model extrapolates these data to infer *additional* mutations that might be synergistic, like combinations of 5-7 amino acid swaps. 4. Take the top three predictions for proteins with 5-, 6-, or 7 amino acid mutations, based on the neural network. Synthesize these proteins using a new DNA assembly approach, also reported in this paper, called MULTI-assembly. (The gist is that you anneal together a bunch of short oligos, each carrying one of the mutations, in a tube to reconstruct each of the full genes. This yields correctly built sequences 40-70% of the time.) 5. Finally, express the proteins in cells, measure their activities, and benchmark them against the wild-type protein. The researchers used this method for various proteins. For one protein, called dCasRx (a Crispr protein that targets RNA instead of DNA), they used it to create a variant with 9.8-fold better activity, and they validated this across three different human genes. You can also optimize proteins for two different properties at the same time. The authors used their method, for example, to engineer an antibody targeting CD122 for both binding affinity AND its expression yield in cells. TL;DR This is a new way to speed up directed evolution. Instead of using random mutations to search through a huge amount of "biological space" (remember that a single protein, with just 100 amino acids, has 20^100 possible combinations), these researchers use AI models to navigate this search space for proteins more quickly. Can we use the same approach to make entire gene circuits?

6 Comments

Lee Bergstrand

Software Engineer, Bioinformatician, Information Architect, Entrepreneur

2,979 followers 7mo

🚀 Genomic Language Models: Transformers for the Language of Life Large language models have reshaped natural language processing—now, the same architectures are being applied to the genome. 📈 I’ve noticed that genomic language models are increasingly featured in the latest journal issues, reflecting the rapid momentum and growing interest in this technology. This new Nature Machine Intelligence review (March 2025) explores genome language models (gLMs), transformer-based deep learning systems trained on DNA sequences. Just as LLMs learn grammar and meaning in text, gLMs are beginning to uncover the regulatory grammar of the genome. 🧬 Applications of gLMs in genomics: - Identifying regulatory elements (promoters, enhancers, silencers) - Predicting gene expression and chromatin accessibility - Assessing the functional impact of genetic variants - Mapping 3D genome architecture and transcription factor binding Discovering functions of non-coding RNAs 🔑 Highlights from the review: - Why transformers? Their attention mechanism captures long-range DNA dependencies, crucial for understanding regulation. - Pretraining power: gLMs learn from massive unlabelled DNA, enabling zero- and few-shot predictions in biology. - Model families: From hybrid models like Enformer and Borzoi to transformer gLMs (DNABERT, Nucleotide Transformer, GENA-LM) and beyond (HyenaDNA, Evo). - Challenges ahead: Whole-chromosome modelling, curated datasets for long-range regulation, and better interpretability. 💡 Takeaway: Genomic language models are not just technical breakthroughs—they’re powerful tools for decoding how genes are regulated, why mutations matter, and how the genome shapes health and disease. 📄 See a link to the review in the comments below, or DM me for the full text. #Genomics #ArtificialIntelligence #MachineLearning #Transformers #GenomicLanguageModels #LargeLanguageModels #Bioinformatics #ComputationalBiology

2 Comments

Dragan Primorac

Chairman of the Board of Trustees of St Catherine Specialty Hospital

16,275 followers 1y Edited

My team, together with American colleagues, just published a predictive model based on genetics and artificial intelligence for detecting the primary site of metastatic malignant tumors Zagreb, March 14, 2025. Physicians and scientists from the St. Catherine Specialty Hospital (Croatia), in collaboration with colleagues from the prestigious Dartmouth Health (USA), successfully conducted the first whole-genome sequencing (WGS) with clinical interpretation in Croatia last September. A few months later, they developed a model with tremendous potential for detecting the primary site of malignant tumors of unknown origin. The results of their in silico study, conducted on a sample of more than 20,000 metastatic tumor tissues and an analysis of over 600 genes, are based on computational simulations and were recently published in the International Journal of Molecular Sciences (https://lnkd.in/deUvnZix). The study authors are Dragan Primorac, Petar Brlek, and Luka Bulić from St. Catherine Specialty Hospital (Croatia), as well as Nidhi Shah and Parth Shah from Dartmouth Health (USA). In the published study, which applies artificial intelligence-driven computational simulations, more than 20,000 metastatic tumor samples were analyzed, including data on the patient’s sex, age, and the presence of genetic variants across more than 600 different genes. The model's quality was assessed through cross-validation on the training set and evaluation on a separate test set. Finally, the optimal model was integrated with a graphical user interface in the OncoOriginsoftware. The significance of specific features for distinguishing between different primary tumor sites was also determined. Among the four machine learning models tested, XGBoostClassifier demonstrated the best performance, achieving a ROC-AUC value of 0.97 (Receiver Operating Characteristic – Area Under the Curve), where the maximum possible value is 1. The ROC-AUC value is a measure of model quality, indicating how well the model identifies a given class, such as a specific type of tumor. A value of 1 represents a perfect model that correctly classifies every tumor type without error, while a value of 0.5 indicates a model no better than random guessing. The closer the ROC-AUC value is to 1, the more accurately the model identifies the correct primary tumor site. This result highlights the exceptional predictive power of the model, making it a highly useful tool for oncologists in making treatment decisions. Unlike other machine learning models described in the literature, OncoOrigin stands out due to its integration with an intuitive graphical interface, facilitating easier application in clinical practice. This makes the tool more accessible to oncology specialists, and its implementation in routine diagnostics could significantly improve the identification of primary tumor sites and enable more precise treatment.

7 Comments

Dr. Javier Quintana Plaza, MD, PhD, MBA

29,625 followers 2y

In a study published in Cancer Discovery, scientists at University of California San Diego School of Medicine leveraged a machine learning algorithm to tackle one of the biggest challenges facing cancer researchers: predicting when cancer will resist chemotherapy. All cells, including cancer cells, rely on complex molecular machinery to replicate DNA as part of normal cell division. Most chemotherapies work by disrupting this DNA replication machinery in rapidly dividing tumor cells. While scientists recognize that a tumor's genetic composition heavily influences its specific drug response, the vast multitude of mutations found within tumors has made prediction of drug resistance a challenging prospect. The new algorithm overcomes this barrier by exploring how numerous genetic mutations collectively influence a tumor's reaction to drugs that impede DNA replication. After training their model, the researchers put it to the test in cervical cancer, in which roughly 35% of tumors persist after treatment. The model was able to accurately identify tumors that were susceptible to therapy, which were associated with improved patient outcomes. The model also effectively pinpointed tumors likely to resist treatment. https://lnkd.in/d7PsR3Ev

Cancer mutations converge on a collection of protein assemblies to predict resistance to replication stress aacrjournals.org

Predictive Genetic Modeling

More in Genetics Research Breakthroughs

Explore categories