Optimizing Noncoding Variant Effect Discovery

Figuring out exactly what the "dark matter" of the genome does

Aug 15, 2025

📄 Paper: “AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model” 🔗

📅 2025, 📍bioRxiv, 👩‍🔬 Avsec et al.

🧠 The Big Idea

The “dark matter” of a genome (noncoding regions that regulate gene expression) has long since been a scientific mystery. It’s extremely difficult because these noncoding variants can produce such diverse effects and even have cell type- or tissue-specific effects. They’re notoriously hard to track, all the effects a single variant can have is nearly impossible to find without using computer optimization. The researchers behind AlphaGenome have endeavored to create an AI model that furthered noncoding variant effect prediction, allowing us to know more about how our genes really work.

🔬 What They Did

The challenge: Most deep learning models that predict the effect of noncoding variants in DNA tend to fall into a trade-off between how long their sequence input can be versus the model’s prediction resolution, which has limited both their overall scope and performance.

The method: The researchers designed AlphaGenome, a model whose framework leverages multimodal prediction, long sequence context, and base-pair resolution—essentially everything that made leading effect prediction models work so well, all put into one optimized model.

The data: AlphaGenome was trained on data taken from human and mouse genomes, alongside common variant prediction datasets such as GTEx and Roadmap Epigenomics, along with other public data (typically vetted by research papers discussing a variant’s discovered effect[s]).

The results: AlphaGenome was able to take in an input of a megabase of DNA, and from that input is able to produce thousands of genomic predictions across a diverse set of modalities, ranging from transcription initiation to histone modifications to splice junction coordinates and strength. AlphaGenome also matched or exceeded all previous models, garnering a score of 24/26 on variant effect prediction.

The breakthrough: This model is able to take in tons more data than any other in published research and is able to produce a boatload more predictions as well. This means more accurate outputs along with more variations to look over. AlphaGenome also has multiple modalities, meaning it can output multiple different kinds of functions, or can observe multiple effects of whichever noncoding variant’s data was inputted into it.

🧭 Why It Matters

AlphaGenome really propels forward research in noncoding variant effects because of the sheer amount of data it can process and the number of results it can produce in comparison to other models while still working faster and as accurate or more accurate than all of them. This is amazing—it means our scientist will have a much greater understanding of exactly how the genome works and what each individual gene does that much faster, and we’ll be able to pursue further research for medical or commercial reasons along with it, being able to create even more work for good. Seriously, hats off to Avsec et al.

🧪 Quick Stats

Input size: up to 1 million base pairs per run (largest context window in genomics AI yet)
Genome coverage: predicts function across 98% of the human genome (noncoding regions)
Accuracy: surpasses prior state-of-the-art in 50+ functional genomics benchmarks
Training data: integrated from ENCODE, Roadmap Epigenomics, GTEx, and other public multi-omics datasets
Scale: Trained on petabyte-scale genomic data

📚Further Reading

AlphaGenome’s website
Open-source GitHub code
Signal2030’s podcast episode

BioInfo

Discussion about this post

Ready for more?