IA pour données omiques

Focus sur quelques apports

Mahendra Mariadassou

MaIAGE

June 25, 2026

Disclaimer

I’m not an expert in any of those domains;
But AI¹ has disrupted a few domains and has a impressive track record;
So I’m trying to be aware of what’s going on.

Don’t expect a deep dive on technical details / architecture details.
Plenty of online ressources for introductory material (e.g. Fidle).

Introduction

A very hot topic
Variant calling
Protein Folding
Foundation models

An explosion of AI models…

ChatGPT and LLM (Large Language Models) took the world by storm in 2022
In parallel, explosion of foundations models (FM) trained at scale on biological data :
- 2018: DeepVariant [1]: CV-based variant calling
- 2021: AlphaFold2 [2]: 3D structures from AA sequences (precursor to FM)
- 2023: DNABERT* [3–5] Genome FMs, Geneformer [6] and scGPT [7] Single-Cell FMs
- 2024: AlphaFold3 [8]: 3D structure of complexes (proteins / DNA / RNA / small molecules and ligands), Evo [9] long context genome FM
- 2025: Evo2 [10] improvement over Evo, [11] perspective of multimodal single-cell foundation models, virtual tissues [12]
- 2026: Virtual cell initiatives [13], [14] protein world models (ESM Atlas and ESMC)
- etc.

Variant calling

Goal of variant calling

All images are copied from DeepVariant blog.

Images from [DeepVariant blog](https://ai.googleblog.com/2017/12/deepvariant-highly-accurate-genomes.html)

Principle of DeepVariant

Collaboration Google Brain / Verily Life Sciences, images from DeepVariant blog post

Impressive Results (PrecisionFDA)

Confirmed in several studies

Accuracy of SNV (A) and Indel (B) calling. Fig 1 from [15]

Precision-recall curve of several caller on high (top) and low (bottom) datasets. DeepVariant is in blue and red. Adapted from Fig 2 of [16]

Why it matters

Higher accuracy \(\rightarrow\) lower coverage \(\rightarrow\) less money 💰 for GBS

AI for variant calling

Deep neural networks can learn to call variants in pileup images with high accuracy and capture signals that are difficult for humans to identify.

Protein structure

A very short primer on proteins

Proteins responsible for functions essential to life.
Biological function depends on tertiary / quaternary structure
Understanding protein folding is essential
- Incorrect folding can lead to diseases
- Protein structure useful to develop drugs

Protein structure

Resolving protein structure is (or used to be) hard.

The first protein structure (that of myoglobin) was solved in 1958 (by X-ray crystallography) and refined in 1960 by John Kendrew (who received the Nobel Prize in 1962). He started working on this structure in 1949.
By 2021, there were 200 000 experimentally resolved proteins (0.1% of the >200M known proteins)
Huge need for structure prediction methods: see timeline in [17]

John Kendrew and the myoglobin structure in the PDB.

Main events in structure prediction. Fig. 1 from [17]

AlphaFold2 enters the ring

AlphaFold2 enters the ring

CASP13 (2018): Warm-up round
CASP14 (2020): Win by KO with “transformational” results.
AlphaFoldDB (2022): 200M+ protein structure prediciton
AlphaFold3 [8]: Multimer structure prediction

Results from CASP14 (2020). Higher is better.

Large protein language models (pLM)

Drawbacks: AlphaFold[2-3] can’t produce novel structures and requires MSA.

pLM have been proposed to tackle this problem:

ESM2 [18], based on masked transformers for proteins.
- attention patterns \(\leftrightarrow\) residue contact map
- lower TM (Template Model) scores than AlphaFold2 but better scores than AlphaFold2 when using the AA sequence and not MSA
- Similar accuracy (median all-atom RMSD of 1.91Å and a backbone RMSD of 1.33Å) for high confidence proteins

EMBER2 [19, 20], based on embeddings from a pLM (protT5) to predict 2D (inter residue distance) or 3D structure from a PLM trained on AA sequences rather than MSA.
- Less accurate than AlphaFold2 but orders of magnitude faster
- Outperforms AlphaFold2 for deep mutational scanning (impact of single AA change on structure)

Scaling up to the “Protein World”

Move over, AlphaFold: open-source model predicts shape of 1 billion proteins [21]

ESM Cambrian (ESMC) [14] trained on 2.8 billions protein sequences
- use the latent space to predict folding (ESMFold2)
- build an Atlas of 6.8 billions sequences and 1.1 billion predicted sequences (ESM Atlas)
- use SAE to extract \(\sim 16 000\) interpretable features from representations

Protein world and atlas (bi[o]hub)

Compositional grammar for protein biology

ESM Atlas

Use case: designing antibodies

Fraction of designed proteins that successfully bound their targets in the lab

Foundation Models

Motivations

Applications

Genomics: Identify genetic variants associated with diseases.
Transcriptomics: Analyze gene expression patterns to understand cellular responses.
Metabolomics: Investigate metabolic profiles to understand disease mechanisms.

Expected Benefits:

Improved Accuracy: high accuracy in prediction of biological outcomes.
Efficiency: Automation of data analysis processes.
Scalability: Analysis of large-scale datasets (if inference is optimized).
Transfer: Transfer knowledge learned from large datasets to small datasets analyses.

Evo 2: the new kid on the block

EVO 2 [10]

Key points:

A biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life.
Trained to predict the next nucleotide using a million token context window
2 versions: 7B and 40B parameters
Fully open: model architecture, model weights, training code, inference code, and the OpenGenome2 training dataset

Rationale

Next word prediction encompasses many tasks (grammar, knowledge, translation, arithmetics, etc)
Likewise biological mechanism leave footprints in sequence variation (think co-evolution).
Next base prediction could be used to predict genetic disease, RNA structure, etc

Goal: Develop long context models for sequences

Long context require to capture complexity of genomes (long repeats) and length of operational units (genes, operons, plasmids, etc)

EVO2 dataset

OpenGenome2: 9.3 trillion nucleotides from all domains of life.

113,379 representative prokaryotes genomes (357B nt)
6.98T nt from eukaryotic genomes
854B nt of non-redundant metagenomic sequencing data
2.82B nucleotides of organelle genomes
602B nt eukaryotic sequence (coding regions + surrounding windows).

EVO2 training

Two phase training strategy:

Short phase pretraining (8192bp): learn functional genetic elements
Midtraining with long window (1Mbp): learn composition of elements

Higher throughput than transformers
Perplexity improves with model size and context length

EVO 2 learns biological functions

Likelihoods reflect biology: codons…

and exons

Predicts gene essentiality in prokaryotes…

and humans

EVO2 predicts the effect of variants

Evo competitive with SOTA on coding regions in ClinVar

Outperforms SOTA on non-coding regions

Interpreting features

Interpret layer 26 features

Prophage detector

Secondary structure detector

Use case: chromatin accessibility

Design and validate regions with different chromatin accessibility

Experimental validation of the designed sequences

Next-generation Foundation models

Moving beyond DNA sequences

1st gen: DNA, AA

gLM: DNABERTx, Evox, Nucleotide Transformer
pLM: ESMx

2nd gen:

Single-cell transcriptomics
Spatial omics
Multimodal data

Credits to OmicsML/awesome-foundation-model-single-cell-papers

Key evolutions

Biological entities become tokens:

nucleotides 🧬
genes
cells 🦠
tissues 🫀
pangenome
- genes = tokens
- genomes = sentences

microbiome:
- ASV/OTU/taxa = tokens
- communities = sentences
- ecosystems = corpus
  - MGM [22]
  - BiomeGPT [23]
  - Compass [24]

Foundation models are not magic 🪄

Challenges 💪

experimental validation
data leakage
interpretability
causal inference
reproducibility
cost and availability

New actors 🏦

Alphabet DeepMind
CZ Bi[o]hub
Arc Institute

Conclusion

Main messages

AI methods have already changed some fields
- variant calling
- protein folding

Foundation models for sequences 🧬 come out weekly
- No “killer app” yet
- New tool 🔧 in the toolbox 🧰
- Huge potential for our applications
- Heavy dose of skepticism is required

Foundation models are
- moving from sequences to cells, tissues (and other biological entities)
- becoming multimodal

Additional slides

Principle (I)

Consider pileup as an image

Principle (II)

works on images of size 100 \(\times\) 221 (bases)
- 5 rows for the reference
- (up to 95) reads
Use 6 “channels” (think RGB)
- read-level features (MQ, strand, supports candidate)
- base-level features
2 additional channels (maps to alt. allele and haplotype)
In the original publication: ConvNetJuly2015v2 CNN with 9 partitions

Test it for yourself

Images from DeepVariant blog post

Homozygous alternate allele

Heterozygous

Harder isn’t it 🤔

References

1. Poplin R, Chang P-C, Alexander D, Schwartz S, Colthurst T, Ku A, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology. 2018;36:983–7. doi:10.1038/nbt.4235.

2. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9. doi:10.1038/s41586-021-03819-2.

3. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37:2112–20. doi:10.1093/bioinformatics/btab083.

4. Zhou Z, Ji Y, Li W, Dutta P, Davuluri R, Liu H. DNABERT-2: Efficient foundation model and benchmark for multi-species genome. 2023. doi:10.48550/ARXIV.2306.15006.

5. Zhou Z, Wu W, Ho H, Wang J, Shi L, Davuluri RV, et al. DNABERT-s: Pioneering species differentiation with species-aware DNA embeddings. 2024. doi:10.48550/ARXIV.2402.08777.

6. Theodoris CV, Xiao L, Chopra A, Chaffin MD, Al Sayed ZR, Hill MC, et al. Transfer learning enables predictions in network biology. Nature. 2023;618:616–24. doi:10.1038/s41586-023-06139-9.

7. Cui H, Wang C, Maan H, Pang K, Luo F, Duan N, et al. scGPT: Toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods. 2024;21:1470–80. doi:10.1038/s41592-024-02201-0.

8. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500. doi:10.1038/s41586-024-07487-w.

9. Nguyen E, Poli M, Durrant MG, Kang B, Katrekar D, Li DB, et al. Sequence modeling and design from molecular to genome scale with Evo. Science. 2024;386. doi:10.1126/science.ado9336.

10. Brixi G, Durrant MG, Ku J, Naghipourfar M, Poli M, Sun G, et al. Genome modelling and design across all domains of life with Evo 2. Nature. 2026;652:1349–61. doi:10.1038/s41586-026-10176-5.

11. Cui H, Tejada-Lapuerta A, Brbić M, Saez-Rodriguez J, Cristea S, Goodarzi H, et al. Towards multimodal foundation models in molecular cell biology. Nature. 2025;640:623–33. doi:10.1038/s41586-025-08710-y.

12. Wenckstern J, Jain E, Vasilev K, Pariset M, Wicki A, Gut G, et al. AI-powered virtual tissues from spatial proteomics for clinical diagnostics and biomedical discovery. 2025. doi:10.48550/ARXIV.2501.06039.

13. Bunne C, Roohani Y, Rosen Y, Gupta A, Zhang X, Roed M, et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell. 2024;187:7045–63. doi:10.1016/j.cell.2024.11.015.

14. Candido S, Hayes T, Derry A, Rao R, Lin Z, Verkuil R, et al. Language modeling materializes a world model of protein biology. 2026. http://dx.doi.org/10.64898/2026.06.03.729735.

15. Barbitoff YA, Abasov R, Tvorogova VE, Glotov AS, Predeus AV. Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics. 2022;23. doi:10.1186/s12864-022-08365-3.

16. Chen N-C, Kolesnikov A, Goel S, Yun T, Chang P-C, Carroll A. Improving variant calling using population data and deep learning. BMC Bioinformatics. 2023;24. doi:10.1186/s12859-023-05294-0.

17. Bertoline LMF, Lima AN, Krieger JE, Teixeira SK. Before and after AlphaFold2: An overview of protein structure prediction. Frontiers in Bioinformatics. 2023;3. doi:10.3389/fbinf.2023.1120370.

18. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–30. doi:10.1126/science.ade2574.

19. Weissenow K, Heinzinger M, Rost B. Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure. 2022;30:1169–1177.e4. doi:10.1016/j.str.2022.05.001.

20. Weissenow K, Heinzinger M, Steinegger M, Rost B. Ultra-fast protein structure prediction to capture effects of sequence variation in mutation movies. 2022. http://dx.doi.org/10.1101/2022.11.14.516473.

21. Callaway E, Naddaf M. Move over, AlphaFold: open-source model predicts shape of 1 billion proteins. Nature. 2026;654:13–4. doi:10.1038/d41586-026-01686-3.

22. Zhang H, Zhang Y, Kang Z, Xiong J, Yang R, Ning K. MGM as a Large-Scale Pretrained Foundation Model for Microbiome Analyses in Diverse Contexts. Advanced Science. 2026;13. doi:10.1002/advs.202513333.

23. Medearis NA, Zhu S, Zomorrodi AR. BiomeGPT: A foundation model for the human gut microbiome. 2026. http://dx.doi.org/10.64898/2026.01.05.697599.

24. Treloar NJ, Ur-Rehman S, Yang J. Learning the language of the microbiome with transformers. 2026. http://dx.doi.org/10.64898/2026.05.02.722381.