IA pour données omiques

Focus sur quelques apports

June 25, 2026

Disclaimer

  • I’m not an expert in any of those domains;
  • But AI1 has disrupted a few domains and has a impressive track record;
  • So I’m trying to be aware of what’s going on.
  • Don’t expect a deep dive on technical details / architecture details.

  • Plenty of online ressources for introductory material (e.g. Fidle).

Introduction

  • A very hot topic
  • Variant calling
  • Protein Folding
  • Foundation models

An explosion of AI models…

  • ChatGPT and LLM (Large Language Models) took the world by storm in 2022

  • In parallel, explosion of foundations models (FM) trained at scale on biological data :

    • 2018: DeepVariant [1]: CV-based variant calling

    • 2021: AlphaFold2 [2]: 3D structures from AA sequences (precursor to FM)

    • 2023: DNABERT* [35] Genome FMs, Geneformer [6] and scGPT [7] Single-Cell FMs

    • 2024: AlphaFold3 [8]: 3D structure of complexes (proteins / DNA / RNA / small molecules and ligands), Evo [9] long context genome FM

    • 2025: Evo2 [10] improvement over Evo, [11] perspective of multimodal single-cell foundation models, virtual tissues [12]

    • 2026: Virtual cell initiatives [13], [14] protein world models (ESM Atlas and ESMC)

    • etc.

Variant calling

Goal of variant calling

All images are copied from DeepVariant blog.

Images from [DeepVariant blog](https://ai.googleblog.com/2017/12/deepvariant-highly-accurate-genomes.html)

Principle of DeepVariant

Collaboration Google Brain / Verily Life Sciences, images from DeepVariant blog post

Impressive Results (PrecisionFDA)

Confirmed in several studies

Accuracy of SNV (A) and Indel (B) calling. Fig 1 from [15]

Accuracy of SNV (A) and Indel (B) calling. Fig 1 from [15]

Precision-recall curve of several caller on high (top) and low (bottom) datasets. DeepVariant is in blue and red. Adapted from Fig 2 of [16]

Precision-recall curve of several caller on high (top) and low (bottom) datasets. DeepVariant is in blue and red. Adapted from Fig 2 of [16]

Why it matters

Higher accuracy \(\rightarrow\) lower coverage \(\rightarrow\) less money 💰 for GBS

Image from DeepVariant blog

Image from DeepVariant blog

AI for variant calling

Deep neural networks can learn to call variants in pileup images with high accuracy and capture signals that are difficult for humans to identify.

Protein structure

A very short primer on proteins

  • Proteins responsible for functions essential to life.

  • Biological function depends on tertiary / quaternary structure

  • Understanding protein folding is essential

    • Incorrect folding can lead to diseases

    • Protein structure useful to develop drugs

From Wikipedia, created by user Kep17

From Wikipedia, created by user Kep17

Protein structure

Resolving protein structure is (or used to be) hard.

  • The first protein structure (that of myoglobin) was solved in 1958 (by X-ray crystallography) and refined in 1960 by John Kendrew (who received the Nobel Prize in 1962). He started working on this structure in 1949.

  • By 2021, there were 200 000 experimentally resolved proteins (0.1% of the >200M known proteins)

  • Huge need for structure prediction methods: see timeline in [17]

John Kendrew and the myoglobin structure in the PDB.

John Kendrew and the myoglobin structure in the PDB.

Main events in structure prediction. Fig. 1 from [17]

Main events in structure prediction. Fig. 1 from [17]

AlphaFold2 enters the ring

AlphaFold2 enters the ring

  • CASP13 (2018): Warm-up round

  • CASP14 (2020): Win by KO with “transformational” results.

  • AlphaFoldDB (2022): 200M+ protein structure prediciton

  • AlphaFold3 [8]: Multimer structure prediction

Results from CASP14 (2020). Higher is better.

Results from CASP14 (2020). Higher is better.

Large protein language models (pLM)

Drawbacks: AlphaFold[2-3] can’t produce novel structures and requires MSA.

pLM have been proposed to tackle this problem:

  • ESM2 [18], based on masked transformers for proteins.

    • attention patterns \(\leftrightarrow\) residue contact map
    • lower TM (Template Model) scores than AlphaFold2 but better scores than AlphaFold2 when using the AA sequence and not MSA
    • Similar accuracy (median all-atom RMSD of 1.91Å and a backbone RMSD of 1.33Å) for high confidence proteins
  • EMBER2 [19, 20], based on embeddings from a pLM (protT5) to predict 2D (inter residue distance) or 3D structure from a PLM trained on AA sequences rather than MSA.

    • Less accurate than AlphaFold2 but orders of magnitude faster
    • Outperforms AlphaFold2 for deep mutational scanning (impact of single AA change on structure)

Scaling up to the “Protein World”

  • Move over, AlphaFold: open-source model predicts shape of 1 billion proteins [21]
  • ESM Cambrian (ESMC) [14] trained on 2.8 billions protein sequences
    • use the latent space to predict folding (ESMFold2)
    • build an Atlas of 6.8 billions sequences and 1.1 billion predicted sequences (ESM Atlas)
    • use SAE to extract \(\sim 16 000\) interpretable features from representations

Protein world and atlas (bi[o]hub)

Compositional grammar for protein biology

Compositional grammar for protein biology

Highlighting one feature of the protein

Highlighting one feature of the protein

Highlighting another feature

Highlighting another feature

Filter using fine scaled features

Filter using fine scaled features

ESM Atlas

Use case: designing antibodies

Fraction of designed proteins that successfully bound their targets in the lab

Fraction of designed proteins that successfully bound their targets in the lab

Foundation Models

Motivations

Applications

  • Genomics: Identify genetic variants associated with diseases.

  • Transcriptomics: Analyze gene expression patterns to understand cellular responses.

  • Metabolomics: Investigate metabolic profiles to understand disease mechanisms.

Expected Benefits:

  • Improved Accuracy: high accuracy in prediction of biological outcomes.

  • Efficiency: Automation of data analysis processes.

  • Scalability: Analysis of large-scale datasets (if inference is optimized).

  • Transfer: Transfer knowledge learned from large datasets to small datasets analyses.

Evo 2: the new kid on the block

EVO 2 [10]

Key points:

  • A biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life.

  • Trained to predict the next nucleotide using a million token context window

  • 2 versions: 7B and 40B parameters

  • Fully open: model architecture, model weights, training code, inference code, and the OpenGenome2 training dataset

Rationale

  • Next word prediction encompasses many tasks (grammar, knowledge, translation, arithmetics, etc)

  • Likewise biological mechanism leave footprints in sequence variation (think co-evolution).

  • Next base prediction could be used to predict genetic disease, RNA structure, etc

  • Goal: Develop long context models for sequences

Long context require to capture complexity of genomes (long repeats) and length of operational units (genes, operons, plasmids, etc)

EVO2 dataset

OpenGenome2: 9.3 trillion nucleotides from all domains of life.

  • 113,379 representative prokaryotes genomes (357B nt)

  • 6.98T nt from eukaryotic genomes

  • 854B nt of non-redundant metagenomic sequencing data

  • 2.82B nucleotides of organelle genomes

  • 602B nt eukaryotic sequence (coding regions + surrounding windows).

EVO2 training

Two phase training strategy:

  • Short phase pretraining (8192bp): learn functional genetic elements

  • Midtraining with long window (1Mbp): learn composition of elements

  • Higher throughput than transformers

  • Perplexity improves with model size and context length

EVO 2 learns biological functions

Likelihoods reflect biology: codons…

and exons

Predicts gene essentiality in prokaryotes…

and humans

EVO2 predicts the effect of variants

Evo competitive with SOTA on coding regions in ClinVar

Outperforms SOTA on non-coding regions

Interpreting features

Interpret layer 26 features

Prophage detector

Secondary structure detector

Use case: chromatin accessibility

Design and validate regions with different chromatin accessibility

Extracted from Fig 6 of [10]

Extracted from Fig 6 of [10]

Experimental validation of the designed sequences

Experimental validation of the designed sequences

Next-generation Foundation models

Moving beyond DNA sequences

1st gen: DNA, AA

  • gLM: DNABERTx, Evox, Nucleotide Transformer
  • pLM: ESMx

2nd gen:

  • Single-cell transcriptomics
  • Spatial omics
  • Multimodal data

Credits to OmicsML/awesome-foundation-model-single-cell-papers

Key evolutions

Biological entities become tokens:

  • nucleotides 🧬
  • genes
  • cells 🦠
  • tissues 🫀
  • pangenome
    • genes = tokens
    • genomes = sentences
  • microbiome:
    • ASV/OTU/taxa = tokens
    • communities = sentences
    • ecosystems = corpus
      • MGM [22]
      • BiomeGPT [23]
      • Compass [24]

Foundation models are not magic 🪄

Challenges 💪

  • experimental validation
  • data leakage
  • interpretability
  • causal inference
  • reproducibility
  • cost and availability

New actors 🏦

  • Alphabet DeepMind
  • CZ Bi[o]hub
  • Arc Institute

Conclusion

Main messages

  • AI methods have already changed some fields

    • variant calling
    • protein folding
  • Foundation models for sequences 🧬 come out weekly

    • No “killer app” yet
    • New tool 🔧 in the toolbox 🧰
    • Huge potential for our applications
    • Heavy dose of skepticism is required
  • Foundation models are
    • moving from sequences to cells, tissues (and other biological entities)
    • becoming multimodal

Additional slides

Principle (I)

  • Consider pileup as an image

Principle (II)

  • works on images of size 100 \(\times\) 221 (bases)
    • 5 rows for the reference
    • (up to 95) reads
  • Use 6 “channels” (think RGB)
    • read-level features (MQ, strand, supports candidate)
    • base-level features
  • 2 additional channels (maps to alt. allele and haplotype)
  • In the original publication: ConvNetJuly2015v2 CNN with 9 partitions

Test it for yourself

Images from DeepVariant blog post

Homozygous alternate allele

Heterozygous

Harder isn’t it 🤔

References

1. Poplin R, Chang P-C, Alexander D, Schwartz S, Colthurst T, Ku A, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology. 2018;36:983–7. doi:10.1038/nbt.4235.
2. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9. doi:10.1038/s41586-021-03819-2.
3. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37:2112–20. doi:10.1093/bioinformatics/btab083.
4. Zhou Z, Ji Y, Li W, Dutta P, Davuluri R, Liu H. DNABERT-2: Efficient foundation model and benchmark for multi-species genome. 2023. doi:10.48550/ARXIV.2306.15006.
5. Zhou Z, Wu W, Ho H, Wang J, Shi L, Davuluri RV, et al. DNABERT-s: Pioneering species differentiation with species-aware DNA embeddings. 2024. doi:10.48550/ARXIV.2402.08777.
6. Theodoris CV, Xiao L, Chopra A, Chaffin MD, Al Sayed ZR, Hill MC, et al. Transfer learning enables predictions in network biology. Nature. 2023;618:616–24. doi:10.1038/s41586-023-06139-9.
7. Cui H, Wang C, Maan H, Pang K, Luo F, Duan N, et al. scGPT: Toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods. 2024;21:1470–80. doi:10.1038/s41592-024-02201-0.
8. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500. doi:10.1038/s41586-024-07487-w.
9. Nguyen E, Poli M, Durrant MG, Kang B, Katrekar D, Li DB, et al. Sequence modeling and design from molecular to genome scale with Evo. Science. 2024;386. doi:10.1126/science.ado9336.
10. Brixi G, Durrant MG, Ku J, Naghipourfar M, Poli M, Sun G, et al. Genome modelling and design across all domains of life with Evo 2. Nature. 2026;652:1349–61. doi:10.1038/s41586-026-10176-5.
11. Cui H, Tejada-Lapuerta A, Brbić M, Saez-Rodriguez J, Cristea S, Goodarzi H, et al. Towards multimodal foundation models in molecular cell biology. Nature. 2025;640:623–33. doi:10.1038/s41586-025-08710-y.
12. Wenckstern J, Jain E, Vasilev K, Pariset M, Wicki A, Gut G, et al. AI-powered virtual tissues from spatial proteomics for clinical diagnostics and biomedical discovery. 2025. doi:10.48550/ARXIV.2501.06039.
13. Bunne C, Roohani Y, Rosen Y, Gupta A, Zhang X, Roed M, et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell. 2024;187:7045–63. doi:10.1016/j.cell.2024.11.015.
14. Candido S, Hayes T, Derry A, Rao R, Lin Z, Verkuil R, et al. Language modeling materializes a world model of protein biology. 2026. http://dx.doi.org/10.64898/2026.06.03.729735.
15. Barbitoff YA, Abasov R, Tvorogova VE, Glotov AS, Predeus AV. Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics. 2022;23. doi:10.1186/s12864-022-08365-3.
16. Chen N-C, Kolesnikov A, Goel S, Yun T, Chang P-C, Carroll A. Improving variant calling using population data and deep learning. BMC Bioinformatics. 2023;24. doi:10.1186/s12859-023-05294-0.
17. Bertoline LMF, Lima AN, Krieger JE, Teixeira SK. Before and after AlphaFold2: An overview of protein structure prediction. Frontiers in Bioinformatics. 2023;3. doi:10.3389/fbinf.2023.1120370.
18. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–30. doi:10.1126/science.ade2574.
19. Weissenow K, Heinzinger M, Rost B. Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure. 2022;30:1169–1177.e4. doi:10.1016/j.str.2022.05.001.
20. Weissenow K, Heinzinger M, Steinegger M, Rost B. Ultra-fast protein structure prediction to capture effects of sequence variation in mutation movies. 2022. http://dx.doi.org/10.1101/2022.11.14.516473.
21. Callaway E, Naddaf M. Move over, AlphaFold: open-source model predicts shape of 1 billion proteins. Nature. 2026;654:13–4. doi:10.1038/d41586-026-01686-3.
22. Zhang H, Zhang Y, Kang Z, Xiong J, Yang R, Ning K. MGM as a Large-Scale Pretrained Foundation Model for Microbiome Analyses in Diverse Contexts. Advanced Science. 2026;13. doi:10.1002/advs.202513333.
23. Medearis NA, Zhu S, Zomorrodi AR. BiomeGPT: A foundation model for the human gut microbiome. 2026. http://dx.doi.org/10.64898/2026.01.05.697599.
24. Treloar NJ, Ur-Rehman S, Yang J. Learning the language of the microbiome with transformers. 2026. http://dx.doi.org/10.64898/2026.05.02.722381.