Analyse de données métagénomiques 16S

Module 20

Olivier Rué

Migale

Cédric Midoux

PROSE & MaIAGE

September 11, 2023

Introduction

Practical informations

  • 9h00 - 17h00
  • 2 breaks morning and afternoon
  • Lunch at INRAE restaurant (not mandatory)
  • Questions are strongly encouraged
  • Everyone has something to learn from each other

Better knwow you

Who are you?

  • Institution / Laboratory / position

What is your scientific topic?

  • Studied ecosystem
  • Scientific question
  • Experimental design

What is your background?

  • Already treated shotgun data?
  • Background in bioinformatics?

Better know us

  • Open infrastructure dedicated to life sciences
    • Computing resources, tools, databanks…
  • Dissemination of expertise in bioinformatics
  • Design and development of applications
  • Data analysis

Data analysis service

  • We are specialized in genomics/metagenomics
  • 3 Bioinformaticians and 2 Statisticians
  • More than 140 projects since 2016
  • 2 types of partnership
    • Classical collaboration (we perfom the analyses)
    • Accompaniment (we help you do the analysis yourself)

Our expectations

Aim of this training

After this 4 days training, you will:

  • Know the outlines, advantages and limits of amplicon sequencing data analysis
  • Be able to use FROGS (through Galaxy) and phyloseq (through easy16S) tools on the training data set
  • Be able to identify tools and parameters adapted to your own analyses

Aim of this training

Program

DAY 1

  • Introduction
  • Data import on Easy16S
  • \(\alpha\) and \(\beta\) diversities
  • Ordination

DAY 2

  • PERMANOVA and hypothesis tests
  • Differential abundance
  • Analysis of Ravel and Mach data
  • Introduction to amplicon analysis (1)

DAY 3

  • Introduction to amplicon analysis (2)
  • Introduction to Galaxy
  • Quality control
  • FROGS (1)

DAY 4

  • FROGS (2)
  • FROGSfunc
  • Analysis of your data

Introduction to amplicon analyses

Meta-omics using next-generation sequencing (NGS)

Meta-omics using next-genertation sequencing (NGS)

Strengths and weaknesses of amplicon analyses?

Strengths

  • Detect subdominant microorganisms present in complex samples → microbial inventories
  • Get (approximate) relative abondances of different taxa in samples
  • Analyze and compare many taxa (hundreds) at the same time
  • Taxonomic profiles of the communities (usually up to genus level, and sometimes up to species or strain)
  • Low cost

Weaknesses

  • Compositional data, many biases -> no absolute quantification
  • Exact identification of the organisms difficult
  • Hard to distinguish live and dead fractions of the communities
  • No functional view of the ecosystem

The gene marker power

Microbial tree of life

Story of barcoding

  • Early 2000’s: beginning of barcoding
  • 1st DNA barcode: 65 bases of the mitochondrial gene of Cytochrome Oxidase I (COI) dedicated to the identification of vertebrates
  • 2007: 1st international published database (BOLD)
  • 2009: chloroplastic markers - RBCL (Ribulose Biphosphate Carboxylase; 553 pairs of bases) and MATK (MATurase K; 879 pairs of bases) → standard markers for plants
  • 2012: ITS, standard marker of fungi (length between 361–1475 bases in UNITE 7.1)
  • 16S marker, mainly used for bacteria but no designated standard.

Choice of a marker gene

The perfect / ideal gene marker:

  • is ubiquist
  • is conserved among taxa
  • is enough divergent to distinguish stains
  • is not submitted to lateral transfer
  • has only one copy in genome
  • has conserved regions to design specific primers
  • is enough characterized to be present in databases for taxonomic affiliation

Bacterial targets

The genes that have been proposed for this task include those encoding :

  • 16S / 23S rRNA
  • DNA gyrase subunit B (gyrB)
  • RNA polymerase subunit B (rpoB)
  • TU elongation factor (tuf)
  • DNA recombinase protein (recA)
  • protein synthesis elongation factor-G (fusA)
  • dinitrogenase protein subunit D (nifD) …

Bacterial lineages vary in their genomic contents, which suggests that different genes might be needed to resolve the diversity within certain taxonomic groups.

The gene encoding the small subunit of the ribosomal RNA

  • The most widely used gene in molecular phylogenetic studies
  • Ubiquist gene: 16S rDNA in prokayotes ; 18S rDNA in eukaryotes
  • Gene encoding a ribosomal RNA : non-coding RNA (not translated), part of the small subunit of the ribosome which is responsible for the translation of mRNA in proteins
  • Not submitted to lateral gene transfer
  • Availability of databases facilitating comparison
    • Silva v138.1 - 2021: available SSU/LSU sequences to over 10,700,000

The 16S resolution

16S rRNA copy number

Median of the number of 16S rRNA copies in 3,070 bacterial species according to data reported in rrnDB database – 2018

16S rRNA copy variation

[B] The positions of sequence variation within 16S and 23S rRNA are shown along the gene organization of rrn operons. A total of 33 and 77 differences were identified in 16S rRNA and 23S rRNA, respectively.

[C] The number of bases that are different from the conserved sequence are shown for 16S and 23S rRNA for each rrn operon

16S rRNA copy variation

  • Only a minority of bacterial genomes harbors identical 16S rRNA gene copies
  • Sequence diversity increases with increasing copy numbers
  • While certain taxa harbor dissimilar 16S rRNA genes, others contain sequences common to multiple species

gyrB: an alternative of 16S

  • A single-copy housekeeping gene that encodes the subunit B of DNA gyrase, a type II DNA topoisomerase, and therefore plays an essential role in DNA replication.
  • Essential and ubiquitous in bacteria
  • Higher rate of base substitution than 16S rDNA does
  • Sufficiently large in size for use in analysis of microbial communities.
  • Also present in Eukarya and sometimes in Archaea but it shows enough sequence dissimilarity between the three domains of life to be used selectively for Bacteria.

Fungal ITS

  • ITS: Internal Transcribed Spacer

  • Size polymorphism of ITS (from 361 to 1475 bases in UNITE 7.1)

  • Highly conserved regions of the neighboring of ITS1 and ITS2

  • Lack of a generalist and abundant ITS databank (several small specialized databanks)

  • Multiple copies (14 to 1400 copies (mean at 113, median at 80))

  • FROGS deals very good with ITS [8]

    • small and long fragments contrary to many tools

Planning an experiment

Challenges

Experimental design: challenges and solutions

  • In general, any hypothesis should primarily be supported by meticulous literature driven evidence and preliminary testing using small-scale/pilot studies to avoid uncertainty in biological signals, trials and failures
    • Number of samples: variability between similar samples / choosing appropriate sample sizes based on statistical principles can certainly help to avoid biases and spurious interpretations
    • Controls: needed to identify whether a signal is real and not just a stochastic or spurious result
    • Cross-sectional or longitudinal studies: it is equally important to cautiously plan identical sample collection times for each replicate to avoid biases
    • Metadata: help to avoid false interpretation of results and highlights the effective size of individual factors

Sample collection and handling

  • Contamination: changes in temperature, humidity, or other factors could alter or contaminate samples. Minimizing the time of sample collection and using aseptic laboratory resources, including gloves, masks and head covers, help to reduce contamination
  • Transportation: Transit conditions and duration can influence the quality and quantity of extracted nucleic acids
  • Storage and safety: Several studies have assessed the effect of storage conditions on compositional changes in microbial samples

DNA extraction and preparation

  • mechanical lysis/bead beating or chemical lysis
  • amplification using barcode primer pairs, purification, and preparation of purified DNA libraries are done before sequencing
    • universal primers are not so universal [11]
    • amplification bias

Amplification bias

  • Amplification by PCR has sequence-dependence efficiency, especially the sequence that binds to primers.
  • If one sequence is amplified 10% more than another in one round, it will be 1.130 = 17.4 x more abundant after 30 rounds.
  • This effect is most important when the sequence has one or more mismatches with the primer.
  • With one mismatch, amplification efficiency is usually significantly less, and with two or more mismatches the sequence may not be amplified to detectable levels.

Amplification bias

  • C and D impact the abundance without adding new sequences
  • E and F add new sequences

Sequencing technologies

Sequencing technologies

Sequencing technologies

Illumina technology

Illumina technology

Illumina technology

Effect of sequencing technology

Sequencing biases

  • Contamination between samples during the same run
  • Contamination during successive runs (residual contaminants)
  • Variability between runs: take into account for experimental plan
  • Variability inside run: add some controls

Interest of controls

Interest of controls

Illustration

Here, we showed that contaminant OTUs from extraction and amplification steps can represent more than half the total sequence yield in sequencing runs, and lead to unreliable results when characterizing tick microbial communities. We thus strongly advise the routine use of negative controls in tick microbiota studies, and more generally in studies involving low biomass samples

Synthetis of biases

Synthesis

Bioinformatics

A pile of pipelines

Benchmarking

Compositions at the phylum level for Human gut and, using a range of different methods (separate subpanels within each group).

Benchmarking

Quality parameters obtained with the seven bioinformatics pipelines. A) Recall rate (TP/(TP+FN)) reflects the capacity of the tools to detect expected species. B) Precision (TP/(TP+FP)) shows the fraction of relevant species among the retrieved species. C) Divergence rate is the Bray-Curtis distance between expected and observed species abundance. D. Percentage of perfectly reconstructed sequences is the fraction of predicted sequences with 100% of identity with the expected ones.

Conclusion 1: sequencing data do not contain exactly what you sampled…

Summary

Conclusion 2: … but you now know how to deal with

Key advices

  • Discuss with all partners (bioinformaticians & statisticians) involved in the project
    • scientific aspects
    • financial aspects
  • Use controls!
  • If possible, perform a preliminary analysis

References

1. Liu Y-X, Qin Y, Chen T, Lu M, Qian X, Guo X, et al. A practical guide to amplicon and metagenomic analysis of microbiome data. Protein & Cell. 2020;12:315–30. doi:10.1007/s13238-020-00724-8.
2. Kim O-S, Cho Y-J, Lee K, Yoon S-H, Kim M, Na H, et al. Introducing EzTaxon-e: A prokaryotic 16S rRNA gene sequence database with phylotypes that represent uncultured species. International Journal of Systematic and Evolutionary Microbiology. 2012;62:716–21. doi:https://doi.org/10.1099/ijs.0.038075-0.
3. Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ, Castelle CJ, et al. A new view of the tree of life. Nature microbiology. 2016;1:1–6.
4. Yarza P, Yilmaz P, Pruesse E, Glöckner FO, Ludwig W, Schleifer K-H, et al. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nature Reviews Microbiology. 2014;12:635–45.
5. Espejo RT, Plaza N. Multiple ribosomal RNA operons in bacteria; their concerted evolution and potential consequences on the rate of evolution of their 16S rRNA. Frontiers in microbiology. 2018;9:1232.
6. Maeda M, Shimada T, Ishihama A. Strength and regulation of seven rRNA promoters in escherichia coli. PLoS One. 2015;10:e0144697.
7. Poirier OAP Simon AND Rué. Deciphering intra-species bacterial diversity of meat and seafood spoilage microbiota using gyrB amplicon sequencing: A comparative analysis with 16S rDNA V3-V4 amplicon sequencing. PLOS ONE. 2018;13:1–26. doi:10.1371/journal.pone.0204629.
8. Bernard M, Rué O, Mariadassou M, Pascal G. FROGS: a powerful tool to analyse the diversity of fungi with special management of internal transcribed spacers. Briefings in Bioinformatics. 2021;22. doi:10.1093/bib/bbab318.
9. Lofgren LA, Uehling JK, Branco S, Bruns TD, Martin F, Kennedy PG. Genome-based estimates of fungal rDNA copy number variation across phylogenetic scales and ecological lifestyles. Molecular Ecology. 2019;28:721–30. doi:https://doi.org/10.1111/mec.14995.
10. Bharti R, Grimm DG. Current challenges and best-practice protocols for microbiome analysis. Briefings in Bioinformatics. 2019;22:178–93. doi:10.1093/bib/bbz155.
11. Alard J, Lehrter V, Rhimi M, Mangin I, Peucelle V, Abraham A-L, et al. Beneficial metabolic effects of selected probiotics on diet-induced obesity and insulin resistance in mice are associated with improvement of dysbiotic gut microbiota. Environmental Microbiology. 2016;18:1484–97. doi:https://doi.org/10.1111/1462-2920.13181.
12. Tan YC, Kumar AU, Wong YP, Ling APK. Bioinformatics approaches and applications in plant biotechnology. Journal of Genetic Engineering and Biotechnology. 2022;20:1–13.
13. Cruaud P, Rasplus J-Y, Rodriguez LJ, Cruaud A. High-throughput sequencing of multiple amplicons for barcoding and integrative taxonomy. Scientific reports. 2017;7:41948.
14. Whon TW, Chung W-H, Lim MY, Song E-J, Kim PS, Hyun D-W, et al. The effects of sequencing platforms on phylogenetic resolution in 16 s rRNA gene profiling of human feces. Scientific data. 2018;5:1–15.
15. Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biology. 2014;12. doi:10.1186/s12915-014-0087-z.
16. Lejal E, Estrada-Peña A, Marsot M, Cosson J-F, Rué O, Mariadassou M, et al. Taxon appearance from extraction and amplification steps demonstrates the value of multiple controls in tick microbiota analysis. Frontiers in Microbiology. 2020;11:1093.
17. Brooks JP, Edwards DJ, Harwich MD, Rivera MC, Fettweis JM, Serrano MG, et al. The truth about metagenomics: Quantifying and counteracting bias in 16S rRNA studies. BMC microbiology. 2015;15:1–14.
18. Hakimzadeh A, Abdala Asbun A, Albanese D, Bernard M, Buchner D, Callahan B, et al. A pile of pipelines: An overview of the bioinformatics software for metabarcoding data analyses. Molecular Ecology Resources. 2023.
19. Liu Z, DeSantis TZ, Andersen GL, Knight R. Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic acids research. 2008;36:e120–0.
20. Rué O, Coton M, Dugat-Bony E, Howell K, Irlinger F, Legras J-L, et al. Comparison of metabarcoding taxonomic markers to describe fungal communities in fermented foods. bioRxiv. 2023. doi:10.1101/2023.01.13.523754.