Module 20
June 8, 2026
https://documents.migale.inrae.fr/data-analysis.html

After this 4 days training, you will:
Liu et al., 2020: A practical guide to amplicon and metagenomic analysis of microbiome data [1]
DAY 1
DAY 2
DAY 3
DAY 4
Training with Easy16S
DAY 3
DAY 4
Become familiar with {phyloseq} [2] R package and {Easy16S} [3] Shiny Web Application for the analysis of microbiome datasets.
Exploratory Data Analysis
\(\alpha\)-diversity: how diverse is my community?
\(\beta\)-diversity: how different are two communities?
Use a distance matrix to study structures:
Become familiar with {phyloseq} [2] R package and {Easy16S} [3] Shiny Web Application for the analysis of microbiome datasets.
Visual assessment of the data
phyloseq R package [2] from Bioconductor
Companion tools
{phyloseq-extended} [4]{vegan}, {ade4}, {picante}{ape}{mia} Shiny Web Application [3]
Strengths
{phyloseq} data structure [2]{ggplot2}Shiny Web Application [3]
Content
A phyloseq object is made of up to 5 components (or slots):
The biom format natively supports
The other components are optional and must be stored in separate files
The import functions create consistent objects with
Check the import phyloseq data
Projet MetaPDOcheese [5]
It’s useful to explore the data and prepare datasets to be analysed depending on the biological questions.
Samples filter
Agglomerate taxa at higher taxonomic rank allows
Spread taxonomy to remove unknown and multi-affiliations by spreading the last known rank to further ranks
Nature of microbiome data
Raw abundances of taxa per sample are not comparable between samples
Rarefying = subsampling sequence reads without replacement based on the smallest total number of reads
All samples have the same depth, set as the smallest number of reads. Rarefied abundances of taxa per sample are recommanded for diversity analyses.
Rarefaction curve:
Preprocess data
Some slides were adapted from Mahendra Mariadassou’s supports
How many species in each sample/community ? Distribution of species ?
4 categories, the most popular indices
Note \(c_i\) the number of species observed \(i\) times (\(i = 1, 2, \dots\)) and \(p_s\) the proportion of species \(s\) (\(s = 1, \dots, S\)).
Note \(p_s = n_{s}/N\) the proportion of species \(s\) (\(s = 1, \dots, S\)), \(n_{s}\) number of species \(s\) et \(N\) total number of species.
Take into account the relative abundance of each taxon \(p_s\)
Take into account the relative abundance of each taxon \(p_s\)
Capture how dominated a community is by its most abundant taxa, asking “What is the probability that two randomly chosen sequences belong to the same taxon?”
Interpretation: - If one specie dominates completely : D = 1 and 1/D = 1 (minimal value) - 1/D increases with diversity (as would Shannon, richness and others)
| Even | Uneven | |
|---|---|---|
| Observed | 15 | 15 |
| Shannon | 2.71 | 2.06 |
| invSimpson | 15 | 5.45 |
Low Shannon and invSimpson, communities are dominated by a few abundant taxa.
Many \(\alpha\) diversities (Observed, Chao) depend a lot on rare ASV. Do not trim rare ASV before computing them as it can drastically alter the result.
Consider working with rarefied data
Perform an analysis of variance (ANOVA) to test the impact of some covariates in the experimental design (in the sample_data table).
Test the null hypothesis \(H_O\), there is no difference between the biological conditions (groups) versus the mean is different between groups.
Assumptions
Reject \(H_0\) when \(Pr(F > f) \leq \alpha\) (significant level, 5% usually)
To understand group differences in ANOVA, conduct post hoc tests also called “multiple comparison analysis” tests.
Test and interpret
best approach to control for uneven sequencing effort
Schloss PD. 2024. Rarefaction is currently the best approach to control for uneven sequencing effort in amplicon sequence analyses. mSphere 9:e00354-23. https://doi.org/10.1128/msphere.00354-23
measures the fraction of species specific to either A or B
\({\text{Jaccard}} = \frac{\sum_{s} 1_{\{n^A_s > 0, n^B_s = 0 \}} + 1_{\{n^B_s > 0, n^A_s = 0 \}}}{\sum_{s} 1_{\{n^A_s + n^B_s > 0 \}}}\)
Note \(n^A_s\) the count of species \(s\) (\(s = 1, \dots, S\)) in community \(A\) and \(n^B_s\) the count in community \(B\). We focus on shared taxa.
| Bacteria | Sample A | Sample B |
|---|---|---|
| 🌹 | 50 | 10 |
| 🌻 | 10 | 50 |
| 🌸 | 20 | 5 |
| 🌼 | 5 | 20 |
| Total | 85 | 85 |
\[ \frac{0}{4 + 4} = 0 \]
measures the fraction of species specific to either A or B
\({\text{Jaccard}} = \frac{\sum_{s} 1_{\{n^A_s > 0, n^B_s = 0 \}} + 1_{\{n^B_s > 0, n^A_s = 0 \}}}{\sum_{s} 1_{\{n^A_s + n^B_s > 0 \}}}\)
Note \(n^A_s\) the count of species \(s\) (\(s = 1, \dots, S\)) in community \(A\) and \(n^B_s\) the count in community \(B\).
Focus on shared taxa.
The Bray Curtis distance mixes which species are present in each sample and how abundant they are.
\(\displaystyle {\text{BC}} = \sum_{s} |n^A_s - n^B_s| / \sum_{s} |n^A_s + n^B_s|\)
Note \(n^A_s\) the count of species \(s\) (\(s = 1, \dots, S\)) in community \(A\) and \(n^B_s\) the count in community \(B\).
| Bacteria | Sample A | Sample B |
|---|---|---|
| 🌹 | 50 | 10 |
| 🌻 | 10 | 50 |
| 🌸 | 20 | 5 |
| 🌼 | 5 | 20 |
| Total | 85 | 85 |
| Bacteria | Diff. |A - B| |
|---|---|
| 🌹 | |50 - 10| = 40 |
| 🌻 | |10 - 50| = 40 |
| 🌸 | |20 - 5| = 15 |
| 🌼 | |5 - 20| = 15 |
| Total | 110 |
\[ \frac{110}{85 + 85} = 0.647 \]
The Bray Curtis distance mixes which species are present in each sample and how abundant they are.
\(\displaystyle {\text{BC}} = \sum_{s} |n^A_s - n^B_s| / \sum_{s} |n^A_s + n^B_s|\)
Note \(n^A_s\) the count of species \(s\) (\(s = 1, \dots, S\)) in community \(A\) and \(n^B_s\) the count in community \(B\).
Focus on shared taxa
measures the proportion of the length of the phylogenetic tree specific to either a sample or the other.
\({\text{UF}} = \frac{\sum_{e} l_e \left[ 1_{\{p_e > 0, q_e = 0 \}} + 1_{\{q_e > 0, p_e = 0 \}} \right] }{\sum_{e} l_e \times 1_{\{p_e + q_e > 0 \}}}\)
For each branch \(e\), note \(l_e\) its length and \(p_e\) (resp. \(q_e\)) the fraction of community \(A\) (resp. community \(B\)) below branch \(e\).
Lozupone C, Knight R2005.UniFrac: a New Phylogenetic Method for Comparing Microbial Communities. Appl Environ Microbiol71:.https://doi.org/10.1128/AEM.71.12.8228-8235.2005
measures the proportion of the length of the phylogenetic tree specific to either a sample or the other.
\(d_{\text{UF}} = \frac{\sum_{e} l_e \left[ 1_{\{p_e > 0, q_e = 0 \}} + 1_{\{q_e > 0, p_e = 0 \}} \right] }{\sum_{e} l_e \times 1_{\{p_e + q_e > 0 \}}}\)
For each branch \(e\), note \(l_e\) its length and \(p_e\) (resp. \(q_e\)) the fraction of community \(A\) (resp. community \(B\)) below branch \(e\).
Focus on shared taxa
measures the proportion of the length of the phylogenetic tree specific to a sample, weighted by the abundance differences.
\({\text{wUF}} = \frac{\sum_{e} l_e | p_e - q_e| }{\sum_{e} l_e (p_e + q_e)}\)
For each branch \(e\), note \(l_e\) its length and \(p_e\) (resp. \(q_e\)) the fraction of community \(A\) (resp. community \(B\)) below branch \(e\).
Focus on shared taxa
| Qualitative | Quantitative | |
|---|---|---|
| No phylogeny | Jaccard | Bray-Curtis |
| With phylogeny | Unifrac | Weigthed-Unifrac |
In general, qualitative diversities are most sensitive to factors that affect presence/absence of organisms (such as pH, salinity, depth, etc) and therefore useful to study and define bioregions (regions with little of no flow between them)…
… whereas quantitative distances focus on factors that affect relative changes (seasonal changes, nutrient availability, concentration of oxygen, depth, etc) and therefore useful to monitor communities over time or along an environmental gradient.
best approach to control for uneven sequencing effort
Schloss PD. 2024. Rarefaction is currently the best approach to control for uneven sequencing effort in amplicon sequence analyses. mSphere 9:e00354-23. https://doi.org/10.1128/msphere.00354-23
Consider working with rarefied data
Multi-Dimensional Scaling (MDS), also named Principal Coordinate Analysis (PCoA)
Aim: find a lower dimensional representation of the data such that the most information is retained.
PCoA focuses on inter-sample dissimilarities
Test the differences in the community composition of communities (beta diversity) from different groups using a distance matrix,
with the assumption of homogeneous dispersions.
Permutational Multivariate ANOVA (PERMANOVA) compares the variation between groups to the variation within groups.
Three partitions of the variation
Limitations
Some slides were adapted from RNASeq formation, module 16 and 23’s Migale formation supports
Aim: Identify ASV (or any taxonomic rank) with differential abundance between biological conditions ?
Challenges of microbiome data to be addressed in DAA:
There are many methods, and this remains an active area of research. There is currently no consensus.
DESeq2 is based on negative binomial generalized linear model and implemented in Easy16S to perform DAA.
The model is defined as follows:
\(K_{ij} \sim\) NB(mean = \(\mu_{ij}\), dispersion = \(\phi_i\))
\(\log \mu_{ij} = x_j^T \beta_i + \log s_{ij}\)
where \(K_{ij}\) is the count for ASV i in sample j, \(\phi_i\) is is the ASV-specific dispersion, \(s_{ij}\) effective library size (e.g. sequencing depth), \(\mathbf{X}=\left[x_{j}\right]\) is the design matrix and \(\beta_i\) is the vector of coefficients.
A Generalized Linear Model (GLM) allows to decompose the effects on the mean of different factors and their interactions.
Differential abundance analysis is performed taxon-by-taxon with replicates, using directly raw counts.
Normalization is included in the model. Adapted to compare samples with similar community compositions.
Assumptions:
Some slides were adapted from RNASeq formation, module 16 and 23’s Migale formation supports
False positive (FP) (type I error: \(\alpha\)) : A non differentially adundant (DA) taxon which is declared DA.
For all ‘taxa (ASV)’, we test \(H_0\) = {taxon \(i\) is not DA} vs \(H_1\) = {the taxon is DA} using a statistical test (calcul of a score)
Let assume all the \(N\) taxa are not DA. Each test is realized at \(\alpha\) level
Ex: \(N=10 000\) taxa and \(\alpha= 0.05\) \(\rightarrow\) \(\mathbb{E}(FP)=500\) taxa.
Probability of having at least one Type I error (false positive), of declaring DA at least one non DA taxon.
\[FWER = \mathbb{P}(FP \ge 1)\]
The Bonferroni procedure
For \(N=2 000\), FWER \(\leq\alpha^*=0.05\), \(\alpha= 2.5 10^{-5}\).
Easy but conservative and not powerful.
When the number of tests increases, the FWER \(\rightarrow\) 1 with constant FP.
Idea: Do not control the error rate but the proportion of error
\(\Rightarrow\) less conservative than control of the FWER.
The false discovery rate of FDR’s Benjamini-Hochberg is the expected proportion of Type I errors among the rejected hypotheses
\(FDR = \mathbb{E}(FP / P)\) if \(P > 0\) and \(0\) if \(P=0\)
Principle: The number of declared positive elements \(P\) is given by the greater rank \(i\)
\[p_{(i)} \leq i \alpha^* / N\] with N the total number of taxa (ASV).
FDR \(\leq\) FWER
Some slides were adapted from RNASeq formation, module 16 and 23’s Migale formation supports
A good design is a list of experiments to conduct in order to answer to the asked question which maximize collected information and minimize experiments cost with respect to constraints.
A microbiome analysis involves…

Module 20 - Metabarcoding