Analyse de données métagénomiques 16S - FROGS

Module 20

Olivier Rué

MaIAGE - Migale

June 8, 2026

FROGS

FROGS team

  • FROGS is a INRAE development project since 2015

FROGS evolution

FROGS deals with long and short reads

FROGS offers numerous databanks

frogs.inrae.fr

FROGS articles

 

Understand FROGS

How to use FROGS

  • Command line
remove_chimera.py 
  --input-biom clustering.biom \ 
  --input-fasta clustering.fasta \
  --non-chimera remove_chimera.fasta \
  --out-abundance remove_chimera.biom \
  --summary remove_chimera.html
  • Galaxy instances via web

FROGS Core

FROGS Stat

FROGS Func

FROGS docs and help

  • Website: https://frogs.inrae.fr
  • Github: https://github.com/geraldinepascal/FROGS.git
  • Newsletter: subscription request at frogs-support@inrae.fr
  • Need help
    • frogs-support@inrae.fr for generic questions

FROGS companion tools

TP1: Introduction to Galaxy

Sequencing data

FASTQ format

@ST-E00114:1342:HHMGVCCX2:1:1101:3123:2012 1:N:0:TCCGGAGA+TCAGAGCC
CTTGGTCATTTAGAG
+
***<<*AEF???***
@ST-E00114:1342:HHMGVCCX2:1:1101:11556:2030 1:N:0:TCCGGAGA+TCAGAGCC
CATTGGCCATATCAT
+
AAAE??<<*???***

Meaning

@Identifier1 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
@Identifier2 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ

Quality score encoding

Quality score

Measure of the quality of the identification of the nucleobases generated by automated DNA sequencing

FASTQ compression

  • Compression is essential to deal with FASTQ files (reduce disk storage)
  • extension: file.fastq.gz
  • Tools are (almost all) able to deal with compressed files

Quality control

Quality control

  • One of the most easy step in bioinformatics …
  • … but one of the most important
  • check if everything is ok
  • Indicates if/how to clean reads
  • Shows possible sequencing problems
  • The results must be interpreted in relation to what has been sequenced

Reads are not perfect

Why QC’ing your reads?

Try to answer to (not always) simple questions:

  • Are data conform to the expected level of performance?
    • Size / Number of reads / Quality
  • Residual presence of adapters or indexes?
  • (Un)expected techincal biases?
  • (Un)expected biological biases?

Warning

QC without context leads to misinterpretation!

TP2: Quality control

FROGS workflow

FROGS workflow

FROGS Read processing

FROGS Read processing

  • Preprocessing
    • Paired-end merging of R1 and R2 reads with vsearch [4], flash [5] or pear [6] (only in command line)
    • Find and remove primer sequences with cutadapt [7]
    • Delete sequences with not expected lengths
    • Delete sequences with ambiguous bases (N)
    • Dereplication
  • Clustering sequences with swarm [8] / or / Denoising with DADA2 [9]

FROGS Read processing with swarm

  • R1 and R2 reads are together
  • Find 5’ primer and remove it
  • Find 3’ primer and remove it
  • Remove sequences with N’s inside

FROGS Read processing with swarm

  • Remove shortest and longest sequences
  • Remove sequences with N’s inside
  • Dereplication
  • Clustering with swarm

FROGS Read processing with DADA2

  • R1 and R2 reads are together
  • Find 5’ primer and remove it
  • Find 3’ primer and remove it
  • Remove reads with N’s inside

FROGS Read processing with DADA2

  • Correct sequencing errors
  • Merge R1 and R2
  • Remove shortest and longest sequences
  • Dereplication

Clustering / Denoising

Sequencing data are noised

How to deal with these noised sequences?

  • Comparison all against all
    • Very accurate
    • Requires a lot of memory and/or time
  • Clustering
    • closed-reference / open-reference
    • de novo
  • Denoising

Vocabulary

  • A lot of terms for features built by softwares
    • OTUs, zOTUS, ASVs, ESVs…
  • A recent review establishes the vocabulary [10]
    • OTUs / ASVs / swarm clusters
  • ASVs are identical denoised reads with as few as 1 base pair difference between variants, representing an inference of the biological sequences prior to amplification and sequencing errors
  • OTUs are formed with a % threshold clustering
  • Swarm clusters are a third feature type

OTU paradigm

  • Operational Taxonomic Unit

Operational Taxonomic Units

Operational Taxonomic Units

Operational Taxonomic Units

ASV paradigm

  • Amplicon Sequence Variants

ASV are inferred by a de novo process in which biological sequences are discriminated from errors on the basis of the expectation that biological sequences are more likely to be repeatedly observed than are error-containing sequences

ASV resolution

  • ASV resolution changes the composition for these samples

Swarm

Swarm [8] is a notably different sequence clustering approach, which, while technically a clustering algorithm, may also be considered a denoising method when using the fastidious method with d=1. It relies on the maximum number of differences between reads (local linking threshold) and forms clusters that are resilient to input-order changes, thus creating stable, high-resolution features (herein referred to as swarm-clusters). When using the fastidious method with d=1, swarm aims to produce clusters centered around real biological sequences, where clusters represent sequence variants.

Since FROGS uses swarm (with the fastidious method with d=1) and strongly promotes denoising by chimera removal and cluster filtering, FROGS produces ASVs.

Why Swarm?

  • Fixed clustering threshold is a real problem
  • OTUs construction is input-order depenent

Swarm: A smart idea

Swarm

  • A robust and fast clustering method for amplicon-based studies
  • The purpose of swarm is to provide a novel clustering algorithm to handle large sets of amplicons
  • swarm results are resilient to input-order changes and rely on a small local linking threshold d, the maximum number of differences between two amplicons
  • swarm forms stable high-resolution clusters, with a high yield of biological information
  • Default: forms a lot of low-abundant OTUs that are in fact artifacts and need to be removed
  • Swarm (fastidious method + d=1) clusters + filters → ASVs

d: the small local linking threshold

Swarm steps

DADA2

  • It is a complete workflow with a new, original approach to regrouping sequences.
  • It uses an error model that incorporates read quality information, and estimate the probability that a low-abundance read is an error derived from a more abundant read (incorporating specific base- transitions probabilities, computed on the dataset).
  • In FROGS, only the denoising algorithm is included.

DADA2

DADA2

DADA2

DADA2 options in FROGS

  • Independent pooling
    • By construction, it was impossible to observe an abundance of 1 for any ASV.
    • Singleton reads are classified in the ‘least likely’ partition, with potentially significant sequence differences, even though, in some cases, the same sequence exists (and is abundant) in other samples
  • Full pooling
    • One solution to better reconstruct rare sequences in samples using information from other samples is to pool the reads from all samples (as is done with Swarm).
    • Partitioning is therefore performed on all samples’ reads, and reads that were singletons in one sample are no longer considered as such if they were also present in another sample.
    • However, full pooling becomes non-linearly scalable with increasing sample numbers, and computation time increases exponentially with large datasets
  • Pseudo-pooling
    • Samples are first treated independently before a second treatment using all ASVs identified in the first round as ‘priors’ (the initial centres of partitions with an abundance of 1)

Advantages and inconvenients

  • Which type of features to prefer may be context-dependent, and both may even be used in the same study
  • ASV demonstrate a biologically informative fine-scale resolution [11]
  • But difficult to separate noise from a real signal in low abundant reads [12]
  • ASVs represent stable and reproducible units across studies whereas OTUs are dataset-specific features (swarm clusters are not )
    • problematic for longitudinal and very big studies

TP FROGS read processing

Remove chimera

Chimera removal

Chimera detection strategies

  • Reference based: against a database of «genuine» sequences

    • dependant of the references used
  • De novo: against abundant sequences in the samples

  • FROGS uses vsearch [4] as chimera removal tool

FROGS remove chimera

A little extra: the sample-cross validation

  • FROGS adds a sample-cross validation

Chimera rates in samples

  • From 5 to 40% in 16S data

  • Few with ITS (<10%)

TP FROGS Remove chimera

Abundance/Prevalence filters

How to filter clusters?

  • Filters on abundance
    • absolute or relative
  • Filters on prevalence
    • global or by group
  • Filters on most abundant
  • Filters on contaminants

TP Frogs Cluster/ASV filters

Taxonomic affiliation

Taxonomic affiliation

  • Taxonomic affiliation assigns a taxonomic identification to ASVs (if possible).
  • Taxonomic affiliations are determined by comparing ASVs to sequences contained within databanks

A lot of solutions

  • FROGS: Uses an alignment-based consensus approach (primarily BLAST).

  • RDP Classifier: Implements a Native Bayesian classifier that cuts sequences into 8-mer words to calculate the probability of a sequence belonging to a specific taxonomic node, providing a bootstrap confidence score for each rank.

  • Sintax: Operates as a fast, non-Bayesian classifier that uses k-mer similarity to find the top matching sequences in a reference database and calculates taxomic confidence via bootstrap sampling of those k-mers.

  • IDTaxa (DECIPHER): Employs a machine learning approach based on a novel classification algorithm that reduces over-classification errors by inherently learning when to stop assigning taxonomy if the evidence is insufficient.

  • DADA2: Uses a Naive Bayesian classifier implementation (adapted from the RDP Classifier algorithm) that breaks sequences into 8-mers to assign taxonomy against reference databases, while uniquely allowing for an optional, exact-string-matching step to resolve assignments down to the strict 100% species level.

Comparison of approaches

The FROGS databanks

  • FROGS gives access to numerous databanks (~130!)
  • Command line: you can use your own databank
  • Galaxy
    • Admins have to format and add your databank
  • The file must be well formated, we can do it for you
  • For private databanks, contact us!

LEAP hels you to choose the appropriate databank

LEAP

The FROGS extra: the multi-affiliations

  • FROGS gives all identical hits
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus xylosus
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus saprophyticus

Strictly identical (V1-V3 amplification) on 499 nucleotides

  • FROGS can’t decide if it’s one or another
  • You have to check if you can choose between multi-affiliations

To help you

TP FROGS Taxonomic Affiliation

TP FROGS Biom to TSV

Filter ASVs based on their affiliations

Affiliation filters

  • Remaining contamination?

    • Chloroplast, Mitochondria…
  • Want to analyse only the Firmicutes?

  • Want to remove ASVs without affiliation?

  • Want to hide affiliation if metrics are too bad

  • Want to ignore taxonomies with unknown species

  • 2 modes

    • Deleting: remove ASVs
    • Hiding: only the affiliation is modified, not the abundance

Phylogenetic tree

FROGS tree

  • This tool builds a phylogenetic tree thanks to affiliations of ASVs contained in the BIOM file
  • Needed to compute beta-diversity indices based on phylogenetic distances
  • Interesting to explore poor-characterized environments

TP FROGS Tree

FROGSfunc: function inference

Concepts

Based on PICRUSt2

  • PICRUSt [15] (Phylogenetic investigation of communities by reconstruction of unobserved states) is an open-source tool.
  • It is a software for predicting functional abundances based only on marker gene sequences
  • PICRUSt2 is composed of 4 python applications.
  • No graphic interface exists to run PICRUSt2 for non-expert users.

How it works

  1. Places the ASVs into a reference phylogenetic tree and predicts of marker copy number in each ASV.
  2. Predicts number of function copy number in each ASV and calculates functions abundances in each sample and ASV abundances according to marker copy number.
  3. Calculates pathway abundances in each sample.

FROGSfunc placeseqs and copynumber

FROGSfunc placeseqs and copynumber

NSTI

  • NSTI scores are simply the average branch length that separates each ASV in your sample from a reference bacterial genome, weighted by the abundance of that ASV in the sample.
  • PICRUSt2 sets NSTI threshold to 2 per default. Some studies have shown that this threshold is permissive. Thus, it is important to see if the taxonomies between PICRUSt2 and FROGS are quite similar or not, in order to potentially choose a more stringent threshold afterwards.
    • 0 < Good < 0.5
    • 0.5 <= Medium < 1
    • 1 <= Bad < 2
    • To exclude >= 2

FROGSfunc functions

FROGSfunc pathways

FROGS Stat: Exploring diversity

References

1. Escudié F, Auer L, Bernard M, Mariadassou M, Cauquil L, Vidal K, et al. FROGS: Find, Rapidly, OTUs with Galaxy Solution. Bioinformatics. 2018;34:1287–94. doi:10.1093/bioinformatics/btx791.
2. Bernard M, Rué O, Mariadassou M, Pascal G. FROGS: a powerful tool to analyse the diversity of fungi with special management of internal transcribed spacers. Briefings in Bioinformatics. 2021;22. doi:10.1093/bib/bbab318.
3. Stoler N, Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genomics and Bioinformatics. 2021;3. doi:10.1093/nargab/lqab019.
4. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: A versatile open source tool for metagenomics. PeerJ. 2016;4:e2584.
5. Magoč T, Salzberg SL. FLASH: Fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27:2957–63.
6. Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: A fast and accurate illumina paired-end reAd mergeR. Bioinformatics. 2013;30:614–20.
7. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal. 2011;17:10–2. doi:10.14806/ej.17.1.200.
8. Mahé F, Rognes T, Quince C, Vargas C de, Dunthorn M. Swarm v2: Highly-scalable and high-resolution amplicon clustering. PeerJ. 2015;3:e1420.
9. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. DADA2: High-resolution sample inference from illumina amplicon data. Nature methods. 2016;13:581.
10. Hakimzadeh A, Abdala Asbun A, Albanese D, Bernard M, Buchner D, Callahan B, et al. A pile of pipelines: An overview of the bioinformatics software for metabarcoding data analyses. Molecular Ecology Resources. 2023.
11. Couton M, Baud A, Daguin-Thiébaut C, Corre E, Comtet T, Viard F. High-throughput sequencing on preservative ethanol is effective at jointly examining infraspecific and taxonomic diversity, although bioinformatics pipelines do not perform equally. Ecology and Evolution. 2021;11:5533–46.
12. De Santiago A, Pereira TJ, Mincks SL, Bik HM. Dataset complexity impacts both MOTU delimitation and biodiversity estimates in eukaryotic 18S rRNA metabarcoding studies. Environmental DNA. 2022;4:363–84.
13. Group JCHMPDGW. Evaluation of 16S rDNA-based community profiling for human microbiome research. PloS one. 2012;7:e39315.
14. Murali A, Bhargava A, Wright ES. IDTAXA: A novel approach for accurate taxonomic classification of microbiome sequences. Microbiome. 2018;6:1–14.
15. Douglas GM, Maffei VJ, Zaneveld JR, Yurgel SN, Brown JR, Taylor CM, et al. PICRUSt2 for prediction of metagenome functions. Nature biotechnology. 2020;38:685–8.