Module 24
March 18, 2024
Diffusion of documents
All documents presented during this training course are intended for distribution, and are available on https://documents.migale.inrae.fr
Collaboration
We take care of the analyses
Accompaniment
We help you to perform the analyses
conda activate insilicoseq-1.5.4
# après un premier tirage aléatoire permettant de récupérer 50 génomes bactériens, 4 génomes viraux et 10 génomes d'archées
iss generate --seed 13062022 -k bacteria viruses archaea -U 50 4 10 --model hiseq --output s1 -n 2000000 --abundance_file hiseq_ncbi_abundance.txt
iss generate --seed 14062022 --genomes s1_genomes.fasta --model hiseq --output s1 -n 2000000 --abundance_file s1_abundance.txt -z
iss generate --seed 14062022 --genomes s2_genomes.fasta --model hiseq --output s2 -n 2000000 --abundance_file s2_abundance.txt -z
iss generate --seed 14062022 --genomes s1_genomes.fasta --model hiseq --abundance uniform --output s3 -n 2000000 -z
iss generate --seed 14062022 --genomes s2_genomes.fasta --model hiseq --abundance uniform --output s4 -n 1000000 -z
iss generate --seed 14062022 --genomes s1_genomes.fasta --model hiseq --output s5 -n 2000000 --abundance_file s1_abundance.txt -z
save
/work
/home
)qsub
/qstat
)conda activate
)We will only deal with short reads during this training
@ST-E00114:1342:HHMGVCCX2:1:1101:3123:2012 1:N:0:TCCGGAGA+TCAGAGCC
CTTGGTCATTTAGAG
+
***<<*AEF???***
@ST-E00114:1342:HHMGVCCX2:1:1101:11556:2030 1:N:0:TCCGGAGA+TCAGAGCC
CATTGGCCATATCAT
+
AAAE??<<*???***
Meaning
@Identifier1 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
@Identifier2 (comment)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
Measure of the quality of the identification of the nucleobases generated by automated DNA sequencing
file.fastq.gz
Try to answer to (not always) simple questions:
Using callouts is an effective way to highlight content that your reader give special consideration or attention.
Definition
Taxonomic assignment in the context of bioinformatics involves the computational identification and classification of organisms into their taxonomic groups using various data sources, such as DNA sequences, protein sequences, or other molecular markers. This process typically utilizes algorithms and computational tools to compare sequences against reference databases or phylogenetic trees, allowing for accurate identification and classification of organisms at different taxonomic levels.
Kraken [5] is a very popular taxonomic affiliation tool. It is very fast and accurate. Kraken examines the k-mers (~35 bp) within the query sequence, searches for them in the database, looks for where these are placed within the taxonomy tree inside the database, makes the classification with the most probable position, then maps k-mers to the lowest common ancestor (LCA) of all genomes known to contain the given k-mer.
Method:
Kaiju [6] is an equivalent of Kraken, but with some particularities:
tool | migale | comments |
---|---|---|
Kraken2 | ✔️ | the reference, fast and efficient |
Kaiju | ✔️ | protein level |
Bracken | ✔️ | Bayesian Reestimation of Abundance with Kraken |
Centrifuge | ✔️ | indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem |
MetaPhlAn3 | ✔️ | MetaPhlAn relies on unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic) |
Definition
Contamination corresponds to the presence of DNA that does not come from the sample studied.
tool | migale | comments |
---|---|---|
fastp [10] | ✔️ | all in one |
pear [11] | ✔️ | for merging reads |
sickle [12] | ✔️ | adaptative trimming |
tool | migale | comments |
---|---|---|
kneaddata [13] | ✔️ | remove rRNA reads |
sortmerna [14] | ✔️ | remove rRNA reads, slow… |
HoCoRT [15] | ✔️ | choice between several aligners, easy to use |
DeconSeq | ❌ | |
GenCoF | ❌ |
Pros of co-assembly | Cons of co-assembly |
---|---|
More data | Higher computational overhead |
Better/longer assemblies | Risk of shattering the assembly |
Access to lower abundant organisms | Risk of increased contamination |
In these cases, co-assembly is reasonable if:
If it is not the case, individual assembly should be prefered. In this case, an extra step of de-replication should be used
After assembly, we use MetaQUAST [19] to evaluate and compare metagenome assemblies.
What MetaQUAST does :
Evaluation of the assembly based on:
For each given reference genome, based on an alignement of all contigs on it :
Binning is a good compromise when the assembly of whole genomes is not feasible.
Similar contigs are grouped together.
For the evaluation of bins, we will use completeness and contamination estimated by CheckM [24]
checkm2 predict
workflow which only mandatory requires a directory of genome bins.-meta
parameter) and fraggenescan have good enough results on metagenomic contigs.--proteins
parameter)snakemake.
Module 24 - Métagénomique shotgun