Migale data analysis service (updated: r Sys.Date())

Total projects submitted

170

Collaborations / Accompaniements

134 / 36

In progress

Total remaining work (months)

Distinct collaborators

120

Thematics overview

Projects keywords

Geographic origin

Members

Not all members work full time on this activity

User feedbacks

Collaboration

In Collaboration mode, we handle the analysis of the data. If possible, we would like to be involved very early in the project design to give advice concerning the experimental design, the sequencing strategy, the number of samples/replicates, the randomization, etc.

We take care of storing sequencing data on our infrastructure and we provide a web report that tracks all performed analyses to facilitate reproducibility.

Accompaniement

The Accompaniement mode is quite different. Our role is to help you to analyze and interpret the results you obtain. We give assistance for debugging, choosing tools and parameters but don’t perform the analyses. With this service, you have access to an interlocutor available to answer your questions, supervise your treatments and help you in the analysis and interpretation of your results. It helps you to become progressively autonomous and confident in your bioinformatics skills.

This mode is appreciated by people who have taken one of our trainings and by doctoral students.

Tools frequently used

Autocycler [1] is a tool for generating consensus long-read assemblies for bacterial genomes. It is the successor to Trycycler. Autocycler combines multiple alternative assemblies of the same genome (e.g. from different assemblers and/or different read subsets) into a high-confidence consensus assembly. It achieves this by compressing input assemblies into a compacted De Bruijn graph, clustering similar sequences, trimming overlaps and resolving ambiguities.

Bakta [2] is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs. It provides dbxref-rich, sORF-including and taxon-independent annotations in machine-readable JSON & bioinformatics standard file formats for automated downstream analysis.

BamToCov [3] is a toolkit for rapid and flexible coverage computation that relies on the most memory efficient algorithm and is designed for integration in pipelines, given its ability to read alignment files from streams. The tools in the suite can process sorted BAM or CRAM files, allowing the user to extract coverage information via different filtering approaches and to save the output in different formats (BED, Wig or counts). The BamToCov algorithm can also handle strand-specific and/or physical coverage analyses.

BLASTN [4] stands for Basic Local Alignment Search Tool for Nucleotides. It is a bioinformatics tool used to compare a nucleotide query sequence (such as DNA or RNA) against a nucleotide sequence database to identify regions of similarity.

Bowtie2 [5] is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.

Busco [6] provides a quantitative assessment of the completeness in terms of expected gene content of a genome assembly, transcriptome, or annotated gene set.

BWA [7] is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.

bwa-mem2 [8] is the next version of the bwa-mem algorithm in bwa [7]. It produces alignment identical to bwa and is ~1.3-3.1x faster depending on the use-case, dataset and the running machine.

CheckM [9] provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage.

Chopper [10], intended for long read sequencing such as PacBio or ONT, filters and trims a fastq file.

run_dbcan [11] is the standalone version of the dbCAN3 annotation tool for automated CAZyme annotation. This tool, known as run_dbcan, incorporates HMMER, Diamond, and dbCAN_sub for annotating CAZyme families, and integrates Cazyme Gene Clusters (CGCs) and substrate predictions.

Delly [12] is an integrated structural variant (SV) prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations at single-nucleotide resolution in short-read and long-read massively parallel sequencing data. It uses paired-ends, split-reads and read-depth to sensitively and accurately delineate genomic rearrangements throughout the genome.

EggNOG-mapper [13] is a tool for fast functional annotation of novel sequences. It uses precomputed orthologous groups and phylogenies from the eggNOG database (http://eggnog5.embl.de) to transfer functional information from fine-grained orthologs only.

fastp [14] is a tool designed to provide fast all-in-one preprocessing for FASTQ files. By default, this tool allows correction in overlapped regions (at least 30 bases between R1 and R2), removes reads with more than 5 ambiguous sequences, moves a sliding window from tail (3’) to front, drops the bases in the window if its mean quality < 30, stops otherwise. Reads shorter than 75 nucleotides are also removed.

fastplong [15] is an ultrafast preprocessing and quality control for long reads (Nanopore, PacBio, Cyclone, etc.).

FastQC [16] is a program designed to spot potential problems in high througput sequencing datasets. It runs a set of analyses on one or more raw sequence files in fastq or bam format and produces a report which summarises the results. MultiQC [17] aggregates results from bioinformatics analyses across many samples into a single report.

freebayes [18] is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.

FROGS [19], [20] is a workflow dedicated to the analysis of amplicon sequencing data. Available both on the command line and in Galaxy, it can process any type of amplicon (16S, ITS…) sequenced in Illumina, IonTorrent or PacBio.

The cluster_filters.py tool from the FROGS suite [19], [20] filters clusters on abundance and/or prevalence and/or contaminations.

The denoising.py tool from the FROGS suite [19], [20] begins with searching and removing primers with Cutadapt [21]. Then, reads with N’s are discarded and remaining reads are given to DADA2 [22] to be denoised. Then, paired-end reads are merged with PEAR [23], filtered on min and max length and finally dereplicated and stored in a BIOM file.

The remove_chimera.py tool from the FROGS suite [19], [20] uses VSEARCH [24] to remove chimera sample by sample. Then, a cross-validation is performed to remove only chimera identified in all samples where they are present.

The taxonomic_affiliation.py tool from the FROGS suite [19], [20] assigns taxonomy to ASVs by performing BLASTn searches [4] against a dedicated reference databank.

The tree.py tool from the FROGS suite [19], [20] creates a multiple alignment of ASVs with Mafft [25] and a rooted phylogenetic tree with FastTree [26] and Phangorn R package [27].

HoCoRT [28] stands for Host Contamination Removal Tool. Its purpose is to simplify and improve the process of host contamination removal from sequencing reads. It does not do any quality checking or low complexity region masking, just host contamination removal. HoCoRT wraps already existing aligners and classifiers such as Bowtie2 and Kraken2 to remove host contamination.

The sequencing data is mapped to the reference genome.
The sequences which map well are removed and the remaining sequences are written to output files.

Some of the pipelines combine multiple mappers/classifiers in an attempt to improve the results.

Kaiju [29] is a program for fast and sensitive taxonomic classification of high-throughput sequencing reads from metagenomic whole genome sequencing or metatranscriptomics experiments.

Each sequencing read is assigned to a taxon in the NCBI taxonomy by comparing it to a reference protein database containing microbial and viral protein sequences. By using protein-level classification, Kaiju achieves a higher sensitivity compared with methods based on nucleotide comparison.

KneadData [30] is a tool designed to perform quality control on metagenomic sequencing data, especially data from microbiome experiments. This tool aims to perform principled in silico separation of bacterial reads from the “contaminant” reads, be they from the host, from bacterial 16S sequences, or other user-defined sources.

Kraken2 [31] is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm.

MEGAHIT [32] is an ultra-fast and memory-efficient NGS assembler. It is optimized for metagenomes, but also works well on generic single genome assembly (small or mammalian size) and single-cell assembly.

Minimap2 [33] is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.

NanoCaller [34] is a computational method that integrates long reads in deep convolutional neural network for the detection of SNPs/indels from long-read sequencing data. NanoCaller uses long-range haplotype structure to generate predictions for each SNP candidate variant site by considering pileup information of other candidate sites sharing reads. Subsequently, it performs read phasing, and carries out local realignment of each set of phased reads and the set of all reads for each indel candidate variant site to generate indel calling, and then creates consensus sequences for indel sequence prediction..

NanoPlot [10] is a plotting tool for long read sequencing data and alignments.

NanoStat [10] calculates various statistics from a long read sequencing dataset in fastq, bam or albacore sequencing summary format.

PEAR [23] is an ultrafast, memory-efficient and highly accurate pair-end read merger. It is fully parallelized and can run with as low as just a few kilobytes of memory. PEAR evaluates all possible paired-end read overlaps and without requiring the target fragment size as input. In addition, it implements a statistical test for minimizing false-positive results. Together with a highly optimized implementation, it can merge millions of paired end reads within a couple of minutes on a standard desktop computer.

The phyloseq package [35] is a tool to import, store, analyze, and graphically display complex metabarcoding data, especially when there is associated sample data, phylogenetic tree, and/or taxonomic assignment of the OTUs or ASVs. Various customs functions written to enhance the base functions of phyloseq are available in the phyloseq-extended package [36].

Prokka [37] is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

Qualimap [38] is a platform-independent application written in Java and R that provides both a Graphical User Interface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data. Shortly, Qualimap:

Examines sequencing alignment data according to the features of the mapped reads and their genomic properties
Povides an overall view of the data that helps to to the detect biases in the sequencing and/or mapping of the data and eases decision-making for further analysis.

Quast [39] stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics.

Resistance Gene Identifier (RGI) [40] is used to predict antibiotic resistome(s) from protein or nucleotide data based on homology and SNP models. The application uses reference data from the Comprehensive Antibiotic Resistance Database (CARD).

seqkit [41] was used to get information from FASTQ files.

Sequali [42] is a quality control tool designed for both short- and long-read sequencing data. It offers features like adapter searching, overrepresented sequence analysis, and duplication analysis, supporting inputs in FASTQ and uBAM formats. MultiQC [17] aggregates results from bioinformatics analyses across many samples into a single report.

Simka [43] is a de novo comparative metagenomics tool. Simka represents each dataset as a k-mer spectrum and compute several classical ecological distances between them.

Sniffles2 [44] is a fast structural variant caller for long-read sequencing. Sniffles2 accurately detect SVs on germline, somatic and population-level for PacBio and Oxford Nanopore read data.

SPAdes [45] is a versatile toolkit designed for assembly and analysis of sequencing data. SPAdes is primarily developed for Illumina sequencing data, but can be used for IonTorrent as well. Most of SPAdes pipelines support hybrid mode, i.e. allow using long reads (PacBio and Oxford Nanopore) as a supplementary data.

SPAdes package contains assembly pipelines for isolated and single-cell bacterial, as well as metagenomic and transcriptomic data. Additional modes allow to discover bacterial plasmids and RNA viruses, as well as perform HMM-guided assembly. Besides, SPAdes package includes supplementary tools for efficient k-mer counting and k-mer-based read filtering, assembly graph construction and simplification, sequence-to-graph alignment and metagenomic binning refinement.

StringTie [46] employs efficient algorithms for transcript structure recovery and abundance estimation from bulk RNA-Seq reads aligned to a reference genome. It takes as input spliced alignments in coordinate-sorted SAM/BAM/CRAM format and produces a GTF output which consists of assembled transcript structures and their estimated expression levels (FPKM/TPM and base coverage values).

ToulligQC is dedicated to the QC analyses of Oxford Nanopore runs. This software is written in Python and developped by the GenomiqueENS core facility of the Institute of Biology of the Ecole Normale Superieure (IBENS).

Trycycler [47] is a tool for generating consensus long-read assemblies for bacterial genomes. I.e. if you have multiple long-read assemblies for the same isolate, Trycycler can combine them into a single assembly that is better than any of your inputs.

UNICYCLER [48] is an assembly pipeline for bacterial genomes. It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser.

VCFttols [49] is a set of tools written in Perl and C++ for working with VCF files

Promoting the FAIR principles

One of our mission is to promote the FAIR principles in research. In this way, we can handle the transfer of data to public repositories and we provide you a web companion report that lists all of the analyses done (tool, version, interpretation).

This website is provided to you at the start of the project and is updated as the analyses progress.

References

1. Wick RR, Howden BP, Stinear TP. Autocycler: Long-read consensus assembly for bacterial genomes. bioRxiv. 2025. doi:10.1101/2025.05.12.653612.

2. Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. Bakta: Rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics. 2021;7. doi:https://doi.org/10.1099/mgen.0.000685.

3. Birolo G, Telatin A. BamToCov: An efficient toolkit for sequence coverage calculations. Bioinformatics. 2022;38:2617–8. doi:10.1093/bioinformatics/btac125.

4. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215:403–10.

5. Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nature methods. 2012;9:357.

6. Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology and Evolution. 2021;38:4647–54. doi:10.1093/molbev/msab199.

7. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997. 2013.

8. Vasimuddin M, Misra S, Li H, Aluru S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In: 2019 IEEE international parallel and distributed processing symposium (IPDPS). IEEE; 2019. p. 314–24.

9. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research. 2015;25:1043–55. doi:10.1101/gr.186072.114.

10. De Coster W, Rademakers R. NanoPack2: population-scale evaluation of long-read sequencing data. Bioinformatics. 2023;39:btad311. doi:10.1093/bioinformatics/btad311.

11. Zheng J, Ge Q, Yan Y, Zhang X, Huang L, Yin Y. dbCAN3: automated carbohydrate-active enzyme and substrate annotation. Nucleic Acids Research. 2023;51:W115–21. doi:10.1093/nar/gkad328.

12. Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. DELLY: Structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28:i333–9. doi:10.1093/bioinformatics/bts378.

13. Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Molecular Biology and Evolution. 2021;38:5825–9. doi:10.1093/molbev/msab293.

14. Zhou Y, Chen Y, Chen S, Gu J. Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–90. doi:10.1093/bioinformatics/bty560.

15. Chen S. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta. 2023;2:e107. doi:https://doi.org/10.1002/imt2.107.

16. Andrews S. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

17. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–8.

18. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:12073907. 2012.

19. Escudié F, Auer L, Bernard M, Mariadassou M, Cauquil L, Vidal K, et al. FROGS: Find, Rapidly, OTUs with Galaxy Solution. Bioinformatics. 2018;34:1287–94. doi:10.1093/bioinformatics/btx791.

20. Bernard M, Rué O, Mariadassou M, Pascal G. FROGS: a powerful tool to analyse the diversity of fungi with special management of internal transcribed spacers. Briefings in Bioinformatics. 2021;22. doi:10.1093/bib/bbab318.

21. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet journal. 2011;17:10–2.

22. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. DADA2: High-resolution sample inference from illumina amplicon data. Nature methods. 2016;13:581.

23. Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: A fast and accurate illumina paired-end reAd mergeR. Bioinformatics. 2013;30:614–20.

24. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: A versatile open source tool for metagenomics. PeerJ. 2016;4:e2584.

25. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular biology and evolution. 2013;30:772–80.

26. Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments. PloS one. 2010;5:e9490.

27. Schliep KP. Phangorn: Phylogenetic analysis in r. Bioinformatics. 2011;27:592–3.

28. Rumbavicius I, Rounge TB, Rognes T. HoCoRT: Host contamination removal tool. BMC bioinformatics. 2023;24:371.

29. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with kaiju. Nature communications. 2016;7:11257.

30. Lab H. KneadData is a tool designed to perform quality control on metagenomic and metatranscriptomic sequencing data, especially data from microbiome experiments. 2022. https://github.com/biobakery/kneaddata.

31. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with kraken 2. Genome biology. 2019;20:1–13.

32. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics. 2015;31:1674–6.

33. Li H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100.

34. Ahsan MU, Liu Q, Fang L, Wang K. NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks. Genome biology. 2021;22:261.

35. McMurdie PJ, Holmes S. Phyloseq: An r package for reproducible interactive analysis and graphics of microbiome census data. PloS one. 2013;8:e61217.

36. Mariadassou M. Phyloseq-extended: Various customs functions written to enhance the base functions of phyloseq. 2018. https://github.com/mahendra-mariadassou/phyloseq-extended.

37. Seemann T. Prokka: Rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068–9. doi:10.1093/bioinformatics/btu153.

38. Okonechnikov K, Conesa A, Garcı́a-Alcalde F. Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics. 2015;32:292–4.

39. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–5. doi:10.1093/bioinformatics/btt086.

40. Alcock BP, Huynh W, Chalil R, Smith KW, Raphenya AR, Wlodarski MA, et al. CARD 2023: Expanded curation, support for machine learning, and resistome prediction at the comprehensive antibiotic resistance database. Nucleic acids research. 2023;51:D690–9.

41. Shen W, Le S, Li Y, Hu F. SeqKit: A cross-platform and ultrafast toolkit for FASTA/q file manipulation. PloS one. 2016;11:e0163962.

42. Vorderman RHP. Sequali: Efficient and comprehensive quality control of short- and long-read sequencing data. Bioinformatics Advances. 2025;5:vbaf010. doi:10.1093/bioadv/vbaf010.

43. Benoit G, Peterlongo P, Mariadassou M, Drezen E, Schbath S, Lavenier D, et al. Multiple comparative metagenomics using multiset k-mer counting. PeerJ Computer Science. 2016;2:e94.

44. Smolka M, Paulin LF, Grochowski CM, Horner DW, Mahmoud M, Behera S, et al. Detection of mosaic and population-level structural variants with Sniffles2. Nature biotechnology. 2024;42:1571–80.

45. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19:455–77. doi:10.1089/cmb.2012.0021.

46. Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature biotechnology. 2015;33:290–5.

47. Wick RR, Judd LM, Cerdeira LT, Hawkey J, Méric G, Vezina B, et al. Trycycler: Consensus long-read assemblies for bacterial genomes. Genome biology. 2021;22:266.

48. Wick LMAG Ryan R. AND Judd. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology. 2017;13:1–22. doi:10.1371/journal.pcbi.1005595.

49. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–8. doi:10.1093/bioinformatics/btr330.

Platform	Migale
Unit	MaIAGE