Analyse de données NGS sous Galaxy

Module 8 bis

Valentin Loux

MaIAGE

Cédric Midoux

PROSE & MaIAGE

April 23, 2024

Introduction

Practical informations

9h00 - 17h00
2 breaks morning and afternoon
Lunch at INRAE restaurant (not mandatory)
Questions allowed
Everyone has something to learn from each other

Better know you

Who are you?

Institution / Laboratory / position

What is your scientific topic?

Studied ecosystem
Scientific question
Experimental design

What is your background?

What are your needs in NGS data analysis?
Do you have already dealt with NGS data?
Have you ever used Galaxy ?

Better know us

Open infrastructure dedicated to life sciences
- Computing resources, tools, databanks…
Dissemination of expertise in bioinformatics
Design and development of applications
Data analysis

Data analysis service

We are specialized in genomics/metagenomics
Bioinformaticians and Statisticians
More than 110 projects since 2016
2 types of services
- Collaboration or Accompaniement

Training objectives

After this training day, you will know:

the characteristics of the main types of sequencers
how to do a quality control of the raw sequences
how to assemble a (small) genome
how to align reads to a reference genome
how to explore graphically an alignment
how to compare assemblies

Program

Morning

Introduction & Round table
Sequencing technologies

Break

Quality Control
Data cleaning
Assembly

Afternoon

Assembly evaluation and comparison

Break

Mapping
Visualisation

The truth about bioinformatics

Next generation Sequencing in a few slides

Sequencing Cost per Megabase

Genome Sequencing, why ?

Interest in a genome that has not yet been sequenced

Assembly and annotation
- de novo sequencing
- chromosomal rearrangements
- metagenomics

Genome Sequencing, why ?

A reference genome is available

Alignment (mapping) of reads on the genome
- Detection of genomic variants (SNPs)
- RNA-seq (gene expression)
- ChIP-seq (regulation of gene expression)
- Chromosomal rearrangements, variation in gene copy number
- Detection of small non-coding RNAs
- metagenomics

Sequencing challenges

Smallest known (non viral) genome:
- Carsonella ruddii = 0.16 Mbp
Largest known genome:
- Paris japonica = 150 Gbp
- Amoeba dubia = 670 Gbp

Sequencing challenges

Maximum Reads Size :
- 1st generation (Sanger): up to 900 bp
- 2nd generation: up to 500 bp
- 3rd generation: up to 100 - 1000 Kbp

Need to cut the genome into millions of fragments (shotgun sequencing) from the 2 DNA strands.

The operation to reconstruct the genetic elements from the raw reads is called assembly.

Sequencing technologies

First generation :
- Sanger sequencing
  - First step : fragment cloning
  - Reads up to 900 bp
  - Expensive
  - low throughput

Next generation Sequencing technologies

Second generation (since 2007)

454 - Sequencing by Synthesis - PCR Amplification
SOLiD : Sequencing by Ligation - PCR Amplification
Ion Torrent : Sequencing by Synthesis - PCR Amplification
Illumina : Sequencing by Synthesis - PCR Amplification

Illumina : principles

Based on “reversible terminated chemistry” : reversible terminators that enable the identification of single nucleotides as they are washed over DNA strands.

Three steps :

Amplification of DNA fragments
Sequencing
Analysis

Prepare genomic DNA samples

Randomly fragment genomic DNA and ligate adapters to both ends of the fragments

Attach DNA to Flow Cell Surface

Bind single-stranded fragments randomly to the inside surface of the flow cell channels.

Bridge Amplification

Add unlabelled nucleotides and enzyme to initiate solid-phase bridge amplification.

Fragments Become Double Stranded

The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate.

Denature the Double-Stranded Molecule

Denaturation leaves single-stranded templates anchored to the substrate.

Complete Amplification

Several millions dense clusters of double-stranded DNA are grated in in channel of the flow cell.

Determine First Base

The first sequencing cycle begins by adding four labelled reversible terminators, primers, and DNA polymerase.

Image First Base

After laser excitation, the emitted fluorescence from each cluster is captured and the first base is identified.

The blocked 3’ terminus and florphore are removed,flow cell washed, leaving the terminator free for a second cycle.

Determine Second Base

The next cycle repeats the incorporation of four labelled reversible terminators, primers, and DNA polymerase.

Image Second Chemistry Cycle

After laser excitation, the image is captured as before, and the identity of the second base is recorded.

Sequencing Over Multiple Chemistry Cycles

The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time.

Millions of clusters are processed in parallel, allowing high-throughput sequencing.

Illumina : summary

High precision >99.5% (main type or errors : substitutions)
Short reads (maximum 2 x 300)
Huge throughput (up to 6 Tbp per run on NovaSeq)
Some under-representation of rich AT- and GC- regions.

Sequencing - Vocabulary

Read: piece of sequenced DNA
DNA fragment: 1 or more reads depending on whether the sequencing is single- or paired-end
Insert: Fragment size
Depth: \(\frac{N * L}{G}\)
\(N =\) number of reads
\(L =\) reads size
\(G =\) genome size
Coverage: % of genome covered

3d generation

Target the weaknesses of the 2nd generation :

PCR amplification
Short reads

Two main competitors (in production ) :

Pacific Bioscience (PacBio)
Oxford Nanopore Technologies (ONT)

PacBio

A polymerase is immobilized at the bottom of a sequencing unit called zero-mode waveguide (ZMW) .Four fluorescent-labelled nucleotides, which generate distinct emission spectrums, are added to the SMRT cell. As a base is held by the polymerase, a light pulse is produced that identifies the base. The replication processes in all ZMWs of a SMRT cell are recorded by a “movie” of light pulses, and the pulses corresponding to each ZMW can be interpreted to be a sequence of bases.

Reference

PacBio : summary

Long reads (up to Kbs with SequelII)
Depends on DNA quality
High error rate. Tend to lower with depth
Medium throughput

Applications :

IsoSeq (RNA Isoform full length sequencing)
Detection of DNA modification
Assembly

Oxford Nanopore

Oxford Nanopore

MinION, GridION, PromethION

Sequencing on The ISS

ONT Summary

Ultra long reads ( up to 1 Mb (!) )
Length of the reads depends on DNA quality
Low to high throughput
“On field” sequencing
Direct RNA sequencing, peptide sequencing
High error rate (5-10%), tends to lower with new chemistry, base calling algorithms and depth

ONT Applications

Full length isofrom sequencing, direct RNA sequencing
Detection of DNA modification
Assembly

An other view on sequencing technologies (probably out of date)

Global Summary (probably out of date)

An interesting review [1]

Nature review : Milestones in Genomic Sequencing

Connect to Galaxy

Practical session :

Escherichia coli genome (re)sequencing
- Illumina MiSeq
- Paired-end sequencing (2*150bp , insert size ~300bp)
- Sub-sampled

Connect to Galaxy

https://galaxy.migale.inrae.fr
Login : stageXX
Data in Shared Data / Data Libraries / formation NGS / Reads
References in Shared Data / Data Libraries / formation NGS / Refs

FASTQ format

FASTQ syntax

The FASTQ format is the de facto standard by which all sequencing instruments represent data. It may be thought of as a variant of the FASTA format that allows it to associate a quality measure to each sequence base: FASTA with QUALITIES.

FASTQ format consists of 4 sections:

A FASTA-like header, but instead of the > symbol it uses the @ symbol. This is followed by an ID and more optional text, similar to the FASTA headers.
The second section contains the measured sequence (typically on a single line), but it may be wrapped until the + sign starts the next section.
The third section is marked by the + sign and may be optionally followed by the same sequence id and header as the first section
The last line encodes the quality values for the sequence in section 2, and must be of the same length as section 2.

FASTQ syntax

@ST-E00114:1342:HHMGVCCX2:1:1101:3123:2012 1:N:0:TCCGGAGA+TCAGAGCC
CTTGGTCATTTAGAG
+
***<<*AEF???***
@ST-E00114:1342:HHMGVCCX2:1:1101:11556:2030 1:N:0:TCCGGAGA+TCAGAGCC
CATTGGCCATATCAT
+
AAAE??<<*???***

FASTQ quality

Each character represents a numerical value: a so-called Phred score, encoded via a single letter encoding.

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
|    |    |    |    |    |    |    |    |
0....5...10...15...20...25...30...35...40
|    |    |    |    |    |    |    |    |
worst................................best

The numbers represent the error probabilities via the formula: \(Error=10^{-P/10}\)

It is basically summarized as:

P=0 means 1/1 (100% probability of error)
P=10 means 1/10 (10% probability of error)
P=20 means 1/100 (1% probability of error)
P=30 means 1/1000 (0.1% probability of error)
P=40 means 1/10000 (0.01% probability of error)

FASTQ quality encoding specificities

There was a time when instrumentation makers could not decide at what character to start the scale. The current standard shown above is the so-called Sanger (+33) format where the ASCII codes are shifted by 33. There is the so-called +64 format that starts close to where the other scale ends.

FASTQ Header informations

Information is often encoded in the “free” text section of a FASTQ file.

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

EAS139: the unique instrument name
136: the run id
FC706VJ: the flowcell id
2: flowcell lane
2104: tile number within the flowcell lane
15343: ‘x’-coordinate of the cluster within the tile
197393: ‘y’-coordinate of the cluster within the tile
1: the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y: Y if the read is filtered, N otherwise
18: 0 when none of the control bits are on, otherwise it is an even number
ATCACG: index sequence

This information is specific to a particular instrument/vendor and may change with different versions or releases of that instrument.

Fastq import & visualisation

Quality control

Why QC’ing your reads ?

What are the information you want to know about the sequencing when you perform Quality Control ?

Collective Answer on this collaborative whiteboard

Why QC’ing your reads ?

Try to answer to (not always) simple questions:

Quality control

Are data conform to the expected level of performance?
- Size
- Number of reads
- Quality
Residual presence of adapters or indexes ?
Are there (un)expected technical biases
Are there (un)expected biological biases

Warning

Quality control without context leads to misinterpretation

Quality control for FASTQ files : FastQC [2]

QC for (Illumina) FastQ files
Command line fastqc or graphical interface
Complete HTML report to spot problem originating from sequencer, library preparation, contamination
Summary graphs and tables to quickly assess your data

Quality Control with FastQC

Reads cleaning

Objectives

Detect and remove sequencing adapters (still) present in the FastQ files
Filter / trim reads according to quality (as plotted in FastQC)

Tools

Simple & fast : Sickle [3] (quality)
dedicated to adapters cutadapt [4] (adpater removal)
Ultra-configurable : Trimmomatic
All in one & ultra-fast : fastp [5]

Clean your data with Fastp

Assembly

Assembly : principles

Similar to a puzzle :

millions of pieces
without the original image
with pieces in both sense
the pieces do not necessarily fit together (sequencing errors)
parts of the puzzle are missing (cover + sequencing bias)

Assembly

All assembly algorithms are based on read overlap.

Different ways of calculating overlap :
“All vs All” comparison :
- “old” assemblers based on this approach
- Graph representing overlap between reads
- Quadratic number of comparison (number of reads^2 )
- do not scale with billion of reads
de Bruijn Graph
- Named after Nicolaas Govert de Bruijn
- Directed graph representing overlaps between sequences of symbols
- Sequences can be reconstructed by moving between nodes in graph

Slide Credits

De Bruijn Graph

A directed graph of sequences of symbols
Nodes in the graph are k-mers
Edges represent consecutive k-mers (k-1 symbols overlap)

Consider the 2 symbol alphabet (0 & 1) de Bruijn Graph for k=3

Producing sequences

Sequences of symbols are produced by moving through the graph

K-mers ?

To be able to use de Bruijn graphs, we need reads of length L to overlap by L-1 bases.
Not all reads will overlap another read perfectly.
- Read errors
- Coverage “holes”
Not all reads are the same length (depending on technology and quality clean-up)

To help us get around these problems, we use all k-length subsequences of the reads, these are the k-mers.

What are K-mers ?

K-mers de Bruijn graph

K-mers de Bruijn graph

K-mers de Bruijn graph

The problem of repeats

The problem of repeats

The problem of repeats

Different k

Different k

2 contigs : MISSISSIS & SSIPPI

Choose k wisely

Lower k
- More connections
- Less chance of resolving small repeats
- Higher k-mer coverage
Higher k
- Less connections
- More chance of resolving small repeats
- Lower k-mer coverage

Optimum value for k will balance these effects.

Sequencing errors

Sequencing errors

Sequencing errors

More coverage

Errors won’t be duplicated in every read
Most reads will be error free
We can count the frequency of each k-mer
Annotate the graph with the frequencies
Use the frequency data to clean the de Bruijn graph

More coverage depth will help overcome errors!

Sequencing errors - coverage

Which path looks most valid ? Why ?

An important parameter : coverage cutoff

At what point is a low coverage indicative of an error?
Can we ignore low coverage nodes and paths?
This is a new assembly parameter

Coverage cut-off is an important parameter to differentiate error from real variations

de Bruijn Graph Assembly process

Select a value for k
“Hash” the reads (make the kmers)
Count the kmers
Make the de Bruijn graph
Perform graph simplification steps - use cov cutoff
Read off contigs from simplified graph

Graph simplification : Chain Merging

Graph simplification : Tip Clipping

Graph simplification : Bubble Collapsing

Make contigs

Find an unbalanced node in the graph
Follow the chain of nodes and “read off” the bases to produce the contigs
Where there is an ambiguous divergence/convergence, stop the current contig and start a new one.
Re-trace the reads through the contigs to help with repeat resolution

Graph simplification : Remove low coverage nodes

remove erroneous nodes and edges using the “coverage cutoff”
genuine short nodes will be kept because of their high coverage

Assemble with SPADES

SPADES [6] is the de Bruijn graph assembler by Pavel Pevzner’s group out of St. Petersburg

Uses multiple k-mers to build the graph
- Graph has connectivity and specificity
- Usually use a low, medium and high k-mer size together.
Performs error correction on the reads first
Maps reads back to the contigs and scaffolds as a check
Under active development

Assembly your data with SPADES

Assembly quality

QUAST

After assembly, we use QUAST [7] to evaluate and compare genome assemblies.

What QUAST does :

De novo genome assembly evaluation
Reference-based evaluation
Evaluating so-called misassemblies
Report and visualisation

De novo metrics

Evaluation of the assembly based on:

Number of contigs greater than a given threshold (0, 1kb, …)
Total / thresholded assembly size
Largest contig size
N50 : the sequence length of the shortest contig at 50% of the total assembly length, equivalent to a median of contig lengths. (N75 idem, for 75%)
L50 : the number of contigs at 50% of the total assembly length. (L75 idem, for 75%)

Reference-based metrics

Metrics based on based on an alignment of all contigs on a reference genome. :
- duplication rate
- percent genome complete
- NGA50 : equivalent of N50 but with the aligned block of the contigs
- “Misassemblies” : breakpoint of alignment in a contigs. ”
- Visualisation available

Assembly QC with Quast

Alignment

Alignment strategies

GAAGCTCTAGGATTACGATCTTGATCGCCGGGAAATTATGATCCTGACCTGAGTTTAAGGCATGGACCCATAA
                 ATCTTGATCGCCGAC----ATT       # GLOBAL
                 ATCTTGATCGCCGACATT           # LOCAL, with soft clipping

Global alignment

Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the Needleman–Wunsch algorithm, which is based on dynamic programming.

Alignment strategies

GAAGCTCTAGGATTACGATCTTGATCGCCGGGAAATTATGATCCTGACCTGAGTTTAAGGCATGGACCCATAA
                 ATCTTGATCGCCGAC----ATT       # GLOBAL
                 ATCTTGATCGCCGACATT           # LOCAL, with soft clipping

Local alignment

Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith–Waterman algorithm is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place.

Seed-and-extend for NGS data

Seed-and-extend for NGS data

Seed-and-extend mappers are a class of read mappers that break down each read sequence into seeds (i.e., smaller segments) to find locations in the reference genome that closely match the read.

The mapper obtains a read
The mapper selects smaller DNA segments from the read to serve as seeds
The mapper indexes a data structure with each seed to obtain a list of possible locations within the reference genome that could result in a match
For each possible location in the list, the mapper obtains the corresponding DNA sequence from the reference genome
The mapper aligns the read sequence to the reference sequence, using an expensive sequence alignment algorithm to determine the similarity between the read and the reference.

Mapping

For further analysis it is necessary to map all the reads on the contigs.

Mapping

We will use bowtie2 [8]
- Firstly, we build an index.
- Secondly, reads are aligned.
- We can use samtools [9] and bedtools [10] to manipulate SAM/BAM files.

BAM/SAM

SAM = Sequence Alignment Map
BAM = Binary Alignment Map

These files represent an alignment of FASTQ reads against a reference like a FASTA.

After a header section (for reference), each line represents the alignment of one read.

BAM/SAM

Mapping

Vizualisation

Visualization

Some tools for visualization and browsing
- IGV (alignments and reference)
- Artemis (genome and annotations)

Visualization

Long reads

Tools for long reads

Long read data can be used to improve assembly
Bottlenecks :
- DNA extraction (?)
- cost of data generation
- sequencing errors
State of the art pipeline for assembly :
- standalone long read assembly
  - FLYE [11]
  - canu [12]
Long read assembly first, correction. with short reads
- Unicycler [13] & Trycycler

Take home message

→ You have in your hands the first tools to analyze your NGS data

→ Data quality control is a crucial step

→ It is essential to define your plan analyses upstream of your project.

→ NGS are still an ongoing active bioinformatics research field

→ Biostatistics …

References

1. Goodwin S, McPherson JD, McCombie WR. Coming of age: Ten years of next-generation sequencing technologies. Nature Reviews Genetics. 2016;17:333–51. doi:10.1038/nrg.2016.49.

2. Andrews S. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

3. Joshi N, Fass J. Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files. 2011.

4. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet journal. 2011;17:10–2.

5. Zhou Y, Chen Y, Chen S, Gu J. Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–90. doi:10.1093/bioinformatics/bty560.

6. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19:455–77. doi:10.1089/cmb.2012.0021.

7. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–5. doi:10.1093/bioinformatics/btt086.

8. Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nature methods. 2012;9:357.

9. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.

10. Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.

11. Lin Y, Yuan J, Kolmogorov M, Shen MW, Chaisson M, Pevzner PA. Assembly of long error-prone reads using de bruijn graphs. Proceedings of the National Academy of Sciences. 2016;113:E8396–405. doi:10.1073/pnas.1604560113.

12. Berlin K, Koren S, Chin C-S, Drake JP, Landolin JM, Phillippy AM. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature biotechnology. 2015;33:623–30.

13. Wick LMAG Ryan R. AND Judd. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology. 2017;13:1–22. doi:10.1371/journal.pcbi.1005595.