Module 8 bis
June 15, 2023
After this training day, you will know:
Break
Break
Interest in a genome that has not yet been sequenced
A reference genome is available
Need to cut the genome into millions of fragments (shotgun sequencing) from the 2 DNA strands.
The operation to reconstruct the genetic elements from the raw reads is called assembly.
Second generation (since 2007)
Three steps :
Randomly fragment genomic DNA and ligate adapters to both ends of the fragments
Bind single-stranded fragments randomly to the inside surface of the flow cell channels.
Add unlabelled nucleotides and enzyme to initiate solid-phase bridge amplification.
The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate.
Denaturation leaves single-stranded templates anchored to the substrate.
Several millions dense clusters of double-stranded DNA are grated in in channel of the flow cell.
The first sequencing cycle begins by adding four labelled reversible terminators, primers, and DNA polymerase.
After laser excitation, the emitted fluorescence from each cluster is captured and the first base is identified.
The blocked 3’ terminus and florphore are removed,flow cell washed, leaving the terminator free for a second cycle.
The next cycle repeats the incorporation of four labelled reversible terminators, primers, and DNA polymerase.
After laser excitation, the image is captured as before, and the identity of the second base is recorded.
The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time.
Millions of clusters are processed in parallel, allowing high-throughput sequencing.
Target the weaknesses of the 2nd generation :
Two main competitors (in production ) :
A polymerase is immobilized at the bottom of a sequencing unit called zero-mode waveguide (ZMW) .Four fluorescent-labelled nucleotides, which generate distinct emission spectrums, are added to the SMRT cell. As a base is held by the polymerase, a light pulse is produced that identifies the base. The replication processes in all ZMWs of a SMRT cell are recorded by a “movie” of light pulses, and the pulses corresponding to each ZMW can be interpreted to be a sequence of bases.
Applications :
An interesting review [1]
Nature review : Milestones in Genomic Sequencing
https://galaxy.migale.inrae.fr
Login : stageXX
Data in Shared Data / Data Libraries / formation NGS / Reads
References in Shared Data / Data Libraries / formation NGS / Refs
The FASTQ format is the de facto standard by which all sequencing instruments represent data. It may be thought of as a variant of the FASTA format that allows it to associate a quality measure to each sequence base: FASTA with QUALITIES.
>
symbol it uses the @
symbol. This is followed by an ID and more optional text, similar to the FASTA headers.+
sign starts the next section.+
sign and may be optionally followed by the same sequence id and header as the first sectionEach character represents a numerical value: a so-called Phred score, encoded via a single letter encoding.
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
| | | | | | | | |
0....5...10...15...20...25...30...35...40
| | | | | | | | |
worst................................best
The numbers represent the error probabilities via the formula: \(Error=10^{-P/10}\)
It is basically summarized as:
There was a time when instrumentation makers could not decide at what character to start the scale. The current standard shown above is the so-called Sanger (+33) format where the ASCII codes are shifted by 33. There is the so-called +64 format that starts close to where the other scale ends.
Information is often encoded in the “free” text section of a FASTQ file.
EAS139
: the unique instrument name136
: the run idFC706VJ
: the flowcell id2
: flowcell lane2104
: tile number within the flowcell lane15343
: ‘x’-coordinate of the cluster within the tile197393
: ‘y’-coordinate of the cluster within the tile1
: the member of a pair, 1 or 2 (paired-end or mate-pair reads only)Y
: Y if the read is filtered, N otherwise18
: 0 when none of the control bits are on, otherwise it is an even numberATCACG
: index sequenceThis information is specific to a particular instrument/vendor and may change with different versions or releases of that instrument.
What are the information you want to know about the sequencing when you perform Quality Control ?
Collective Answer on this collaborative whiteboard
Try to answer to (not always) simple questions:
Warning
Quality control without context leads to misinterpretation
Sickle
[3] (quality)cutadapt
[4] (adpater removal)Trimmomatic
fastp
[5]Similar to a puzzle :
All assembly algorithms are based on read overlap.
Different ways of calculating overlap :
“All vs All” comparison :
de Bruijn Graph
Consider the 2 symbol alphabet (0 & 1) de Bruijn Graph for k=3
To help us get around these problems, we use all k-length subsequences of the reads, these are the k-mers.
2 contigs : MISSISSIS
& SSIPPI
Optimum value for k will balance these effects.
More coverage depth will help overcome errors!
Which path looks most valid ? Why ?
Coverage cut-off is an important parameter to differentiate error from real variations
Find an unbalanced node in the graph
Follow the chain of nodes and “read off” the bases to produce the contigs
Where there is an ambiguous divergence/convergence, stop the current contig and start a new one.
Re-trace the reads through the contigs to help with repeat resolution
remove erroneous nodes and edges using the “coverage cutoff”
genuine short nodes will be kept because of their high coverage
SPADES [6] is the de Bruijn graph assembler by Pavel Pevzner’s group out of St. Petersburg
After assembly, we use QUAST [7] to evaluate and compare genome assemblies.
What QUAST does :
Evaluation of the assembly based on:
GAAGCTCTAGGATTACGATCTTGATCGCCGGGAAATTATGATCCTGACCTGAGTTTAAGGCATGGACCCATAA
ATCTTGATCGCCGAC----ATT # GLOBAL
ATCTTGATCGCCGACATT # LOCAL, with soft clipping
Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the
Needleman–Wunsch algorithm
, which is based on dynamic programming.
GAAGCTCTAGGATTACGATCTTGATCGCCGGGAAATTATGATCCTGACCTGAGTTTAAGGCATGGACCCATAA
ATCTTGATCGCCGAC----ATT # GLOBAL
ATCTTGATCGCCGACATT # LOCAL, with soft clipping
Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The
Smith–Waterman algorithm
is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place.
Seed-and-extend mappers are a class of read mappers that break down each read sequence into seeds (i.e., smaller segments) to find locations in the reference genome that closely match the read.
SAM
= Sequence Alignment MapBAM
= Binary Alignment MapThese files represent an alignment of FASTQ reads against a reference like a FASTA.
Module 8bis - Analyse de données NGS sous Galaxy