Kategorie: Wszystkie - cost - error - alignment - sequencing

przez Rozaimi Razali 2 lat temu

803

Application of Next Generation Sequencing (NGS)

Application of Next Generation Sequencing (NGS)

Looking from the above

the probability that the intensity represent the incorrect base is stored as Phred score
e.g. if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000.

How Oxford Nanopore Sequencing works?

How Pacbio Sequencing works?

How it works?

Sample Prep

During sample prep, the genome is fragmented into short random fragments.
Then sequencing adapters are added to these fragments

Cluster Generation

the fragments hybridize to the surface of the flow cell

DNA polymerase will bind to the hybridized fragment and create a complimentary strand

The original template/fragment is washed away

the newly created fragment then hybridize with the neighbouring oligo nucleotide bases attached to the flow cell

this amplification process will be repeated

Sequencing

The fragments are then sequenced using fluorescent tagged nucleotide

what is Fluorescent probes?

molecules that absorb light of a specific wavelength and emit light of a different wavelength

ONLY the fragments with the adapters attached are amplified

remember there are 2 types of oligo nucleotides on the flow cells

each is complementary to the starting adapters and the end adapters

What is the purpose of these adapters?

save $$$

Allow the fragments to bind to the nucleotide bases on the flowcell

low to medium

~5Gb

5-10Gb

~$1800-2000 per human sample
~$3000 per human sample
ONT
5-15%
1-10%
very long
ONT 5kb-10kb
Pacbio 10kb -15kb

characteristics

cost

cheaper than long-reads

reads length

shorter, 100bp - 250bp

error rate

error rate <=1%

throughput

very high, 25-100Gb

69,830,209 SNPs

6,216 samples

Application of Next Generation Sequencing (NGS)

Common pipeline

additional pipeline
effect on protein structure & interactions

SAAPdap

SuSpect

Missense3D

LS-SNP/PDB

Functional annotation

Annotate

SNPeffect

Variant Effect Predictor

Annotate & Rank

AnnotSV

for SVs

PVP

Random Forest Classifier

Annovar

Exomiser

Rank

DANN

CADD

POLYPHEN

SIFT

Other resources

Gene Set Experiment Analysis

ClusterProfiler

Cytoscape

Enrichment Map

given a set of genes, expression data and list of phenotypes

identify statistically significant, concordant differences between two phenotypic states

other DB

dbSNP

all known SNPs as reported by GRCh, NCBI, HapMap and 1000 Genome Project

Good for seeing AF for SNPs

gnomAD

combine all publicly available WGS and WES

Good to see AF for a SNP or SV

eQTL/sQTL database

eQTL catalogue

GTEx

sQTL

variant affecting splicing

eQTL

locus affecting expression

co-localization GWAS and eQTL

COLOC

QTLtools

Disease specific DB

Cancer

COSMIC

TCGA

Human Genetics Reference

OMIM

Raredisease.gov

Genetics Home Reference

e.g. associated genes and known pathogenic variants

e.g. identify mode of inheritance (MOI) of the disease/phenotype

Pathway

once we identified the variants, we want to find out look at the pathway that it is involves in

Paid

Pathway Studio

Ingenuity Pathway Analysis

Free/Open-source

KEGG

REACTOME

HumanCyc

BioCyc

ALIGNMENT
Koboldt, D.C. Best practices for variant calling in clinical sequencing. Genome Med 12, 91 (2020)

Alignment QC

Identify PCR duplicates

Samblaster

Picard MarkDuplicates

What is PCR duplicates?

if one small error was introduced during PCR process

This error will be amplified

left with lots of reads that contains these error

Duplicates resulting from error during PCR

Tools

Minimap2

for long-reads

BWA-MEM

for short-reads

Critical phase

Raw sequence data are aligned to the reference sequence

output is a mapping file called SAM/BAM

which reads mapped where in the ref genome

e.g. of reference genome

CHM13

GRCh37/38

all downstream analyses and interpretation result from the quality of alignment

PRE-ALIGNMENT QC
Common questions

Did the sequencing work?

Post-sequencing level

Sample relatedness

KING

check how different samples are related

especially important for Trio-based analysis

verify that the proband dataset is indeed the child of the parental dataset

Sample contamination/swap

tools

Picard CrosscheckFingerprints

verifyBamID

What are the effects?

e.g. in Cancer analysis

calling contaminant germline variants as somatic

e.g. in Trio analysis

mistakenly identify variants in VCF as de novo mutations, when the variants actually came from someone else

How could this happen?

many reasons

rotated sample plate

mislabeling of sample sheets

swap

reads contain DNA from another sample

contamination

reads contain mixture of DNA from different samples

Reads quality assessment

MultiQC

can also used as QC post alignment

Can view QC of multiple samples at the same time

an "advanced" version of FastQC

FastQC

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

All are important but the most important ones are

Per base GC content

check if there is problem with library

Per base sequence quality

check quality of your sequence

During sequencing

e.g. Sequencing Real Time Analysis

https://supportassets.illumina.com/content/dam/illumina-support/images/featured-training/sav-overview.png

Demultiplexing

all libraries should be well-balanced

Error rate

Pacbio

Sequel/RSII

less than 15%

HiFi/Sequel II

less than 1%

illumina

less than 0.5%

Percentage >Q30

in 70% of reads

Number of reads

platform specific

Miseq

>25M

> 100M

> 300M

Do I have enough sequencing reads?

Depth = (Read length x Number of reads) / Genome size

The most important step
Cost

Gene Panel ~ WES > WGS

Turnaround time

Gene Panel > WES > WGS

Type of variants

SNPs/short InDels

Gene Panel = WES = WGS

SV

WGS LR

Comprehensiveness

WGS > WES > Gene Panel

Read depth coverage

What is it and why this is important?

Depth plays important role for heterogenous samples

e.g. cancer

In a tumor sample, normal cells tend to be observed together with tumor cells

2 populations: Normal and Tumor

We do not know the ratio of each

With high depth, modern bioinformatic tools is able to understand the differences in the reads coverage

more reads

might be duplication

fewer reads

might be deletion

During library prep, the genome is fragmented into short random fragments.

Sequencing adapters are added to these fragments

PCR amplify the libraries. ONLY the fragments with the adapters attached are amplified

These random fragments are then sequenced

The reads are then aligned to a reference genome to create longers stretch of sequences

example

To make the tiling process a success

need to have many fragments that overlap between each other

the more overlaps, the higher the alignment confidence

sequencing error will always occurs

"low diversity" error

occurs at the beginning or end of a sequencing read

"low confident" error

e.g. sometimes adapters attach to each other instead to fragment

result in inability for the tiling process to occur properly

result in erroneous variant calling

but if we have many copies, correct reads will outweigh bad reads

result in high confidence variant calling

purpose

Multiplexing sequencing

allow sequencing multiple samples in one run

Allow sequencer to recognize the fragments

Background

What kind of experiments can you do?
Many more
Epigenomics

ChIPSeq

Methylation

Transcriptomics

non-coding RNA

Targeted

mRNA

Total RNA

Genomics

Targeted Gene

Whole Exome Sequencing (WES)

Whole Genome Sequencing (WGS)

How is it commonly used in clinical settings?
Therapy selection
Prognosis
Diagnosis

Diagnosis of infectious diseases

e.g. Initial whole-genome sequencing and analysis of the host genetic contribution to COVID-19 severity and susceptibility

e.g. SARS-CoV-2

Diagnosis of specific clinical presentations of suspected genetic diseases

e.g. Neuromuscular disorder

Risk assessment and screening

e.g. Germline cancer risk testing

e.g. Carrier screening for recessive genetic disorders

cystic fibrosis, mendelian inherited disorders

e.g. Non-Invasive Prenatal Testing (NIPT)

common trisomy syndromes in fetuses

What is NGS?
A major advantage of NGS compared with PCR is that prior knowledge of the target organism (target-specific primers) is not required.
a broad term encompasses several modern sequencing technologies

can be largely divided into two based on the length of the sequencing output

long-reads

Oxford Nanopore

PromethION

GridION

MinION

PacBio

Sequel II

Sequel

RS II

short-reads

Ion Torrent

Genexus System

Gene Studio S5

MGIseq

T7

G400

G50

Illumina-based platform

NovaSeq

HiSeq

NextSeq

MiSeq

MiniSeq

Objectives

Introduce the Linux environment
Sequencing strategies
Background on NGS