Buscar

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 3, do total de 46 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 6, do total de 46 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 9, do total de 46 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Prévia do material em texto

Dr. rer. nat. Diego Mauricio Riaño Pachón
Laboratório de Biologia Computacional, Evolutiva e de Sistemas
Centro de Energia Nuclear na Agricultura
Universidade de São Paulo
diego.riano@cena.usp.br
http://diriano.github.io/
Genomics

1
Use long reads
Two (three) tecnologies readily available:
PacBio CLR (Continuous Long Reads)
PacBio CCS (HiFi) (Circular Consensus Sequences)
Oxford Nanopore Technologies
2
Key points for long read sequencing
Fresh material. Or collect and freeze it
Extraction of HMW DNA
For ONT eliminate short reads with circulomics
If doing the sequencing yourself with ONT get used to nanopore problems
Server, plenty of RAM and CPU. GPU in nanopore, required for basecalling
Try diferent assemblers.
Polish your assemblies, Always. Speciallly if only using long reads of high error rate.
Scaffold genome using HiC or genetic/physical (bionano) maps
3
Some problems when assemblying a genome (plants)
Genome Size (get an idea of the size of your genome!)
Deposit a specimen sample in a museum!
Ploidy
Heterozigozity (out/in-breeder), phasing
Repeats! Plants usualy have lots of TEs, read length important to resolve repeats
4
Assembler: No size fits all!
Everything on the market: OLC, de Bruiijn, Greedy
Minimap2/miniasm
HiFiASM
WTDBJ2
FlyE
Shasta
CANU
FALCON
MaSurCA
5
 Advances in sequencing technology 
6
2020
2022
2023
https://nanoporetech.com/accuracy​
https://training.galaxyproject.org/training-material/topics/assembly/tutorials/get-started-genome-assembly/slides-plain.html​
​
https://nanoporetech.com/about-us/news/22-highlights-remember-2022​
https://twitter.com/nanopore/status/1480996225029652483
 (a) Sanger and Illumina sequencing technologies have short read lengths but high per base quality. Long read sequencing technologies (PacBio and NanoPore) have reads that can exceed 1 Mb but have a much lower per base quality. (b) Long read sequencing technologies have driven the largest improvements in genome contiguity, or completeness, over the last ∼5 years. These are schematic depictions of sequencing technologies and not actual data.
6
 Genome size & Heterozygosity et al.:GenomeScope
7
https://www.nature.com/articles/s41467-020-14998-3
Counting frequency of all k-mers in all sequencing reads
K-mer spectra and fitted models for (a) diploid Arabidopsis thaliana and (b) triploid Meloidogyne enterolobii. Note that the diploid plot has two major peaks, while the triploid plot has three major peaks. Both also have high frequency putative error k-mers with coverage near 1.
7
 Genome size & Heterozygosity et al.:GenomeScope
8
https://www.nature.com/articles/s41467-020-14998-3
Highly homozygous diploid,
You can think that it behaves as a haploid
K-mer spectra and fitted models for (a) diploid Arabidopsis thaliana and (b) triploid Meloidogyne enterolobii. Note that the diploid plot has two major peaks, while the triploid plot has three major peaks. Both also have high frequency putative error k-mers with coverage near 1.
8
 Genome size & Heterozygosity et al.:GenomeScope
9
https://www.nature.com/articles/s41467-020-14998-3
Significantly heterozygous diploid
K-mer spectra and fitted models for (a) diploid Arabidopsis thaliana and (b) triploid Meloidogyne enterolobii. Note that the diploid plot has two major peaks, while the triploid plot has three major peaks. Both also have high frequency putative error k-mers with coverage near 1.
9
 Genome size & Heterozygosity et al.:GenomeScope
10
https://www.nature.com/articles/s41467-020-14998-3
GenomeScope2.0 can deal with different values of ploidy (up to 6), but you must know the ploidy in advance! 
K-mer spectra and fitted models for (a) diploid Arabidopsis thaliana and (b) triploid Meloidogyne enterolobii. Note that the diploid plot has two major peaks, while the triploid plot has three major peaks. Both also have high frequency putative error k-mers with coverage near 1.
10
About Ploidy
 
Haploid (N)
Diploid (2N)
Triploid (3N)
Tetraploid (3N)
Polyploid
 Diploid 
Haploid
 Monoploid
We have NGS genomic data, let’s the data speak!
11
12
If you have a monoploid reference genome you can use ploidyNGS
ploidyNGS: Let the data tell you about the ploidy of your organism
 
Results
We need the alignment of the genomic NGS reads to the genome sequence in BAM format
Reference Genome
NGS Reads
Then we count the fraction of reads supporting each base (“allele”) at every position in the alignment file (BAM)
13
ploidyNGS: Haploid case
 
Results
Monomorphic positions
Abundance of the most frequent “allele”
In the haploid case we expect most positions in the alignment to be monomorphic or have a low frequency of other bases (“Alleles”), that mostly arise form sequencing errors or mapping errors
14
ploidyNGS: Haploid case
 
Results
Monomorphic positions
Sequencing errors 
Abundance of the most frequent “allele”
Abundance of the second most frequent “allele”
In the haploid case we expect most positions in the alignment to be monomorphic or have a low frequency of other bases (“Alleles”), that mostly arise form sequencing errors or mapping errors
15
ploidyNGS: Haploid case
 
Results
In the haploid case we expect most positions in the alignment to be monomorphic or have a low frequency of other bases (“Alleles”), that mostly arise form sequencing errors or mapping errors
Monomorphic positions
Sequencing errors 
16
ploidyNGS: Diploid case
 
Results
Polymorphic positions
In the diploid case we expect that in the polymorphic positions that both, the most frequent and second most frequent, “alleles” appear each with 50% of the reads in that position.
17
ploidyNGS: Triploid case
 
Results
Polymorphic positions
In the triploid case we have two possible options for the polymorphic positions:
3 “alleles” present each with 33% of the reads
2 “alleles” present one with 66% and 33% of the reads
18
19
If you DO NOT have a reference genome, try to infer ploidy directly from your reads using kmer statistics with SmudgePlots
 Ploidy: Smudgeplots
20
Slides from Kamil Jaron
Counting frequency of all k-mers in all sequencing reads
20
 Ploidy: Smudgeplots
21
Slides from Kamil Jaron
21
 Ploidy: Smudgeplots
22
Slides from Kamil Jaron
22
 Ploidy: Smudgeplots
23
Slides from Kamil Jaron
23
 Ploidy: Smudgeplots
24
https://www.nature.com/articles/s41467-020-14998-3
Smudgeplots for (a) the triploid root-knot nematode Meloidogyne floridensis and (b) the octaploid strawberry Fragaria × ananassa.
Take pairs of kmers (above a minimum kmer coverage threshold) that differ by exactly one nucleotide. These kmers should be homologous and represented different alleles or different paralogues. If they are alleles, they could be used to estimate ploidy.
The two kmers in the pair are called A and B, covA is greater or equal than covB
24
Strategy for near telomere-to-telomere assembly
25
https://arxiv.org/abs/2308.07877v1 
Chromosome level phasing required more than long-reads. Long-range data is very important!
26
https://arxiv.org/abs/2308.07877v1 
As good as they are, modern TGS reads still need error correction
27
https://www.nature.com/articles/s41592-020-01056-5
Assembly with overlap graphs
28
https://arxiv.org/abs/2308.07877v1 
Graphs need to be simplified
Assembly with overlap graphs
29
https://arxiv.org/abs/2308.07877v1 
High quality reads (near perfect), allow to look “just” for perfect overlaps, which greatly simplify the overlap graph, and permits identifying repeat copies and . . . 
Assembly with overlap graphs
30
https://arxiv.org/abs/2308.07877v1 
. . . to phase the genome
Assembly with overlap graphs
31
https://arxiv.org/abs/2308.07877v1 
It is common to remove contained reads (yellow), i.e., a read contained in another one. However this could lead to assembly gaps, particularly when phasing. It is one of the main problemsfor overlap/string graphs assemblers.
Assembly with de Bruijn graphs
32
https://arxiv.org/abs/2308.07877v1 
The value of k is usually much smaller than the read length, which can lead to loss some information
Smaller values of k, lead to more ambiguities, as shown in the figure. But much large values of k could lead to contig breakpoints in low coverage regions.
Assembly with de Bruijn graphs
33
https://arxiv.org/abs/2308.07877v1 
There is no single best k for all situations, modern assemblers can use a mixture of k values for different regions of the genome, based on read coverage.
DBG assemblers were very common for short-read technologies. Now with near perfect long-reads they are coming back.
Scaffolding
34
https://www.sciencedirect.com/science/article/pii/S1369526619301244
Figure 2. Leading strategies for scaffolding long read assemblies. (a) High throughput chromatin conformation capture (Hi-C) relies on the proximity of interactions from cross-linked chromatin to order contigs. Chromatin is cross-linked, digested with restriction enzymes and biotinylated, and the two chromatin ends are ligated and purified using streptaviadin beads. The resulting library is sequenced and aligned to the genome to build a Hi-C interaction matrix for scaffolding contigs into pseudomolecules. (b) Optical maps utilize restriction enzymes and single molecule imaging to create a physical map of the genome. Long fragments of DNA are nicked using a restriction enzyme and labeled. DNA molecules are linearized and imaged, and fingerprints for each molecule are combined to create a consensus genome map. Contigs are overlaid on the genome map based on in silico digestion and anchored into scaffolds or pseudomolecules.
34
Assembly of haplotypes
35
https://www.sciencedirect.com/science/article/pii/S1369526619301244
Figure 3. Assembly approaches for sequencing and phasing heterozygous genomes. Long read assemblies allow assembly of multiple haplotypes from homologous chromosomes in heterozygous regions. The primary and alternative haplotypes can be collapsed into a single, non-redundant but chimeric pseudomolecule for simplicity of downstream analyses (top). Raw reads can be mapped to the contigs to resolve missing haplotype regions to create a phased, diploid assembly (middle). Partial haplotypes can be retained and labeled in a graph-based assembly (bottom).
35
Repeats and heterozygosity
36
https://www.sciencedirect.com/science/article/pii/S1369526619301244
Figure 3. Assembly approaches for sequencing and phasing heterozygous genomes. Long read assemblies allow assembly of multiple haplotypes from homologous chromosomes in heterozygous regions. The primary and alternative haplotypes can be collapsed into a single, non-redundant but chimeric pseudomolecule for simplicity of downstream analyses (top). Raw reads can be mapped to the contigs to resolve missing haplotype regions to create a phased, diploid assembly (middle). Partial haplotypes can be retained and labeled in a graph-based assembly (bottom).
36
Why repetitive sequences make assembly difficult? 
How difficult depends on the read length
37
Oftentimes repeats are collapsed, and the assembly is fragmented
Luckily . . .
Given long error-free reads, we can distinguish different repeat copies and successfully assemble them. Reads are never all entirely error-free, but when the read error rate is low enough and sequencing errors are sufficiently independent, we can correct most errors and achieve high-quality assembly (Li & Durbin, 2023).
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-S5-S18
Assembly quality assessment
Assess contamination of the assembly: Blobtools (https://blobtools.readme.io/docs)
Contiguity metrics: Quast (http://quast.sourceforge.net/) 
Reference free completeness and quality: Merqury (https://github.com/marbl/merqury)
Reference free completeness: LAI (https://github.com/oushujun/LTR_retriever)
Gene space completeness: BUSCO (https://busco.ezlab.org/)
38
https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13312
Merqury: Reference-free quality assessment
39
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02134-9
The catalog of kmers and their frequency should be similar between the assembly and the reads, deviation of this could suggest problems to be looked at:
A kmer frequent in reads but absent in assembly suggest a part of the genome is missing in the assembly
On the other hand, kmer more frequent in assembly than in reads, suggest a false supplication in assembly
The phasing accuracy could be measured when trio data is available.
Merqury: Reference-free quality assessment
40
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02134-9
Contamination Assessment in assemblies
41
https://www.pnas.org/doi/full/10.1073/pnas.1600338113
https://f1000research.com/articles/6-1287
Blobtools
41
Contamination Assessment in reads
42
https://link.springer.com/chapter/10.1007/978-3-030-91814-9_6
Cont-Free NGS
42
Evaluating sequence assemblies: Gene content
43
Looking for sets of conserved genes, the more you find the more complete assembly you have: some software:
BUSCO
Compleasm
asmgene
https://www.biorxiv.org/content/10.1101/2023.06.03.543588v1
https://busco.ezlab.org/
Evaluating sequence assemblies: Gene content
44
	Species	Ploidy
	Fragaria x ananassa	8x = 56
	F. iinumae	2x = 14
	F. nilgerrensis	2x = 14
	F. nipponica	2x = 14
	F. nubicola	2x = 14
	F. orientalis	4x = 28
	F. vesca	2x = 14
	F. viridis	2x = 14
45
Re-sequencing vs de novo
In most cases re-sequencing is not worth the loss of information and incorrect conclusions (due to re-arrangements)
As costs continue to decrease, and algorithms for de novo assembly improve, de novo assembly is the best path in most cases. With a combination of high quality (PacBio HiFi or ONT Q20+) 3rd NGS Technologies and the use of long-range information (bionano/HiC), chromosome scale assembly can be achieved.

46
That´s all for today folks
46
image3.emf
image4.emf
image5.png
image6.png
image7.png
image8.png
image9.png
image10.png
image11.png
image12.png
image13.png
image14.png
image15.png
image16.emf
0
25000
50000
75000
100000
0 25 50 75 100
Allele Freq
C
ou
nt
 p
os
iti
on
s
Type First
Haploid genome
0
25000
50000
75000
100000
0255075100
Allele Freq
C
o
u
n
t
 
p
o
s
i
t
i
o
n
s
TypeFirst
Haploid genome
image17.emf
0
25000
50000
75000
100000
0 25 50 75 100
Allele Freq
C
ou
nt
 p
os
iti
on
s
Type First Second
Haploid genome
0
25000
50000
75000
100000
0255075100
Allele Freq
C
o
u
n
t
 
p
o
s
i
t
i
o
n
s
TypeFirstSecond
Haploid genome
image18.emf
0
25000
50000
75000
100000
0 25 50 75 100
Allele Freq
C
ou
nt
 p
os
iti
on
s
Type First Second Third Fourth
Haploid genome
0
25000
50000
75000
100000
0255075100
Allele Freq
C
o
u
n
t
 
p
o
s
i
t
i
o
n
s
TypeFirstSecondThirdFourth
Haploid genome
image19.png
image20.emf
0
100
200
300
400
500
25 50 75
Allele Freq
C
ou
nt
 p
os
iti
on
s
Type First Second
Diploid genome
0
100
200
300
400
500
255075
Allele Freq
C
o
u
n
t
 
p
o
s
i
t
i
o
n
s
TypeFirstSecond
Diploid genome
image21.png
image22.emf
0
50
100
150
200
25 50 75
Allele Freq
C
ou
nt
 p
os
iti
on
s
Type First Second
Triploid genome
0
50
100
150
200
255075
Allele Freq
C
o
u
n
t
 
p
o
s
i
t
i
o
n
s
TypeFirstSecond
Triploid genome
image23.png
image24.png
image25.png
image26.png
image27.png
image28.png
image29.png
image30.png
image31.png
image32.png
image33.png
image34.jpeg
image35.jpeg
image36.jpeg
image37.png
image38.png
image39.pngimage40.png
image41.png
image42.png
image43.png
image44.png
image45.jpeg
image46.png
image47.png
image48.png
image49.png
image50.png
image1.png
image2.jpeg