Prévia do material em texto
Dr. rer. nat. Diego Mauricio Riaño Pachón Laboratório de Biologia Computacional, Evolutiva e de Sistemas Centro de Energia Nuclear na Agricultura Universidade de São Paulo diego.riano@cena.usp.br http://diriano.github.io/ Genomics 1 Use long reads Two (three) tecnologies readily available: PacBio CLR (Continuous Long Reads) PacBio CCS (HiFi) (Circular Consensus Sequences) Oxford Nanopore Technologies 2 Key points for long read sequencing Fresh material. Or collect and freeze it Extraction of HMW DNA For ONT eliminate short reads with circulomics If doing the sequencing yourself with ONT get used to nanopore problems Server, plenty of RAM and CPU. GPU in nanopore, required for basecalling Try diferent assemblers. Polish your assemblies, Always. Speciallly if only using long reads of high error rate. Scaffold genome using HiC or genetic/physical (bionano) maps 3 Some problems when assemblying a genome (plants) Genome Size (get an idea of the size of your genome!) Deposit a specimen sample in a museum! Ploidy Heterozigozity (out/in-breeder), phasing Repeats! Plants usualy have lots of TEs, read length important to resolve repeats 4 Assembler: No size fits all! Everything on the market: OLC, de Bruiijn, Greedy Minimap2/miniasm HiFiASM WTDBJ2 FlyE Shasta CANU FALCON MaSurCA 5 Advances in sequencing technology 6 2020 2022 2023 https://nanoporetech.com/accuracy https://training.galaxyproject.org/training-material/topics/assembly/tutorials/get-started-genome-assembly/slides-plain.html https://nanoporetech.com/about-us/news/22-highlights-remember-2022 https://twitter.com/nanopore/status/1480996225029652483 (a) Sanger and Illumina sequencing technologies have short read lengths but high per base quality. Long read sequencing technologies (PacBio and NanoPore) have reads that can exceed 1 Mb but have a much lower per base quality. (b) Long read sequencing technologies have driven the largest improvements in genome contiguity, or completeness, over the last ∼5 years. These are schematic depictions of sequencing technologies and not actual data. 6 Genome size & Heterozygosity et al.:GenomeScope 7 https://www.nature.com/articles/s41467-020-14998-3 Counting frequency of all k-mers in all sequencing reads K-mer spectra and fitted models for (a) diploid Arabidopsis thaliana and (b) triploid Meloidogyne enterolobii. Note that the diploid plot has two major peaks, while the triploid plot has three major peaks. Both also have high frequency putative error k-mers with coverage near 1. 7 Genome size & Heterozygosity et al.:GenomeScope 8 https://www.nature.com/articles/s41467-020-14998-3 Highly homozygous diploid, You can think that it behaves as a haploid K-mer spectra and fitted models for (a) diploid Arabidopsis thaliana and (b) triploid Meloidogyne enterolobii. Note that the diploid plot has two major peaks, while the triploid plot has three major peaks. Both also have high frequency putative error k-mers with coverage near 1. 8 Genome size & Heterozygosity et al.:GenomeScope 9 https://www.nature.com/articles/s41467-020-14998-3 Significantly heterozygous diploid K-mer spectra and fitted models for (a) diploid Arabidopsis thaliana and (b) triploid Meloidogyne enterolobii. Note that the diploid plot has two major peaks, while the triploid plot has three major peaks. Both also have high frequency putative error k-mers with coverage near 1. 9 Genome size & Heterozygosity et al.:GenomeScope 10 https://www.nature.com/articles/s41467-020-14998-3 GenomeScope2.0 can deal with different values of ploidy (up to 6), but you must know the ploidy in advance! K-mer spectra and fitted models for (a) diploid Arabidopsis thaliana and (b) triploid Meloidogyne enterolobii. Note that the diploid plot has two major peaks, while the triploid plot has three major peaks. Both also have high frequency putative error k-mers with coverage near 1. 10 About Ploidy Haploid (N) Diploid (2N) Triploid (3N) Tetraploid (3N) Polyploid Diploid Haploid Monoploid We have NGS genomic data, let’s the data speak! 11 12 If you have a monoploid reference genome you can use ploidyNGS ploidyNGS: Let the data tell you about the ploidy of your organism Results We need the alignment of the genomic NGS reads to the genome sequence in BAM format Reference Genome NGS Reads Then we count the fraction of reads supporting each base (“allele”) at every position in the alignment file (BAM) 13 ploidyNGS: Haploid case Results Monomorphic positions Abundance of the most frequent “allele” In the haploid case we expect most positions in the alignment to be monomorphic or have a low frequency of other bases (“Alleles”), that mostly arise form sequencing errors or mapping errors 14 ploidyNGS: Haploid case Results Monomorphic positions Sequencing errors Abundance of the most frequent “allele” Abundance of the second most frequent “allele” In the haploid case we expect most positions in the alignment to be monomorphic or have a low frequency of other bases (“Alleles”), that mostly arise form sequencing errors or mapping errors 15 ploidyNGS: Haploid case Results In the haploid case we expect most positions in the alignment to be monomorphic or have a low frequency of other bases (“Alleles”), that mostly arise form sequencing errors or mapping errors Monomorphic positions Sequencing errors 16 ploidyNGS: Diploid case Results Polymorphic positions In the diploid case we expect that in the polymorphic positions that both, the most frequent and second most frequent, “alleles” appear each with 50% of the reads in that position. 17 ploidyNGS: Triploid case Results Polymorphic positions In the triploid case we have two possible options for the polymorphic positions: 3 “alleles” present each with 33% of the reads 2 “alleles” present one with 66% and 33% of the reads 18 19 If you DO NOT have a reference genome, try to infer ploidy directly from your reads using kmer statistics with SmudgePlots Ploidy: Smudgeplots 20 Slides from Kamil Jaron Counting frequency of all k-mers in all sequencing reads 20 Ploidy: Smudgeplots 21 Slides from Kamil Jaron 21 Ploidy: Smudgeplots 22 Slides from Kamil Jaron 22 Ploidy: Smudgeplots 23 Slides from Kamil Jaron 23 Ploidy: Smudgeplots 24 https://www.nature.com/articles/s41467-020-14998-3 Smudgeplots for (a) the triploid root-knot nematode Meloidogyne floridensis and (b) the octaploid strawberry Fragaria × ananassa. Take pairs of kmers (above a minimum kmer coverage threshold) that differ by exactly one nucleotide. These kmers should be homologous and represented different alleles or different paralogues. If they are alleles, they could be used to estimate ploidy. The two kmers in the pair are called A and B, covA is greater or equal than covB 24 Strategy for near telomere-to-telomere assembly 25 https://arxiv.org/abs/2308.07877v1 Chromosome level phasing required more than long-reads. Long-range data is very important! 26 https://arxiv.org/abs/2308.07877v1 As good as they are, modern TGS reads still need error correction 27 https://www.nature.com/articles/s41592-020-01056-5 Assembly with overlap graphs 28 https://arxiv.org/abs/2308.07877v1 Graphs need to be simplified Assembly with overlap graphs 29 https://arxiv.org/abs/2308.07877v1 High quality reads (near perfect), allow to look “just” for perfect overlaps, which greatly simplify the overlap graph, and permits identifying repeat copies and . . . Assembly with overlap graphs 30 https://arxiv.org/abs/2308.07877v1 . . . to phase the genome Assembly with overlap graphs 31 https://arxiv.org/abs/2308.07877v1 It is common to remove contained reads (yellow), i.e., a read contained in another one. However this could lead to assembly gaps, particularly when phasing. It is one of the main problemsfor overlap/string graphs assemblers. Assembly with de Bruijn graphs 32 https://arxiv.org/abs/2308.07877v1 The value of k is usually much smaller than the read length, which can lead to loss some information Smaller values of k, lead to more ambiguities, as shown in the figure. But much large values of k could lead to contig breakpoints in low coverage regions. Assembly with de Bruijn graphs 33 https://arxiv.org/abs/2308.07877v1 There is no single best k for all situations, modern assemblers can use a mixture of k values for different regions of the genome, based on read coverage. DBG assemblers were very common for short-read technologies. Now with near perfect long-reads they are coming back. Scaffolding 34 https://www.sciencedirect.com/science/article/pii/S1369526619301244 Figure 2. Leading strategies for scaffolding long read assemblies. (a) High throughput chromatin conformation capture (Hi-C) relies on the proximity of interactions from cross-linked chromatin to order contigs. Chromatin is cross-linked, digested with restriction enzymes and biotinylated, and the two chromatin ends are ligated and purified using streptaviadin beads. The resulting library is sequenced and aligned to the genome to build a Hi-C interaction matrix for scaffolding contigs into pseudomolecules. (b) Optical maps utilize restriction enzymes and single molecule imaging to create a physical map of the genome. Long fragments of DNA are nicked using a restriction enzyme and labeled. DNA molecules are linearized and imaged, and fingerprints for each molecule are combined to create a consensus genome map. Contigs are overlaid on the genome map based on in silico digestion and anchored into scaffolds or pseudomolecules. 34 Assembly of haplotypes 35 https://www.sciencedirect.com/science/article/pii/S1369526619301244 Figure 3. Assembly approaches for sequencing and phasing heterozygous genomes. Long read assemblies allow assembly of multiple haplotypes from homologous chromosomes in heterozygous regions. The primary and alternative haplotypes can be collapsed into a single, non-redundant but chimeric pseudomolecule for simplicity of downstream analyses (top). Raw reads can be mapped to the contigs to resolve missing haplotype regions to create a phased, diploid assembly (middle). Partial haplotypes can be retained and labeled in a graph-based assembly (bottom). 35 Repeats and heterozygosity 36 https://www.sciencedirect.com/science/article/pii/S1369526619301244 Figure 3. Assembly approaches for sequencing and phasing heterozygous genomes. Long read assemblies allow assembly of multiple haplotypes from homologous chromosomes in heterozygous regions. The primary and alternative haplotypes can be collapsed into a single, non-redundant but chimeric pseudomolecule for simplicity of downstream analyses (top). Raw reads can be mapped to the contigs to resolve missing haplotype regions to create a phased, diploid assembly (middle). Partial haplotypes can be retained and labeled in a graph-based assembly (bottom). 36 Why repetitive sequences make assembly difficult? How difficult depends on the read length 37 Oftentimes repeats are collapsed, and the assembly is fragmented Luckily . . . Given long error-free reads, we can distinguish different repeat copies and successfully assemble them. Reads are never all entirely error-free, but when the read error rate is low enough and sequencing errors are sufficiently independent, we can correct most errors and achieve high-quality assembly (Li & Durbin, 2023). https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-S5-S18 Assembly quality assessment Assess contamination of the assembly: Blobtools (https://blobtools.readme.io/docs) Contiguity metrics: Quast (http://quast.sourceforge.net/) Reference free completeness and quality: Merqury (https://github.com/marbl/merqury) Reference free completeness: LAI (https://github.com/oushujun/LTR_retriever) Gene space completeness: BUSCO (https://busco.ezlab.org/) 38 https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13312 Merqury: Reference-free quality assessment 39 https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02134-9 The catalog of kmers and their frequency should be similar between the assembly and the reads, deviation of this could suggest problems to be looked at: A kmer frequent in reads but absent in assembly suggest a part of the genome is missing in the assembly On the other hand, kmer more frequent in assembly than in reads, suggest a false supplication in assembly The phasing accuracy could be measured when trio data is available. Merqury: Reference-free quality assessment 40 https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02134-9 Contamination Assessment in assemblies 41 https://www.pnas.org/doi/full/10.1073/pnas.1600338113 https://f1000research.com/articles/6-1287 Blobtools 41 Contamination Assessment in reads 42 https://link.springer.com/chapter/10.1007/978-3-030-91814-9_6 Cont-Free NGS 42 Evaluating sequence assemblies: Gene content 43 Looking for sets of conserved genes, the more you find the more complete assembly you have: some software: BUSCO Compleasm asmgene https://www.biorxiv.org/content/10.1101/2023.06.03.543588v1 https://busco.ezlab.org/ Evaluating sequence assemblies: Gene content 44 Species Ploidy Fragaria x ananassa 8x = 56 F. iinumae 2x = 14 F. nilgerrensis 2x = 14 F. nipponica 2x = 14 F. nubicola 2x = 14 F. orientalis 4x = 28 F. vesca 2x = 14 F. viridis 2x = 14 45 Re-sequencing vs de novo In most cases re-sequencing is not worth the loss of information and incorrect conclusions (due to re-arrangements) As costs continue to decrease, and algorithms for de novo assembly improve, de novo assembly is the best path in most cases. With a combination of high quality (PacBio HiFi or ONT Q20+) 3rd NGS Technologies and the use of long-range information (bionano/HiC), chromosome scale assembly can be achieved. 46 That´s all for today folks 46 image3.emf image4.emf image5.png image6.png image7.png image8.png image9.png image10.png image11.png image12.png image13.png image14.png image15.png image16.emf 0 25000 50000 75000 100000 0 25 50 75 100 Allele Freq C ou nt p os iti on s Type First Haploid genome 0 25000 50000 75000 100000 0255075100 Allele Freq C o u n t p o s i t i o n s TypeFirst Haploid genome image17.emf 0 25000 50000 75000 100000 0 25 50 75 100 Allele Freq C ou nt p os iti on s Type First Second Haploid genome 0 25000 50000 75000 100000 0255075100 Allele Freq C o u n t p o s i t i o n s TypeFirstSecond Haploid genome image18.emf 0 25000 50000 75000 100000 0 25 50 75 100 Allele Freq C ou nt p os iti on s Type First Second Third Fourth Haploid genome 0 25000 50000 75000 100000 0255075100 Allele Freq C o u n t p o s i t i o n s TypeFirstSecondThirdFourth Haploid genome image19.png image20.emf 0 100 200 300 400 500 25 50 75 Allele Freq C ou nt p os iti on s Type First Second Diploid genome 0 100 200 300 400 500 255075 Allele Freq C o u n t p o s i t i o n s TypeFirstSecond Diploid genome image21.png image22.emf 0 50 100 150 200 25 50 75 Allele Freq C ou nt p os iti on s Type First Second Triploid genome 0 50 100 150 200 255075 Allele Freq C o u n t p o s i t i o n s TypeFirstSecond Triploid genome image23.png image24.png image25.png image26.png image27.png image28.png image29.png image30.png image31.png image32.png image33.png image34.jpeg image35.jpeg image36.jpeg image37.png image38.png image39.pngimage40.png image41.png image42.png image43.png image44.png image45.jpeg image46.png image47.png image48.png image49.png image50.png image1.png image2.jpeg