Baixe o app para aproveitar ainda mais
Prévia do material em texto
BMMB#852:"Applied"Bioinforma0cs" " "Week"3,"Lecture"5" István#Albert# # Bioinforma0cs"Consul0ng"Center" Penn"State,"2014" One"slide"biology"summary"" • Cells"contain"gene0c"material"!"DNA"!"large"molecules"called"nucleic#acids#4"of" them:"Adenine,#Thymine,#Guanine,#Cytosine" # • Stored"in"long"segments"!"chromosomes"!"sum"of"chromosomes"!"genome" • Parts"of"the"genome"get"transcribed"into"RNA"some"RNA"(mRNA)"gets"translated" into"proteins"" • Proteins"are"made"of"amino=acids#(23"of"them):"Alanine,"Threonine,"Glycine," Cysteine,"…" • Sequencing"technologies"can"be"used"to"iden0fy"nucleic"acids"(DNA)" • Important:"we"cannot"directly"sequence"DNA"of"a"cell"!"there"is"always"a" laboratory"protocol"of"substan0al"complexity"!"library#prepara@on# • Sequencing"instruments"sequence"libraries." " BioStar:"ques0on"of"the"day" Sequencing"Technology"Reviews" Sequencing"Technologies"T"perspec0ve" 1st##genera@on:"Frederic"Sanger"develops"DNA" sequencing"technology."Latest"versions"3"million"bases/ day,"1500bp"long"reads" " 2nd#genera@on:"(nextTgen)"sequencing"started"2005"with" the"release"of"the"454"sequencing"plaWorm."600"billion" bases/week,"150bp"long"reads" " 3rd#genera@on:"single"molecule"(no"DNA"amplifica0on" required),"these"are"not"replacing"but"augumen0ng"2nd" genera0on"systems,"longer"reads,"shorter"turnarounds" " " " Sequencing"Instruments" Sequence"representa0on"FASTA:"" One"of"the"most"“dangerously”"simple"formats" • Seemingly"trivial"but"it"is"also"“underTspecified”," there"are"many"“custom”"extensions"" • Tools"may"make"assump0ons"on"the"structure" of"a"FASTA"file" • Surprising"number"of""problems"can"arise" FASTA"format"iden0fier" sequence"(in"a"certain"alphabet)" The"alphabet"is"similar"to"a"specifica0on:"we"need"to"know"what"are"the"valid" characters"to"describe"the"sequence" Alphabets" • Interna0onal"Union"of"Pure"and"Applied" Chemistry"(IUPAC)"codes" • Nucleic"acid"sequences" • Pep0de"sequences"!"polypep0des"could"be" proteins" A"mul0"record"FASTA" It"is"not"clear"what"the"sequence"above"contains"nucleic"acids"or"aminoacids" " (feels"like"a"nucleic"acids"because"of"having"so"many"ACTG"both"those"are"also"valid"amino" acids)" iden0fier" extra"info" More"considera0ons" • Many"tools"will"embed"extra"informa0on"into"either"the"iden0fier"or"the"“free" zone”"the"descrip0on"sec0on" • See"the"FASTA"format"wiki"page" • Accession"!"unique"(o`en"numerical)"iden0fier"for"each"sequence"that"is"entered" into"the"database" • Locus#=>#an"inden0fier"that"represents"a"posi0on"in"the"genome,"mul0ple" accessions"may"point"to"the"same"locus"" • Loci"may"have"versions"like:"ABCD.1"ABCD.2# First"step"of"any"sequence"processing"step" understand"your"FASTA"file" • What’s"is"what"in"this"file,"what"does"the"fasta" header"look"like?" 1. How"many"sequences"do"we"have" " 2. Are"sequences"all"on"a"single"line"or"over" mul0ple"lines" " 3. What"is"the"iden0fier,"what"is"embedded"in"the" descrip0on" " Shell"scripts" • Used"to"collect"mul0ple"commands"into"a" single"program" • Run"the"same"commands"again"or"on"other" data" • Document"the"steps"and"describe"and"analysis"" thought"process!" " Crea0ng"scripts" You"can"run"it"with"bash#script=name.sh# # Special"variable"names:"$1,#$2,#$3#…#command"line"parameters# Comments," Why,"what,"where." Make"them"full" coherent"sentences." A"more"advanced"version"of"the"script" It"uses"func0ons"to"isolate"different"tasks,"then"it"is"easier"to"create" custom"tasks"by"chaining"together"various"building"blocks." " The"art"of"data"analysis:"crea0ng"the"right"building"blocks." Bash"has"lots"of"features" We"will"slowly"introduce"some"features"along"the"way."It"is"not"a"" par0cularly"good"programming"language."Don’t"try"to"write"overly"complex" programs."There"are"beger"solu0ons"for"that." Homework"5" Write"a"script"that"selects"the"first"10"bases"of"every" ebola"virus"assembly"submiged"to"the"PRJNA257197" project"" 1. How"many"sequences"have"you"obtained?" " 2. How"many"unique"sequence"starts"are"there?" " 3. Are"there"small"or"big"differences"between"the"genome"starts?" "" 4. What"is"the"most"common"sequence"start?"" " 5. Which"one"do"you"think"is"the"most"dissimilar"sequence"start" compared"to"the"most"common"one?"
Compartilhar