Buscar

lecture 5

Prévia do material em texto

BMMB#852:"Applied"Bioinforma0cs"
"
"Week"3,"Lecture"5"
István#Albert#
#
Bioinforma0cs"Consul0ng"Center"
Penn"State,"2014"
One"slide"biology"summary""
•  Cells"contain"gene0c"material"!"DNA"!"large"molecules"called"nucleic#acids#4"of"
them:"Adenine,#Thymine,#Guanine,#Cytosine"
#
•  Stored"in"long"segments"!"chromosomes"!"sum"of"chromosomes"!"genome"
•  Parts"of"the"genome"get"transcribed"into"RNA"some"RNA"(mRNA)"gets"translated"
into"proteins""
•  Proteins"are"made"of"amino=acids#(23"of"them):"Alanine,"Threonine,"Glycine,"
Cysteine,"…"
•  Sequencing"technologies"can"be"used"to"iden0fy"nucleic"acids"(DNA)"
•  Important:"we"cannot"directly"sequence"DNA"of"a"cell"!"there"is"always"a"
laboratory"protocol"of"substan0al"complexity"!"library#prepara@on#
•  Sequencing"instruments"sequence"libraries."
"
BioStar:"ques0on"of"the"day"
Sequencing"Technology"Reviews"
Sequencing"Technologies"T"perspec0ve"
1st##genera@on:"Frederic"Sanger"develops"DNA"
sequencing"technology."Latest"versions"3"million"bases/
day,"1500bp"long"reads"
"
2nd#genera@on:"(nextTgen)"sequencing"started"2005"with"
the"release"of"the"454"sequencing"plaWorm."600"billion"
bases/week,"150bp"long"reads"
"
3rd#genera@on:"single"molecule"(no"DNA"amplifica0on"
required),"these"are"not"replacing"but"augumen0ng"2nd"
genera0on"systems,"longer"reads,"shorter"turnarounds"
"
"
"
Sequencing"Instruments"
Sequence"representa0on"FASTA:""
One"of"the"most"“dangerously”"simple"formats"
•  Seemingly"trivial"but"it"is"also"“underTspecified”,"
there"are"many"“custom”"extensions""
•  Tools"may"make"assump0ons"on"the"structure"
of"a"FASTA"file"
•  Surprising"number"of""problems"can"arise"
FASTA"format"iden0fier"
sequence"(in"a"certain"alphabet)"
The"alphabet"is"similar"to"a"specifica0on:"we"need"to"know"what"are"the"valid"
characters"to"describe"the"sequence"
Alphabets"
•  Interna0onal"Union"of"Pure"and"Applied"
Chemistry"(IUPAC)"codes"
•  Nucleic"acid"sequences"
•  Pep0de"sequences"!"polypep0des"could"be"
proteins"
A"mul0"record"FASTA"
It"is"not"clear"what"the"sequence"above"contains"nucleic"acids"or"aminoacids"
"
(feels"like"a"nucleic"acids"because"of"having"so"many"ACTG"both"those"are"also"valid"amino"
acids)"
iden0fier" extra"info"
More"considera0ons"
•  Many"tools"will"embed"extra"informa0on"into"either"the"iden0fier"or"the"“free"
zone”"the"descrip0on"sec0on"
•  See"the"FASTA"format"wiki"page"
•  Accession"!"unique"(o`en"numerical)"iden0fier"for"each"sequence"that"is"entered"
into"the"database"
•  Locus#=>#an"inden0fier"that"represents"a"posi0on"in"the"genome,"mul0ple"
accessions"may"point"to"the"same"locus""
•  Loci"may"have"versions"like:"ABCD.1"ABCD.2#
First"step"of"any"sequence"processing"step"
understand"your"FASTA"file"
•  What’s"is"what"in"this"file,"what"does"the"fasta"
header"look"like?"
1.  How"many"sequences"do"we"have"
"
2.  Are"sequences"all"on"a"single"line"or"over"
mul0ple"lines"
"
3.  What"is"the"iden0fier,"what"is"embedded"in"the"
descrip0on"
"
Shell"scripts"
•  Used"to"collect"mul0ple"commands"into"a"
single"program"
•  Run"the"same"commands"again"or"on"other"
data"
•  Document"the"steps"and"describe"and"analysis""
thought"process!"
"
Crea0ng"scripts"
You"can"run"it"with"bash#script=name.sh#
#
Special"variable"names:"$1,#$2,#$3#…#command"line"parameters#
Comments,"
Why,"what,"where."
Make"them"full"
coherent"sentences."
A"more"advanced"version"of"the"script"
It"uses"func0ons"to"isolate"different"tasks,"then"it"is"easier"to"create"
custom"tasks"by"chaining"together"various"building"blocks."
"
The"art"of"data"analysis:"crea0ng"the"right"building"blocks."
Bash"has"lots"of"features"
We"will"slowly"introduce"some"features"along"the"way."It"is"not"a""
par0cularly"good"programming"language."Don’t"try"to"write"overly"complex"
programs."There"are"beger"solu0ons"for"that."
Homework"5"
Write"a"script"that"selects"the"first"10"bases"of"every"
ebola"virus"assembly"submiged"to"the"PRJNA257197"
project""
1.  How"many"sequences"have"you"obtained?"
"
2.  How"many"unique"sequence"starts"are"there?"
"
3.  Are"there"small"or"big"differences"between"the"genome"starts?"
""
4.  What"is"the"most"common"sequence"start?""
"
5.  Which"one"do"you"think"is"the"most"dissimilar"sequence"start"
compared"to"the"most"common"one?"

Continue navegando