Bioinformatics analysis of disordered proteins in prokaryotes
A Practical Overview of Bioinformatics - BioinfoHelpDesk.org
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of A Practical Overview of Bioinformatics - BioinfoHelpDesk.org
A Practical Overview of A Practical Overview of A Practical Overview of A Practical Overview of BioinformaticsBioinformaticsBioinformaticsBioinformatics
Chuong Huynh (Chuong Huynh (HuHuỳỳnhnh ChươngChương))
[email protected]@bioinfohelpdesk.org
Lac Hong UniversitySeptember 9, 2008
What is bioinformatics? - Definition
• My definition – bringing biological themes to computers
• Peter Elkin: Primer on Medical Genomics: Part V: Bioinformatics– “Bioinformatics is the discipline that develops and applies informatics to the
field of molecular biology.”
• BISTIC Bioinformatics Definition– “Research, development, or application of computational tools and
approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data”
• BISTIC Computational Biology Definition– “Computational Biology: the development and application of data-
analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.”
• http://www.bisti.nih.gov/
Useful/Necessary Bioinformatics Skills
• Strong background in some aspects of molecular biology!!!• Ability to communicate biological questions comprehensibly to
computer scientists• Thorough comprehension of the problem in the bioinformatics
field• Statistics (association studies, clustering, sampling)• Ability to filter, parse, and munge data and determine the
relationships between the data sets• Mathematics (e.g. algorithm development)• Engineering (e.g. robotics)• Good knowledge of a few molecular biology software packages
(molecular modeling / sequence analysis)• Command line computing environment (Linux/Unix knowledge)• Data administration (esp. relational database concept) and
Computer Programming Skills/Experience (C/C++, Sybase, Java, Oracle) and Scripting Language Knowledge (Perl and perhaps Phython)
High throughput DNA sequencing Centre
gggtctctcttgttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgtgtgactctgatagctagagatcccttcagaccaaatttagtcagtgtgaaa
aatctctagcagtggcgcctgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcaggactcggcttgctgaagcgcgcacggcaagaggcgaggggacggcgactggtgagtacgccaaaattttgactagcggaggctagaaggagagagatgggt
gcgagagcgtcgatattaagcgggggaggattagatagatgggaaaaaattcggttaaggccagggggaaagaaaaaatatagattaaaacatttagtatgggcaagcagggagctagaacgattcgcagtcaatcctggcctattagaaacatcagaaggttgtagacaaa
tactgggacaactacaaccagcccttcagacaggatcagaagaacttagatcattatataatacagtagcaaccctctattgtgtgcatcaaaagatagatgtaaaagacaccaaggaagctttagataagatagaggaagagcaaaacaaaagtaagaaaaaagcacagca
agcagcagctgacacaggaaatagcagccaggtcagccaaaattaccccatagtgcagaacatccaggggcaaatggtacatcaggccatatcacctagaactttaaatgcatgggtaaaagtagtagaagagaaggctttcagcccagaagtaatacccatgttttcagca
ttatcagaaggagccaccccacaagatttaaacaccatgctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagaccatcaatgaggaagctgcagaatgggatagattgcatccagtgcatgcagggcctcatccaccaggccagatgagagaaccaa
ggggaagtgacatagcaggaactactagtacccttcaggaacaaatagcatggatgacaaataatccacctatcccagtaggagaaatctataagagatggataatcctgggattaaataaaatagtaaggatgtatagccctaccagcattctggacataaaacaaggacc
aaaggaaccctttagagactatgtagaccggttctataagactctaagagccgagcaagcttcacaggaggtaaaaaattggatgacagaaaccttgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattgggaccagcagctacactagaagaaatg
atgacagcatgtcagggagtgggaggacccggccataaagcaagagttttggcagaagcaatgagccaagtaacaaattcagctaccataatgatgcagaaaggcaattttaggaaccaaagaaaaattgttaagtgtttcaattgtggcaaagaagggcacatagccaaaa
attgcagggcccctaggaaaaggggctgttggaaatgtggaaaggagggacaccaaatgaaagattgtactgagagacaggctaattttttagggaaaatctggccttcccacaggggaaggccagggaattttcctcagaacagactagagccaacagccccaccagcccc
accagaagagagcttcaggtttggggaagagacaacaactccctctcagaagcaggagctgatagacaaggaactgtatccttcagcttccctcaaatcactctttggcaacgaccccttgtcacaataaagataggggggcaactaaaggaagctctattagatacaggag
cagatgatacagtattagaagaaataaatttgccaggaagatggaaaccaaaaatgatagggggaattggaggttttatcaaagtaagacagtatgatcaaatactcgtagaaatctgtggacataaagctataggtacagtattagtaggacctacacctgtcaacataat
tggaagaaatctgttgactcagattggttgcactttaaattttcccattagtcctattgaaactgtaccagtaaaattaaagccaggaatggatggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcattagtagaaatctgtacagaaatggaaaag
gaaggaaaaatttcaaaaatcgggcctgaaaatccatataatactccagtatttgccataaagaaaaaagacagtactaaatggagaaaattagtagatttcagagaacttaataagaaaactcaagacttctgggaagttcaattaggaataccacatcccgcagggttaa
aaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttcagttcccttagataaagaattcaggaagtacactgcatttaccatacctagtataaacaatgagacaccagggattagatatcagtacaatgtgcttccacagggatggaaaggatcaccagc
aatattccaaagcagcatgacaaaaatcttagagccttttagaaaacaaaatccagacatagttatctatcaatacatggacgatttgtatgtaggatctgacttagaaatagggcagcatagaacaaaaatagaggaactgagacaacatctgttgaagtggggatttacc
acaccagacaaaaaacatcagaaagaacctccattcctttggatgggttatgaactccatcctgataaatggacagtacagcctatagtgctgccagaaaaggacagctggactgtcaatgacatacagaagttagtgggaaaattgaattgggcaagtcagatttacccag
ggattaaagtaaagcaattatgtagactccttaggggaaccaaggcactaacagaagtaataccactaacaaaagaagcagagctagaactggcagaaaacagggaaattctaaaagaaccagtacatggagtgtattatgacccatcaaaagacttaatagcggaaataca
gaagcaggggcaaggtcaatggacatatcaaatttatcaagagccatttaaaaatctgaaaacaggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaattaacagaggcagtgcaaaaaataaccacagaaagcatagtaatatggggaaagactcct
aaatttaaactacccatacaaaaagaaacatgggaaacatggtggacagagtattggcaagccacctggattcctgagtgggagtttgtcaatacccctcccttagtaaaattatggtaccagttagagaaagaacccataataggagcagaaactttctatgtagatgggg
cagctaacagggagactaaattaggaaaagcaggatatgttactaacaaagggagacaaaaagttgtctccataactgacacaacaaatcagaagactgagttacaagcaattcttctagcattacaggattctggattagaagtaaacatagtaacagactcacaatatgc
attaggaatcattcaagcacaaccagataaaagtgaatcagagatagtcagtcaaataatagagcagttaataaaaaaagaaaaggtctacctgacatgggtaccagcgcacaaaggaattggaggaaatgaacaagtagataaattagtcagtactggaatcaggaaagta
ctctttttagatggaatagataaagcccaagaagaacatgaaaaatatcacagtaattggagggcaatggctagtgattttaacctgccacctgtggtagcaaaagagatagtagccagctgtgataaatgtcagctaaaaggagaagccatgcatggacaagtagactgta
gtccaggaatatggcaactagattgtacacatttagaaggaaaaattatcctggtagcagttcatgtagccagtggatatatagaagcagaagttattccagcagaaacagggcaggaaacagcatactttctcttaaaattagcaggaagatggccagtaaaaacagtaca
tacagacaatggcagcaatttcaccagtactacagttaaggccgcctgttggtgggcaggaatcaagcaggaatttggcattccctacaatccccaaagtcaaggagtagtagaatctataaataaagaattaaagaaagttataggacagataagagatcaggctgaacat
cttaagacagcagtacaaatggcagtattcatccacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaagaatagtagacataatagcaacagacatacaaactaaagaactacaaaaacaaattacaaaaattcaaaattttcgggtttattacaggg
acagcagagatccactttggaaaggaccagcaaagcttctctggaaaggtgaaggggcagtagtaatacaagataatagtgacataaaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggcaagtagacagga
tgaggattagaacatggaaaagtttagtaaaacaccatatgtatgtttcaaggaaagctaagggatggttttatagacatcactatgaaagtactcatccgagaataagttcagaagtacacatcccactagggaatgcaaaattggtaataacaacatattggggtctaca
tacaggagaaagagactggcatttgggtcaaggagtctccatagaattgaggaaaaggagatatagcacacaattagaccctaacctagcagaccaactaattcatctgcattactttgattgtttttcagaatctgctataagaaatgccatattaggacatatagttagc
cctaggtgtgaatatcaagcaggacataacaaggtaggatctctacagtacttggcactaacagcattagtaagaccaagaaaaaagataaagccacctttgcctagtgttacaaaactgacagaggatagatggaacaagccccagaagaccaagggccacaaagggaacc
atacaatgaatggacactagaacttttagaggagctcaagaatgaagctgttagacattttcctaggatatggctccatagcttagggcaacatatctatgaaacttatggagatacttgggcaggagtggaagccataataagaattctgcaacaactgctgtttattcat
ttcagaattgggtgtcaacatagcagaatagacattcttcgacgaaggagagcaagaaatggagccagtagatcctagactagagccctggaagcatccaggaagtcagcctaggactgcttgtaccaattgctattgtaaaaagtgttgctttcattgccaagtttgtttc
ataacaaaaggcttaggcatctcctatggcaggaagaagcggagacagcgacgaagagctcctcaagacagtcagactcatcaagtttctctatcaaagcagtaagtagtacatgtaatgcaatctttacaaatattagcagtagtagcattagtagtagcagcaataatag
caatagttgtgtggtccatagtattcatagaatataggaaaataagaagacaaaacaaaatagaaaggttgattgatagaataatagaaagagcagaagacagtggcaatgagagtgacggagatcaggaagaattatcagcacttgtggaaatggggcacgatgctccttg
ggatgttaatgatctgtaaagctgcagaaaatttgtgggtcacagtttattatggggtacctgtgtggaaagaagcaaccaccactctattttgtgcctcagatgctaaagcgtatgatacagaggtacataatgtttgggccacacatgcctgtgtacccacagaccccaa
cccacaagaagtagaactgaagaatgtgacagaaaattttaacatgtggaaaaataacatggtagaccaaatgcatgaggatataattagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactctgtgttactttaaattgcactgattatgggaatgat
actaacaccaataatagtagtgctactaaccccactagtagtagcgggggaatggaggggagaggagaaataaaaaattgctctttcaatatcaccagaagcataagagataaagtgaagaaagaatatgcacttttttatagtcttgatgtaataccaataaaagatgata
atactagctataggttgagaagttgtaacacctcagtcattacacaggcctgtccaaaggtatcctttgaaccaattcccatacattattgtgccccggctggttttgcgattctaaagtgtaatgataaaaagttcaatggaaaaggaccatgtacaaatgtcagcacagt
acaatgtacacatggaattaggccagtagtatcaactcaactgctgttaaatggcagtctagcagaagaagaggtagtaattagatcagacaatttctcggacaatgctaaagtcataatagtacatctgaatgaatctgtagaaattaattgtacaagactcaacaacatt
acaaggagaagtatacatgtaggacatgtaggaccaggcagagcaatttatacaacaggaataataggaaaaataagacaagcacattgtaacattagtagagcaaaatggaataacactttaaaacagatagttacaaaattaagagaacaatttaagaataaaacaatag
tctttaatcaatcctcaggaggggacccagaaattgtaatgcacagttttaattgtggaggggaatttttctactgtaattcaacacaactgtttaacagtacttggaatggtactgcatggtcaaataacactgaaggaaatgaaaatgacacaatcacactcccatgcag
aataaaacaaattataaacatgtggcaggaagtaggaaaagcaatgtatgcacctcccatcagaggacaaattagatgttcatcaaatattacagggctgatattaacaagagatggtggtattaaccagaccaacaccaccgagattttcaggcctggaggaggagatatg
aaggacaattggagaagtgaattatataaatataaagtagtaaaaattgaaccattaggagtagcacccaccaaggcaaagagaagagtggtgcaaagagaaaaaagagcagtgggaataataggagctatgctccttgggttcttgggagcagcaggaagcactatgggcg
cagcgtcaatgacgctgacggtacaggccagacaattattgtctggtatagtgcaacagcagaacaatttgctgagggctattgaggcgcaacagcatctgttgcacctcacagtctggggcatcaagcagctccaagcaagagtcctggctgtggaaagatacctaaggga
tcaacagctcctggggttttggggttgctctggaaaactcatttgcaccactgctgtgccttggaatactagttggagtaataaatctctgagtcagatttgggataacatgacctggatgcagtgggaaagggaaattgataattacacaagcttaatatacaacttaatt
gaagaatcgcaaaaccaacaagaaaagaatgaacaagagttattggaattagataactgggcaagtttgtggaattggtttagcataacaaattggctgtggtatataaaaatattcataatgatagtaggaggcttggtaggtttaagaatagtttttactgtactttcta
tagtaaatagagttaggcagggatactcaccattgtcgtttcagacgcgcctcccagccaggaggggacccgacaggcccgaaggaatcgaagaagaaggtggagagagagacagagacagatccggtcaattagtggatggattcttagcaattatctgggtcgacctgcg
gagcctgtgcctcttcagctaccaccgcttgagagacttactcttgattgtaacgaggattgtggaacttctgggacgcagggggtgggaagccctcaaatattggtggaatctcctacaatattggattcaggaactaaagaatagtgctgttagcttgctcaacgccaca
gccatagcagtagctgagggaactgatagggttatagaagtattacaaagagcttgtagagctattctccacatacctagaagaataagacagggcttagaaagggctttgcaataagatgggtggtaagtggtcaaaaagtagtaaaattggatggcctactgtaagggaa
agaatgagaagagctgagccagcagcagatggggtgggagcagtatctcgagacctggaaaaacatggagcaatcacaagtagtaatacagcaactaacaatgctgattgtgcctggctagaagcacaagaggaggaggaggtgggttttccagtcagacctcaggtacctt
taagaccaatgacttacaagggagcgttagatcttagccactttttaaaagaaaaggggggactggaagggctaatttggtcccagaaaagacaagacatccttgatttgtgggtccaccacacacaaggctacttccctgattggcagaactacacaccagggccagggat
cagatatccactgacctttggttggtgcttcaagctagtaccagttgagccagagaaggtagaagaggccaatgaaggagagaacaacagattgttacaccctgtgagcctgcatgggatggaggacccggagaaagaagtgttagtatggaggtttgacagccgcctagta
ctccgtcacatggcccgagagctgcatccggagtactacaaggactgctgacactgagctttctacaagggactttccgctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatgctgcatataagcagctgctttttgcctgtact
gggtctctcttgttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttca
DNA sequences are meaningless!DNA sequences are meaningless!DNA sequences are meaningless!DNA sequences are meaningless!Interpreting data from many
sources
Problems and ChallengesProblems and ChallengesProblems and ChallengesProblems and Challenges
• Know the sequence of every possible transcript but not understand the functions of these transcripts and their corresponding proteins!
• How to make sense of all of the gene and protein data in order to assign functions to these genes and proteins and to understand biological processes at the molecular level?
From gene to protein and its From gene to protein and its From gene to protein and its From gene to protein and its function(sfunction(sfunction(sfunction(s))))
> DNA sequence> DNA sequenceAATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACACAATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACACTGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAATGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAATCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTATCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTAACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGGACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGGTTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAATTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAATTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTGTTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTGGTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGAGTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGACGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGCCGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGCTACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGATACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGAACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGGACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGGTAAGAAGATCGCGAACATCTAGTAGATAAGAAGATCGCGAACATCTAGTAGAGeneGeneGeneGene
> Protein sequence> Protein sequenceMKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSMKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEVLEESEFEPFIEEISDVNIDELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNEPDEAEQDCIEFGKKIANICVVVETPLIVQNEPDEAEQDCIEFGKKIANIFunctionFunctionFunctionFunction
What is the function of these structures?What is the function of these structures?What is the function of this sequence?What is the function of this sequence?What is the function of this motif?What is the function of this motif?–– the fold provides a scaffold, which can be decorated the fold provides a scaffold, which can be decorated in different ways by different sequences to confer in different ways by different sequences to confer different functionsdifferent functions–– knowing the fold & function allows us to knowing the fold & function allows us to rationaliserationalisehow the structure effects its function at the molecular how the structure effects its function at the molecular levellevelGoals of Functional GenomicsGoals of Functional GenomicsGoals of Functional GenomicsGoals of Functional Genomics Annotation of eukaryotic genomes
transcription
RNA processing
translation
AAAAAAA
Genomic DNA
Unprocessed RNA
Mature mRNA
Nascent polypeptide
folding
Reactant A Product BFunction
Active enzyme
ab initio gene
prediction
Comparative gene
prediction
Functional
identification
Gm3
TranscriptionTranscription
DNADNA55’’ 33’’
mRNAmRNA SplicingSplicing
TranslationTranslation
PolyPoly--peptidepeptideFoldingFolding
ProteinProtein
•• Transport / LocalizationTransport / Localization
•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification
FunctionFunction FunctionFunction
Bioinformatics is needed in all levels
TranscriptionTranscription
DNADNA55’’ 33’’
mRNAmRNA SplicingSplicing
TranslationTranslation
PolyPoly--peptidepeptideFoldingFolding
ProteinProtein
•• Transport / LocalizationTransport / Localization
•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification
FunctionFunction FunctionFunction
At Genome Level Genome Projects need to store and organize DNA sequences
TranscriptionTranscription
DNADNA55’’ 33’’
mRNAmRNA SplicingSplicing
TranslationTranslation
PolyPoly--peptidepeptideFoldingFolding
ProteinProtein
•• Transport / LocalizationTransport / Localization
•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification
FunctionFunction FunctionFunction
At Transcription Level How do we find protein How do we find protein How do we find protein How do we find protein coding regions, coding regions, coding regions, coding regions, intronsintronsintronsintronsand and and and exonsexonsexonsexons in genomic in genomic in genomic in genomic DNA sequences?DNA sequences?DNA sequences?DNA sequences?TranscriptionTranscription
DNADNA55’’ 33’’
mRNAmRNA SplicingSplicing
TranslationTranslation
PolyPoly--peptidepeptideFoldingFolding
ProteinProtein
•• Transport / LocalizationTransport / Localization
•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification
FunctionFunction FunctionFunction
At Transcription Level Under which condition is a certain gene transcribed?
TranscriptionTranscription
DNADNA55’’ 33’’
mRNAmRNA SplicingSplicing
TranslationTranslation
PolyPoly--peptidepeptideFoldingFolding
ProteinProtein
•• Transport / LocalizationTransport / Localization
•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification
FunctionFunction FunctionFunction
At Translation Level
What do we know about a specific protein?TranscriptionTranscription
DNADNA55’’ 33’’
mRNAmRNA SplicingSplicing
TranslationTranslation
PolyPoly--peptidepeptideFoldingFolding
ProteinProtein
•• Transport / LocalizationTransport / Localization
•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification
FunctionFunction FunctionFunction
At Translation Level
How can we compare protein sequences?
TranscriptionTranscription
DNADNA55’’ 33’’
mRNAmRNA SplicingSplicing
TranslationTranslation
PolyPoly--peptidepeptideFoldingFolding
ProteinProtein
•• Transport / LocalizationTransport / Localization
•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification
FunctionFunction FunctionFunction
At Structure Level
Can we predict Can we predict Can we predict Can we predict protein structures?protein structures?protein structures?protein structures? What are some basic bioinformatics tools that can assist biologists in
answering these biological questions?
Web Access: http://Web Access: http://Web Access: http://Web Access: http://www.ncbi.nlm.nih.govwww.ncbi.nlm.nih.govwww.ncbi.nlm.nih.govwww.ncbi.nlm.nih.gov The Entrez System: Text Searches
BLAST: Sequence Similarity Searches VAST: Structure Similarity Searches
SWISSSWISSSWISSSWISS----PROT/PROT/PROT/PROT/TrEMBLTrEMBLTrEMBLTrEMBL• Collaboration between the SIB
(CH) and EMBL/EBI (UK)
• SWISS-PROT: Fully annotated(manually), non-redundant, cross-referenced, documentedprotein sequence database
• TrEMBL: is automatically generated (from annotated EMBL coding sequences (CDS)) and annotated using software tools
ProteinProteinProteinProtein Data BankData BankData BankData Bank
• Sequence similarity: BLAST and others
• Beyond sequence similarity:matching sequences and shapes (threading)
Assigning function by similarity
MNIFEMLRID EGLRLKIYKD TEGYYTIGIG
HLLTKSPSLN AAKSELDKAI GRNCNGVITK
DEAEKLFNQD VDAAVRGILR NAKLKPVYDS
LDAVRRCALI NMVFQMGETG VAGFTNSLRM
LQQKRWDEAA VNLAKSRWYN QTPNRAKRVI
TTFRTGTWDA YKNL
• Ab initio modeling
• Threading & Fold Recognition
• Homology Modeling
?
Protein Structure Modeling
Rational Drug Design
• Understanding how structures bind other molecule (function)
• Designing inhibitors
• Docking, structure modeling
Drug Lead Screening & Docking
Complementarity- Shape- Chemical- Electrostatic
?
Drug Development Life CycleDrug Development Life CycleDrug Development Life CycleDrug Development Life Cycle
YearsYears
0 2 4 6 0 2 4 6 8 10 12 14 168 10 12 14 16
Discovery (2 to 10 Years)
Preclinical Testing(Lab and Animal Testing)
Phase I(20-30 Healthy Volunteers used to check for safety and dosage)
Phase II(100-300 Patient Volunteers used to check for efficacy and side effects)
Phase III(1000-5000 Patient Volunteersused to monitor reactions to
long-term drug use)
FDA Review& Approval
Post-Marketing Testing
$600-700 Million!
7 – 15 Years!
With the aid of Bioinformatics
Drug lead screeningDrug lead screeningDrug lead screeningDrug lead screening
5,000 to 10,000 compounds screened
250 lead candidates inPreclinical
Testing5 drug candidatesenter Clinical Testing;80% pass Phase I
One drug approved by the FDA
30% pass Phase II
80% pass Phase III
Applications of BioinformaticsApplications of BioinformaticsApplications of BioinformaticsApplications of Bioinformatics
Molecular Interactions Structure Prediction
NH
O COO-H
NN
N
OH
NH2
N
CH2
NH
N
NH
O
COO-
COO-H
N
N
NH
N
OH
NH2
Search for new drugs
NH2
NH2
N
N
CH3
Cl
N
CH3
NH2
NH2
N
N CH2
OCH3
OCH3
OCH3
NH2
NH2
N
N CH2
OCH3
OCH3
OCH3
HC
NH
NH2
N
NH
CH3
Cl
NH
CH3
HC
NH
NH2
N
NH
CH 3
Cl
NH
CH3
Cl
data analysis, algorithms, data analysis, algorithms, data analysis, algorithms, data analysis, algorithms, visualization, statistics, etc.visualization, statistics, etc.visualization, statistics, etc.visualization, statistics, etc.DNA chips
Biochemical Networks
Genetic Variations
Optimizing therapies
Sequence Analysis
Genomes
Proteins d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ- NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ- NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN- LPADLAWFKRNTL-------- NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH- LPDDLHYFRAQTV-------- GKIMVVGRRTYESF
d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ- NLVIMGKKTWFSId8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ- NAVIMGKKTWFSId4dfra_ ISLIAALAVDRVIGMENAMPW- NLPADLAWFKRNTLD-------- KPVIMGRHTWESId3dfr__ TAFLWAQDRNGLIGKDGHLPW- HLPDDLHYFRAQTVG-------- KIMVVGRRTYESF
caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagcc
aaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatct
cttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatat
tgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaa
actttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgaca
ttgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagca
agcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaa
tacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatga
aatccaccattcaattacaacaagatcctctttcttgcacttgg
Access to Biomedical Literature
• HINARI: http://www.who.int/hinari/ - get full text biomedical and health literature
• OARE: http://www.oaresciences.org/ - get full text environmental sciences
• AGORA: http://www.aginternetwork.org/ - get full text agriculture
• PubMed Central: http://www.pubmedcentral.nih.gov/- get full text biomedical and life sciences journal literature
• PubMed: http://www.pubmed.gov/ - search biomedical literature
Logging in to HINARI 1
Log-in to the HINARI website by clicking “HINARI LOGIN”
Logging into HINARI 2
We will need to insert your HINARI User ID and Password in the Login box and click on the “Sign On” button.
Accessing journals via PubMed
Click on the link to find articles through PubMed.
Accessing PubMed for HINARI Users
http://hinari-gw.who.int/whalecomwww.ncbi.nlm.nih.gov/whalecom0/sites/entrez?myncbishare=hinari_who
Accessing PubMed for HINARI Users Accessing PubMed for HINARI Users
Accessing PubMed for HINARI Users
Another window will open at the journal publishers’ website.
Accessing PubMed for HINARI Users
Training Material : Viet Nam
• HINARI: Training: Vietnamese Translation -Dịch Tiếng Việt
http://www.bioinfohelpdesk.org/PubMedSep2005/vn_translation/
http://www.bioinfohelpdesk.org/PubMedSep2006/���� Reading Packet, including Vietnamese Translations
Most international scientific research is written in English.
Literature DatabasesLiterature DatabasesLiterature DatabasesLiterature Databases
Databases & Visualization Tools– Primarily based on Nucleotide Sequences!
Examples: GenBank (Sequence Formatter) Genome (MapViewer)UniGene (Expression Profile Viewer) SNP (GeneView & Genotype Viewer)
— We also have some other types of data:Examples: Protein (Translations & Curations seen by Sequence Formatter)
HomoloGene (Protein Sequence Clusters)Genome Project (Information and links for Organisms’ Sequencing Projects)GEO (Arrays & Spectras seen with Heat Maps) Structure (Coordinates seen with Cn3D)BioAssay (Activities seen in Reports)
Search Methods– Entrez: word/text searches
– BLAST: sequence searches– in-lieu-of-BLAST: pre-computed BLAST searches
– VAST: macromolecular structure searches
– PubChem Structure Search: molecular structure searches
Linking– Pre-computed and Hand-curated Associations
• Gene Database: A central source for gene information & links
• Hemochromatosis Example: Relating data to find a biomedical answer
“What is the Entrez System?”
● A text search and retrieval engine
● A virtual workspace for manipulating large datasets
● An organized system for linking biological data
● A network of 29 linked databases
Types of Databases
• Primary Databases– Raw and redundant Data…..submitted, “owned” and updated
by experimentalists
• Examples: GenBank, SNP, GEO, PubChem Substance & BioAssay
• Derivative Databases– Human-curated (compilation and curation of data)
• Examples: GEO Datasets, Structure & Literature databases
– Computationally-Derived
• Example: UniGene, HomoloGene, PubChem Compound
– Combination
• Examples: RefSeq, Genome Assembly, Conserved Domain and Structure databases
11111111ºººººººº Sequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence Database
GenBank
• Nucleotide only sequence database
• Archival in nature
• Each record is assigned a stable accession number
• Submission of GenBank Data to NCBI
– Direct submissions of individual records via Web(BankIt, Sequin)
– Batch submissions of bulk sequences via Email(EST, GSS, STS)
– FTP accounts for Sequencing Centers
• Three collaborating databases and other sources of data
The International SequenceDatabase Collaboration
EBI
GenBank
DDBJEMBL
EMBLEMBL
Entrez
SRS
getentry
NIGNIGCIB
NCBI
NIHNIH
•Submissions•Updates
•Submissions•Updates
•Submissions•Updates
GenBank
• full release every two months• incremental and cumulative updates daily• available only through internet
ftp://ftp.ncbi.nih.gov/genbank/
Release 161 December 200780,388,382 Records
83,874,179,730 Nucleotides
>140,000 Species
335 Gigabytes 1449 files
GenBank Divisions
“Organismal”(Traditional)
PRI (34) PrimateROD (26) RodentPLN (26) Plant and FungalBCT (24) Bacterial/ArchealINV (11) InvertebrateVRT (14) Other VertebrateVRL (8) ViralMAM (3) MammalianPHG (1) PhageSYN (5) SyntheticENV (6) Envir. samplesUNA (1) Unannotated
“Functional”(Bulk)
EST (635) Expressed Sequence TagGSS (264) Genome Survey SequenceHTG (98) High Throughput GenomicHTC (9) High Throughput ContigPAT (30) PatentSTS (14) Sequence Tagged SiteCON (82) Contigs, virtual
• Organized by taxonomy (sort of)• Direct submissions (Sequin/Bankit)• Accurate (~1 error per 10,000 bp)• Well characterized
• Organized by sequence type• Batch submissions (ftp/email) • Less accurate• Poorly characterized
EST (Expressed Sequence Tags)48,992,175
GSS (Genome Survey Sequence)21,636,089
Shredding
Draft Sequence (HTG/HTC divisions)
whole BAC genomic insert or genome
Cloning &Isolating
Assembling
Sequencing
GSS databaseor Trace Archive
NEW Re-Organization of theNucleotide Database
CoreNucleotide44,068,877
114,679,141 Total Nucleotide Records
RNARNA
nucleus~30,000 genes
Isolate RNA &make cDNA library
3’
5’
3’3’
5’5’
isolate unique clones sequence once from
each end
ESTdatabase
CoreNuc38%
EST43%
GSS19%
Sequence Databases: File Formats
ASN.1 – The Raw Data
XML
FASTA
GenBankflat file
A Traditional GenBank Record
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004
DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
complete cds.
ACCESSION AY182241
VERSION AY182241.2 GI:32265057
KEYWORDS .
SOURCE Malus x domestica (cultivated apple)
ORGANISM Malus x domestica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
REFERENCE 1 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Cloning and functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL Planta 219, 84-94 (2004)
REFERENCE 2 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE 3 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK Sequence update by submitter
COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
FEATURES Location/Qualifiers
source 1..1931
/organism="Malus x domestica"
/mol_type="mRNA"
/cultivar="'Law Rome'"
/db_xref="taxon:3750"
/tissue_type="peel"
gene 1..1931
/gene="AFS1"
CDS 54..1784
/gene="AFS1"
/note="terpene synthase"
/codon_start=1
/product="(E,E)-alpha-farnesene synthase"
/protein_id="AAO22848.2"
/db_xref="GI:32265058"
/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK
NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF
EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE
NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS
LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW
ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS
EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT
KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA
DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK
GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI
LSLLFQPLVN"
ORIGIN
1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat
61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg
121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt
181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt
1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt
1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa
1921 aaaaaaaaaa a
//
Header
Feature Table
Sequence
Field Indexed Terms
[primary accession] M17755[title] Homo sapiens thyroid peroxidase (TPO) mRNA…[organism] Homo sapiens[sequence length] 3060[modification date] 1999/04/26[properties] biomol_mrna
gbdiv_prisrcdb_genbank
Indexing for Nucleotide UID:4680720
A Traditional GenBank Record
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004
DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
complete cds.
ACCESSION AY182241
VERSION AY182241.2 GI:32265057
KEYWORDS .
SOURCE Malus x domestica (cultivated apple)
ORGANISM Malus x domestica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
REFERENCE 1 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Cloning and functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL Planta 219, 84-94 (2004)
REFERENCE 2 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE 3 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK Sequence update by submitter
COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
FEATURES Location/Qualifiers
source 1..1931
/organism="Malus x domestica"
/mol_type="mRNA"
/cultivar="'Law Rome'"
/db_xref="taxon:3750"
/tissue_type="peel"
gene 1..1931
/gene="AFS1"
CDS 54..1784
/gene="AFS1"
/note="terpene synthase"
/codon_start=1
/product="(E,E)-alpha-farnesene synthase"
/protein_id="AAO22848.2"
/db_xref="GI:32265058"
/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK
NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF
EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE
NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS
LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW
ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS
EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT
KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA
DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK
GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI
LSLLFQPLVN"
ORIGIN
1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat
61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg
121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt
181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt
1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt
1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa
1921 aaaaaaaaaa a
//
Header
Feature Table
Sequence
Field Indexed Terms
[gene name] TPO[text word] thyroiditis[protein name] thyroid peroxidase
Indexing for Nucleotide UID:4680720
The sequence itselfis not indexed…
Use BLAST for that!
Which one is the best sequence?
Primary vs. DerivativeSequence Databases
GenBankGenBank
SequencingSequencingCentersCenters
GA
GAGA
ATT
ATTC
CGAGA
ATT
ATTC
C
AT
GAGA
ATTC
C GAGA
ATTC
C
TTGACA
ATTGACTA
ACGTGC
TTGACA
CGTGAAT
TGACTA
TATAGCCG
ACGTGC
ACGTGCACGTGC
TTGACA
TTGACA
CGTGA
CGTGA
CGTGA
ATTGACTA
ATTGACTAATTGACTA
ATTGACTA
TATAGCCG
TATAGCCG
TATAGCCG
TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG
CATT
GAGA
ATTC
C
GAGA
ATTC
C LabsLabs
AlgorithmsAlgorithms
UniGene
CuratorsCurators
RefSeq
GenomeAssembly
TATAGCCGAGCTCCGATACCGATGACAA
Updated continually by NCBI
Updated ONLY by submitters
genomes transcripts proteins
GenBank
Derivative Sequence DatabaseDerivative Sequence DatabaseDerivative Sequence DatabaseDerivative Sequence Database
• The curated “best representative” sequences“non-redundant”
• Standardized nomenclature and record structure• Added annotation (references, sequence features)
Explicitly linked nucleotide and protein sequences
Validated by hand
• Updated to reflect current sequence data and biology• Stewardship by NCBI staff and collaborators
RELEASE 30 (July 7, 2008) IS NOW AVAILABLE ON THE FTP SITE!
RefSeqRefSeqRefSeqRefSeqRefSeqRefSeqRefSeqRefSeq Sequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence Database
Curated genomic DNACurated genomic DNA(NC, NT, NW)(NC, NT, NW)
Curated Model mRNACurated Model mRNA (XM)(XM)(XR)(XR)
Curated mRNACurated mRNA (NM)(NM)(NR)(NR)
Model protein Model protein (XP)(XP)
RefSeq Curation Processes
ProteinProtein (NP)(NP)
Scanning....
Curated RefSeq Records
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI
staff. The reference sequence was derived from X66503.1.
Summary: Adenylosuccinate synthetase catalyzes the first
committed step in the conversion of IMP to AMP.
LOCUS ADSS 1368 bp mRNA linear PRI 27-AUG-2002
DEFINITION Homo sapiens adenylosuccinate synthase (ADSS), mRNA.
ACCESSION NM_001126
VERSION NM_001126.1 GI:4557270 RefSeq NucleotideRefSeq Nucleotide
LOCUS ADSS 455 aa linear PRI 27-AUG-2002
DEFINITION adenylosuccinate synthase; Adenylosuccinate synthetase
(Ade(-)H-complementing) Homo sapiens .
ACCESSION NP_001117
VERSION NP_001117.1 GI:4557271
DBSOURCE REFSEQ: accession NM_001126.1 RefSeq ProteinRefSeq Protein
X records:X records: Genome Annotation & Inferred or PredictedGenome Annotation & Inferred or Predictedvsvs
N records:N records: Provisional, Reviewed or ValidatedProvisional, Reviewed or Validated
The Perils of the XM
XM records are models based only on genomic sequenceand are subject to revision or removal with each new build of that genome.
Query= gi|20850420|ref|XM_124429.1|
Mus musculus expressed sequence AA553001 (AA553001), mRNA
gi|19527087|ref|NM_133873.1|
Mus musculus DNA segment, Chr 4, Wayne State University 114,
expressed (D4Wsu114e), mRNA Length=1898
Score = 3701.55 bits (1867), Expect = 0
Identities = 1870/1871 (99%), Gaps = 0/1871 (0%) Strand=Plus/Plus
BLAST the XM against the RefSeq database to look for a replacement:
Sequence DatabasesSequence DatabasesSequence DatabasesSequence Databases
GSS -EST -CoreNucleotide -
genome surveysequences
expressedsequence tags
the restMapping Genome Data on an Assembly:Mapping Genome Data on an Assembly:Genome SequenceGenome Sequence
(RefSeq: NC, NT, NW)(RefSeq: NC, NT, NW)Transcript regions & Transcript regions & ORFsORFs
(RefSeq: NM/NP, XM/XP)(RefSeq: NM/NP, XM/XP)Markers (STS)Markers (STS)Polymorphisms (SNP)Polymorphisms (SNP)ESTs/ExonsESTs/Exons ((UniGeneUniGene))
• “Best representative” (reference) sequences
• Standardized nomenclature and record structure
• Added annotation (references, sequence features)
• Assembled and annotated genomes
Complete Genomesas of November 2007
• Organelles:– Mitochondria (1214)
– Plastids (115)
– Plasmids (1148)
• Viruses (2854)
•• ArchaebacteriaArchaebacteria(46(46 completecomplete))
•• EubacteriaEubacteria(521(521completecomplete))
•• Eukaryotes Eukaryotes (25(25completecomplete/162 /162 assembliesassemblies))
Simple Genomes
• Full chromosomal sequences are provided
• Genes are annotated
• The annotation can be shown graphically and linked to sequence records
LOCUS NC_000913 4639221 bp DNA circular BCT 30-JUL-2003
DEFINITION Escherichia coli K12, complete genome.
ACCESSION NC_000913
VERSION NC_000913.1 GI:16127994
KEYWORDS .
SOURCE Escherichia coli K12.
ORGANISM Escherichia coli K12
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Escherichia.
REFERENCE 1 (bases 1 to 4639221)
AUTHORS Blattner,F.R., Plunkett,G. III, Bloch, C.A., Perna, N.T., Burland,V.,
Riley,M., Collado-Vides,J., Glasner,J.D., Rode, C.K., Mayhew,G.F.,
Gregor,J., Davis,N.W., Kirkpatrick,H.A., Goeden,M.A., Rose,D.J.,
Mau,R. and Shao,Y.
TITLE The complete genome sequence of Esherichia coli K12.
JOURNAL Science 277 (5331), 1453-1474 (1997)
MEDLINE 97426617
PUBMED 9278503
REFERENCE 2 (bases 1 to 4639221)
AUTHORS Blattner,F.R.
TITLE Direct submission
JOURNAL Sumbitted (16-JAN-1997) Guy Plunkett III, Laboratory of Genetics,
University of Wisconsin, 445 Henry Mall, Madison, WI 53706, USA.
E-mail [email protected] Phone: 608-262-2543 Fax:
RefSeq Chromosomes: NC_
gene 3954631..3956478
/gene="mutL"
/locus_tag="b4170"
/note="synonym: mut-25"
CDS 3954631..3956478
/gene="mutL"
/locus_tag="b4170"
/function="methyl-directed mismatch repair"
/codon_start=1
/transl_table=11
/product="MutL"
/protein_id="NP_418591.1"
/db_xref="GI:16131992"
/translation="MPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDI
DIERGGAKLIRIRDNGCGIKKDELALALARHATSKIASLDDLEAIISLGFRGEALASI
SSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAAHPVGTTLEVLDLFYNTPARRKF
LRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQKERRLGAICGT
AFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQAC
EDKLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQL
ETPLPLDDEPQPAPRSIPENRVAAGRNHFAEPAAREPVAPRYTPAPASGSRPAAPWPN
AQPGYQKQQGEVYRQLLQTPAPMQKLKAPEPQEPALAANSQSFGRVLTIVHSDCALLE
RDGNISLLSLPVAERWLRQAQLTPGEAPVCAQPLLIPLRLKVSAEEKSALEKAQSALA
ELGIDFQSDAQHVTIRAVPLPLRQQNLQILIPELIGYLAKQSVFEPGNIAQWIARNLM
SEHAQWSMAQAITLLADVERLCPQLVKTPPGGLLQSVDLHPAIKALKDE"
Annotation ofGene, CDS,
and other features
BASE COUNT 978672 a1011074 c 997153 g 974742 t
ORIGIN
1 cgtcttcatt gtcagacagc agaatttgta cgcgctgttc ggcttgttgt aatttggcct
61 gcccctgacg tgccagctgc acgccgcgtt cgaactcgtt cagcgcctct tccagcggca
121 ggtcgccact ttccagacgg gttacaatct gttccagctc gctcagcgcc ttttcaaagc
181 tggcgggcgc ctcatttttc ttcggcataa tgaatgtctg actctcaata tttttcgccc
241 cgtcatggta acggactcag ggcaaatagc aaataacgcg caatggtaag gtgatgtgca
301 cagcaaagcg atgttagtgg tatacttccg cgcctggatg cagccgcagg tgtgggctgc
361 tgtatttttc cctatacaag tcgcttaagg cttgccaacg aaccattgcc gccatgaagt
421 ttatcattaa attgttcccg gaaatcacca tcaaaagcca atctgtgcgc ttgcgcttta
481 taaaaatcct taccgggaac attcgtaacg ttttaaagca ctatgatgag acgctcgctg
541 tcgtccgcca ctgggataac atcgaagttc gcgcaaaaga tgaaaaccag cgtctggcta
601 ttcgcgacgc tctgacccgt attccgggta tccaccatat tctcgaagtc gaagacgtgc
661 cgtttaccga catgcacgat attttcgaga aagcgttggt tcagtatcgc gatcagctgg
721 aaggcaaaac cttctgcgta cgcgtgaagc gccgtggcaa acatgatttt agctcgattg
781 atgtggaacg ttacgtcggc ggcggtttaa atcagcatat tgaatccgcg cgcgtgaagc
841 tgaccaatcc ggatgtgact gtccatctgg aagtggaaga cgatcgtctc ctgctgatta
901 aaggccgcta cgaaggtatt ggcggtttcc cgatcggcac ccaggaagat gtgctgtcgc
961 tcatttccgg tggtttcgac tccggtgttt ccagttatat gttgatgcgt cgcggctgcc
Genome sequence
mutL
Complex Genomes
• Sequences are provided complete or we help assemble
• Heavy annotation: Genes, transcript regions & ORFs, sequence variations & markers, clones, ESTs, etc.
• The annotation can be shown graphically and linked to other
databases using the MapViewerMapViewer
�������� 1: NT_034400. Homo sapiens chro...[gi:51458694] Links
Click here to see all features and the sequence of this contig record.
LOCUS NT_034400 1065823 bp DNA linear CON 19-AUG-2004
DEFINITION Homo sapiens chromosome 1 genomic contig.
ACCESSION NT_034400
VERSION NT_034400.3 GI:51458694
KEYWORDS .
SOURCE Homo sapiens
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini;
Hominidae; Homo.
REFERENCE 1 (bases 1 to 1065823)
AUTHORS International Human Genome Sequencing Consortium.
TITLE The DNA sequence of Homo sapiens
JOURNAL Unpublished (2003)
COMMENT GENOME ANNOTATION REFSEQ: Features on this sequence have
been produced for build 35 version 1 of the NCBI's genome
annotation [see documentation].
On Aug 19, 2004 this sequence version replaced gi:27478327.
The DNA sequence is part of the third release of the
finished human reference genome. It was assembled from
individual clone sequences by the Human Genome Sequencing
Consortium in consultation with NCBI staff.
COMPLETENESS: not full length.
RefSeq Contig: NT_
gene complement(2548206..2591802)
/gene="ADSS"
/db_xref="LocusID:159"
/db_xref="MIM:103060"
mRNA complement(join(2548206..2549349,2550998..2551147,
2555692..2555789,2557339..2557463,2558471..2558625,
2559881..2560007,2562526..2562607,2563644..2563751,
2564012..2564078,2572236..2572286,2576516..2576584,
2577357..2577459,2591326..2591802))
/gene="ADSS"
/product="adenylosuccinate synthase"
/note="Derived by automated computational analysis using gene
prediction method: BLAST. Supporting evidence includes
similarity to: 3 mRNAs"
/transcript_id="XM_049992.8"
/db_xref="GI:22045950"
/db_xref="LocusID:159"
/db_xref="MIM:103060"
Annotation ofGene, mRNA, CDS,and other features
CONTIG join(AL139152.7:1..55543,AL596177.4:1998..91084,
AL356378.17:1999..202955,AL391904.14:2001..68222,
AL590667.7:2001..175494,AL359207.7:2001..112707,
AL365260.11:2001..114412,complement(AL445591.10:1..138092),
BX537254.7:2001..121309)
// Ordering of draft sequences
Click here to see all features and the sequence of this contig
record.
Higher Genome MapViews
adss
Genomic BLAST
Search the maps
Species-specific help!
Higher Genome MapViewsMap Viewer Help
Human Maps Help
Maps&Options
Examples of Maps & Mapped Data
--Sequence maps---Ab initio (model)AssemblyRepeatsBES_CloneCloneNCI_CloneContigComponentCpG islanddbSNP haplotypeFosmidGenBank_DNAGenePhenotypeSAGE_TagSTSTCAG_RNATranscript (RNA)UniGene ESTVariation
--Cytogenetic maps--IdeogramFISH CloneNCI FISH CloneGene_CytogeneticMitelman BreakpointMorbid/Disease--Genetic Maps--deCODEGenethonMarshfield--RH maps---GeneMap99-G3GeneMap99-GB4NCBI RHStandford-G3TNGWhitehead-RHWhitehead-YAC
Sequence DatabasesSequence DatabasesSequence DatabasesSequence Databases
GSS -EST -CoreNucleotide -
genome surveysequences
expressedsequence tags
the rest
GSS -EST -CoreNucleotide -
genome surveysequences
expressedsequence tags
the rest
BLASTBLAST
SequenceSequenceVASTVAST
StructureStructure
EntrezEntrez
TextText
Searching the NCBI Databases
PubChemPubChem
StructureStructure
SearchSearch
Small MoleculeSmall MoleculeStructureStructure
TextClinical FeaturesOther FeaturesDiagnosisInheritanceClinical ManagementMappingMolecular GeneticsGene TherapyMolecular GeneticsPopulation GeneticsAnimal ModelAllelic Variants HistoryReferences
EXAMPLE: Searching with Entrez
“How do I retrieve the mRNA record NM_000311?”
“Finding information on Sickle Cell Anemia…”
1
3
2
How to Query a Database
(term1[tag delimiter] op term2[tag delimiter] op …)
tag delimiter = Entrez indexing field
op = AND, OR, NOT
OrganismJournalUser compoundsAuthor
�Boolean operators MUST be in ALL CAPS!
Examples oftag delimiters
1
3
term1 term22
Brauninger a c-src kinase
OrganismJournalUser compoundsAuthor
EXAMPLE: Searching with Entrez
“How do I download the human proteome?”
1. Choose the ProteinProtein database to search 2. Retrieve HumanHuman records
human[organismhuman[organism]]
3. Narrow to only RefSeqRefSeq recordsAND AND srcdb_refseq[propertiessrcdb_refseq[properties]]
(or use the new (or use the new ““RefSeqRefSeq”” Tab!)Tab!)
4. Choose your “DisplayDisplay” format
5. “SendSend” the data to a “FileFile”
human
Using Fields to Sequence RecordsAccessionAll FieldsAuthor EC/RN NumberFeature KeyFilterGene NameIssueJournalKeywordModification DateMolecular WeightOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitleVolume
Most useful search field: [Organism]human[orgn] …or… bacteria[orgn]
Useful search terms in [Properties]: srcdb: “source database” srcdb_refseq[prop]
gbdiv: “genbank division” gbdiv_est[prop]
biomol: “biomolecular type” biomol_mrna[prop]
Downloading Records Downloading a Record
EXAMPLE: Searching with Entrez
“How do I find information on genesexpressed in mouse pancreas?”
“How do I get a structure of theHIV-1 Reverse transcriptase?”
“How do I find shigella proteins between24 and 36kDa?”
http://www.ncbi.nlm.nih.gov/entrez/cubby.fcgi
Be lazy – Get the new query results Emailed to you. Precomputed BLAST Services
Some Entrez Links:A List of Similar Sequences: Related SequencesStructures with Similar Sequences: Related StructuresThe Multifunctional Blast Link: Blink
Some BLAST-generated Databases & Links:SNP:GeneView SNP:GeneViewUniGene (Transcript Clusters) UniGeneHomoloGene (Protein Homologs)
HomoloGeneCDD (Functional Domains) Conserved Domains
& CDART (modular domains) Domain Relatives
You may not have to run a BLAST searchor you may have already used BLAST
and did not even know it!
Most similar
Least similar
Links
Related Sequences:Precomputed BLASTn & BLASTp Lists
The “Related Sequences” link retrievesGenBank/GenPept sequences sorted by BLAST score,
but with no alignment details.
Nucleotide
Protein
Links
BLink: Precomputed BLASTp
• Lists only 200 hits
• List is nonredundant
NewNewNewNew
Summary pages of sequence and expression information
for sets of expressed sequences clustered by BLAST
UniGene Cluster Hs.XXXXX Homo sapiens
SEE ALSO: LocusLink | OMIM | HomoloGene
MODEL ORGANISM PROTEIN SIMILARITIES
MAPPING INFORMATION
EXPRESSION INFORMATION
mRNA SEQUENCES
EST SEQUENCES
EXAMPLE: Finding Transcripts & Tissues
“How do I find information on expression of my gene?”
1. Choose the UniGene database 2. Search with a gene name or UID
(g6pdg6pd: Glucose-6-phosphate dehydrogenase)
3. Look at the General Information page
4. Click on the Expression Profile to see in which tissues the transcripts are found.
5. You can download all of the sequences.
Clusters of homologous protein sequences based on BLASTp
(also guided by taxonomic tree information)
- Includes orthologs and paralogs -
early globin gene
A-chain gene B-chain gene
frog A chick A mouse A mouse B chick B frog Bfrog A chick A mouse A mouse B chick B frog B
paralogsorthologs orthologs
gene duplication
EXAMPLE: Finding Homologous Proteins
“How do I findhomologs of my protein?”
1. Choose the HomoloGene database 2. Search with a gene name or UID
(g6pdg6pd: Glucose-6-phosphate dehydrogenase)
3. Look at the General Information page
4. Display Multiple Alignment format
5. Find all of the comparison statistics and BLAST2 alignment already calculated!
g6pd
aa%IDamino acid % identity
nt%IDnucleotide % identity
Devolutionary distance(Jukes-Cantor Model)
Ka/Ksnon-synonymous/synonymous
codon change ratio
Knr/Kncradical/conservative
amino acid change ratio
“Reverse-Position Specific” Sequence Comparisons (RPS-BLAST)a.k.a. “Conserved Domain Database” (CDD) Search
Conserved sequence elements that perform common functions curated from multiple sequence alignments with similar function
(Position-Specific Scoring Matrices)
10 10 10 10 20 30 40 20 30 40 20 30 40 20 30 40 50 6050 6050 6050 60. . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . * . . . . |. . * . . . . | . . . . * . . . . | . . . . * . . . . |. . * . . . . | . . . . * . . . . | . . . . * . . . . |. . * . . . . | . . . . * . . . . | . . . . * . . . . |consensus 1 KWEIPREDLTLGKKLGEGAFGEVYKGTLKGkgd---nkSIDVAVKTLKEDASEeqIKEFL 57
1FGI A 1 aWEIPRESLRLEVKLGQGCFGEVWMGTWNG--------TTRVAIKTLKPGTMS--PEAFL 311
1BYG A 1 RWELPRDRLVLgkPLGEGAFGQVYLAEAIglgkdkpnrvTKVAVKMLKSDAtedkLSLDI 74
gi 125135 1 GWALNMKELKLlqTIGKGEFGDVMLGDYRg---------NKVAVKCIKNDAt---AQAFL 62
gi 125702 1 KYEIPRTDLTLkhKLGGGQYGEVYEGVWKky-------sLTVAVKTLKEDTm--eVEEFL 284
gi 1174437 1 KWEIPRSELTIlrKLGRGNFGEVFYGKWRn--------sIDVAVKTLREGTm--sTAAFL 325NCBIProtein
Clusters
NCBICOG
EMBLSMART
SangerPfam
NCBICD
Finding CDs for a Query Protein
MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWESPAQNTAHLDQFERIKTLG
TGSFGRVMLVKHKETGNHYAMKILDKQKVVKLKQIEHTLNEKRILQAVNFP
FLVKLEFSFKDNSNLYMVMEYVPGGEMFSHLRRIGRFSEP…………………………
CDART:CDART:CDART:CDART:Conserved Domain Architecture Retrieval Tool
Modular Architecture of Domains
Cartoon descriptions of relative protein domain organization
Allows for comparison with other proteins with the same domain
Finding CDs for a Query Protein CDD Record – SMART S_TKc
aligned query
red = high conservation blue = low conservation
View in Cn3D
View PSSM MatrixLinks to proteins
containing this domain
Annotated features
Links to Domain Families/CDTree Data
Curated Domain
Entrez Structure: MMMMolecular MMMModeling DDDDatabbbbase
• Searching the Structure Databases:
• Keyword search by Entrez
• Sequence search by BLAST or BLink
• Domain search by CDD/RPS-BLAST
• Structure search by VAST
• Derived from experimentally determined PDB records
• Data is added to PDB records including:– Addition of explicit chemical bonding information
– Validation and indexing of sequence
– Inclusion of Taxonomy, Citation, and other information
– Conversion to ASN.1 data description language
Molecular components in the MMDB structure are listed below. Theicons indicate macromolecular chains, 3D domains, protein classifications and ligands. Please hold the mouse over each icon for more information on the component. You may also click the thumbnails below to view corresponding chains and domains in Cn3D.
to get the Cn3D viewer
Sequence-based Neighbors:Conserved Domains (CDD/RPS-BLAST)
Structure-based Domains:(3D Domains)
PubChem Link
Searching the NCBI Databases
BLASTBLAST
SequenceSequenceVASTVAST
StructureStructure
EntrezEntrez
TextTextPubChemPubChem
StructureStructure
SearchSearch
Small MoleculeSmall MoleculeStructureStructure
Structure-based Neighbors:Vector Alignment Search Tool
For each protein chain:
locate secondarystructure elements,
represent them asindividual vectors,
1
2
3
4
5 6
Human IL-4
and compare these withprecomputed vectorsof database structures.
Molecular components in the MMDB structure are listed below. Theicons indicate macromolecular chains, 3D domains, protein classifications and ligands. Please hold the mouse over each icon for more information on the component. You may also click the thumbnails below to view corresponding chains and domains in Cn3D.
Structure-based Neighbors:Vector Alignment Search Tool (VAST)
����
EXAMPLE: Using VAST to find binding sites“I have a methyltransferase with a weird fold.
Are there any other proteins that have this fold?”
The Crystal Structure Of E.coli
m1G37 tRNAMethyltransferase
(1P9P)
&
A protein of unknown function from
Thermotoga maritima(1O6D)
2.4Å rmsd11% sequence identity
BLASTBLAST
SequenceSequenceVASTVAST
StructureStructure
EntrezEntrez
TextText
Searching the NCBI Databases
PubChemPubChem
StructureStructure
SearchSearch
Small MoleculeSmall MoleculeStructureStructure
Derivative Database:information is provided, updated
and “owned” by NCBI.
PubChem Databases
�Composed of Experimental data with Background, Protocols and Results for bioactivity screens of chemical substances described in PubChem Substance
� Submitters add “Hard” Links to PubChemSubstance records and outside sources.
Cell-line Growth Effectors HIV InfectivityPyruvate Kinase Estrogen Receptor Binding
�Composed of Substances which may be of known or unknown composition and also may contain a discrete compound or mixtures of compounds.
� Submitters add “Hard” Links to PubChemBioAssay records and outside sources.
Hydrolyzed feathers Diphenhydramine CitrateAspartame
Primary Databases:information is provided, updated
and “owned” by Submitters.
�Composed of discrete compounds with known chemical structure.
� Summary reports about the known chemical compounds described in PubChem Substance.
� Addition of Automated Links which are combined from information provided on PubChem Substance & BioAssayrecords.
EXAMPLE: Using PubChem Structure Search“What is this compound we isolated from grape skins?
Has it been found to have any biological activity?”
“What is this compound we isolated from grape skins? Has it been found to have any biological activity?”
YES
Searching the NCBI Databases
Linking!
Genomes
Taxonomy
PubMed
Nucleotide sequences
Protein sequences
3-D Structuremmdb
(3D structure)
Term Frequency Statistics
Structure Similarity
(VAST)
SequenceSimilarity
(BLASTp)
SequenceSimilarity
(BLASTn)
Phylogeny
Examples of Soft Linking in Entrez
Following Links
“Hard” Links: Curated links based on biology
for example:
nucleotide � taxonomy (based on organism identifier)protein � domain relatives (based on domain assignment)domains � pubmed (based on supporting literature)
“Soft” Links: Pre-computed analyses
for example:nucleotide � related sequences (BLASTn neighbors)protein � conserved domains (RPS-BLAST search) gene � map viewer (map position of annotated gene)
Follow links to related datain the same database
or in others!
Links
Summary pages of curated
information (LINKS!)
for genetic loci of RefSeq
organisms.
�Summary�Genomic regions, transcripts and products�Genomic context�Bibliography�HIV-1 protein interactions�Interactions�General gene information
�Markers�Genotypes�Phenotypes�Pathways�Homology�GeneOntology (function, process, component)
�General protein information�Reference Sequences
�mRNA and Protein(s)�Reference and Alternate Genomic assemblies
�Related Sequences�Additional Links
EXAMPLE: Finding Information on your Gene
“How do I find information about my gene?”
1. Choose the GeneGene database to search
2. Find your Gene recordvia traditional Entrez search!
3. Examine the record and all of those links!
What does my gene look like?
Are there SNPs in my gene? SNP:GeneView
Is there Genotype data on my gene? SNP:Genotype
Where is my gene in the genome? MapViewer& what other genome information can I find in this region?
Are there any known phenotypes for my gene? OMIM
Where is my gene expressed? UniGene
What are homologs to my gene in other organisms? HomoloGene
Genomic regions, transcripts & products
NEW!
Sequence Polymorphisms found in Patients withAge-Related Macular Degeneration (AMD)
Mapping the Polymorphisms to a Candidate Gene.
A clinical databaseof Genotypes & Phenotypes
Data is either open access or controlled access
FTP Downloads
NCBI Toolbox: In-house source code useful for incorporating NCBI-like functionality into their programs. Three main parts: Data Model, Data Encoding
and Programming Libraries.
• Examples: BLAST, Cn3D, Sequin, Data format conversion scripts
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/index.cgi
Help for Programmers
http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html
E-Utilities: Guidelines for Entrez “URL calls” used to access data. Designed for use in scripts.
• Examples: ESearch, EPost, ESummary, EFetch and ELink
Caution: Overuse may result in blocked IPs!
3000 3000 MyrMyr
1000 1000 MyrMyr
540 540 MyrMyr
Alzheimer’sDisease
Ataxiatelangiectasia
Colon cancer
Pancreaticcarcinoma
Yeast BacteriaWormFlyHuman
BLAST and Molecular Evolution
MLH1 MutL
Common ancestry allows us to infer similar function
Calculates similarity for biological sequences
Finds best local alignments
A Heuristic approach based on the Smith-Waterman algorithm
− Searches for matching “words” (W) rather than individual residues
− Uses statistical theory to determine if a match might have occurred by chance
Basic Local Alignment Search Tool
Seq 1
Seq 2Seq 1
Seq 2
Global Alignment Local Alignment
Protein BLAST Page Limiting Database: Organism
Organism autocomplete
Limiting Database: Entrez Query
all[filter] NOT mammals[organism]
gene_in_mitochondrion[Properties]2006:2007 [Modification Date]
Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]
Run Search
BLAST Formatting Page
Conserved Domain Results
BLAST Output: Graphical Overview
mouse over
Sort by taxonomy
BLAST Output: Descriptions
Link to entrez
Sorted by e values
5 X 10-14
Default e value cutoff 10
Gene Linkout
TaxBLAST: Taxonomy Reports
BLAST Output: Alignments
Identical match
positive score(conservative)
Negative or zero
gap
Position Specific Iterative BLAST
MLH1 and ETR1
>gi|4557757|ref|NP_000240.1| MutL protein homolog 1 [Homo sapiens]
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIVKEGGLKLIQIQDNGTGIRK
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPK
PCAGNQGTQITVEDLFYNIATRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNA
STVDNIRSIFGNAVSRELIEIGCEDKTLAFKMNGYISNANYSVKKCIFLLFINHRLVESTSLRKAIETVY
AAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLGSNSSRMYFTQTLLP
GLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLSSQPQAIVTEDKTDIS
SGRARQQDEEMLELPAPAEVAAKNQSLEGDTTKGTSEMSEKRGPTSSNPRKRHREDSDVEMVEDDSRKEM
TAACTPRRRIINLTSVLSLQEEINEQGHEVLREMLHNHSFVGCVNPQWALAQHQTKLYLLNTTKLSEELF
YQILIYDFANFGVLRLSEPAPLFDLAMLALDSPESGWTEEDGPKEGLAEYIVEFLKKKAEMLADYFSLEI
DEEGNLIGLPLLIDNYVPPLEGLPIFILRLATEVNWDEEKECFESLSKECAMFYSIRKQYISEESTLSGQ
QSEVPGSIPNSWKWTVEHIVYKALRSHILPPKHFTEDGNILQLANLPDLYKVFERC
>gi|22095656|sp|O81122.1|ETR1_MALDO Ethylene receptor
MLACNCIEPQWPADELLMKYQYISDFFIALAYFSIPLELIYFVKKSAVFPYRWVLVQFGAFIVLCGATHL
INLWTFSIHSRTVAMVMTTAKVLTAVVSCATALMLVHIIPDLLSVKTRELFLKNKAAELDREMGLIRTQE
ETGRHVRMLTHEIRSTLDRHTILKTTLVELGRTLALEECALWMPTRTGLELQLSYTLRQQNPVGYTVPIH
LPVINQVFSSNRAVKISANSPVAKLRQLAGRHIPGEVVAVRVPLLHLSNFQINDWPELSTKRYALMVLML
PSDSARQWHVHELELVEVVADQVAVALSHAAILEESMRARDLLMEQNIALDLARREAETAIRARNDFLAV
MNHEMRTPMHAIIALSSLLQETELTAEQRLMVETILRSSNLLATLINDVLDLSRLEDGSLQLEIATFNLH
SVFREVHNMIKPVASIKRLSVTLNIAADLPMYAIGDEKRLMQTILNVVGNAVKFSKEGSISITAFVAKSE
SLRDFRAPDFFPVQSDNHFYLRVQVKDSGSGINPQDIPKLFTKFAQTQALATRNSGGSGLGLAICKRFVN
LMEGHIWIESEGLGKGCTATFIVKLGFPERSNESKLPFAPKLQANHVQTNFPGLKVLVMDDNGVSRSVTK
GLLAHLGCDVTAVSLIDELLHVISQEHKVVFMDVSMPGIDGYELAVRIHEKFTKRHERPVLVALTGSIDK
ITKENCMRVGVDGVILKPVSVDKMRSVLSELLEHRVLFEAM
Human Mismatch Repair Protein
Apple ethylene receptor
PSI-BLAST: Iteration 1
PSI-BLAST:Iteration 4
Plant ethylene receptors, bacterial two-component regulatory system kinases
RPS-BLAST: Conserved Domains
Histidine Kinase-like ATPase Domain
Algorithm parameters: Protein
Adjust to set stringency
May limit results
Default statistics adjustmentfor compositional bias
Off now by default. Conflicts withcomp-based stats
Expand
Automatic Short Sequence Adjustment
e-value 20000Word Size 2Matrix PAM30Comp Stats OffLow Comp Filter Off
Nucleotide and Protein
Managing Searches
Recent Results
Saved Strategies
Recent Results
Login to My NCBI to save search strategies
Results available for 36 hours
Saved Strategies
Re-run searchesto keep up to date
Databases & Visualization Tools– Primarily based on Nucleotide Sequences!
Examples: GenBank (Sequence Formatter) Genome (MapViewer)UniGene (Expression Profile Viewer) SNP (GeneView & Genotype Viewer)
— We also have some other types of data:Examples: Protein (Translations & Curations seen by Sequence Formatter)
HomoloGene (Protein Sequence Clusters)Genome Project (Information and links for Organisms’ Sequencing Projects)GEO (Arrays & Spectras seen with Heat Maps) Structure (Coordinates seen with Cn3D)BioAssay (Activities seen in Reports)
Search Methods– Entrez: word/text searches
– BLAST: sequence searches– in-lieu-of-BLAST: pre-computed BLAST searches
– VAST: macromolecular structure searches
– PubChem Structure Search: molecular structure searches
Linking– Pre-computed and Hand-curated Associations
• Gene Database: A central source for gene information & links
• Hemochromatosis Example: Relating data to find a biomedical answer
EXAMPLE: Finding information onthe Etiology of Hemochromatosis
To answer a question/solve a puzzlewith pre-computed links!
-the data is already there-
What causes Hemochromatosis?Classic hemochromatosis (HFE),
an autosomal recessive disorder,is an iron overload disease.
Figure 17-48. The transferrin cycle, which operates in all growing mammalian cells.
Molecular Cell Biology by Lodish, Berk, Zipursky, Matsudaira, Baltimore & Darnell ©2000 by W.H. Freeman & Co.
A Bookshelf member!
Iron is an essential nutrient required for the synthesis of hemoglobin, cytochromes, and many other proteins
Iron is transported around the body by Transferrin which binds to the extracellular Transferrin Receptor, causing internalization and cellular uptake of the complex.
High intracellular iron levels causes free-radical damage to intracellular proteins, lipids, and nucleic acids.
HFE protein is a competitive inhibitor with Transferrin for binding to the Transferrin Receptor, thereby inhibiting cellular uptake of iron and preventing intracellular damage.
Medical complications of Hemochromatosis: hypermelanoticpigmentation of the skin, arthritis, cirrhosis of the liver, diabetes, heart failure and primary hepatocellular carcinoma
Fe+3
Fe+3
Fe+3
Fe+3
Fe+3
hemochromatosis
OMIM Record � Link to Gene
Coriell Cell RepositoriesHuman Gene Mutation Database
Gene � Links to Everywhere (almost) � Gene
NTNM
records GeneUniGeneGene Model
HFE Genome Maps
Gene � Links to Everywhere (almost) � ProteinGeneView in SNP