A Practical Overview of Bioinformatics - BioinfoHelpDesk.org

27
A Practical Overview of A Practical Overview of A Practical Overview of A Practical Overview of Bioinformatics Bioinformatics Bioinformatics Bioinformatics Chuong Huynh ( Chuong Huynh ( Hu Hunh nh Chương Chương) [email protected] [email protected] Lac Hong University September 9, 2008 What is bioinformatics? - Definition My definition – bringing biological themes to computers Peter Elkin: Primer on Medical Genomics: Part V: Bioinformatics “Bioinformatics is the discipline that develops and applies informatics to the field of molecular biology.” BISTIC Bioinformatics Definition “Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data” BISTIC Computational Biology Definition “Computational Biology: the development and application of data- analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.” http://www.bisti.nih.gov/ Useful/Necessary Bioinformatics Skills Strong background in some aspects of molecular biology!!! Ability to communicate biological questions comprehensibly to computer scientists Thorough comprehension of the problem in the bioinformatics field Statistics (association studies, clustering, sampling) Ability to filter, parse, and munge data and determine the relationships between the data sets Mathematics (e.g. algorithm development) Engineering (e.g. robotics) Good knowledge of a few molecular biology software packages (molecular modeling / sequence analysis) Command line computing environment (Linux/Unix knowledge) Data administration (esp. relational database concept) and Computer Programming Skills/Experience (C/C++, Sybase, Java, Oracle) and Scripting Language Knowledge (Perl and perhaps Phython) High throughput DNA sequencing Centre gggtctctcttgttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgtgtgactctgatagctagagatcccttcagaccaaatttagtcagtgtgaaa aatctctagcagtggcgcctgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcaggactcggcttgctgaagcgcgcacggcaagaggcgaggggacggcgactggtgagtacgccaaaattttgactagcggaggctagaaggagagagatgggt gcgagagcgtcgatattaagcgggggaggattagatagatgggaaaaaattcggttaaggccagggggaaagaaaaaatatagattaaaacatttagtatgggcaagcagggagctagaacgattcgcagtcaatcctggcctattagaaacatcagaaggttgtagacaaa tactgggacaactacaaccagcccttcagacaggatcagaagaacttagatcattatataatacagtagcaaccctctattgtgtgcatcaaaagatagatgtaaaagacaccaaggaagctttagataagatagaggaagagcaaaacaaaagtaagaaaaaagcacagca agcagcagctgacacaggaaatagcagccaggtcagccaaaattaccccatagtgcagaacatccaggggcaaatggtacatcaggccatatcacctagaactttaaatgcatgggtaaaagtagtagaagagaaggctttcagcccagaagtaatacccatgttttcagca ttatcagaaggagccaccccacaagatttaaacaccatgctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagaccatcaatgaggaagctgcagaatgggatagattgcatccagtgcatgcagggcctcatccaccaggccagatgagagaaccaa ggggaagtgacatagcaggaactactagtacccttcaggaacaaatagcatggatgacaaataatccacctatcccagtaggagaaatctataagagatggataatcctgggattaaataaaatagtaaggatgtatagccctaccagcattctggacataaaacaaggacc aaaggaaccctttagagactatgtagaccggttctataagactctaagagccgagcaagcttcacaggaggtaaaaaattggatgacagaaaccttgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattgggaccagcagctacactagaagaaatg atgacagcatgtcagggagtgggaggacccggccataaagcaagagttttggcagaagcaatgagccaagtaacaaattcagctaccataatgatgcagaaaggcaattttaggaaccaaagaaaaattgttaagtgtttcaattgtggcaaagaagggcacatagccaaaa attgcagggcccctaggaaaaggggctgttggaaatgtggaaaggagggacaccaaatgaaagattgtactgagagacaggctaattttttagggaaaatctggccttcccacaggggaaggccagggaattttcctcagaacagactagagccaacagccccaccagcccc accagaagagagcttcaggtttggggaagagacaacaactccctctcagaagcaggagctgatagacaaggaactgtatccttcagcttccctcaaatcactctttggcaacgaccccttgtcacaataaagataggggggcaactaaaggaagctctattagatacaggag cagatgatacagtattagaagaaataaatttgccaggaagatggaaaccaaaaatgatagggggaattggaggttttatcaaagtaagacagtatgatcaaatactcgtagaaatctgtggacataaagctataggtacagtattagtaggacctacacctgtcaacataat tggaagaaatctgttgactcagattggttgcactttaaattttcccattagtcctattgaaactgtaccagtaaaattaaagccaggaatggatggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcattagtagaaatctgtacagaaatggaaaag gaaggaaaaatttcaaaaatcgggcctgaaaatccatataatactccagtatttgccataaagaaaaaagacagtactaaatggagaaaattagtagatttcagagaacttaataagaaaactcaagacttctgggaagttcaattaggaataccacatcccgcagggttaa aaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttcagttcccttagataaagaattcaggaagtacactgcatttaccatacctagtataaacaatgagacaccagggattagatatcagtacaatgtgcttccacagggatggaaaggatcaccagc aatattccaaagcagcatgacaaaaatcttagagccttttagaaaacaaaatccagacatagttatctatcaatacatggacgatttgtatgtaggatctgacttagaaatagggcagcatagaacaaaaatagaggaactgagacaacatctgttgaagtggggatttacc acaccagacaaaaaacatcagaaagaacctccattcctttggatgggttatgaactccatcctgataaatggacagtacagcctatagtgctgccagaaaaggacagctggactgtcaatgacatacagaagttagtgggaaaattgaattgggcaagtcagatttacccag ggattaaagtaaagcaattatgtagactccttaggggaaccaaggcactaacagaagtaataccactaacaaaagaagcagagctagaactggcagaaaacagggaaattctaaaagaaccagtacatggagtgtattatgacccatcaaaagacttaatagcggaaataca gaagcaggggcaaggtcaatggacatatcaaatttatcaagagccatttaaaaatctgaaaacaggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaattaacagaggcagtgcaaaaaataaccacagaaagcatagtaatatggggaaagactcct aaatttaaactacccatacaaaaagaaacatgggaaacatggtggacagagtattggcaagccacctggattcctgagtgggagtttgtcaatacccctcccttagtaaaattatggtaccagttagagaaagaacccataataggagcagaaactttctatgtagatgggg cagctaacagggagactaaattaggaaaagcaggatatgttactaacaaagggagacaaaaagttgtctccataactgacacaacaaatcagaagactgagttacaagcaattcttctagcattacaggattctggattagaagtaaacatagtaacagactcacaatatgc attaggaatcattcaagcacaaccagataaaagtgaatcagagatagtcagtcaaataatagagcagttaataaaaaaagaaaaggtctacctgacatgggtaccagcgcacaaaggaattggaggaaatgaacaagtagataaattagtcagtactggaatcaggaaagta ctctttttagatggaatagataaagcccaagaagaacatgaaaaatatcacagtaattggagggcaatggctagtgattttaacctgccacctgtggtagcaaaagagatagtagccagctgtgataaatgtcagctaaaaggagaagccatgcatggacaagtagactgta gtccaggaatatggcaactagattgtacacatttagaaggaaaaattatcctggtagcagttcatgtagccagtggatatatagaagcagaagttattccagcagaaacagggcaggaaacagcatactttctcttaaaattagcaggaagatggccagtaaaaacagtaca tacagacaatggcagcaatttcaccagtactacagttaaggccgcctgttggtgggcaggaatcaagcaggaatttggcattccctacaatccccaaagtcaaggagtagtagaatctataaataaagaattaaagaaagttataggacagataagagatcaggctgaacat cttaagacagcagtacaaatggcagtattcatccacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaagaatagtagacataatagcaacagacatacaaactaaagaactacaaaaacaaattacaaaaattcaaaattttcgggtttattacaggg acagcagagatccactttggaaaggaccagcaaagcttctctggaaaggtgaaggggcagtagtaatacaagataatagtgacataaaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggcaagtagacagga tgaggattagaacatggaaaagtttagtaaaacaccatatgtatgtttcaaggaaagctaagggatggttttatagacatcactatgaaagtactcatccgagaataagttcagaagtacacatcccactagggaatgcaaaattggtaataacaacatattggggtctaca tacaggagaaagagactggcatttgggtcaaggagtctccatagaattgaggaaaaggagatatagcacacaattagaccctaacctagcagaccaactaattcatctgcattactttgattgtttttcagaatctgctataagaaatgccatattaggacatatagttagc cctaggtgtgaatatcaagcaggacataacaaggtaggatctctacagtacttggcactaacagcattagtaagaccaagaaaaaagataaagccacctttgcctagtgttacaaaactgacagaggatagatggaacaagccccagaagaccaagggccacaaagggaacc atacaatgaatggacactagaacttttagaggagctcaagaatgaagctgttagacattttcctaggatatggctccatagcttagggcaacatatctatgaaacttatggagatacttgggcaggagtggaagccataataagaattctgcaacaactgctgtttattcat ttcagaattgggtgtcaacatagcagaatagacattcttcgacgaaggagagcaagaaatggagccagtagatcctagactagagccctggaagcatccaggaagtcagcctaggactgcttgtaccaattgctattgtaaaaagtgttgctttcattgccaagtttgtttc ataacaaaaggcttaggcatctcctatggcaggaagaagcggagacagcgacgaagagctcctcaagacagtcagactcatcaagtttctctatcaaagcagtaagtagtacatgtaatgcaatctttacaaatattagcagtagtagcattagtagtagcagcaataatag caatagttgtgtggtccatagtattcatagaatataggaaaataagaagacaaaacaaaatagaaaggttgattgatagaataatagaaagagcagaagacagtggcaatgagagtgacggagatcaggaagaattatcagcacttgtggaaatggggcacgatgctccttg ggatgttaatgatctgtaaagctgcagaaaatttgtgggtcacagtttattatggggtacctgtgtggaaagaagcaaccaccactctattttgtgcctcagatgctaaagcgtatgatacagaggtacataatgtttgggccacacatgcctgtgtacccacagaccccaa cccacaagaagtagaactgaagaatgtgacagaaaattttaacatgtggaaaaataacatggtagaccaaatgcatgaggatataattagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactctgtgttactttaaattgcactgattatgggaatgat actaacaccaataatagtagtgctactaaccccactagtagtagcgggggaatggaggggagaggagaaataaaaaattgctctttcaatatcaccagaagcataagagataaagtgaagaaagaatatgcacttttttatagtcttgatgtaataccaataaaagatgata atactagctataggttgagaagttgtaacacctcagtcattacacaggcctgtccaaaggtatcctttgaaccaattcccatacattattgtgccccggctggttttgcgattctaaagtgtaatgataaaaagttcaatggaaaaggaccatgtacaaatgtcagcacagt acaatgtacacatggaattaggccagtagtatcaactcaactgctgttaaatggcagtctagcagaagaagaggtagtaattagatcagacaatttctcggacaatgctaaagtcataatagtacatctgaatgaatctgtagaaattaattgtacaagactcaacaacatt acaaggagaagtatacatgtaggacatgtaggaccaggcagagcaatttatacaacaggaataataggaaaaataagacaagcacattgtaacattagtagagcaaaatggaataacactttaaaacagatagttacaaaattaagagaacaatttaagaataaaacaatag tctttaatcaatcctcaggaggggacccagaaattgtaatgcacagttttaattgtggaggggaatttttctactgtaattcaacacaactgtttaacagtacttggaatggtactgcatggtcaaataacactgaaggaaatgaaaatgacacaatcacactcccatgcag aataaaacaaattataaacatgtggcaggaagtaggaaaagcaatgtatgcacctcccatcagaggacaaattagatgttcatcaaatattacagggctgatattaacaagagatggtggtattaaccagaccaacaccaccgagattttcaggcctggaggaggagatatg aaggacaattggagaagtgaattatataaatataaagtagtaaaaattgaaccattaggagtagcacccaccaaggcaaagagaagagtggtgcaaagagaaaaaagagcagtgggaataataggagctatgctccttgggttcttgggagcagcaggaagcactatgggcg cagcgtcaatgacgctgacggtacaggccagacaattattgtctggtatagtgcaacagcagaacaatttgctgagggctattgaggcgcaacagcatctgttgcacctcacagtctggggcatcaagcagctccaagcaagagtcctggctgtggaaagatacctaaggga tcaacagctcctggggttttggggttgctctggaaaactcatttgcaccactgctgtgccttggaatactagttggagtaataaatctctgagtcagatttgggataacatgacctggatgcagtgggaaagggaaattgataattacacaagcttaatatacaacttaatt gaagaatcgcaaaaccaacaagaaaagaatgaacaagagttattggaattagataactgggcaagtttgtggaattggtttagcataacaaattggctgtggtatataaaaatattcataatgatagtaggaggcttggtaggtttaagaatagtttttactgtactttcta tagtaaatagagttaggcagggatactcaccattgtcgtttcagacgcgcctcccagccaggaggggacccgacaggcccgaaggaatcgaagaagaaggtggagagagagacagagacagatccggtcaattagtggatggattcttagcaattatctgggtcgacctgcg gagcctgtgcctcttcagctaccaccgcttgagagacttactcttgattgtaacgaggattgtggaacttctgggacgcagggggtgggaagccctcaaatattggtggaatctcctacaatattggattcaggaactaaagaatagtgctgttagcttgctcaacgccaca gccatagcagtagctgagggaactgatagggttatagaagtattacaaagagcttgtagagctattctccacatacctagaagaataagacagggcttagaaagggctttgcaataagatgggtggtaagtggtcaaaaagtagtaaaattggatggcctactgtaagggaa agaatgagaagagctgagccagcagcagatggggtgggagcagtatctcgagacctggaaaaacatggagcaatcacaagtagtaatacagcaactaacaatgctgattgtgcctggctagaagcacaagaggaggaggaggtgggttttccagtcagacctcaggtacctt taagaccaatgacttacaagggagcgttagatcttagccactttttaaaagaaaaggggggactggaagggctaatttggtcccagaaaagacaagacatccttgatttgtgggtccaccacacacaaggctacttccctgattggcagaactacacaccagggccagggat cagatatccactgacctttggttggtgcttcaagctagtaccagttgagccagagaaggtagaagaggccaatgaaggagagaacaacagattgttacaccctgtgagcctgcatgggatggaggacccggagaaagaagtgttagtatggaggtttgacagccgcctagta ctccgtcacatggcccgagagctgcatccggagtactacaaggactgctgacactgagctttctacaagggactttccgctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatgctgcatataagcagctgctttttgcctgtact gggtctctcttgttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttca DNA sequences are meaningless! DNA sequences are meaningless! DNA sequences are meaningless! DNA sequences are meaningless! Interpreting data from many sources

Transcript of A Practical Overview of Bioinformatics - BioinfoHelpDesk.org

A Practical Overview of A Practical Overview of A Practical Overview of A Practical Overview of BioinformaticsBioinformaticsBioinformaticsBioinformatics

Chuong Huynh (Chuong Huynh (HuHuỳỳnhnh ChươngChương))

[email protected]@bioinfohelpdesk.org

Lac Hong UniversitySeptember 9, 2008

What is bioinformatics? - Definition

• My definition – bringing biological themes to computers

• Peter Elkin: Primer on Medical Genomics: Part V: Bioinformatics– “Bioinformatics is the discipline that develops and applies informatics to the

field of molecular biology.”

• BISTIC Bioinformatics Definition– “Research, development, or application of computational tools and

approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data”

• BISTIC Computational Biology Definition– “Computational Biology: the development and application of data-

analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.”

• http://www.bisti.nih.gov/

Useful/Necessary Bioinformatics Skills

• Strong background in some aspects of molecular biology!!!• Ability to communicate biological questions comprehensibly to

computer scientists• Thorough comprehension of the problem in the bioinformatics

field• Statistics (association studies, clustering, sampling)• Ability to filter, parse, and munge data and determine the

relationships between the data sets• Mathematics (e.g. algorithm development)• Engineering (e.g. robotics)• Good knowledge of a few molecular biology software packages

(molecular modeling / sequence analysis)• Command line computing environment (Linux/Unix knowledge)• Data administration (esp. relational database concept) and

Computer Programming Skills/Experience (C/C++, Sybase, Java, Oracle) and Scripting Language Knowledge (Perl and perhaps Phython)

High throughput DNA sequencing Centre

gggtctctcttgttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgtgtgactctgatagctagagatcccttcagaccaaatttagtcagtgtgaaa

aatctctagcagtggcgcctgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcaggactcggcttgctgaagcgcgcacggcaagaggcgaggggacggcgactggtgagtacgccaaaattttgactagcggaggctagaaggagagagatgggt

gcgagagcgtcgatattaagcgggggaggattagatagatgggaaaaaattcggttaaggccagggggaaagaaaaaatatagattaaaacatttagtatgggcaagcagggagctagaacgattcgcagtcaatcctggcctattagaaacatcagaaggttgtagacaaa

tactgggacaactacaaccagcccttcagacaggatcagaagaacttagatcattatataatacagtagcaaccctctattgtgtgcatcaaaagatagatgtaaaagacaccaaggaagctttagataagatagaggaagagcaaaacaaaagtaagaaaaaagcacagca

agcagcagctgacacaggaaatagcagccaggtcagccaaaattaccccatagtgcagaacatccaggggcaaatggtacatcaggccatatcacctagaactttaaatgcatgggtaaaagtagtagaagagaaggctttcagcccagaagtaatacccatgttttcagca

ttatcagaaggagccaccccacaagatttaaacaccatgctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagaccatcaatgaggaagctgcagaatgggatagattgcatccagtgcatgcagggcctcatccaccaggccagatgagagaaccaa

ggggaagtgacatagcaggaactactagtacccttcaggaacaaatagcatggatgacaaataatccacctatcccagtaggagaaatctataagagatggataatcctgggattaaataaaatagtaaggatgtatagccctaccagcattctggacataaaacaaggacc

aaaggaaccctttagagactatgtagaccggttctataagactctaagagccgagcaagcttcacaggaggtaaaaaattggatgacagaaaccttgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattgggaccagcagctacactagaagaaatg

atgacagcatgtcagggagtgggaggacccggccataaagcaagagttttggcagaagcaatgagccaagtaacaaattcagctaccataatgatgcagaaaggcaattttaggaaccaaagaaaaattgttaagtgtttcaattgtggcaaagaagggcacatagccaaaa

attgcagggcccctaggaaaaggggctgttggaaatgtggaaaggagggacaccaaatgaaagattgtactgagagacaggctaattttttagggaaaatctggccttcccacaggggaaggccagggaattttcctcagaacagactagagccaacagccccaccagcccc

accagaagagagcttcaggtttggggaagagacaacaactccctctcagaagcaggagctgatagacaaggaactgtatccttcagcttccctcaaatcactctttggcaacgaccccttgtcacaataaagataggggggcaactaaaggaagctctattagatacaggag

cagatgatacagtattagaagaaataaatttgccaggaagatggaaaccaaaaatgatagggggaattggaggttttatcaaagtaagacagtatgatcaaatactcgtagaaatctgtggacataaagctataggtacagtattagtaggacctacacctgtcaacataat

tggaagaaatctgttgactcagattggttgcactttaaattttcccattagtcctattgaaactgtaccagtaaaattaaagccaggaatggatggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcattagtagaaatctgtacagaaatggaaaag

gaaggaaaaatttcaaaaatcgggcctgaaaatccatataatactccagtatttgccataaagaaaaaagacagtactaaatggagaaaattagtagatttcagagaacttaataagaaaactcaagacttctgggaagttcaattaggaataccacatcccgcagggttaa

aaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttcagttcccttagataaagaattcaggaagtacactgcatttaccatacctagtataaacaatgagacaccagggattagatatcagtacaatgtgcttccacagggatggaaaggatcaccagc

aatattccaaagcagcatgacaaaaatcttagagccttttagaaaacaaaatccagacatagttatctatcaatacatggacgatttgtatgtaggatctgacttagaaatagggcagcatagaacaaaaatagaggaactgagacaacatctgttgaagtggggatttacc

acaccagacaaaaaacatcagaaagaacctccattcctttggatgggttatgaactccatcctgataaatggacagtacagcctatagtgctgccagaaaaggacagctggactgtcaatgacatacagaagttagtgggaaaattgaattgggcaagtcagatttacccag

ggattaaagtaaagcaattatgtagactccttaggggaaccaaggcactaacagaagtaataccactaacaaaagaagcagagctagaactggcagaaaacagggaaattctaaaagaaccagtacatggagtgtattatgacccatcaaaagacttaatagcggaaataca

gaagcaggggcaaggtcaatggacatatcaaatttatcaagagccatttaaaaatctgaaaacaggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaattaacagaggcagtgcaaaaaataaccacagaaagcatagtaatatggggaaagactcct

aaatttaaactacccatacaaaaagaaacatgggaaacatggtggacagagtattggcaagccacctggattcctgagtgggagtttgtcaatacccctcccttagtaaaattatggtaccagttagagaaagaacccataataggagcagaaactttctatgtagatgggg

cagctaacagggagactaaattaggaaaagcaggatatgttactaacaaagggagacaaaaagttgtctccataactgacacaacaaatcagaagactgagttacaagcaattcttctagcattacaggattctggattagaagtaaacatagtaacagactcacaatatgc

attaggaatcattcaagcacaaccagataaaagtgaatcagagatagtcagtcaaataatagagcagttaataaaaaaagaaaaggtctacctgacatgggtaccagcgcacaaaggaattggaggaaatgaacaagtagataaattagtcagtactggaatcaggaaagta

ctctttttagatggaatagataaagcccaagaagaacatgaaaaatatcacagtaattggagggcaatggctagtgattttaacctgccacctgtggtagcaaaagagatagtagccagctgtgataaatgtcagctaaaaggagaagccatgcatggacaagtagactgta

gtccaggaatatggcaactagattgtacacatttagaaggaaaaattatcctggtagcagttcatgtagccagtggatatatagaagcagaagttattccagcagaaacagggcaggaaacagcatactttctcttaaaattagcaggaagatggccagtaaaaacagtaca

tacagacaatggcagcaatttcaccagtactacagttaaggccgcctgttggtgggcaggaatcaagcaggaatttggcattccctacaatccccaaagtcaaggagtagtagaatctataaataaagaattaaagaaagttataggacagataagagatcaggctgaacat

cttaagacagcagtacaaatggcagtattcatccacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaagaatagtagacataatagcaacagacatacaaactaaagaactacaaaaacaaattacaaaaattcaaaattttcgggtttattacaggg

acagcagagatccactttggaaaggaccagcaaagcttctctggaaaggtgaaggggcagtagtaatacaagataatagtgacataaaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggcaagtagacagga

tgaggattagaacatggaaaagtttagtaaaacaccatatgtatgtttcaaggaaagctaagggatggttttatagacatcactatgaaagtactcatccgagaataagttcagaagtacacatcccactagggaatgcaaaattggtaataacaacatattggggtctaca

tacaggagaaagagactggcatttgggtcaaggagtctccatagaattgaggaaaaggagatatagcacacaattagaccctaacctagcagaccaactaattcatctgcattactttgattgtttttcagaatctgctataagaaatgccatattaggacatatagttagc

cctaggtgtgaatatcaagcaggacataacaaggtaggatctctacagtacttggcactaacagcattagtaagaccaagaaaaaagataaagccacctttgcctagtgttacaaaactgacagaggatagatggaacaagccccagaagaccaagggccacaaagggaacc

atacaatgaatggacactagaacttttagaggagctcaagaatgaagctgttagacattttcctaggatatggctccatagcttagggcaacatatctatgaaacttatggagatacttgggcaggagtggaagccataataagaattctgcaacaactgctgtttattcat

ttcagaattgggtgtcaacatagcagaatagacattcttcgacgaaggagagcaagaaatggagccagtagatcctagactagagccctggaagcatccaggaagtcagcctaggactgcttgtaccaattgctattgtaaaaagtgttgctttcattgccaagtttgtttc

ataacaaaaggcttaggcatctcctatggcaggaagaagcggagacagcgacgaagagctcctcaagacagtcagactcatcaagtttctctatcaaagcagtaagtagtacatgtaatgcaatctttacaaatattagcagtagtagcattagtagtagcagcaataatag

caatagttgtgtggtccatagtattcatagaatataggaaaataagaagacaaaacaaaatagaaaggttgattgatagaataatagaaagagcagaagacagtggcaatgagagtgacggagatcaggaagaattatcagcacttgtggaaatggggcacgatgctccttg

ggatgttaatgatctgtaaagctgcagaaaatttgtgggtcacagtttattatggggtacctgtgtggaaagaagcaaccaccactctattttgtgcctcagatgctaaagcgtatgatacagaggtacataatgtttgggccacacatgcctgtgtacccacagaccccaa

cccacaagaagtagaactgaagaatgtgacagaaaattttaacatgtggaaaaataacatggtagaccaaatgcatgaggatataattagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactctgtgttactttaaattgcactgattatgggaatgat

actaacaccaataatagtagtgctactaaccccactagtagtagcgggggaatggaggggagaggagaaataaaaaattgctctttcaatatcaccagaagcataagagataaagtgaagaaagaatatgcacttttttatagtcttgatgtaataccaataaaagatgata

atactagctataggttgagaagttgtaacacctcagtcattacacaggcctgtccaaaggtatcctttgaaccaattcccatacattattgtgccccggctggttttgcgattctaaagtgtaatgataaaaagttcaatggaaaaggaccatgtacaaatgtcagcacagt

acaatgtacacatggaattaggccagtagtatcaactcaactgctgttaaatggcagtctagcagaagaagaggtagtaattagatcagacaatttctcggacaatgctaaagtcataatagtacatctgaatgaatctgtagaaattaattgtacaagactcaacaacatt

acaaggagaagtatacatgtaggacatgtaggaccaggcagagcaatttatacaacaggaataataggaaaaataagacaagcacattgtaacattagtagagcaaaatggaataacactttaaaacagatagttacaaaattaagagaacaatttaagaataaaacaatag

tctttaatcaatcctcaggaggggacccagaaattgtaatgcacagttttaattgtggaggggaatttttctactgtaattcaacacaactgtttaacagtacttggaatggtactgcatggtcaaataacactgaaggaaatgaaaatgacacaatcacactcccatgcag

aataaaacaaattataaacatgtggcaggaagtaggaaaagcaatgtatgcacctcccatcagaggacaaattagatgttcatcaaatattacagggctgatattaacaagagatggtggtattaaccagaccaacaccaccgagattttcaggcctggaggaggagatatg

aaggacaattggagaagtgaattatataaatataaagtagtaaaaattgaaccattaggagtagcacccaccaaggcaaagagaagagtggtgcaaagagaaaaaagagcagtgggaataataggagctatgctccttgggttcttgggagcagcaggaagcactatgggcg

cagcgtcaatgacgctgacggtacaggccagacaattattgtctggtatagtgcaacagcagaacaatttgctgagggctattgaggcgcaacagcatctgttgcacctcacagtctggggcatcaagcagctccaagcaagagtcctggctgtggaaagatacctaaggga

tcaacagctcctggggttttggggttgctctggaaaactcatttgcaccactgctgtgccttggaatactagttggagtaataaatctctgagtcagatttgggataacatgacctggatgcagtgggaaagggaaattgataattacacaagcttaatatacaacttaatt

gaagaatcgcaaaaccaacaagaaaagaatgaacaagagttattggaattagataactgggcaagtttgtggaattggtttagcataacaaattggctgtggtatataaaaatattcataatgatagtaggaggcttggtaggtttaagaatagtttttactgtactttcta

tagtaaatagagttaggcagggatactcaccattgtcgtttcagacgcgcctcccagccaggaggggacccgacaggcccgaaggaatcgaagaagaaggtggagagagagacagagacagatccggtcaattagtggatggattcttagcaattatctgggtcgacctgcg

gagcctgtgcctcttcagctaccaccgcttgagagacttactcttgattgtaacgaggattgtggaacttctgggacgcagggggtgggaagccctcaaatattggtggaatctcctacaatattggattcaggaactaaagaatagtgctgttagcttgctcaacgccaca

gccatagcagtagctgagggaactgatagggttatagaagtattacaaagagcttgtagagctattctccacatacctagaagaataagacagggcttagaaagggctttgcaataagatgggtggtaagtggtcaaaaagtagtaaaattggatggcctactgtaagggaa

agaatgagaagagctgagccagcagcagatggggtgggagcagtatctcgagacctggaaaaacatggagcaatcacaagtagtaatacagcaactaacaatgctgattgtgcctggctagaagcacaagaggaggaggaggtgggttttccagtcagacctcaggtacctt

taagaccaatgacttacaagggagcgttagatcttagccactttttaaaagaaaaggggggactggaagggctaatttggtcccagaaaagacaagacatccttgatttgtgggtccaccacacacaaggctacttccctgattggcagaactacacaccagggccagggat

cagatatccactgacctttggttggtgcttcaagctagtaccagttgagccagagaaggtagaagaggccaatgaaggagagaacaacagattgttacaccctgtgagcctgcatgggatggaggacccggagaaagaagtgttagtatggaggtttgacagccgcctagta

ctccgtcacatggcccgagagctgcatccggagtactacaaggactgctgacactgagctttctacaagggactttccgctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatgctgcatataagcagctgctttttgcctgtact

gggtctctcttgttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttca

DNA sequences are meaningless!DNA sequences are meaningless!DNA sequences are meaningless!DNA sequences are meaningless!Interpreting data from many

sources

Problems and ChallengesProblems and ChallengesProblems and ChallengesProblems and Challenges

• Know the sequence of every possible transcript but not understand the functions of these transcripts and their corresponding proteins!

• How to make sense of all of the gene and protein data in order to assign functions to these genes and proteins and to understand biological processes at the molecular level?

From gene to protein and its From gene to protein and its From gene to protein and its From gene to protein and its function(sfunction(sfunction(sfunction(s))))

> DNA sequence> DNA sequenceAATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACACAATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACACTGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAATGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAATCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTATCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTAACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGGACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGGTTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAATTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAATTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTGTTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTGGTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGAGTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGACGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGCCGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGCTACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGATACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGAACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGGACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGGTAAGAAGATCGCGAACATCTAGTAGATAAGAAGATCGCGAACATCTAGTAGAGeneGeneGeneGene

> Protein sequence> Protein sequenceMKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSMKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEVLEESEFEPFIEEISDVNIDELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNEPDEAEQDCIEFGKKIANICVVVETPLIVQNEPDEAEQDCIEFGKKIANIFunctionFunctionFunctionFunction

What is the function of these structures?What is the function of these structures?What is the function of this sequence?What is the function of this sequence?What is the function of this motif?What is the function of this motif?–– the fold provides a scaffold, which can be decorated the fold provides a scaffold, which can be decorated in different ways by different sequences to confer in different ways by different sequences to confer different functionsdifferent functions–– knowing the fold & function allows us to knowing the fold & function allows us to rationaliserationalisehow the structure effects its function at the molecular how the structure effects its function at the molecular levellevelGoals of Functional GenomicsGoals of Functional GenomicsGoals of Functional GenomicsGoals of Functional Genomics Annotation of eukaryotic genomes

transcription

RNA processing

translation

AAAAAAA

Genomic DNA

Unprocessed RNA

Mature mRNA

Nascent polypeptide

folding

Reactant A Product BFunction

Active enzyme

ab initio gene

prediction

Comparative gene

prediction

Functional

identification

Gm3

TranscriptionTranscription

DNADNA55’’ 33’’

mRNAmRNA SplicingSplicing

TranslationTranslation

PolyPoly--peptidepeptideFoldingFolding

ProteinProtein

•• Transport / LocalizationTransport / Localization

•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification

FunctionFunction FunctionFunction

Bioinformatics is needed in all levels

TranscriptionTranscription

DNADNA55’’ 33’’

mRNAmRNA SplicingSplicing

TranslationTranslation

PolyPoly--peptidepeptideFoldingFolding

ProteinProtein

•• Transport / LocalizationTransport / Localization

•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification

FunctionFunction FunctionFunction

At Genome Level Genome Projects need to store and organize DNA sequences

TranscriptionTranscription

DNADNA55’’ 33’’

mRNAmRNA SplicingSplicing

TranslationTranslation

PolyPoly--peptidepeptideFoldingFolding

ProteinProtein

•• Transport / LocalizationTransport / Localization

•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification

FunctionFunction FunctionFunction

At Transcription Level How do we find protein How do we find protein How do we find protein How do we find protein coding regions, coding regions, coding regions, coding regions, intronsintronsintronsintronsand and and and exonsexonsexonsexons in genomic in genomic in genomic in genomic DNA sequences?DNA sequences?DNA sequences?DNA sequences?TranscriptionTranscription

DNADNA55’’ 33’’

mRNAmRNA SplicingSplicing

TranslationTranslation

PolyPoly--peptidepeptideFoldingFolding

ProteinProtein

•• Transport / LocalizationTransport / Localization

•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification

FunctionFunction FunctionFunction

At Transcription Level Under which condition is a certain gene transcribed?

TranscriptionTranscription

DNADNA55’’ 33’’

mRNAmRNA SplicingSplicing

TranslationTranslation

PolyPoly--peptidepeptideFoldingFolding

ProteinProtein

•• Transport / LocalizationTransport / Localization

•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification

FunctionFunction FunctionFunction

At Translation Level

What do we know about a specific protein?TranscriptionTranscription

DNADNA55’’ 33’’

mRNAmRNA SplicingSplicing

TranslationTranslation

PolyPoly--peptidepeptideFoldingFolding

ProteinProtein

•• Transport / LocalizationTransport / Localization

•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification

FunctionFunction FunctionFunction

At Translation Level

How can we compare protein sequences?

TranscriptionTranscription

DNADNA55’’ 33’’

mRNAmRNA SplicingSplicing

TranslationTranslation

PolyPoly--peptidepeptideFoldingFolding

ProteinProtein

•• Transport / LocalizationTransport / Localization

•• OligomerizationOligomerization•• PostPost--Translational ModificationTranslational Modification

FunctionFunction FunctionFunction

At Structure Level

Can we predict Can we predict Can we predict Can we predict protein structures?protein structures?protein structures?protein structures? What are some basic bioinformatics tools that can assist biologists in

answering these biological questions?

Web Access: http://Web Access: http://Web Access: http://Web Access: http://www.ncbi.nlm.nih.govwww.ncbi.nlm.nih.govwww.ncbi.nlm.nih.govwww.ncbi.nlm.nih.gov The Entrez System: Text Searches

BLAST: Sequence Similarity Searches VAST: Structure Similarity Searches

SWISSSWISSSWISSSWISS----PROT/PROT/PROT/PROT/TrEMBLTrEMBLTrEMBLTrEMBL• Collaboration between the SIB

(CH) and EMBL/EBI (UK)

• SWISS-PROT: Fully annotated(manually), non-redundant, cross-referenced, documentedprotein sequence database

• TrEMBL: is automatically generated (from annotated EMBL coding sequences (CDS)) and annotated using software tools

ProteinProteinProteinProtein Data BankData BankData BankData Bank

• Sequence similarity: BLAST and others

• Beyond sequence similarity:matching sequences and shapes (threading)

Assigning function by similarity

MNIFEMLRID EGLRLKIYKD TEGYYTIGIG

HLLTKSPSLN AAKSELDKAI GRNCNGVITK

DEAEKLFNQD VDAAVRGILR NAKLKPVYDS

LDAVRRCALI NMVFQMGETG VAGFTNSLRM

LQQKRWDEAA VNLAKSRWYN QTPNRAKRVI

TTFRTGTWDA YKNL

• Ab initio modeling

• Threading & Fold Recognition

• Homology Modeling

?

Protein Structure Modeling

Rational Drug Design

• Understanding how structures bind other molecule (function)

• Designing inhibitors

• Docking, structure modeling

Drug Lead Screening & Docking

Complementarity- Shape- Chemical- Electrostatic

?

Drug Development Life CycleDrug Development Life CycleDrug Development Life CycleDrug Development Life Cycle

YearsYears

0 2 4 6 0 2 4 6 8 10 12 14 168 10 12 14 16

Discovery (2 to 10 Years)

Preclinical Testing(Lab and Animal Testing)

Phase I(20-30 Healthy Volunteers used to check for safety and dosage)

Phase II(100-300 Patient Volunteers used to check for efficacy and side effects)

Phase III(1000-5000 Patient Volunteersused to monitor reactions to

long-term drug use)

FDA Review& Approval

Post-Marketing Testing

$600-700 Million!

7 – 15 Years!

With the aid of Bioinformatics

Drug lead screeningDrug lead screeningDrug lead screeningDrug lead screening

5,000 to 10,000 compounds screened

250 lead candidates inPreclinical

Testing5 drug candidatesenter Clinical Testing;80% pass Phase I

One drug approved by the FDA

30% pass Phase II

80% pass Phase III

Applications of BioinformaticsApplications of BioinformaticsApplications of BioinformaticsApplications of Bioinformatics

Molecular Interactions Structure Prediction

NH

O COO-H

NN

N

OH

NH2

N

CH2

NH

N

NH

O

COO-

COO-H

N

N

NH

N

OH

NH2

Search for new drugs

NH2

NH2

N

N

CH3

Cl

N

CH3

NH2

NH2

N

N CH2

OCH3

OCH3

OCH3

NH2

NH2

N

N CH2

OCH3

OCH3

OCH3

HC

NH

NH2

N

NH

CH3

Cl

NH

CH3

HC

NH

NH2

N

NH

CH 3

Cl

NH

CH3

Cl

data analysis, algorithms, data analysis, algorithms, data analysis, algorithms, data analysis, algorithms, visualization, statistics, etc.visualization, statistics, etc.visualization, statistics, etc.visualization, statistics, etc.DNA chips

Biochemical Networks

Genetic Variations

Optimizing therapies

Sequence Analysis

Genomes

Proteins d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ- NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ- NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN- LPADLAWFKRNTL-------- NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH- LPDDLHYFRAQTV-------- GKIMVVGRRTYESF

d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ- NLVIMGKKTWFSId8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ- NAVIMGKKTWFSId4dfra_ ISLIAALAVDRVIGMENAMPW- NLPADLAWFKRNTLD-------- KPVIMGRHTWESId3dfr__ TAFLWAQDRNGLIGKDGHLPW- HLPDDLHYFRAQTVG-------- KIMVVGRRTYESF

caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagcc

aaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatct

cttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatat

tgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaa

actttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgaca

ttgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagca

agcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaa

tacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatga

aatccaccattcaattacaacaagatcctctttcttgcacttgg

Access to Biomedical Literature

• HINARI: http://www.who.int/hinari/ - get full text biomedical and health literature

• OARE: http://www.oaresciences.org/ - get full text environmental sciences

• AGORA: http://www.aginternetwork.org/ - get full text agriculture

• PubMed Central: http://www.pubmedcentral.nih.gov/- get full text biomedical and life sciences journal literature

• PubMed: http://www.pubmed.gov/ - search biomedical literature

Logging in to HINARI 1

Log-in to the HINARI website by clicking “HINARI LOGIN”

Logging into HINARI 2

We will need to insert your HINARI User ID and Password in the Login box and click on the “Sign On” button.

Accessing journals via PubMed

Click on the link to find articles through PubMed.

Accessing PubMed for HINARI Users

http://hinari-gw.who.int/whalecomwww.ncbi.nlm.nih.gov/whalecom0/sites/entrez?myncbishare=hinari_who

Accessing PubMed for HINARI Users Accessing PubMed for HINARI Users

Accessing PubMed for HINARI Users

Another window will open at the journal publishers’ website.

Accessing PubMed for HINARI Users

Training Material : Viet Nam

• HINARI: Training: Vietnamese Translation -Dịch Tiếng Việt

http://www.bioinfohelpdesk.org/PubMedSep2005/vn_translation/

http://www.bioinfohelpdesk.org/PubMedSep2006/���� Reading Packet, including Vietnamese Translations

Most international scientific research is written in English.

Literature DatabasesLiterature DatabasesLiterature DatabasesLiterature Databases

Databases & Visualization Tools– Primarily based on Nucleotide Sequences!

Examples: GenBank (Sequence Formatter) Genome (MapViewer)UniGene (Expression Profile Viewer) SNP (GeneView & Genotype Viewer)

— We also have some other types of data:Examples: Protein (Translations & Curations seen by Sequence Formatter)

HomoloGene (Protein Sequence Clusters)Genome Project (Information and links for Organisms’ Sequencing Projects)GEO (Arrays & Spectras seen with Heat Maps) Structure (Coordinates seen with Cn3D)BioAssay (Activities seen in Reports)

Search Methods– Entrez: word/text searches

– BLAST: sequence searches– in-lieu-of-BLAST: pre-computed BLAST searches

– VAST: macromolecular structure searches

– PubChem Structure Search: molecular structure searches

Linking– Pre-computed and Hand-curated Associations

• Gene Database: A central source for gene information & links

• Hemochromatosis Example: Relating data to find a biomedical answer

“What is the Entrez System?”

● A text search and retrieval engine

● A virtual workspace for manipulating large datasets

● An organized system for linking biological data

● A network of 29 linked databases

Types of Databases

• Primary Databases– Raw and redundant Data…..submitted, “owned” and updated

by experimentalists

• Examples: GenBank, SNP, GEO, PubChem Substance & BioAssay

• Derivative Databases– Human-curated (compilation and curation of data)

• Examples: GEO Datasets, Structure & Literature databases

– Computationally-Derived

• Example: UniGene, HomoloGene, PubChem Compound

– Combination

• Examples: RefSeq, Genome Assembly, Conserved Domain and Structure databases

11111111ºººººººº Sequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence Database

GenBank

• Nucleotide only sequence database

• Archival in nature

• Each record is assigned a stable accession number

• Submission of GenBank Data to NCBI

– Direct submissions of individual records via Web(BankIt, Sequin)

– Batch submissions of bulk sequences via Email(EST, GSS, STS)

– FTP accounts for Sequencing Centers

• Three collaborating databases and other sources of data

The International SequenceDatabase Collaboration

EBI

GenBank

DDBJEMBL

EMBLEMBL

Entrez

SRS

getentry

NIGNIGCIB

NCBI

NIHNIH

•Submissions•Updates

•Submissions•Updates

•Submissions•Updates

GenBank

• full release every two months• incremental and cumulative updates daily• available only through internet

ftp://ftp.ncbi.nih.gov/genbank/

Release 161 December 200780,388,382 Records

83,874,179,730 Nucleotides

>140,000 Species

335 Gigabytes 1449 files

GenBank Divisions

“Organismal”(Traditional)

PRI (34) PrimateROD (26) RodentPLN (26) Plant and FungalBCT (24) Bacterial/ArchealINV (11) InvertebrateVRT (14) Other VertebrateVRL (8) ViralMAM (3) MammalianPHG (1) PhageSYN (5) SyntheticENV (6) Envir. samplesUNA (1) Unannotated

“Functional”(Bulk)

EST (635) Expressed Sequence TagGSS (264) Genome Survey SequenceHTG (98) High Throughput GenomicHTC (9) High Throughput ContigPAT (30) PatentSTS (14) Sequence Tagged SiteCON (82) Contigs, virtual

• Organized by taxonomy (sort of)• Direct submissions (Sequin/Bankit)• Accurate (~1 error per 10,000 bp)• Well characterized

• Organized by sequence type• Batch submissions (ftp/email) • Less accurate• Poorly characterized

EST (Expressed Sequence Tags)48,992,175

GSS (Genome Survey Sequence)21,636,089

Shredding

Draft Sequence (HTG/HTC divisions)

whole BAC genomic insert or genome

Cloning &Isolating

Assembling

Sequencing

GSS databaseor Trace Archive

NEW Re-Organization of theNucleotide Database

CoreNucleotide44,068,877

114,679,141 Total Nucleotide Records

RNARNA

nucleus~30,000 genes

Isolate RNA &make cDNA library

3’

5’

3’3’

5’5’

isolate unique clones sequence once from

each end

ESTdatabase

CoreNuc38%

EST43%

GSS19%

Sequence Databases: File Formats

ASN.1 – The Raw Data

XML

FASTA

GenBankflat file

A Traditional GenBank Record

LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004

DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,

complete cds.

ACCESSION AY182241

VERSION AY182241.2 GI:32265057

KEYWORDS .

SOURCE Malus x domestica (cultivated apple)

ORGANISM Malus x domestica

Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;

rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.

REFERENCE 1 (bases 1 to 1931)

AUTHORS Pechous,S.W. and Whitaker,B.D.

TITLE Cloning and functional expression of an (E,E)-alpha-farnesene

synthase cDNA from peel tissue of apple fruit

JOURNAL Planta 219, 84-94 (2004)

REFERENCE 2 (bases 1 to 1931)

AUTHORS Pechous,S.W. and Whitaker,B.D.

TITLE Direct Submission

JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,

USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD

20705, USA

REFERENCE 3 (bases 1 to 1931)

AUTHORS Pechous,S.W. and Whitaker,B.D.

TITLE Direct Submission

JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,

USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD

20705, USA

REMARK Sequence update by submitter

COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.

FEATURES Location/Qualifiers

source 1..1931

/organism="Malus x domestica"

/mol_type="mRNA"

/cultivar="'Law Rome'"

/db_xref="taxon:3750"

/tissue_type="peel"

gene 1..1931

/gene="AFS1"

CDS 54..1784

/gene="AFS1"

/note="terpene synthase"

/codon_start=1

/product="(E,E)-alpha-farnesene synthase"

/protein_id="AAO22848.2"

/db_xref="GI:32265058"

/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK

NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF

EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE

NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS

LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW

ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS

EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT

KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA

DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK

GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI

LSLLFQPLVN"

ORIGIN

1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat

61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg

121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt

181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga

241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt

1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt

1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa

1921 aaaaaaaaaa a

//

Header

Feature Table

Sequence

Field Indexed Terms

[primary accession] M17755[title] Homo sapiens thyroid peroxidase (TPO) mRNA…[organism] Homo sapiens[sequence length] 3060[modification date] 1999/04/26[properties] biomol_mrna

gbdiv_prisrcdb_genbank

Indexing for Nucleotide UID:4680720

A Traditional GenBank Record

LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004

DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,

complete cds.

ACCESSION AY182241

VERSION AY182241.2 GI:32265057

KEYWORDS .

SOURCE Malus x domestica (cultivated apple)

ORGANISM Malus x domestica

Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;

rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.

REFERENCE 1 (bases 1 to 1931)

AUTHORS Pechous,S.W. and Whitaker,B.D.

TITLE Cloning and functional expression of an (E,E)-alpha-farnesene

synthase cDNA from peel tissue of apple fruit

JOURNAL Planta 219, 84-94 (2004)

REFERENCE 2 (bases 1 to 1931)

AUTHORS Pechous,S.W. and Whitaker,B.D.

TITLE Direct Submission

JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,

USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD

20705, USA

REFERENCE 3 (bases 1 to 1931)

AUTHORS Pechous,S.W. and Whitaker,B.D.

TITLE Direct Submission

JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,

USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD

20705, USA

REMARK Sequence update by submitter

COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.

FEATURES Location/Qualifiers

source 1..1931

/organism="Malus x domestica"

/mol_type="mRNA"

/cultivar="'Law Rome'"

/db_xref="taxon:3750"

/tissue_type="peel"

gene 1..1931

/gene="AFS1"

CDS 54..1784

/gene="AFS1"

/note="terpene synthase"

/codon_start=1

/product="(E,E)-alpha-farnesene synthase"

/protein_id="AAO22848.2"

/db_xref="GI:32265058"

/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK

NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF

EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE

NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS

LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW

ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS

EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT

KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA

DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK

GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI

LSLLFQPLVN"

ORIGIN

1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat

61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg

121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt

181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga

241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt

1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt

1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa

1921 aaaaaaaaaa a

//

Header

Feature Table

Sequence

Field Indexed Terms

[gene name] TPO[text word] thyroiditis[protein name] thyroid peroxidase

Indexing for Nucleotide UID:4680720

The sequence itselfis not indexed…

Use BLAST for that!

Which one is the best sequence?

Primary vs. DerivativeSequence Databases

GenBankGenBank

SequencingSequencingCentersCenters

GA

GAGA

ATT

ATTC

CGAGA

ATT

ATTC

C

AT

GAGA

ATTC

C GAGA

ATTC

C

TTGACA

ATTGACTA

ACGTGC

TTGACA

CGTGAAT

TGACTA

TATAGCCG

ACGTGC

ACGTGCACGTGC

TTGACA

TTGACA

CGTGA

CGTGA

CGTGA

ATTGACTA

ATTGACTAATTGACTA

ATTGACTA

TATAGCCG

TATAGCCG

TATAGCCG

TATAGCCGTATAGCCG TATAGCCGTATAGCCG TATAGCCG

CATT

GAGA

ATTC

C

GAGA

ATTC

C LabsLabs

AlgorithmsAlgorithms

UniGene

CuratorsCurators

RefSeq

GenomeAssembly

TATAGCCGAGCTCCGATACCGATGACAA

Updated continually by NCBI

Updated ONLY by submitters

genomes transcripts proteins

GenBank

Derivative Sequence DatabaseDerivative Sequence DatabaseDerivative Sequence DatabaseDerivative Sequence Database

• The curated “best representative” sequences“non-redundant”

• Standardized nomenclature and record structure• Added annotation (references, sequence features)

Explicitly linked nucleotide and protein sequences

Validated by hand

• Updated to reflect current sequence data and biology• Stewardship by NCBI staff and collaborators

RELEASE 30 (July 7, 2008) IS NOW AVAILABLE ON THE FTP SITE!

RefSeqRefSeqRefSeqRefSeqRefSeqRefSeqRefSeqRefSeq Sequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence DatabaseSequence Database

Curated genomic DNACurated genomic DNA(NC, NT, NW)(NC, NT, NW)

Curated Model mRNACurated Model mRNA (XM)(XM)(XR)(XR)

Curated mRNACurated mRNA (NM)(NM)(NR)(NR)

Model protein Model protein (XP)(XP)

RefSeq Curation Processes

ProteinProtein (NP)(NP)

Scanning....

Curated RefSeq Records

COMMENT REVIEWED REFSEQ: This record has been curated by NCBI

staff. The reference sequence was derived from X66503.1.

Summary: Adenylosuccinate synthetase catalyzes the first

committed step in the conversion of IMP to AMP.

LOCUS ADSS 1368 bp mRNA linear PRI 27-AUG-2002

DEFINITION Homo sapiens adenylosuccinate synthase (ADSS), mRNA.

ACCESSION NM_001126

VERSION NM_001126.1 GI:4557270 RefSeq NucleotideRefSeq Nucleotide

LOCUS ADSS 455 aa linear PRI 27-AUG-2002

DEFINITION adenylosuccinate synthase; Adenylosuccinate synthetase

(Ade(-)H-complementing) Homo sapiens .

ACCESSION NP_001117

VERSION NP_001117.1 GI:4557271

DBSOURCE REFSEQ: accession NM_001126.1 RefSeq ProteinRefSeq Protein

X records:X records: Genome Annotation & Inferred or PredictedGenome Annotation & Inferred or Predictedvsvs

N records:N records: Provisional, Reviewed or ValidatedProvisional, Reviewed or Validated

The Perils of the XM

XM records are models based only on genomic sequenceand are subject to revision or removal with each new build of that genome.

Query= gi|20850420|ref|XM_124429.1|

Mus musculus expressed sequence AA553001 (AA553001), mRNA

gi|19527087|ref|NM_133873.1|

Mus musculus DNA segment, Chr 4, Wayne State University 114,

expressed (D4Wsu114e), mRNA Length=1898

Score = 3701.55 bits (1867), Expect = 0

Identities = 1870/1871 (99%), Gaps = 0/1871 (0%) Strand=Plus/Plus

BLAST the XM against the RefSeq database to look for a replacement:

Sequence DatabasesSequence DatabasesSequence DatabasesSequence Databases

GSS -EST -CoreNucleotide -

genome surveysequences

expressedsequence tags

the restMapping Genome Data on an Assembly:Mapping Genome Data on an Assembly:Genome SequenceGenome Sequence

(RefSeq: NC, NT, NW)(RefSeq: NC, NT, NW)Transcript regions & Transcript regions & ORFsORFs

(RefSeq: NM/NP, XM/XP)(RefSeq: NM/NP, XM/XP)Markers (STS)Markers (STS)Polymorphisms (SNP)Polymorphisms (SNP)ESTs/ExonsESTs/Exons ((UniGeneUniGene))

• “Best representative” (reference) sequences

• Standardized nomenclature and record structure

• Added annotation (references, sequence features)

• Assembled and annotated genomes

Complete Genomesas of November 2007

• Organelles:– Mitochondria (1214)

– Plastids (115)

– Plasmids (1148)

• Viruses (2854)

•• ArchaebacteriaArchaebacteria(46(46 completecomplete))

•• EubacteriaEubacteria(521(521completecomplete))

•• Eukaryotes Eukaryotes (25(25completecomplete/162 /162 assembliesassemblies))

Simple Genomes

• Full chromosomal sequences are provided

• Genes are annotated

• The annotation can be shown graphically and linked to sequence records

LOCUS NC_000913 4639221 bp DNA circular BCT 30-JUL-2003

DEFINITION Escherichia coli K12, complete genome.

ACCESSION NC_000913

VERSION NC_000913.1 GI:16127994

KEYWORDS .

SOURCE Escherichia coli K12.

ORGANISM Escherichia coli K12

Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;

Enterobacteriaceae; Escherichia.

REFERENCE 1 (bases 1 to 4639221)

AUTHORS Blattner,F.R., Plunkett,G. III, Bloch, C.A., Perna, N.T., Burland,V.,

Riley,M., Collado-Vides,J., Glasner,J.D., Rode, C.K., Mayhew,G.F.,

Gregor,J., Davis,N.W., Kirkpatrick,H.A., Goeden,M.A., Rose,D.J.,

Mau,R. and Shao,Y.

TITLE The complete genome sequence of Esherichia coli K12.

JOURNAL Science 277 (5331), 1453-1474 (1997)

MEDLINE 97426617

PUBMED 9278503

REFERENCE 2 (bases 1 to 4639221)

AUTHORS Blattner,F.R.

TITLE Direct submission

JOURNAL Sumbitted (16-JAN-1997) Guy Plunkett III, Laboratory of Genetics,

University of Wisconsin, 445 Henry Mall, Madison, WI 53706, USA.

E-mail [email protected] Phone: 608-262-2543 Fax:

RefSeq Chromosomes: NC_

gene 3954631..3956478

/gene="mutL"

/locus_tag="b4170"

/note="synonym: mut-25"

CDS 3954631..3956478

/gene="mutL"

/locus_tag="b4170"

/function="methyl-directed mismatch repair"

/codon_start=1

/transl_table=11

/product="MutL"

/protein_id="NP_418591.1"

/db_xref="GI:16131992"

/translation="MPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDI

DIERGGAKLIRIRDNGCGIKKDELALALARHATSKIASLDDLEAIISLGFRGEALASI

SSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAAHPVGTTLEVLDLFYNTPARRKF

LRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQKERRLGAICGT

AFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQAC

EDKLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQL

ETPLPLDDEPQPAPRSIPENRVAAGRNHFAEPAAREPVAPRYTPAPASGSRPAAPWPN

AQPGYQKQQGEVYRQLLQTPAPMQKLKAPEPQEPALAANSQSFGRVLTIVHSDCALLE

RDGNISLLSLPVAERWLRQAQLTPGEAPVCAQPLLIPLRLKVSAEEKSALEKAQSALA

ELGIDFQSDAQHVTIRAVPLPLRQQNLQILIPELIGYLAKQSVFEPGNIAQWIARNLM

SEHAQWSMAQAITLLADVERLCPQLVKTPPGGLLQSVDLHPAIKALKDE"

Annotation ofGene, CDS,

and other features

BASE COUNT 978672 a1011074 c 997153 g 974742 t

ORIGIN

1 cgtcttcatt gtcagacagc agaatttgta cgcgctgttc ggcttgttgt aatttggcct

61 gcccctgacg tgccagctgc acgccgcgtt cgaactcgtt cagcgcctct tccagcggca

121 ggtcgccact ttccagacgg gttacaatct gttccagctc gctcagcgcc ttttcaaagc

181 tggcgggcgc ctcatttttc ttcggcataa tgaatgtctg actctcaata tttttcgccc

241 cgtcatggta acggactcag ggcaaatagc aaataacgcg caatggtaag gtgatgtgca

301 cagcaaagcg atgttagtgg tatacttccg cgcctggatg cagccgcagg tgtgggctgc

361 tgtatttttc cctatacaag tcgcttaagg cttgccaacg aaccattgcc gccatgaagt

421 ttatcattaa attgttcccg gaaatcacca tcaaaagcca atctgtgcgc ttgcgcttta

481 taaaaatcct taccgggaac attcgtaacg ttttaaagca ctatgatgag acgctcgctg

541 tcgtccgcca ctgggataac atcgaagttc gcgcaaaaga tgaaaaccag cgtctggcta

601 ttcgcgacgc tctgacccgt attccgggta tccaccatat tctcgaagtc gaagacgtgc

661 cgtttaccga catgcacgat attttcgaga aagcgttggt tcagtatcgc gatcagctgg

721 aaggcaaaac cttctgcgta cgcgtgaagc gccgtggcaa acatgatttt agctcgattg

781 atgtggaacg ttacgtcggc ggcggtttaa atcagcatat tgaatccgcg cgcgtgaagc

841 tgaccaatcc ggatgtgact gtccatctgg aagtggaaga cgatcgtctc ctgctgatta

901 aaggccgcta cgaaggtatt ggcggtttcc cgatcggcac ccaggaagat gtgctgtcgc

961 tcatttccgg tggtttcgac tccggtgttt ccagttatat gttgatgcgt cgcggctgcc

Genome sequence

mutL

Complex Genomes

• Sequences are provided complete or we help assemble

• Heavy annotation: Genes, transcript regions & ORFs, sequence variations & markers, clones, ESTs, etc.

• The annotation can be shown graphically and linked to other

databases using the MapViewerMapViewer

�������� 1: NT_034400. Homo sapiens chro...[gi:51458694] Links

Click here to see all features and the sequence of this contig record.

LOCUS NT_034400 1065823 bp DNA linear CON 19-AUG-2004

DEFINITION Homo sapiens chromosome 1 genomic contig.

ACCESSION NT_034400

VERSION NT_034400.3 GI:51458694

KEYWORDS .

SOURCE Homo sapiens

ORGANISM Homo sapiens

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;

Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini;

Hominidae; Homo.

REFERENCE 1 (bases 1 to 1065823)

AUTHORS International Human Genome Sequencing Consortium.

TITLE The DNA sequence of Homo sapiens

JOURNAL Unpublished (2003)

COMMENT GENOME ANNOTATION REFSEQ: Features on this sequence have

been produced for build 35 version 1 of the NCBI's genome

annotation [see documentation].

On Aug 19, 2004 this sequence version replaced gi:27478327.

The DNA sequence is part of the third release of the

finished human reference genome. It was assembled from

individual clone sequences by the Human Genome Sequencing

Consortium in consultation with NCBI staff.

COMPLETENESS: not full length.

RefSeq Contig: NT_

gene complement(2548206..2591802)

/gene="ADSS"

/db_xref="LocusID:159"

/db_xref="MIM:103060"

mRNA complement(join(2548206..2549349,2550998..2551147,

2555692..2555789,2557339..2557463,2558471..2558625,

2559881..2560007,2562526..2562607,2563644..2563751,

2564012..2564078,2572236..2572286,2576516..2576584,

2577357..2577459,2591326..2591802))

/gene="ADSS"

/product="adenylosuccinate synthase"

/note="Derived by automated computational analysis using gene

prediction method: BLAST. Supporting evidence includes

similarity to: 3 mRNAs"

/transcript_id="XM_049992.8"

/db_xref="GI:22045950"

/db_xref="LocusID:159"

/db_xref="MIM:103060"

Annotation ofGene, mRNA, CDS,and other features

CONTIG join(AL139152.7:1..55543,AL596177.4:1998..91084,

AL356378.17:1999..202955,AL391904.14:2001..68222,

AL590667.7:2001..175494,AL359207.7:2001..112707,

AL365260.11:2001..114412,complement(AL445591.10:1..138092),

BX537254.7:2001..121309)

// Ordering of draft sequences

Click here to see all features and the sequence of this contig

record.

Higher Genome MapViews

adss

Genomic BLAST

Search the maps

Species-specific help!

Higher Genome MapViewsMap Viewer Help

Human Maps Help

Maps&Options

Examples of Maps & Mapped Data

--Sequence maps---Ab initio (model)AssemblyRepeatsBES_CloneCloneNCI_CloneContigComponentCpG islanddbSNP haplotypeFosmidGenBank_DNAGenePhenotypeSAGE_TagSTSTCAG_RNATranscript (RNA)UniGene ESTVariation

--Cytogenetic maps--IdeogramFISH CloneNCI FISH CloneGene_CytogeneticMitelman BreakpointMorbid/Disease--Genetic Maps--deCODEGenethonMarshfield--RH maps---GeneMap99-G3GeneMap99-GB4NCBI RHStandford-G3TNGWhitehead-RHWhitehead-YAC

Sequence DatabasesSequence DatabasesSequence DatabasesSequence Databases

GSS -EST -CoreNucleotide -

genome surveysequences

expressedsequence tags

the rest

GSS -EST -CoreNucleotide -

genome surveysequences

expressedsequence tags

the rest

BLASTBLAST

SequenceSequenceVASTVAST

StructureStructure

EntrezEntrez

TextText

Searching the NCBI Databases

PubChemPubChem

StructureStructure

SearchSearch

Small MoleculeSmall MoleculeStructureStructure

TextClinical FeaturesOther FeaturesDiagnosisInheritanceClinical ManagementMappingMolecular GeneticsGene TherapyMolecular GeneticsPopulation GeneticsAnimal ModelAllelic Variants HistoryReferences

EXAMPLE: Searching with Entrez

“How do I retrieve the mRNA record NM_000311?”

“Finding information on Sickle Cell Anemia…”

1

3

2

How to Query a Database

(term1[tag delimiter] op term2[tag delimiter] op …)

tag delimiter = Entrez indexing field

op = AND, OR, NOT

OrganismJournalUser compoundsAuthor

�Boolean operators MUST be in ALL CAPS!

Examples oftag delimiters

1

3

term1 term22

Brauninger a c-src kinase

OrganismJournalUser compoundsAuthor

EXAMPLE: Searching with Entrez

“How do I download the human proteome?”

1. Choose the ProteinProtein database to search 2. Retrieve HumanHuman records

human[organismhuman[organism]]

3. Narrow to only RefSeqRefSeq recordsAND AND srcdb_refseq[propertiessrcdb_refseq[properties]]

(or use the new (or use the new ““RefSeqRefSeq”” Tab!)Tab!)

4. Choose your “DisplayDisplay” format

5. “SendSend” the data to a “FileFile”

human

Using Fields to Sequence RecordsAccessionAll FieldsAuthor EC/RN NumberFeature KeyFilterGene NameIssueJournalKeywordModification DateMolecular WeightOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitleVolume

Most useful search field: [Organism]human[orgn] …or… bacteria[orgn]

Useful search terms in [Properties]: srcdb: “source database” srcdb_refseq[prop]

gbdiv: “genbank division” gbdiv_est[prop]

biomol: “biomolecular type” biomol_mrna[prop]

Downloading Records Downloading a Record

EXAMPLE: Searching with Entrez

“How do I find information on genesexpressed in mouse pancreas?”

“How do I get a structure of theHIV-1 Reverse transcriptase?”

“How do I find shigella proteins between24 and 36kDa?”

http://www.ncbi.nlm.nih.gov/entrez/cubby.fcgi

Be lazy – Get the new query results Emailed to you. Precomputed BLAST Services

Some Entrez Links:A List of Similar Sequences: Related SequencesStructures with Similar Sequences: Related StructuresThe Multifunctional Blast Link: Blink

Some BLAST-generated Databases & Links:SNP:GeneView SNP:GeneViewUniGene (Transcript Clusters) UniGeneHomoloGene (Protein Homologs)

HomoloGeneCDD (Functional Domains) Conserved Domains

& CDART (modular domains) Domain Relatives

You may not have to run a BLAST searchor you may have already used BLAST

and did not even know it!

Most similar

Least similar

Links

Related Sequences:Precomputed BLASTn & BLASTp Lists

The “Related Sequences” link retrievesGenBank/GenPept sequences sorted by BLAST score,

but with no alignment details.

Nucleotide

Protein

Links

BLink: Precomputed BLASTp

• Lists only 200 hits

• List is nonredundant

NewNewNewNew

Summary pages of sequence and expression information

for sets of expressed sequences clustered by BLAST

UniGene Cluster Hs.XXXXX Homo sapiens

SEE ALSO: LocusLink | OMIM | HomoloGene

MODEL ORGANISM PROTEIN SIMILARITIES

MAPPING INFORMATION

EXPRESSION INFORMATION

mRNA SEQUENCES

EST SEQUENCES

EXAMPLE: Finding Transcripts & Tissues

“How do I find information on expression of my gene?”

1. Choose the UniGene database 2. Search with a gene name or UID

(g6pdg6pd: Glucose-6-phosphate dehydrogenase)

3. Look at the General Information page

4. Click on the Expression Profile to see in which tissues the transcripts are found.

5. You can download all of the sequences.

Clusters of homologous protein sequences based on BLASTp

(also guided by taxonomic tree information)

- Includes orthologs and paralogs -

early globin gene

A-chain gene B-chain gene

frog A chick A mouse A mouse B chick B frog Bfrog A chick A mouse A mouse B chick B frog B

paralogsorthologs orthologs

gene duplication

EXAMPLE: Finding Homologous Proteins

“How do I findhomologs of my protein?”

1. Choose the HomoloGene database 2. Search with a gene name or UID

(g6pdg6pd: Glucose-6-phosphate dehydrogenase)

3. Look at the General Information page

4. Display Multiple Alignment format

5. Find all of the comparison statistics and BLAST2 alignment already calculated!

g6pd

aa%IDamino acid % identity

nt%IDnucleotide % identity

Devolutionary distance(Jukes-Cantor Model)

Ka/Ksnon-synonymous/synonymous

codon change ratio

Knr/Kncradical/conservative

amino acid change ratio

“Reverse-Position Specific” Sequence Comparisons (RPS-BLAST)a.k.a. “Conserved Domain Database” (CDD) Search

Conserved sequence elements that perform common functions curated from multiple sequence alignments with similar function

(Position-Specific Scoring Matrices)

10 10 10 10 20 30 40 20 30 40 20 30 40 20 30 40 50 6050 6050 6050 60. . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . * . . . . | . . . . * . . . . |. . * . . . . | . . . . * . . . . | . . . . * . . . . |. . * . . . . | . . . . * . . . . | . . . . * . . . . |. . * . . . . | . . . . * . . . . | . . . . * . . . . |consensus 1 KWEIPREDLTLGKKLGEGAFGEVYKGTLKGkgd---nkSIDVAVKTLKEDASEeqIKEFL 57

1FGI A 1 aWEIPRESLRLEVKLGQGCFGEVWMGTWNG--------TTRVAIKTLKPGTMS--PEAFL 311

1BYG A 1 RWELPRDRLVLgkPLGEGAFGQVYLAEAIglgkdkpnrvTKVAVKMLKSDAtedkLSLDI 74

gi 125135 1 GWALNMKELKLlqTIGKGEFGDVMLGDYRg---------NKVAVKCIKNDAt---AQAFL 62

gi 125702 1 KYEIPRTDLTLkhKLGGGQYGEVYEGVWKky-------sLTVAVKTLKEDTm--eVEEFL 284

gi 1174437 1 KWEIPRSELTIlrKLGRGNFGEVFYGKWRn--------sIDVAVKTLREGTm--sTAAFL 325NCBIProtein

Clusters

NCBICOG

EMBLSMART

SangerPfam

NCBICD

Finding CDs for a Query Protein

MGNAAAAKKGSEQESVKEFLAKAKEDFLKKWESPAQNTAHLDQFERIKTLG

TGSFGRVMLVKHKETGNHYAMKILDKQKVVKLKQIEHTLNEKRILQAVNFP

FLVKLEFSFKDNSNLYMVMEYVPGGEMFSHLRRIGRFSEP…………………………

CDART:CDART:CDART:CDART:Conserved Domain Architecture Retrieval Tool

Modular Architecture of Domains

Cartoon descriptions of relative protein domain organization

Allows for comparison with other proteins with the same domain

Finding CDs for a Query Protein CDD Record – SMART S_TKc

aligned query

red = high conservation blue = low conservation

View in Cn3D

View PSSM MatrixLinks to proteins

containing this domain

Annotated features

Links to Domain Families/CDTree Data

Curated Domain

Entrez Structure: MMMMolecular MMMModeling DDDDatabbbbase

• Searching the Structure Databases:

• Keyword search by Entrez

• Sequence search by BLAST or BLink

• Domain search by CDD/RPS-BLAST

• Structure search by VAST

• Derived from experimentally determined PDB records

• Data is added to PDB records including:– Addition of explicit chemical bonding information

– Validation and indexing of sequence

– Inclusion of Taxonomy, Citation, and other information

– Conversion to ASN.1 data description language

Molecular components in the MMDB structure are listed below. Theicons indicate macromolecular chains, 3D domains, protein classifications and ligands. Please hold the mouse over each icon for more information on the component. You may also click the thumbnails below to view corresponding chains and domains in Cn3D.

to get the Cn3D viewer

Sequence-based Neighbors:Conserved Domains (CDD/RPS-BLAST)

Structure-based Domains:(3D Domains)

PubChem Link

Searching the NCBI Databases

BLASTBLAST

SequenceSequenceVASTVAST

StructureStructure

EntrezEntrez

TextTextPubChemPubChem

StructureStructure

SearchSearch

Small MoleculeSmall MoleculeStructureStructure

Structure-based Neighbors:Vector Alignment Search Tool

For each protein chain:

locate secondarystructure elements,

represent them asindividual vectors,

1

2

3

4

5 6

Human IL-4

and compare these withprecomputed vectorsof database structures.

Molecular components in the MMDB structure are listed below. Theicons indicate macromolecular chains, 3D domains, protein classifications and ligands. Please hold the mouse over each icon for more information on the component. You may also click the thumbnails below to view corresponding chains and domains in Cn3D.

Structure-based Neighbors:Vector Alignment Search Tool (VAST)

����

EXAMPLE: Using VAST to find binding sites“I have a methyltransferase with a weird fold.

Are there any other proteins that have this fold?”

The Crystal Structure Of E.coli

m1G37 tRNAMethyltransferase

(1P9P)

&

A protein of unknown function from

Thermotoga maritima(1O6D)

2.4Å rmsd11% sequence identity

BLASTBLAST

SequenceSequenceVASTVAST

StructureStructure

EntrezEntrez

TextText

Searching the NCBI Databases

PubChemPubChem

StructureStructure

SearchSearch

Small MoleculeSmall MoleculeStructureStructure

Derivative Database:information is provided, updated

and “owned” by NCBI.

PubChem Databases

�Composed of Experimental data with Background, Protocols and Results for bioactivity screens of chemical substances described in PubChem Substance

� Submitters add “Hard” Links to PubChemSubstance records and outside sources.

Cell-line Growth Effectors HIV InfectivityPyruvate Kinase Estrogen Receptor Binding

�Composed of Substances which may be of known or unknown composition and also may contain a discrete compound or mixtures of compounds.

� Submitters add “Hard” Links to PubChemBioAssay records and outside sources.

Hydrolyzed feathers Diphenhydramine CitrateAspartame

Primary Databases:information is provided, updated

and “owned” by Submitters.

�Composed of discrete compounds with known chemical structure.

� Summary reports about the known chemical compounds described in PubChem Substance.

� Addition of Automated Links which are combined from information provided on PubChem Substance & BioAssayrecords.

EXAMPLE: Using PubChem Structure Search“What is this compound we isolated from grape skins?

Has it been found to have any biological activity?”

“What is this compound we isolated from grape skins? Has it been found to have any biological activity?”

YES

Searching the NCBI Databases

Linking!

Genomes

Taxonomy

PubMed

Nucleotide sequences

Protein sequences

3-D Structuremmdb

(3D structure)

Term Frequency Statistics

Structure Similarity

(VAST)

SequenceSimilarity

(BLASTp)

SequenceSimilarity

(BLASTn)

Phylogeny

Examples of Soft Linking in Entrez

Following Links

“Hard” Links: Curated links based on biology

for example:

nucleotide � taxonomy (based on organism identifier)protein � domain relatives (based on domain assignment)domains � pubmed (based on supporting literature)

“Soft” Links: Pre-computed analyses

for example:nucleotide � related sequences (BLASTn neighbors)protein � conserved domains (RPS-BLAST search) gene � map viewer (map position of annotated gene)

Follow links to related datain the same database

or in others!

Links

Summary pages of curated

information (LINKS!)

for genetic loci of RefSeq

organisms.

�Summary�Genomic regions, transcripts and products�Genomic context�Bibliography�HIV-1 protein interactions�Interactions�General gene information

�Markers�Genotypes�Phenotypes�Pathways�Homology�GeneOntology (function, process, component)

�General protein information�Reference Sequences

�mRNA and Protein(s)�Reference and Alternate Genomic assemblies

�Related Sequences�Additional Links

EXAMPLE: Finding Information on your Gene

“How do I find information about my gene?”

1. Choose the GeneGene database to search

2. Find your Gene recordvia traditional Entrez search!

3. Examine the record and all of those links!

What does my gene look like?

Are there SNPs in my gene? SNP:GeneView

Is there Genotype data on my gene? SNP:Genotype

Where is my gene in the genome? MapViewer& what other genome information can I find in this region?

Are there any known phenotypes for my gene? OMIM

Where is my gene expressed? UniGene

What are homologs to my gene in other organisms? HomoloGene

Genomic regions, transcripts & products

NEW!

Sequence Polymorphisms found in Patients withAge-Related Macular Degeneration (AMD)

Mapping the Polymorphisms to a Candidate Gene.

A clinical databaseof Genotypes & Phenotypes

Data is either open access or controlled access

FTP Downloads

NCBI Toolbox: In-house source code useful for incorporating NCBI-like functionality into their programs. Three main parts: Data Model, Data Encoding

and Programming Libraries.

• Examples: BLAST, Cn3D, Sequin, Data format conversion scripts

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/index.cgi

Help for Programmers

http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html

E-Utilities: Guidelines for Entrez “URL calls” used to access data. Designed for use in scripts.

• Examples: ESearch, EPost, ESummary, EFetch and ELink

Caution: Overuse may result in blocked IPs!

3000 3000 MyrMyr

1000 1000 MyrMyr

540 540 MyrMyr

Alzheimer’sDisease

Ataxiatelangiectasia

Colon cancer

Pancreaticcarcinoma

Yeast BacteriaWormFlyHuman

BLAST and Molecular Evolution

MLH1 MutL

Common ancestry allows us to infer similar function

Calculates similarity for biological sequences

Finds best local alignments

A Heuristic approach based on the Smith-Waterman algorithm

− Searches for matching “words” (W) rather than individual residues

− Uses statistical theory to determine if a match might have occurred by chance

Basic Local Alignment Search Tool

Seq 1

Seq 2Seq 1

Seq 2

Global Alignment Local Alignment

Protein BLAST Page Limiting Database: Organism

Organism autocomplete

Limiting Database: Entrez Query

all[filter] NOT mammals[organism]

gene_in_mitochondrion[Properties]2006:2007 [Modification Date]

Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]

Run Search

BLAST Formatting Page

Conserved Domain Results

BLAST Output: Graphical Overview

mouse over

Sort by taxonomy

BLAST Output: Descriptions

Link to entrez

Sorted by e values

5 X 10-14

Default e value cutoff 10

Gene Linkout

TaxBLAST: Taxonomy Reports

BLAST Output: Alignments

Identical match

positive score(conservative)

Negative or zero

gap

Position Specific Iterative BLAST

MLH1 and ETR1

>gi|4557757|ref|NP_000240.1| MutL protein homolog 1 [Homo sapiens]

MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIVKEGGLKLIQIQDNGTGIRK

EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPK

PCAGNQGTQITVEDLFYNIATRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNA

STVDNIRSIFGNAVSRELIEIGCEDKTLAFKMNGYISNANYSVKKCIFLLFINHRLVESTSLRKAIETVY

AAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLGSNSSRMYFTQTLLP

GLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLSSQPQAIVTEDKTDIS

SGRARQQDEEMLELPAPAEVAAKNQSLEGDTTKGTSEMSEKRGPTSSNPRKRHREDSDVEMVEDDSRKEM

TAACTPRRRIINLTSVLSLQEEINEQGHEVLREMLHNHSFVGCVNPQWALAQHQTKLYLLNTTKLSEELF

YQILIYDFANFGVLRLSEPAPLFDLAMLALDSPESGWTEEDGPKEGLAEYIVEFLKKKAEMLADYFSLEI

DEEGNLIGLPLLIDNYVPPLEGLPIFILRLATEVNWDEEKECFESLSKECAMFYSIRKQYISEESTLSGQ

QSEVPGSIPNSWKWTVEHIVYKALRSHILPPKHFTEDGNILQLANLPDLYKVFERC

>gi|22095656|sp|O81122.1|ETR1_MALDO Ethylene receptor

MLACNCIEPQWPADELLMKYQYISDFFIALAYFSIPLELIYFVKKSAVFPYRWVLVQFGAFIVLCGATHL

INLWTFSIHSRTVAMVMTTAKVLTAVVSCATALMLVHIIPDLLSVKTRELFLKNKAAELDREMGLIRTQE

ETGRHVRMLTHEIRSTLDRHTILKTTLVELGRTLALEECALWMPTRTGLELQLSYTLRQQNPVGYTVPIH

LPVINQVFSSNRAVKISANSPVAKLRQLAGRHIPGEVVAVRVPLLHLSNFQINDWPELSTKRYALMVLML

PSDSARQWHVHELELVEVVADQVAVALSHAAILEESMRARDLLMEQNIALDLARREAETAIRARNDFLAV

MNHEMRTPMHAIIALSSLLQETELTAEQRLMVETILRSSNLLATLINDVLDLSRLEDGSLQLEIATFNLH

SVFREVHNMIKPVASIKRLSVTLNIAADLPMYAIGDEKRLMQTILNVVGNAVKFSKEGSISITAFVAKSE

SLRDFRAPDFFPVQSDNHFYLRVQVKDSGSGINPQDIPKLFTKFAQTQALATRNSGGSGLGLAICKRFVN

LMEGHIWIESEGLGKGCTATFIVKLGFPERSNESKLPFAPKLQANHVQTNFPGLKVLVMDDNGVSRSVTK

GLLAHLGCDVTAVSLIDELLHVISQEHKVVFMDVSMPGIDGYELAVRIHEKFTKRHERPVLVALTGSIDK

ITKENCMRVGVDGVILKPVSVDKMRSVLSELLEHRVLFEAM

Human Mismatch Repair Protein

Apple ethylene receptor

PSI-BLAST: Iteration 1

PSI-BLAST:Iteration 4

Plant ethylene receptors, bacterial two-component regulatory system kinases

RPS-BLAST: Conserved Domains

Histidine Kinase-like ATPase Domain

Algorithm parameters: Protein

Adjust to set stringency

May limit results

Default statistics adjustmentfor compositional bias

Off now by default. Conflicts withcomp-based stats

Expand

Automatic Short Sequence Adjustment

e-value 20000Word Size 2Matrix PAM30Comp Stats OffLow Comp Filter Off

Nucleotide and Protein

Managing Searches

Recent Results

Saved Strategies

Recent Results

Login to My NCBI to save search strategies

Results available for 36 hours

Saved Strategies

Re-run searchesto keep up to date

Databases & Visualization Tools– Primarily based on Nucleotide Sequences!

Examples: GenBank (Sequence Formatter) Genome (MapViewer)UniGene (Expression Profile Viewer) SNP (GeneView & Genotype Viewer)

— We also have some other types of data:Examples: Protein (Translations & Curations seen by Sequence Formatter)

HomoloGene (Protein Sequence Clusters)Genome Project (Information and links for Organisms’ Sequencing Projects)GEO (Arrays & Spectras seen with Heat Maps) Structure (Coordinates seen with Cn3D)BioAssay (Activities seen in Reports)

Search Methods– Entrez: word/text searches

– BLAST: sequence searches– in-lieu-of-BLAST: pre-computed BLAST searches

– VAST: macromolecular structure searches

– PubChem Structure Search: molecular structure searches

Linking– Pre-computed and Hand-curated Associations

• Gene Database: A central source for gene information & links

• Hemochromatosis Example: Relating data to find a biomedical answer

EXAMPLE: Finding information onthe Etiology of Hemochromatosis

To answer a question/solve a puzzlewith pre-computed links!

-the data is already there-

What causes Hemochromatosis?Classic hemochromatosis (HFE),

an autosomal recessive disorder,is an iron overload disease.

Figure 17-48. The transferrin cycle, which operates in all growing mammalian cells.

Molecular Cell Biology by Lodish, Berk, Zipursky, Matsudaira, Baltimore & Darnell ©2000 by W.H. Freeman & Co.

A Bookshelf member!

Iron is an essential nutrient required for the synthesis of hemoglobin, cytochromes, and many other proteins

Iron is transported around the body by Transferrin which binds to the extracellular Transferrin Receptor, causing internalization and cellular uptake of the complex.

High intracellular iron levels causes free-radical damage to intracellular proteins, lipids, and nucleic acids.

HFE protein is a competitive inhibitor with Transferrin for binding to the Transferrin Receptor, thereby inhibiting cellular uptake of iron and preventing intracellular damage.

Medical complications of Hemochromatosis: hypermelanoticpigmentation of the skin, arthritis, cirrhosis of the liver, diabetes, heart failure and primary hepatocellular carcinoma

Fe+3

Fe+3

Fe+3

Fe+3

Fe+3

hemochromatosis

OMIM Record � Link to Gene

Coriell Cell RepositoriesHuman Gene Mutation Database

Gene � Links to Everywhere (almost) � Gene

NTNM

records GeneUniGeneGene Model

HFE Genome Maps

Gene � Links to Everywhere (almost) � ProteinGeneView in SNP

Protein Record

282

�Domain (CDD) Search CDD Record: IGc1 Domain Sequence � Structure!

..disulfide bridge connecting 2 beta-sheets with a Trp packing against the disulfide bond.

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

Human Transferrin Receptor Complexes

HFE dimer Transferrin

1SUV1DE4

HFE dimer Transferrin