The UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation Program

7
The UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation Program Michel Schneider a, , Lydie Lane a , Emmanuel Boutet a , Damien Lieberherr a , Michael Tognolli a , Lydie Bougueleret a , Amos Bairoch a,b a Swiss Institute of Bioinformatics, Centre Médical Universitaire, 1, rue Michel-Servet, 1211 Genève 4, Switzerland b Department of Structural Biology and Bioinformatics, Centre Médical Universitaire, 1, rue Michel-Servet, 1211 Genève 4, Switzerland ARTICLE DATA ABSTRACT The UniProt knowledgebase, UniProtKB, is the main product of the UniProt consortium. It consists of two sections, UniProtKB/Swiss-Prot, the manually curated section, and UniProtKB/TrEMBL, the computer translation of the EMBL/GenBank/DDBJ nucleotide sequence database. Taken together, these two sections cover all the proteins characterized or inferred from all publicly available nucleotide sequences. The Plant Proteome Annotation Program (PPAP) of UniProtKB/Swiss-Prot focuses on the manual annotation of plant-specific proteins and protein families. Our major effort is currently directed towards the two model plants Arabidopsis thaliana and Oryza sativa. In UniProtKB/Swiss-Prot, redundancy is minimized by merging all data from different sources in a single entry. The proposed protein sequence is frequently modified after comparison with ESTs, full length transcripts or homologous proteins from other species. The information present in manually curated entries allows the reconstruction of all described isoforms. The annotation also includes proteomics data such as PTM and protein identification MS experimental results. UniProtKB and the other products of the UniProt consortium are accessible online at www.uniprot.org. © 2008 Elsevier B.V. All rights reserved. Keywords: Database UniProt Manual annotation Plant Proteomics PTM 1. Introduction The UniProt consortium was created in 2002 by the joining of forces between the Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Infor- mation Resource (PIR) group at the Georgetown University Medical Center and National Biomedical Research Foundation. The main goal of the consortium is to provide the scientific community with a single, stable, high quality, comprehensive and authoritative protein knowledgebase, UniProtKB (www. uniprot.org). This knowledgebase consists of two sections: UniProtKB/Swiss-Prot, which contains all the fully manually annotated, non-redundant records, and UniProtKB/TrEMBL, the computer-annotated section that contains the translation of all the coding sequences (CDS) deposited in the EMBL/ GenBank/DDBJ nucleotide sequence database. Taken together, the two sections cover all the proteins characterized or inferred from all publicly available nucleotide sequences. Besides this centerpiece, the UniProt consortium also produces and maintains several other products such as UniRef, which consists of clusters of sequences sharing 100%, 90% or 50% of identity, UniParc, a highly redundant archive that contains original protein sequences retrieved from several different sources, or UniMES, a collection of metagenomic and environmental sequences (Fig. 1). For a detailed description of UniProt and its various products, see [1]. 2. The Plant Proteome Annotation Program Shortly after the publication of the first complete plant genome sequence in 2000 [2], the Swiss-Prot group initiated JOURNAL OF PROTEOMICS 72 (2009) 567 573 Corresponding author. Tel.: +41 22 379 50 50; fax: +41 22 379 58 58. E-mail address: [email protected] (M. Schneider). 1874-3919/$ see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.jprot.2008.11.010 available at www.sciencedirect.com www.elsevier.com/locate/jprot

Transcript of The UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation Program

J O U R N A L O F P R O T E O M I C S 7 2 ( 2 0 0 9 ) 5 6 7 – 5 7 3

ava i l ab l e a t www.sc i enced i rec t . com

www.e l sev i e r. com/ loca te / j p ro t

The UniProtKB/Swiss-Prot knowledgebase and its PlantProteome Annotation Program

Michel Schneidera,⁎, Lydie Lanea, Emmanuel Bouteta, Damien Lieberherra,Michael Tognollia, Lydie Bouguelereta, Amos Bairocha,b

aSwiss Institute of Bioinformatics, Centre Médical Universitaire, 1, rue Michel-Servet, 1211 Genève 4, SwitzerlandbDepartment of Structural Biology and Bioinformatics, Centre Médical Universitaire, 1, rue Michel-Servet, 1211 Genève 4, Switzerland

A R T I C L E D A T A

⁎ Corresponding author. Tel.: +41 22 379 50 50E-mail address: [email protected]

1874-3919/$ – see front matter © 2008 Elsevidoi:10.1016/j.jprot.2008.11.010

A B S T R A C T

Keywords:

The UniProt knowledgebase, UniProtKB, is the main product of the UniProt consortium. Itconsists of two sections, UniProtKB/Swiss-Prot, the manually curated section, andUniProtKB/TrEMBL, the computer translation of the EMBL/GenBank/DDBJ nucleotidesequence database. Taken together, these two sections cover all the proteins characterizedor inferred from all publicly available nucleotide sequences. The Plant Proteome AnnotationProgram (PPAP) of UniProtKB/Swiss-Prot focuses on themanual annotation of plant-specificproteins and protein families. Our major effort is currently directed towards the two modelplants Arabidopsis thaliana and Oryza sativa. In UniProtKB/Swiss-Prot, redundancy isminimized by merging all data from different sources in a single entry. The proposedprotein sequence is frequently modified after comparison with ESTs, full length transcriptsor homologous proteins from other species. The information present in manually curatedentries allows the reconstruction of all described isoforms. The annotation also includesproteomics data such as PTM and protein identification MS experimental results. UniProtKBand the other products of the UniProt consortium are accessible online at www.uniprot.org.

© 2008 Elsevier B.V. All rights reserved.

DatabaseUniProtManual annotationPlantProteomicsPTM

1. Introduction

The UniProt consortium was created in 2002 by the joining offorces between the Swiss Institute of Bioinformatics (SIB), theEuropean Bioinformatics Institute (EBI) and the Protein Infor-mation Resource (PIR) group at the Georgetown UniversityMedical Center andNational Biomedical Research Foundation.

Themain goal of the consortium is to provide the scientificcommunity with a single, stable, high quality, comprehensiveand authoritative protein knowledgebase, UniProtKB (www.uniprot.org). This knowledgebase consists of two sections:UniProtKB/Swiss-Prot, which contains all the fully manuallyannotated, non-redundant records, and UniProtKB/TrEMBL,the computer-annotated section that contains the translationof all the coding sequences (CDS) deposited in the EMBL/GenBank/DDBJ nucleotide sequence database. Taken together,

; fax: +41 22 379 58 58.h (M. Schneider).

er B.V. All rights reserved

the two sections cover all the proteins characterized orinferred from all publicly available nucleotide sequences.

Besides this centerpiece, the UniProt consortium alsoproducesandmaintains several other products suchasUniRef,which consists of clusters of sequences sharing 100%, 90% or50% of identity, UniParc, a highly redundant archive thatcontains original protein sequences retrieved from severaldifferent sources, or UniMES, a collection of metagenomic andenvironmental sequences (Fig. 1). For a detailed description ofUniProt and its various products, see [1].

2. The Plant Proteome Annotation Program

Shortly after the publication of the first complete plantgenome sequence in 2000 [2], the Swiss-Prot group initiated

.

Fig. 1 –Sources and flow of data for UniProt componentdatabases.

568 J O U R N A L O F P R O T E O M I C S 7 2 ( 2 0 0 9 ) 5 6 7 – 5 7 3

the Plant ProteomeAnnotation Program (PPAP). Themain goalof this program is the manual annotation of plant-specificproteins or protein families, with a specific emphasis on theproteomes of two fully sequenced model organisms, Arabi-dopsis thaliana [2] and Oryza sativa [3].

We are currently working on the establishment andannotation of a comprehensive, non-redundant completeproteome of Arabidopsis. As a first step towards achievingthis goal we have compared the content of our database withthe list of proteins produced by alternative splicing publishedby The Arabidopsis Information Resource (TAIR) [4]. In severalcases this has led us to complement the sequence informationthat was already present in UniProtKB with data available atTAIR.

Table 2 – The 10 most highly represented plant species inUniProtKB/Swiss-Prot (Rel. 14.3).

Number Frequency Species

3. Current status of the PlantProteome Annotation

By mid October 2008, UniProtKB/Swiss-Prot (Rel. 14.3) con-tained 399,749 manually curated entries, including 23,951plant proteins (Table 1). Of these, 7064 are from A. thaliana,with 999 having one or more splice variant, while 1865originate from O. sativa, spp. japonica, with 124 having oneor more splice variant.

1894 different plant species are currently represented inthe manually annotated section of UniProtKB, and 12,205proteins, half of all the entries from Viridiplantae, originatefrom the 10 most highly represented species (Table 2).

Table 1 – Content of UniProtKB release 14.3 (14-Oct-2008).

All speciescombined

Viridiplantae

UniProtKB/TrEMBL(computer annotated entries,waiting for manual curation)

6,678,831 503,867

UniProtKB/Swiss-Prot(Manually annotated proteins)

399,749 23.951

Total in UniProtKB 7,078,580 527,818

4. Structure and content of a UniProtKB/Swiss-Prot entry

Database redundancy is minimized by merging all submittedsequence data from different sources about a given protein in agiven organism into a single entry. This implies the detectionand correction of potential frameshifts, sequencing errors anderroneous gene model predictions. The sequence displayed inthe entry is the most correct sequence version according toannotator judgment. If protein sequences areonlyderived froma computer gene prediction program running on a genomicsequence, the proposed gene model is validated, wheneverpossible, bymultiple alignmentswith paralogs (othermembersof the same protein family) or orthologs (proteins having thesame function in related species). These comparisons allownotonly the correction of a great number of predicted genemodels,but may also permit the inference of certain biological proper-ties for as yet uncharacterized proteins.

4.1. Core structure

The minimal information contained in each entry, be itUniProtKB/TrEMBL or UniProtKB/Swiss-Prot, consists of an entryidentifier, an accession number, a description including arecommended name, taxonomic classification of the organismin which the protein is present, bibliographical reference(s) andthe protein sequence. In fully annotated UniProtKB/Swiss-Protentries additional information is found in 11 different sections:

• Names and origin• Protein attributes• General annotation (Comments)• Ontologies• Alternative products• Sequence annotation (Features)• Sequences• References• Cross-references• Entry information• Relevant documents

The level of confidence associated with each item ofinformation is indicated by the presence or otherwise of oneor more non-experimental qualifiers. “Potential” indicates

1 7064 ARATH Arabidopsis thaliana (Mouse-ear cress)2 1865 ORYSJ Oryza sativa subsp. japonica (Rice)3 611 MAIZE Zea mays (Maize)4 442 ORYSI Oryza sativa subsp. indica (Rice)5 431 TOBAC Nicotiana tabacum (Common tobacco)6 400 SOLLC Solanum lycopersicum (Tomato)

(Lycopersicon esculentum)7 376 SOLTU Solanum tuberosum (Potato)8 352 PEA Pisum sativum (Garden pea)9 344 SOYBN Glycine max (Soybean)10 320 WHEAT Triticum aestivum (Wheat)

Fig. 2 –Part of the “Sequence annotation” section of an entry (screenshot of O82804).

569J O U R N A L O F P R O T E O M I C S 7 2 ( 2 0 0 9 ) 5 6 7 – 5 7 3

that the feature is predicted by computer analysis. “Probable”means that the information is not explicitly given in theliterature, but can be inferred from other sources (otherproteins, obvious targeting signals, common knowledge,etc.). “By similarity” indicates that the information is notproven for this particular protein, but that it has beendemonstrated for a homologous protein.

The content of each entry is continuously evolving as theinformation is regularly updated and completed. While forreasons of consistency it is sometimes necessary to changethe entry name (identifier), the primary accession number isnever altered and therefore should always be used tounambiguously identify and cite UniProtKB entries.

Differences between the displayed sequence and thoseinferred from various sequencing reports, including differ-ences due to splice variants, polymorphisms, mutagenesis orsequencing errors, are listed in the “Sequence annotation(Features)” section of the entry (Fig. 2).

This allows all alternative sequences described in Uni-ProtKB/Swiss-Prot to be easily recreated, and UniProtKBprovides several tools which utilize this feature. For example,a BLAST search launched from the UniProt web site againstUniProtKB or a subsection thereof will automatically includeall described isoforms. It should be noted that a file containingthe sequence of all the splice variants present in UniProtKB/Swiss-Prot is available for download under the FASTA formatfrom our FTP server (ftp.uniprot.org/pub/databases/uniprot/knowledgebase/uniprot_sprot_varsplic.fasta.gz).

When the submitted sequence extensively differs from thedisplayed sequence, as in the case of an erroneous geneprediction, individual conflicts are not described in the

Fig. 3 –Example of a “Sequence caution

“Sequence annotation (Features)” section; instead, this fact isindicated in the “Sequence Caution” part of the “Generalannotation (Comments)” section (Fig. 3).

4.2. Scientific literature as a source of validated information

Most of the data included in a UniProtKB/Swiss-Prot entry areextracted from the scientific literature by specially trained lifescientists. If the information relates to defined regions of theprotein then the precise positions of these regions areindicated in the “Sequence annotation (Features)” section.Otherwise, data are stored as “General annotation (Com-ments)”. This section is divided into subsections, eachcontaining information pertinent to particular properties ofthe protein such as its function, catalytic activity, subcellularlocation, subunit structure, mass spectrometry or biophysico-chemical properties. Whenever possible, individual fields arestructured and controlled vocabularies are used in order tofacilitate text searches and database interoperability. Cur-rently, Gene Ontology (GO) terms are manually mapped to theUniProtKB keyword and subcellular location controlled voca-bularies, to EC number, to InterPro [5] matches or to HAMAPfamily rules [6] and then transferred automatically to thematching entries. Mapping to Plant Ontology terms orbetween UniProtKB pathways and GO terms is underway.UniProt will integrate the GO consortium in the forthcomingmonths and active GO annotation will be implemented inUniProtKB by mid 2009. Keywords and GO terms are bothstored in the “Ontologies” section of the entries.

A description of the various line types and their format isavailable at http://www.uniprot.org/manual.

” warning (screenshot of Q3E6Y4).

Fig. 4 –Example of bibliographical references (screenshot of Q9M7T0).

570 J O U R N A L O F P R O T E O M I C S 7 2 ( 2 0 0 9 ) 5 6 7 – 5 7 3

The sources from which the data have been extracted arelisted in the “References” section, with an indication of thenature of the information retrieved, and its source or scope(for example the tissue or cultivar used in the experiment).When available, direct links to the abstract in PubMed and tothe full text version through its Digital Object Identifier (DOI)are also provided (Fig. 4). The bibliographic list presented in anentry is not exhaustive but consists of a selection of the mostrelevant articles used for the current annotation.

4.3. Evidence of the existence of a protein

Some protein sequences may be mere predictions derivedfrom the hypothetical translation of a nucleotide sequence,while others may be from well characterized proteins whoseexistence is proven, for example by mass spectrometry. Wetherefore provide an indication of the available evidence forthe existence of each protein in the form of the ProteinExistence (PE) line included in the “Protein attributes” sectionof every UniProtKB entry.

The PE line may take one of the following values:

1. Evidence at protein level;2. Evidence at transcript level;

Fig. 5 –Distribution of themanually annotated plant entries accorexistence of a protein, the absolute number and proportion of th(Rel. 14.3) is indicated.

3. Inferred from homology;4. Predicted;5. Uncertain;

– Level 1 (Evidence at protein level) is attributed to anyprotein whose existence is supported by clear experimen-tal evidence, such as Edman sequencing, unambiguousidentification by mass spectrometry, X-ray or NMR struc-ture, detection by antibodies, etc.

– Level 2 (Evidence at transcript level) is used to indicate thatthe existence of a protein has not been strictly proven, butis supported by transcription data such as cDNAs, RT-PCR,Northern blots or micro-array data extracted from theArrayExpress or CleanEx databases.

– Level 3 (Inferred fromhomology) is used to indicate that theexistence of the protein is probable since clear orthologsexist in closely related species or because the protein is themember of a family including multiple paralogs in thesame species.

– Level 4 (Predicted) is attributed to entries without evidenceat protein, transcript, or homology levels.

– Level 5 (Uncertain) indicates that the existence of theprotein is unsure and that we consider that the proposedprotein sequence may represent the translation of a

ding to their PE level. For each type of evidence supporting thee 23,951 plant entries present in UniProtKB/Swiss-Prot

Fig. 6 –Example of annotation of mass spectrometry data (screenshot of P0AE67).

571J O U R N A L O F P R O T E O M I C S 7 2 ( 2 0 0 9 ) 5 6 7 – 5 7 3

pseudogene or an erroneously assigned ORF (to a non-coding RNA for example).

Criteria used to assign a PE level to entries are described ina document file available on the UniProt web site (www.uniprot.org/docs/pe_criteria). The distribution of the Uni-ProtKB/Swiss-Prot plant entries according to their proteinexistence level is shown in Fig. 5.

It should be pointed out that the PE line does not describethe accuracy or correctness of the sequence displayed but onlythe evidence for the existence of the protein. Sequencesderived from gene predictions from genomic sequences inparticular are prone to errors and may well be not entirelyaccurate.

4.4. Proteomics information

Two specific topics of the “General annotation” section aboutmass spectrometry and PTM might be of special interest forscientists working in the field of proteomics.

4.4.1. Mass spectrometry dataA comment “Mass spectrometry” is added when the entireprotein or a biologically active peptide has been specificallystudied by MS. It includes the range of the peptide submittedto MS, its determined molecular weight in Daltons with theerror range of the machine if available, and the MS methodused. When the protein studied carries a PTM, this fact isindicated in a note (Fig. 6).

When proteomic identification has been made by MS orMS/MS on tryptic fragments instead of biologically relevantpeptides, the comment “Mass spectrometry” is not used, butthe corresponding bibliographic reference is tagged as havingprovided identification of the protein concerned (cf. ref. 5 inFig. 4).

When MS/MS spectra are studied manually and providedalong with each amino acid and ion series, the results areconsidered as direct protein sequencing and annotated assuch.

4.4.2. PTMWhen a protein is modified post-translationally but theprecise site or sites of modification are not known, this factis indicated in the “Post-translational modification” subsec-tion of the “General annotation” section. The same subsection

Fig. 7 –Example of annotation of PTMs in the “Gene

is also used to indicate proteolytic cleavage or N-terminalblockage when the precise identity of the N-terminal residueis unknown. If the biological role of a particular PTM is known,the corresponding detailed information is also given in thesame “General annotation” section (Fig. 7).

When the precise positions of individual modifications areknown these are given in the “Sequence annotation (Fea-tures)” section of the UniProtKB entry (Fig. 8).

Currently, 336 different forms of PTM (not including thevarious types of glycosylation) are annotated in UniProtKB,and a full list is available at www.uniprot.org/docs/ptmlist.Included in this list are also the average mass differencescaused by the various modifications.

In release 14.3 of UniProtKB, 4225 manually annotatedplant proteins contained one or more defined PTM sites. Thefrequency of occurrence of some of the most commonlyencountered PTMs in plants is given in Table 3.

4.4.3. Large scale proteomic experimentsWe also incorporate information extracted from large scaleproteomic experiments, mostly dealing with PTM identifica-tion or organelle profiling. Due to a high level of genomeduplication, proteomic studies in plants are particularlydifficult. Often fragments cannot be assigned unequivocallyto a single protein as they match several members of largeprotein families. We pay particular attention to the quality ofthe data to be integrated, first at the level of samplepreparation and purity, then at the level of MS protocols andapparatus, and finally at the level of data analysis andfiltering. We often change the threshold of acceptableidentification scores, or manually remove potential chemicalartifacts from biologically relevant PTM.

Curation of proteomics experiments, performed on avariety of biological materials, using different apparatus,software and scoring system is not an easy task. Someinitiatives such as PRIDE [7] have been launched in order tocreate repositories displaying all the technical details neededfor evaluating and comparing proteomic experiments. We areworking in close collaboration with these groups to findadequate selection criteria to facilitate the automatic retrievalof high quality data from public repositories.

Therefore raw proteomic identification data should ideallybe submitted first to specialized databases such as PRIDE.Specific cross-references to those databases could then beinserted in the corresponding UniProtKB entries.

ral annotation” section (screenshot of O24591).

Fig. 8 –Example of annotation of PTMs in the “Sequence annotation” section, with a precise indication of the position of themodifications (screenshot of Q9M0S4).

572 J O U R N A L O F P R O T E O M I C S 7 2 ( 2 0 0 9 ) 5 6 7 – 5 7 3

4.5. Cross-references

Information related to the protein described in an entry butstored in a database external to UniProt can be accessed byfollowing the links provided in the “Cross-references” section.Among many, this concerns the underlying DNA sequencesstored in the EMBL/GenBank/DDBJ nucleotide sequence data-base, the 3D-protein structures from PDB [8], HSSP [9] orModBase [10], the protein domains and family characteriza-tions of PROSITE [11], Pfam [12], InterPro [5], ProDom [13], etc.,the proteomic data from PeptideAtlas [14] and ProMEX [15], orall the various information stored in Model Organism Data-bases (MOD) such as Gramene [16], MaizeGDB [17], and TAIR [4]to cite only a few.

Currently UniProtKB is cross-linked to more than 100external databases and a complete list can be downloadedfrom our web site (www.uniprot.org/docs/dbxref).

UniProtKB is synchronized with several of those externaldatabases. For example, all the genemodels proposed by TAIRare also included in UniProtKB and all the different 806 plant

Table 3 – Number of plant entries containing some of themost frequent PTM sites.

PTM Number of occurrence

Disulfide bonds 1993Modified residues 1809Phosphorylations 656Acetylations 633Methylations 602Pyrrolidone carboxylic acid 147

Glycosylations 1319Cross-links 290Lipidation 268GPI-anchor 109Geranylation 76Myristoylation 47Palmitoylation 19Farnesylation 13

proteins with a determined 3D-structure are cross-linked tothe corresponding PDB entries, while 672 of them (83%) aremanually annotated in UniProtKB/Swiss-Prot.

5. Concluding remarks

UniProtKB provides the scientific community with a compre-hensive, high-quality and freely accessible resource of proteinsequence and functional information. It contains all theprotein sequences inferred from publicly available nucleotidessequences, but only UniProtKB/Swiss-Prot, the fully manuallyannotated section, is non redundant. UniProtKB providesaccess to more than one hundred external resources and assuch acts as a central hub for biomolecular information.

The “Protein Evidence” tag allows the discriminationbetween proteins whose existence has been experimentallyproven and those whose existence has been inferred compu-tationally. This is important since computational gene pre-dictions can be prone to error: approximately one third of theinitial gene predictions in Arabidopsis were revised followingincorporation of information from expressed transcripts.However it is important to bear in mind that the “ProteinEvidence” tag does not constitute ameasure of the correctnessof the protein sequence displayed.

As protein isoforms may differ considerably from thedisplayed protein sequence, with potentially less than 50%similarity, it may be important to include all the splicevariants when the database is used to identify new proteins.All the information needed to recreate the various isoforms isprovided in the original UniProtKB entries. In this way, anincreased number of theoretical peptides can be produced fora potentially better identification of proteins after a MSexperiment for example. By the same token the mass of thetheoretical peptides can easily be corrected according to thepost-transcriptional modifications described in the entries.Tools such as Phenyx (www.genebio.com/products/phenyx),take full advantage of the annotated sequence information

573J O U R N A L O F P R O T E O M I C S 7 2 ( 2 0 0 9 ) 5 6 7 – 5 7 3

found in the UniProtKB databases such as PTMs, binding sites,variants, alternative splicing events, etc. During a search forpeptides Phenyx examines the difference between an experi-mental peptide mass and a theoretical peptide mass in orderto determine modifications to the protein sequence. If a massdifference corresponds to a known PTM that is annotated inUniProtKB/Swiss-Prot (even if the modification was notselected by the user), then the peptide sequence is consideredmodified and reported in the results.

Measuring the false discovery rate of a MS/MS experimentthrough the use of decoy databases is now a requirement formost proteomic studies. To comply with these new standards,the UniProt consortium now provides decoy versions ofUniProtKB and UniRef100, which ensure that no trypticpeptide is shared between the decoy and the initial databases.The various decoy databases can be retrieved in FASTA formatfrom our public FTP site (ftp.uniprot.org/pub/databases/uni-prot/ current_release/decoy).

The major effort of the plant group is currently directedfocused on the extensive annotation of the proteomes of botha monocot (O. sativa) and a dicot (A. thaliana). In a later stage,we will then propagate the relevant annotation to orthologousproteins from other plant species. Once the sequences and thegene predictions are reliable enough, we will also annotate allthe proteins involved in specific pathways not present in ourtwo model plants, such as the nitrogen fixation pathway ofMedicago truncatula.

As we are willing to keep up-to-date with the most recentresearch results, we are seeking active collaboration from thescientific community. We urge researchers to deposit theirresults in public databases such as the EMBL/GenBank/DDBJdatabase for nucleotide sequences (even if it is only asequence produced by PCR) or PRIDE for proteomics data forexample. Researchers should also favor the use of clear,unique and non-ambiguous identifiers in their publications.Usage of well recognized ordered locus names such as the AGInumbers in Arabidopsis, for example, is strongly recom-mended. Following those simple guidelines will alreadyconsiderably facilitate the daily work of curators. Researchersare welcome to help us maintain a high-quality database bysending us feedback and corrections, or submitting relevantfindings.

Acknowledgments

The authors would like to thank Alan Bridge and Sylvain Pouxfor critical reading and correction of the manuscript. TheSwiss-Prot group is part of the Swiss Institute of Bioinfor-matics (SIB) and of the UniProt consortium. Its activities aresupported by the Swiss Federal Government through theFederal Office of Education and Science and by the NationalInstitutes of Health (NIH) grant 2 U01 HG02712-04. Additionalsupport comes from the European Commission contractFELICS (021902RII3).

R E F E R E N C E S

[1] The UniProt Consortium. The universal protein resource(UniProt). Nucleic Acids Res 2008;36:D190–5.

[2] Arabidopsis Genome Initiative. Analysis of the genomesequence of the flowering plant Arabidopsis thaliana. Nature2000;408:796–815.

[3] International rice genome sequencing project (IRGSP). Themap-based sequence of the rice genome. Nature2005;436:793–800.

[4] SwarbreckD,WilksC, LameschP,BerardiniTZ,Garcia-HernandezM, FoersterH, et al. TheArabidopsis InformationResource (TAIR):gene structure and function annotation. Nucleic Acids Res2008;36:D1009–14.

[5] Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A,Binns D, et al. New developments in the InterPro database.Nucleic Acids Res 2007;35:D224–8.

[6] Lima T, Auchincloss AH, Coudert E, Keller G, Michoud K,Rivoire C, Bulliard V, de Castro E, Lachaize C, Baratin D, Phan I,Bougueleret L, Bairoch A. HAMAP: a database of completelysequenced microbial proteome sets and manually curatedmicrobial protein families in UniProtKB/Swiss-Prot. NucleicAcids Res. 2009;37:D471–8.

[7] Jones P, Cote RG, Cho SY, Klie S, Martens L, Quinn AF, et al.PRIDE: new developments and new datasets. Nucleic AcidsRes 2008;36:D878–83.

[8] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN,Weissig H, et al. The Protein Data Bank. Nucleic Acids Res2000;28:235–42.

[9] Sander C, Schneider R. The HSSP data base of proteinstructure-sequence alignments. Nucleic Acids Res1993;21:3105–9.

[10] Pieper U, Eswar N, Davis F, Madhusudhan MS, Rossi A,Marti-Renom MA, et al. MODBASE, a database of annotatedcomparative protein structure models, and associatedresources. Nucleic Acids Res 2006;34:D291–5.

[11] Hulo N, Bairoch A, Bulliard V, Cerutti L, Cuche BA, de Castro E,et al. The 20 years of PROSITE. Nucleic Acids Res 2008;36:D245–9.

[12] Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, et al.The Pfam protein families database. Nucleic Acids Res2008;36:D281–8.

[13] Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, Kahn D.The ProDom database of protein domain families: moreemphasis on 3D. Nucleic Acids Res 2005;33:D212–5.

[14] Deutsch EW, Lam H, Aebersold R. PeptideAtlas: a resource fortarget selection for emerging targeted proteomics workflows.EMBO Rep 2008;9:429–34.

[15] Hummel J, Niemann M, Wienkoop S, Schulze W, SteinhauserD, Selbig J, et al. ProMEX: a mass spectral reference database forproteins and protein phosphorylation sites. BMC Bioinformatics2007;8:216.

[16] Liang C, Jaiswal P, Hebbard C, Avraham S, Buckler ES,Casstevens T, et al. Gramene: a growing plant comparativegenomics resource. Nucleic Acids Res 2008;36:D947–53.

[17] Lawrence CJ, Schaeffer ML, Seigfried TE, Campbell DA, HarperLC. MaizeGDB's new data types, resources and activities.Nucleic Acids Res 2007;35:D895–900.