Survey of genome organization and gene content of Corynebacterium pseudotuberculosis

9
Microbiological Research 165 (2010) 312—320 Survey of genome organization and gene content of Corynebacterium pseudotuberculosis ´vian D’Afonseca a , Francisco Prosdocimi b , Fernanda A. Dorella a , Luis Gustavo C. Pacheco a , Pablo M. Moraes a , Izabela Pena b , Jose ` Miguel Ortega b , Santuza Teixeira c , Se´rgio Costa Oliveira d , ElisaˆngelaMonteiroCoser e , Luciana Ma ´rcia Oliveira e , Guilherme Correˆa de Oliveira e , Roberto Meyer f , Anderson Miyoshi a,1 , Vasco Azevedo a, ,1 a Laborato ´rio de Gene´tica Celular e Molecular, Departamento Biologia Geral, ICB-UFMG, Brazil b Laborato ´rio de Biodados, Departamento Bioquı ´mica e Imunologia, ICB-UFMG, Brazil c Laborato ´rio de Imunologia e Bioquı ´mica Celular de Parasitas, Departamento Bioquı ´mica e Imunologia, ICB-UFMG, Brazil d Laborato ´rio de Imunologia de Doenc -as Infecciosas, Departamento Bioquı ´mica e Imunologia, ICB-UFMG, Brazil e Laborato ´rio de Parasitologia Celular e Molecular, Centro de Pequisa Renee´ Rachou FIOCRUZ, Brazil f Laborato ´rio de Biointerac -a˜o, Instituto de Cieˆncias da Sau ´de, ICB-UFBA, Brazil Received 23 September 2008; received in revised form 27 May 2009; accepted 30 May 2009 KEYWORDS Corynebacterium genus; Corynebacterium pseudotuberculosis; Genome survey sequence (GSS); Microbial genome sequencing; Comparative genomic Summary Corynebacterium pseudotuberculosis is an intracellular pathogen that causes Caseous lymphadenitis (CLA) disease in sheep and goats. The widespread occurrence and the economic importance of this pathogen have prompted investigation of its pathogenesis. We used a genomic library of C. pseudotuberculosis to generate 1440 genomic survey sequences (GSSs); these were analyzed in silico with bioinformatics tools, using public databases for comparative analyses. We employed non-redundant unique sequences as a query for BLAST searches against the genome, the translated genome and the proteome of four other Corynebacterium species that have been completely sequenced. We were able to characterize approximately 8% of the genome of C. pseudotuberculosis, including previously undescribed functional group genes, based on the COG database; the GSSs classification into categories gave 13% ARTICLE IN PRESS www.elsevier.de/micres 0944-5013/$ - see front matter & 2009 Elsevier GmbH. All rights reserved. doi:10.1016/j.micres.2009.05.009 Abbreviations: CLA, caseous lymphadenitis; GSS, genomic survey sequences; BLAST, basic local alignment search tool; COG, clusters of orthologous groups; Cd, Corynebacterium diphtheriae; Ce, Corynebacterium efficiens; Cg, Corynebacterium glutamicum; Cj, Corynebacterium jeikeium; Cp, Corynebacterium pseudotuberculosis. Corresponding author. Tel.: +55 31 3499 2778; fax: +55 31 3499 2610. E-mail addresses: [email protected] (A. Miyoshi), [email protected] (V. Azevedo). 1 Vasco Azevedo and Anderson Miyoshi share credit in this work for senior authorship.

Transcript of Survey of genome organization and gene content of Corynebacterium pseudotuberculosis

ARTICLE IN PRESS

Microbiological Research 165 (2010) 312—320

0944-5013/$ - sdoi:10.1016/j.

Abbreviationof orthologousCorynebacteriu�CorrespondE-mail addr

1Vasco Azeve

www.elsevier.de/micres

Survey of genome organization and gene content ofCorynebacterium pseudotuberculosis

Vıvian D’Afonsecaa, Francisco Prosdocimib, Fernanda A. Dorellaa,Luis Gustavo C. Pachecoa, Pablo M. Moraesa, Izabela Penab,Jose Miguel Ortegab, Santuza Teixeirac, Sergio Costa Oliveirad,Elisangela Monteiro Cosere, Luciana Marcia Oliveirae,Guilherme Correa de Oliveirae, Roberto Meyerf, Anderson Miyoshia,1,Vasco Azevedoa,�,1

aLaboratorio de Genetica Celular e Molecular, Departamento Biologia Geral, ICB-UFMG, BrazilbLaboratorio de Biodados, Departamento Bioquımica e Imunologia, ICB-UFMG, BrazilcLaboratorio de Imunologia e Bioquımica Celular de Parasitas, Departamento Bioquımica e Imunologia,ICB-UFMG, BrazildLaboratorio de Imunologia de Doenc-as Infecciosas, Departamento Bioquımica e Imunologia, ICB-UFMG, BrazileLaboratorio de Parasitologia Celular e Molecular, Centro de Pequisa Renee Rachou – FIOCRUZ, BrazilfLaboratorio de Biointerac-ao, Instituto de Ciencias da Saude, ICB-UFBA, Brazil

Received 23 September 2008; received in revised form 27 May 2009; accepted 30 May 2009

KEYWORDSCorynebacteriumgenus;Corynebacteriumpseudotuberculosis;Genome surveysequence (GSS);Microbial genomesequencing;Comparative genomic

ee front matter & 2009micres.2009.05.009

s: CLA, caseous lymphagroups; Cd, Corynebactm jeikeium; Cp, Coryning author. Tel.: +55 31esses: [email protected] and Anderson Miyos

SummaryCorynebacterium pseudotuberculosis is an intracellular pathogen that causesCaseous lymphadenitis (CLA) disease in sheep and goats. The widespread occurrenceand the economic importance of this pathogen have prompted investigation of itspathogenesis. We used a genomic library of C. pseudotuberculosis to generate 1440genomic survey sequences (GSSs); these were analyzed in silico with bioinformaticstools, using public databases for comparative analyses. We employed non-redundantunique sequences as a query for BLAST searches against the genome, the translatedgenome and the proteome of four other Corynebacterium species that have beencompletely sequenced. We were able to characterize approximately 8% of thegenome of C. pseudotuberculosis, including previously undescribed functional groupgenes, based on the COG database; the GSSs classification into categories gave 13%

Elsevier GmbH. All rights reserved.

denitis; GSS, genomic survey sequences; BLAST, basic local alignment search tool; COG, clusterserium diphtheriae; Ce, Corynebacterium efficiens; Cg, Corynebacterium glutamicum; Cj,ebacterium pseudotuberculosis.3499 2778; fax: +55 31 3499 2610.g.br (A. Miyoshi), [email protected] (V. Azevedo).hi share credit in this work for senior authorship.

ARTICLE IN PRESS

Partial identification of the gene content of C. Pseudotuberculosis 313

information storage and processing, 14% cellular processes and 23% metabolism. Wefound a close relation between C. pseudotuberculosis and C. diphtheriae conserved-gene synteny in Corynebacteria species.& 2009 Elsevier GmbH. All rights reserved.

Introduction

The genomic era began in the 1990s with thesequencing of the first microbial genome, thebacterium Haemophilus influenzae, a free-livingorganism (Fleischmann et al. 1995; Mora et al.2006). Since then, genomic research has generatedconsiderable information that is available in publicdatabases (Dorella et al. 2006a). This informationhas helped scientists discover genes and theirfunctionality, and consequently the proteins thatthey encode, by comparing genomes of evolution-ary-related species (Celestino et al. 2004).Although prokaryotes account for most of theplanet’s total biomass (Rodrıguez-Valera 2004),much more still needs to be understood aboutthem. The massive accumulation of prokaryoticDNA sequences generated by microbial genomeprojects gives potential for great advances in ourknowledge in various areas, including bacterialdiversity, mobile genetic elements (MGE) andhorizontal gene transfer (HGT) (Roe et al. 1996;Binnewies et al. 2006). Genome sequencing hasprovided important insights concerning deducedgenes and proteins of pathogens, including biologi-cal features, putative virulence factors and poten-tial targets for the development of immunologicalor chemotherapeutic reagents against specificorganisms (Celestino et al. 2004; Tauch et al.2006). Till now, approximately 570 microbialgenome projects have been completed and pub-lished, and there are more than 1100 activesequencing projects (Genomes OnLine Database,http://www.genomesonline.org/).

The Corynebacterium genus consists of a largenumber of Gram-positive bacteria that are pleio-morphic, asporogenous and have a high G+Ccontent in their genomes (Deb and Nath 1999).Originally, Corynebacterium included onlydiphtheria bacilli and some other animal pathogens(Barksdale 1970; Collins and Cummins 1986; Pascualet al. 1995). Later, other species were discovered,including plant and animal pathogens, nonpatho-genic soil bacteria and saprophytic species (Collinsand Bradbury 1986; Collins and Cummins 1986;Pascual et al. 1995). Corynebacterium species areclosely related to Mycobacterium species, and bothare classified in the Actinomycetales (Nakamura

et al. 2003). The National Center for BiotechnologyInformation (NCBI) genomes database contains thecomplete genome of four Corynebacterium species:Corynebacterium diphtheriae, Corynebacteriumefficiens, Corynebacterium glutamicum and Cory-nebacterium jeikeium (Cerdeno-Tarraga et al.2003; Ikeda and Nakagawa 2003; Matsui et al.2003; Tauch et al. 2005).

One of the most important members of thisgenus, and the focus of this paper, is the bacteriumC. pseudotuberculosis, a facultative intracellularpathogen that infects sheep and goats, causingcaseous lymphadenitis (CLA). Occasionally, humaninfections have also been diagnosed. This disease isfound worldwide, and its considerable economicimportance has prompted investigation of itspathogenesis. However, the genetic determinantsof C. pseudotuberculosis virulence are still poorlycharacterized (Dorella et al. 2006b). Furthermore,this species has only 19 proteins identified in theGenPept database.

The partial C. pseudotuberculosis genome se-quence analysis that we provide here adds to ourunderstanding of the molecular and genetic basis ofthis bacterium’s virulence and also may be usefulfor the development of new diagnostic methodsand vaccines, contributing to the control of CLA.

Methodology

Bacterial strains and growth conditions

Corynebacterium pseudotuberculosis wild-typestrain T2 was isolated from a caseous granulomafound in a CLA-affected goat in Bahia state (Brazil);it was identified with an API CORYNE batterybacterial identification diagnostic kit (Biomerieux,France). It was cultured aerobically in brain heartinfusion (BHI, Acumedia) broth at 37 1C, underagitation (Ikeda and Nakagawa 2003). We used theDH5a strain of Escherichia coli, aerobically grownat 37 1C in Luria Bertani (LB) medium (DifcoLaboratories, Detroit, USA) supplemented withampicillin (100 mgmL�1) and X-Gal (40 mgmL�1), toobtain recombinant clones.

ARTICLE IN PRESS

V. D’Afonseca et al.314

Genomic DNA extraction and construction ofa genomic library

Genomic DNA extraction was performed accord-ing to a previously described protocol (Pacheco etal. 2007). DNA integrity was confirmed throughelectrophoresis on 0.8% agarose gels and visualiza-tion with ethidium bromide staining. We used aspectrophotometric analysis to evaluate the DNAand test its quality. Total genomic DNA wasnebulized for 2min, in order to obtain C. pseudo-tuberculosis DNA fragments of various sizes (Roe etal. 1996). Fragmentation was confirmed by electro-phoresis in 0.8% agarose gels; DNA fragmentsranging from 500 to 1000 bp were obtained. Thesefragments were cloned with the PCRs4 Blunt-TOPOvector, using the TOPOs Shotgun Subcloning kit(Invitrogen) and transformed into E. coli DH5a byelectroporation (Dower et al. 1998). Blue-whitescreening was used to select positive clones, andPCR with M13 universal primers was used to confirminclusion of the insert (TOPOs Shotgun SubcloningKIT (Invitrogen)).

Genomic survey sequence (GSS) sequencing

Plasmid extraction was performed directly onplates, using a standardized protocol (Sambrook etal. 1989). The plasmids that were obtained wereanalyzed in 0.8% agarose gels and the concentra-tions were determined spectrophotometrically.Approximately 200 ng mL�1 of plasmid DNA wasprocessed in a Mastercycler gradient thermocycler(Eppendorf), using a DYEnamic ET Dye TerminatorCycle Sequencing Kit (GE Healthcare) and universalprimers M13 F and M13 R, according to themanufacturer’s instructions. The sequences weregenerated in a MegaBACETM 1000 apparatus (GEHealthcare).

GSS amplification by PCR

To confirm putative exclusive genes from C.pseudotuberculosis we designed forward and re-verse primers for each of the genes that had nosimilarities in the search databases of othersequenced Corynebacterium spp.:

Putative Oligopeptide/dipeptide ABC transporterprimers:

Forward: 50-CCTTACCGAGACAACGTCAT-30

Reverse: 50-GCCTGGTGCTTATCATTGAT-30

NADP oxidoreductase, coenzyme F420-depen-dent primers:

Forward: 50-CTGCGACATAGCTAGGCACT-30

Reverse: 50-CCGCCAGACTTTTCTCTACA-30

Proline iminopeptidase (PIP) primers:

Forward: 50-AACTGCGGCTTTCTTTATTC-30

Reverse: 50-GACAAGTGGGAACGGTATCT-30.

Generated amplicons with lengths of 285, 382,551 bp.

Base-calling, sequence filtering andclustering

Base-calling was performed using the PHREDalgorithm (Ewing and Green 1998; Ewing et al.1998), with trim_alt and trim_cut-off parametersset at 0.16, as suggested by Prosdocimi et al.(2007).

In order to filter sequences for low quality andvector contamination, we used the algorithmsSeqClean (http://compbio.dfci.harvard.edu/tgi/software/) and PHRAP’s package cross_match(http://www.phrap.org). SeqClean was first usedto remove low-quality portions of PHRED base-called sequences. The cross_match algorithm wasthen run against the sequence of the cloning vectorused in this analysis; finally, SeqClean was rerun toremove the cross_matches. Sequence clusteringwas analyzed with TIGR software TGICL (Perteaet al. 2003), using default parameters.

Comparative BLAST analysis againstCorynebacterium complete genomes

Non-redundant unique sequences clustered withTGICL were used as a query for BLAST searches(Altschul et al. 1997) against other species of theCorynebacterium genus, to look for similaritiesbetween these species and C. pseudotuberculosis.We ran BLASTs on the DNA (BLASTn) and the protein(BLASTx, tBLASTx) to search for nucleotide andamino acid conservation among species.

These complete genomes are available on theNCBI site; their accession numbers are: C. diphther-iae (NC_002935); C. efficiens (NC_004369); C.glutamicum (NC_006958); and C. jeikeium(NC_007164).

COG-functional categories inference throughBlast best-hit analysis

Unique sequences from C. pseudotuberculosiswere also used as query for BLASTx searches in theCOG database (Tatusov et al. 1997). The NCBI COG

ARTICLE IN PRESS

Partial identification of the gene content of C. Pseudotuberculosis 315

database clusters genes from 66 microbial speciesinto orthologous groups and classifies the corre-sponding protein clusters into biological processcategories. We used a best-hit inference approachto classify this organism’s sequences into COG-functional categories (Tatusov et al. 1997). TheBLASTx algorithm was run using a 10e�5 cut-offvalue, and C. pseudotuberculosis partial genomesequences were classified into COG categoriesbased on the best-hit match of each sequenceagainst COG-classified proteins.

Comparative BLAST analyses of the UNIPROTdatabase

The most recent version of the UNIPROT curatedprotein database available at the time (February13, 2007) (Apweiler et al. 2004; UniProt Consortium2007) was downloaded from the EBI website (ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledge-base/uniprot_sprot.fasta.gz). UNIPROTwas used asa database subject for BLASTx searches, using C.pseudotuberculosis unique sequences as a query.The number of protein hits was recorded, as well asthe number of unique sequences matching UNIPROTproteins and not matching Corynebacterium genusor COG proteins.

Gene synteny analysis with Corynebacteriumgenomes

Synteny analysis of C. pseudotuberculosis non-redundant GSSs was also performed to determinewhether gene order is conserved in C. pseudotu-berculosis compared to other Corynebacteriumspecies. Using tBLASTx best-hits data of C. pseu-dotuberculosis unique sequences againstC. diphtheriae, C. efficiens, C. glutamicum andC. jeikeium genomes, we calculated the putativeaverage position of each protein-coding gene(subject end minus subject init), compared toother Corynebacterium spp. complete genomes.The putative gene position results were plotted andordered based on C. diphtheriae hits.

Table 1. Clustering results.

Sequence class Number ofbases

Number ofsequences

Singlets 91,047 276Sequences clustered 255,813 687Clusters 92,645 190GSS uniquesa (non-redundantset)

183,692 466

aUnique data (except for longest and shortest sequences) are the su

Raw data filtering, processing and clusteringanalysis

Base-called sequences were first filtered toremove vector regions, as well as short (o100 bp)and low-quality regions. Filtering removed morethan 30% of the bases and sequences. Longersequences were also reduced due to vector regionsand/or low-quality contamination.

C. pseudotuberculosis GSS were clustered usingTIGR TGICL software. This software uses Megablast(Zhang et al. 2000) to cluster sequences and CAP3(Huang and Madan 1999) to further assemble thesequences from each cluster. As expected, wefound a large number of singlets and clusters witha small number of sequences. The 466 non-redundant unique sequences spanned approxi-mately 184,000 bases (Table 1).

BLAST against the complete genome ofCorynebacterium species

All the analyses were made using the non-redundant set of C. pseudotuberculosis sequences(uniques). C. pseudotuberculosis GSS unique se-quences were then used as a query for BLASTsearches against the genome (BLASTn), the trans-lated genome (tBLASTx) and the proteome(BLASTx) of the Corynebacterium species that havebeen completely sequenced (Table 2).

Results

We obtained approximately 1440 GSS; aftersequence processing, we had about 1000 sequencesof the 500 plasmids or clones. Including all GSSslarger than 100 bp, the average sequence lengthwas 360 bp, while the longest was 792 bp; 346,869residues were generated.

As expected, we found C. pseudotuberculosis tobe more similar to other Corynebaterium spp. atthe protein level than at the DNA level (Nakamuraet al. 2003). Based on BLAST analysis of

Average size(bp)

Longestsequence

Shortestsequence

331 792 100372 786 100490 1248 121395 1248 100

m of singlets plus clusters data.

ARTICLE IN PRESS

Table 2. BLAST analysis of Corynebacterium pseudotuberculosis uniques against other Corynebacterium species.

BLAST program Subject E-value cutoff cda,b cea,b cga,b cja,b

tBLASTx Genome 10�10 300 (100%) 244 (100%) 247 (100%) 198 (100%)BLASTn Genome 10�10 106 (35%) 42 (17%) 46 (19%) 31 (16%)BLASTx Proteome 10�10 268 (89%) 224 (92%) 221 (90%) 182 (92%)BLASTx Proteome 10�5 305 274 275 235

aNumber of C. pseudotuberculosis uniques hits against other species’ genomes.bPercentage of searches with Genome and tBLASTx; cd: Corynebacterium diphtheriae, ce: Corynebacterium efficiens; cg:Corynebacterium glutamicum; cj: Corynebacterium jeikeium.

V. D’Afonseca et al.316

C. pseudotuberculosis uniques against other Cor-ynebacterium species (Table 2), C. pseudotubercu-losis is most similar to C. diphtheriae, followed byC. glutamicum, C. efficiens and C. jeikeium.Similar results were previously reported by Khamiset al. (2004), who suggested these phylogeneticrelationships using rpoB and 16S rRNA analysis.

All GSS unique sequences were used as a queryfor BLASTx searches of the COG database. Weincluded a 10�5 BLAST e-value cut-off, and theC. pseudotuberculosis translated partial proteinsequences were classified into functional cate-gories based on the category of their best hit inthe COG database. The C. pseudotuberculosisproteins were classified into major and minor(Table 3) COG-functional categories.

Almost half of the C. pseudotuberculosis GSSuniques did not correlate with any protein in theCOG database, even with the low, weakly stringentBLAST cut-off that we used. Most of the genes wereclassified in the metabolism category; whileapproximately 10% of the genes were classifiedinto each of three categories: information storageand processing, cellular processes and poorlycharacterized. For comparison, we calculated theexpected percentage of genes classified as Coryne-bacterium, based on the COG classification of thefour bacteria of this genus whose complete genomehad already been sequenced (Table 3).

Most GSS uniques classified data fit the expectedpercentage of classification into the Corynebacter-ium genus when we consider that this is only apartial analysis of the C. pseudotuberculosis gen-ome and proteome (Table 3). We found an excess ofunclassified sequences.

Shared hits against different databases

In order to search for putative C. pseudotuber-culosis genes originated by lateral transfer, wecompared BLAST results against various databases.It was expected that most of the C. pseudotuber-culosis genes derived from GSS would be found inother Corynebacterium species if they had been

present in the common ancestor of all the speciesin this genus. This expectation was confirmed(Figure 1). However, we observed three putativeC. pseudotuberculosis proteins (translated uniques)that had similarities to COG and UNIPROT proteinsbut without similarities to other Corynebacteriumspecies already sequenced and deposited in data-bases. Therefore, we consider these three proteinsputative candidates for analysis of lateral transferinto the C. pseudotuberculosis genome and otherCorynebacterium not previously sequenced.

The putative genes that might have beenincorporated into the C. pseudotuberculosis gen-ome by lateral transfer were identified as follows:(1) an oligopeptide/dipeptide ABC transporter, anATPase subunit with best hit in the NR databaseagainst another bacteria (Arthrobacter sp. FB24)from the same order as C. pseudotuberculosis(Actinomycetales) (e-value 2e�19); (2) an NADPoxidoreductase, an F420-dependent coenzymefrom Chloroflexus aurantiacus, a bacterium in theChloroflexi phylum (e-value 4e�10); and (3) aproline iminopeptidase (PIP) from Xanthomonasaxonopodis of the Proteobacteria phylum (e-value1e�55).

Confirmation of the putative genes of C.pseudotuberculosis

Using PCR analysis, we found three putativeexclusive genes of C. pseudotuberculosis. For thisamplification, we used genomic DNA of 1002, C231,T1 and T2 strains of C. pseudotuberculosis, C.renale and C. ulcerans, which have not yet beensequenced, and C. diphteriae, which has beencompletely sequenced. As a negative control, weused the vector TOPO alone, and we used the clonethat contained the fragment that originated theGSS as a positive control.

Gene order analysis

In order to determine whether gene order isconserved in Corynebacterium species, based on

ARTICLE IN PRESS

Table 3. COG functional classification of Corynebacterium pseudotuberculosis putative partial proteins.

COG category Cata Cpb % in genusc

(J) Translation, ribosomal structure and biogenesis Inf 21 (4.2%) 3.9(K) Transcription Inf 23 (4.6%) 5.4(L) DNA replication, recombination and repair Inf 23 (4.6%) 5.2(D) Cell division and chromosome partitioning pCel 1 (0.2%) 0.5(O) Posttranslational modification, protein turnover, chaperones pCel 8 (1.6%) 2.2(M) Cell envelope biogenesis, outer membrane pCel 19 (3.8%) 3.1(N) Cell motility and secretion pCel 0 0.0(U) Intracellular trafficking and secretion pCel 1 (0.2%) 0.7(V) Defense mechanisms pCel 8 (1.6%) 1.3(P) Inorganic ion transport and metabolism pCel 21 (4.2%) 4.9(T) Signal transduction mechanisms pCel 11 (2.2%) 2.3(C) Energy production and conversion Met 14 (2.8%) 3.9(H) Coenzyme metabolism Met 19 (3.8%) 3.4(I) Lipid metabolism Met 14 (2.8%) 2.4(G) Carbohydrate transport and metabolism Met 21 (4.2%) 3.6(E) Amino acid transport and metabolism Met 28 (5.6%) 5.8(F) Nucleotide transport and metabolism Met 18 (3.6%) 2.1(Q) Secondary metabolites biosynthesis, transport and catabolism Met 3 (0.6%) 1.7(R) General function prediction only Poor 29 (5.8%) 8.0(S) Function unknown Poor 17 (3.4%) 5.0Not in COG – 204 (40.6%) 33.3

Cat: Caterories; Cp: Corynebacterium pseudotuberculosis.aDescription of column data: Inf (information storage and processing), pCel (cellular processes), Met (metabolism), Poor (poorlycharacterized).bNumber of genes found (and partial percentages) in the cp genome. The absolute number is larger than 514, since some genes wereclassified into more than one category.cPercentage in the genus Corynebacterium according to COG data (http://www.ncbi.nlm.nih.gov/sutils/coxik.cgi?gi=375).

Figure 1. Distribution of Corynebacterium pseudotuber-culosis GSS uniques’ BLAST hits against three databases:COG, UNIPROT and a built-in Corynebacterium proteindatabase containing the proteomes.

Partial identification of the gene content of C. Pseudotuberculosis 317

the partial genome analysis that we made ofC. pseudotuberculosis, we used tBLASTx best-hitresults against the C. diphteriae, C. glutamicum,

C. efficiens and C. jeikeium genomes to define theposition where all the GSS C. pseudotuberculosisputative genes mapped in those genomes.

All of the C. pseudotuberculosis GSS putativegenes were mapped against other Corynebacteriumgenomes, ordered by the C. diphteriae genome(Figure 2). All the genomes were found to have ahigh degree of gene order conservation when theC. pseudotuberculosis GSSs were used for mapping.C. jeikeium had a clear inversion in the exactcenter of its genome when compared to the otherspecies. The symbols representing the C. efficiensand C. glutamicum genomes were often seentogether, showing the similarity between them.We did not observe any evidence of similaritybetween C. pseudotuberculosis and C. diphteriaegenes outside of the main diagonal line, suggestingconsiderable conservation of gene order in thesetwo species.

Discussion

We analyzed the C. pseudotuberculosis genomebased on �1000 genome survey sequences that

ARTICLE IN PRESS

Figure 2. Gene synteny analysis of Corynebateriumpseudotuberculosis GSS uniques based on tBLASTx gen-ome mapping against other Corynebaterium genomes.

V. D’Afonseca et al.318

were deposited in GenBank (ER770684 toER771646). DNA clustering of these data produced466 sequences, accounting for 183,692 base pairs inthis organism’s genome. Another four Corynebac-terium species had already been sequenced; theirgenomes varied from 2.4 to 3.2 megabases.

We analyzed an estimated 5.6–7.5% of theC. pseudotuberculosis genome. We found C. pseu-dotuberculosis to be most similar to C. diphtheriae,a species with a 2.4 megabase genome, at both thenucleotidic and the amino acid levels (Table 3).

We also performed many similarity analyses,comparing C. pseudotuberculosis GSS genome datawith the genomes and proteomes of four otherCorynebacterium species that had already beencompletely sequenced. Similarity was greatest atthe protein level. This indicates that these speciesmight have diverged more ancestrally. This situa-tion also may be due to random, neutral mutationsthat did not affect the phenotype. We foundC. pseudotuberculosis to have more nucleotideand amino acid similarities with C. diphtheriaethan with other species (Table 3). AfterC. diphtheriae, C. pseudotuberculosis was mostsimilar to C. efficiens and C. glutamicum. Incontrast, C. jeikeium was most distant fromC. efficiens and C. glutamicum. Khamis et al.(2004) found the same species relationships basedon 16S rRNA and rpoB data.

Based on GSS’s cluster formation, we found acluster with 25 sequences. Similarity analysesshowed this contig to have similarity to subunit23S rRNA of C. diphtheriae gravis NCTC13129, with96% identity and an e-value of 0.0.

The very high number of ‘‘no hits’’ can be almosttotally explained by considering the size of the non-coding regions in the Corynebacterium genomes. A

third of the genes of C. pseudotuberculosis werenot classified into COG categories of the otherspecies in the Corynebacterium genus (Table 3). Wealso found that 40.6% of the non-redundant uniquesequences of C. pseudotuberculosis fell into thisunclassified category. However, this calculation ofone-third of the genes (Table 3) only includes thepercentage of genes not classified as COGs, whilethe unique sequences include both genes andintergenic regions. We found that the proportionsof the genomes of C. diphtheriae, C. glutamicum,C. efficiens and C. jeikeium that were composed ofnon-protein-coding regions were 11.7%, 6.7%,12.7% and 8.4%, respectively. Therefore, if weconsider an average putative value of 9.9% non-coding regions in the C. pseudotuberculosis gen-ome, plus 33.3% putative C. pseudotuberculosisproteins not classified as COGs, we obtain anestimate of 43.2% uniques that are expected notto be not classified as COGs; this is close to whatcan be seen in Figure 2 (40.6%).

The GSSs used in the analysis cited above rangedfrom 350 to 1280 bp as shown in Table 3; it isrelevant to similarities searches. Also the percen-tage of unclassified proteins found in the COGdatabase was similar between C. pseudotubercu-losis and other species of the genus alreadysequenced; this probably reflects the characteriza-tion of new proteins not previously described.

Comparing the number of uniques found as BLASThits against Corynebacterium proteins that havesignificant similarities with proteins in two differ-ent curated protein databases (COG and UNIPROT),we found three C. pseudotuberculosis genes thatcould have been incorporated through lateral genetransference. These three GSS were confirmed byPCR in C. pseudotuberculosis, suggesting that PIP,putative oligopeptide/dipeptide ABC transporterand NADP oxireductase are actually present in thegenome of C. pseudotuberculosis. We also foundthem in C. renale (all three GSS) and not inC. diphtheriae and C. ulcerans. A close relationshipof C. pseudotuberculosis with C. renale was alsofound by Dorella et al. (2006c), based on analysis ofthe rpoB gene. We used amplification of the rpoBgene as a positive control for confirmation ofCorynebacterium species, since this gene onlyamplifies in species of this genus.

The synteny analysis based in the genomic regioncovered by 1000 GSS’s mentioned at work showed agreat similarity between C. diphtheriae andC. pseudoutuberculosis as described in previousstudies. Because of this similarity between twospecies, the genome of C. diphtheriae was used foranchoring of the GSSs in genome of C. pseudotu-berculosis. Other characteristics among the

ARTICLE IN PRESS

Partial identification of the gene content of C. Pseudotuberculosis 319

Corynebacterium species that present whole gen-ome already sequenced were demonstrated for thisgenomic region (Figure 2).

An almost straight line was seen when C.pseudotuberculosis putative gene positions weremapped against the C. diphtheriae genome; thisfinding demonstrated that our analysis was efficientin this sample of randomly located genes in the C.pseudotuberculosis genome. A broken line wouldindicate that some regions had been preferentiallysampled. When we examine the main diagonals inFigure 2, we can conclude that all of the C.pseudotuberculosis GSS uniques would likely beuseful as anchors to produce a final version of afuture C. pseudotuberculosis genome initiative.

Many C. pseudotuberculosis genes mapped indifferent positions when compared to the C.efficiens and C. glutamicum genomes. Neverthe-less, we found many filled and unfilled triangles insimilar positions (off the diagonal), also evidencingsimilarity between C. efficiens and C. glutamicumgenomes and differences compared to C. diphther-iae and C. pseudotuberculosis. As expected, the C.jeikeium genome was the most divergent speciesbased on gene order. Synteny analysis suggests thatCorynebacteria has rarely undergone genome re-arrangements and has maintained ancestral gen-ome structures, even after the divergence ofCorynebacteria and Mycobacteria (Nakamura etal. 2003).

Thus, based on our analysis of �1000 C.pseudotuberculosis GSS sequences, we producedan overview of this bacterium’s genome. Thesequence of �8% of its genome allowed us todetermine its evolutionary position among otherCorynebacterium spp. that have been completelysequenced, and to classify putative proteins intoCOG-functional categories.

Acknowledgement

This work received funding from FAPEMIG REDE-2829/05.

References

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z,Miller W, et al. Gapped BLAST and PSI-BLAST: a newgeneration of protein database search programs.Nucleic Acids Res 1997;1:3389–402.

Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B,Ferro S, et al. UniProt: the universal protein knowl-edgebase. Nucleic Acids Res 2004;1:115–9.

Barksdale L. Corynebacterium diphtheriae and its rela-tives. Bacteriol Rev 1970;34:378–422.

Binnewies TT, Motro Y, Hallin PF, Lund O, Dunn D, DavidTL, et al. Ten years of bacterial genome sequencing:comparative-genomics-based discoveries. Funct IntegrGenomic 2006;6:165–85.

Celestino PBS, Carvalho LR, Freitas LM, Dorella FA,Martins AF, Pacheco LGC, et al. Update of microbialgenome programs for bacteria and archaea. Genet MolRes 2004;3:421–31.

Cerdeno-Tarraga AM, Efstratiou1 A, Dover LG, HoldenMTG, Pallen M, Bentley SD, et al. The completegenome sequence and analysis of Corynebacteriumdiphtheriae NCTC13129. Nucleic Acids Res2003;31:6516–23.

Collins MD, Bradbury JF. Plant pathogenic species ofCorynebacterium. Bergey’s manual of systematic.Bacteriology 1986;2:276–1283.

Collins MD, Cummins CS. Genus Corynebacterium, Ber-gey’s manual of systematic. Bacteriology 1986;2:1266–76.

Deb JK, Nath N. Plasmids of corynebacteria. FEMSMicrobiol Lett 1999;175:11–20.

Dorella FA, Fachin MS, Billault A, Dias Neto E, Soravito C,Oliveira SC, et al. Construction and partial character-ization of a Corynebacterium pseudotuberculosisbacterial artificial chromosome library through geno-mic survey sequencing. Genet Mol Res 2006a;5:653–63.

Dorella FA, Pacheco LGC, Oliveira SC, Miyoshi A, AzevedoV. Corynebacterium pseudotuberculosis: microbiology,biochemical properties, pathogenesis and molecularstudies of virulence. Vet Res 2006b;37:1–18.

Dorella FA, Estevam EM, Cardoso PG, Savassi BM, OliveiraSC, Azevedo V, et al. An improved protocol forelectrotransformation of Corynebacterium pseudotu-berculosis. Vet Microbiol 2006c;114:298–303.

Dower WJ, Miller JF, Ragsdale CW. High efficiencytransformation of E. coli by high voltage electropora-tion. Molecular biology group, Bio-Rad Laboratories,Richmond. Nucleic Acids Res 1998;16:6127–45.

Ewing B, Green P. Base-calling of automated sequencertraces using phred. II. Error probabilities. Genome Res1998;8:186–94.

Ewing B, Hillier L, Wen MC, Green P. Base-calling ofautomated sequencer traces using phred. I. Accuracyassessment. Genome Res 1998;8:175–85.

Fleischmann RD, Adams MD, White O, Clayton RA,Kirkness EF, Kerlavage AR, et al. Whole-genomerandom sequencing and assembly of Haemophilusinfluenzae Rd. Science 1995;5223(269):496–8(507–512).

Huang X, Madan A. CAP3: a DNA sequence assemblyprogram. Genome Res 1999;9:868–77.

Ikeda M, Nakagawa S. The Corynebacterium glutamicumgenome: features and impacts on biotechnologicalprocesses. Appl Microbiol Biotechnol 2003;62:99–109.

Khamis A, Raoult D, La Scola B. rpoB gene sequencing foridentification of Corynebacterium species. J ClinMicrobiol 2004;42:3925–31.

ARTICLE IN PRESS

V. D’Afonseca et al.320

Matsui K, Nishio Y, Nakamura Y, Kawarabayasi Y, Usuda Y,Kimura E, et al. Comparative complete genomesequence analysis of the amino acid replacementsresponsible for the thermostability of Corynebacter-ium efficiens. Genome Res 2003;13:1572–9.

Mora M, Donati C, Medini D, Covacci A, Rappuoli R.Microbial genomes and vaccine design: refinements tothe classical reverse vaccinology approach. Curr OpinMicrobiol 2006;9:532–6.

Nakamura Y, Nishio Y, Ikeo K, Gojobori T. The genomestability in Corynebacterium species due to lack of therecombinational repair system. Gene 2003;317:149–55.

Pacheco LGC, Pena RR, Castro TLP, Dorella FA, Bahia RC,Carminati R, et al. Multiplex PCR assay for identifica-tion of Corynebacterium pseudotuberculosis frompure cultures and for rapid detection of this pathogenin clinical samples. J Med Microbiol 2007;56:1–7.

Pascual C, Lawson PA, Farrow JAE, Gimenez MN, CollinsM. Phylogenetic analysis of the genus Corynebacter-ium based on 16S rRNA gene sequences. Int J SystBacteriol 1995;1:724–8.

Pertea G, Huang X, Liang F, Antonescu V, Sultana R,Karamycheva S, et al. TIGR Gene Indices clusteringtools (TGICL): a software system for fast clustering oflarge EST datasets. Bioinformatics 2003;19:651–2.

Prosdocimi F, Lopes DAO, Peixoto FC, Ortega JM. Effectsof sample re-sequencing and trimming on the quality

and size of assembled consensus. Genet Mol Res2007;6(4):756–65.

Rodrıguez-Valera F. Environmental genomics, the bigpicture? FEMS Microbiol Lett 2004;231:153–8.

Roe BA, Crabtree JS, Khan AS. DNA isolation andsequencing. New York, NY: Wiley; 1996.

Sambrook J, Fritsch EF, Maniatis T. Molecular cloning: alaboratory manual, 2nd ed. New York: Cold SpringHarbor Laboratory Press; 1989.

Tatusov RL, Koonin EV, Lipman DJ. A genomic perspectiveon protein families. Science 1997;278:631–7.

Tauch A, Kaiser O, Hain T, Goesmann A, Weisshaar B,Albersmeier A, et al. Complete genome sequence andanalysis of the multiresistant nosocomial pathogenCorynebacterium jeikeium K411, a lipid-requiringbacterium of the human skin flora. J Bacteriol2005;13:4671–82.

Tauch A, Trost E, Bekel T, Goesmann A, Ludewig U, PuhlerA. Ultrafast de novo sequencing of Corynebacteriumurealyticum using the genome sequencer 20 system.Biochemica 2006;4:1–4.

UniProt Consortium. The universal protein resource(UniProt). Nucleic Acids Res 2007;35:193–7.

Zhang Z, Schwartz S, Wagner L, Miller W. A greedyalgorithm for aligning DNA sequences. J Comput Biol2000;1:203–14.