The gene cassette metagenome is a basic resource for bacterial genome evolution

12
Environmental Microbiology (2003) 5 (5), 383–394 © 2003 Society for Applied Microbiology and Blackwell Publishing Ltd Blackwell Science, LtdOxford, UKEMIEnvironmental Microbiology 1462-2920Blackwell Publishing Ltd, 20035 5383394 Original Article The mobile gene cassettte metagenomeA. J. Holmes et al. Received 17 October, 2002; accepted 15 January, 2003. *For correspondence. E-mail [email protected]; Tel. ( + 612) 9850 8164; Fax ( + 612) 9850 8245. †Present address. School of Molecular and Microbial Biosciences, The University of Sydney, New South Wales, 2006. The gene cassette metagenome is a basic resource for bacterial genome evolution Andrew J. Holmes, 1† Michael R. Gillings, 1 Blair S. Nield, 2 Bridget C. Mabbutt, 3 K. M. Helena Nevalainen 2 and H. W. Stokes 2 * 1 Key Centre for Biodiversity and Bioresources, Macquarie University, Sydney NSW 2109, Australia. 2 Department of Biological Sciences, Macquarie University, Sydney NSW 2109, Australia. 3 Department of Chemistry, Macquarie University, Sydney NSW 2109, Australia. Summary Lateral gene transfer has been proposed as a funda- mental process underlying bacterial diversity. Trans- posons, plasmids and phage are widespread and have been shown to significantly contribute to lateral gene transfer. However, the processes by which dis- parate genes are assembled and integrated into the host regulatory network to yield new phenotypes are poorly known. Recent discoveries about the integron/ gene cassette system indicate it has the potential to play a role in this process. Gene cassettes are small mobile elements typically consisting of a promoter- less orf and a recombination site. Integrons are capa- ble of acquisition and re-arrangement of gene cassettes and of the expression of their associated genes. The potential of the integron/gene cassette system is thus largely determined by the diversity contained within the cassette pool and the rate at which integrons sample this pool. We show here using a polymerase chain reaction (PCR) approach by which the environmental gene cassette (EGC) metagenome can be directly sampled that this metagenome contains both protein-coding and non- protein coding genes. Environmental gene cassette- associated recombination sites showed greater diver- sity than previously seen in integron arrays. Class 1 integrons were shown to be capable of accessing this gene pool through tests of recombinational activity with a representative range of EGCs. We propose that gene cassettes represent a vast, prepackaged genetic resource that could be thought of as a metagenomic template for bacterial evolution. Introduction The Bacteria are the most physiologically diverse group known. Such physiological diversity is underpinned by corresponding genomic diversity. Given that the typical bacterial genome size is <10 Mbp this diversity has arisen from a remarkably small genomic template. These con- trasting observations can be reconciled through the prop- osition that horizontal gene transfer (HGT) is the major factor in the evolution of bacteria (Ochman et al ., 2000). There is a large body of evidence supporting this general thesis. Mechanisms for transfer of genes between cells are well known and virtually ubiquitous in both a phyloge- netic and ecological sense. Genome sequence analyses have shown that a large proportion of any one bacterial genome is likely to have been acquired from ‘foreign’ sources. For several complex phenotypes there is strong evidence that sets of genes were separately acquired by HGT. A major gap in our knowledge, however, is how transferred genes are integrated into the metabolism of the recipient cell. In recent years the integron/gene cassette system has emerged as one of the best examples of capture and expression of new genes (Hall et al ., 1999). Integrons include a site-specific recombination system and were first identified as the sites of antibiotic resistance gene capture in mobile elements from clinical isolates (Stokes and Hall, 1989; Martinez and de la Cruz, 1990; Collis et al ., 1993). The integron is a recombination and expres- sion system that captures genes as part of a genetic element called a gene cassette (Recchia and Hall, 1995). Gene cassettes are very simple genetic elements that typically consist of a single promoterless gene and a recombination site termed a 59-base element (59-be). In the well-studied class 1 integrons, the gene capture sys- tem consists of a site-specific recombinase (IntI1) and a recombination site ( attI1 ). IntI1 reversibly catalyses two types of site-specific recombination reaction. These are recombination between attI1 and a 59-be, or recombina- tion between two 59-be sites. Collectively these reactions result in the assembly of new genes downstream of an integron-associated promoter P c that directs transcription

Transcript of The gene cassette metagenome is a basic resource for bacterial genome evolution

Environmental Microbiology (2003)

5

(5), 383–394

© 2003 Society for Applied Microbiology and Blackwell Publishing Ltd

Blackwell Science, LtdOxford, UKEMIEnvironmental Microbiology 1462-2920Blackwell Publishing Ltd, 20035

5383394

Original Article

The mobile gene cassettte metagenomeA. J. Holmes

et al.

Received 17 October, 2002; accepted 15 January, 2003. *Forcorrespondence. E-mail [email protected]; Tel.(

+

612) 9850 8164; Fax (

+

612) 9850 8245. †Present address. Schoolof Molecular and Microbial Biosciences, The University of Sydney,New South Wales, 2006.

The gene cassette metagenome is a basic resource for bacterial genome evolution

Andrew J. Holmes,

1†

Michael R. Gillings,

1

Blair S. Nield,

2

Bridget C. Mabbutt,

3

K. M. Helena Nevalainen

2

and H. W. Stokes

2

*

1

Key Centre for Biodiversity and Bioresources, Macquarie University, Sydney NSW 2109, Australia.

2

Department of Biological Sciences, Macquarie University, Sydney NSW 2109, Australia.

3

Department of Chemistry, Macquarie University, Sydney NSW 2109, Australia.

Summary

Lateral gene transfer has been proposed as a funda-mental process underlying bacterial diversity. Trans-posons, plasmids and phage are widespread andhave been shown to significantly contribute to lateralgene transfer. However, the processes by which dis-parate genes are assembled and integrated into thehost regulatory network to yield new phenotypes arepoorly known. Recent discoveries about the integron/gene cassette system indicate it has the potential toplay a role in this process. Gene cassettes are smallmobile elements typically consisting of a promoter-less orf and a recombination site. Integrons are capa-ble of acquisition and re-arrangement of genecassettes and of the expression of their associatedgenes. The potential of the integron/gene cassettesystem is thus largely determined by the diversitycontained within the cassette pool and the rate atwhich integrons sample this pool. We show hereusing a polymerase chain reaction (PCR) approach bywhich the environmental gene cassette (EGC)metagenome can be directly sampled that thismetagenome contains both protein-coding and non-protein coding genes. Environmental gene cassette-associated recombination sites showed greater diver-sity than previously seen in integron arrays. Class 1integrons were shown to be capable of accessing thisgene pool through tests of recombinational activitywith a representative range of EGCs. We propose that

gene cassettes represent a vast, prepackaged geneticresource that could be thought of as a metagenomictemplate for bacterial evolution.

Introduction

The Bacteria are the most physiologically diverse groupknown. Such physiological diversity is underpinned bycorresponding genomic diversity. Given that the typicalbacterial genome size is <10 Mbp this diversity has arisenfrom a remarkably small genomic template. These con-trasting observations can be reconciled through the prop-osition that horizontal gene transfer (HGT) is the majorfactor in the evolution of bacteria (Ochman

et al

., 2000).There is a large body of evidence supporting this generalthesis. Mechanisms for transfer of genes between cellsare well known and virtually ubiquitous in both a phyloge-netic and ecological sense. Genome sequence analyseshave shown that a large proportion of any one bacterialgenome is likely to have been acquired from ‘foreign’sources. For several complex phenotypes there is strongevidence that sets of genes were separately acquired byHGT. A major gap in our knowledge, however, is howtransferred genes are integrated into the metabolism ofthe recipient cell.

In recent years the integron/gene cassette system hasemerged as one of the best examples of capture andexpression of new genes (Hall

et al

., 1999). Integronsinclude a site-specific recombination system and werefirst identified as the sites of antibiotic resistance genecapture in mobile elements from clinical isolates (Stokesand Hall, 1989; Martinez and de la Cruz, 1990; Collis

et al

., 1993). The integron is a recombination and expres-sion system that captures genes as part of a geneticelement called a gene cassette (Recchia and Hall, 1995).Gene cassettes are very simple genetic elements thattypically consist of a single promoterless gene and arecombination site termed a 59-base element (59-be). Inthe well-studied class 1 integrons, the gene capture sys-tem consists of a site-specific recombinase (IntI1) and arecombination site (

attI1

). IntI1 reversibly catalyses twotypes of site-specific recombination reaction. These arerecombination between

attI1

and a 59-be, or recombina-tion between two 59-be sites. Collectively these reactionsresult in the assembly of new genes downstream of anintegron-associated promoter P

c

that directs transcription

384

A. J. Holmes

et al.

© 2003 Society for Applied Microbiology and Blackwell Publishing Ltd,

Environmental Microbiology

,

5

, 383–394

of the cassette-associated genes (Stokes and Hall, 1989;Hall

et al

., 1991; Collis and Hall, 1992; Collis and Hall,1995). The arrangement of these features is shown inFig. 1. Recently, a similar organization has also beendemonstrated for the class 3 integron (Collis

et al

., 2002).Class 1 and 3 integrons thus fulfill the basic requirementsfor gene acquisition in the HGT model for evolution ofbacterial physiological diversity. Disparate genes can beassembled at a specific locus where they are amenableto regulatory control by the cell.

Recent discoveries have shown that integrons are notsimply part of mobile elements carrying antibiotic resis-tance genes, but are a distinct type of genetic elementfound in a variety of genomic contexts. Integrons andgene cassette arrays have been sequenced in the chro-mosomes of

Pseudomonas

,

Vibrio

,

Xanthomonas

and

Shewanella

spp. (Heidelberg

et al

., 2000; Rowe-Magnus

et al

., 2001; Vaisvila

et al

., 2001; da Silva

et al

., 2002).Furthermore,

intI

homologues are present in the unfin-ished genomes of

Treponema denticola

,

Geobacter sul-phurreducens

,

Acidithiobacillus ferrooxidans

, and

Nitrosomonas europaea

implying that integrons are alsopresent in these genomes (Nield

et al

., 2001). Given thegene acquisition and expression properties of class 1and 3 integrons, the discovery that integrons are wide-spread raises the proposition that they may play a gen-eral role in the acquisition of new genes in bacterial

genomes. However the integron platform is essentially asimple structure, entirely dependent on gene cassettesas its substrates. The crucial questions therefore revolvearound the nature of the gene cassette pool and how itinteracts with integrons. Difficulties in sampling or rec-ognizing gene cassettes from outside an integron con-text have restricted our capacity to address thesequestions.

Recognition of gene cassettes independently of inte-gron features requires an objective definition of the 59-be sequence family. Amongst characterized 59-be sites,length ranges from 57 bp to 145 bp and the pairwisesequence difference may exceed 70%. A detailed com-parison of all cassette-associated recombination sitesavailable at the time was reported by Stokes

et al

.(1997) and they noted a number of conserved features.These include an overall imperfect inverted repeat struc-ture, each half of which includes a simple site of the typecommonly associated with the tyrosine family of recom-binases (Grainge and Jayaram, 1999). 59-be sitesinclude a core site with the consensus GTTRRRY (des-ignated 1R, Fig. 2) and for recombination events medi-ated by IntI1, the recombination crossover point isbetween the G and first T of this site (Stokes

et al

.,1997). Within the 59-be conserved features there ismoderate sequence conservation including eight near-invariant positions (Fig. 2).

Fig. 1.

Structure of In3 and of co-integrates formed with test 59-be sites. The two most common insertion sites into In3 for a test 59-be are indicated. LHS, Left-hand side. RHS, Right-hand side. Restriction sites are: B,

Bam

H1; H,

Hind

III; and S,

Sal

1. 3

-CS, 3

-conserved segment. P

c

, promoter for cassette-associated genes. The horizontal arrows indicate the binding sites and direction of synthesis for each of the primers used in mapping co-integrate junctions. Cm, chloramphenicol

The mobile gene cassettte metagenome

385

© 2003 Society for Applied Microbiology and Blackwell Publishing Ltd,

Environmental Microbiology

,

5

, 383–394

We have recently demonstrated that use of degenerateprimers targeting the conserved regions of 59-be sites inPCR with environmental DNA samples results in recov-ery of diverse sequences showing characteristics of genecassettes (Stokes

et al

., 2001). In total, 123 predictedgene cassettes were recovered in that study. Here wehave confirmed these sequences as gene cassettes.Thus ‘cassette PCR

technique allows us to address anumber of questions regarding the nature of the genecassette pool for the first time. Here we describe thecharacterization of a further 41 environmental gene cas-settes, sequence relationships of environmental genecassettes to gene cassettes from other sourcesand demonstrate their ability to be recruited by class 1integrons.

Results

PCR recovery of gene cassettes

A total of 57 cloned amplicons, derived by PCR with theprimers HS286 and HS287 and that predominantly targetcassettes contained within linear arrays, were analysedfrom soil microcosm samples. All showed the character-istics expected of cassette PCR products (Stokes

et al

.,2001). Of these, 38 represented amplification of a singlegene cassette, and 19 the amplification of arrays of two,three or four cassettes in tandem. Many gene cassetteswere sampled more than once (either singly or as part ofan array), thus this dataset resulted in a total of 41 distinctcassettes (EGC111 – EGC151, see

Experimental proce-dures

). When pooled with our previous study the total

Fig. 2.

Alignment of the conserved domains of 59-be sites from environmental cassettes. 59-be recombination site sequence is shown in bold, with five bases of flanking sequence at each end also shown. The highly variable central regions are not shown, with numbers indicating the length of the omitted sequence. Sequence depicted in upper case is that predicted for the free, circular form of the cassette. For most sequences shown here, the last six bases of the 59-be (which also represent the first six bases of the integrated, linear form of the cassette) are derived from a PCR primer and consequently shown in lower case. Putative IntI binding domains in the left (1L and 2L) and right (1R and 2R) halves of the elements are indicated by shading and arrows (top). The filled triangles (

) identify locations at which extra bases are not shown (see also Fig. 4). The alignment is separated into three groups to accommodate length variation. The ‘position’ lines allow alignment between groups. Upper and lower case letters in the ‘position’ lines indicate bases that are generally complementary when left and right halves of the element are compared. The asterisk indicates the extra base in 2L compared to 2R. The ‘

+

’ symbol indicates positions that are found in all 59-be sequences but disrupt the repeat structure (see

Experimental procedures

). Positions that are not common to all 59-be are left blank. Numbers on the left distinguishes each of the 11 identified subfamilies.

386

A. J. Holmes

et al.

© 2003 Society for Applied Microbiology and Blackwell Publishing Ltd,

Environmental Microbiology

,

5

, 383–394

number of gene cassettes directly sampled from naturalenvironments via cassette PCR is 164.

Gene cassette content: protein-coding orfs

The vast majority of experimentally characterized genecassettes contain promoterless protein coding orfs. OurEnvironmental Gene Cassette (EGC) dataset and genecassettes recovered as part of large scale sequencingprojects contain a disproportionately high fraction ofnovel sequences. This complicates the task of predictingcoding sequences. Where alternative reading frames arefound by software programs the cassette boundaries canassist in predicting the ‘true’ orf. Regardless of the diffi-culties in prediction of the coding content of the genecassettes it is evident that the spectrum of proteins har-boured in the total gene cassette pool is extraordinarilydiverse. Of the 142 hypothetical proteins currently in theEGC data set only 24 (17%) show sequence relationshipto any previously described protein and, of these, 17 aresimilar only to ‘hypothetical proteins’ (Table 1). This samepattern is also true of other bacterial gene cassette poolsoutside of the antibiotic resistance context including the

Xanthomonas campestris

pv

campestris

,

Pseudomonasalcaligenes

and

Vibrio cholerae

chromosomes (Heidel-berg

et al

., 2000; Vaisvila

et al

., 2001; da Silva

et al

.,2002).

No obvious hints of a general role for cassette-associ-ated proteins could be predicted. Predicted protein sizes

in the EGC pool ranges from 36 to 346 amino acids.The distribution of this range is skewed, with

60% ofall hypothetical proteins in the size range of 70–140amino acids. Only 10% were greater than 200 aminoacids. However, in general, the size range of predictedproteins matches that of biologically active peptides thatare ribosomally synthesized. Similarly, calculations ofhydrophobicity values show an essentially normal distri-bution indicating the pool of proteins is unlikely to showany marked bias towards membrane or cytoplasmiclocation.

The lack of pattern in size and physicochemical prop-erties of hypothetical proteins is matched by the diversityof biological activities in those cassettes that have beencharacterized or which show homology to characterizedproteins. In genes within cassettes from class 1 integronsmost encode antibiotic resistance. However the mode ofresistance varies tremendously (Recchia and Hall, 1995).Gene cassettes identified from genome sequencing orenvironmental contexts encode diverse properties includ-ing, lipases, restriction endonucleases, transport proteins,toxins and surface antigens (Clark

et al

., 2000; Vaisvila

et al

., 2001; Rowe-Magnus

et al

., 2001; Stokes

et al

.,2001).

Gene cassette content: non-protein-coding sequences

Some gene cassettes appear to have a biological roleother than protein-coding. EGC104 is the first cassette of

Table 1.

Cassette gene products with database matches.

Gene product Top database hit %Identity/%similarity Predicted function

orf297_EGC010

Bacillus subtilis

(CAB15191) 28/48 Hypothetical proteinorf101_EGC017

Caulobacter crescentus

(AAK23946) 61/74 Hypothetical proteinorf117_EGC020 Bacteriophage 933 W (AAD25429) 41/56 Hypothetical proteinorf346_EGC034

Pseudomonas syringae

(ZP_00124941) 27/49 Aminoglycoside phosphotransferaseorf271_EGC035

Wolinella succinogenes

(CAC50085) 31/52 Sulphur transferaseorf81_EGC044

Nitrosomonas europaea

(ZP_00003253) 91/96 Possible toxin antidote proteinorf113_EGC044

Clostridium thermocellum

(ZP_00061925) 45/60 PemK familyorf132_EGC064

Pseudomonas aeruginosa

(AAG07752) 49/61 Hypothetical proteinorf208_EGC067

Thermus thermophilus

(BAB17605) 30/53 RNA methyl transferaseorf133_EGC068

Mycobacterium tuberculosis

(AAK44262) 32/50 Hypothetical proteinorf90_EGC101

Nostoc punctiforme

(ZP_00108370), 71/78 Hypothetical proteinorf105_EGC103

Mycobacterium tuberculosis

(AAK46615) 33/52 Hypothetical proteinorf147_EGC162

Nitrosomonas europaea

(ZP_00003997) 68/80 Pyrimidine dimer DNA glycosylaseorf168_EGC027

Pasteurella multicida

(AAK03695) 27/52 Hypothetical proteinorf209_EGC029

Xanthomonas campestris

(AAM39391) 27/46 Hypothetical proteinorf154_EGC030

Shewanella oneidensis

(AAN56682) 50/61 Hypothetical proteinorf135_EGC079

Agrobacterium tumefaciens

(AAK86432) 34/60 Hypothetical proteinorf139_EGC084

Oceanobacillus iheyensis

(BAC15208) 29/54 Hypothetical proteinorf159_EGC115

Bacillus anthracis

(NP_656837) 36/58 Hypothetical proteinorf174_EGC125

Bacillus anthracis

(NP_656524) 36/54 Hypothetical proteinorf125_EGC139

Caulobacter crescentus

(AAK24858) 42/63 Bleomycin resistanceorf161_EGC159

Brucella suis

(AAN30397) 27/57 Hypothetical proteinOrf110_EGC088

Xylella fastidiosa

(AAF85305) 17/41 Hypothetical proteinOrf117_EGC148

Synechocystis

sp. PCC6803 (BAA18636) 34/62 Hypothetical protein

Other notable hits: Orf360_EGC124 is in the reverse orientation. This orf has strong similarity (42% identity, 65% similarity) to cheA (AAK78103).It is likely that this cassette is a non-specific product. Orf196_EGC129 shows strong identity over the first 36 residues only to a diverse range ofpeptides including HI1126 (AAC22780). It is possible this region is a common leader sequence.

The mobile gene cassettte metagenome

387

© 2003 Society for Applied Microbiology and Blackwell Publishing Ltd,

Environmental Microbiology

,

5

, 383–394

an array in the clone Bal48 (AF349111). This cassettecontains 178 bp that are not part of its 59-be. Severalobservations indicate that this sequence is non-protein-coding. Although, in both the forward and reverse orien-tations, there is one reading frame comprising an uninter-rupted stretch of coding codons, neither of these frameshas plausible start or stop codons within the cassette.Second, the EGC104 cassette content shows significantsequence relationship to three other cassettes (EGC048,EGC050, EGC051). Collectively the contents of these fourcassettes comprise a sequence family sharing 61–94%identity (data not shown) yet none contain obvious pro-tein-coding orfs. Third, despite the DNA sequence conser-vation across the family, neither of the possible orfs inEGC104 is conserved in other family members. Finally,sequence conservation is particularly strong through thecentral 70 bp of the cassette-content and this region formsan imperfect inverted repeat. The pattern of familysequence conservation indicates that sequence structure,rather than coding potential, is more biologically relevant.

We conclude that members of this sequence family havea role other than encoding a protein.

A similar situation is seen for the cassette content ofEGC091. The complete cassette was recovered as partof an array in clone Bal33 (AF349108). It shows significantDNA identity to other sequences from various environ-ments (two soils and a hot spring). These comprise atleast nine distinct EGC types, including EGC049,EGC052, EGC053, EGC054, EGC055, EGC056,EGC057, EGC058 and EGC091. For brevity, only threeof these representing the range of sequence divergence,are shown in Fig. 3. Members of the family are 308–330 bp in length and in all cases stop codons are prev-alent, precluding the presence of protein-coding orfsacross the sequence family. Pairwise sequence identityranges from 62 to 91%. Noteworthy features of theEGC091 family are that poly A/poly T tracts are prominentwith (16 occurrences) and that the predicted RNA (if tran-scribed) contains a number of sequence domains likelyto have a stable secondary structure.

Fig. 3.

Alignment of three representative members of the EGC091 sequence family. Positions that are universal in all nine members of the sequence family are shown in the consensus as upper case. Those that are strongly conserved (> 75% identity) across all members of the sequence family are shown in lower case. Where no data is given in the consensus sequence there is not significant sequence conservation across all members of the sequence family. Members of the sequence family are EGC049, EGC052, EGC053, EGC054, EGC055, EGC056, EGC057, EGC058 and EGC091.

388

A. J. Holmes

et al.

© 2003 Society for Applied Microbiology and Blackwell Publishing Ltd,

Environmental Microbiology

,

5

, 383–394

Sequence relationships of ‘environmental’ 59-be sites

In total 37 inverted repeat structures were identified withinthe recovered environmental clones that were consistentwith the generalized structure for 59-be sites as outlinedin

Experimental procedures

(Fig. 2). There was nonethe-less considerable sequence and structural variationbetween elements.

The most notable variable feature of 59-be sites is thelength and sequence of the region separating the twohalves of the repeat. This central region ranges in lengthfrom three to 79 bases occurring between positions ‘s’and ‘S

in the generalized structure (Fig. 2). None, part,or all of this region may be part of the inverted repeatstructure. Another noteworthy variation is the presence of‘asymmetric’ sequence insertions in some 59-be sites(triangles in Fig. 2). These appear to only occur at spe-cific loci that are internal to the conserved ‘paired’regions making the inverted repeat structure asymmetric.It is this sequence variation that makes evaluation of rela-tionships between members of the 59-be family difficult.Estimation of evolutionary distances from sequence anal-yses requires the comparison of orthologous positions.The conservation of structure indicates 59-be sites con-stitute an orthologous sequence family. However, it isunlikely that positions outside the core 51 positions areorthologous across the sequence family. Furthermore,within the 51 core positions the structural constraints aresuch that this short sequence contains no effective ‘phy-logenetic signal’.

A consequence of this heterogeneity is that inferringevolutionary relationships across the whole 59-be familyis not possible. Nevertheless, evolutionarily distinct sub-families can be recognized on the basis of heterologoussequence features. That is, any group of sequences con-taining a sequence insertion that is heterologous withrespect to all other members of the family represents anevolutionarily distinct group (such groups are not neces-sarily monophyletic). The ‘PAR signature’ described forsome

Ps. alcaligenes

gene cassettes is such an example(Vaisvila

et al

., 2001). On the basis of ‘heterologousinsertions’ our EGC dataset includes at least 11 distinctsubfamilies of 59-be sites (Fig. 2). In each of these fami-lies the inserted sequence (with respect to the corestructure) is either at a different locus to all other exam-ples or is differentiated by length and structure. Onlythree of these subfamilies are presently found in class 1integrons.

Examples of 59-be sequence variation, representingthe range observed in the EGC dataset, are shown inFig. 2. Of note are the 59-be sites associated withEGC099 (shorter than previously assayed elements),EGC068 (greatest divergence from the canonical struc-ture), EGC102 (large insertion between halves), and

BGC001 (containing an insertion introducing asymmetry).These 59-be sites are ‘extremes’ of the diversity observedand to-date no members of these subfamilies have beenfound in gene cassettes encoding antibiotic resistancegenes.

59-be sites from environmental DNA samples are active recombination sites

59-be sites from clinical isolates are active recombina-tion sites recognizable by integron integrases. Of theconsiderable number of 59-be sites tested for activitywith IntI1, all have been found, with varying levels ofefficiency, to be functional (Martinez and de la Cruz,1990; Hall

et al

., 1991; Collis

et al

., 1993; 2001; Stokes

et al

., 1997). In addition, activity with the class 3 inte-grase, IntI3, has also been demonstrated (Collis

et al

.,2002). To determine whether the 59-be sites identified inenvironmental clones are active recombination sites, sixof them (Fig. 4), EGC099, EGC082, EGC102, EGC140,EGC068 and BGC001 were tested in conduction assaysto determine if they could be recognized by the class 1integrase, IntI1. Tested elements were selected to repre-sent the diversity of elements recovered from the envi-ronments examined. The elements associated withEGC082 and EGC140 are similar to the well-studiedand highly active

aadB

59-be element in terms of bothlength and sequence. In contrast, the EGC099 59-be, at56 bases is the shortest element of the 59-be familyseen to date while EGC102 is an example of an elementthat groups with members that are of greater length.EGC068 is similar in sequence to other elements recov-ered from Balmain but is noteworthy in that it has a ninebase right hand simple site spacer (Fig. 2). This is theonly element known to have a spacer of this length andcontrasts with the seven or eight bases for all other ele-ments (Stokes

et al

., 1997). A sixth element was alsotested. This element, BGC001 (Fig. 4), was from a cas-sette within an array in a strain of

Pseudomonas stutzeri

from a soil enrichment culture and is noteworthy in thatit possesses the PAR signature (Vaisvila et al., 2001 andFig. 4) associated with elements from Pseudomonasspecies.

All tested elements were functional (Table 2). Activitylevels varied however, with the three shortest elementsEGC099, EGC082 and EGC140 included in a group offive that were the most active and comparable to that ofthe highly active aadB 59-be. A fourth, longer element of102 bases, EGC102, also fell within this highly activegroup as did the element from Ps. stutzeri. EGC068, theelement with a nine base right hand spacer was the leastactive at a level about 10 to100-fold below the others. Itnonetheless had a level of activity 50-fold above a no-element control.

The mobile gene cassettte metagenome 389

© 2003 Society for Applied Microbiology and Blackwell Publishing Ltd, Environmental Microbiology, 5, 383–394

Analysis of recombination events

To investigate the recombination events involving each ofthe environmental test elements, the sensitivity of co-integrates to trimethoprim (Tp) was determined (Table 3).Tp sensitivity implies insertion at attI1 since the dfrB2gene is separated from the Pc promoter on which itdepends for expression (Fig. 1). The percentage of Tpsensitive co-integrates was between 81 and 94 indicatinga strong preference for insertion at attI1. These values areconsistent with those seen previously for 59-be sites fromantibiotic resistance cassettes when cloned intopACYC184 in orientation 2 (Collis et al., 2001).

The insertion site of several co-integrates was furtheranalysed by PCR mapping (Experimental procedures). Intotal, 69 co-integrates were mapped and in all cases the

length of the PCR product was consistent with insertionat either attI1 or orfA (Table 3). This mapping was alsoconsistent with the Tp phenotype in that co-integratesmapping to attI1 were Tp sensitive and those mapping toorfA were Tp resistant. No insertion events were found atdfrB2, a result also seen previously for 59-be sites fromantibiotic resistance cassettes where non-attI1 insertionevents favour orfA (Collis et al., 2001).

To confirm the PCR mapping data and further investi-gate the nature of the recombination events involving theenvironmental 59-be sites, the junctions of several ofthese 69 co-integrates were sequenced. In total, 10 inde-pendent co-integrates were sequenced at both the left andright junctions (Fig. 1) and a further 12 independent co-integrates were sequenced at one junction. In all cases(Table 3), the recombination crossover point could be

Fig. 4. 59-be sites tested for recombination activity shown as foldbacks to highlight their inverted repeat structure. Colons indicate com-plementary bases. Sequences shown are as they appear in the linear array from which they are derived and as tested in conduction assays. Both EGC068 and BGC001 contain an insert in the left side of the element compared to the right side. The positions of these extra bases are indicated by the vertical arrows. In EGC068 the insertion is TAG and in BGC001 it is TCGCTCGCCTCGCTCACT.

Table 2. Conduction frequency of 59-base elements from environmental samples

Plasmid Test element Fragment lengtha Element length Range Average frequencyb

pMAQ28 aadB/qacE 202/198 60 4.5 × 10−3-1.6 × 10−2c 1.1 × 10−2 (5)c

pMAQ701 EGC099 101/197 56 3.2 × 10−3-5.6 × 10−2 1.9 × 10−2 (12)pMAQ653 EGC082 486/383 60 4.1 × 10−3-1.3 × 10−2 8.9 × 10−3(6)pMAQ713 EGC102 164/95 102 1.7 × 10−3-9.6 × 10−3 6.1 × 10−3(7)pMAQ714 EGC140 84/110 60 7.2 × 10−4-1.4 × 10−2 6.1 × 10−3(5)pMAQ707 BGC001 124/114 77 2.1 × 10−4-2.5 × 10−3c 1.3 × 10−3 (6)c

pMAQ710 EGC068 142/87 73 4.1 × 10−5-2.6 × 10−4 1.5 × 10−4(7)pACYC184 none N/A N/A 8.8 × 10−7-4.5 × 10−6 2.9 × 10−6 (4)

a. Numbers refer to nucleotides in the cloned fragment to the left and right of the recombination crossover point.b. Values for test elements are derived from at least three independent donor constructs with the number of assays shown in brackets.c. Values for pMAQ28 and pMAQ707 are from Collis et al. (2001) and Holmes et al. (2003) respectively.

390 A. J. Holmes et al.

© 2003 Society for Applied Microbiology and Blackwell Publishing Ltd, Environmental Microbiology, 5, 383–394

localized to a region of between four (BGC001 versusattI1) and seven (EGC099 or EGC068 versus attI1) basesthat included the invariant GTT of the 1R core site (Fig. 2).Consequently it is likely that the IntI1-mediated recombi-nation events involving these environmental 59-be sites isthe same as that previously described for 59-be sites fromantibiotic resistance cassettes where the recombinationcrossover has been shown to occur between the G andfirst T of the 1R core site (Stokes et al., 1997).

Discussion

The properties of integrons and gene cassettes indicatethat these elements have the potential to play a broaderrole in bacterial evolution. Given that the integron is arelatively simple structure the significance of the integron/gene cassette system is inextricably linked to the natureof the mobile gene cassette pool. Of particular importancehere is the diversity of cassette-associated genes, thedistribution of gene cassettes, and the ability of differentintegrons to exploit gene cassettes. We have previouslyreported that primers can be used in ‘cassette PCR′ torecover intact genes from environmental DNA and thatthis technique taps a very large genetic resource (Stokeset al., 2001). In this paper we confirm that cassette PCRsamples an environmental gene cassette ‘metagenome’which is accessible to class 1 integrons.

Even on the basis of the present, limited dataset it isevident that the EGC metagenome sampled by cassettePCR shows remarkable diversity in both 59-be sites andcassette content. The majority of EGC include protein-coding orfs. It appears that the nature of the cassette-encoded proteins is different to typical protein-codinggenes found in bacteria. The majority of cassette-encodedproteins represent novel families and no genes encodingenzymes of central metabolic pathways were found. In thisrespect the EGC metagenome is markedly different fromany bacterial genome characterized to date. However, the

lack of genes of central metabolism could simply reflectthat, in terms of overall bacterial genetic diversity, suchgenes are a minor component. Indeed, genome sequenc-ing projects have indicated that two strains of the samespecies may diverge considerably and that this primarilyreflects genes outside of central metabolism. Given thevery large size of bacterial populations it is not unreason-able to expect a high proportion of novelty in the ‘speciesgenome’ (Lan and Reeves, 2000). Consideration of thediversity of orf sizes, inferred physicochemical propertiesand predicted functions suggests that any protein may beencoded within a gene cassette. Evaluation of this possi-bility will require large scale sequencing of the EGCmetagenome.

It is clear that non-protein coding DNA, including featuressuch as binding sites for regulatory proteins and smallRNAs, is an important part of bacterial genomes. If theEGC pool is a fundamental resource for bacteria, wherebymobilised genes facilitate genome evolution, we mightexpect it to contain a significant proportion of such features.Gene cassettes that do not contain obvious protein-codingorfs occur in a V. cholerae chromosomal integron (Heidel-berg et al., 2000) and are present in this EGC dataset.One noteworthy observation here is that the cassette con-tent of the EGC105 and EGC091 families is characteris-tically not protein-coding. Two factors suggest that thesesequences encode some biological activity. First, non-identical members of the families were repeatedly isolatedfrom several separate environmental DNA samples (fourfor the EGC091 family). Second, the strong conservationof both sequence families implies selective constraints onthese sequences perhaps suggesting that they representa family of transcribed RNAs. These observations raisethe possibility that essentially any DNA-encoded functionmay be contained within a gene cassette.

Present data indicate that gene cassettes, when clas-sified by their 59-be sequence, show at least some parti-tioning across bacterial species. The first evidence for

Table 3. Characteristics of co-integrates formed with environmental 59-be sites and In3.

Test element Percentagea TpS PCR mappingb Junction sequencingc attI orfA

EGC082 94 9/11 attI; 2/11 orfA GTTAG (1/1) GTTAGA (0/1)BGC001 88 6/9 attI; 3/9 orfA GTTA (3/3) NEEGC099 81 5/9 attI; 4/9 orfA GTTAGGC (2/0) CGTTAG (0/1)EGC068 88 8/16 attI; 8/16 orfA GTTAGGC (3/0) CGTTAG (0/1)EGC102 90 11/12 attI; 1/12 orfA GTTAG (0/2) NEEGC140 94 11/12 attI; 1/12 orfA GTTAG (1/2) CGTTAG (0/1)

a. At least 100 co-integrates were tested from a minimum of four independent crosses.b. Mapped co-integrates are derived from at least three independent crosses. Co-integrates were otherwise selected randomly except for BGC001and EGC068 for which co-integrates were selected on the basis of their Tp phenotype.c. Co-integrates sequenced were selected from those as being either attI1 or orfA insertions by PCR mapping. Where more than one co-integratewas sequenced for a particular element at the same insertion point (attI1 or orfA), replicates are derived from independent crosses. Sequenceshown indicates the region at and around the core site to which the recombination crossover point can be defined (i.e. the two recombiningmolecules are identical in the core site region indicated). Numbers in brackets before and after the slash (/) refers to the number of co-integratessequenced at both junctions and one junction respectively. NE = not examined.

The mobile gene cassettte metagenome 391

© 2003 Society for Applied Microbiology and Blackwell Publishing Ltd, Environmental Microbiology, 5, 383–394

distinctive patterns of relationship among gene cassettesemerged from studies of integron arrays in Vibrio species.In these examples the 59-be sites were found to be char-acteristically long (∼ 130 bp) and showed unexpectedlyclose sequence relationship (Mazel et al., 1998; Clarket al., 2000). The demonstrations that the integrons host-ing these arrays are fixed in the chromosomes of mostVibrio species, and that there is at least some correlationbetween 59-be sequence relationships and the species oforigin, led Rowe-Magnus et al. (2001) to conclude thatcertain groups of 59-be sites are characteristic of partic-ular bacterial species. Subsequent data from other bacte-rial genera has supported the view that cassettes foundwithin the same chromosomal integron show significantsimilarity of their associated 59-be sites. In this ‘integron/gene cassette relationship’ model, atypical 59-besequences found in chromosomal integrons are inferredto reflect acquisition of cassettes via HGT.

One of the most significant features of the gene cas-settes recovered directly from soils is that the diversity of59-be sites observed is greater than that observed in anyone integron array. The 37 cassette-associated recombi-nation sites recovered here include at least 11 distinctsubfamilies (Fig. 2). This is greater than in all three com-pletely sequenced chromosomal integrons. There are foursubfamilies from 22 cassettes in X. campestris (one ofwhich has 19 members the others being unique), threesubfamilies from 33 cassettes in Ps. alcaligenes (Vaisvilaet al., 2001) and one from 179 in V. cholerae (Heidelberget al., 2000). It is also much greater than the cumulativetotal for any single integron class with the exception ofclass 1. Together these observations demonstrate that theEGC metagenome sampled by cassette PCR is likely tobe partitioned across multiple bacterial species and/ormultiple integrons. In support of this we have directlyrecovered diverse integrons from soil by PCR (Nield et al.,2001) and have recently isolated several different speciesthat contain integrons (unpublished). A 59-be site fromone such isolate (Ps. stutzeri strain Q) was included inthis analysis.

We tested six EGC 59-be sites for recombinationalactivity with the class 1 integron integrase and recombi-nation sites found in class 1 integrons. Although a rela-tively limited number, these elements were deliberatelyselected to represent as diverse a range of element typesas possible. These included elements of a previouslyundescribed total length (56 bases) and previously unde-scribed right hand spacer length (nine bases). Despite thisall elements were found to be active. Mapping of co-integrate junctions showed that in all cases recombinationwas site-specific, preserving the orientation of the genecassette and therefore its compatibility with the integron-associated promoter Pc. These data indicate that class 1integrons are inherently capable of acquiring the tested

elements and orienting them in such a way that any asso-ciated gene could be expressed. Also, the bias towardsthe capture of antibiotic resistance containing cassettes iscertainly a result of natural selection and that such cas-settes are being acquired, by class 1 integrons, from amuch larger pool of cassettes.

Our data set represents the closest currently availableto a random sampling of natural gene cassette diversity.When viewed together with available data on gene cas-settes from clinical environments, or large-scale organismsequencing projects, a number of points become clear.Specific environmental pressures may show correlationwith specific cassette-associated genes, as witnessed bythe abundance of antibiotic resistance gene cassettes inclinical or animal production environments. Specificorganisms (or integrons) may show correlation with spe-cific subfamilies of recombination sites, as shown by var-ious Vibrio, Pseudomonas and Xanthomonas species.Natural communities contain very high diversity of bothrecombination sites and cassette-associated genes. Theabundance of gene cassettes, and their capacity toencode diverse types of DNA-related function indicate thatthe EGC metagenome represents a fundamentalresource for bacteria. Integrons provide a means for bac-teria to perform ‘combinatorial genetics’ upon this pool.The association of integrons with other genetic elementssuch as transposons, plasmids and chromosomes pro-vides both gene cassettes and integrons routes by whichthey may travel both within cells and between cells.

A number of genetic elements and recombination pro-cesses are now known to contribute to the mobilization,transfer and eventual capture of DNA by the receivingcells. To what extent the integron/gene cassette systemcontributes to the different stages within the total gene fluxin proportion to other systems is not yet clear. In part, thiswill depend on the proportion of cells that possess thisgene capture system. The abundance of gene cassetteshowever, would appear to indicate that the impact of thissystem on bacterial genome evolution will be substantial.

Experimental procedures

Bacterial strains, plasmids and primers

UB5201 is F– pro met recA56 gyrA; UB1637 is F– his lys trprecA56 rpsL (de la Cruz and Grinsted, 1982). Plasmids andprimers used are shown in Table 4 and Table 5 respectively.

DNA manipulations

Recovery of gene cassettes from natural environments byPCR, their cloning and sequencing, has been described(Stokes et al., 2001). The cassette PCR technique mayrecover partial gene cassettes or gene cassette arrays thatinclude recombination sites. In this study we have consider-

392 A. J. Holmes et al.

© 2003 Society for Applied Microbiology and Blackwell Publishing Ltd, Environmental Microbiology, 5, 383–394

ably expanded the number of 59-be recombination sitesrecovered from environmental samples through samplingmicrocosms established from the previously describedBalmain, Homebush and Lidsdale soil samples. Thisexpanded dataset enables us to address the recombinationactivity of environmental gene cassettes for the first time.Details of the microcosm conditions and enrichment are notpertinent to the present data and will be reported elsewhere,obtained upon request. Polymerase chain reactionconditions for co-integrate mapping were: [(94°C ×3 min)] × 1, [(94°C × 30 s)(65°C × 30 s)(72°C × 90 s)] × 35,[(72°C × 5 min] × 1. DNA sequencing was performed at theMacquarie Sequencing Facility (Macquarie University, Aus-tralia) using an ABI Prism 377 (PE Biosystems).

Conduction assays

The conduction assay was performed as described previously(Collis et al., 2002). Briefly, a donor cell contains three plas-mids. One of these is a conjugative plasmid, pMAQ495, thatcontains the integron In3 (Fig. 1) but with an insertionallyinactivated intI1 gene (Table 4). The second plasmid is a

derivative of the cloning vector pACYC184 and includes atest recombination site. The third plasmid, pSU2056, supplieshighly expressed IntI1 protein in trans. Recombination effi-ciencies are determined by the frequency with which the testrecombination site recombines with one of the three partnerrecombination sites in In3 (see below). This efficiency ismeasured as the ratio of the number of co-integrates con-ducted to a recipient cell, as measured by transfer of chloram-phenicol resistance, divided by the total number of pMAQ495transconjugants as measured by transfer of trimethoprimresistance.

Analysis of co-integrates

In3 of pMAQ495 (R388) contains three recombination sites.These are attI1, and the 59-be sites of dfrB2 and orfA (Fig. 1).Insertion of a test element at attI1 separates the dfrB2 genefrom the Pc promoter leading to a TpS phenotype. Conse-quently the Tp phenotype was used as an indicator of inser-tion at attI1. However, to accurately and rapidly map co-integrates, a PCR-based strategy was used (Fig. 1). Twoprimers, one specific for a sequence within pACYC184 of the

Table 5. Sequencing and PCR primers.

Primer Sequence Positiona/comment Accession number/reference

HS286 b5′GGGATCCTCSGCTKGARCGAMTTGTTAGVC3′ For cassette PCR. Targets left half of 59-be Stokes et al. (2001)HS287 b5′GGGATCCGCSGCTKANCTCVRRCGTTAGSC3′ For cassette PCR. Targets right half of 59-be Stokes et al. (2001)HS318 5′GCTTCATCGCTACTTTG3′ 815–831. (C). Within dfrB2 gene cassette J01773HS319 5′GTATGAAGTCTTTGGCG3′ 282–298. Within orfA cassette X12869HS320 5′AGTAAAGCCCTCGCTAG3′ 606–622. (C). Within 3′-conserved segment X12869HS457 5′CAAATGTAGCACCTGAAGTCAGCCC3′ 1452–1476. Adjacent to unique HindIII site

of pACYC184.X06403

HS458 5′GTTTGATGTTATGGAGCAGCAACG3′ 648–671. Within 5′-conserved segment J01773HS459 5′GCAAAAAGGCAGCAATTATGAGCC3′ 813–836 (C). Within 3′-conserved segment X12869HS460 5′GGAAGGAGCTGACTGGGTTGAAGG3′ 2167–2190 (C). Adjacent to unique SalI site

of pACYC184.X06403

a. Numbers refer to location in the cited database entry. (C) indicates sequence is the complementary strand.b. The first eight bases include a BamH1 linker that is not complementary to targeted sequences.

Table 4. Plasmids.

Plasmid Description Cloned recombination siteaPosition in pACYC184b Relevant phenotype Reference

R388 33 kb IncW plasmid containing class 1 integron In3

N/A N/A TpRSuRTra+IntI1+ Avila and de la Cruz (1988)

pMAQ495 R388 with aphA inserted into intI1 gene

N/A N/A TpRSuRKmRTra+IntI1–

Collis et al. (1998)

pACYC184 Cloning vector N/A N/A CmRTcR Chang and Cohen (1978)

pSU2056 1176 bp RsaI-BamH1 fragment of In2 in pUC9

N/A N/A ApRIntI1+ Martinez and de la Cruz (1990)

pMAQ28 400 bp Sau3A-HindIII fragment aadB/qacE 59-be BamH1-HindIII (1) CmR. Hall et al. (1991)pMAQ653 882 bp BamH1 fragment EGC082 59-be BamH1 (2) CmR This studypMAQ701 300 bp HindIII-BamH1 fragment EGC099 59-be BamH1-HindIII (2) CmR This studypMAQ707 238 bp HindIII-BamH1 fragment BGC001 59-be (AY129391) BamH1-HindIII (2) CmR Holmes et al.

(2003)pMAQ710 241 bp HindIII-BamH1 fragment EGC140 59-be (AF421329) BamH1-HindIII (2) CmR This studypMAQ713 271 bp HindIII-BamH1 fragment EGC068 59-be (AF349098) BamH1-HindIII (2) CmR This studypMAQ714 206 bp HindIII-BamH1 fragment EGC102 59-be (AF265275) BamH1-HindIII (2) CmR This study

a. Accession numbers for environmental and bacterial gene cassettes from which cloned elements are derived are shown in brackets.b. Numbers in parentheses represent the orientation of the cloned fragment with respect to pACYC184 as previously defined (Collis et al., 2001).

The mobile gene cassettte metagenome 393

© 2003 Society for Applied Microbiology and Blackwell Publishing Ltd, Environmental Microbiology, 5, 383–394

test element and a second specific for a sequence within In3,were used to directly amplify co-integrate template DNA. Thederived PCR product was of a length dependent on theinsertion site. The primer pair commonly used was HS457and HS459. As an example, PCR product lengths involvingpMAQ701 were 578 bp for insertion at orfA compared to1463 bp for insertion at attI1. Polymerase chain reactionproducts were also used as sequencing templates to identifythe recombination crossover point. For HS457/459 productsand insertion at orfA, right hand junctions were sequencedby priming with HS320. For insertion at attI1, the sequencingprimer used was HS318. For some co-integrates the left handjunction was also amplified and sequenced and this wasachieved with HS458 and HS460 as primers. Sequencingprimers for these were HS458 for insertion events at attI1 andwith HS319 for insertion events at orfA.

Nomenclature of gene cassettes and 59-be sites

The recovery of mobile gene cassettes directly from the envi-ronment by PCR means that the source organism cannot beidentified. Consequently, for such cassettes, we haveadopted a nomenclature whereby each cassette is assignedthe descriptor ‘EGC’ (environmental gene cassette) followedby a unique numerical code. For PCR products with morethan one cassette it is also possible to identify the sequenceof a cassette's 59-be as it appears in the linearized form, inall but the last cassette of the recovered array. Consequently,identified 59-be sites are assigned the same descriptor (i.e.EGCxxx) as their cognate cassette. In the linear, integratedform of a gene cassette, the last six bases of the 59-be arederived from the following cassette. Consequently thesebases may differ depending on the context of the gene cas-sette. For consistency all experimental data presented hererefer to the ‘linear’ sequence of the element as observed inthe cloned fragment unless indicated otherwise. If thesequence of the last six bases of the element in the circularform of the cassette is known, and if they are different to thatseen in the linear form of the cassette in a particular array,these differences are noted.

In our studies we are also recovering cassette arrays frombacterial strains that have been cultured directly from theenvironment. As gene cassettes are mobile and it is not yetclear if any are truly specific to certain bacterial species weuse the descriptor ‘BGC’ (bacterial gene cassette) followedby a unique numerical code (e.g. BGCxxx) to describe suchcassettes. For completeness of this study we have includeda representative 59-be from a gene cassette in an integronin a Ps.stutzeri strain (Holmes et al., 2003) recovered fromthe Balmain soil (Stokes et al., 2001).

Sequence accession numbers

Cassettes recovered from Balmain: EGC086 (Accessionnumber AF349106), EGC090 (AF349108), EGC104/EGC105 (AF349111), EGC068 (AF349098), EGC070(AF349099), EGC072 (AF349100), EGC074 (AF349101),EGC076 (AF349102), EGC078 (AF349103), EGC080(AF349104), EGC066 (AF349097), EGC064 (AF265272),EGC095/EGC096 (AF349109), EGC092/EGC093(AF265270); Homebush Bay cassettes: EGC082

(AF265263); Cape Denison cassettes: EGC084 (AF349105),EGC099 (AF349110); Sturt National Park cassettes:EGC101/EGC102 (AF265275). No-orf containing cassettes:EGC049 (AF349081), EGC052 (AF349085), EGC053(AF349086), EGC054 (AF349087), EGC055 (AF349088),EGC056 (AF349089), EGC057 (AF349090), EGC058(AF349091), and EGC091 (AF349108). Environmental loca-tions are described in Stokes et al. (2001). Other sequencesreported but not specifically discussed can be found inAccession numbers AF421312-AF421335.

Recognition of gene cassettes

Bioinformatics analyses were conducted on BioManager.comprovided by ANGIS (http://www.angis.org.au). Gene cas-settes in public databases were identified by searchesthrough the NCBI web site (http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/genom_table_cgi) using various sequencesobtained in this study as the query sequence.

Sequences were analysed for features associated withgene cassettes, principally an open reading frame and a 59-be. All 59-be sites are imperfect, inverted repeat structures.This common structure is achieved through conservation ofa number of positions and features (Fig. 2). Within each halfof the element there are 19 positions that covary (i.e. theybase pair in the fold-back structure). For ease of descriptionwe designate these positions as ‘a’ to ‘s’ on the left and ‘S′to ‘A′ for the co-varying positions on the right (Fig. 2). Therepeat is imperfect as a result of ‘disruption’ (i.e. mismatchor insertion) and these commonly occur at certain points onboth sides. In total there are 13 of these and they are indi-cated by an asterisk or ‘ +’ symbol in Fig. 2. Seven of theseare located in the left half of the element and six in the righthalf. We define a DNA sequence as a 59-be site if it containsthese 51 positions in the relative positions shown in Fig. 2.Known 59-be sites show considerable variation on this modelstructure. From comparisons of previously described ele-ments as well as for the sequences obtained here it is appar-ent that variation is largely restricted to a few specific lociwithin the 59-be sequence and results in retention of themodel structure (see Results).

Acknowledgements

Supported by a Research Innovation Fund Grant from Mac-quarie University. We thank Roberto Anitori and MalcomWalter for providing the DNA from the Flinders Ranges HotSprings, Clare McInnes for sampling the Yerranderie minesite and Alexandra Kirsten for collection of material fromCape Denison, Antarctica. We thank Ruth Hall for helpfuldiscussions. Didier Mazel provided sequence informationfrom XCR 59-be sites.

References

Avila, P., and de la Cruz, F. (1988) Physical and genetic mapof the IncW plasmid R388. Plasmid 20: 155–157.

Chang, A.C.Y., and Cohen, S.N. (1978) Construction andcharacterization of amplifiable multicopy DNA cloning vehi-

394 A. J. Holmes et al.

© 2003 Society for Applied Microbiology and Blackwell Publishing Ltd, Environmental Microbiology, 5, 383–394

cles derived from the p15A cryptic miniplasmid. J Bacteriol134: 1141–1156.

Clark, C.A., Purins, L., Kaewrakon, P., Focareta, T., andManning, P.A. (2000) The Vibrio cholerae O1 chromo-somal integron. Microbiology 146: 2605–2612.

Collis, C.M., and Hall, R.M. (1992) Gene cassettes from theinsert region of integrons are excised as covalently closedcircles. Mol Microbiol 6: 2875–2885.

Collis, C.M., and Hall, R.M. (1995) Expression of antibioticresistance genes in the integrated cassettes of integrons.Antimicrob Agents Chemother 39: 155–162.

Collis, C.M., Grammaticopoulos, G., Briton, J., Stokes, H.W.,and Hall, R.M. (1993) Site-specific insertion of gene cas-settes into integrons. Mol Microbiol 9: 41–52.

Collis, C.M., Kim, M.-J., Stokes, H.W., and Hall, R.M. (1998)Binding of the purified integron DNA integrase IntI1 tointegron- and cassette-associated recombination sites. MolMicrobiol 29: 477–490.

Collis, C.M., Recchia, G.D., Kim, M.-J., Stokes, H.W., andHall, R.M. (2001) Efficiency of recombination reactionscatalysed by the Class 1 integron integrase IntI1. J Bacte-riol 183: 2535–2542.

Collis, C.M., Kim, M.-J., Partridge, S.R., Stokes, H.W., andHall, R.M. (2002) Characterization of the class 3 integronand the site-specific recombination system it determines.J Bacteriol 184: 3017–3026.

de la Cruz, F. and Grinsted, J. (1982) Genetic and molecularcharacterization of Tn21, a multiple resistance transposonfrom R100.1. J Bacteriol 151: 222–228.

Grainge, I., and Jayaram, M. (1999) The integrase family ofrecombinase: organization and function of the active site.Mol Microbiol 33: 449–456.

Hall, R.M., Brookes, D.E., and Stokes, H.W. (1991) Site-specific insertion of genes into integrons: role of the 59-base element and determination of the recombinationcross-over point. Mol Microbiol 5: 1941–1959.

Hall, R.M., Collis, C.M., Kim, M.-J., Partridge, S.R., Recchia,G.D., and Stokes, H.W. (1999) Mobile gene cassettes andintegrons in evolution. Ann New York Acad Sci 870: 68–80.

Heidelberg, J.F., Eisen, J.A., Nelson, W.C., Clayton, R.A.,Gwinn, M.L., Dodson, R.J. et al. (2000) DNA sequence ofboth chromosomes of the cholera pathogen Vibrio chol-erae. Nature 406: 477–483.

Holmes, A.J., Holley, M.P., Mahon, A., Nield, B.S., Gillings,M.R., and Stokes, H.W. (2003) A distinctive and functionalintegron/gene cassette system present in soil bacterial

communities associated with genomic diversity inPseudomonas stutzeri. J Bacteriol 185: 918–928.

Lan, R., and Reeves, P.R. (2000) Intraspecies variation inbacterial genomes: the need for a species genome con-cept. Trends Microbiol 8: 396–401.

Martinez, E., and de la Cruz, F. (1990) Genetic elementsinvolved in Tn21 site-specific integration, a novel mecha-nism for the dissemination of antibiotic resistance genes.EMBO J 9: 1275–1281.

Mazel, D., Dychinco, B., Webb, B., and Davies, J. (1998) Adistinctive class of integron in the Vibrio cholerae genome.Science 280: 605–608.

Nield, B.S., Holmes, A.J., Gillings, M.R., Recchia, G.D., Mab-butt, B.C., Nevalainen, K.M.H., and Stokes, H.W. (2001)Recovery of new integron classes from environmentalDNA. FEMS Microbiol Letts 195: 59–65.

Ochman, H., Lawrence, J.G., and Groisman, E.A. (2000)Lateral gene transfer and the nature of bacterial innovation.Nature 405: 299–304.

Recchia, G.D., and Hall, R.M. (1995) Gene cassettes: a newclass of mobile element. Microbiology 141: 3015–3027.

Rowe-Magnus, D.A., Guerot, A.M., Ploncard, P., Dychinco,B., Davies, J., and Mazel, D. (2001) The evolutionary his-tory of chromosomal super-integrons provides an ancestryfor multiresistant integrons. Proc Natl Acad Sci USA 98:652–657.

da Silva, A.C.R., Ferro, J.A., Reinach, F.C., Farah, C.S.,Furlan, L.R., Quaggio, R.B. et al. (2002) Comparison ofthe genomes of two Xanthomonas pathogens with differinghost specificities. Nature 417: 459–463

Stokes, H.W., and Hall, R.M. (1989) A novel family of poten-tially mobile DNA elements encoding site-specific geneintegration functions: integrons. Mol Microbiol 3: 1669–1683.

Stokes, H.W., O'Gorman, D.B., Recchia, G.D., Parsekhian,M., and Hall, R.M. (1997) Structure and function of 59-baseelement recombination sites associated with mobile genecassettes. Mol Microbiol 26: 731–745.

Stokes, H.W., Holmes, A.J., Nield, B.S., Holley, M.P., Nev-alainen, K.M.H., Mabbutt, B.C., and Gillings, M.R. (2001)Gene cassette PCR: sequence-independent recovery ofentire genes from environmental DNA. Appl Environ Micro-biol 67: 5240–5246.

Vaisvila, R., Morgan, R.D., Posfai, J., and Raleigh, E.A.(2001) Discovery and distribution of super-integronsamong Pseudomonads. Mol Microbiol 42: 587–601.