Reorganization of Adjacent Gene Relationships in Yeast Genomes by Whole-Genome Duplication and Gene...

27
and Evolution. All rights reserved. For permissions, please e-mail: [email protected] The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology Research Article Reorganization of adjacent gene relationships in yeast genomes by whole genome duplication and gene deletion Jake K. Byrnes*, Geoffrey P. Morris*, and Wen-Hsiung Li * These authors contributed equally. Department of Ecology and Evolution, University of Chicago Address for correspondence and reprints: Wen-Hsiung Li, Department of Ecology & Evolution, University of Chicago, 1101 East 57th St., Chicago, Illinois 60637. Phone: 773-702-3104. Fax:773-702-9740. Email: [email protected] Running head: Reorganization of adjacent genes in yeast Keywords: Whole genome duplication, gene deletion, Saccharomyces, adjacent gene orientation, intergenic spacing, gene expression. Abbreviations: whole genome duplication followed by deletion (WGD-D), percentage of divergently and convergently transcribed adjacent gene pairs (%DC) MBE Advance Access published March 9, 2006 by guest on May 30, 2013 http://mbe.oxfordjournals.org/ Downloaded from

Transcript of Reorganization of Adjacent Gene Relationships in Yeast Genomes by Whole-Genome Duplication and Gene...

and Evolution. All rights reserved. For permissions, please e-mail: [email protected] The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology

Research Article

Reorganization of adjacent gene relationships in yeast genomes by

whole genome duplication and gene deletion

Jake K. Byrnes*, Geoffrey P. Morris*, and Wen-Hsiung Li

* These authors contributed equally.

Department of Ecology and Evolution, University of Chicago

Address for correspondence and reprints: Wen-Hsiung Li, Department of Ecology &

Evolution, University of Chicago, 1101 East 57th St., Chicago, Illinois 60637. Phone:

773-702-3104. Fax:773-702-9740. Email: [email protected]

Running head: Reorganization of adjacent genes in yeast

Keywords: Whole genome duplication, gene deletion, Saccharomyces, adjacent gene

orientation, intergenic spacing, gene expression.

Abbreviations: whole genome duplication followed by deletion (WGD-D), percentage of

divergently and convergently transcribed adjacent gene pairs (%DC)

MBE Advance Access published March 9, 2006 by guest on M

ay 30, 2013http://m

be.oxfordjournals.org/D

ownloaded from

Abstract

In Saccharomyces, an ancient whole genome duplication (WGD) and widespread

duplicate gene deletion resulted in extensive reorganization of adjacent gene

relationships. We have studied the evolution of adjacent gene pairs’ identity, orientation,

and spacing following WGD and deletion (WGD-D) using comparative genomic analyses

and simulations. Surveying adjacent gene organization across the Saccharomyces species

complex, we find a genome-wide bias towards divergently and convergently transcribed

gene pairs in all species, but a reduction in this bias in the species that underwent WGD-

D. Among neutral models of WGD-D, only single-gene deletion can produce the

appropriate reduction in orientation bias and recapitulate the pattern of short, highly-

dispersed deletions we observe in S. cerevisiae. To characterize the dynamics of WGD-

D, we trace the conservation and creation of adjacent gene pairs along the S. cerevisiae

lineage. We find that newly created adjacencies have a tandem orientation bias, while

adjacencies conserved from prior to WGD-D have the same divergent-convergent bias as

found in the species that diverged before WGD. We also find that adjacent gene pairs

produced by WGD-D gained greater intergenic spacing, but that this is reduced in the

older adjacencies. Given this, and the preponderance of short deleted blocks, we argue

that the deletion phase of WGD-D occurred primarily by small inactivating mutations

followed by numerous small deletions. Newly created adjacent gene pairs also have an

initial increase in mean log2 expression ratios and maximal expression levels, suggesting

that increased intergenic spacing caused a genome-wide reduction in transcriptional

interference.

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

Introduction

Based on analyses of syntenic blocks of duplicate genes in the Saccharomyces

cerevisiae genome, Wolfe and Shields (1997) proposed an ancient whole genome

duplication (WGD) in the budding yeast. This hypothesis has recently been confirmed by

comparative genomic analyses of two yeast species that diverged from S. cerevisiae prior

to WGD, Ashbya gossypii (Dietrich et al. 2004) and Kluyveromyces waltii (Kellis, Birren,

and Lander 2004). There is now an excellent opportunity to study genome evolution

following WGD using the many species in the Saccharomyces species complex

(Kurtzman 2003) for which whole genome sequence is available (Kellis et al. 2003;

Cliften et al. 2003; Dujon et al. 2004). Those which diverged following WGD (“post-

WGD”) include the Saccharomyces sensu stricto species S. mikatae, S. kudriavzevii, S.

bayanus, and the more distantly related S. castellii and Candida glabrata. Those that

diverged from the S. cerevisiae lineage prior to WGD (“non-WGD”) include S. kluyveri

and K. lactis, in addition to A. gossypii and K. waltii.

Despite the initial doubling of genome content due to WGD there are now only

modest differences in genome size and gene number between non-WGD and post-WGD

species. For example, the K. waltii genome has 10.7 million base pairs [Mbp] and ~5200

genes, while the S. cerevisiae genome contains 12.5 Mbp and ~5700 genes. Was the

reduction of genome size concurrent with the reduction in gene number, with large

deletions directly responsible for gene loss, or did much of the deletion follow prior

pseudogenization? We may gain insight into the dynamics of this process by studying

the organization of the remaining genes.

In fact, the process of WGD followed by deletion (WGD-D) left a complex

pattern of large, interleaved syntenic blocks in S. cerevisiae. Since few duplicates remain

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

(~10%; Kellis, Birren, and Lander 2004), these syntenic blocks only become obvious in a

2:1 alignment of S. cerevisiae syntenic blocks to the corresponding block in a non-WGD

species (Kellis, Birren, and Lander 2004; Dietrich et al. 2004). Visual inspection of the

interleaving pattern in these 2:1 alignments seems to suggest a preponderance of short,

highly-dispersed deletions, but the underlying deletion process has never been modeled.

It is not known whether a simple model of short random deletions could produce this

pattern, or if deletions are more or less interleaved than would be expected by chance.

This interleaving is important because it may represent a rare opportunity for

genome reorganization in yeast, given the apparent paucity of inversions and

translocations (Fisher et al. 2000). Following WGD-D, genome location was mostly

retained (albeit across two duplicate chromosomes) while adjacent gene relationships

were largely altered. Therefore, we focused on the dynamics of genomic reorganization

from the perspective of adjacent gene organization. A deletion may affect one or more of

these aspects of adjacent gene organization: identity, orientation, and spacing. First, the

identities of the genes in an adjacent pair change when one or more genes are lost and the

flanking genes form a new adjacent pair. Second, after a deletion, the newly adjacent

pair may have a different transcriptional orientation (tandem, convergent, or divergent;

Cohen et al. 2000) than the old adjacent pairs. Finally, the spacing of adjacent genes may

be reduced or increased by a deletion, depending on its boundaries

There are a number of reasons to consider genome reorganization in terms of

adjacent genes as opposed to chromosomal location. First, a preliminary examination of

the S. cerevisiae synteny map suggests that most deletions are small, altering the local

gene organization. In this case, adjacency may be the aspect of genome structure most

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

affected by WGD-D. Second, there is mounting evidence for functional interactions

between adjacent genes in eukaryotes. There are many well-documented instances of

transcriptional interference for adjacent genes (Shearwin, Callen, and Egan 2005) and

divergent transcription from bidirectional promoters, including GAL1-GAL10 in S.

cerevisiae and prnD-prnB in Aspergillus nidulans (Lohr, Venkov, and Zlatanova 1995;

Garcia et al. 2004). Genome-wide analyses have shown differing mean expression

correlations for divergent, convergent, and tandem adjacent gene pairs in yeast (Cohen et

al. 2000) and greater intergenic spacing for genes with higher expression in humans

(Chiaromonte, Miller, and Bouhassira 2003). Adjacent gene orientation has also been

linked to the localization of cohesin domains (Filipski and Mucha 2002) and hotspots for

recombination (Gerton et al. 2000) in yeast. Finally, we study adjacent genes because

several of the sequenced yeast genomes are currently available only as fragments such as

contigs or supercontigs. By analyzing adjacent genes we can investigate genome

reorganization without knowing the entire genome structure.

In this study, we analyze genome sequences of post-WGD and non-WGD species

and implement simulations of the WGD-D process to address the following questions.

What is the typical scale of deletion and how does this inform our model of gene loss?

Does the genome organization of post-WGD species (i.e. interleaving and adjacent gene

relationships) provide evidence for selection shaping WGD-D or for functional

consequences of WGD-D on expression patterns? We find that gene loss occurred by

inactivating mutation, followed by numerous small deletions, and that the resulting

increase in intergenic spacing led to a widespread, but largely neutral, reduction in

transcriptional interference across the yeast genome.

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

Materials and Methods

Genomic analyses

The Saccharomyces Genome Database (SGD) annotation for S. cerevisiae, and

the Washington University annotations for S. mikatae, S. kudriavzevii, S. bayanus, S.

castellii, and S. kluyveri were downloaded from SGD (http://www.yeastgenome.org).

The K. waltii genome annotation was downloaded from the supplemental website for

Kellis, Birren, and Lander (2004; http://www.broad.mit.edu/seq/YeastDuplication/). To

avoid spurious ORFs, we only use K. waltii ORFs with homology to S. cerevisiae ORFs.

The A. gossypii genome annotation was downloaded from the Ashbya Genome Database

(http://agd.unibas.ch/). The genome annotations for C. glabrata and K. lactis were

downloaded from Génolevures (http://cbi.labri.fr/Genolevures/). Orthology to S.

cerevisiae was provided in the respective genome annotations. The lengths of post-WGD

deletions were collected from the S. cerevisiae-A. gossypii alignment, provided in the

supplemental materials from Dietrich et al. (2004). Orthologous genes with no

connection to a syntenic block were not included in counts of deleted block lengths.

We define the quantity %DC, the percentage of adjacent pairs that are in

divergent or convergent orientation. We use this summary statistic because of structural

dependency between divergent and convergent adjacency. For any contiguous block of

genes, the divergent and convergent counts can differ by at most one, since divergent and

convergent adjacencies are switching points between tracts of genes on opposite strands.

Genome-wide %DC and intergenic length data were collected using PERL scripts and

MySQL queries.

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

Evolutionary analyses

We traced the origin of each pair of adjacent genes in S. cerevisiae, looking for

conservation in successive outgroups along the S. cerevisiae lineage (including S.

mikatae, S. kudriavzevii, S. bayanus, S. castellii, S. kluyveri, and A. gossypii). Since

homoplasy is unlikely, we assume that any S. cerevisiae adjacency also found in an

outgroup was present in their common ancestor regardless of its absence in intervening

nodes. Tandem gene duplication also creates new adjacencies, skewing the distribution

of orientations in younger adjacency classes. Indeed, we found that 14 out of 52 tandem

duplicates in the full datasets are in the youngest class, so we removed all adjacent

duplicate genes for subsequent analyses. We present results from analysis of all S.

cerevisiae ORFs included in Harbison et al. (2004; 5546 ORFs), though the patterns hold

if we use only adjacencies where both ORFs are classified by SGD as “verified”. We

investigated whether other possible origins for S. cerevisiae-specific adjacencies

(inversion, mis-annotation) could have biased our results, but found no evidence for this.

If we use only the S. cerevisiae-specific adjacencies that have unambiguous evidence of

WGD-D origin (i.e., orthologs in non-WGD species are less than 10 genes apart) the

results are the same.

Simulations

To examine the potential for reorganization of adjacent gene relationships by

WGD-D we used a simulation coded in PERL. Each chromosome was represented as an

array of genes that recorded presence/absence and orientation. The input genomes had

the chromosome number and gene number of A. gossypii, a non-WGD species. Gene

orientations were either modeled after A. gossypii or constructed randomly within a set

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

range of %DC. The genomes were then duplicated and the deletion process occurred in

three stages. First, we draw “attempted” deletions, which represent the underlying

mutational process. For each attempted deletion event, the locus was randomly selected

and the length (in number of genes) was drawn from a uniform distribution (maximum

block size of one or two genes) or a Poisson distribution (mean block size of one or two

genes). For the neutral simulation, the duplicate copy was chosen for deletion randomly.

For the selective simulation, we weighted the probability that a duplicate copy was

chosen based on the net gain of tandem adjacencies for a deletion of one copy versus the

other. When defining the deletion boundaries, previously deleted genes were assumed to

have zero length. Next, attempted deletions that remove only redundant gene copies

become “accepted” deletions. This assumes that any deletion that removes a single-copy

gene would be effectively lethal. Finally, overlapping and adjacent deletions were

combined to form “apparent” deletions, equivalent to the blocks of deleted genes that can

be observed in yeast. Therefore, it is the apparent deleted block length distributions from

our simulations that we compare to the observed deleted block length distribution from S.

cerevisiae. When we discuss “deleted blocks” from either simulation or data analysis, we

will always be referring to apparent deleted blocks. In S. cerevisiae, Kellis, Birren and

Lander (2004) found that ~10% of duplicates remained from the WGD, so in our

simulations we allowed deletions to continue until 10% of duplicate genes remained. All

simulation results were generated using 10,000 iterations.

Gene expression analyses

S. cerevisiae expression data from Affymetrix GeneChip (101 microarray

experiments) were obtained from NCBI’s Gene Expression Omnibus

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

(http://www.ncbi.nlm.nih.gov/geo/). Only the hybridization intensities for the perfect

match probes were used for further analyses. These data were background-corrected and

quantile normalized in R (http://www.r-project.org) using the Affy package from

Bioconductor (http://www.bioconductor.org; Gautier et al. 2004). The probe-to-gene

annotation was created from a MEGABLAST (Zhang et al. 2000) similarity search of the

probe sequences against the most recent version of S. cerevisiae coding sequences. To

avoid spurious signal due to cross-hybridization, probes that matched more than one gene

with an E-value < 10-2 (at least 12 consecutive base pairs and 16/25 base pairs matching

overall) were dropped from the analysis. The relative expression values used in

subsequent analyses are mean intensities from the set of unique probes. To quantify

expression coupling of the genes in an adjacent pair, we calculated the mean of the

absolute value of log2 expression ratios for adjacent genes (abs[log2[adjacent gene #1

expression/adjacent gene #2 expression]]) across conditions and Pearson and Spearman

correlations across conditions. To quantify maximal expression for an adjacent gene pair,

we summed expression for pairs of adjacent genes within each condition, then took the

maximum of this value across conditions (max[adjacent gene #1 expression + adjacent

gene #2 expression]).

Results

Adjacent gene relationships in extant yeast genomes

To investigate the evolution of adjacent gene relationships in yeast, we first

surveyed the variation in adjacent gene organization in the genome sequences of ten

members of the Saccharomyces species complex (Table 1). All species examined have a

bias towards divergent and convergent adjacencies, however, the %DC values for

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

genomes of post-WGD species are consistently lower (51.0 to 53.0%) than the %DC

found in the genomes of non-WGD species (54.0 to 56.0%). This suggests that one effect

of WGD-D may be to reduce genome-wide bias in adjacent gene orientation.

The intergenic lengths of the non-WGD genomes are generally less than that of

post-WGD species (Table 1). Since intergenic length estimates are inflated by missing

genes, it is most informative to compare genomes of similar annotation quality.

Comparing the well-annotated non-WGD species A. gossypii (4711 genes) with the post-

WGD species S. cerevisiae (5714 genes), we see that the intergenic spacing has increased

regardless of orientation. The same is true in the comparison of the non-WGD species K.

waltii (5230 genes) and K. lactis (5331 genes) with the post-WGD species C. glabrata

(5272 genes) and S. bayanus (4716 genes).

Modeling adjacent gene reorganization due to WGD-D

We developed a simulation to test the ability of WGD-D to reorganize adjacent

gene relationships (see Materials and Methods). Under random single-gene deletion, our

model predicts that WGD-D will lead to extensive reorganization of adjacent gene

orientations for a wide range of starting genome structures (fig. 1). In particular, the

mean genome-wide orientation bias was always reduced in our simulations of neutral

WGD-D. We also find that selection has the potential to shape the extent of

reorganization, either in terms of %DC (fig. 1) or deleted block lengths (unpublished

data). We present the results for strong selection on adjacency (i.e. duplicate copies are

deterministically selected for deletions based on net differences in the adjacencies

created) and moderate selection on adjacency (i.e. 40% adjustment in the probability of

choosing the copy for deletion), either favoring or disfavoring tandem adjacencies. We

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

can clearly distinguish the change in %DC expected under these selective models from

that expected under the neutral model for a wide range of starting %DC (starting %DC >

20%).

From our simulations we can make quantitative predictions about the effect of

WGD-D to compare with the pattern in extant yeast. We ran simulations using the A.

gossypii genome organization as a proxy for the ancestral organization, and used a range

of attempted deletion distributions to account for the possibility of longer deletion tracts

or clustering of deletions. Of the neutral deletion scenarios we examined, only single-

gene deletion could reduce the %DC from 56.0% to 52.0%, the inferred reduction in the

S. cerevisiae lineage (fig. 2a; p = 0.21 for single-gene deletion and p < 10-4 for models

with longer deletions).

We can also use our simulations to derive the expected distribution of deleted

block lengths under a variety of deletion models. We compared the distribution of

deleted block lengths (in number of genes) from S. cerevisiae to the distribution from our

simulations of uniform and Poisson-distributed attempted deletions. The distribution of

deletions for S. cerevisiae is skewed towards small deletions relative to most neutral

deletion models (fig. 2b; p < 0.0001), but a single-gene deletion model can approximate it

well (p = 0.11).

Evolutionary analysis of orientation and intergenic spacing

To gain insight into the dynamics of the adjacent gene reorganization, we traced

the origin of each pair of adjacent genes in S. cerevisiae, looking for conservation in

successive outgroups along the S. cerevisiae lineage. The oldest adjacencies, which

predate WGD, have a %DC of 56.7% (fig. 3a), statistically indistinguishable from the

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

ancestral %DC (p = 0.59). Therefore, the orientation bias in conserved adjacencies is a

reflection of the ancestral bias, not a bias in the retention rate. Given that the ancestral

genome has a divergent-convergent bias, we expect that random deletion will initially

create more new tandem adjacencies and reduce the divergent-convergent bias (see fig.

1). As deletion proceeds, and the genome-wide divergent-convergent bias is reduced, the

bias in newly created adjacencies should approach this genome-wide value (see fig. 1).

Indeed, the adjacencies created immediately following WGD have a tandem bias (%DC =

46.5%) and the tandem bias is reduced for those created more recently (%DC = 49.3% -

49.9%), though they are not significantly different from one another.

The evolutionary analysis also shows a pattern of greater intergenic spacing in

more recently created adjacencies (fig. 3b). Furthermore, the mean and variance of

intergenic length are monotonically decreasing with age, as would be expected if genes

are lost by small, inactivating mutations followed by successive small deletions. This

effect is not due to differences in %DC across the adjacency age classes because the

pattern holds when the data is partitioned into convergent, divergent, and tandem

adjacencies (unpublished data). Even the adjacencies created immediately following

WGD have greater mean intergenic length than the adjacencies that remained from before

WGD, suggesting that WGD-D has had a lasting effect on intergenic spacing.

Expression evolution in adjacencies

Given that Cohen et al. (2000) found that adjacent gene coexpression was less

likely for pairs with greater spacing, we asked whether the evolution of longer intergenic

regions was associated with the decoupling of expression in adjacent genes. If new

adjacencies are as strongly coupled in expression as conserved adjacencies, then the

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

increase in intergenic spacing provides no overall expression decoupling. To characterize

the extent of expression coupling, we determined the log2 expression ratio for pairs of

adjacent genes in each age class, averaged across 101 published microarray experiments

(fig. 3c). There is a significant increase in the mean expression ratio for younger

adjacencies, suggesting an overall expression decoupling due to WGD-D. Pearson and

Spearman correlation coefficients for expression of adjacent genes, which would detect

finer scale coexpression patterns, show no relationship with age of adjacency

(unpublished data). The higher mean log2 expression ratio for new adjacencies may not

be due to decoupling at all levels of expression, but could be driven by decoupling at high

expression levels. Since the mean log expression ratio for adjacent genes is more

sensitive to extreme values than correlation coefficients, which are bounded, the pattern

in log expression ratio is likely driven by the increase in maximal expression. Therefore,

we asked whether the evolution of longer intergenic regions was associated with greater

maximal expression (i.e. maximum of the summed expression for an adjacent gene pair

across conditions). In this case, the average maximal expression should be greater for the

younger adjacency age classes. As predicted, greater maximal expression is observed in

the adjacency age classes with greater intergenic spacing, that is, the S. cerevisiae and the

Saccharomyces sensu stricto specific adjacencies (fig. 3d).

Expression values derived from microarray hybridization signal may potentially

be influenced by differences in probe affinities due to base composition, but this is

unlikely to affect our results given the large number (~104-105) of probes per adjacency

age class. Indeed, when we performed a multiple regression accounting for any effect of

GC-content on the expression measures, the relationship between maximal expression

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

and adjacency age class remained highly significant (p < 10-6). Furthermore, this

relationship is not due to a specific functional category of genes. While many of the

highly expressed genes are ribosomal, the trend remains when we remove the 438 genes

in our dataset annotated by Gene Ontology to the “protein biosynthesis” biological

process category.

Discussion

Deletion and the mechanism of gene loss

In the 2:1 alignments of the S. cerevisiae genome to the K. waltii or A. gossypii

genome, there appears to be many multiple-gene deletion blocks. Kellis, Birren and

Lander (2004) note that deleted blocks are small, with an average length of two genes,

but they do not speculate whether these blocks represent individual multiple-gene

deletion events or several single-gene losses beside one another. Our simulation

demonstrates that long tracts of adjacent deletions are common under a random single-

gene deletion model, and S. cerevisiae has no more long deleted blocks than expected

under a model of random single-gene deletion (fig. 2b). Therefore, we find no evidence

that multiple-gene deletion events played a significant role in WGD-D in yeast.

Similarly, there is no evidence for clustering of deletions, as this would also lead to the

appearance of an excess of multiple-gene deleted blocks. Since multiple-gene deleted

blocks imply tracts of conserved adjacency on the paralogous chromosome, there is also

no genome-wide tendency for conservation of gene clusters at the level of adjacencies.

Since we only consider conservation of adjacency, this does not contradict window-based

identification of conserved clusters of metabolic (Wong and Wolfe 2005) or essential

genes (Pal and Hurst 2003).

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

While a single-gene deletion model was best able to approximate the effect of

WGD-D in yeast, our gene-based simulation does not distinguish whether the loss of

single-genes occurs by gene-length deletion, smaller deletion, or inactivating mutation.

However, the evolutionary analysis of intergenic lengths does provide information about

the interplay between deletion and pseudogenization during the process of gene loss. For

instance, if a functional duplicate gene is lost by a gene-length deletion event without an

intermediate step of pseudogene formation, the intergenic length of the new adjacency

may not increase significantly. This mode of gene loss is not likely because our analysis

shows that adjacencies newly created by WGD-D have greater intergenic spacing (fig.

3b). This suggests that gene loss is initiated by inactivating mutations, such as small

indels that cause frameshifts or substitutions that cause premature stop codons.

The monotonic decrease of intergenic spacing with increasing age of adjacency

suggests that after pseudogenization the intergenic spacing for the newly formed

adjacency is gradually reduced by many small deletions. This mean decrease of

intergenic spacing cannot be due to rare large deletions since the variance is reduced

along with the mean (fig. 3b). These data support a model for gene loss during the yeast

WGD-D where small inactivating mutations (deletion or otherwise) are followed by a

whittling down of the pseudogenic and intergenic sequence. Previous research has

shown that small indels are biased towards deletion in a wide range of eukaryotes, from

mammals and fish to insects and plants (Gregory 2004), but it has never been

investigated in yeast. While it is possible that selection for genome compactness led to

the reduction of intergenic space following WGD-D, it is likely that the yeast deletion

pattern is due to the same mutational bias found across the eukaryotes.

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

In keeping with the idea of gradual deletion, it seems that the deletion phase was

not complete in the common ancestor of the post-WGD species. If the deletion phase of

WGD-D was very short, most new adjacencies would have been created by the time of

the S. castellii-S. cerevisiae split, but we find that the majority (73%; fig. 3) of new

adjacencies were created afterwards. An incomplete annotation for the S. castellii

genome could potentially lead us to underestimate the age of some adjacencies,

spuriously dating them to the S. cerevisiae-Saccharomyces sensu stricto split. However,

there are also many (721) S. cerevisiae specific adjacencies, which are unlikely to be

missed in the genome sequencing of all Saccharomyces sensu stricto species and S.

castellii. Therefore, we conclude that the process of gene loss and deletion has continued

during the radiation of the Saccharomyces species complex. Since there are only a

handful of pseudogenes in yeast (Harrison and Gerstein 2002), and none of these appear

to be of WGD-D origin (unpublished data), little, if any, trace remains of the

pseudogenes created by WGD-D.

Functional consequences of WGD-D

A gradual process of gene loss may suggest that the deletion phase of WGD-D is

neutral or even subject to negative selection. Unfortunately, the current annotation

quality does not allow a reliable estimate of the gene loss rate, so we cannot determine

whether the rate of gene loss is reduced due to negative selection conserving adjacencies

or even increased by positive selection favoring the creation of new adjacencies.

However, the signature of selection may appear as a bias in the conservation or creation

of adjacent gene relationships, or in the spatial patterning of deletions.

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

We find that random single-gene deletion can recapitulate the reorganization of

adjacencies due to WGD-D in S. cerevisiae. Since random single-gene deletion is

sufficient to produce the same genome-wide reduction in %DC (from ~56% to ~52%) as

observed in yeast, WGD-D was largely neutral with respect to orientation (fig. 2a). The

conservation and creation of adjacencies along the S. cerevisiae lineage is also consistent

with a neutral reorganization of gene orientations (fig. 3a). Finally, the random single-

gene deletion model was also able to produce the distribution of deleted block lengths

observed in S. cerevisiae, with deletions neither excessively interleaved or clustered (fig.

2b). Since the random single-gene deletion model of WGD-D reproduced the patterns of

retention and turnover of adjacent gene relationships, there is no evidence that negative

selection or positive selection shaped the WGD-D genome reorganization in yeast.

Even though there is no evidence for selection shaping WGD-D in yeast, there is

evidence that this reorganization had functional consequences for the genome. A major

effect of the interleaved gene loss was to provide increased intergenic spacing for many

genes (fig. 3b). We suggest that this increased intergenic spacing was responsible for the

uncoupling of expression and increased maximal expression for adjacent genes by

relieving transcriptional interference. Given that the ancestral genome was highly

compact, the concurrent expression of adjacent genes may have been limited by

transcriptional interference (Shearwin, Callen, and Egan 2005). Even in the relatively

spacious genome of humans, there is evidence that transcriptional interference can

explain an association between higher expression and greater intergenic spacing

(Chiaromonte, Miller, and Bouhassira 2003).

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

How can we reconcile the gradual neutral reorganization of adjacent gene

relationships with the evidence for widespread functional changes? For instance, while

WGD maintains relative copy numbers of interacting partners, the dosage balance

hypothesis (Veitia 2004) would predict that an asynchronous deletion phase would be

deleterious. It may be that downstream regulatory mechanisms (i.e. feedback or

translational regulation) compensate for most expression changes or that expression

changes of the magnitude we see do not affect the function of most genes. Future

improvements to the genome sequences of the species in the Saccharomyces complex

will clarify the dynamics of WGD-D, but a full understanding of the functional

consequences will require genome-wide expression data from more species, particularly

the non-WGD outgroups.

Acknowledgements

This research was supported by the Natural Sciences and Engineering Research

Council of Canada (G.P.M.), National Science Foundation (G.P.M.), the Department of

Education’s Graduate Assistance in Areas of National Needs Program (J.K.B and

G.P.M.) and NIH grants (W.H.L.). We thank K. Wolfe and the reviewers for helpful

suggestions.

Literature Cited

Chiaromonte, F., W. Miller, and E.E. Bouhassira. 2003. Gene length and proximity to

neighbors affect genome-wide expression levels. Genome Res. 13:2602-2608.

Cliften, P., P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton, J. Majors, R. Waterston,

B.A. Cohen, and M. Johnston. 2003. Finding functional features in

Saccharomyces genomes by phylogenetic footprinting. Science. 301:71-76.

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

Cohen, B.A., R.D. Mitra, J.D. Hughes, and G.M. Church. 2000. A computational analysis

of whole-genome expression data reveals chromosomal domains of gene

expression. Nat. Genet. 26:183-186.

Dietrich, F.S., S. Voegeli, S. Brachat et al. (14 co-authors). 2004. The Ashbya gossypii

genome as a tool for mapping the ancient Saccharomyces cerevisiae genome.

Science. 304:304-307.

Dujon, B., D. Sherman, G. Fischer et al. (67 co-authors). 2004. Genome evolution in

yeasts. Nature. 430:35-44.

Filipski, J., and M. Mucha. 2002. Structure, function and DNA composition of

Saccharomyces cerevisiae chromatin loops. Gene. 300:63-68.

Fischer, G., S.A. James, I.N. Roberts, S.G. Oliver, and E.J. Louis. 2000. Chromosomal

evolution in Saccharomyces. Nature. 405:451-454.

Garcia, I., R. Gonzalez, D. Gomez, and C. Scazzocchio. 2004. Chromatin rearrangements

in the prnD-prnB bidirectional promoter: dependence on transcription factors.

Eukaryot. Cell. 3:144-156.

Gautier, L., L. Cope, B.M. Bolstad, and R.A. Irizarry. 2004. affy--analysis of Affymetrix

GeneChip data at the probe level. Bioinformatics. 20:307-315.

Gerton, J.L., J. DeRisi, R. Shroff, M. Lichten, P.O. Brown, and T.D. Petes. 2000. Global

mapping of meiotic recombination hotspots and coldspots in the yeast

Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA. 97:11383-11390.

Gregory, T.R. 2004. Insertion-deletion biases and the evolution of genome size. Gene.

324:15-34.

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

Harbison, C.T., D.B. Gordon, T.I. Lee et al. (20 co-authors). 2004. Transcriptional

regulatory code of a eukaryotic genome. Nature. 431:99-104.

Harrison, P.M., and M. Gerstein. 2002. Studying genomes through the aeons: protein

families, pseudogenes and proteome evolution. J. Mol. Biol. 318:1155-1174.

Kellis, M., N. Patterson, M. Endrizzi, B. Birren, and E.S. Lander. 2003. Sequencing and

comparison of yeast species to identify genes and regulatory elements. Nature.

423:241-254.

Kellis, M., B.W. Birren, and E.S. Lander. 2004. Proof and evolutionary analysis of

ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature.

428:617-624.

Kurtzman, C.P. 2003. Phylogenetic circumscription of Saccharomyces, Kluyveromyces

and other members of the Saccharomycetaceae, and the proposal of the new

genera Lachancea, Nakaseomyces, Naumovia, Vanderwaltozyma and

Zygotorulaspora. FEMS Yeast Res. 4:233-245.

Lohr, D., P. Venkov, and J. Zlatanova. 1995. Transcriptional regulation in the yeast GAL

gene family: a complex genetic network. FASEB J. 9:777-787.

Pal, C., and L.D. Hurst. 2003. Evidence for co-evolution of gene order and recombination

rate. Nat. Genet. 33:392-395.

Shearwin, K.E., B.P. Callen, and J.B. Egan. 2005. Transcriptional interference - a crash

course. Trends Genet. 21:339-345.

Veitia, R.A. 2004. Gene dosage balance in cellular pathways: implications for dominance

and gene duplicability. Genetics. 168:569-574.

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

Wolfe, K.H., and D.C. Shields. 1997. Molecular evidence for an ancient duplication of

the entire yeast genome. Nature. 387:708-713.

Wong, S., and K.H. Wolfe. 2005. Birth of a metabolic gene cluster in yeast by adaptive

gene relocation. Nat. Genet. 37:777-782.

Zhang, Z., S. Schwartz, L. Wagner, and W. Miller. 2000. A greedy algorithm for aligning

DNA sequences. J. Comput. Biol. 7:203–214.

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

Figure 1: The extent of reorganization for adjacent gene orientation depends on starting

bias in orientation and selection on orientation. We created 51 starting genomes with

%DC ranging from 0% to 100%, separated by increments of ~2%, based on the gene

counts of A. gossypii. We simulated WGD-D under a single-gene deletion model 10,000

times for each genome and here plot the mean change in %DC (+/- one standard

deviation) for a neutral deletion (solid), or with strong (black) or moderate (grey)

selection favoring (dashed) or disfavoring (dotted) deletions that result in a net gain of

tandem adjacencies.

Figure 2: Shorter deletions in WGD-D result in more reorganization of gene orientation

and results in shorter deleted blocks. (a) We plot the frequency distribution of %DC for

10,000 simulations of WGD-D using the A. gossypii genome structure as the starting

point, with deletion lengths drawn from a uniform or Poisson distribution. Vertical lines

indicate the genome-wide %DC values observed for S. cerevisiae (solid) and A. gossypii

(dotted). Smaller deletions lead to significantly greater mean change and variance in

final %DC (p < 10-15). The single gene deletion model (i.e. uniform with max = 1; short

dashed line) has a reduction in %DC from 56% to 52.5% ± 0.57% (mean ± standard

deviation), which is statistically indistinguishable from the S. cerevisiae value (p = 0.21).

(b) We plot the distribution of deleted blocks lengths from the same set of simulations,

with the means (± standard deviation) for each deletion model, against the distribution

from S. cerevisiae. Again, only the single-gene deletion model (short dashed line)

approximates the pattern observed in S. cerevisiae (solid line).

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

Figure 3: Origin of adjacencies in the S. cerevisiae lineage. (a) %DC, (b) mean

intergenic length, (c) mean log2 expression ratio of adjacent genes, and (d) maximal

expression for adjacent gene pairs in each age class (Scer = S. cerevisiae specific, stricto

= Saccharomyces sensu stricto specific, Scas = present in S. cerevisiae-S. castellii

common ancestor, preWGD = conserved from before WGD). Bars with the same letter

are not significantly different (p > 0.05) and error bars represent 95% confidence interval.

The number of genes in each adjacency age class is as follows (with the subset of genes

with expression data [c,d] given in parentheses): Scer, 1392 (312); stricto, 1530 (1454);

Scas, 810 (774); and preWGD, 2156 (1984). The pattern for %DC and mean intergenic

length is the same whether we use the full set of genes (a,b) or the subset of genes with

expression data (unpublished data). The %DC is lowest for adjacencies created

immediately following WGD-D and increases slightly in more recently created

adjacencies. Intergenic lengths, log2 expression ratios, and maximal expression are

greater for younger adjacencies.

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

Table 1: Summary of adjacent gene relationships for 10 species of the Saccharomyces complex

Orientation counts Mean intergenic lengths (standard deviation) Species

Genome Size (Mb)

Annotated ORFs %DC

Div Con Tan Divergent Convergent Tandem

S. cerevisiae 12.5a 5714 52.0 1495 1493 2763 799 (1039) 379 (670) 650 (995)

S. mikatae 10c , 12.12b 1902 53.0 514 494 894 1684 (1611) 1372 (1649) 1528 (1639)

S. kudriavzevii 10.6c 2793 53.0 749 732 1312 1410 (1495) 1132 (1557) 1381 (1607)

S. bayanus 10.2c, 11.54b 4716 51.0 1215 1217 2284 1045 (1372) 622 (1023) 793 (1154)

S. castellii 11.1c 4378 52.4 1143 1149 2086 996 (1289) 656 (1221) 819 (1232)

C. glabrata 12.3d 5272 52.9 1360 1363 2427 1031 (1214) 393 (420) 861 (837)

S. kluyveri 10.2c 1985 57.3 570 563 852 1830 (2108) 1235 (1710) 1792 (2050)

K. waltii 10.7e 5230 54.0 1234 1264 2131 946 (984) 310 (554) 736 (899)

K. lactis 10.6d 5331 55.5 1476 1480 2361 650 (708) 318 (453) 668 (653)

A. gossypii 9.2f 4711 56.0 1317 1319 2075 466 (493) 203 (313) 415 (502)

a Saccharomyces genome database b Kellis et al. 2003 c Cliften et al. 2003 d Dujon et al. 2004 e Kellis, Birren, and Lander 2004 f Dietrich et al. 200

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

Figure 1

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

B

A

Figure 2

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from

Figure 3

B A

C D

by guest on May 30, 2013

http://mbe.oxfordjournals.org/

Dow

nloaded from