Methods for the detection of non-random base substitution in virus genes: models of synonymous...

14
Virus Genes 16:3, 253–266, 1998 # 1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. Methods for the Detection of Non-Random Base Substitution in Virus Genes: Models of Synonymous Nucleotide Substitution in Picornavirus Genes DANIEL HAYDON,* NICK KNOWLES 1 & JOHN MCCAULEY 2 1 Institute for Animal Health, Pirbright Laboratory, Ash Road, Pirbright, Woking GU24 0NF, UK E-mail: [email protected] 2 Institute for Animal Health, Compton Laboratory, Newbury, Berkshire RG20 7NN, UK E-mail: [email protected] Received September 2, 1997; Accepted December 1, 1997 Abstract. A substantial fraction of phylogenetic divergence between closely related RNA virus genes is generally accounted for by synonymous (non-amino acid changing) point mutation. Viral evolution may be a complicated phenomena, governed by many different processes. However in this study we ask whether there are any properties in the patterns of synonymous nucleotide substitutions in three different Picornavirus genes that permit the process of accumulation of synonymous point mutation in these genes to be distinguished from some of the simplest most basic evolutionary models. We conclude that while the observed patterns in the occurrence of synonymous point substitution are consistent with those predicted by a model in which base mutation is equi-probable along a gene, and the probability of synonymous substitution determined only by local codon usage, some patterns in the actual nucleotides exchanged remain to be explained. Key words: synonymous substitution, picornavirus genes, neutral, models of evolution Introduction The high rate of base mis-match exhibited by RNA viruses during gene replication is estimated to be between 10 4 to 10 3 per nucleotide site per replication (1,2) and suggests that point mutation may be a major component of RNA virus evolution. We conjecture that the simplest form of this evolutionary change would be when point mutations occur with equal probability along the gene, and are subsequently subjected to no positive or negative selection pressures. There are many possible reasons to reject such a simplistic notion of evolution (as reviewed in 3). The first part of this conjecture may be unrealistic: there are a number of reasons to suppose that the probability of single nucleotide point mutations may not occur with equal probability along viral genes. For example, certain particular sequences of nucleotides have been shown to incur unusually high rates of point mutation within HIV-1 structural genes (4), and the local purine/pyrimidine bias has also been related to observed mutation rate (5,6). Biased hypermutation has been observed within GpA dinucleotides resulting in a preponderance of G ! A mutations in human immunodeficiency virus-1 (7), simian immunodeficiency virus (8), and necrosis provirus (9), possibly as a result of dislocation mutagenesis (10). Similarly, biased hypermutation has been observed from U ! C in vesicular stomatitis virus (11), and parainfluenza virus (12); and from A ! G in viral genes associated with brain tissue infected with measles virus (13). The second part of the conjecture pre-supposes that base substitutions are likely to be neutral. Single-base nucleotide substitutions may be classified as synony- mous (resulting in no amino acid change) or non- *Corresponding author: Dept of Zoology, University of British Columbia, 6270 University Blvd., Vancouver, BC, Canada, V6T 1Z4 (e-mail: [email protected]). The nucleotide sequence data used in this paper has been submitted to GenBank.

Transcript of Methods for the detection of non-random base substitution in virus genes: models of synonymous...

Virus Genes 16:3, 253±266, 1998

# 1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Methods for the Detection of Non-Random Base Substitution in Virus Genes:Models of Synonymous Nucleotide Substitution in Picornavirus Genes

DANIEL HAYDON,* NICK KNOWLES1 & JOHN MCCAULEY2

1Institute for Animal Health, Pirbright Laboratory, Ash Road, Pirbright, Woking GU24 0NF, UK

E-mail: [email protected] for Animal Health, Compton Laboratory, Newbury, Berkshire RG20 7NN, UK

E-mail: [email protected]

Received September 2, 1997; Accepted December 1, 1997

Abstract. A substantial fraction of phylogenetic divergence between closely related RNAvirus genes is generally

accounted for by synonymous (non-amino acid changing) point mutation. Viral evolution may be a complicated

phenomena, governed by many different processes. However in this study we ask whether there are any properties

in the patterns of synonymous nucleotide substitutions in three different Picornavirus genes that permit the process

of accumulation of synonymous point mutation in these genes to be distinguished from some of the simplest most

basic evolutionary models. We conclude that while the observed patterns in the occurrence of synonymous point

substitution are consistent with those predicted by a model in which base mutation is equi-probable along a gene,

and the probability of synonymous substitution determined only by local codon usage, some patterns in the actual

nucleotides exchanged remain to be explained.

Key words: synonymous substitution, picornavirus genes, neutral, models of evolution

Introduction

The high rate of base mis-match exhibited by RNA

viruses during gene replication is estimated to be

between 10ÿ 4 to 10ÿ 3 per nucleotide site per

replication (1,2) and suggests that point mutation

may be a major component of RNA virus evolution.

We conjecture that the simplest form of this

evolutionary change would be when point mutations

occur with equal probability along the gene, and are

subsequently subjected to no positive or negative

selection pressures. There are many possible reasons

to reject such a simplistic notion of evolution (as

reviewed in 3). The ®rst part of this conjecture may be

unrealistic: there are a number of reasons to suppose

that the probability of single nucleotide point

mutations may not occur with equal probability

along viral genes. For example, certain particular

sequences of nucleotides have been shown to incur

unusually high rates of point mutation within HIV-1

structural genes (4), and the local purine/pyrimidine

bias has also been related to observed mutation rate

(5,6). Biased hypermutation has been observed within

GpA dinucleotides resulting in a preponderance of

G!A mutations in human immunode®ciency virus-1

(7), simian immunode®ciency virus (8), and necrosis

provirus (9), possibly as a result of dislocation

mutagenesis (10). Similarly, biased hypermutation

has been observed from U!C in vesicular stomatitis

virus (11), and parain¯uenza virus (12); and from

A!G in viral genes associated with brain tissue

infected with measles virus (13).

The second part of the conjecture pre-supposes that

base substitutions are likely to be neutral. Single-base

nucleotide substitutions may be classi®ed as synony-

mous (resulting in no amino acid change) or non-

*Corresponding author: Dept of Zoology, University of British

Columbia, 6270 University Blvd., Vancouver, BC, Canada, V6T

1Z4 (e-mail: [email protected]).

The nucleotide sequence data used in this paper has been submitted

to GenBank.

synonymous (resulting in amino acid change). Non-

synonymous mutations are likely to be subjected to

purifying selection (as demonstrated with Foot and

Mouth Disease Virus, see 14), although there are

examples of point mutations in RNA viruses that are

subject to positive selection resulting from immune

surveillance (as demonstrated with HIV-1, see 15).

The most likely forms of nucleotide substitutions in

which neutral evolution may proceed are synonymous

point mutations. The occurrence of synonymous

mutations is subject to the in¯uence of codon usage.

As a result of the differential redundancy in the

genetic code some codons (e.g. those for serine,

leucine or argenine) are far more likely to incur

synonymous mutations than others (e.g. tyrosine or

cysteine). Despite the fact that synonymous base

mutations do not result in a change in the amino acid

sequence of protein coding genes, there are reasons to

suppose that they too may not be free from selection

pressures. The adopted secondary structure of the

RNA may impose constraints on permissible synony-

mous base substitutions, as suggested for some genes

of Caulimoviruses (16), and Hepatitis C virus (17).

Alternatively gross base composition of the genome

may itself be under the in¯uence of selective forces,

perhaps for reasons of genome packing, coding

requirements, tRNA availability or genome stability.

A further possible reason would be the existence of

overlapping reading frames, as has been suggested for

the unusually low rate of synonymous substitution

observed in the C gene of Hepatitus C (18).

Many of these complicating factors have been

proposed as explanations for the non-random patterns

of synonymous substitution observed within some

enterobacterial genes (e.g. 19±21).

Non-randomness in the occurrence of synonymous

nucleotide substitutions within genes may be identi-

®ed from related gene sequence data in two broadly

different ways: First, the frequency distribution of

multiply ``hit'' sites (i.e. how many homologous base

sites within the dataset were free from change, or

incurred one, two, three . . . N changes) may differ

from that expected from a simply de®ned mutational

model exclusive of mutation bias, hotspots and

subsequent selection. Second, given the observed

frequency distribution of multiply hit sites, it is

possible that these sites may be distributed along the

gene in a non-random pattern, i.e. the incidence of

observed substitutions may be unusually clustered, or

spaced out. In this study we conducted 3 different

statistically based tests to determine whether or not

the pattern of synonymous point mutations in the

RNA sequences of 3 different Picornavirus genes was

distinguishable from patterns predicted by very

simple neutral models of base substitution. We

conclude that within the genes we examined, there

is no evidence for any unexpected spatial clustering of

synonymous change and that the observed frequency

distributions of multiply hit sites in these genes can be

explained by a simple model of synonymous base

substitution that includes the observed codon usage

and transition transversion bias. However, these

simple models are unable to replicate some of the

patterns of biased hypermutation encountered in 2 of

the 3 datasets.

Materials and Methods

The Sequences

The analyses were conducted on 3 sets of sequences.

One data set contained sequences of the complete

(n� 630 nucleotides) VP1 gene from 55 isolates of

Foot and Mouth Disease virus serotype O (FMDV

VP1) obtained from around the world, over the last 70

years (EMBL accession numbers: AJ004645-86,

M89900, Z21862, M16088, K01201, M15981,

M21695, M15980, Z21860, M58601, M55287,

M15974, J02185, X00871). The other two data sets

contained sequences from the VP1 (n� 849 nucleo-

tides) and 3B/C (n� 666 nucleotides) genes of 45

isolates of Swine Vesicular Disease Virus (SVDV

VP1 and SVDV 3BC), obtained from around the

world over the last 30 years (VP1 gene EMBL

accession numbers: AJ004611-40, Y14464-76,

Y14481, D16364, D00435, and X54521; 3BC gene

EMBL accession numbers: AJ004564-604, D16364,

D00435, and X54521). Accession numbers and strain

names are collated in 22 and 23.

Analysis of Sequences

We applied 3 different tests to each of the 3 datasets.

Test I. This test was designed to determine whether

or not the observed frequency distribution of multiply

hit sites was consistent with that expected from a

simple Poisson model of synonymous substitution.

For each dataset we constructed the consensus tree of

254 Haydon, Knowles, and McCauley

all most parsimonious phylogenetic trees using the

program DNAPARS in the package PHYLIP 3.5c (24)

and determined the ancestral sequences at the internal

nodes of this consensus tree by supplying it as a user

de®ned tree to the same program. The output from this

procedure can be used to identify the predicted base

substitutions at each site along each branch of the tree.

This information can in turn be used to estimate how

many 1st and 3rd codon position sites received 0, 1, 2,

3, . . . N unambiguously identi®able synonymous

substitutions. This information was used to calculate

average synonymous hit rates for 1st and 3rd codon

positions combined. These per site averages were

used to construct the expected frequency distribution

of multiply hit sites if the same number of substitution

events was distributed over the appropriate number of

sites according to a Poisson distribution. The two

distributions can then be compared using a Chi-

squared test. If the expected multiple hit distribution is

found to differ from the observed distribution, it may

be concluded that the actual process of synonymous

substitution is not homogenous and independent over

all sites, as assumed by a Poisson process.

Test II. This test is technically complex, but the

underlying idea is simple. The aim is to simulate

datasets that are identical in many ways to the original

real datasets. They are identical in that real and

simulated datasets share almost identical consensus

sequences, levels of synonymous divergence and

phylogeny. They only differ in that patterns of

synonymous point substitution in the simulated

datasets are determined by a single speci®ed

evolutionary model, that makes explicit assumptions

about the process of synonymous evolution. The

adequacy of these assumptions may be examined

statistically by comparing the patterns of point

mutation in the real and simulated datasets.

Phylogeny Reconstruction

The number of synonymous substitutions per synony-

mous site at the ith nucleotide position, (pi, i� 1, . . . ,

n) was estimated by a site by site comparison of

sequence pairs using the method of Nei and Gojobori

(25), and then averaged over all possible pairwise

sequence comparisons at each nucleotide site to arrive

at an observed average number of synonymous

substitutions per synonymous site for each nucleotide

position, O(�pi), i� 1, . . . , n.

The average transition:transversion ratio (Ts) for

synonymous base substitutions was estimated from

the ratio of observed unambiguously synonymous

transitional and transversional differences between

codons that differ at only one position from all

possible pairwise sequence comparisons. This ratio

was then used to calculate a synonymous distance

matrix, containing the number of synonymous

substitutions per synonymous site by which the xth

and yth sequences differ from one another (ds,xy).

These distances were corrected for multiple hits, using

a modi®ed formulation of the Kimura two-parameter

model (26):

ds;xy �1

2ln� 1

1ÿ 2Fs ÿ Gs

�� 1

4ln� 1

1ÿ 2Gs

�where Fs is the estimated number of transitions per

synonymous site, and Gs the estimated number of

transversions per synonymous site. A phylogeny

based on these synonymous distances was then

constructed using the program FITCH in the

PHYLIP package (24).

Simulation of Sequences

A model of synonymous nucleotide change was

developed that is essentially an extension of a model

of nucleotide substitution proposed by Hasegawa et

al. (27) that permits different rates for transitions and

transversions with unequal base frequencies. We

extended this model to one that permits synonymous

codon substitution with unequal codon frequencies.

The model assumes that the probability of a mutation

occurring is identical and independent across all sites,

but that the probability that such a mutation is

synonymous is determined by the particular codon

to which the nucleotide belongs, and the transition

transversion ratio (as this obviously determines to

what alternative codon it may mutate). The model

further assumes that all synonymous substitutions are

neutral and independent of each other. The theoretical

underpinnings of models of this type are clearly

reviewed in (28±30) and the notation used here is

directly analogous whenever possible. Let fi (i�1, . . . , 20) be the frequency of the ith amino acid in the

consensus sequence, and mi be the number of codons

that can represent the ith amino acid de®ned by the

Testing Models of Synonymous Substitution 255

universal genetic code. Let Q be a symmetric 61� 61

instantaneous per site rate substitution matrix, whose

rows and columns correspond to the 61 sense codons,

and whose elements qij are calculated from 1 of 4

different equations:

If i� j then

qii � ÿX61

j�1j 6�i

qij

If i and j are synonymous codons that are related

by a single transitional nucleotide substitution then

qij� kbfk/mk, . . . , (k� 2Ts, and both codons code for

amino acid k and kb is the instantaneous transition

rate). If i and j are synonymous codons that are

related by a single transversional nucleotide substitu-

tion then qij� bfk/mk, . . . , (both codons code for

amino acid k and b is the instantaneous transversion

rate). If codons i and j code for different amino acids,

or are synonymous but are only interchangeable by

the exchange of more than one nucleotide then

qij� 0. The value of b is obtained by solving the

equationX61

i�1

X61

j�1j 6�i

qij ÿ 1 � 0

The matrix P(t), containing the per codon probability

of change from codon i to codon j in time t is given

by:

P�t� � eQt � UEUÿ1

where U is the matrix of eigenvectors of Q, and Eis the diagonal matrix with elements exp(lit), (i�1, . . . , 61) and li are the eigenvalues of Q. The

branch lengths, t, are obtained from the synonymous

distance tree. The ijth element of the P(t) matrix

determines the probability that codon i will incur

point mutation(s) resulting in its change to codon jover a branch of speci®ed length t. Use of this model

in combination with the synonymous distance tree

permits the generation of sets of sequences that have

been exposed to the same degree of synonymous

mutation and are related to each other in the same

way as the original dataset. In order to approximate

the observed codon usage, it was assumed that the

consensus sequence was the common ancestor to all

the samples, and it was placed at the root of the tree.

The expected average number of synonymous

substitutions per synonymous site for each nucleotide

position, E(�pi) (i� 1, . . . , n), can be estimated from

the simulated datasets in precisely the same way as

O(�pi) is estimated from the real dataset. Multiple data

sets of sequences simulated in this way can be used to

test the adequacy of the adopted model to account for

patterns in the real dataset as explained below.

Statistical Analysis

In order to test for differences between the observed

and expected multiple hit frequency distributions,

nucleotide sites were ranked in descending order of

their tendency to incur synonymous substitution. Thus

the one real observed such ranking (denoted obs~p1,

obs~p2, obs~p

3 � � � obs~pn, where obs~p

1 corresponds to the

number of synonymous substitutions per synonymous

site at the most synonymously variable site) was

compared to the mean distribution of rankings derived

from 200 simulated sequence datasets. The simulated

analyses yielded 200 values for the most synony-

mously variable nucleotide sites (denoted 1~p1, 2~p

1,

3~p1 � � � 200~p

1, the average of which is denoted exp~p1),

for the second most synonymously variable nucleo-

tide site (1~p2, 2~p

2, 3~p2 � � � 200~p

2, the average of which is

denoted exp~p2), the third, the ith (1~p

i, 2~pi, 3~p

i � � � 200~pi,

denoted exp~pi), and so on. By plotting E(2~p

i) and

E(199~pi), for each ranked position (i� 1, . . . , n), 99%

con®dence intervals for exp~pi may be obtained.

Similarly plotting E(5~pi) and E(195~p

i) for each

ranked position (i� 1, . . . , n) indicates the 95%

con®dence interval of exp~pi. These plots are made

easier to inspect if the y-axis is transformed so as to

represent the difference between the con®dence

intervals and the simulated mean rankings (i.e.

obs~pi ÿ2~p

i, obs~pi ÿ199~p

i; obs~pi ÿ5~p

i, obs~pi ÿ195~p

i)

for each site (for i� 1, . . . , n), and the real mean

rankings and the simulated mean rankings

(obs~piÿ exp~p

i) for each site (for i� 1, . . . , n). If the

difference between the observed and expected

rankings deviate beyond those indicated by the

con®dence intervals, the model by which the process

of synonymous substitution was emulated may be

rejected.

Using the simulated datasets it is possible to

calculate the mean and standard deviation of the

proportion of all differences between the simulated

sequences that were observed to be A,C, A,G,

A,U, C,G, C,U, or G,U and to determine by

use of a simple two-tailed t-test whether or not the

observed proportion of such differences in the real

256 Haydon, Knowles, and McCauley

dataset (deduced from synonymous codons differing

by only one base) are signi®cantly different to these

simulated means. This test provides an additional

opportunity to test the adequacy of the null model.

Test III. The ®nal test was constructed so as to

identify whether the observed pattern of synonymous

change was consistent with a model in which the

intensity of synonymous substitution was uniform

along the gene, or whether there was evidence for

``hot'' or ``cold'' regions of synonymous change. The

test can be used to identify such hot or cold spots at a

variety of spatial scales, depending on the size of a

sliding window that is run along the gene, within

which the number of synonymous substitutions is

summed. Let S be a vector of length n, composed of

the elements si�O(�pi). We calculated summed totals

of substitution activity within sliding windows of

varying length, L, as measured by the quantities:

VL � max�XL

j�1

si�j

��i � 0 to nÿ L�

and

UL � min�XL

j�1

si�j

��i � 0 to nÿ L�

for

L � 3; 6; 9; 12; . . . ; n=2

The test is conducted by sampling the elements of

S (without replacement) and reassigning them in

random order to a new vector S0 (with elements s0i).For each S0 we calculated the equivalent quantities:

V0L � max�XL

j�1

si�j

��i � 0 to nÿ L�

and

U0L � min�XL

j�1

si�j

��i � 0 to nÿ L�

for

L � 3; 6; 9; 12; . . . ; n=2

We computed the distributions of V0L and U0L from

1000 constructions of S0, and then asked whether the

observed values of VL and UL fell for suf®ciently far

into the tails of these distributions to indicate a sig-

ni®cant deviation from the randomized expectation.

Results

Test I. The comparison of ancestral/derived codons

across the entire consensus tree revealed that the

fraction of pairwise codon comparisons differing at

more than a single site was very low (1.4% for SVDV

3B/C, 2.3% for SVDV VP1, and 7.1% for FMDV

VP1). Therefore, the identi®cation of synonymous

substitutions was unambiguous for the most part. The

observed and expected multiple hit frequency dis-

tributions for the 3 genes for the 1st and 3rd positions

are graphed in Fig. 1, and the result of a chi-squared

goodness-of-®t test indicated on each ®gure. Thus of

the 222 codons constituting the 3B/C gene, 65 were

deduced to have remained unchanged, 62 received a

single synonymous substitution, 44 received 2, 23

received 3, 14 received 4, 9 received 5, 3 received 6,

and 2 received 7. The greater divergence of the

FMDV sequences is re¯ected by the observation that

only 26 out of 210 codons remained unchanged while

one codon was estimated to have incurred 12

independent synonymous substitutions. The null

model does not predict a signi®cantly different pattern

of multiple hits for the SVDV VP1 gene, but the

simple Poisson model can be rejected for the SVDV

3B/C and FMDV VP1 genes. For these two genes the

Poisson model consistently and signi®cantly under-

estimates the number of codons in both tails of the

multiple hit frequency distributions (i.e. those

receiving no substitutions and many substitutions).

This Poisson model fails because, given the overall

divergence within these genes, it predicts too few

codon positions incurring no synonymous change and

too few codon positions incurring many changes.

Test II. The transition:transversion ratio observed

from strictly synonymous substitutions was estimated

as 2.44:1 in FMDV VP1, 9.50:1 in SVDV VP1 and

11.58:1 in SVDV 3B/C. Fig. 2 illustrates the

distribution of observed synonymous substitutions

per synonymous site, O(�pi), along the 3 genes,

deduced from an all possible pairwise comparison of

sequences. There are no visually apparent indications

of non-randomness in these distributions. Figs. 3a,b,

and c show how the observed ranked values of O(�pi)

differ from the mean and con®dence limits expected

from the described simulation procedure. It is

apparent from these graphs that the observed

deviation from the expected ranked distribution

falls, for the most part, well within the predicted

Testing Models of Synonymous Substitution 257

Fig. 1. Graphical illustration of test I. The y-axis is the count of the number of codons that have incurred the number of synonymous

changes indicated on the x-axis, at 1st and 3rd position sites for the 3 different genes. The count is estimated from ancestral

reconstruction's inferred from a most parsimonious consensus tree. Results of chi-squared goodness of ®t tests are indicated.

258 Haydon, Knowles, and McCauley

Fig. 2. The graphs illustrate the distribution of observed synonymous substitutions per synonymous site, O(�pi), along the 3 genes, deduced

from an all possible pairwise comparison of sequences. All 2nd position codon sites are excluded (as no synonymous changes are possible

at 2nd position sites).

Testing Models of Synonymous Substitution 259

con®dence intervals and the simple evolutionary

model proposed is able to account for much of the

observed pattern of synonymous substitution in these

genes. The only exception is in the SVDV VP1 gene

(Fig. 3b), where a tendency for slightly fewer than

expected ``often changed'' sites results in a signi®cant

excess in the number of ``rarely hit'' sites. Test II

shows that the observed pattern of synonymous

substitution along this gene is a little more even

than is predicted by this model.

Table 1 contains the comparison of the observed

patterns of base exchange (corresponding to un-

ambiguously identi®ed synonymous differences

resulting from the pairwise comparison of all

Fig. 3. Graphical illustration of the results from test II. The y-axis represents the difference between the normalized con®dence intervals

and the normalized simulated mean rankings (i.e. obs~pi ÿ 2~p

i; obs~pi ÿ 199~p

i; obs~pi ÿ 5~p

i; obs~pi ÿ195~p

i) for each site (for i� 1, . . . , n), and

the normalized real mean rankings and the normalized simulated mean rankings (obs~pi ÿexp~p

i) for each site (for i� 1, . . . , n). Sites on the

left hand side of these ®gures incurred the most synonymous substitutions, those on the right received the least. If the difference between

the observed and expected rankings (dotted line) deviate beyond those differences indicated by the con®dence limits the model by which

the process of synonymous substitution was emulated may be rejected. a) FMDV VP1; b) SVDV VP1; c) SVDV 3B/C. All 2nd position

codon sites are excluded from the rankings (as no synonymous changes are possible at 2nd position sites).

a)

260 Haydon, Knowles, and McCauley

sequences) with the mean and standard deviations

expected from the patterns of base exchange that

occurred in the computer simulations. The overall

transition:transversion ratio was an input variable in

the computer simulation, but differences between the

frequencies of the two transitions, and the four

transversions may re¯ect non-neutral processes. The

FMDV VP1 gene is the only one in which there are no

signi®cant differences between the observed and

expected numbers of the different types of possible

base exchange. In the SVDV genes, the A,G

transitions are signi®cantly less common than

expected in both genes, and the C,U transitions

more common (but only signi®cantly so in the case of

VP1). Also signi®cantly more frequent than expected

is the A,U transversion in both SVDV genes.

Test III. The results from test III indicated that no

signi®cant spatial clumping (hotspots) or gaps (cold

spots) existed at any spatial scale for either of the two

SVDV genes. For the FMDV VP1 gene there was one

value of VL (for L� 15) which was signi®cant for

Fig. 3. (Continued)

b)

Testing Models of Synonymous Substitution 261

p5 0.03, that indicated a slight clumping of

synonymous substitution activity between nucleotide

sites 567 and 582. This clumping would not be

signi®cant if the appropriate Bonferroni corrections

were made (since 2 tests were carried out at each

spatial scale).

Discussion

There is considerable variation in the extent to which

codons for different amino acids may incur synony-

mous mutation (see Fig. 4). The most synonymously

mutable codon is the codon CGA (arg.); there are 8

possible sense mutations that can be made, and 4 are

synonymous (so the number of synonymous sites is 4/

8� 3). The least synonymously variable codons are

obviously TGG (trp.) and ATG (met.) which have zero

synonymous sites. Interestingly, it has been noted that

in some genes in which non-synonymous changes

might be of selective value, codon usage is biased

toward those with particularly low numbers of

synonymous sites (22,31). Test I models synonymous

substitution as a homogenous Poisson process,

Fig. 3. (Continued)

c)

262 Haydon, Knowles, and McCauley

thereby assuming that mutations occur independently

of each other at a constant rate and that all codons in a

gene are equally likely to incur synonymous

substitution. This second simpli®cation could easily

account for the failure of this model to explain the

multiple hit frequency distribution for the FMDV VP1

and SVDV 3B/C genes for 1st and 3rd position sites

combined. This test is somewhat dependent on the

exact topology of the ``user de®ned tree'' in the

DNAPARS program. However, substitution of the

most parsimonious consensus tree for various distance

based trees made no difference to the overall

conclusions drawn from the application of this test.

Test II permits codon usage to be taken somewhat

into account by using the consensus sequence of each

of the datasets as the ancestral sequence to the

simulated evolutionary process. However, deviations

from the consensus sequence are inevitable in

localized parts of the phylogenies of these viruses

and cannot be taken into account simply. For example,

a tendency to replace codons with a low propensity to

incur synonymous mutation ( perhaps ATG-methio-

nine) with those with high propensity to incur

synonymous mutation ( perhaps CTG-leucine) within

a clade in the phylogeny may explain why more sites

incur synonymous substitution in the SVDP VP1 gene

(see Fig. 3b) than expected. Test II permits the realistic

use of observed transition:transversion ratios, which

may also help to explain the larger degree of the

observed pattern in frequency distribution of multiply

hit sites accounted for by this test. There are no

generally recognized implications of transition bias

for viral evolution, although it is interesting to note

that of the 380 possible amino acid substitutions, 98

may be achieved by a single transversion, 44 by a

single transition, but only 8 by either a single transition

or transversion. Therefore, the set of amino acid

changes induced by a single transversion is almost

Table 1.

FMDV VP1

From/to A C G U

A 0.0647 (0.0109) 0.3049 (0.0213) 0.0637 (0.0106)

C 0.0833NS 0.0663 (0.0102) 0.4315 (0.0219)

G 0.3076NS 0.0653NS 0.0689 (0.0106)

U 0.0667NS 0.4000NS 0.0772NS

SVDV VP1

From/to A C G U

A 0.0244 (0.0074) 0.4231 (0.0209) 0.0219 (0.0071)

C 0.0287NS 0.0198 (0.0066) 0.4917 (0.0200)

G 0.3567p5 0.005 0.0066p5 0.05 0.0191 (0.0070)

U 0.0384p5 0.01 0.5494p5 0.005 0.0201NS

SVDV 3B/C

From/to A C G U

A 0.0149 (0.0084) 0.4826 (0.0277) 0.0112 (0.0064)

C 0.0091NS 0.0137 (0.0078) 0.4670 (0.0289)

G 0.3968p5 0.005 0.0115NS 0.0106 (0.0068)

U 0.0423p5 0.005 0.5235NS 0.0167NS

Observed proportion of homologous sites that differ by a synonymous point mutation of the type indicated in the table (lower triangle, below the

diagonal elements) compared to the mean and standard deviations (in parentheses) of those occurring in the simulated synonymous evolution

procedure developed as part of test II (upper triangle, above the diagonal elements). Direction of change cannot be inferred from a pairwise

comparison of sequences. Signi®cance determined by two tailed t tests. Proportions refer to the proportion of differences resulting from all

pairwise comparisons.

Testing Models of Synonymous Substitution 263

non-overlapping to the set induced by a single

transition. Furthermore, multiple transversions allow

a much more complete and direct way of accessing

different amino acids than multiple transitions.

The comparison of observed and predicted base

exchanges in Test II suggests that bulk nucleotide

ratios play no role in the evolution of the FMDV VP1

gene, but may be important in the evolution of SVDV

genes. Both SVDV genes exhibit a tendency for

increasing G-richness with time (see Fig. 5), thus the

lower than expected A,G transition frequency could

be accounted for by a selection pressure against

G!A. Biased hypermutation of this sort appears to

be quite common in RNAviruses (4±13), although the

nature of the biases and proposed explanations are

varied. One possible explanation is that such

hypermutation is catalyzed by cellular double

stranded RNA-dependent adenosine deaminase

activity (32). The conversion of A:U to I:U base

pairs in dsRNA results in the destabilization of the

substrate dsRNA. Conversion of A:U to A:I may lead

to degradation by the ribonuclease speci®c for

inosine-containing RNA which has recently been

described (33,34). The nature of selective pressures

on gross base composition of the genome remain

speculative, considerations of genome packing,

coding requirements, tRNA availability and genome

stability are logical possibilities. The use of this test

suggests that a substantial fraction of the observed

patterns of synonymous evolution in these virus genes

may be explained by simple models that take into

account codon usage, transition:transversion ratios

and the structure of the genetic code (as represented

by the structure of the Q matrix).

The existence of hot or cold spots would

be consistent with the idea that RNA secondary

structure imposes constraints on the synonymous

substitution process in stem regions more than in loop

regions. Test III indicates exceedingly little evidence

for the existence of such hot or cold spots along these

parts of the genome at any of the spatial scales

examined.

Overall, these results suggest that while synony-

mous point mutation in these virus genes may not be a

random and neutral process (as indicated by the

patterns of biased hypermutation in the SVDV genes),

the patterns that emerge are statistically hard to

distinguish from a process that is random and neutral.

The spatial occurrence of point mutation in these

genes appears to require no particularly complicated

explanations beyond those of differential codon usage

and transition:transversion bias.

Fig. 4. The number of synonymous sites for each of the 61 sense codons, as calculated by the method of Nei and Gojobori (25).

264 Haydon, Knowles, and McCauley

Fig. 5. The proportion of each of the SVDV genes that is encoded with guanine, plotted as a function of the year in which the samples

were taken. There is good evidence to suppose that this group of viruses forms a monophyletic group (see 23), the isolates clearly do not

represent independent samples.

Testing Models of Synonymous Substitution 265

References

1. Holland J.J., Spindler K., Horodyski F., Grabeu E., and Nichol

S., Vande Pol S Science 215, 1577±1585, 1982.

2. Domingo E. and Holland J.J. In Domingo E., Holland J.J., and

Ahlquist P. (eds). RNA genetics, Vol III. Variability of RNAgenomes. Boca Raton, FL: CRC Press, 1988, pp. 3±36.

3. Ramirez, B.-C., Barbier P., Seron K., Haenni A.-L., and

Bernardi F. In Gibbs A., Calisher C.H., and Garcia-Arenal F.

(eds) Molecular basis of virus evolution. Cambridge University

Press, 1995.

4. Doi H., Proc Natl Acad Sci USA 88, 9282±9286, 1991.

5. Kunkel T.A., Biochemistry 29, 8003±8011, 1990.

6. Burns D.P.W. and Temin H.M., J Virol 68, 4196±4203, 1994.

7. Vartanian J.P., Meyerhans A., Asjo B., and Wain-Hobson S.,

J Virol 65, 1779±1788, 1991.

8. Johnson P.R., Hamm T.E., Goldstein S., Kitov S., and Hirsch

V.M., Virology, 185, 217±28, 1991.

9. Pathak, V.K. and Temin H.M., Proc Natl Acad Sci USA 87,

6019±23, 1990.

10. Kunkel T.A. J Biol Chem 267, 18251±18254, 1992.

11. O'Hara P.J., Nichol S.T., Horodyski F.M., and Holland J.J., Cell

36, 915±924, 1984.

12. Murphy D.G., Dimock K., and Kang C.Y. Virology 181, 760±

763, 1991.

13. Cattaneo R., Schmid A., Spielhofer R., Kaelin K., Baczko K.,

ter Meulen V., Pardowitz J., Flanagan S., Rimna B.K., Udem

S.A., and Billeter M.A., Virology 173, 415±425, 1989.

14. Escarmis C., Davila M., Charpentier N., Bracho A., Moya A.,

and Domingo E., J Mol Biol 264, 255±267, 1996.

15. Seibert S.A., Howell C.Y., Hughes K.A., and Hughes A.L., Mol

Biol Evol 12, 803±813, 1995.

16. Chenault K.D. and Melcher U., Biochimie 76, 3±8, 1994.

17. Smith D.B. and Simmonds P. (1997). Submitted to J Mol Evol.

18. Ina Y., Mizokami M., Ohba K., and Gojobori T., J Mol Evol 38,

50±56, 1996.

19. Eyre-Walker A. and Bulmer M., Nucleic Acid Res 21, 4599±

4603, 1993.

20. Eyre-Walker A., J Mol Evol 39, 448±451, 1994.

21. Lawrence J.G., Hartl D.L., and Ocman H., J Mol Evol 33, 241±

250, 1991.

22. Haydon D.T., Lea S., Fry E., Knowles N., Samuel A.R., Stuart

D., and Woolhouse M.E.J. To appear in J Mol Evol.

23. Zhang G., Haydon D.T., Knowles N., and McCauley J.,

Molecular Evolution of Swine Vesicular Disease Virus.

Manuscript in preparation.

24. Felsenstein J., PHYLIP (Phylogeny inference package), version

3.5c. Department of Genetics, University of Washington,

Seattle, 1993.

25. Nei M. and Gojobori T., Mol Biol Evol 3, 418±426, 1986.

26. Kimura M., J Mol Evol 16, 111±120, 1980.

27. Hasegawa M., Kishino H., and Yano T., J Mol Evol 21, 160±

174, 1985.

28. Rambaut A. and Grassly N.C., Comput Appl Biosci 13, 235±

238, 1997.

29. Swofford D.L., Olsen G.J., Waddell P.J., and Hillis D.M. In

Hillis D.M., Moritz C., and Mable B. (eds). MolecularSystematics, 2nd edition. Sinuaur, 1996, pp. 432±438.

30. Lewis et al. Maximum Likelihood. Paupstar manual. Sinuaur.

In press.

31. Wagner S.D., Milstein C., and Neuberger M.S., Nature 376,

732, 1995.

32. Bass B.L., Weintraub H., Cattaneo R., and Billiter M.A., Cell

56, 331, 1989.

33. Scadden A.D.J. and Smith C.W.J. EMBO J 14, 3236±3246,

1995.

34. Scadden A.D.J. and Smith C.W.J. EMBO J 16, 2140±2149,

1997.

266 Haydon, Knowles, and McCauley