Methods for the detection of non-random base substitution in virus genes: models of synonymous...
-
Upload
independent -
Category
Documents
-
view
3 -
download
0
Transcript of Methods for the detection of non-random base substitution in virus genes: models of synonymous...
Virus Genes 16:3, 253±266, 1998
# 1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Methods for the Detection of Non-Random Base Substitution in Virus Genes:Models of Synonymous Nucleotide Substitution in Picornavirus Genes
DANIEL HAYDON,* NICK KNOWLES1 & JOHN MCCAULEY2
1Institute for Animal Health, Pirbright Laboratory, Ash Road, Pirbright, Woking GU24 0NF, UK
E-mail: [email protected] for Animal Health, Compton Laboratory, Newbury, Berkshire RG20 7NN, UK
E-mail: [email protected]
Received September 2, 1997; Accepted December 1, 1997
Abstract. A substantial fraction of phylogenetic divergence between closely related RNAvirus genes is generally
accounted for by synonymous (non-amino acid changing) point mutation. Viral evolution may be a complicated
phenomena, governed by many different processes. However in this study we ask whether there are any properties
in the patterns of synonymous nucleotide substitutions in three different Picornavirus genes that permit the process
of accumulation of synonymous point mutation in these genes to be distinguished from some of the simplest most
basic evolutionary models. We conclude that while the observed patterns in the occurrence of synonymous point
substitution are consistent with those predicted by a model in which base mutation is equi-probable along a gene,
and the probability of synonymous substitution determined only by local codon usage, some patterns in the actual
nucleotides exchanged remain to be explained.
Key words: synonymous substitution, picornavirus genes, neutral, models of evolution
Introduction
The high rate of base mis-match exhibited by RNA
viruses during gene replication is estimated to be
between 10ÿ 4 to 10ÿ 3 per nucleotide site per
replication (1,2) and suggests that point mutation
may be a major component of RNA virus evolution.
We conjecture that the simplest form of this
evolutionary change would be when point mutations
occur with equal probability along the gene, and are
subsequently subjected to no positive or negative
selection pressures. There are many possible reasons
to reject such a simplistic notion of evolution (as
reviewed in 3). The ®rst part of this conjecture may be
unrealistic: there are a number of reasons to suppose
that the probability of single nucleotide point
mutations may not occur with equal probability
along viral genes. For example, certain particular
sequences of nucleotides have been shown to incur
unusually high rates of point mutation within HIV-1
structural genes (4), and the local purine/pyrimidine
bias has also been related to observed mutation rate
(5,6). Biased hypermutation has been observed within
GpA dinucleotides resulting in a preponderance of
G!A mutations in human immunode®ciency virus-1
(7), simian immunode®ciency virus (8), and necrosis
provirus (9), possibly as a result of dislocation
mutagenesis (10). Similarly, biased hypermutation
has been observed from U!C in vesicular stomatitis
virus (11), and parain¯uenza virus (12); and from
A!G in viral genes associated with brain tissue
infected with measles virus (13).
The second part of the conjecture pre-supposes that
base substitutions are likely to be neutral. Single-base
nucleotide substitutions may be classi®ed as synony-
mous (resulting in no amino acid change) or non-
*Corresponding author: Dept of Zoology, University of British
Columbia, 6270 University Blvd., Vancouver, BC, Canada, V6T
1Z4 (e-mail: [email protected]).
The nucleotide sequence data used in this paper has been submitted
to GenBank.
synonymous (resulting in amino acid change). Non-
synonymous mutations are likely to be subjected to
purifying selection (as demonstrated with Foot and
Mouth Disease Virus, see 14), although there are
examples of point mutations in RNA viruses that are
subject to positive selection resulting from immune
surveillance (as demonstrated with HIV-1, see 15).
The most likely forms of nucleotide substitutions in
which neutral evolution may proceed are synonymous
point mutations. The occurrence of synonymous
mutations is subject to the in¯uence of codon usage.
As a result of the differential redundancy in the
genetic code some codons (e.g. those for serine,
leucine or argenine) are far more likely to incur
synonymous mutations than others (e.g. tyrosine or
cysteine). Despite the fact that synonymous base
mutations do not result in a change in the amino acid
sequence of protein coding genes, there are reasons to
suppose that they too may not be free from selection
pressures. The adopted secondary structure of the
RNA may impose constraints on permissible synony-
mous base substitutions, as suggested for some genes
of Caulimoviruses (16), and Hepatitis C virus (17).
Alternatively gross base composition of the genome
may itself be under the in¯uence of selective forces,
perhaps for reasons of genome packing, coding
requirements, tRNA availability or genome stability.
A further possible reason would be the existence of
overlapping reading frames, as has been suggested for
the unusually low rate of synonymous substitution
observed in the C gene of Hepatitus C (18).
Many of these complicating factors have been
proposed as explanations for the non-random patterns
of synonymous substitution observed within some
enterobacterial genes (e.g. 19±21).
Non-randomness in the occurrence of synonymous
nucleotide substitutions within genes may be identi-
®ed from related gene sequence data in two broadly
different ways: First, the frequency distribution of
multiply ``hit'' sites (i.e. how many homologous base
sites within the dataset were free from change, or
incurred one, two, three . . . N changes) may differ
from that expected from a simply de®ned mutational
model exclusive of mutation bias, hotspots and
subsequent selection. Second, given the observed
frequency distribution of multiply hit sites, it is
possible that these sites may be distributed along the
gene in a non-random pattern, i.e. the incidence of
observed substitutions may be unusually clustered, or
spaced out. In this study we conducted 3 different
statistically based tests to determine whether or not
the pattern of synonymous point mutations in the
RNA sequences of 3 different Picornavirus genes was
distinguishable from patterns predicted by very
simple neutral models of base substitution. We
conclude that within the genes we examined, there
is no evidence for any unexpected spatial clustering of
synonymous change and that the observed frequency
distributions of multiply hit sites in these genes can be
explained by a simple model of synonymous base
substitution that includes the observed codon usage
and transition transversion bias. However, these
simple models are unable to replicate some of the
patterns of biased hypermutation encountered in 2 of
the 3 datasets.
Materials and Methods
The Sequences
The analyses were conducted on 3 sets of sequences.
One data set contained sequences of the complete
(n� 630 nucleotides) VP1 gene from 55 isolates of
Foot and Mouth Disease virus serotype O (FMDV
VP1) obtained from around the world, over the last 70
years (EMBL accession numbers: AJ004645-86,
M89900, Z21862, M16088, K01201, M15981,
M21695, M15980, Z21860, M58601, M55287,
M15974, J02185, X00871). The other two data sets
contained sequences from the VP1 (n� 849 nucleo-
tides) and 3B/C (n� 666 nucleotides) genes of 45
isolates of Swine Vesicular Disease Virus (SVDV
VP1 and SVDV 3BC), obtained from around the
world over the last 30 years (VP1 gene EMBL
accession numbers: AJ004611-40, Y14464-76,
Y14481, D16364, D00435, and X54521; 3BC gene
EMBL accession numbers: AJ004564-604, D16364,
D00435, and X54521). Accession numbers and strain
names are collated in 22 and 23.
Analysis of Sequences
We applied 3 different tests to each of the 3 datasets.
Test I. This test was designed to determine whether
or not the observed frequency distribution of multiply
hit sites was consistent with that expected from a
simple Poisson model of synonymous substitution.
For each dataset we constructed the consensus tree of
254 Haydon, Knowles, and McCauley
all most parsimonious phylogenetic trees using the
program DNAPARS in the package PHYLIP 3.5c (24)
and determined the ancestral sequences at the internal
nodes of this consensus tree by supplying it as a user
de®ned tree to the same program. The output from this
procedure can be used to identify the predicted base
substitutions at each site along each branch of the tree.
This information can in turn be used to estimate how
many 1st and 3rd codon position sites received 0, 1, 2,
3, . . . N unambiguously identi®able synonymous
substitutions. This information was used to calculate
average synonymous hit rates for 1st and 3rd codon
positions combined. These per site averages were
used to construct the expected frequency distribution
of multiply hit sites if the same number of substitution
events was distributed over the appropriate number of
sites according to a Poisson distribution. The two
distributions can then be compared using a Chi-
squared test. If the expected multiple hit distribution is
found to differ from the observed distribution, it may
be concluded that the actual process of synonymous
substitution is not homogenous and independent over
all sites, as assumed by a Poisson process.
Test II. This test is technically complex, but the
underlying idea is simple. The aim is to simulate
datasets that are identical in many ways to the original
real datasets. They are identical in that real and
simulated datasets share almost identical consensus
sequences, levels of synonymous divergence and
phylogeny. They only differ in that patterns of
synonymous point substitution in the simulated
datasets are determined by a single speci®ed
evolutionary model, that makes explicit assumptions
about the process of synonymous evolution. The
adequacy of these assumptions may be examined
statistically by comparing the patterns of point
mutation in the real and simulated datasets.
Phylogeny Reconstruction
The number of synonymous substitutions per synony-
mous site at the ith nucleotide position, (pi, i� 1, . . . ,
n) was estimated by a site by site comparison of
sequence pairs using the method of Nei and Gojobori
(25), and then averaged over all possible pairwise
sequence comparisons at each nucleotide site to arrive
at an observed average number of synonymous
substitutions per synonymous site for each nucleotide
position, O(�pi), i� 1, . . . , n.
The average transition:transversion ratio (Ts) for
synonymous base substitutions was estimated from
the ratio of observed unambiguously synonymous
transitional and transversional differences between
codons that differ at only one position from all
possible pairwise sequence comparisons. This ratio
was then used to calculate a synonymous distance
matrix, containing the number of synonymous
substitutions per synonymous site by which the xth
and yth sequences differ from one another (ds,xy).
These distances were corrected for multiple hits, using
a modi®ed formulation of the Kimura two-parameter
model (26):
ds;xy �1
2ln� 1
1ÿ 2Fs ÿ Gs
�� 1
4ln� 1
1ÿ 2Gs
�where Fs is the estimated number of transitions per
synonymous site, and Gs the estimated number of
transversions per synonymous site. A phylogeny
based on these synonymous distances was then
constructed using the program FITCH in the
PHYLIP package (24).
Simulation of Sequences
A model of synonymous nucleotide change was
developed that is essentially an extension of a model
of nucleotide substitution proposed by Hasegawa et
al. (27) that permits different rates for transitions and
transversions with unequal base frequencies. We
extended this model to one that permits synonymous
codon substitution with unequal codon frequencies.
The model assumes that the probability of a mutation
occurring is identical and independent across all sites,
but that the probability that such a mutation is
synonymous is determined by the particular codon
to which the nucleotide belongs, and the transition
transversion ratio (as this obviously determines to
what alternative codon it may mutate). The model
further assumes that all synonymous substitutions are
neutral and independent of each other. The theoretical
underpinnings of models of this type are clearly
reviewed in (28±30) and the notation used here is
directly analogous whenever possible. Let fi (i�1, . . . , 20) be the frequency of the ith amino acid in the
consensus sequence, and mi be the number of codons
that can represent the ith amino acid de®ned by the
Testing Models of Synonymous Substitution 255
universal genetic code. Let Q be a symmetric 61� 61
instantaneous per site rate substitution matrix, whose
rows and columns correspond to the 61 sense codons,
and whose elements qij are calculated from 1 of 4
different equations:
If i� j then
qii � ÿX61
j�1j 6�i
qij
If i and j are synonymous codons that are related
by a single transitional nucleotide substitution then
qij� kbfk/mk, . . . , (k� 2Ts, and both codons code for
amino acid k and kb is the instantaneous transition
rate). If i and j are synonymous codons that are
related by a single transversional nucleotide substitu-
tion then qij� bfk/mk, . . . , (both codons code for
amino acid k and b is the instantaneous transversion
rate). If codons i and j code for different amino acids,
or are synonymous but are only interchangeable by
the exchange of more than one nucleotide then
qij� 0. The value of b is obtained by solving the
equationX61
i�1
X61
j�1j 6�i
qij ÿ 1 � 0
The matrix P(t), containing the per codon probability
of change from codon i to codon j in time t is given
by:
P�t� � eQt � UEUÿ1
where U is the matrix of eigenvectors of Q, and Eis the diagonal matrix with elements exp(lit), (i�1, . . . , 61) and li are the eigenvalues of Q. The
branch lengths, t, are obtained from the synonymous
distance tree. The ijth element of the P(t) matrix
determines the probability that codon i will incur
point mutation(s) resulting in its change to codon jover a branch of speci®ed length t. Use of this model
in combination with the synonymous distance tree
permits the generation of sets of sequences that have
been exposed to the same degree of synonymous
mutation and are related to each other in the same
way as the original dataset. In order to approximate
the observed codon usage, it was assumed that the
consensus sequence was the common ancestor to all
the samples, and it was placed at the root of the tree.
The expected average number of synonymous
substitutions per synonymous site for each nucleotide
position, E(�pi) (i� 1, . . . , n), can be estimated from
the simulated datasets in precisely the same way as
O(�pi) is estimated from the real dataset. Multiple data
sets of sequences simulated in this way can be used to
test the adequacy of the adopted model to account for
patterns in the real dataset as explained below.
Statistical Analysis
In order to test for differences between the observed
and expected multiple hit frequency distributions,
nucleotide sites were ranked in descending order of
their tendency to incur synonymous substitution. Thus
the one real observed such ranking (denoted obs~p1,
obs~p2, obs~p
3 � � � obs~pn, where obs~p
1 corresponds to the
number of synonymous substitutions per synonymous
site at the most synonymously variable site) was
compared to the mean distribution of rankings derived
from 200 simulated sequence datasets. The simulated
analyses yielded 200 values for the most synony-
mously variable nucleotide sites (denoted 1~p1, 2~p
1,
3~p1 � � � 200~p
1, the average of which is denoted exp~p1),
for the second most synonymously variable nucleo-
tide site (1~p2, 2~p
2, 3~p2 � � � 200~p
2, the average of which is
denoted exp~p2), the third, the ith (1~p
i, 2~pi, 3~p
i � � � 200~pi,
denoted exp~pi), and so on. By plotting E(2~p
i) and
E(199~pi), for each ranked position (i� 1, . . . , n), 99%
con®dence intervals for exp~pi may be obtained.
Similarly plotting E(5~pi) and E(195~p
i) for each
ranked position (i� 1, . . . , n) indicates the 95%
con®dence interval of exp~pi. These plots are made
easier to inspect if the y-axis is transformed so as to
represent the difference between the con®dence
intervals and the simulated mean rankings (i.e.
obs~pi ÿ2~p
i, obs~pi ÿ199~p
i; obs~pi ÿ5~p
i, obs~pi ÿ195~p
i)
for each site (for i� 1, . . . , n), and the real mean
rankings and the simulated mean rankings
(obs~piÿ exp~p
i) for each site (for i� 1, . . . , n). If the
difference between the observed and expected
rankings deviate beyond those indicated by the
con®dence intervals, the model by which the process
of synonymous substitution was emulated may be
rejected.
Using the simulated datasets it is possible to
calculate the mean and standard deviation of the
proportion of all differences between the simulated
sequences that were observed to be A,C, A,G,
A,U, C,G, C,U, or G,U and to determine by
use of a simple two-tailed t-test whether or not the
observed proportion of such differences in the real
256 Haydon, Knowles, and McCauley
dataset (deduced from synonymous codons differing
by only one base) are signi®cantly different to these
simulated means. This test provides an additional
opportunity to test the adequacy of the null model.
Test III. The ®nal test was constructed so as to
identify whether the observed pattern of synonymous
change was consistent with a model in which the
intensity of synonymous substitution was uniform
along the gene, or whether there was evidence for
``hot'' or ``cold'' regions of synonymous change. The
test can be used to identify such hot or cold spots at a
variety of spatial scales, depending on the size of a
sliding window that is run along the gene, within
which the number of synonymous substitutions is
summed. Let S be a vector of length n, composed of
the elements si�O(�pi). We calculated summed totals
of substitution activity within sliding windows of
varying length, L, as measured by the quantities:
VL � max�XL
j�1
si�j
��i � 0 to nÿ L�
and
UL � min�XL
j�1
si�j
��i � 0 to nÿ L�
for
L � 3; 6; 9; 12; . . . ; n=2
The test is conducted by sampling the elements of
S (without replacement) and reassigning them in
random order to a new vector S0 (with elements s0i).For each S0 we calculated the equivalent quantities:
V0L � max�XL
j�1
si�j
��i � 0 to nÿ L�
and
U0L � min�XL
j�1
si�j
��i � 0 to nÿ L�
for
L � 3; 6; 9; 12; . . . ; n=2
We computed the distributions of V0L and U0L from
1000 constructions of S0, and then asked whether the
observed values of VL and UL fell for suf®ciently far
into the tails of these distributions to indicate a sig-
ni®cant deviation from the randomized expectation.
Results
Test I. The comparison of ancestral/derived codons
across the entire consensus tree revealed that the
fraction of pairwise codon comparisons differing at
more than a single site was very low (1.4% for SVDV
3B/C, 2.3% for SVDV VP1, and 7.1% for FMDV
VP1). Therefore, the identi®cation of synonymous
substitutions was unambiguous for the most part. The
observed and expected multiple hit frequency dis-
tributions for the 3 genes for the 1st and 3rd positions
are graphed in Fig. 1, and the result of a chi-squared
goodness-of-®t test indicated on each ®gure. Thus of
the 222 codons constituting the 3B/C gene, 65 were
deduced to have remained unchanged, 62 received a
single synonymous substitution, 44 received 2, 23
received 3, 14 received 4, 9 received 5, 3 received 6,
and 2 received 7. The greater divergence of the
FMDV sequences is re¯ected by the observation that
only 26 out of 210 codons remained unchanged while
one codon was estimated to have incurred 12
independent synonymous substitutions. The null
model does not predict a signi®cantly different pattern
of multiple hits for the SVDV VP1 gene, but the
simple Poisson model can be rejected for the SVDV
3B/C and FMDV VP1 genes. For these two genes the
Poisson model consistently and signi®cantly under-
estimates the number of codons in both tails of the
multiple hit frequency distributions (i.e. those
receiving no substitutions and many substitutions).
This Poisson model fails because, given the overall
divergence within these genes, it predicts too few
codon positions incurring no synonymous change and
too few codon positions incurring many changes.
Test II. The transition:transversion ratio observed
from strictly synonymous substitutions was estimated
as 2.44:1 in FMDV VP1, 9.50:1 in SVDV VP1 and
11.58:1 in SVDV 3B/C. Fig. 2 illustrates the
distribution of observed synonymous substitutions
per synonymous site, O(�pi), along the 3 genes,
deduced from an all possible pairwise comparison of
sequences. There are no visually apparent indications
of non-randomness in these distributions. Figs. 3a,b,
and c show how the observed ranked values of O(�pi)
differ from the mean and con®dence limits expected
from the described simulation procedure. It is
apparent from these graphs that the observed
deviation from the expected ranked distribution
falls, for the most part, well within the predicted
Testing Models of Synonymous Substitution 257
Fig. 1. Graphical illustration of test I. The y-axis is the count of the number of codons that have incurred the number of synonymous
changes indicated on the x-axis, at 1st and 3rd position sites for the 3 different genes. The count is estimated from ancestral
reconstruction's inferred from a most parsimonious consensus tree. Results of chi-squared goodness of ®t tests are indicated.
258 Haydon, Knowles, and McCauley
Fig. 2. The graphs illustrate the distribution of observed synonymous substitutions per synonymous site, O(�pi), along the 3 genes, deduced
from an all possible pairwise comparison of sequences. All 2nd position codon sites are excluded (as no synonymous changes are possible
at 2nd position sites).
Testing Models of Synonymous Substitution 259
con®dence intervals and the simple evolutionary
model proposed is able to account for much of the
observed pattern of synonymous substitution in these
genes. The only exception is in the SVDV VP1 gene
(Fig. 3b), where a tendency for slightly fewer than
expected ``often changed'' sites results in a signi®cant
excess in the number of ``rarely hit'' sites. Test II
shows that the observed pattern of synonymous
substitution along this gene is a little more even
than is predicted by this model.
Table 1 contains the comparison of the observed
patterns of base exchange (corresponding to un-
ambiguously identi®ed synonymous differences
resulting from the pairwise comparison of all
Fig. 3. Graphical illustration of the results from test II. The y-axis represents the difference between the normalized con®dence intervals
and the normalized simulated mean rankings (i.e. obs~pi ÿ 2~p
i; obs~pi ÿ 199~p
i; obs~pi ÿ 5~p
i; obs~pi ÿ195~p
i) for each site (for i� 1, . . . , n), and
the normalized real mean rankings and the normalized simulated mean rankings (obs~pi ÿexp~p
i) for each site (for i� 1, . . . , n). Sites on the
left hand side of these ®gures incurred the most synonymous substitutions, those on the right received the least. If the difference between
the observed and expected rankings (dotted line) deviate beyond those differences indicated by the con®dence limits the model by which
the process of synonymous substitution was emulated may be rejected. a) FMDV VP1; b) SVDV VP1; c) SVDV 3B/C. All 2nd position
codon sites are excluded from the rankings (as no synonymous changes are possible at 2nd position sites).
a)
260 Haydon, Knowles, and McCauley
sequences) with the mean and standard deviations
expected from the patterns of base exchange that
occurred in the computer simulations. The overall
transition:transversion ratio was an input variable in
the computer simulation, but differences between the
frequencies of the two transitions, and the four
transversions may re¯ect non-neutral processes. The
FMDV VP1 gene is the only one in which there are no
signi®cant differences between the observed and
expected numbers of the different types of possible
base exchange. In the SVDV genes, the A,G
transitions are signi®cantly less common than
expected in both genes, and the C,U transitions
more common (but only signi®cantly so in the case of
VP1). Also signi®cantly more frequent than expected
is the A,U transversion in both SVDV genes.
Test III. The results from test III indicated that no
signi®cant spatial clumping (hotspots) or gaps (cold
spots) existed at any spatial scale for either of the two
SVDV genes. For the FMDV VP1 gene there was one
value of VL (for L� 15) which was signi®cant for
Fig. 3. (Continued)
b)
Testing Models of Synonymous Substitution 261
p5 0.03, that indicated a slight clumping of
synonymous substitution activity between nucleotide
sites 567 and 582. This clumping would not be
signi®cant if the appropriate Bonferroni corrections
were made (since 2 tests were carried out at each
spatial scale).
Discussion
There is considerable variation in the extent to which
codons for different amino acids may incur synony-
mous mutation (see Fig. 4). The most synonymously
mutable codon is the codon CGA (arg.); there are 8
possible sense mutations that can be made, and 4 are
synonymous (so the number of synonymous sites is 4/
8� 3). The least synonymously variable codons are
obviously TGG (trp.) and ATG (met.) which have zero
synonymous sites. Interestingly, it has been noted that
in some genes in which non-synonymous changes
might be of selective value, codon usage is biased
toward those with particularly low numbers of
synonymous sites (22,31). Test I models synonymous
substitution as a homogenous Poisson process,
Fig. 3. (Continued)
c)
262 Haydon, Knowles, and McCauley
thereby assuming that mutations occur independently
of each other at a constant rate and that all codons in a
gene are equally likely to incur synonymous
substitution. This second simpli®cation could easily
account for the failure of this model to explain the
multiple hit frequency distribution for the FMDV VP1
and SVDV 3B/C genes for 1st and 3rd position sites
combined. This test is somewhat dependent on the
exact topology of the ``user de®ned tree'' in the
DNAPARS program. However, substitution of the
most parsimonious consensus tree for various distance
based trees made no difference to the overall
conclusions drawn from the application of this test.
Test II permits codon usage to be taken somewhat
into account by using the consensus sequence of each
of the datasets as the ancestral sequence to the
simulated evolutionary process. However, deviations
from the consensus sequence are inevitable in
localized parts of the phylogenies of these viruses
and cannot be taken into account simply. For example,
a tendency to replace codons with a low propensity to
incur synonymous mutation ( perhaps ATG-methio-
nine) with those with high propensity to incur
synonymous mutation ( perhaps CTG-leucine) within
a clade in the phylogeny may explain why more sites
incur synonymous substitution in the SVDP VP1 gene
(see Fig. 3b) than expected. Test II permits the realistic
use of observed transition:transversion ratios, which
may also help to explain the larger degree of the
observed pattern in frequency distribution of multiply
hit sites accounted for by this test. There are no
generally recognized implications of transition bias
for viral evolution, although it is interesting to note
that of the 380 possible amino acid substitutions, 98
may be achieved by a single transversion, 44 by a
single transition, but only 8 by either a single transition
or transversion. Therefore, the set of amino acid
changes induced by a single transversion is almost
Table 1.
FMDV VP1
From/to A C G U
A 0.0647 (0.0109) 0.3049 (0.0213) 0.0637 (0.0106)
C 0.0833NS 0.0663 (0.0102) 0.4315 (0.0219)
G 0.3076NS 0.0653NS 0.0689 (0.0106)
U 0.0667NS 0.4000NS 0.0772NS
SVDV VP1
From/to A C G U
A 0.0244 (0.0074) 0.4231 (0.0209) 0.0219 (0.0071)
C 0.0287NS 0.0198 (0.0066) 0.4917 (0.0200)
G 0.3567p5 0.005 0.0066p5 0.05 0.0191 (0.0070)
U 0.0384p5 0.01 0.5494p5 0.005 0.0201NS
SVDV 3B/C
From/to A C G U
A 0.0149 (0.0084) 0.4826 (0.0277) 0.0112 (0.0064)
C 0.0091NS 0.0137 (0.0078) 0.4670 (0.0289)
G 0.3968p5 0.005 0.0115NS 0.0106 (0.0068)
U 0.0423p5 0.005 0.5235NS 0.0167NS
Observed proportion of homologous sites that differ by a synonymous point mutation of the type indicated in the table (lower triangle, below the
diagonal elements) compared to the mean and standard deviations (in parentheses) of those occurring in the simulated synonymous evolution
procedure developed as part of test II (upper triangle, above the diagonal elements). Direction of change cannot be inferred from a pairwise
comparison of sequences. Signi®cance determined by two tailed t tests. Proportions refer to the proportion of differences resulting from all
pairwise comparisons.
Testing Models of Synonymous Substitution 263
non-overlapping to the set induced by a single
transition. Furthermore, multiple transversions allow
a much more complete and direct way of accessing
different amino acids than multiple transitions.
The comparison of observed and predicted base
exchanges in Test II suggests that bulk nucleotide
ratios play no role in the evolution of the FMDV VP1
gene, but may be important in the evolution of SVDV
genes. Both SVDV genes exhibit a tendency for
increasing G-richness with time (see Fig. 5), thus the
lower than expected A,G transition frequency could
be accounted for by a selection pressure against
G!A. Biased hypermutation of this sort appears to
be quite common in RNAviruses (4±13), although the
nature of the biases and proposed explanations are
varied. One possible explanation is that such
hypermutation is catalyzed by cellular double
stranded RNA-dependent adenosine deaminase
activity (32). The conversion of A:U to I:U base
pairs in dsRNA results in the destabilization of the
substrate dsRNA. Conversion of A:U to A:I may lead
to degradation by the ribonuclease speci®c for
inosine-containing RNA which has recently been
described (33,34). The nature of selective pressures
on gross base composition of the genome remain
speculative, considerations of genome packing,
coding requirements, tRNA availability and genome
stability are logical possibilities. The use of this test
suggests that a substantial fraction of the observed
patterns of synonymous evolution in these virus genes
may be explained by simple models that take into
account codon usage, transition:transversion ratios
and the structure of the genetic code (as represented
by the structure of the Q matrix).
The existence of hot or cold spots would
be consistent with the idea that RNA secondary
structure imposes constraints on the synonymous
substitution process in stem regions more than in loop
regions. Test III indicates exceedingly little evidence
for the existence of such hot or cold spots along these
parts of the genome at any of the spatial scales
examined.
Overall, these results suggest that while synony-
mous point mutation in these virus genes may not be a
random and neutral process (as indicated by the
patterns of biased hypermutation in the SVDV genes),
the patterns that emerge are statistically hard to
distinguish from a process that is random and neutral.
The spatial occurrence of point mutation in these
genes appears to require no particularly complicated
explanations beyond those of differential codon usage
and transition:transversion bias.
Fig. 4. The number of synonymous sites for each of the 61 sense codons, as calculated by the method of Nei and Gojobori (25).
264 Haydon, Knowles, and McCauley
Fig. 5. The proportion of each of the SVDV genes that is encoded with guanine, plotted as a function of the year in which the samples
were taken. There is good evidence to suppose that this group of viruses forms a monophyletic group (see 23), the isolates clearly do not
represent independent samples.
Testing Models of Synonymous Substitution 265
References
1. Holland J.J., Spindler K., Horodyski F., Grabeu E., and Nichol
S., Vande Pol S Science 215, 1577±1585, 1982.
2. Domingo E. and Holland J.J. In Domingo E., Holland J.J., and
Ahlquist P. (eds). RNA genetics, Vol III. Variability of RNAgenomes. Boca Raton, FL: CRC Press, 1988, pp. 3±36.
3. Ramirez, B.-C., Barbier P., Seron K., Haenni A.-L., and
Bernardi F. In Gibbs A., Calisher C.H., and Garcia-Arenal F.
(eds) Molecular basis of virus evolution. Cambridge University
Press, 1995.
4. Doi H., Proc Natl Acad Sci USA 88, 9282±9286, 1991.
5. Kunkel T.A., Biochemistry 29, 8003±8011, 1990.
6. Burns D.P.W. and Temin H.M., J Virol 68, 4196±4203, 1994.
7. Vartanian J.P., Meyerhans A., Asjo B., and Wain-Hobson S.,
J Virol 65, 1779±1788, 1991.
8. Johnson P.R., Hamm T.E., Goldstein S., Kitov S., and Hirsch
V.M., Virology, 185, 217±28, 1991.
9. Pathak, V.K. and Temin H.M., Proc Natl Acad Sci USA 87,
6019±23, 1990.
10. Kunkel T.A. J Biol Chem 267, 18251±18254, 1992.
11. O'Hara P.J., Nichol S.T., Horodyski F.M., and Holland J.J., Cell
36, 915±924, 1984.
12. Murphy D.G., Dimock K., and Kang C.Y. Virology 181, 760±
763, 1991.
13. Cattaneo R., Schmid A., Spielhofer R., Kaelin K., Baczko K.,
ter Meulen V., Pardowitz J., Flanagan S., Rimna B.K., Udem
S.A., and Billeter M.A., Virology 173, 415±425, 1989.
14. Escarmis C., Davila M., Charpentier N., Bracho A., Moya A.,
and Domingo E., J Mol Biol 264, 255±267, 1996.
15. Seibert S.A., Howell C.Y., Hughes K.A., and Hughes A.L., Mol
Biol Evol 12, 803±813, 1995.
16. Chenault K.D. and Melcher U., Biochimie 76, 3±8, 1994.
17. Smith D.B. and Simmonds P. (1997). Submitted to J Mol Evol.
18. Ina Y., Mizokami M., Ohba K., and Gojobori T., J Mol Evol 38,
50±56, 1996.
19. Eyre-Walker A. and Bulmer M., Nucleic Acid Res 21, 4599±
4603, 1993.
20. Eyre-Walker A., J Mol Evol 39, 448±451, 1994.
21. Lawrence J.G., Hartl D.L., and Ocman H., J Mol Evol 33, 241±
250, 1991.
22. Haydon D.T., Lea S., Fry E., Knowles N., Samuel A.R., Stuart
D., and Woolhouse M.E.J. To appear in J Mol Evol.
23. Zhang G., Haydon D.T., Knowles N., and McCauley J.,
Molecular Evolution of Swine Vesicular Disease Virus.
Manuscript in preparation.
24. Felsenstein J., PHYLIP (Phylogeny inference package), version
3.5c. Department of Genetics, University of Washington,
Seattle, 1993.
25. Nei M. and Gojobori T., Mol Biol Evol 3, 418±426, 1986.
26. Kimura M., J Mol Evol 16, 111±120, 1980.
27. Hasegawa M., Kishino H., and Yano T., J Mol Evol 21, 160±
174, 1985.
28. Rambaut A. and Grassly N.C., Comput Appl Biosci 13, 235±
238, 1997.
29. Swofford D.L., Olsen G.J., Waddell P.J., and Hillis D.M. In
Hillis D.M., Moritz C., and Mable B. (eds). MolecularSystematics, 2nd edition. Sinuaur, 1996, pp. 432±438.
30. Lewis et al. Maximum Likelihood. Paupstar manual. Sinuaur.
In press.
31. Wagner S.D., Milstein C., and Neuberger M.S., Nature 376,
732, 1995.
32. Bass B.L., Weintraub H., Cattaneo R., and Billiter M.A., Cell
56, 331, 1989.
33. Scadden A.D.J. and Smith C.W.J. EMBO J 14, 3236±3246,
1995.
34. Scadden A.D.J. and Smith C.W.J. EMBO J 16, 2140±2149,
1997.
266 Haydon, Knowles, and McCauley