Post on 31-Mar-2023
Symmetry: Founding editors: G. Darvas and D. Nagy The journal of the Symmetrion
Editor: György Darvas
Volume 23, Numbers 3-4,225-448, 2012
SYMMETRIES IN GENETIC INFORMATION AND ALGEBRAIC BIOLOGY
CONTENTS
ANNOUNCEMENT Symmetry Festival 2013, 2-7 August, Delft, The Netherlands 228
EDITORIAL, Sergey Petoukhov 229
SYMMETRY IN SCIENCE AND ART Genome symmetries, Paul Dan Cristea 233 Symmetry of mitochondrial DNA. The case of COXn genes in primates
and carnivores, Teodora Popovici and Paul Dan Cristea 255 Symmetries of the genetic code, hypercomplex numbers and
genetic matrices with internal complementarities, Sergey V. Petoukhov 275 Fractal genetic nets and symmetry principles in long nucleotide
sequences, S.V. Petoukhov, V.I. Svirin 303 A Markov information source for the syntactic characterization of amino
acid substitutions in protein evolution, Miguel A. Jiménez-Montaño 323 Symmetries in molecular-genetic systems and musical harmony,
G. Darvas, A.A. Koblyakov, S.V.Petoukhov, I.V.Stepanian 343 Modeling “cognition” with nonlinear dynamic systems, Yuri V. Andreyev,
Alexander S. Dmitriev 377 The irregular (integer) tetrahedron as a warehouse of biological
information, Tidjani Négadi 403 Theory of topological coding of proteins and nature of antisymmetry
of the amino acids canonical set, Vladimir A. Karasev 427
SYMMETRY: CULTURE AND SCIENCE is the journal of and is published by the Symmetrion, http://symmetry.hu/. Edition is backed by the Executive Board and the Advisory Board (http://symmetry.hu/isa_leadership.html) of the International Symmetry Association. The views expressed are those of individual authors, and not necessarily shared by the boards and the editor.
Editor: György Darvas
Any correspondence should be addressed to the
Symmetrion Mailing address: Symmetrion c/o G. Darvas, 29 Eötvös St., Budapest, H-1067 Hungary Phone: 36-1-302-6965 E-mail: symmetry@symmetry.hu http://symmetry.hu/
Annual subscription:
Normal € 120.00, Members of ISA € 90.00, Student Members of ISA € 60.00, Benefactors € 900.00, Institutional Members please contact the Symmetrion.
Make checks payable to the Symmetrion and mail to the above address, or transfer to the following account: Symmetrology Foundation, IBAN: HU24 1040 5004 5048 5557 4953 1021,
SWIFT: OKHBHUHB, K&H Bank, 20 Arany J. St., Budapest, H-1051.
© Symmetrion. No part of this publication may be reproduced without written permission from the publisher.
ISSN 0865-4824 – print version
ISSN 2226-1877 – electronic version
Cover layout: Günter Schmitz; Image on the front cover: Matjuska Teja Krasek: Star(s) for Donald, 2000, (tribute to H.S.M. Coxeter);
Images on the back cover: Matjuska Teja Krasek: Twinstar and Octapent; Ambigram on the back cover: Douglas R. Hofstadter.
Symmetry: Culture and Science Vol. 23, Nos. 3-4, 323-342, 2012
A MARKOV INFORMATION SOURCE FOR THE SYNTACTIC CHARACTERIZATION OF AMINO ACID
SUBSTITUTIONS IN PROTEIN EVOLUTION
Miguel A. Jiménez-Montaño
BioPhysicist, (b. México, D.F., MEXICO, 1941).
Address: Faculty of Physics and Artificial Intelligence, University of Veracruz, Sebastián Camacho # 5, Col. Centro, C.P. 91000, Xalapa, Ver., México. E-mail: ajimenez@uv.mx.
Fields of interest: The structure of the genetic code, technological evolution and informational measures and algorithmic complexity of sequences of symbols and nerve signals.
Awards: Research Award, 1989; Dean Award, 2004; both from Universidad Veracruzana. Fulbright Fellow, 1982. First Prize National Contest on Scientific Non-technical Essay, 1990.
Publications: Ebeling W., Jiménez-Montaño M. A. (1980)*. On Grammars, Complexity, and Information Measures of Biological Macromolecules. Mathematical Biosciences Vol. 52:53-71. Jiménez-Montaño M. A. (1984). On the Syntactic Structure of Protein Sequences, and Concept of Grammar Complexity. Bulletin of Mathematical Biology, Vol.46:641-660. Jiménez-Montaño M.A., de la Mora-Basáñez R., Pöschel T. (1996)*. The Hypercube Structure of the Genetic Code Explains Conservative and Non-Conservartive Aminoacid Substitutions in Vivo and in Vitro. BioSystems Vol. 39: 117-125. Jiménez-Montaño M.A (1999)* Protein Evolution Drives the Evolution of the Genetic Code and Vice Versa. BioSystems Vol. 54: 47-64. Weiss O., Jiménez-Montaño M.A, Herzel H. (2000) Information content of protein sequences. J. Theor. Biol., Vol 206: 379-386. . Jiménez-Montaño M. A. (2004). Applications of Hyper Genetic Code to Bioinformatics. Journal of Biological Systems. Vol. 12: 5-20. .Jiménez-Montaño M. A. (2009)*. The fourfold way of the genetic code. BioSystems, Vol. 98 (2), 105-114.
Abstract: We introduce a theoretical model, which consists of a Markov Information
Source that generates codon sequences, and from them amino acid sequences, that
maintain the same or very similar functions and structures, as a direct consequence of
the structure of the genetic code, and general physical chemical constraints. With the
help of the model, we propose a codon dendrogram to describe a hierarchy of codon
categorizations, which explain the pattern of frequent amino acid substitutions in short-
term evolution.
Keywords: Markov source, genetic code, codon, amino acid, protein evolution.
M. A. JIMÉNEZ-MONTAÑO 324
1. INTRODUCTION
Understanding protein evolution remains today a major challenge in molecular biology
as it was a decade ago (Dokholyan and Shakhnovich, 2001 and references therein),
despite the huge amount of data gathered from genes, protein sequences and structures
presently available. Our knowledge of the relation between the genotype (DNA coding
for a protein) and the phenotype (a protein’s structure and its pattern of specific traits
related to its biological function), which is central to the Theory of Evolution and all
biology, is still at a very primitive stage (Thorne and Goldman, 2001; Wagner, 2012).
While the mechanisms of mutations in DNA sequences that code for proteins are known
(Parkhomchuk et al., 2009 ; Skipper,et al., 2012) the contribution of the genetic code in
creating new information, against the part played by natural selection in its fixation in
the population, is not completely appreciated. According to Abel and Trevors, (2006),
“Genetic prescription of computation precedes and produces phenotypic realization.
And this prescription is “written in stone”. Only recently, it has been recognized in the
literature the full complexity of the genotype/phenotype map (Crutchfield and Schuster,
2003).
According to DePristo et al., (2005), “Taken as whole, recent findings from
biochemistry and evolutionary biology indicate that our understanding of protein
evolution is incomplete, if not fundamentally flawed”. They suggest joining the fields of
protein biophysics and molecular evolution by highlighting the shared questions. In the
same line of thought, Pàl et al., (2006) argue that an integrated view of this field should
embrace genomic, structural and population levels of description. In Fig.1 these
different levels are graphically displayed. However, the problem to achieve this aim is,
on the one hand, that these levels belong to fields of knowledge with radically different
conceptual frameworks; and, on the other, the degenerate relationship between physics
and biology. It is well known that many proteins with no apparent sequence similarity
display the same folds (Kleiger et al., 2000). Thus, the many-to-one map M (Ai / S),
depicted in Fig.1, gives the amino acid at position i of any of the sequences
corresponding to the given structure (fold) S. In the concise statement by Hietpas et al.,
(2011): “Biology is governed by physical interactions, but biological requirements can
have multiple physical solutions”.
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 325
Figure 1: The conceptual scheme for protein evolution. The connection between the domain of population genetics (where Darwinian selection operates and true evolution ocurrs) and the genome of an organism (where mutations, an essential ‘raw material’ of evolution, occur), is mediated by the physics and chemistry of proteins. The crucial point to appreciate the complexity implicit in this diagram comes from the degenerate relationship between physics and biology: Given a population, e.g. of virus quasi-species with frequencies {fi, i = 1, 2, …,n}and with corresponding phenotypes {ψi, i = 1, 2, …,n}, M (S/ ψi ) specifies the common structure (S) of a protein family (the set π of orthologous proteins, associated phenotype ψi ). Then, M (Ai/S) is the amino acid at position i of any of the sequences corresponding to the structure S. In the same way, M(Ci/Ai) gives one of the codons codifying the amino acid, according to the genetic code. Let Cm be the chosen codon, then M (C’m / Cm) represents the single-nucleotide mutation from codon Cm to codon C’m , obeying the structure of the genetic code. In turn, M (A’j / C’m) gives the mutated amino acid A’j , coded by C’m, and M (S’/A’j) maps A’j to the corresponding new structure S’ . This modified protein structure is mapped to a new phenotype ψ+ though M (ψ+ / S’), which through Darwinian selection starts the evolutionary cycle again. Notice that M (S’/A’j) maps A’j to the corresponding new structure S’, and M (ψ+ / S’), maps the new protein structure (fold) to the new phenotype ψ+. Certainly, there is no direct feedback from the old to the new structure. Proteins are concrete molecules that do not evolve.
The main purpose of the present work is to provide a theoretical model, built upon an
empirical codon substitution matrix (Schneider et al., 2005), which explains the pattern
of amino acid substitutions in proteins that maintain the same or very similar functions
and structures, as a direct consequence of the structure of the genetic code (Jiménez-
Montaño, 1994), which controls the possible amino acid changes from single
Environment
NaturalSelectionNeutralEvolution
VirusQuasi‐Speciesf1, f2…fn;ψ1, ψ2… ψn
(Concentrationspace)
DNAsequence space(Genotypes)
Translation apparatus
GeneticCode Space(Hypercube)MUTATIONS
X
M (f’/S’)
f
M(S/ψi)
M (S’/Aj’)
Cm
M(Cm’/Cm)
Population genetics(PHENOTYPE)
(Codonusage)
Molecular biology &Bioinformatics(GENOTYPE)
Protein structure(formSpace)
Protein sequence space(Coding of structure)
Aj’
M(Ai/S)M(Cj /Ai)
M(Ai’/C’m)
Biophysics&
Biochemistry(Thermodynamics)
S = 3‐d structure of the proteinAi = aa at site i of protein sequence Ci = codon at site i of protein fi= frequency of species iψi= phenotype of species i
M. A. JIMÉNEZ-MONTAÑO 326
nucleotide mutations, and general physical chemical constraints which are responsible
for the stability of the protein.
2. MODELS OF PROTEIN EVOLUTION
2.1 Amino acid models of protein evolution
Nonetheless the complexity of protein evolution, for a wide range of applications such
as database search, sequence alignment, protein family classification and phylogenetic
inference, among many others, the phenomenological approach to amino acid
substitutions in protein families, started with the empirical work of Margaret Dayhoff
and her colleagues (1978), is still widely used. Following Dayhoff’s footsteps, with the
help of large data bases available in subsequent years, various authors built several
amino acid substitution matrices based on observed mutation counts in protein
alignments (e.g. the updated Dayhoff matrices by Gonnet et al., 1992 or Jones et al.,
1992). This formalism operates in protein space (see below), thus completely ignores
the underlying mutational process that occurs at the DNA level. Dayhoff’s PAM
matrices describe the probabilities of amino acid substitutions, for a given period of
evolution. They are derived from a model in which amino acids mutate randomly and
independent of one another. Each substitution probability during some time interval
depends only on the identities of the initial and replacement residues. Mathematically
speaking, the dynamics of amino acid substitution resembles a time-homogenous first
order reversible Markov chain (Dayhoff et al., 1972, 1978; Gonnet et al., 1992; Jones et
al., 1992; Müller and Vingron, 2000).Of course, the above assumptions are not strictly
true, and various authors have pointed out that the dynamics of amino acid substitutions
is not Markovian, stationary, nor homogeneous (Crooks and Brenner, 2005).
Sequence space, the abstract space of all sequences drawn from an alphabet of k letters
and of length n, was first introduced in coding theory by Hamming (1950). It is a
metric space with respect to the Hamming distance, dH, (Hamming, 1950), which
represents the minimum number of changes that are required to convert one sequence
into another. Maynard Smith (1970) applied this concept to amino acid sequences
defining the concept of protein space (see also Kauffman, 1989 and references therein).
As recently described in a delightful paper by Frances Arnold (2011), in protein space
each sequence is surrounded by its one-mutant neighbors, that is, by all the proteins that
differ from it by a change in a single amino acid letter. As described in (Kauffman
1989), “The concept of protein space is a high-dimensional space in which each point
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 327
represents one protein, and is next to 19 N points representing all the 1-mutant
neighbors of that protein. The protein space therefore simultaneously represents the
entire ensemble of 20N proteins and keeps track of which proteins are 1-mutant
neighbors of each other”.
However, the autonomous description of protein evolution in protein (sequence) space
is misleading, because it violates the Central Dogma of molecular biology. In nature
amino acids do not interchange among themselves during evolution. The space of
possibilities is at the DNA level, where the meaningful unit is a base triplet or codon.
Therefore, the relevant space to describe codon substitutions is a genetic code space
(Fig. 1), where codon mutations occur (Swanson, 1984; Jiménez-Montaño et al., 1996;
Petoukhov, 1999; Stambuk, 2000; Jiménez-Montaño, 2004; Jiménez-Montaño and He,
2009). Thus, by single-nucleotide mutations many of the 19 amino acids are out of
reach from the original amino acid, and thus they have null probability of appearance.
This is the first place where the symmetry associated to the concept of random
mutations is broken. The single-nucleotide mutations among the four bases are indeed
random (although not necessarily equally probable), but the corresponding amino acid
substitution probabilities cannot be equal, due to the structure of the genetic code. The
dynamics of amino acid substitutions refers to an aggregate level (see below).
2.2 Codon models of protein evolution
Goldman and Yang (1994) and, independently, Muse and Gaut (1994) introduced the
first models of a Markovian dynamics at the DNA (codon) level. In these models all
substitution rates are derived from parameters. We will not discuss parametric codon
models here; instead, we are going to employ an adaptation for short-term evolution of
the empirical codon substitution model proposed by Gaston Gonnet and his group
(Schneider et al., 2005). In this case, all substitution rates were estimated from a large
data set of aligned vertebrate coding sequences and then fixed.
Assuming a Markovian dynamics at the DNA (codon) level, the dynamics of amino acid
substitutions is defined by an aggregation (grouping) of codon states. However,
Görnerup and Jacobi (2010) pointed out that in general the dynamics on the aggregated
level is not closed, since the partition of the original space introduces memory on the
aggregated level. Only in the special case when the aggregated dynamics indeed is
closed, the stochastic process over the partitions constitutes a Markov chain with the
same order as the original process. Employing the same empirical codon substitution
matrix (Schneider et al., 2005) as we do, they showed that the substitution process
M. A. JIMÉNEZ-MONTAÑO 328
hierarchically operates on multiple levels, from nucleotides to codons, to groups of
codons, associated with amino acids, and to amino acid groups which form “reduced
alphabets”. Since each level approximately has its own closed dynamics, the original
dynamics and the partition of the state space then define a new stochastic process on the
coarser level. These theoretical aspects of molecular evolution were corroborated by our
computer simulations.
Recently, Kosiol and Goldman (2011) proposed a closely related approach in terms of
aggregated Markov processes (AMPs), to model protein evolution as time-
homogeneous Markovian at the DNA (codon) level but observed (via the genetic code)
only at the amino acid level. They showed that this approach leads to time-dependent
and non-Markovian observations of amino acid sequence evolution. The main
difference between their work and the paper by Görnerup and Jacobi (2010) and our
model is that Kosiol and Goldman employed a parametric codon substitution matrix.
Nonetheless, our model is consistent with their assertion that the genetic code and
amino acids' physiochemical properties “influence the average substitution patterns
observed over collections of proteins at all evolutionary distances in the same way”.
That is, we assert that is not exact that the influence of the genetic dominates in the
short-term, and physiochemical properties in the long-term, as supposed by Benner et
al. (1994).
3. THE MARKOVIAN CODON-SUBSTITUTION MODEL
Markov processes/ chains/ models were first developed by Andrei A. Markov. Their
first use was for a linguistic purpose, modeling the letter sequences in works of Russian
literature (Markov, 1913). Later on, Markov models were developed as a general
statistical tool and applied to problems in the study of natural language processing
(Christopher and Schutze, 2003) and in computational biology (Nielsen, 2005; Ewens
et al., 2001; Yang, 2006), among many other applications.
3.1 The Markov Information Source
Probabilistic finite state automata, PFA, as hidden Markov models, HMM, are widely
used in computational linguistics, machine learning, time series analysis, computational
biology, and speech recognition among other fields of research. Their definition, given
in (Vidal et al., 2005), is equivalent to the definition of a stochastic regular grammar.
PFA are built to deal with the problem of probabilizing a structured space by adding
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 329
probabilities to structure. This is precisely what we want to do: To bring in codon
transition probabilities into the structure of the genetic code.
This is necessary because in our model, as in the parametric models by Goldman and
Yang (1994), and by Halpern and Bruno (1998), the state-space for the Markov process
corresponds to the standard genetic code (or its variants). In the model of Goldman and
Yang, “The states of the Markov process are the 61 sense codons. The three nonsense
(stop) codons are not considered in the model, as mutations to or from stop codons can
be assumed to affect drastically the structure and function of the protein and therefore
will rarely survive”. But, except for sharing the same abstract space, our approach is not
related with the mentioned models. Rather, its formulation and interpretation is closer to
that of informational and linguistic models. Therefore, we interpret the genetic code as a
Markov information source, exactly as this expression is understood in information
theory (Ash, 1965, p 172). That is, a finite Markov chain, together with a function f
whose domain is the set of states S and whose range is a finite set Γ called the alphabet
of the source. In our case, Γ = {A, G, S, T,…, Y, W} is the amino acid alphabet. The
PFA can be displayed graphically as a six-dimensional Boolean hypercube (Jiménez-
Montaño et al., 1996; Petoukhov, 1999; Stambuk, 2000; Jiménez-Montaño, 2004;
Sánchez et al., 2004; Karasev and Soronkin, 1997). In a forthcoming paper (Jiménez-
Montaño and Ramos-Fernández, 2013) we describe an implementation of the PFA with
the help of software tool GSEQUENCE that we developed specially to simulate the
generation of codon sequences, and from them amino acid sequences, in protein
evolution.
As Shannon (1948) did not mean that his statistical description of human language, with
a Markov information source, is the actual manner in which human discourse is
generated, it is clear that we are not suggesting that Nature really produces proteins with
the help of a Markov information source at the codon level. This is only a mathematical
device to describe the correlations among the amino acid substitutions along the
evolutionary process.
As mentioned above, for our model we have adapted for short term evolution (i.e., for
single-nucleotide changes) the 61 x 61 codon matrix introduced in (Schneider et al.,
2005), for which all substitution rates have been estimated from a set of 17,502
alignments of orthologous genetic sequences from five vertebrate genomes. The codon
transition probabilities are fixed, and correspond to protein divergence between 25 and
60 accepted point mutations per 100 amino acids (PAMs). Besides the influence of the
genetic code, this as any other empirical codon substitution matrix includes variable
M. A. JIMÉNEZ-MONTAÑO 330
factors such as codon usage, transition/transversion bias and selective pressures.
Inversions and duplications are not considered in this paper.
Out of the 190 possible interchanges among the 20 amino acids, we consider only 75
that can be obtained by single-base substitutions. Therefore, we employ a reduced
empirical matrix (REM), making zero all entries corresponding to more than one
nucleotide change in the original matrix and normalizing the resulting matrix; see
(Jiménez-Montaño and He, 2009) for more details. In this way, we take into account the
local structure of the genetic code around each codon. In one step, a codon can change
in nine different ways and generally can have from zero to three synonymous changes
and from six to nine non-synonymous changes, except in the cases of six-fold
degeneracy such as serine, leucine and arginine. We consider all possible one-step
changes for all 61 codons disregarding the three stop codons.
The important contribution of the genetic code to protein evolution has recently been
underlined by Hietpas et al., (2011), who found that the genetic code is highly
optimized (+2.4σ) to favor single-base substitutions between codons with WT-like
fitness compared with randomly generated codes. Thus, the genetic code generally
permits single-base substitution pathways between codons with WT-like fitness.
4. SELECTIVE CONSTRAINTS
The functions and structure of individual proteins impose different constraints on their
evolution. Irrespective of their dispensability, most proteins require a suitable three
dimensional structure to function. Therefore, any polypeptide having a well defined
globular structure must be the subject of a strong selection and its sequence is, from this
point of view, nearly optimal in terms of stability (Sánchez et al., 2006). Therefore, a
majority of positions in a protein globular domain are selected for stability. The
fundamental role of selection for thermodynamic stability in shaping molecular
evolution has been demonstrated by studies that simulated sequence evolution under
structural constraints (Parisi and Echave, 2001). The amino acid substitution
probabilities derived from the REM matrix are highly anti-correlated with the values
taken from the amino acid substitution matrix suggested long ago by Miyata et al.,
(1979) (Table 2). Therefore, the selective constraints are approximately captured by the
amino acid properties of hydrophobicity and volume.
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 331
More than thirty years ago, while analyzing the globin fold, Lesk and Chothia (1980)
reached already a similar conclusion. So long as some basic physical-chemical
constraints are satisfied, there is considerable latitude in primary structure (Axe, 2004).
About the same time, a related conclusion was reached by Sander and Schulz (1979) in
their study of the degeneracy of information contents of amino acid sequences from
overlaid genes. From the families of homologous protein know at that time, they
concluded that the information contained in a sequence is degenerate with respect to
function. They quantified this degeneracy on the basis of viral overlays and found that
five amino acid groups is the largest number of groups for which the assumption is
tenable that there exists one group sequence per protein function. Ever since, a great
number of reduced alphabets have been proposed, based on different amino acid
properties or observed substitutions (Miyata et al., 1979; Jiménez-Montaño, 1984;
Murphy et al., 2000; Solis and Rackovsky, 2000; Cannata et al., 2002; Li et al., 2003),
among many others. Here, following (Görnerup and Jacobi, 2010), we interpret reduced
amino acid alphabets simply as a result of the various codon sub dynamics, among
different groups of codons, which are neighbors according to the topology of the genetic
code space.
Recently, Chothia (Sasidharan and Chothia, 2007) returned to the problem from a
different perspective. For the divergence process in proteins that maintain the same or
very similar functions and structures, Sasidharan and Chothia reported very similar
overall patterns of divergence by counting observed amino acid substitutions in three
very different groups of orthologs. They interpret this result to mean that individual
responses of most proteins are variations on a common set of selective constraints
which govern the types of frequent mutations that are acceptable. In RESULTS we
show that the frequencies of amino acid pair substitutions deduced from our computer
simulations are in very good agreement with the mutation profile obtained in their
paper.
5. PROTEIN SYNTAX
Paraphrasing Prince and Smolensky (1997) in their attempt to relate the sciences of the
brain with the sciences of the mind, we can say in the present context that: “It is evident
that statistical thermodynamics and molecular biology are separated by many gulfs, not
the least of which lies between the formal methods appropriate for continuous
dynamical systems and those for discrete symbol structures”.
M. A. JIMÉNEZ-MONTAÑO 332
In order that an amino acid substitution is acceptable is necessary, first of all, that the
alteration it produces in the protein structure be as small as possible. Therefore, the
general Darwinian principle of “gradual change” is interpreted in the sense that the
destabilization of the structure should be as small as possible. Thus, thermodynamics
requires minimization of the Gibbs free energy change. However, this continuous
optimization is hampered by the discrete nature of the amino acid change. A rough
estimation of the effect produced by the substitution consists, for example, in
calculating the Miyata et al., distance (1979) between the original and the new amino
acid. However, this distance should not calculated between the original and any of the
other 19 amino acids; only between the original amino acid and the accessible amino
acids after a single-nucleotide mutation (that is, at most nine amino acids). In this way
the genetic code modulates acceptable mutations.
Following the parallelism between linguistics and a formal protein language (Jiménez-
Montaño, 1984), we recall that an important challenge of the first discipline is to
discover an architecture for grammars that both allows variation and limits its range to
what is actually possible in human language (Prince and Smolensky, 1997).
Furthermore, these authors remark that “… a central element in the architecture of
grammar is a formal means for managing the pervasive conflict between grammatical
constraints”. “The key observation is this: In a variety of clear cases where there is a
strength asymmetry between two conflicting constraints, no amount of success on the
weaker constraint can compensate for failure on the stronger one”. Finally, “….a
grammar consists entirely of constraints arranged in a strict domination hierarchy, in
which each constraint is strictly more important than-takes absolute priority over- all
constrains lower-ranked in the hierarchy. With this type of constraint interaction, it is
only the ranking of constraints in the hierarchy that matters for the determination of
optimally; no particular numerical strengths, for example, are necessary”.
Below we are going to show in which sense these concepts can be applied to the
characterization of amino acid substitutions. First, we need to say some words about
amino acid categorizations and the syntactic structure of proteins at the letter-unit level
which we discussed in detail in (Jiménez-Montaño, 1984).
We will call any classification of the 20 amino acid types in r groups (under different
criteria), an amino acid categorization. The set of symbols denoting the group names
will be called a reduced alphabet. When the reduced alphabet corresponds to a pattern
of substitutions according to an empirical matrix (Dayhoff et al., 1972, 1978; Gonnet et
al., 1992; Jones et al., 1992; etc.), the pattern is called a pattern of substitution classes.
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 333
If the categorization is based on physical-chemical properties (Grantham; 1964; Miyata
et al., 1979; etc.), the resulting patterns are called amino acid property sequences. In the
paper mentioned above, we approached the question of how we can select a fixed
number of categories which best mirror amino acid replacements. The mutational
categories should reflect the most frequent amino acid substitutions observed. In the
same way, if the selected physical chemical properties truly determine the protein´s
architecture, both patterns will be consistent. In other words, the reduced alphabets
obtained under different criteria will be almost equivalent. It is on this basis that is
reasonable to assume that the constraints responsible for a given fold are somehow
encoded in the pattern of substitution classes.
A hierarchy (inverted tree) of amino acid categorizations represents a syntactic structure
of proteins at the letter-unit level. We shall employ for our discussion the hierarchy
introduced in Fig. 1 of (Jiménez-Montaño, 1984). In the same paper we discussed two
more hierarchies, one from Sneath (1966) and the other from Lim (1974). Twelve more
hierarchies (also called dendrograms), associated with the same number of popular
amino acid substitution matrices are displayed in Fig. 4 of (Johnson and Overington,
1993). See also Fig. 1 in (Fan and Wang, 2003), and (Venkatarajan and Braun, 2001),
among many other proposals in the literature.
With this background, the interpretation of the above quotations from Optimality
Theory (Prince and Smolensky (1997) in the context of the present article is
straightforward. A hierarchy of amino acid categorizations encodes physical chemical
constraints arranged in a strict domination hierarchy. Thus, the dominant partition in the
dendrogram in Fig. 1 of (Jiménez Montaño, 1984) separates amino acids into non-
hydrophobic, represented by the group symbol a, and hydrophobic, represented by the
group symbol b. This constraint dominates over the lower constrains (for example size).
Therefore, we expect that an amino acid of a given class will be substituted with another
amino acid of the same class. In this case, we say that the substitution is syntactically
correct, and that the new sequence belongs to the language generated by the grammar.
Let us illustrate with an example how the grammar generates average amino acid
substitutions obeying general physical chemical constraints that preserve the stability of
the protein. If, in a given site of a protein sequence, we have aspartic acid (D) we expect
that it will be replaced by glutamic acid (E) because the node n in Fig. 1 of (Jiménez-
Montaño, 1984) is the smallest class that includes both amino acids. Next, we have node
e which includes two more amino acids, Q and N, that is, e = {D, E, N, Q}. Thus, the
category represented by the symbol n corresponds to the most conservative substitution,
M. A. JIMÉNEZ-MONTAÑO 334
then follows the wider category e, and so on up to the category represented by the
symbol a, which embraces the non-hydrophobic amino acids.
The amino acid dendrograms we are considering were derived from a number of amino
acid substitution matrices, by several authors that employed different clustering
procedures which have a significant influence in the result. Besides, this approach in
protein space disregards the fact that two amino acids in the same group may be
separated by two or three nucleotide substitutions, thus unlikely to substitute one
another. As pointed out long ago by Miyata et al., (1979): “Amino acids separated by
two or three codon position differences are unlikely to interchange even if they are
chemically similar”. Recently, we discussed this problem (Jiménez-Montaño and He,
2009). For example, the category i = {F, Y, W} in Fig. 1 of (Jiménez-Montaño, 1984) ,
which includes the three large hydrophobic amino acids should be refined into two
groups:{F, Y} and {W}. This is so because to go from the codon of W to any of the
codons of the other two amino acids we need two nucleotide changes; thus, W
constitutes a separate group by itself. This splitting of W was already proposed, for
example, in (Murphy et al., 2000) but for a different reason (which is a consequence of
the above reason): The small number of substitutions observed between W and the other
two amino acids, as reflected in the empirical BLOSUM 50 matrix. Therefore, it is clear
that to improve over previous approaches it is necessary to have a syntactic structure at
the codon level.
6. RESULTS
The first result of this paper is the proposal of the codon dendrogram shown in Fig. 2. It
was obtained by applying the clustering algorithm UPGMA (unweighted pair-group
method using arithmetic averages) to the full codon substitution matrix introduced in
(Schneider et al., 2005). This classification of codons, inferred from an empirical
matrix, induces a corresponding arrangement for amino acids. We observe that the
codons for D and E share the same group, therefore, we expect these two amino acids
exchange frequently both because they are very similar and because they are neighbors
in codon space. They are ranked one in our simulations (Table 1) and in Human-
Chicken orthologs, and ranked two in Escherichia coli and Salmonella orthologs; both
from the observed data (Table 6 in supp. material from Sasidharan and Chothia, 2007).
However, the group e = {D, E, N, Q} does not occur in Fig.2; N and S2 (S with AGY
codons) form one category, and Q and H another. Therefore, the letters in category e are
not completely equivalent from the point of view of their substitutability. From the
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 335
results in Table 1, the substitutions DN, and EQ are in ranks between 12 and
14, while NQ and EN, are ranked 44 and 45, respectively, in (Table 6 in supp.
Material from Sasidharan and Chothia, 2007). In the first case the amino acids have
neighboring codons; in the second case the corresponding codons differ by two bases.
Figure 2: Dendrogram (Codon rooted tree) obtained from the full empirical codon-substitution matrix (Schneider et al., 2005), employing the UPGMA method.
M. A. JIMÉNEZ-MONTAÑO 336
E. COLI-S. ENTERICA MARKOV SOURCE RANDOM GENERATOR
RANK ORDER MUTATION RANK ORDER MUTATION RANK ORDER MUTATION 2 DE 1 DE/ED 10 DE/ED 1 IV 3 IV/VI 6 IV/VI 4 ST 4 ST/TS 3 ST/TS 5 AT 8 AT/TA 5 AT/TA 10 NS 5 SN/NS 6 SN/NS 3 AS 9 AS/SA 2 AS/SA 15 KR 2 RK/KR 6 RK/KR 18 GS 15 SG/GS 2 SG/GS 8 AV 13 AV/VA 5 AV/VA 6 IL 14 IL/LI 4 IL/LI 12 SP/PS 3 SP/PS
13 LV 6 VL/LV 3 VL/LV 7 LM 7 LM/ML 9 LM/ML 19 AP 20 AP/PA 5 AP/PA 16 AG 17 AG/GA 5 AG/GA 21 HQ 3 QH/HQ 10 QH/HQ 17 FY 10 FY/YF 10 FY/YF 23 NT 19 TN/NT 8 TN/NT 16 VM/MV 10 VM/MV
29 IM 10 IM/MI 11 IM/MI 23 PQ/QP 8 PQ/QP
25 FL 10 LF/FL 6 LF/FL 22 TP/PT 5 TP/PT 19 TI/IT 6 TI/IT
24 QR 18 RQ/QR 6 RQ/QR 14 DN 12 ND/DN 10 ND/DN 19 SC/CS 7 SC/CS
11 KQ 9 KQ/QK 10 KQ/QK 28 TM/MT 10 TM/MT
20 HN 17 NH/HN 10 NH/HN 9 AE 24 AE/EA 8 AE/EA 12 EQ 8 EQ/QE 10 EQ/QE 22 KN 11 KN/NK 10 KN/NK 26 AD 25 AD/DA 8 AD/DA 27 TV 5 TV/VT 28 HR 20 RH/HR 6 RH/HR 30 LQ 22 QL/LQ 6 QL/LQ 20 GE/EG 8 GE/EG 24 PL/LP 2 PL/LP
Table 1: Comparison of rank positions of the most frequent mutations types found in pairs of orthologs from
Escherichia coli and Salmonella enteric, for < 10 % divergence (Sasidharan and Chothia, 2007), with the ones obtained from simulations generated with our Markov information source, implemented with the help of
software tool GSEQUENCE (Jiménez-Montaño and Ramos-Fernández, 2013). In the third column we display results obtained with a random source
Despite the just explained discrepancies, there is a very good agreement between most
of the categories in Fig. 1 from (Jiménez-Montaño, 1984) and those in the codon
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 337
dendrogram (Fig. 2). For example, the important category h = {L, I, V, M} of aliphatic
amino acids in our hierarchy of amino acid categorizations, coincides with the category
d in Fig.2. The same is true for the small neutral amino acids in group c = {P, A, G, S,
T}, except for glycine (G), which in Fig. 2 belongs class b = {N, S2, G, D, E}. There are
other coincidences and differences that the reader can easily find. An important
difference worth mention is the following: while in the grouping based on amino acid
properties I and L are in the same category q, in the codon dendrogram I joins V to form
a group; even though there are proofreading process in protein synthesis to correct
translation errors, the substitutions VI are ranked three and the substitutions
LI are ranked fourteen in our simulations; and are ranked one and six, respectively,
in pairs of orthologs from Escherichia coli and Salmonella enterica (See our Table 1
and Table 6 in supp. Material from Sasidharan and Chothia, 2007).
The amino acid substitution pairs from our codon dendrogram, (D,E),(K,R),(I,V),
(Y,F),(M,L),(N, S2 ) (Q, H) and (A,S) are consistent to the ones reported in the
dendrogram displayed in Fig. 3 of (Görnerup and Jacobi, 2010), except for minor
differences. These are in the two last groups; in their paper (which employs a
completely different agglomeration procedure), Q and H form separate groups, and A
pairs with T instead of S. However, the higher categorizations are completely different
in both dendrograms. The separation of hydrophobic and hydrophilic amino acids in our
dendrogram (Fig. 2) is consistent with that in Fig. 1 of (Jiménez-Montaño, 1984). The
only difference comes from the small neutral amino acids, which are in the non-
hydrophobic group in our former publication, and are grouped with the hydrophobic
amino acids in the codon dendrogram.
The second but not less important result is that six of the ten more frequent amino acid
substitutions pairs, obtained from simulations generated with our Markov information
source, implemented with the help of software tool GSEQUENCE (Jiménez-Montaño
and Ramos-Fernández, 2013), agree with the pairs in the three sets of orthologs from E.
coli – S. enterica, Human-Mouse and Human-Chicken, respectively, from Table 6 in
supp. material from (Sasidharan and Chothia, 2007). These pairs are: DE, IV, ST, NS,
AT, AS. The pair KR agrees with two of the three sets of orthologs. Additionally, we
have in descending order the pairs VL, LM and FY which are ranked twelve, thirteen
and seventeen, respectively, in the same source. Seven of these amino acid substitution
pairs agree with the codon pairs in the Codon Dendrogram (Fig. 2), they are: DE, IV,
NS, AS, KR, LM, FY. These results are consistent with the most frequent amino acid
exchanges found in (Schmitt et al., 2007). In our simulations, the corresponding codons
M. A. JIMÉNEZ-MONTAÑO 338
outline approximately closed dynamics, as discussed in (Görnerup and Jacobi, 2010).
Therefore, these cycles of the codon dynamics produce amino acid substitutions which
are fixed in the population because of the similarity of the corresponding amino acids.
These outcomes take us to the third and last result of this paper.
AA/CODON CORR COEF AA/CODON CORR COEF AA/CODON CORR COEF
K AAA -0.8791 I ATA -0.5617 D GAC -0.7193 K AAG -0.9266 I ATC -0.5668 D GAT -0.7201 N AAC -0.6477 I ATT -0.5791 A GCA -0.3349 N AAT -0.7568 M ATG -0.6726 A GCC -0.3815 T ACA -0.6894 Q CAA -0.7166 A GCG 0.7656 T ACC -0.4522 Q CAG -0.5851 A GCT -0.4195 T ACG 0.4707 H CAC -0.6559 G GGA -0.4947 T ACT -0.4727 H CAT -0.6578 G GGC -0.8047 R AGA -0.8543 P CCA -0.8373 G GGG -0.2542 R AGG -0.8397 P CCC -0.8056 G GGT -0.7988 R CGA -0.8787 P CCG 0.3259 V GTA -0.4626 R CGC -0.9202 P CCT -0.7635 V GTC -0.6288 R CGG -0.9238 L CTA -0.6903 V GTG -0.8257 R CGT -0.9403 L CTC -0.964 V GTT -0.6141 S2 AGC -0.7744 L CTG -0.9467 Y TAC -0.9487 S2 AGT -0.6739 L CTT -0.9273 Y TAT -0.9621 S TCA -0.9765 L TTA -0.5547 C TGC -0.788 S TCC -0.8915 L TTG -0.7221 C TGT -0.7812 S TCG -0.7629 E GAA -0.6831 W TGG -0.7817 S TCT -0.9292 E GAG -0.8145 F TTC -0.4676
F TTT -0.4737
Table 2: Anti-correlation between the substitution probabilities from the reduced empirical matrix, REM (Jiménez-Montaño and He, 2009), and the physical-chemical dissimilarity index (distance) from (Miyata et
al., 1979). For a given codon, e.g. AAA (K), I calculated the correlation between the list of values of substitution probabilities with its neighbors (AGA (R), GAA (E), etc, and the list of values of the index of the
associated amino acids (in parenthesis)
In Table 2 we display the anti-correlation between the substitution probabilities from the
reduced empirical matrix, REM (Jiménez-Montaño and He, 2009), and the dissimilarity
physical-chemical index (distance) from (Miyata et al., 1979), which is based on
hydrophobicity and volume of amino acids. As expected, amino acid pairs which have
codons that substitute frequently have small values of the dissimilarity index and vice
versa. So, this well-known result from comparisons of amino acid substitution matrices,
is corroborated at the codon level.
7. CONCLUSIONS
After presenting a general conceptual framework for the analysis of protein evolution,
we introduced a theoretical model, which consists of a Markov Information Source that
generates codon sequences, and from them amino acid sequences, that maintain the
same or very similar functions and structures. This invariance is a consequence not only
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 339
of natural selection (that preserves the sequences that obey general physical chemical
constraints, which are responsible of the stability of the protein), but also of the
structure of the genetic code, which controls the possible amino acid changes, from
single nucleotide mutations. With the help of the model, we introduced a syntactic
formulation (codon dendrogram) to describe a hierarchy of codon categorizations which
explain the pattern of frequent amino acid substitutions in short-term evolution. From
our computer simulations (Jiménez-Montaño and Ramos-Fernández, 2013) we
interpreted the reduced amino acid alphabets simply as a result of the various codon
sub dynamics, among different clusters of codons, which are neighbors according to the
topology of the genetic code space.
Acknowledgements: I wrote this paper while commissioned at Dirección General de
Investigaciones de la Universidad Veracruzana. I want to thank director César I.
Beristain-Guevara for his support. I also thank Q.F.B. Antero Ramos-Fernández for his
help in doing some calculations and preparing the tables and figures. I thank David Abel
for suggestions to make clearer the manuscript and some references. I express thanks to
Sistema Nacional de Investigadores, México, for partial support. Finally, I thank my
wife, Ma. Eta. Castellanos G. for her patience and understanding.
REFERENCES
Abel, D.L. and Trevors, J.T. (2006) More than Metaphor: Genomes are Objective Sign Systems, Journal of BioSemiotics, 1 253-267.
Arnold, F.H. (2011) The Library of Maynard-Smith: My Search for Meaning in the Protein Universe. Microbe, ASM News 6(7) 316-318.
Ash, R. (1965) Information Theory, New York: Interscience Publishers, 339pp.
Axe D.D. (2004) Estimating the Prevalence of Protein Sequences Adopting Functional Enzyme Folds. Journal of Molecular Biology, 341 1295-1315.
Benner S.A. Cohen M.A. Gonnet G.H. (1994). Amino acid substitution during functionally constrained divergent evolution of protein sequences. Protein Engineering, 7 1323–1332.
Cannata,N., Toppo, S., Romualdi, C. and Valle, G. (2002) Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices. Bioinformatics 18 1102-1108.
Crooks, G.E. and Brenner, S.E. (2005) An alternative model of amino acid replacement. Bioinformatics 21 975–980.
Crutchfield, J.P. and Schuster, P. (2003) Evolutionary Dynamics–Exploring the Interplay of Accident, Selection, Neutrality, and Function, Oxford University Press, New York, 452pp.
Dayhoff, M.O., Eck, R.V. and Park, C.M. (1972) A model of evolutionary change in proteins. In: Dayhoff M, ed. Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Washington, D.C., 5 89–99pp.
Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) A model of evolutionary change in proteins. In: Dayhoff M, ed. Atlas of Protein Sequence and Structure, National Biomedical Research Foun- dation, Washington, D.C. 5(3) 345–352pp.
M. A. JIMÉNEZ-MONTAÑO 340
DePristo, M.A., Weinreich, D.M. and Hartl, D.L. (2005) Missense meanderings in sequence space: a biophysical view of protein evolution. Nature Reviews Genetics 6 678-687.
Dokholyan, N.V. and Shakhnovich, E.I. (2001) Understanding hierarchical protein evolution from first principles. Journal of Molecular Biology 312 289–307.
Ewens, W.J. and Grant, G.R. (2001) Statistical Methods in Bioinformatics: An Introduction, Springer- Verlag, New York, 476pp.
Fan, K. and Wang, W. (2003) What is the Minimum Number of Letters Required to Fold a Protein? Journal of Molecular Biology 328 921–926.
Goldman, N. and Yang, Z. (1994) A Codon-based Model of Nucleotide Substitution for Protein-coding DNA Sequences. Molecular Biology and Evolution 11 725-736.
Gonnet, G.H., Cohen, M.A. and Benner, S.A. (1992) Exhaustive matching of the entire protein sequence database. Science 256 1443-1445.
Görnerup, O. and Jacobi, M.N. (2010) A model-independent approach to infer hierarchical codon substitution dynamics. BMC Bioinformatics 11 201
Grantham, R. (1974) Amino acid difference formula to help explain protein evolution. Science 185 862-864.
Halpern, A.L. and Bruno, W.J. (1998) Evolutionary Distances for Protein-Coding Sequences: Modeling Site- Specific Residue Frequencies. Molecular Biology and Evolution 15 910–917.
Hamming, R.W. (1950) Error detecting and error correcting codes. Bell System Technical Journal 29 147-160.
Hietpas, R.T., Jensen, J.D. and Bolon, D.N.A. (2011) Experimental illumination of a fitness landscape. Proceedings of the National Academy of Sciences 108 7896–7901.
Jiménez-Montaño, M.A. (1984) On the syntactic structure of protein sequences and the concept of grammar complexity. Bulletin of Mathematical Biology 46 641-659.
Jiménez Montaño M. A. (1994) On the Syntactic Structure and Redundancy Distribution of the Genetic Code. BioSystems, 32 11-23.
Jiménez-Montaño, M.A. (2004) Applications of Hyper Genetic Code to Bioinformatics. Journal of Biological Systems 12 5-20.
Jiménez-Montaño, M.A. and He, M. (2009) Irreplaceable Amino Acids and Reduced Alphabets in Short-term and Directed Protein Evolution. In Bioinformatics Research and Applications. Mandoiu, Ion; Narasimhan, Giri; Zhang, Yanquing (Eds.). Springer-Verlag Berlin Heidelberg, 297–309pp.
Jiménez-Montaño, M.A. and Ramos-Fernández, A. (2013) Simulation of protein evolution with a Markovian empirical codon-substitution model. Manuscript in preparation.
Jiménez-Montaño, M.A., de la Mora-Basáñez, R. and Pöschel, T. (1996) The Hypercube Structure of the Genetic Code Explains Conservative and Non-Conservartive Aminoacid Substitutions in Vivo and in Vitro. BioSystems 39 117-125.
Johnson, M.S. and Overington, J.P. (1993) A structural basis for sequence comparisons—an evaluation of scoring methodologies. Journal of Molecular Biology 233 716–738
Jones, D.T., Taylor, W.R. and Thornton, J.M. (1992) The rapid generation of mutation data matrices from protein sequences. Computer Applications in the Biosciences 8 275–282.
Karasev, V.A. and Soronkin, S.G. (1997) Topological structure of the genetic code, Russian Journal of Genetics 33 622–628.
Kauffman, S. (1989) Adaptation on Rugged Fitness Landscapes. In Lectures in the Sciences of Complexity. Stein, D.L., Editor.Addison-Wesley Publishing Company, Redwood City, California, 527- 618pp.
Kleiger, G., Beamer, L.J., Grothe, R., Mallick, P., and Eisenberg, D. (2000) The 1.7 Å Crystal Structure of BPI: A Study of How Two Dissimilar Amino Acid Sequences can Adopt the Same Fold, Journal of Molecular Biology 299 1019-1034.
Kosiol, C. and Goldman, N. (2011) Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models, Journal of Molecular Biology 411 910–923.
AMINO ACID SUBSTITUTIONS IN PROTEIN EVOLUTION 341
Lesk, A.M. and Chothia, C. (1980) How different amino acid sequences determine similar protein structures: The structure and evolutionary dynamics of the globins, Journal of Molecular Biology 136 225-270.
Li, T., Fan, K., Wang, J. and Wang, W. (2003) Reduction of protein sequence complexity by residue grouping, Protein Engineering 16 323-330.
Lim, V.I. (1974) Algorithms for prediction of alpha-helical and beta-structural regions in globular proteins, Journal of Molecular Biology 88 873-94.
Markov, A.A. (1913) Primer statisticheskogo issledovanija and tekstom `Evgenija Onegina' illjustrirujuschij svjaz' ispytanij v tsep (An example of statistical study on the text of `Eugene Onegin' illustrating the linking of events to a chain). Izvestija Imp, Akademii nauk, serija VI, 3 153-162.
Miyata,.T., Miyazawa, S. and Yasunaga,.T. (1979) Two types of amino acid substitutions in protein evolution, Journal of Molecular Evolution 12 219-236.
Müller, T. and Vingron, M. (2000) Modeling amino acid replacement, Journal of Computational Biology 7 761–776.
Murphy,L.R., Wallqvist, A. and Levy, R.M. (2000) Simplified amino acid alphabets for protein fold recognition and implications for folding, Protein Engineering 13 149-152.
Muse, S.V. and Gaut, B.S. (1994) A Likelihood Approach for Comparing Synonymous and Nonsynonymous Nucleotide Substitution Rates, with Application to the Chloroplast Genome, Molecular Biology and Evolution 11 715-724.
Manning, C.D. and Schutze, H. (1999) Foundations of Statistical Natural Language Processing, Cambridge, Massachusetts, MIT Press, 680pp. Reprint: Cambridge, Massachusetts, MIT Press 2003.
Nielsen, R. (2005) Statistical Methods in Molecular Evolution, Springer Verlag, New York, 508pp.
Pál, C., Papp, B. and Lercher, M.J. (2006) An integrated view of protein evolution, Nature Reviews Genetics 7 337-348.
Parisi, G. and Echave, J. (2001) Structural constraints and emergence of sequence patterns in protein evolution, Molecular Biology and Evolution 18 750–756.
Parkhomchuk D., Amstislavskiy,V. , Soldatov A. and Ogryzko V. (2009) Use of high throughput sequencing to observe genome dynamics at a single cell level, Proceedings of the National Academy of Sciences 106 20830-20835.
Petoukhov, S.V. (1999) Genetic code and the ancient Chinese book of changes, Symmetry: Culture and Science 10 211-226.
Prince, A. and Smolensky, P. (1997) Optimality: From Neural Networks to Universal Grammar, Science 275 1604-1610.
Sanchez, I.E., Tejero, J., Gomez-Moreno, C., Medina, M. and Serrano, L. (2006) Point Mutations in Protein Globular Domains: Contributions from Function, Stability and Misfolding, Journal of Molecular Biology 363 422–432.
Sánchez, R., Morgado, E. and Grau, R. (2004) The Genetic Code Boolean Lattice, Communications in Mathematical and in Computer Chemistry 52 29-46.
Sander, C. and Schulz, G.E. (1979) Degeneracy of the information contained in amino acid sequences: Evidence from overlaid genes, Journal of Molecular Evolution 13 245-252.
Sasidharan, R. and Chothia, C. (2007) The selection of acceptable protein mutations, Proceedings of the National Academy of Sciences 104 10080–10085.
Schmitt A. O., Schuchhardt, J., Ludwig A., Brockmann G. A. (2007) Protein evolution within and between species, Journal of Theoretical Biology 249 376–383.
Schneider, A., Cannarozzi, G.M. and Gonnet, G.H. (2005) Empirical codon substitution matrix, BMC Bioinformatics 6 134.
Shannon C. (1948). A Mathematical Theory of Communication, The Bell System Technical Journal 27 379–423, 623–656, July, October.
M. A. JIMÉNEZ-MONTAÑO 342
Skipper M., Dhand R., Campbell P. (2012) Nature/Encode. 2001 Will always be remembered as the year of the human genome, Nature 489: 45. The ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74. http://www.nature.com/encode/
Smith, J.M. (1970) Natural Selection and the Concept of a Protein Space, Nature 225 563–564.
Sneath, P.H.A. (1966) Relations between chemical structure and biological activity in peptides, Journal of Theoretical Biology 12 157-195.
Solis, A.D. and Rackovsky, S. (2000) Optimized representations and maximal information in proteins, Proteins: Structure, Function, and Bioinformatics 38 149-164.
Stambuk, N. (2000) Universal metric properties of the genetic code, Croatica Chemica Acta 73 1123-1139.
Swanson, R. (1984) A unifying concept for the amino acid code, Bulletin of Mathematical Biology 46 187-203.
Thorne, J.L. and Goldman, N. (2001) Probabilistic models for the study of protein evolution. Balding, D.J., Bishop, M., Cannings, C. (Eds.), Handbook of Statistical Genetics. John Wiley, Chichester, UK, 67-82pp.
Venkatarajan, M.S. and Braun, W. (2001) New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical–chemical properties, Journal of Molecular Modeling 7 445–453.
Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F. and Carrasco, R.C. (2005) Probabilistic finite-state machines-Part I, IEEE Trans, Pattern Analysis and Machine Intelligence 27 1013-1025.
Yang, Z. (2006) Computational molecular evolution, Oxford: Oxford University Press 374 pp.
Wagner A. (2012). The Role of Randomness in Darwinian Evolution, Philosophy of Science 79 95-119.