Erdogan and Wong (2015)
Computational simulation of single nucleotide polymorphisms and their effect on the difference between the percent identities of DNA and amino acid sequences
Batuhan Erdogan and Bryan Wong
Science One Program
The University of British Columbia March 2015
1
Erdogan and Wong (2015)
ABSTRACT
This study investigated the action of single nucleotide polymorphisms (SNPs) in a DNA
sequences. This was carried out by designing a VPython model that showcased the effect of
SNPs on DNA strands, and subsequently, amino acid sequences. The graphs from VPython show
trends of percent similarity (%) of amino acid sequence, as well as percent similarity (%) of
DNA base pair sequences. Furthermore, a spreadsheet program allowed us to compare directly
the difference between the final percent identities (%) of DNA and amino acid sequences, with
respect to mutation rate. Our goal is to find the relationship between the frequency of SNPs, and
the final amino acid sequence and DNA sequence. It had been hypothesized that a greater
number of SNPs will perpetuate an increase in mutation within an amino acid sequence, largely
due to the redundancy in the genetic code. It was found out that the difference between the
percent identities of DNA and amino acid sequence is proportional to the mutation rate.
Introduction
Single nucleotide polymorphisms (SNPs) give rise to the immense diversity present in
nature. By theoretical definition, SNPs are mispairings of DNA sequence that are present within
the population at a frequency greater than 1% (Kirket. al 3295). SNPs are divided into two main
categories: coding regions SNPs (cSNPs) and nonsynonymous SNPs (nsSNPs). cSNPs that do
not affect protein sequence are known as synonymous SNPs; those which change the amino acid
sequence of proteins are known as nonsynonymous SNPs. Therefore, it can be inferred that
nonsynonymous SNPs are polymorphisms that can alter the amino acid sequence of the encoded
protein, and thus more likely to provoke diseases. nsSNPs are subdivided into three groups,
based on their sequence, structure, and annotation for the resulted protein (Wu and Jiang 1).
Previous research has discovered numerous aspects of SNPs. Cargillet. alobserved that
polymorphisms occur at different rates depending on their identity, and significance to the
encoded protein (231). SNPs are used in a wide range of clinical research. For instance, SNPs
are used in population genetics and medical science, through exploration of genetic
polymorphisms (Wakui 1008). SNPs also allow for personalization of medication, giving rise to
a revolutionized technique in medical practice in the field of pharmacogenomics (Katara 85). In
the field of computer science, Ng and Henikoff created SIFT (Sorting Intolerant From Tolerant),
2
Erdogan and Wong (2015)
a program that predicts whether an amino acid substitution, under the influence of SNPs, affects
protein function (3812).
Conversely, this research focuses less on the implications of SNPs, but more on the
generic technicalities of the mutation process, by designing a computational simulation —
NOVA. NOVA provides a concrete visualization to the audience as to how DNA strands mutate
through iterations of cell cycles. This model aims to connect the idea of SNPs and their effect on
the final amino acid sequence, by quantitatively comparing the percent similarities or identities
(%) between the original and final amino acid sequence, as well as the % similarities between the
final DNA and amino acid sequences, with respect to mutation rate. It is postulated that an
increase in the number of SNPs will generate a progressive increase in mutations within an
amino acid chain, hence a decrease in % similarity between original and final amino acid
sequence. This is due to the redundancy in genetic code: the same amino acid residue can be
encoded by multiple, or synonymous, codons (Spencer and Barral 1). Also, by comparing
percent identities of DNA and amino acid sequences, it is hypothesized that their differences will
increase in proportional rates as mutation rate rises.
Computational Procedure
The objective of this study was to simulate the action of SNPs on an either randomized or
manually inputed DNA sequence, by building a computer program known as NOVA. This
program not only visualizes the action of SNPs, but also displays the resulting amino acid
polymorphism. NOVA is essentially a userfriendly program that recognizes the differences
between the initial and final DNA and amino acid sequences in terms of % similarities. NOVA
was created with the computer program Python, and its visual component, VPython (Visual
Python), to view the graphs generated by the inputs. Users can then examine the trends in
percent similarities of DNA and resulting amino acid sequence, for analytical purposes. Users
can also specifically choose one segment of a mutated protein and compare it with the original
sequence.
NOVA was programmed to show distinct differences between original and mutated DNA
sequences as well as amino acid sequences. First of all, the user is requested to choose between
3
Erdogan and Wong (2015)
an ‘automatic’ or a ‘manual’ mode; the ‘automatic’ mode produces a random DNA sequence,
whereas the ‘manual’ mode requires user input of a sequence, which he/she may want to
examine. After fulfilling the requirements for the following questions (number of base pairs
being generated (in ‘automatic’ mode), mutation possibility of each nucleotide, correction
probability of a nucleotide being checked, and the number of cell cycles), the user is shown the
antiparallel strands of DNA over a number of cell cycles, as the strands are being mutated
gradually yet randomly, to simulate ideally the point mutations in a real cell cycle.
After that, NOVA focuses on the translation from DNA strands to amino acids. Based on
the original and final DNA sequences, NOVA notices the AUG sequences (start codon) within
the DNA sequence, and starts translating triplets of nucleotides into amino acids. When NOVA
encounters a stop codon, it stops reading until another AUG sequence is recognized. Therefore,
more emphasis was placed on the translation process rather than transcription, as the eventual
percent similarity between the original and mutated amino acid sequences can be calculated
utilizing only one of the two antiparallel strands. After translation, the program shows the
original translated amino acid sequence, as well as the mutant amino acid sequence.
Additionally, it displays and compares protein primary structures that would be produced by both
the original and mutated DNA. The user is offered a choice to specifically compare the original
and mutated primary structures for more quantitative details.
Afterwards, the user is offered the choice to run the simulation in “rapid mode”, which
repeats a desired number of the above percent similarity calculations automatically, increasing
the mutation rate each time, to observe the effect of mutation rate on the relationship between
DNA and amino acid percent similarities. VPython is utilized to graph the % similarity of final
amino acid sequence (shown in red), and % similarity of final DNA sequence (shown in green),
versus the mutation rate of SNPs. Ultimately, the data are utilized to inspect the relationship
between the mutation rate, and the difference between the percent identities of DNA and amino
acid strands examined. The yaxes in the graphs generated by the simulation represent percent
similarity/identity between the original and mutated strands, while the xaxes represent mutation
possibility/rate, increasing by increments of 0.001. It is important to note that the rates of
polymorphisms are controlled and varied by altering mutation rates in the program.
4
Erdogan and Wong (2015)
A Mathematical Model for DNA Percent Identity
Once all the data were gathered completely, it was decided to derive a mathematical
model that represents the effect of mutation rate on the percent identity of a DNA strand. This
was done to confirm that the simulation developed is functional, and capable of reproducing
accurate results. This model was then compared to the data from our simulation. The following
demonstrates and explains the mathematical model that was instigated.
(1)
The overall change in the number of mutated base pairsm per cell cycle is equivalent to the
derivative of m(t) if it is defined as a function of t, the number of cell cycles carried out.
Logically, the rate will be equal to the number of mutations that caused base pairs to differ from
their original configuration,(¾)*(1rcor)*(rmut)*(n m), minus the number of mutations that cause
mutated base pairs to return to their unmutated state,(¼)*(1rcor)*(rmut)*(m). As seen in Equation
1, rcor is the correction rate, rmut is the mutation rate, n is the number of base pairs, and the
fraction ¼ represents that only one out of four mutations will manage to ‘correct’ the base pair.
By solving the differential equation (1) while bearing in mind that m(t=0) is 0, particular
solution was obtained (Equation 2) to describe how the number of mutated base pairs change
with respect to the number of cell cycles.
(2)
Equation 2 was then utilized to account for the effect of changing mutation rates (rmut),
while keeping the number of mutated base pairs (n) and the number of cell cycles (t) constant.
This equation depicts the change in percent identity (%) of the DNA molecule, with regard to
change in mutation rate, which can simply be expressed as:
(3)
After the data collection is completed, chisquare tests were carried out between the
model described by Equation 3 and the results obtained through the simulation to assess the
5
Erdogan and Wong (2015)
overall accuracy of NOVA. The result of this analysis can be found in theResults section along
with the rest of the data obtained throughout the investigation.
Results
After running multiple trials with NOVA, data were gathered, and graphically presented
via VPython. The graph below depicts the % similarities for both DNA and amino acid
sequences (Figure 1).
Figure 1. Percent identity versus mutation rate for 500 b.p. in 'automatic' mode.Correction rate = 9990; Number of cell cycles per mutation rate = 1000; Number of times the mutation rate is increased by 0.001: 50
Subsequently, the final amino acid sequence was directly compared with the DNA
sequence according to a gradual increase in mutation rate (Figure 2). This is based on the
concept that a DNA strand is eventually transcribed and translated into amino acids.
6
Erdogan and Wong (2015)
Figure 2.Difference between DNA and amino acidpercent identities versus mutation rate for 500 b.p. in 'automatic' mode.Yaxis: difference between the percent identities of DNA and amino acid sequences, compared to original strands; Xaxis: mutation rate, increasing by 0.001 per cell cycle
Next, the number of base pairs was increased to 1000 b.p., 1500 b.p. and 2000 b.p., and
reiterated the program. Similarly, the final amino acid sequence and the DNA sequence were
compared in each scenario.
Figure 3. Percent identity versus mutation rate for 1000 b.p. in 'automatic' mode.Correction rate = 9990; Number of cell cycles per mutation rate = 1000; Number of times the mutation rate is increased by 0.001: 50
7
Erdogan and Wong (2015)
Figure 4.Difference between DNA and amino acidpercent identities versus mutation rate for 1000 b.p. in 'automatic' mode. Yaxis: difference between the percent identities of DNA and amino acid sequences, compared to original strands; Xaxis: mutation rate, increasing by 0.001 per cell cycle
Figure 5. Percent identity versus mutation rate for 1500 b.p. in 'automatic' mode.Correction rate = 9990; Number of cell cycles per mutation rate = 1000; Number of times the mutation rate is increased by 0.001: 50
8
Erdogan and Wong (2015)
Figure 6.Difference between DNA and amino acidpercent identities versus mutation rate for 1500 b.p. in 'automatic' mode. Yaxis: difference between the percent identities of DNA and amino acid sequences, compared to original strands; Xaxis: mutation rate, increasing by 0.001 per cell cycle
Figure 7. Percent identity versus mutation rate for 2000 b.p. in 'automatic' mode.Correction rate = 9990; Number of cell cycles per mutation rate = 5000; Number of times the mutation rate is increased by 0.001: 100
9
Erdogan and Wong (2015)
Figure 8.Difference between DNA and amino acidpercent identities versus mutation rate for 2000 b.p. in 'automatic' mode. Yaxis: difference between the percent identities of DNA and amino acid sequences, compared to original strands; Xaxis: mutation rate, increasing by 0.001 per cell cycle
Lastly, the ‘manual’ portion of NOVA was experimented by inputting a DNA sequence
of Human Immunodeficiency Virus infection (HIV), and mRNA sequence of Homo sapiens
tubulin, alpha 3c (TUBA3C). The sequences were obtained from the National Centre for
Biotechnology Information (NCBI). Once again, the DNA and amino acid sequences were
compared under various mutation rates.
10
Erdogan and Wong (2015)
Figure 9. Percent identity versus mutation rate for 1617 b.p. HIV gene in 'manual' mode. Correction rate = 9990; Number of cell cycles per mutation rate = 1000; Number of times the mutation rate is increased by 0.001: 50
Figure 10. Difference between DNA and amino acid percent identities versus mutation rate for 1617 b.p. HIV gene in 'manual' mode. Yaxis: difference between the percent identities of DNA and amino acid sequences, compared to original strands; Xaxis: mutation rate, increasing by 0.001 per cell cycle
11
Erdogan and Wong (2015)
Figure 11. Percent identity versus mutation rate for 1564 b.p. TUBA3C gene in 'manual' mode.Correction rate = 9990; Number of cell cycles per mutation rate = 1000; Number of times the mutation rate is increased by 0.001: 50
Figure 12. Difference between DNA and amino acid percent identities versus mutation rate for 1564 b.p. TUBA3C gene in 'manual' mode.Yaxis: difference between the percent identities of DNA and amino acid sequences, compared to original strands; Xaxis: mutation rate, increasing by 0.001 per cell cycle
12
Erdogan and Wong (2015)
Once the data collection step was completed, chisquare tests were carried out to evaluate
the overall accuracy of the simulation developed, by assessing the proximity between the model
derived and the DNA percent identities acquired. The result of this analysis can be found in
Table 1 below:
DNA
Sequence 500 b.p. 1000 b.p. 1500 b.p. HIV TUBA3C 2000 b.p.
(5000 cycles)
CHISQUARE 0.221911 0.095864 0.061888 0.060312 0.08034 0.444035
CRITICAL
VALUE (α=0.05) 67.5048 67.5048 67.5048 67.5048 67.5048 124.3421
Table 1: Chisquare values and critical values calculated for each DNA sequence.
Discussion and Implications
One of the significant discoveries from this study is the relationship between the percent
identities of the final DNA sequence and amino acid sequence depending on various mutation
rates. The trends in figures 2, 4, 6, 10 and 12 exhibit a linear relationship; whereas in figure 8,
the graph has a slight deviation from a linear structure for mutation rates above 0.06 (as indicated
by the equation on the bottom left). This reinforces the idea that the difference between percent
identities of DNA and amino acid sequences increases proportionally with rising mutation rates.
Although it was concluded that there exists a linear relationship between the mutation rate and
the difference between DNA and amino acid percent identities based on these data, this study
was not able to quantify the exact nature of this relationship explicitly.
In addition, the results reveal that the effect of mutations on amino acids is more drastic
than that on DNA base pairs. It is largely because of the genetic code redundancy, as explained
by Spencer and Barral (1). Redundancy here implies the fact that multiple codons may
correspond to one amino acid, which renders the ability of specific amino acids to be altered by
multiple mutations occurring in multiple sites.
13
Erdogan and Wong (2015)
Statistically, the presented Chisquare (X2) values show results that are remarkably lower
than their respective critical values for α = 0.05, which indicates that the values of the difference
between the percent identities of DNA and amino acid sequences are significantly close to the
modelled values, represented by the differential equation. Note that the X2 value for most genes
examined decreases gradually as the number of base pairs increases. This is due to the fact that
the increase in the number of base pairs reduces the overall noise in the data.
A major limitation that was encountered throughout this investigation was the
computational capacity of the device utilized to simulate the effect of mutation rates on percent
identities of DNA and amino acid sequences. Initially, in order to obtain results as close to the
reallife values as much as possible, it was decided to run the simulation with a correction rate of
99.99%. However, the computer used to run the program did not manage to yield results in a
quick and efficient manner. Accordingly, to procure sensible results in a relatively short amount
of time, the correction rate was readjusted to 99.9%. Another significant factor that limited the
extent of this research was our focus on only one type of DNA mutation, namely single
nucleotide polymorphisms, among many other different possibilities. Although DNA replication
is based on simple principles and mechanisms, it is a complicated process itself; even a miniscule
mutation, in a critical part of the genetic material, is capable of having a crucial effect on the
amino acid product, hence the proteins synthesized by the cell. Therefore, it is feasible to state
that in order to achieve a more accurate simulation, one must also take into account different
types of mutations that were disregarded in this investigation.
Ultimately, this quantitative analysis of DNA and amino acid sequences before and after
mutations may lead to the discovery of the required mutation cycles in a cell. It may be possible
in the future for us to obtain the duration (the number of cell cycles) needed to transform an
original amino acid sequence to a particular mutant sequence. With further coding and
improvements on NOVA, it can be turned into a systematic tool, whose job is to minimize the
time required for a DNA strand to evolve to a certain sequence, at very specific mutation rates
and correction rates. It is assured that NOVA can assist scientific research in fields such as
genetics, developmental biology, and genome evolution.
14
Erdogan and Wong (2015)
Conclusion
This study investigated the role of SNPs on the percent identities of DNA and amino acid
sequences. With the assistance of computational analysis by NOVA, it was concluded that the
difference between DNA and amino acid sequences, at increasing mutation rates, increases
proportionally. A mathematical model was constructed to further support our numerical data; the
expected values were very close to the modelled values, as propagated by the chisquared (X2)
test. Also, it was observed that the X2 value gets smaller as the number of base pairs increases.
Fundamentally, NOVA was proved to be a welldesigned tool for scientists to solve problems in
the field of evolutionary genomics and developmental biology.
Acknowledgments
We would like to express our utmost gratitude to our mentor Pamela Kalas, who provided
knowledge to us in the biological perspective. Also, we would like to thank the peer reviewers
for proofreading this paper and provide feedback for further improvements.
15
Erdogan and Wong (2015)
Work Cited
Cargill, M, et al. "Characterization of Singlenucleotide Polymorphisms in Coding Regions of Human Genes." Nature (1999): 231238.
"Human Immunodeficiency Virus Infection Inhibitors and Therapeutic or Prophylactic Agent for Aids." National Center for Biotechnology Information. U.S. National Library of Medicine, n.d. Web. 13 Mar. 2015. <http://www.ncbi.nlm.nih.gov/nuccore/DM466407.1>.
"Homo Sapiens Tubulin, Alpha 3c (TUBA3C), MRNA." National Center for Biotechnology Information. U.S. National Library of Medicine, n.d. Web. 13 Mar. 2015. <http://www.ncbi.nlm.nih.gov/nuccore/NM_006001.2>.
Katara, P. “Single Nucleotide Polymorphism and Its Dynamics for Pharmacogenomics.” Interdiscip Sci Comput Life Sci 6 (2014): 85–92.
Kirk, B. W, et al. "Single Nucleotide Polymorphism Seeking Long Term Association with Complex Disease." Nucleic Acids Research 30.15 (2002): 32953311.
Ng, P. C., Henikoff, S. "Predicting the Effects of Amino Acid Substitutions on Protein Function."
Annual Review of Genomics and Human Genetics 7 (2006): 6180.
Spencer, P. S., Barral, J. M. "Genetic Code Redundancy and Its Influence on the Encoded Polypeptides." Computational and Structural Biotechnology Journal 1.1 (2012): 18.
Wakui, M. “Analysis of single nucleotide polymorphisms (SNPs).” The Japanese Journal of Clinical Pathology 61.11(2013):10081017.
Wu, J., Rui J. "Prediction of Deleterious Nonsynonymous SingleNucleotide Polymorphism for Human Diseases." The Scientific World Journal (2013): 11
16
Top Related