Computational simulation of single nucleotide polymorphisms and their effect on the difference...

16
Erdogan and Wong (2015) Computational simulation of single nucleotide polymorphisms and their effect on the difference between the percent identities of DNA and amino acid sequences Batuhan Erdogan and Bryan Wong Science One Program The University of British Columbia March 2015 1

Transcript of Computational simulation of single nucleotide polymorphisms and their effect on the difference...

Erdogan and Wong (2015)

Computational simulation of single nucleotide polymorphisms and their effect on the difference between the percent identities of DNA and amino acid sequences

Batuhan Erdogan and Bryan Wong

Science One Program

The University of British Columbia March 2015

1

Erdogan and Wong (2015)

ABSTRACT

This study investigated the action of single nucleotide polymorphisms (SNPs) in a DNA

sequences. This was carried out by designing a VPython model that showcased the effect of

SNPs on DNA strands, and subsequently, amino acid sequences. The graphs from VPython show

trends of percent similarity (%) of amino acid sequence, as well as percent similarity (%) of

DNA base pair sequences. Furthermore, a spreadsheet program allowed us to compare directly

the difference between the final percent identities (%) of DNA and amino acid sequences, with

respect to mutation rate. Our goal is to find the relationship between the frequency of SNPs, and

the final amino acid sequence and DNA sequence. It had been hypothesized that a greater

number of SNPs will perpetuate an increase in mutation within an amino acid sequence, largely

due to the redundancy in the genetic code. It was found out that the difference between the

percent identities of DNA and amino acid sequence is proportional to the mutation rate.

Introduction

Single nucleotide polymorphisms (SNPs) give rise to the immense diversity present in

nature. By theoretical definition, SNPs are mispairings of DNA sequence that are present within

the population at a frequency greater than 1% (Kirket. al 3295). SNPs are divided into two main

categories: coding regions SNPs (cSNPs) and non­synonymous SNPs (nsSNPs). cSNPs that do

not affect protein sequence are known as synonymous SNPs; those which change the amino acid

sequence of proteins are known as non­synonymous SNPs. Therefore, it can be inferred that

non­synonymous SNPs are polymorphisms that can alter the amino acid sequence of the encoded

protein, and thus more likely to provoke diseases. nsSNPs are sub­divided into three groups,

based on their sequence, structure, and annotation for the resulted protein (Wu and Jiang 1).

Previous research has discovered numerous aspects of SNPs. Cargillet. alobserved that

polymorphisms occur at different rates depending on their identity, and significance to the

encoded protein (231). SNPs are used in a wide range of clinical research. For instance, SNPs

are used in population genetics and medical science, through exploration of genetic

polymorphisms (Wakui 1008). SNPs also allow for personalization of medication, giving rise to

a revolutionized technique in medical practice in the field of pharmacogenomics (Katara 85). In

the field of computer science, Ng and Henikoff created SIFT (Sorting Intolerant From Tolerant),

2

Erdogan and Wong (2015)

a program that predicts whether an amino acid substitution, under the influence of SNPs, affects

protein function (3812).

Conversely, this research focuses less on the implications of SNPs, but more on the

generic technicalities of the mutation process, by designing a computational simulation —

NOVA. NOVA provides a concrete visualization to the audience as to how DNA strands mutate

through iterations of cell cycles. This model aims to connect the idea of SNPs and their effect on

the final amino acid sequence, by quantitatively comparing the percent similarities or identities

(%) between the original and final amino acid sequence, as well as the % similarities between the

final DNA and amino acid sequences, with respect to mutation rate. It is postulated that an

increase in the number of SNPs will generate a progressive increase in mutations within an

amino acid chain, hence a decrease in % similarity between original and final amino acid

sequence. This is due to the redundancy in genetic code: the same amino acid residue can be

encoded by multiple, or synonymous, codons (Spencer and Barral 1). Also, by comparing

percent identities of DNA and amino acid sequences, it is hypothesized that their differences will

increase in proportional rates as mutation rate rises.

Computational Procedure

The objective of this study was to simulate the action of SNPs on an either randomized or

manually inputed DNA sequence, by building a computer program known as NOVA. This

program not only visualizes the action of SNPs, but also displays the resulting amino acid

polymorphism. NOVA is essentially a user­friendly program that recognizes the differences

between the initial and final DNA and amino acid sequences in terms of % similarities. NOVA

was created with the computer program Python, and its visual component, VPython (Visual

Python), to view the graphs generated by the inputs. Users can then examine the trends in

percent similarities of DNA and resulting amino acid sequence, for analytical purposes. Users

can also specifically choose one segment of a mutated protein and compare it with the original

sequence.

NOVA was programmed to show distinct differences between original and mutated DNA

sequences as well as amino acid sequences. First of all, the user is requested to choose between

3

Erdogan and Wong (2015)

an ‘automatic’ or a ‘manual’ mode; the ‘automatic’ mode produces a random DNA sequence,

whereas the ‘manual’ mode requires user input of a sequence, which he/she may want to

examine. After fulfilling the requirements for the following questions (number of base pairs

being generated (in ‘automatic’ mode), mutation possibility of each nucleotide, correction

probability of a nucleotide being checked, and the number of cell cycles), the user is shown the

anti­parallel strands of DNA over a number of cell cycles, as the strands are being mutated

gradually yet randomly, to simulate ideally the point mutations in a real cell cycle.

After that, NOVA focuses on the translation from DNA strands to amino acids. Based on

the original and final DNA sequences, NOVA notices the AUG sequences (start codon) within

the DNA sequence, and starts translating triplets of nucleotides into amino acids. When NOVA

encounters a stop codon, it stops reading until another AUG sequence is recognized. Therefore,

more emphasis was placed on the translation process rather than transcription, as the eventual

percent similarity between the original and mutated amino acid sequences can be calculated

utilizing only one of the two antiparallel strands. After translation, the program shows the

original translated amino acid sequence, as well as the mutant amino acid sequence.

Additionally, it displays and compares protein primary structures that would be produced by both

the original and mutated DNA. The user is offered a choice to specifically compare the original

and mutated primary structures for more quantitative details.

Afterwards, the user is offered the choice to run the simulation in “rapid mode”, which

repeats a desired number of the above percent similarity calculations automatically, increasing

the mutation rate each time, to observe the effect of mutation rate on the relationship between

DNA and amino acid percent similarities. VPython is utilized to graph the % similarity of final

amino acid sequence (shown in red), and % similarity of final DNA sequence (shown in green),

versus the mutation rate of SNPs. Ultimately, the data are utilized to inspect the relationship

between the mutation rate, and the difference between the percent identities of DNA and amino

acid strands examined. The y­axes in the graphs generated by the simulation represent percent

similarity/identity between the original and mutated strands, while the x­axes represent mutation

possibility/rate, increasing by increments of 0.001. It is important to note that the rates of

polymorphisms are controlled and varied by altering mutation rates in the program.

4

Erdogan and Wong (2015)

A Mathematical Model for DNA Percent Identity

Once all the data were gathered completely, it was decided to derive a mathematical

model that represents the effect of mutation rate on the percent identity of a DNA strand. This

was done to confirm that the simulation developed is functional, and capable of reproducing

accurate results. This model was then compared to the data from our simulation. The following

demonstrates and explains the mathematical model that was instigated.

(1)

The overall change in the number of mutated base pairsm per cell cycle is equivalent to the

derivative of m(t) if it is defined as a function of t, the number of cell cycles carried out.

Logically, the rate will be equal to the number of mutations that caused base pairs to differ from

their original configuration,(¾)*(1­rcor)*(rmut)*(n ­ m), minus the number of mutations that cause

mutated base pairs to return to their unmutated state,(¼)*(1­rcor)*(rmut)*(m). As seen in Equation

1, rcor is the correction rate, rmut is the mutation rate, n is the number of base pairs, and the

fraction ¼ represents that only one out of four mutations will manage to ‘correct’ the base pair.

By solving the differential equation (1) while bearing in mind that m(t=0) is 0, particular

solution was obtained (Equation 2) to describe how the number of mutated base pairs change

with respect to the number of cell cycles.

(2)

Equation 2 was then utilized to account for the effect of changing mutation rates (rmut),

while keeping the number of mutated base pairs (n) and the number of cell cycles (t) constant.

This equation depicts the change in percent identity (%) of the DNA molecule, with regard to

change in mutation rate, which can simply be expressed as:

(3)

After the data collection is completed, chi­square tests were carried out between the

model described by Equation 3 and the results obtained through the simulation to assess the

5

Erdogan and Wong (2015)

overall accuracy of NOVA. The result of this analysis can be found in theResults section along

with the rest of the data obtained throughout the investigation.

Results

After running multiple trials with NOVA, data were gathered, and graphically presented

via VPython. The graph below depicts the % similarities for both DNA and amino acid

sequences (Figure 1).

Figure 1. Percent identity versus mutation rate for 500 b.p. in 'automatic' mode.Correction rate = 9990; Number of cell cycles per mutation rate = 1000; Number of times the mutation rate is increased by 0.001: 50

Subsequently, the final amino acid sequence was directly compared with the DNA

sequence according to a gradual increase in mutation rate (Figure 2). This is based on the

concept that a DNA strand is eventually transcribed and translated into amino acids.

6

Erdogan and Wong (2015)

Figure 2.Difference between DNA and amino acidpercent identities versus mutation rate for 500 b.p. in 'automatic' mode.Y­axis: difference between the percent identities of DNA and amino acid sequences, compared to original strands; X­axis: mutation rate, increasing by 0.001 per cell cycle

Next, the number of base pairs was increased to 1000 b.p., 1500 b.p. and 2000 b.p., and

reiterated the program. Similarly, the final amino acid sequence and the DNA sequence were

compared in each scenario.

Figure 3. Percent identity versus mutation rate for 1000 b.p. in 'automatic' mode.Correction rate = 9990; Number of cell cycles per mutation rate = 1000; Number of times the mutation rate is increased by 0.001: 50

7

Erdogan and Wong (2015)

Figure 4.Difference between DNA and amino acidpercent identities versus mutation rate for 1000 b.p. in 'automatic' mode. Y­axis: difference between the percent identities of DNA and amino acid sequences, compared to original strands; X­axis: mutation rate, increasing by 0.001 per cell cycle

Figure 5. Percent identity versus mutation rate for 1500 b.p. in 'automatic' mode.Correction rate = 9990; Number of cell cycles per mutation rate = 1000; Number of times the mutation rate is increased by 0.001: 50

8

Erdogan and Wong (2015)

Figure 6.Difference between DNA and amino acidpercent identities versus mutation rate for 1500 b.p. in 'automatic' mode. Y­axis: difference between the percent identities of DNA and amino acid sequences, compared to original strands; X­axis: mutation rate, increasing by 0.001 per cell cycle

Figure 7. Percent identity versus mutation rate for 2000 b.p. in 'automatic' mode.Correction rate = 9990; Number of cell cycles per mutation rate = 5000; Number of times the mutation rate is increased by 0.001: 100

9

Erdogan and Wong (2015)

Figure 8.Difference between DNA and amino acidpercent identities versus mutation rate for 2000 b.p. in 'automatic' mode. Y­axis: difference between the percent identities of DNA and amino acid sequences, compared to original strands; X­axis: mutation rate, increasing by 0.001 per cell cycle

Lastly, the ‘manual’ portion of NOVA was experimented by inputting a DNA sequence

of Human Immunodeficiency Virus infection (HIV), and mRNA sequence of Homo sapiens

tubulin, alpha 3c (TUBA3C). The sequences were obtained from the National Centre for

Biotechnology Information (NCBI). Once again, the DNA and amino acid sequences were

compared under various mutation rates.

10

Erdogan and Wong (2015)

Figure 9. Percent identity versus mutation rate for 1617 b.p. HIV gene in 'manual' mode. Correction rate = 9990; Number of cell cycles per mutation rate = 1000; Number of times the mutation rate is increased by 0.001: 50

Figure 10. Difference between DNA and amino acid percent identities versus mutation rate for 1617 b.p. HIV gene in 'manual' mode. Y­axis: difference between the percent identities of DNA and amino acid sequences, compared to original strands; X­axis: mutation rate, increasing by 0.001 per cell cycle

11

Erdogan and Wong (2015)

Figure 11. Percent identity versus mutation rate for 1564 b.p. TUBA3C gene in 'manual' mode.Correction rate = 9990; Number of cell cycles per mutation rate = 1000; Number of times the mutation rate is increased by 0.001: 50

Figure 12. Difference between DNA and amino acid percent identities versus mutation rate for 1564 b.p. TUBA3C gene in 'manual' mode.Y­axis: difference between the percent identities of DNA and amino acid sequences, compared to original strands; X­axis: mutation rate, increasing by 0.001 per cell cycle

12

Erdogan and Wong (2015)

Once the data collection step was completed, chi­square tests were carried out to evaluate

the overall accuracy of the simulation developed, by assessing the proximity between the model

derived and the DNA percent identities acquired. The result of this analysis can be found in

Table 1 below:

DNA

Sequence 500 b.p. 1000 b.p. 1500 b.p. HIV TUBA3C 2000 b.p.

(5000 cycles)

CHI­SQUARE 0.221911 0.095864 0.061888 0.060312 0.08034 0.444035

CRITICAL

VALUE (α=0.05) 67.5048 67.5048 67.5048 67.5048 67.5048 124.3421

Table 1: Chi­square values and critical values calculated for each DNA sequence.

Discussion and Implications

One of the significant discoveries from this study is the relationship between the percent

identities of the final DNA sequence and amino acid sequence depending on various mutation

rates. The trends in figures 2, 4, 6, 10 and 12 exhibit a linear relationship; whereas in figure 8,

the graph has a slight deviation from a linear structure for mutation rates above 0.06 (as indicated

by the equation on the bottom left). This reinforces the idea that the difference between percent

identities of DNA and amino acid sequences increases proportionally with rising mutation rates.

Although it was concluded that there exists a linear relationship between the mutation rate and

the difference between DNA and amino acid percent identities based on these data, this study

was not able to quantify the exact nature of this relationship explicitly.

In addition, the results reveal that the effect of mutations on amino acids is more drastic

than that on DNA base pairs. It is largely because of the genetic code redundancy, as explained

by Spencer and Barral (1). Redundancy here implies the fact that multiple codons may

correspond to one amino acid, which renders the ability of specific amino acids to be altered by

multiple mutations occurring in multiple sites.

13

Erdogan and Wong (2015)

Statistically, the presented Chi­square (X2) values show results that are remarkably lower

than their respective critical values for α = 0.05, which indicates that the values of the difference

between the percent identities of DNA and amino acid sequences are significantly close to the

modelled values, represented by the differential equation. Note that the X2 value for most genes

examined decreases gradually as the number of base pairs increases. This is due to the fact that

the increase in the number of base pairs reduces the overall noise in the data.

A major limitation that was encountered throughout this investigation was the

computational capacity of the device utilized to simulate the effect of mutation rates on percent

identities of DNA and amino acid sequences. Initially, in order to obtain results as close to the

real­life values as much as possible, it was decided to run the simulation with a correction rate of

99.99%. However, the computer used to run the program did not manage to yield results in a

quick and efficient manner. Accordingly, to procure sensible results in a relatively short amount

of time, the correction rate was readjusted to 99.9%. Another significant factor that limited the

extent of this research was our focus on only one type of DNA mutation, namely single

nucleotide polymorphisms, among many other different possibilities. Although DNA replication

is based on simple principles and mechanisms, it is a complicated process itself; even a miniscule

mutation, in a critical part of the genetic material, is capable of having a crucial effect on the

amino acid product, hence the proteins synthesized by the cell. Therefore, it is feasible to state

that in order to achieve a more accurate simulation, one must also take into account different

types of mutations that were disregarded in this investigation.

Ultimately, this quantitative analysis of DNA and amino acid sequences before and after

mutations may lead to the discovery of the required mutation cycles in a cell. It may be possible

in the future for us to obtain the duration (the number of cell cycles) needed to transform an

original amino acid sequence to a particular mutant sequence. With further coding and

improvements on NOVA, it can be turned into a systematic tool, whose job is to minimize the

time required for a DNA strand to evolve to a certain sequence, at very specific mutation rates

and correction rates. It is assured that NOVA can assist scientific research in fields such as

genetics, developmental biology, and genome evolution.

14

Erdogan and Wong (2015)

Conclusion

This study investigated the role of SNPs on the percent identities of DNA and amino acid

sequences. With the assistance of computational analysis by NOVA, it was concluded that the

difference between DNA and amino acid sequences, at increasing mutation rates, increases

proportionally. A mathematical model was constructed to further support our numerical data; the

expected values were very close to the modelled values, as propagated by the chi­squared (X2)

test. Also, it was observed that the X2 value gets smaller as the number of base pairs increases.

Fundamentally, NOVA was proved to be a well­designed tool for scientists to solve problems in

the field of evolutionary genomics and developmental biology.

Acknowledgments

We would like to express our utmost gratitude to our mentor Pamela Kalas, who provided

knowledge to us in the biological perspective. Also, we would like to thank the peer reviewers

for proofreading this paper and provide feedback for further improvements.

15

Erdogan and Wong (2015)

Work Cited

Cargill, M, et al. "Characterization of Single­nucleotide Polymorphisms in Coding Regions of Human Genes." Nature (1999): 231­238.

"Human Immunodeficiency Virus Infection Inhibitors and Therapeutic or Prophylactic Agent for Aids." National Center for Biotechnology Information. U.S. National Library of Medicine, n.d. Web. 13 Mar. 2015. <http://www.ncbi.nlm.nih.gov/nuccore/DM466407.1>.

"Homo Sapiens Tubulin, Alpha 3c (TUBA3C), MRNA." National Center for Biotechnology Information. U.S. National Library of Medicine, n.d. Web. 13 Mar. 2015. <http://www.ncbi.nlm.nih.gov/nuccore/NM_006001.2>.

Katara, P. “Single Nucleotide Polymorphism and Its Dynamics for Pharmacogenomics.” Interdiscip Sci Comput Life Sci 6 (2014): 85–92.

Kirk, B. W, et al. "Single Nucleotide Polymorphism Seeking Long Term Association with Complex Disease." Nucleic Acids Research 30.15 (2002): 3295­3311.

Ng, P. C., Henikoff, S. "Predicting the Effects of Amino Acid Substitutions on Protein Function."

Annual Review of Genomics and Human Genetics 7 (2006): 61­80.

Spencer, P. S., Barral, J. M. "Genetic Code Redundancy and Its Influence on the Encoded Polypeptides." Computational and Structural Biotechnology Journal 1.1 (2012): 1­8.

Wakui, M. “Analysis of single nucleotide polymorphisms (SNPs).” The Japanese Journal of Clinical Pathology 61.11(2013):1008­1017.

Wu, J., Rui J. "Prediction of Deleterious Nonsynonymous Single­Nucleotide Polymorphism for Human Diseases." The Scientific World Journal (2013): 1­1

16