Improved short adjacent repeat identification using three evolutionary Monte Carlo schemes

17
Int. J. Data Mining and Bioinformatics, Vol. x, No. x, xxxx 1 Improved short adjacent repeat identification using three evolutionary Monte Carlo schemes Jin Xu *, Qiwei Li , Victor O. K. Li ˆ , Shuo-Yen Robert Li § , and Xiaodan Fan Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong ˆ Visiting Professor, Department of Computer Engineering, King Saud University, Saudi Arabia Department of Statistics, § Department of Information Engineering, The Chinese University of Hong Kong, Sha Tin, Hong Kong E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] *Corresponding author Abstract: In nature, creatures that have genetic relationships often share similar biological sequences. One of the challenging problems is to identify and compare the short adjacent repeats in multiple sequences among different species. We call this the short adjacent repeats identification problem (SARIP). Although the recently proposed Markov chain Monte Carlo (MCMC) algorithm has been shown to be effective in solving SARIP, it suffers two major drawbacks: long computation time and tendency to be easily trapped in local optima. Fortunately, these defects in MCMC algorithm can be surmounted by applying Evolutionary Monte Carlo (EMC) algorithm, which is a useful tool in sampling complicated distributions. In this paper, we employ three EMC schemes, i.e., random exchange (RE), best exchange (BE), and crossover, to parallelize the MCMC algorithm to solve SARIP. In order to have a comprehensive comparison, we test all the algorithms on a wide range of synthetic data and real data. All EMC schemes are implemented on a parallel platform. The simulation results show that compared with the conventional MCMC algorithm, all three EMC schemes can not only shorten the computation time via speeding up the convergence but also improve the solution quality in difficult cases. Moreover, we observe that the performances of different EMC schemes depend on the degeneracy degree of the motif pattern. Keywords: Short adjacent repeats, evolutionary Monte Carlo, parallel tempering, sequence motif, maximum a posteriori Reference to this paper should be made as follows: Xu, J., Li, Q., Li, V.O.K., Li, S.Y.R. and Fan, X. (xxxx) ‘Improved short adjacent repeat Copyright 2009 Inderscience Enterprises Ltd.

Transcript of Improved short adjacent repeat identification using three evolutionary Monte Carlo schemes

Int. J. Data Mining and Bioinformatics, Vol. x, No. x, xxxx 1

Improved short adjacent repeat identificationusing three evolutionaryMonte Carlo schemes

Jin Xu†*, Qiwei Li‡, Victor O. K. Li† ,Shuo-Yen Robert Li§, and Xiaodan Fan‡

†Department of Electrical and Electronic Engineering,The University of Hong Kong, Pokfulam Road, Hong Kongˆ Visiting Professor, Department of Computer Engineering, KingSaud University, Saudi Arabia‡Department of Statistics,§Department of Information Engineering,The Chinese University of Hong Kong, Sha Tin, Hong KongE-mail: [email protected]: [email protected]: [email protected]: [email protected]: [email protected]*Corresponding author

Abstract: In nature, creatures that have genetic relationships often

share similar biological sequences. One of the challenging problems is to

identify and compare the short adjacent repeats in multiple sequences

among different species. We call this the short adjacent repeats

identification problem (SARIP). Although the recently proposed

Markov chain Monte Carlo (MCMC) algorithm has been shown to

be effective in solving SARIP, it suffers two major drawbacks: long

computation time and tendency to be easily trapped in local optima.

Fortunately, these defects in MCMC algorithm can be surmounted by

applying Evolutionary Monte Carlo (EMC) algorithm, which is a useful

tool in sampling complicated distributions. In this paper, we employ

three EMC schemes, i.e., random exchange (RE), best exchange (BE),

and crossover, to parallelize the MCMC algorithm to solve SARIP. In

order to have a comprehensive comparison, we test all the algorithms

on a wide range of synthetic data and real data. All EMC schemes

are implemented on a parallel platform. The simulation results show

that compared with the conventional MCMC algorithm, all three EMC

schemes can not only shorten the computation time via speeding up

the convergence but also improve the solution quality in difficult cases.

Moreover, we observe that the performances of different EMC schemes

depend on the degeneracy degree of the motif pattern.

Keywords: Short adjacent repeats, evolutionary Monte Carlo, parallel

tempering, sequence motif, maximum a posteriori

Reference to this paper should be made as follows: Xu, J., Li, Q., Li,

V.O.K., Li, S.Y.R. and Fan, X. (xxxx) ‘Improved short adjacent repeat

Copyright © 2009 Inderscience Enterprises Ltd.

2 J. Xu et al.

identification using three evolutionary Monte Carlo schemes’, Int. J.

Data Mining and Bioinformatics, Vol. x, No. x, pp.xxx–xxx.

Biographical notes: Jin Xu is currently a PhD candidate in the

Department of Electrical and Electronic Engineering at the University

of Hong Kong, Hong Kong. His research interests include evolutionary

algorithms and their applications, parallel and distributed computing,

bioinformatics and portfolio optimization.

Qiwei Li received the B.Eng. degree in electronic engineering from

Tsinghua University in 2008, and the M.Phil. degree in information

engineering from The Chinese University of Hong Kong in 2010.

Currently, he is a Research Assistant working at the Department of

Statistics, The Chinese University of Hong Kong. His research interests

focus on Bioinformatics, including repeats detection, motif discovery,

microarray data analysis, gene network inference, etc.

Victor O.K. Li is Chair Professor of Information Engineering and

Associate Dean of Engineering at the University of Hong Kong,

Hong Kong. He is also Acting Head of the Department of Electrical

& Electronic Engineering and Visiting Professor, Department of

Computer Engineering, King Saud University, Saudi Arabia. He

received SB, SM, EE and ScD degrees in Electrical Engineering and

Computer Science from the Massachusetts Institute of Technology in

1977, 1979, 1980, and 1981, respectively. Previously, he was Professor

of Electrical Engineering at the University of Southern California

(USC), Los Angeles, California, USA, and Director of the USC

Communication Sciences Institute. He is a Fellow of the IEEE, IAE,

and HKIE, and a UK Royal Academy of Engineering Senior Visiting

Fellow.

Bob Li received the PhD degree from UC Berkeley in 1974. He taught

applied math at MIT in 1974-76 and math/statistics/CS at U. Illinois,

Chicago in 1976-79. After working at Bell Labs/Bellcore for a decade,

he joined CUHK as a chair professor in 1989. He now also serves as

Co-director of Institute of Network Coding as well as an honorary

professor in a few universities. Bob initiated algebraic switching theory

and is a cofounder of network coding theory. His “martingale of

patterns” (1980) is widely applied to genetics.

Xiaodan Fan is Assistant Professor at Department of Statistics,

the Chinese University of Hong Kong, Hong Kong. He received his

Ph.D degree in Statistics from Harvard University, USA. Before that,

he got his B.E. in Automation and M.S. in Pattern Recognition

& Intelligent Systems from Tsinghua University, China. Dr. Fan’s

research interests include probabilistic modeling, statistical computing,

pattern recognition, and bioinformatics.

Improved short adjacent repeat identification 3

1 Introduction

Bioinformatics has been developing rapidly in past few decades, especially withthe advancement of computational and statistical techniques. One major goal ofbioinformatics is to study the biological evolution and enhance our understandingof internal factors of diseases. This can be fulfilled by applying “informatics”techniques (e.g. applied mathematics, statistics, and computer science, etc.) tohandle the large-scale data associated with molecules like DNA, RNA and proteins.This paper mainly focuses on identifying short adjacent repeats in multiple DNAsequences.

Since the late 1990s, researchers from the field of forensics and bioinformaticshave paid great attention to the short tandem repeats (STR) analysis. Tandemrepeats is a common phenomenon in which a DNA segment contains two or morecontiguous, approximate copies of a pattern of nucleotides (Benson (1999)). Thepattern width of STR usually ranges from 2 to 16 base pairs (bp). A typicalexample of STR would be like

. . . CCTGA CCTGA CCTaA CCTGA CCTGA . . . ,

where the sequence pattern (also called repeat unit) CCTGA is duplicated for fivetimes, and the lower-case “a” is evolved from “G” due to the mutation. STR canbe used to analyze the genetic fingerprinting (Butler et al. (1997)), and recentefforts have also studied their association with genetic diseases (Sinden (2000)). Forexample, scientists report that the Huntington’s disease, which disorders musclecoordination, is the result of explosive growth in the copy number of a trinucleotidepattern CAG (Huntington’s Disease Collaborative Research Group (1993)).

STR can be generalized to short adjacent repeats (SAR), where gaps betweenneighboring repeat units are allowed. Such inter-unit insertions often happen dueto mutations and crossovers in the evolutionary process. Take again the aboveexample of sequence, it may evolve into

. . . CCTGA GG CCTGA CCTaA CCTGA T CCTGA . . . ,

with the insertions of GG and T. In other words, STR can be considered as aspecial case of SAR where the length of any gap is zero. Besides the application inbioinformatics, the SAR identification problem (SARIP) is also quite common indata analysis, such as pattern recognition in speeches, texts and images.

The Evolutionary Monte Carlo (EMC) algorithm is established by Liang et al.(2000), and further developed by Goswami et al. (2007). It evolves a population ofMarkov chain Monte Carlo (MCMC) chains to explore multi-model multivariatedistributions. EMC is extended from parallel tempering (also called replicaexchange), which is formulated by Swendsen et al. (1986) and then developedby Geyer et al. (1995) Both EMC and PT have been successfully applied totackle complex computational problems in many fields including biology, physics,chemistry, engineering and material science (Earl et al. (2005)). The key ideabehind EMC and PT is to implement multiple independent replicas simultaneouslyunder different conditions. One possible condition can be defined as temperature,which is employed to tune the smoothness of the target distribution. Generallyspeaking, the systems with low temperature are usually able to explore the “local

4 J. Xu et al.

details” of the energy landscape, while those at high temperature are responsiblefor searching a wide range of the solution space. In the simulation process, replicasat different temperature levels are allowed to exchange or partially swap (crossoverscheme) at certain frequency according to the Metropolis-Hastings rule. As a ruleof thumb, EMC algorithm can usually achieve a better performance than that ofcanonical MCMC, especially in complex systems with rugged energy landscape.This is due to the good balance of exploration and exploitation in EMC. In thispaper, three EMC schemes, i.e., random exchange (RE), Best exchange (BE), andcrossover are adopted to solve SARIP and their performance are compared withthe canonical MCMC method.

By virtue of its population-based characteristic, EMC is quite suitableto be implemented on a parallel platform. In EMC and PT, replicas onvarious processors can be performed simultaneously and intermediate results areexchanged from time to time. Moreover, with the quick development of computerand network, more computation resources are readily available, which makesparallel and distributed computing popular nowadays. Thus, in this paper, allEMC algorithms are executed on a parallel platform to improve the quality of thesolution as well as to reduce the computation time.

The rest of the article is organized as follows. In Section 2, we describe thegenerative model of SARIP. Then, the canonical MCMC algorithm and threeEMC algorithms are presented in Sections 3 and 4, respectively. In Section 5,experimental results on synthetic and real data cases are reported and discussed indetails. Finally, we summarize the paper and give potential future work in Section6.

2 Generative Model of SARIP

In this section, we formulate the probabilistic generative model of SARIP. For moredetailed description of this problem, see Li et al. (2010) and Li et al. (2011).

The input data R is composed of N nucleotide sequences with lengths Ln,each of which is denoted by Rn = rn,1, rn,2, · · · , rn,Ln. Each residue rn,l issampled from a finite alphabet χ = A, T,C,G. We assume that each sequenceis embedded with a repeat segment, where the same repeat pattern is sharedby repeat segments, in a homogeneous background. Our target is to find themost probable repeat segment location and structure for each sequence. They

are denoted by A =[a1, a2, · · · , aN

]Tand S =

[sT1 , s

T2 , · · · , sTN

]T, respectively.

For parameters A, each an is an integral number in the range of [1, Ln] andfor parameters S, each sn is a base-(G+ 1) numeral vector of the format sn =[gn,1, gn,2, · · · , gn,Ωn−1,−1, · · · ,−1

], where Ωn is the copy number of the n-th

repeat segment, and the variable gn,ω is the gap length between the ω-th repeatunit and the (ω + 1)-th repeat unit within sn. G is the maximum allowed gaplength and therefore 0 ≤ gn,ω ≤ G. It is required that each sn should be the samedimension by filling with the trivial value −1 from the Ωn-th to the (Ω− 1)-thelement of each sn, where Ω is the maximum allowed copy number. All repeatunits with pattern width J are instances sampled from the motif matrix Θ =[θ1, θ2, · · · , θJ

], where θj =

[θA,j , θT,j , θC,j , θG,j

]Tis the relative frequencies of

finding each letter k, k ∈ χ, at the position j among all repeat units. Note that

Improved short adjacent repeat identification 5∑k θk,j = 1, ∀j and θk,j ≥ 0, ∀k, j. All nucleotides in the background are sampled

from Φ =[ϕA, ϕT , ϕC , ϕG

]T, where ϕk, k ∈ χ is the probability of finding letter k

at a non-unit position. Note that∑

k ϕk = 1 and ϕk ≥ 0,∀k.For ease of presentation, Figure 1 (adapted from Li et al. (2011)) shows an

example of the schematic diagram of our model. In this specific model, there are5 sequences with different lengths Ln, 1 ≤ n ≤ 5. The maximum allowed copynumber Ω is set as 9. Each repeat segment is composed of multiple repeat unitsthat are represented by the gray blocks. The location and structure vector of eachrepeat segment are shown above it. Each repeat unit is an instance sampled fromthe designated motif model. The background area, where letters are generatedfrom a background distribution, is painted in white with dotted borderline. Eachwhite block with solid borderline highlights a gap with length 1 and thus a seriesof g white blocks denotes a gap of length g. Note that the letters at gap locationsare also generated from the background distribution.

Figure 1 Schematic diagram of the model under the setting N = 5 and Ω = 9.

For Bayesian inference of independent parameters A, S, Θ, and Φ, we firstspecify prior knowledge for each parameter, write the complete data likelihoodP (R|A,S,Θ,Φ) and derive the joint posterior probability P (A,S,Θ,Φ|R) viaBayes’ rule. Then, a collapsing technique introduced by Liu et al. (1995) isused to reduce the dimensionality of the state space by integrating out nuisanceparameters Θ and Φ. Lastly, in this parameter space of high dimensionality, weexplore the target distribution P (A,S|R) that can give a whole probability densitylandscape. The maximum a posteriori (MAP) estimate (Gelman et al., 2004) isused as the point estimate of parameters of interest. The particular A and Scorresponding to the MAP of the target distribution is considered to be the bestsolution for the input data set R.

3 The canonical MCMC algorithm

The key idea of addressing this optimization problem is to use the Metropolis-in-Gibbs scheme (Liu (2001); Gelman et al. (2004)) as shown in Table 1 (adapted

6 J. Xu et al.

from Li et al. (2011)). The MCMC algorithm proceeds through iterations afterinitialization, each of which updates an and sn for one sequence after anotherin ascending order from 1 to N . Within each iteration, to update the repeatsegment location an and structure sn for the n-th sequence, we pretend that thelocations and structures of the remaining N − 1 repeat segments are known, andwe stochastically predict an and sn. In other words, we use the given information,A[−n] and S[−n], to estimate the current ‘motif matrix’ Θn and ‘background

distribution’ Φn so as to determine new an via Gibbs sampling and new sn viaMetropolis-Hastings sampling sequentially. Here, A[−n] and S[−n] denote the setof locations and structures, respectively, of all repeat segments excluding the n-th sequence. Intuitively, the more accurate the estimated motif matrix Θn andbackground distribution Φn constructed in the predictive update step, the moreaccurate the determination of an and sn in the following sampling steps, and viceversa.

Table 1 The Schematic Procedure of the MCMC Algorithm BASARD

Step 1: Initialize A and S;Step 2: for n from 1 to N do

2.1: Predictive update Θn and Φn via A[−n] and S[−n];

2.2: Sample and update an via P (an|sn, Θn, Φn,R);

2.3: Sample and update sn via P (sn|an, Θn, Φn,R);Step 3: Repeat Step 2 until convergence;

Figure 2 The full state transition diagram under the setting G = 1 and Ω = 4.

We use Gibbs sampling (Gelfand et al., 1990) to update each an. Conditionalon the current values of all other parameters A[−n] and S, we first break downthe sequence Rn into all possible overlapping windows, of the same length as that

Improved short adjacent repeat identification 7

of the current repeat segment, then calculate the corresponding probability ofgenerating those matching repeat units within each window, and finally samplethe new an according to such probabilistic weights. We use Metropolis-Hastingsalgorithm (Hastings, 1970) to update each sn. In order to make the Markov chainergodic and fast-convergent, we design three categories of moves: rear indel (inserta repeat unit behind the current last repeat unit or delete the current last repeatunit), partial shift (randomly shift a sub-repeat segment), and front indel (inserta repeat unit in front of the current first repeat unit or delete the current firstrepeat unit).

For ease of presentation, Figure 2 (adapted from Li et al. (2011)) showsan example of the full state transition diagram for all sn. Here we set themaximum allowed gap length G = 1 and the maximum allowed copy number Ω =4. Therefore, each state of sn can be described as a 3-dimension binary vector. Forconvenience, we erase the nuisance value −1 from all vectors. The three categoriesof moves are represented by different types of two-way lines. A upward directedline indicates deleting a repeat unit within the repeat segment, while a downwarddirected line indicates inserting a repeat unit.

4 EMC algorithms

In EMC algorithms, typical sampling points are generated based on a targetdistribution

ψ(x) ∝ e−E(x)/T , x ∈ ℜd, T > 0

where E(x) is the energy function of a thermodynamic system, and T is thetemperature parameter used to regulate the shape of the distribution ψ(x).Therefore, the higher the temperature, the “flatter” the target distribution asshown in Figure 3. Smooth landscape means the sampling points can easily jumpout of the local optima.

As described in Section 2, P (A,S|R) is our target probability density function.For point estimates, we target at the solution which has the maximum posteriorprobability, i.e., MAP. However, thermodynamic systems incline to stay at thelowest free energy, which is a minimization problem. Thus, we define the energydensity function as E(x) = −ln(P (A,S|R)). Accordingly, target density functionψ(x) can be written as

e−E(x)/T = e−(−lnP (A,S|R))/T = (P (A,S|R))1/T

and when T = 1, it is the original P (A,S|R).Subsequently, the Gibbs sampling for updating A and the Metropolis-Hastings

sampling for updating S in Li et al. (2011) should be modified as

P (an|Θn,A[−n],S,R) ∝ (P (R|Θn,A,S)P (an))1/T

and

λ = (P (s∗n|Θn,A,S[−n],R)P (sn; s

∗n)

P (sn|Θn,A,S[−n],R)P (s∗n; sn))1/T = (

P (R|Θn,A,S)P (s∗n)P (sn; s

∗n)

P (R|Θn,A,S)P (sn)P (s∗n; sn))1/T .

8 J. Xu et al.

Figure 3 Unnormalized target distribution in different temperatures.

A sequence of temperature ladder T1 > T2 > · · · > TM > 0 is adopted,and we set the lowest temperature TM = 1. Thus, for each replica xi, itstarget density distribution can be expressed as ψ(xi) ∝ e−E(xi)/Ti , where i =1, 2, · · · ,M . Accordingly, the composite system’s solution can be defined as X =x1, x2, · · · , xM and the target distribution for the whole system is denoted by

Ψ(X) ∝∏M

i=1 e−E(xi)/Ti , X ∈ ℜMd

In the process of EMC algorithms, replicas at different temperature levels areexchanged or partially swapped frequently. The acceptance of this transition (X →X∗) is based on the Hasting ratio, which can be formulated as

P (X → X∗) = min(1,Ψ(X∗)

Ψ(X)× ζ(X;X∗)

ζ(X∗;X)).

where ζ(X;X∗) is the candidate-generating density function, and it satisfies∑X∗∈Ω\X ζ(X;X∗) = 1. Ω is all possible states for X and this density can be

interpreted as the probability of proposing a transition from X to X∗. Thepseudocode for the general EMC algorithm is shown in Table 2. We discuss threerepresentative EMC schemes in the following subsections.

4.1 Random exchange (RE)

This move only happens between neighboring temperature levels, i.e., (xi ↔ xi+1

or xi ↔ xi−1). Thus, the new solution X∗ = (x1, · · · , xi+1, xi, · · · , xM ) will beaccepted with the probability

Improved short adjacent repeat identification 9

Table 2 The pseudocode for the general EMC algorithm

Step 1: Initialize the M replicas (x1, x2, · · · , xM ), and set the temperature ladderT1 > T2 > · · · > TM > 0 with TM = 1;

Step 2: If replica exchange rate met;2.1: Select two replicas based on the associated EMC scheme;2.2: Produce the new composite solution X∗;2.3: Calculate the acceptance probability P (X → X∗);2.4: Generate a random value α from the uniform distribution [0,1];2.5: Accept the new solution X∗ if α < P (X → X∗);Else;2.6: Perform the modified MCMC algorithm on each processor;

Step 3: While stopping criterion not met, go back to Step 2;Step 4: Output the best value of the posterior distribution P (A,S|R) and the

corresponding solution A,S

P (X → X∗) = min(1,Ψ(x1, · · · , xi+1, xi, · · · , xM )

Ψ(x1, · · · , xi, xi+1, · · · , xM )× ζ(X;X∗)

ζ(X∗;X))

= min(1, e(E(xi)−E(xi+1))

(1Ti

− 1Ti+1

)× ζ(X;X∗)

ζ(X∗;X)).

Note that ζ(X;X∗)/ζ(X∗;X) = 1 when i = 1, M and ζ(X;X∗)/ζ(X∗;X) = 2otherwise. The motivation of RE is to bring the good solution down the ladderwhile allowing a stuck solution to move to higher temperatures.

4.2 Best exchange (BE)

Suitable replicas migrate slowly in RE scheme because they are only allowed toexchange at neighboring temperature levels. In other words, it takes at least M −1 steps for a good replica found at the highest temperature T1 to move to thelowest temperature TM . This will lower the efficiency of EMC algorithm undercertain situation, which will be discussed in next section. Thus, following the ideaproposed by Goswami et al. (2007), we use BE scheme to improve the migrationof replicas. In BE schemes, two random replicas xi and xj also need to be selectedfirst. Without loss of generality, we assume i > j and the candidate-generationdensity function should be redefined as:

ζ(X;X∗) = 1/E(xi)∑u∈1:i\j 1/E(xu)

and its reverse move is:

ζ(X∗;X) =1/E(xj)∑

u∈1:i−1 1/E(xu).

We can consider ζ(X;X∗)/ζ(X∗;X) as a stabilizer (Goswami et al. (2007)), whichfacilitates the worse replica to move to lower temperature. This is quite usefulwhen the complex problem has multiple local optima. More precisely, the exchangeacceptance ratio now becomes

10 J. Xu et al.

P (X → X∗) = min(1,Ψ(x1, · · · , xi, · · · , xj , · · · , xM )

Ψ(x1, · · · , xj , · · · , xi, · · · , xM )× ζ(X;X∗)

ζ(X∗;X))

= min(1, e(E(xj)−E(xi))

(1Tj

− 1Ti

1/E(xi)×∑

u∈1:i−1 1/E(xu)

1/E(xj)×∑

u∈1:i\j 1/E(xu))

where i > j.

4.3 Crossover

In the crossover scheme, we first randomly choose two replicas xi and xj . Thenone-point exchange operator is employed to generate two new replicas x′i and x′j .This process can be illustrated as follows:

xi = (ai1, si1; · · · ; aiq, siq; · · · ; aiN , siN ) x′i = (ai1, s

i1; · · · ; ajq, sjq; · · · ; aiN , siN )

→xj = (aj1, s

j1; · · · ; ajq, sjq; · · · ; a

jN , s

jN ) x′j = (aj1, s

j1; · · · ; aiq, siq; · · · ; a

jN , s

jN )

where sequence q is randomly selected from 1 : N. Here the location andstructure of sequence q in replicas xi and xj are swapped. Moreover, this exchangeacceptance probability can be computed by

P (X → X∗) = min(1,Ψ(x1, · · · , x′i, · · · , x′j , · · · , xM )

Ψ(x1, · · · , xi, · · · , xj , · · · , xM )× ζ(X;X∗)

ζ(X∗;X))

= min(1, e(E(xi)−E(x′

i)

Ti+

E(xj)−E(x′j)

Tj) × ζ(X;X∗)

ζ(X∗;X))

where ζ(X;X∗)/ζ(X∗;X) = 1.

5 Experiment results and analysis

In this section, we firstly describe the simulation environment including the testingdata and parameters. Secondly, in order to have an extensive comparison, thecanonical MCMC and the three EMC algorithms are tested on both syntheticand real data. We analyze their results in the following subsections. Finally, weintegrate and summarize the observations.

5.1 Simulation environment

In general, DNA sequences are unbranched polymers with thousands of nucleotidebases, which makes the corresponding input data quite large and complex. Thus,including the 9 categories in Li et al. (2011), we totally synthesize 16 data sets.Based on the degeneracy degree which is measured by the average entropy, motifmatrix can be classified into three types: low (L), medium (M) and high (H), wherethe average entropies are about 1.5, 1.3 and 1.1, respectively. We also denote B(∈ [10, 15]) and S (∈ [5, 10]) as the copy number in one sequence. Thus, for thescenario of 6 sequences with 2000 lengths, we have 12 different sets, representedby 6HB, 6MB, 6LB, 6HS, 6MS, 6LS, 12HB, 12MB, 12LB, 12HS, 12MS, and 12LS,

Improved short adjacent repeat identification 11

where the prefix number denotes the pattern width. To test on a large scale, weadd 3 more data sets, i.e., 6HB, 6MB, and 6LB to a scenario of 10 sequences withlength 10000. Note that, all these 15 data sets only contain one repeat segment ineach sequence. However, it is possible to have multiple segments in one sequence,which causes the energy landscape to have many local optima. Taking into accountthis scenario, one data set with multiple repeat segments is also considered. Thesettings of all data sets are listed in Table 3, where the background nucleotides are

generated from Φ =[0.25 0.25 0.25 0.25

]T.

Table 3 All 16 synthetic data settings

6HB, 6MB, 6LB G = 2, J = 6,Ω ∈ [10, 15]N = 6, Ln = 2000, 6HS, 6MS, 6LS G = 2, J = 6,Ω ∈ [5, 10]

1 ≤ n ≤ N 12HB, 12MB, 12LB G = 2, J = 12,Ω ∈ [10, 15]12HS, 12MS, 12LS G = 2, J = 12,Ω ∈ [5, 10]

N = 10, Ln = 10000 6HB, 6MB, 6LB G = 2, J = 6,Ω ∈ [10, 15]N = 5, Ln = 2000, multi segments G = 2, J = 9,Ω = 9

To construct the temperature ladder for EMC algorithms, we adopt thegeometric progression (Predescu et al. (2009)) to approximate the ladder set T1 >T2 > · · · > TM > 0. As discussed in Section 4, the lowest temperature TM is set to1.000. For the highest temperature T1, we assign the value of 1.500, which is largeenough to produce a “flatter” energy landscape. Other intermediate temperaturesT2, · · · , TM−1 can be computed by

Tj = TMµj−1, µ = M−1

√T1

TM

In our simulation, we establish the parallel platform with four processors, and thus,the temperature ladder can be configured as 1.500, 1.310, 1.145, 1.000. Notethat, to guarantee the “detailed balance” requirement in EMC, each processorshould run the same number of evaluations before the replica exchange. Hence, weexecute the EMC algorithms in a synchronized mode. Moreover, the manipulationfor the replica exchange is triggered every 10 iterations. When the exchange isaccepted, the starting position A and the segment structure S will be swapbetween the chosen replicas.

All algorithms are coded in C++, and the EMC algorithms are conductedon a cluster of computers with an Intel Core Quad 2.66GHz CPU and 4G RAMconnected in an Ethernet. The communication between replicas are fulfilled viaMPICH2, which is a high-performance and widely used implementation of MessagePassing Interface (MPI). This software library is popular in parallel design.

5.2 Test on synthetic data

In both MCMC and EMC algorithms, we use maximum a posteriori (MAP) as ametric to judge the quality of the solution A,S. Strictly speaking, the higherthe MAP, the better the solution. Without influence on the final results, we utilizethe unnormalized natural logarithm posterior probability instead of the normalizedone, which is extremely difficult to calculate.

12 J. Xu et al.

5.2.1 Single segment cases

Figures 4 and 5 illustrate the convergence of each compared algorithm on cases of6HS and 12LB, respectively. These two cases have representative signal strengths.The 6HS case has the lowest signal strength of short adjacent repeats, and thus,its convergence track as shown in Figure 4 fluctuates dramatically. Meanwhile, dueto the strongest signal behind 12LB, its convergence curve seems more smoothas represented in Figure 5. Besides, it is clear that the three EMC schemesoutperform the canonical MCMC in terms of convergence speed.

Figure 4 The trace plot of the unnormalized log joint posterior probability for 6HSwith N = 6, Ln = 2000.

To more accurately assess and compare the efficiency of these four algorithms,we list in Table 4 the average computation time for each algorithm to reach thetrue value. This can be calculated by

TCPU

3000 ×Nb,

where the TCPU represents the total CPU computation time for 3000 iterations,and Nb is the number of iterations for an algorithm to reach the true value.The least computation time (bold) for each case is also shown in Table 4. Weconclude that: (1) EMC algorithms perform better than canonical BASARD, andall three schemes can save around 50% computation time; (2) Generally, EMC RE,EMC BE and EMC crossover are superior in cases when the degeneracy degreeis low, medium, and high, respectively. In fact, on the average they can savemore than 60% computation time. It is rational since in cases of high degeneracydegree, the signal strength is quite low, which results in rough energy landscapeand many local optima. Thus, for this situation, EMC BE is a good choice becauseits replicas are not confined in neighboring exchange. Whereas in cases of low

Improved short adjacent repeat identification 13

Figure 5 The trace plot of the unnormalized log joint posterior probability for 12LBwith N = 6, Ln = 2000.

degeneracy degree, the energy space is smooth and the neighboring exchange ismore suitable, which leads to EMC RE being the best one.

Table 4 Average computation time for each algorithm to reach the true value

Case Computation time (s)N = 6, Ln = 2000 BASARD EMC RE EMC BE EM crossover

6HB 77.07 50.38 36.72 38.986MB 83.68 45.53 44.52 32.816LB 46.85 17.30 27.30 23.2312HB 108.17 56.31 41.36 44.3712MB 145.20 86.03 113.05 64.2312LB 177.45 35.16 68.32 54.176HS 38.99 27.64 8.74 15.666MS 23.19 11.66 13.59 11.766LS 28.38 12.19 15.87 8.7412HS 39.39 27.23 12.12 11.8312MS 52.24 14.72 16.97 10.5812LS 65.33 29.71 39.96 50.38

N = 10, Ln = 100006HB 1238.29 1057.24 390.83 738.376MB 751.42 411.35 318.89 249.536LB 745.26 341.38 392.23 573.97

14 J. Xu et al.

5.2.2 Multiple segment case

Note that, in all single segment cases, canonical MCMC algorithm can easilyobtain the true value due to the simple structure of sequences. In this subsection,we consider a more complex structure of DNA sequences which hold multiplesegments within a sequence. This phenomenon is also frequently found in biology,and in our paper, we are mainly interested in locating the most probable segment(the longest) in each sequence. Figure 6 shows an example of multiple segments(marked as blocks). Meanwhile, scenario like this should have multiple localoptima, and it is difficult for the canonical MCMC algorithm to reach the globaloptima (true value).

Figure 7 compares the convergence track for each algorithm, and it can beobserved that the canonical MCMC is trapped in the local optimum, while theother three EMC algorithms are able to attain the true value. Specifically, welist the average and standard deviation values for 20 simulation runs in Table 5for reference. Besides obtaining good solutions, EMC algorithms demonstrate thatthey are more stable and robust than MCMC. This is of great significance inpractice, since we would not like the algorithm to have to run many times to geta relatively good result.

Figure 6 Location of multiple segments within each sequence.

Table 5 Average and standard deviation values for each algorithm

BASARD EMC RE EMC crossover EM BEAVG -13683.4 -13669.7 -13668.8 -13669.7SD 8.0 1.4 1.1 3.0

5.3 Test on real data

Previous work done by Li et al. (2011) has shown the efficiency of canonicalMCMC (or BASARD) in searching the common short adjacent repeats from 24publicly available DNA sequences of DRD4 exon III from GenBank. DRD4 gene isthought to have some relationship with human cognitive function (Previc (1999)).Thus, to have a fair comparison, we use the same 24 DNA sequences from differentmammals to test our EMC algorithms. Figure 8 gives the convergence track of allalgorithms. Similar to the results from the above synthetic data, all EMC schemesconverge much faster than BASARD while obtaining identical quality of solution.This confirms the effectiveness of our EMC algorithms.

Improved short adjacent repeat identification 15

Figure 7 The trace plot of the unnormalized log joint posterior probability formultiple segment case.

Figure 8 The trace plot of the unnormalized log joint posterior probability for realdata case.

5.4 Summary

Both synthetic and real data testing show that compared to canonical MCMC,our EMC algorithms can greatly improve the convergence speed, and hencereduce the computation time. In particular, EMC RE and EMC BE give thebest performance when the motif model is in a low or high degeneracy degree,respectively, while EMC crossover achieves modest performance. Additionally,EMC algorithms have the ability to escape from local optima, which is usually anintricate problem encountered by canonical MCMC approach.

16 J. Xu et al.

6 Conclusions and discussion

EMC (or PT) have attracted much attention as a useful method to handlecomputationally intensive problems. In essence, EMC and PT construct a setof temperature levels and allow replicas to explore the energy landscape, whileEMC also adopts evolutionary ideas such as crossover, mutation, etc. The majoradvantage of EMC is to enable the replica to have the opportunity to surmount theenergy barrier via information exchange. Our contributions in this paper can besummarized as follows: (1) Based on reconstructing and parallelizing the MCMCalgorithm, three EMC schemes (i.e. EMC RE, EMC BE, and EMC crossover) areproposed to solve SARIP; (2) To comprehensively compare our EMC schemes withthe canonical MCMC algorithm, we generate synthetic data (including both singleand multiple segment cases) as well as adopt a real data set; (3) The simulationresults show that the EMC algorithms can not only cut the computation timebut also improve the solution quality. Moreover, the performance of different EMCschemes depends on the degeneracy degree of the motif pattern.

Possible future research includes: (1) Since the temperature ladder setting iscrucial to the EMC algorithm, more accurate designs such as the feedback scheme(Katzgraber et al. (2006)) could be explored; (2) Taking into account the greatimpact of the degeneracy degree on the performance of EMC, we may integrate allthree schemes into an adaptive one; (3) Trade-off between communication overheadand convergence improvement is also an interesting topic.

Acknowledgements

Jin Xu and Victor O.K. Li are supported in part by the University of Hong KongStrategic Research Theme of Information Technology. Xiaodan Fan and Qiwei Liare partially supported by a grant from the Research Grants Council of the HongKong Special Administrative Region, China (Project no. CUHK400709).

References

Benson, G (1999) ‘Tandem repeats finder: a program to analyze DNA sequences’, NucleicAcids Research, Vol. 27, No. 2, pp. 573–580.

Butler, J.M., Ruitberg, C.M. and Reeder, D.J. (1997) ‘STRBase: a Short tandem repeatDNA Internet-accessible database’, Proc. of the 8th International Symposium onHuman Identification, pp. 38–47.

Earl, D.J. and Deem, M.W. (2005) ‘Parallel Tempering: theory, applications, and newperspectives’, Physical Chemistry Chemical Physics, Vol. 7, pp. 3910–3916.

Gelfand A.E. and Smith, A.F.M. (1990) ‘Sampling-based approaches to calculatingmarginal densities’, Journal of the American Statistical Association, Vol. 85, pp.398–409.

Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B. (2004) ‘Bayesian data analysis’,New York, United States: Chapman & Hall / CRC.

Geyer, C.J. and Thompson, E.A. (1995) ‘Annealing Markov chain Monte Carlo withapplication to ancestral inference’, Journal of the American Statistical Association,Vol. 90, pp. 909–220.

Improved short adjacent repeat identification 17

Goswami, G. and Liu, J.S. (2007) ‘On learning strategies for Evolutionary Monte Carlo’,Statistics and Computing, Vol. 17, pp. 23–38.

Hastings, K. (1970) ‘Monte Carlo sampling methods using Markov chains and theirapplications’, Biometrika, Vol. 57, pp. 97–109.

Huntington’s Disease Collaborative Research Group (1993) ‘A novel gene containinga trinucleotide repeat that is expanded and unstable on Huntington’s diseasechromosomes’, Cell, Vol. 72, pp. 971–983.

Katzgraber, H.G., Trebst, S., Huse, D.A. and Troyer, M. (2006) ‘Feedback-optimizedparallel tempering Monte Carlo’, Journal of Statistical Mechanics, Vol. 2006,P03018.

Li, Q., Fan, X., Liang, T. and Li, S.Y.R. (2011) ‘MCMC algorithms for detectingshort adjacent repeats in multiple sequences’, Available as technical report onhttps://sites.google.com/site/liqiwei2000/research/journal-papers.

Li, Q., Liang, T., Li, S.Y.R. and Fan, X. (2010) ‘Bayesian approach for identifying shortadjacent repeats in multiple sequences’, Proc. of the 11th International Conferenceon Bioinformatics and Computational Biology, pp. 255–261.

Liang, F. and Wong, W.H. (2000) ‘Evolutionary Monte Carlo: applications to Cp modelsampling and change point problem’, Statistica Sinica, Vol. 10, pp. 317–342.

Liu, J.S. (2001) ‘Monte Carlo strategies in scientific computing’, New York, UnitedStates: Springer-Verlag.

Liu, J.S., Neuwald, A.F. and Lawrence, C.E. (1995) ‘Bayesian models for multiplelocal sequence alignment and Gibbs sampling strategies’, Journal of the AmericanStatistical Association, Vol. 90, No. 432, pp. 1156–1170.

Predescu, C., Predescu, M. and Ciobanu, C. (2009) ‘The incomplete beta function lawfor parallel tempering sampling of classical canonical systems’, Journal of ChemicalPhysics, Vol. 120, pp. 4119.

Previc, F.H. (1999) ‘Dopamine and the origins of human intelligence’, Brain andCognition, Vol. 41, No. 3, pp. 299–350.

Sinden, R.R. (2000) ‘Trinucleotide repeats biological implication of the DNA structureassociated with disease-causing triplet repeats’, Human Genetics, Vol. 64, pp. 346–353.

Swendsen, R.H. and Wang, J.S. (1986) ‘Replica Monte Carlo Simulation of spin glasses’,Physical Review Letters, Vol. 57, pp. 2607–2609.