Shuffling biological sequences with motif constraints

Shu�ing Biologi al Sequen eswith Motif ConstraintsRomain Rivi�ere1, Dominique Barth2, Johanne Cohen3, Alain Denise1?1 LRI, UMR CNRS 8623. Universit�e Paris-Sud XI. Orsay, Fran e.fAlain.Denise,Romain.Riviereg�lri.fr2 PRiSM, UMR CNRS 8144. Universit�e de Versailles-St-Quentin. Versailles, Fran e.Dominique.Barth�prism.uvsq.fr3 LORIA, UMR CNRS 7503. Universit�e de Nan y. Nan y, Fran e.Johanne.Cohen�loria.frAbstra t. We study the following problem : given a biologi al sequen eS, a multiset M of motifs and an integer k, generate uniformly randomsequen es whi h ontain the given motifs and have exa tly the samefrequen ies of o urren e of k-lets (i.e. fa tors of length k) of S. Thisis a parti ularly diÆ ult problem. We notably prove that the problemof de iding whether a sequen e respe ts given motif onstraints is NP- omplete. We give a random generation algorithm whi h turns out to beexperimentally eÆ ient.1 Introdu tion.The amount of data oming from sequen ed genomes is in reasing rapidly. There-fore, there is a need for eÆ ient omputer-based methods for extra ting biologi alinformation from sequen es. A widely used method for extra ting informationinvolves omparing biologi al sequen es with random sequen es, whi h representthe \ba kground noise",from whi h any relevant biologi al information shouldstand out. This powerful method has been implemented in several areas of se-quen e analysis [19, 20℄. A key example of this method is the sear h for ex ep-tional motifs in biologi al sequen es. In this approa h, an ex eptional motif is apattern that is over- or underrepresented in a biologi al sequen e ompared tothe expe ted number of o urren es of that pattern in random sequen es. Anyoverrepresented or underrepresented motifs may indi ate important biologi alfun tions. Random sequen es are also used for sequen e omparison. Pairwisesequen e omparison algorithms give a s ore that measures their similarity. Afterobtaining the s ore of an alignment, the main problem is to de ide whether thetwo sequen es are homologous (i.e. derive from a ommon an estral sequen e) ornot. This is done by omparing the given s ore with s ores from the omparisonof the biologi al sequen es with random sequen es [4, 13℄.For the results to be relevant, the random sequen es must model some well- hosen hara teristi s of biologi al sequen es. The two most widely a epted? Corresponding author.

IIrandom sequen e models are based on the number of o urren es of all k-lets,i.e. all motifs of given �xed length k, in one or several referen e biologi al se-quen es [9℄. In the �rst model, the random sequen es respe t on average the givennumbers of o urren es. In other words, they obey a stationary Markov hain.In the se ond model, any random sequen e ontains exa tly the same numberof o urren es of k-lets as the referen e sequen e. The �rst model is well suitedto long sequen es or large sets of sequen es, and is widely used for sear hing forex eptional motifs. For one or several rather short sequen es, the se ond modelis better adapted, notably be ause Markov hains may not be irredu ible in this ase. Therefore, this model is used for omparing genes, whi h are rather shortsequen es [3℄. Random sequen es an be studied from both an analyti al andalgorithmi point of view. Indeed, various analyti al methods have been devel-oped for studying the probability distribution of motifs in random sequen es tosear h for ex eptional motifs (see e.g. [14, 16, 17℄.) However, in many ases anexperimental approa h is needed, by generating sets of random sequen es. Thisis parti ularly ne essary for sequen e omparison, where there are still very fewtheoreti al results. For the Markovian model, it is straightforward to generaterandom sequen es. However, for the se ond model (exa t model), the problemis mu h more diÆ ult. The �rst eÆ ient algorithm was developed by Kandel,Matias, Unger and Winkler in 1996 [12℄.Re ent studies in biologi al sequen e analysis have shown that it is ne essaryto onsider models of random sequen es that ontain more information thanpreviously thought. For example, Beaudoing et al. [6℄, looked for variants of apolyadenylation signal. They gave a set of sequen es where one known motifwas strongly overrepresented, and aimed to �nd other weaker overrepresentedmotifs. This is a typi al ase in whi h some motifs that ontain the strong one, orthat partially overlap it, an appear overrepresented. These \wrong signals" are alled artefa ts. In this study [6℄, the known strong signal was the motif AAUAAA.The motifs AAAUAA and AUAAAA, among others, were also overrepresented usinga lassi al model of random sequen es. Clearly, these too were artefa ts. An adho method was then applied to remove these artefa ts. However, it has beenshown [8℄ that these artefa ts an be removed analyti ally in general manner by onditioning the o urren e probabilities by the strong signal. In other words,the strong signal is taken into a ount in the model of random sequen es. VanHelden et al. [18℄ lassi�ed genes a ording to the number of o urren es of aset of overrepresented motifs. Although some motifs were related to others, forpra ti al reasons, all motifs were onsidered independent from ea h other. Theresulting lassi� ation ould be improved if these dependen ies ould be takeninto a ount. Therefore, a model of random sequen es needs to a ount for thepresen e or the overrepresentation of a set of motifs in biologi al sequen es.Unfortunately, at the present, an analyti al approa h to this problem an only beapplied in the simplest ases, in whi h only one strong motif is to be onsidered.Here, we address the problem of generating sequen es a ording to the exa tmodel, but with additional motif onstraints. A set of motifs of length greater

IIIthan k is given, and, as well as the k-lets, the sequen es must ontain a givennumber of o urren es of ea h motif from the set.In Se tion 2, we re onsider the algorithm of Kandel et al., whi h generatessequen es without additional onstraints. We take this as the starting point ofour work and then, in Se tion 3, we develop our approa h. The addition ofmotif onstraints in the model results in diÆ ult problems. We notably provethat the general problem of de iding whether a sequen e respe ts the givenmotif onstraints is NP- omplete. In Se tion 5, we give an algorithm whi h isexperimentally eÆ ient, and present experimental results. For readability, theproofs of our prin ipal results are given in Se tion 4.2 The shu�ing problem.Let S = s1s2 : : : sn be a sequen e of length n over an alphabet L, and k an integersu h that 2 � k � n. A fa tor of S is a word s[p;q℄ su h that s[p;q℄ = sp : : : sq forsome 1 � p � q � n. Consider the number of o urren es in S of all possiblek-lets, i.e. fa tors of length k. We all a shu�ed sequen e any sequen e whi hhas exa tly the same numbers of o urren es of k-lets as S. For example, letS =ACTACTCACG and k = 3. The sequen e S ontains two o urren es of the3-let ACT, and one of ea h of the following 3-lets: CTA, TAC, CTC, TCA, CAC,ACG. The sequen e S0 =ACTCACTACG is a shu�ed sequen e of S, be ause ithas exa tly the same numbers of o urren es of 3-lets as S. The shu�ing problemis the problem of generating, uniformly at random (u.a.r.), a sequen e among allshu�ed sequen es. Uniformly at random means that all shu�ed sequen es musthave the same probability of being generated.We �rst re all a orresponden e between the set of shu�ed sequen es andthe set of Eulerian trails of a parti ular multigraph, whi h is somewhat similarto the de Bruijn graph. We all this the sequen e graph of order k of S.De�nition 1. The sequen e graph of order k of S, denoted Gr(S; k), is a di-re ted multigraph G = (V;E), withV = n�k+2[i=1 fs[i;i+k�2℄gE = n�k+1[i=1 [(s[i;i+k�2℄; s[i+1;i+k�1℄)℄Note that V is a set, while E is a multiset (hen e the bra kets in the de�nitionof E. An example of sequen e graph is given in Figure 1.The nodes of the sequen e graph are the fa tors of size k� 1 of S, and thereare as many ar s between two given nodes v = s[1;k�1℄ and v0 = s[2;k�1℄sk as thenumber of o urren es of the word s[1;k℄ in S. It follows that any sequen e graphis path-Eulerian, i.e it ontains at least one path that overs all ar s exa tly on e

IVAT

GA

GG

TG GC

GT

TT

TC

CA

TA

AGFig. 1. The Sequen e Graph of S=ATGTTCATGCATGGATGGATAG with k=3.- the sequen e of nodes (s[i;i+k�2℄)i=n�k+2i=1 . Su h a path is alled an Euleriantrail. In the following, we note vb (resp. ve) as the vertex whi h begins (resp.ends) the Eulerian trail. In some parti ular ases the sequen e graph may beEulerian (i.e. y le-Eulerian), as well as path-Eulerian. In this ase, vb an beany vertex, and ve = vb. In all other ases, vb and ve are �xed and distin t.The following de�nition will help us to formalize the orresponden e betweenshu�ed sequen es and Eulerian trails.De�nition 2. The tra e of a path in a sequen e graph is the word produ ed by on atenation of the k� 1 letters of the �rst node and the sequen e omposed ofthe last letter of every other node in the path.For example, in Figure 1, the word ATGGAGTTC is the tra e of the path(AT;TG;GG;GA;AT;TG;GT;TT;TC).Now we an state the laimed orresponden e.Proposition 1. Any tra e orresponds to exa tly one shu�ed sequen e. Thenumber of Eulerian trails whi h orrespond to any given tra e does not dependon the tra e, and is equal to Qv2V d+(v), where d+(v) stands for the outdegreeof vertex v.This orresponden e was �rst noti ed by Fit h [9℄ in 1983, and was the basisof further works by Alts hul and Eri kson [3℄ and then Kandel, Matias, Ungerand Winkler [12℄. Thus, the problem of generating uniformly at random shu�edsequen es is redu ed to generating uniformly at random Eulerian trails in a(parti ular) dire ted multigraph. The next step uses the BEST Theorem [1℄,whi h links Eulerian trails and spanning trees of a graph. Here, this theorem an be stated as follows.

VTheorem 1 (Aardenne-Ehrenfest and de Bruijn, 1951.). The number ofEulerian trails in G that begin at vb and end at ve is equal toT (G) (d+(ve))! Yv2V nfveg(d+(v)� 1)!where T (G) is the number of inbound spanning trees, or arbores en es, whoseroot is ve, and d+(v) stands for the outdegree of vertex v.The proof is onstru tive and leads to a straightforward algorithm for the randomgeneration of Eulerian trails, but only if an arbores en e in G an be generateduniformly at random. Starting at the beginning vertex we hoose uniformly atrandom, at ea h step, an ar from among all the ar s from the urrent vertexwhi h have not yet been rossed ex ept the ar whi h belongs to the arbores en e.This ar an be hosen only if no other ar is available. Then follow the ar tothe next vertex, whi h be omes the new urrent vertex. The pro ess stops at v0when all ar s have been rossed.The problem of generating uniformly at random Eulerian trails is now re-du ed to the problem of generating uniformly at random arbores en es. G. Kan-del et al. [12℄ give an algorithm whi h is a variant of a very elegant algorithmfound independently by Aldous [2℄ and by Broder [7℄ for undire ted graphs. Thealgorithm is as follows: if G is only path-Eulerian, then it is �rst made y le-Eulerian by adding a virtual ar between ve and vb. Then pro eed by a freerandom walk in G, and ea h time an ar is rossed add it to the arbores en eonly if it is not the virtual one and no y le o urs in the resulting arbores en e.The expe ted time omplexity of this algorithm is O(q2n), where q is the numberof verti es, i.e. the number of distin t k-lets in the sequen e S. Re ently, Proppand Wilson [15, 21℄ have developed new algorithms, based on similar prin iples,whi h improve the time omplexity.3 Shu�ing sequen es with motif onstraints.3.1 PreliminariesIn this se tion, we address the problem of generating shu�ed sequen es thatare subje t to additional onstraints. SWe onsider a referen e sequen e S oflength n on an alphabet L and an integer k su h that 2 � k � n. Now, letM = [M1; : : : ;Mp℄ be a multiset of words over L with jMij > k 8i 2 [1; p℄, su hthat ea h Mi is a fa tor of S, and there are, at most, as many o urren es ofMi in M as in S. Overlapping o urren es are not taken into a ount, i.e. ifthe o urren e of two motifs overlap in the sequen e, in whi h ase, only one is ounted. In the following, we all the words of M motifs.The problem onsists of generating sequen es that have exa tly the samek-lets ount as S, and ontain at least as many o urren es of ea h motif ofM as its number of o urren es in M. verlapping o urren es are again nottaken into a ount. A eptable sequen es are any sequen e that respe ts these

VI onditions. As motifs are taken (without overlap) for the referen e sequen e S,we are guaranteed at least one a eptable sequen e.Our approa h onsists of two prin ipal steps. These are developed in Se -tions 3.2 and 3.3. In the �rst step, we de�ne a new multi-digraph from Gr(S; k)in whi h ea h a eptable sequen e is the tra e of an Eulerian trail. We then gen-erate uniformly at random an Eulerian trail, and verify that the orrespondingtra e gives rise to an a eptable sequen e - this is not always the ase. Thisstep involves an NP- omplete problem. However, we propose a simple eÆ ientheuristi algorithm for solving this problem (Se tion 5). The se ond step aims toensure the uniformity of the random generation. For this, we need to omputethe number of Eulerian trails that orrespond to any generated tra e. Unlikethe original shu�ing problem (see Proposition 1), this number strongly dependson the given tra e. This ounting problem is #P- omplete, but we propose amethod to solve it in pra ti e.We present three major de�nitions involving a eptable sequen es.De�nition 3. A on�guration of a sequen e S a ording to a multiset of wordsM = [M1; : : : ;Mp℄ is a p-uplet (i1; : : : ; ip) of integers, where il is the position ofone o urren e of the word Ml in S.De�nition 4. Let C = (i1; : : : ; ip) and C 0 = (i01; : : : ; i0p) be two on�gurationsof a sequen e S a ording to the multiset M = [M1; : : : ;Mp℄. For any word w inM = [M1; : : : ;Mp℄, let Jw be the set of integers su h that Jw = fj : Mj = wg.Con�gurations C and C 0 are said to be equivalent if, and only if, for any wordw in M = [M1; : : : ;Mp℄, the two sets fij : j 2 Jmg and fi0j : j 2 Jmg are equal.De�nition 5. A on�guration C of a sequen e S a ording to a multiset ofwords M = [M1; : : : ;Mp℄ is perfe t if, and only if, for any i and j, there is nooverlap between any two o urren es of Mi and Mj .Clearly, a sequen e is a eptable if, and only if, it has a perfe t on�gurationover M.3.2 Generating a eptable sequen esConstrained sequen e graphsDe�nition 6. The sequen e luster of order k of a word Mi = m1 : : :mri 2 M,denoted Chi(Mi; k), is a dire ted multigraph C = (CV;CE) omposed of threenodes: CV = fm[1;k�1℄; (m[2;ri�1℄; i);m[ri�k+2;ri℄gand two ar s:CE = [(m[1;k�1℄; (m[2;ri�1℄; i)); ((m[2;ri�1℄; i);m[ri�k+2;ri℄)℄The spe ial node (m[2;ri�1℄; i) is alled a luster node.

VII(CATGGATG,1)

GG

GCFig. 2. Sequen e Cluster of M1 =GCATGGATGG, with k = 3An example of sequen e luster is given in Figure 2.De�nition 7. Let S be an a eptable sequen e. Let G = Gr(S; k) = (V;E), thesequen e graph asso iated with S and k. For all i 2 [1; p℄, let Gi = Gr(Mi; k) =(Vi; Ei) and Ci = Chi(Mi; k) = (CVi; CEi) be the sequen e graphs and thesequen e lusters asso iated with ea h Mi. The onstrained sequen e graph G0,denoted GrC(S; k;M) = (V 0; E0), is de�ned by G0 = (V 0; E0), withE0 = E [ p[i=1CEi � p[i=1Eiand V 0 = fv 2 V 00jdegG00(v) 6= 0gwhere G00 = (V 00; E0) with V 00 = V [ p[i=1CVi :We have repla ed the subgraphs representing ea h Mi in Gr(S; k) by the se-quen e luster of Mi. There are as many luster nodes in GrC(S; k;M) as thereare motifs in M. An example of a onstrained sequen e graph is given in Fig-ure 3.The notion of a tra e of a sequen e graph an be extended to the onstrainedsequen e graphs, by making the following hange: on rossing a luster node(w; i), its jwj�k last letters have to be on atenated. As in Se tion 2, the follow-ing simple result shows that there is a dire t link between a eptable sequen esand Eulerian trails in a onstrained sequen e graph.Proposition 2. The set of a eptable sequen es is in luded in the set of tra esof Eulerian trails in GrC(S; k;M).Proof. Let S be an a eptable sequen e over M = [M1; : : : ;Mp℄, a multiset ofwords, and J = (j1; : : : ; jp) be a perfe t on�guration of S a ording to M.Let (i1; : : : ; ip) be the positions in S of the o urren es of words pointed to

VIII(CATGGATG,1)

AT

GA

GG

TG GC

GT

TT

TC

CA

TA

AGFig. 3. The onstrained sequen e graph of ATGTTCATGCATGGATGGATAG withM = [GCATGGATGG℄ and k = 3.by the perfe t on�guration J . Let us onsider T = (s[1;k℄, : : :, s[i1;i1+k�1℄,: : :, s[i1+jM1j�1;i1+jM1j+k�2℄, : : :, s[ip;ip+k�1℄, : : :, s[ip+jM1j�1;ip+jM1j+k�2℄, : : :,s[n�k+1;n℄), an Eulerian trail in Gr(S; k) whose tra e is S. Let C(Mi) be the luster node asso iated with Mi. Then, T 0 = (s[1;k℄, : : :, s[i1;i1+k�1℄, C(M1), : : :,s[ip;ip+k�1℄, C(Mp), : : :, s[n�k+1;n℄) is an Eulerian trail in GrC(S; k;M) whosetra e is S.Sear hing for perfe t on�gurations.Unfortunately, not all Eulerian trails give rise to an a eptable sequen e, be ausemotifs may overlap, as shown in Figure 4. Therefore, on e a random sequen e Shas been generated, we have to verify whether it ontains a perfe t on�guration.We all this problem PCS for \Perfe t Con�guration Sear hing", and it is de�nedas follows.Instan e: An alphabet A, a sequen e S over A, a multiset M = [M1; : : : ;Mp℄of p words.Question: Does there exist a perfe t on�guration of S a ording to M?At this stage, the sequen es that we are dealing with are not general sequen esbe ause they result from an Eulerian trail in a sequen e graph. Therefore, weneed to onsiderv the problem of sear hing for a perfe t on�guration in su hsequen es. De�nition 8 and Proposition 3 will allow us to do this.De�nition 8. Let k be a positive integer. A on�guration C of a sequen e Sa ording to a multiset of words M is (k)-pseudo-perfe t if, and only if, for anyi and j, there is no overlap of as mu h as k letters between any two o urren esof Mi and Mj.This means that all the words pointed to by the on�guration overlap by atmost k � 1 letters. We shall omit the parameter k when expli it referen e to a

IXCA AG

AA

GA

AT AC

(TA,1)

(CA,1)

Fig. 4. In this onstrained sequen e graph withM =[ATAC,ACAG℄, the Eulerian trail(AT,TA,AC,CA,AG,GA,AC,CA,AA,AG) gives a sequen e ATACAGACAAG, whi h isnot a eptable be ause the only o urren es of ATAC and ACAG are overlapping. onstrained sequen e graph is given. In this ase, k is the order of the graph.Now, the following property holds.Proposition 3. A sequen e S has a k-pseudo-perfe t on�guration a ording toM if, and only if, S is the tra e of an Eulerian trail in the onstrained sequen egraph GrC(S; k;M).Proof. Let S be the tra e of an Eulerian trail T = (t1; : : : ; tn�k+1) given as a se-quen e of nodes in GrC(S; k;M) = (V 0; E0). Some of these nodes, say ti1 ; : : : ; tip ,are luster nodes. Therefore, for any l1; l2, there is no ar (til1 ; til2 ) in E0. Thus,there exists tj 2 T su h that il1 < j < il2 . This implies that o urren esmil1 andmil2 overlap in S by at most jtj j = k�1 letters, and S ontains a pseudo-perfe t on�guration over M = [M1; : : : ;Mp℄.Conversely, if S has a k-pseudo-perfe t on�guration (j1; : : : ; jp) over M,then we an onstru t the same Eulerian trail T 0 from an Eulerian trail T inGr(S; k).Thus, the a tual problem we are addressing, FPCS for \Further Perfe t Con-�guration Sear hing", is de�ned as follows.Instan e: An alphabet A, a multiset M = [M1; : : : ;Mp℄, an integer k and S aword over A su h that S has a (k)-pseudo-perfe t on�guration over M.Question: Does there exist a perfe t on�guration M over S ?Unfortunately, we have:Theorem 2. Problem FPCS is NP- omplete.And we dedu e:Corollary 1. PCS is NP- omplete.

X For readability, the proofs of Theorem 2 and Corollary 1 are given in Se -tion 4.An algorithm for FPCSDespite having just stated that FPCS is NP- omplete, we present here an al-gorithm that is eÆ ient in realisti ases (see Se tion 5). First, we de�ne theoverlapping graph of M over S.De�nition 9. The overlapping graph of M over S is the undire ted graph G =(V;E) su h that every o urren e in S of ea h word in M is a distin t nodeand su h that there is an ar between two given nodes if the o urren es they areasso iated with are overlapping in S.An example of an overlapping graph is given in Figure 5. Given an overlappingM1 M5

M3M4

M2

Fig. 5. The overlapping Graph of M=[ATT,TATT,CGAT,TTAT,ATT℄ overS =ATTATCGATTATATTATCCGACGATTATTC.graph G, the algorithm is essentially a lassi al arbores ent sear h. We supposethat the motifs of M = [M1; : : : ;Mp℄ and the o urren es [Mi;1; : : : ;Mi;pi ℄ ofea h motif Mi are ordered. The algorithm then pro eeds as follows: Take the�rst o urren e of the �rst motif and delete all of its neighbours. Then ontinuein the same manner with the �rst (remaining) o urren e of the se ond motifand so on, until either the last motif is taken, or the pro ess stops before rea hingall motifs. In the �rst ase, the set of o urren es that were hosen onstitutes aperfe t on�guration. In the se ond ase, we ba ktra k until we �nd a suitablesequen e of o urren es.There is also a dire t interpretation of a perfe t on�guration in terms ofgraph G. If we add edges to make liques on all the vertex-o urren es of asame motif, then there is a one-to-one orresponden e between the set of perfe t on�gurations and the set of maximum independent sets in this new graph.

XI3.3 Generating sequen es uniformly at random.We now fo us on the problem of generating random sequen es uniformly. For onstrained sequen e graphs, no property su h as Proposition 1 holds. The num-ber of Eulerian trails orresponding to a given tra e strongly depends on thistra e. Consequently, the generation pro ess is not ne essarily uniform. There-fore, we use a lassi al reje tion method to make the generation uniform. Whena tra e is generated, we either a ept it with a probability proportional to thenumber of its orresponding Eulerian trails, or we reje t it and start the pro essagain. Hen e, we need to ount the number of Eulerian trails orresponding toa given tra e.Proposition 4. The number of Eulerian trails orresponding to any given tra eS is N(k; S;M)= Ym2M jMjm!where N(k; S;M) is the number of (k)-pseudo-perfe t on�gurations of S a - ording to M, and jMjm is the number of o urren es of m in the multisetM.Proof. In Propositions 2 and 3, we have seen how to map a k-pseudo-perfe t on�guration to an Eulerian trail in GrC(S; k;M). This gives the numerator.However, if two on�gurations are equivalent (see De�nition 4), they will bemapped to the same Eulerian trail, giving the denominator.Now, ounting the number of Eulerian trails orresponding to a given tra eredu es to ounting the number of equivalen e lasses of pseudo-perfe t on�gu-rations. Our ounting algorithm is based on the pseudo-overlapping graph of Mover S, similar to the previously de�ned overlapping graph.De�nition 10. The pseudo-overlapping graph of M over S is the undire tedgraph G = (V;E) su h that ea h o urren e in S of every word inM is a distin tnode, and there exists an edge between two given nodes if the o urren es withwhi h they are asso iated overlap by at least k letters.If we onsider the pseudo-overlapping graph in whi h all the nodes orre-sponding to o urren es of any same word are onne ted together in a lique,the number of maximal independent sets (MIS) in this graph is obviously equalto the number of equivalen e lasses of perfe t on�gurations in the related se-quen e. The problem of ounting MISs is known to be polynomial in interse tiongraphs (in luding interval graphs) [5℄. Although ea h pseudo-overlapping graphis learly an interval graph, the graphs we onsider here are not even perfe tgraphs (the problem of determining the ardinal of an MIS is polynomial forperfe t graphs but NP- omplete for general graphs, see [11, 10℄ and refs.). Un-fortunately, we haveTheorem 3. The problem of ounting the equivalen e lasses of perfe t on�g-urations of S a ording to M is #P- omplete.The proof of this Theorem is given in Se tion 4.

XII3.4 The random generation algorithm.We are now able to state the omplete algorithm for generating onstrainedsequen es uniformly at random.Algorithm 1 Random generationInput: a sequen e S, an integer k, a multiset MOutput: a sequen e T(i) Produ e the onstrained sequen e graph G = GrC(S; k;M).(ii) Uniformly generate a random Eulerian trail in G, and take its tra e T .(iii) If T has no perfe t on�guration then goto (ii).(iv) Compute the number N of Eulerian trails orresponding to this parti ulartra e T .(v) Return T with probability 1=N , or goto (ii).If we ould ompute a lower bound m of the minimum over the tra es ofthe number of Eulerian trails asso iated with any tra es, we ould repla e thereje tion probability in (v) by m=N . However, in general, it is very diÆ ult to ompute this lower bound.Proposition 5. Step (iv) of Algorithm 1 is alled at most R times on average,where R is the average number of Eulerian trails per tra e.Proof. Consider the square [0; 1℄2. For any given tra e of an Eulerian trail, on-sider an interval of [0; 1℄ whose length is the probability of hoosing this parti ulartra e. Pla e those intervals one after ea h other in any given order. Then, aboveea h interval, onstru t a re tangle whose height is the probability of keepingthis tra e a ording to Algorithm 1. The sum of the areas of these re tanglesequals the number of distin t tra es divided by the number of distin t Euleriantrails. It is easy to verify that this number is, in fa t, the inverse of the averagenumber of Eulerian trails asso iated to a tra e. This is the expe ted number ofsteps needed to hit one of these re tangles, and to stop the algorithm.4 Proofs of Theorem 2, Corollary 1 and Theorem 3Clearly, PCS is a spe ial ase of FPCS. For the sake of larity, we �rst prove theNP- ompleteness of PCS (Corollary 1) and then generalise it for FPCS(Theorem 2).Corollary 1 PCS is NP- omplete.Proof. Clearly, we an verify in polynomial time whether or not a given on-�guration is perfe t. So PCS is in NP . We now redu e PCS to 3 DimensionalMat hing (3DM) (see example 1). The 3DM problem [10℄ is de�ned as follows:Instan e: A set C � X � Y � Z where X , Y , Z are disjoint sets having thesame number q of elements.

XIIIQuestion: Does C ontain a mat hing, that is, a subset C0 � C su h that jC0j = qand no two elements of C0 agree in any oordinate? Let us onsider an instan eof 3DM, that is 3 sets X;Y; Z of the same ardinality q and C � X �Y �Z. Forany r 2 X [ Y [ Z, we de�ne fC(r) as the number of o urren es of r in C.Let C = f 1; : : : ; sg and de�ne S = w 1 : : : w s where, 8 = (x; y; z) 2C, w = wxwywz0, with wx = a0x�110q�xa, wy = b0y�110q�yb, and wz =ba0z�110q�zba0.For any x 2 X , we de�ne a multiset Mx as follows: it ontains1. one o urren e of the motif mx = 0x�110q�xab2. fC(x)� 1 o uren es of the motif m0x = a0xx�110q�xaSimilary, for any y 2 Y (resp. z 2 Z) we de�neMy (resp.Mz) as the multiset ontaining one o urren e of my = 0y�110q�ybba (resp. mz = 0z�110q�zba0)and fC(y)� 1 o urren es of m0y = b0yy�110q�yb (resp. fC(z)� 1 o urren es ofm0z = ba0xz�110q�zba). Finally, we set M = Se2X[Y [Z Me where S denotesthe union of multisets.Obviously, this transformation is polynomial with respe t to the instan e of3DM. So, we only need the following two laims to on lude.Claim 1 If there exists a perfe t mat hing in C, then there exists a perfe t on�guration of M over S.Let C0 be a perfe t mat hing for C. We onstru t a perfe t on�guration of SoverM by independently onsidering the fa tors w = a0x�110q�xa b0y�110q�ybba0z�110q�zba0 of S, for all 2 C.1. Ea h = (x; y; z) 2 C0 is re overed by the following three motifs of M:mx = 0x�110q�xab, my = 0y�110q�ybba, and mz = 0z�110q�zba0. Ea h ofthese motifs o urs only on e in Mx, My and Mz respe tively. Sin e thereis only one o urren e of x, y and z respe tively in C0 by de�nition of amat hing, only the motifs in fmx;my;mz : x 2 X; y 2 Y; z 2 Zg of M havebeen used to over all the fa tors w of S for any 2 C0.2. Ea h = (x; y; z) =2 C0 is re overed by the following three motifs of M:m0x = a0x�110q�xa, m0y = b0y�110q�yb, and m0z = ba0z�110q�zba. Sin emotif m0x (resp. m0y, m0z) o urs fC(x)� 1 (resp. fC(y)� 1, fC(z)� 1) timesin M, all the fa tors w of S for any =2 C0 are overed (unless the terminal0 in ea h of them), and all the motifs m0x, m0y and m0z ofM have been used.Finally, all motifs of M have been used, and no two overlap. We have thusde�ned a perfe t on�guration of S a ording to M.Claim 2 If there exists a perfe t on�guration of M over S, then thereexists a perfe t mat hing in C.Let P be a perfe t on�guration ofM over S. We onstru t a perfe t mat h-ing C0 in C. We de�ne C0 as follows: = (x; y; z) 2 C0 if, and only if, in P , wx offa tor w = wxwywz0 of S is partially re overed by the motif mx = 0x�110q�xabof M.Sin e jfmx : x 2 Xgj = jX j = q, by onstru tion we get jC0j = jfmx : x 2Xgj = q. It remains to prove that C0 is a perfe t mat hing of C. Indeed, let =

XIV(x; y; z) 2 C0. In the orresponding fa tor w = wxwywz0 of S, by de�nition wx isre overed by mx. The fC(x)� 1 remaining o urren es of wx in S are re overedby the fC(x) � 1 motifs m0x. Thus, w is ne essarily re overed by mxmymz,be ause, by onstru tion, no two motifs mr and m0s (for any r; s 2 X [ Y [ Z) an re over a fa tor w without overlapping. As there is exa tly one motif mx(resp. my, mz) per element of X (resp. Y , Z), ea h element of X (resp. Y , Z)o urs exa tly on e in C0. Finally, C0 is a perfe t mat hing of C.This on ludes the proof.Example 1. We onsider an instan e I of 3DM su h that X = fx; x0g, Y =fy; y0g, Z = fz; z0g, C = f 1 = (x; y0; z); 2 = (x0; y; z0); 3 = (x; y; z0)g. Theinstan e T (I) of PCS is de�ned as follows:{ wx = a10a, wx0 = a01a,wy = b10b, wy0 = b01b,wz = ba10ba, wz0 = ba01ba{ w 1 = a10ab01bba10ba0, w 2 = a01ab10bba01ba0,w 3 = a10ab10bba01ba0{ M = [10ab; a10a; 01ab; 10bba; b10b; 01bba; 10ba0; 01ba0; ba01ba℄.{ S = a10ab01bba10ba0a01ab10bba01ba0a10aba10bba01ba0.The instan e I has a mat hing omposed of 1 and 2. For the instan e T (I)has a perfe t on�guration of M over S.S = a z}|{10ab z }| {01bba z }| {10ba0| {z }w 1 a z}|{01ab z }| {10bba z }| {01ba0| {z }w 2 z}|{a10a z}|{b10b z }| {ba01ba0| {z }w 3Theorem 2 Problem FPCS is NP- omplete.Proof. We only need to prove the result for the parti ular ase where k = 2. Itis easy to see that Problem FPCS belongs to NP. Moreover, we transform 3DMto FPCS by the transformation given in Corollary 1. Let us onsider an instan eI of 3DM, that is, three sets X;Y; Z of the same length q and C � X � Y � Z.Let I 0 be an instan e of FPCS obtained by the transformation in Corollary 1.Instan e I 0 is omposed of an alphabet A = f0; 1; a; bg, a multisetM of p words,and a word S over A. By the proof of orollary 1, there exists a perfe t mat hingin C if, and only if, there exists a perfe t on�guration of M over S.It remains to prove that S has a (2)-pseudo-perfe t on�guration C overM. Let x 2 X . Re all that fC(r) is the number of o urren es of r in C. Bythe transformation from 3DM in orollary 1, there is one motif 0x�110q�xaband fC(x) � 1 motifs a0x�110q�xa in M. We now onstru t a pseudo-perfe t on�guration C overM. The fC(x) patterns a0x�110q�xab ontained in S an be overed by the orresponding fC(x) motifs inM. We apply the same onstru tionfor all elements of Y and Z. Now, for ea h = (x; y; z) in C, the word w in Sis overed by three motifs of M, and, by onstru tion, two onse utive motifsoverlap by at most one letter. So, C is a (2)-pseudo-perfe t on�guration overM .

XVTheorem 3 The ounting problem of the equivalen e lasses of perfe t on�g-urations of S a ording to M is #P- omplete.Proof. Problem #PCS belongs to the lass #P be ause there is a polynomial-time algorithm to determine, given an instan e x of #PCS and a on�gurationy of S a ording to M, if y is a perfe t on�guration of S a ording to M. Wedemonstrate that #PCS is #P-hard, by showing a parsimonious redu tion fromthe #P- omplete problem #Perfe t Mat hing de�ned as follows:Instan e: A bipartite graph G.Question: How many perfe t mat hings does G have?Suppose that we are given an instan e I of the #Perfe t Mat hing problemwith bipartite graph G = (V1 [V2; E) su h that no two verti es within V1 (resp.V2) are adja ent. The redu tion an be splitted into two parts.First, instan e I is tranformed into an instan e I 0 of 3DM su h that:{ X = V1, Y = V2 and Z = V2.{ C = f(x1; x2; x2) : (x1; x2) 2 E ^ x1 2 V1 ^ x2 2 V2gTherefore, the number of perfe t mat hings in G is equal to the number ofmat hings in C. Indeed, there is a one-to-one orresponden e between the set ofperfe t mat hings in G and the mat hing in C.Also, the instan e I 0 of 3DM is transformed into an instan e I 00 of #PCSusing the same transformation as in the proof of Corollary 1. The proofs ofClaims 1 and 2 show that there exists a one-to-one orresponden e between theset of mat hings in C and the set of equivalen e lasses of perfe t on�gurationsof S. So, there exits a one-to-one orresponden e between the set of perfe tmat hings in G and the set of equivalen e lasses of perfe t on�gurations of S.Thus, this redu tion from #Perfe t Mat hing to #PCS is parsimonious.5 Experimental results.We know the theoreti al omplexity of every routine of our algorithm ex ept for1. the number of times step (ii) of the algorithm is pro essed2. sear hing if T ontains a perfe t on�guration (step (iii));3. ounting the number of Eulerian trails whi h orrespond to T (step (iv)).Therefore, we arried out simulations on random data to determine the average omplexity of these routines. We aimed to determine what an and annot bedone in terms of the size of the parameters. Routines 2 and 3 involve essentiallyan arbores ent sear h over an overlapping graph. The �rst routine requires onlyone \good" sear h and then stops. However, in many ase for routine 3, a sear hof the whole sear h tree is needed. This makes the two algorithms di�erent interms of what makes them diÆ ult.We generated random instan es of the problem as follows. Sequen es of sizen were generated a ording to uniform Bernoulli probabilities over an alphabetof size t. Generally, we took t = 4 be ause we are interested in DNA sequen es.

XVIGiven the ardinality p of M and the size s of its motifs, we then generatedthe multiset M by hoosing p positions in the sequen e and taking, for ea hposition, the word of length s beginning at that position. Thus, all motifs hadthe same length.We �rst looked at the number of times step (ii) was pro essed. We found thatfor small k, say k < 10, almost all the sequen es produ ed ontained a perfe t on�guration. Therefore, the algorithm behaves as if there were no reje tion.For routine 2, a diÆ ult ase would o ur when n is far larger than 4s. In this ase the motifs would tend to have more than one o urren e in S and therefore,the overlapping graph would have many nodes. The problem is even more diÆ ultif these o urren es overlap, whi h is the ase when the motifs are numerous andlarge enough. If k >> 1, the instan e also be omes diÆ ultbe ause any tra e ofan Eulerian trail already ontains a (k)-pseudo-perfe t on�guration. Therefore,to generate diÆ ult ases, we need n >> 4s and s > k >> 1. If we hoosek � 10 and s � 11, then n should be greater than 108. The graph library weused for our implementation did not allow us to investigate this many valueseÆ iently. Therefore, we restri ted our simulations to k = 5 and found no asewhen n < 100000 and jMj < 1000 in whi h the omputation time of routine2 was signi� ant. (For bioinformati s purposes, k = 5 is a standard value forshu�ing DNA sequen es.) We are urrently working on implementing a graphlibrary that should allow us further investigations.Routine 3 is the bottlene k of our algorithm. As it enumerates all pseudo-perfe t on�gurations, its omplexity strongly depends on the number of nodesof the pseudo-overlapping graph. As for routine 2, this number is very dependenton the number of o urren es of the motifs in the sequen e, whi h is itself relatedto the ratio of n=4s. If this ratio is high, we expe t a high number of o urren esof motifs and, onsequently, a high omputation time. This is what we observedwith random data, as illustrated in Figure 6. We show here the ase for s = 6,but the results are similar for other values of s, with the time s ale in reasingexponentially when s de reases.In pra ti e, the program an generate sequen es up to a length of 100000with jMj up to several dozens of motifs in a few minutes on a standard PC.We are also trying to improve the pro essing time. We have found that anumber of motifs appear \naturally" in almost any (un onstrained) shu�edsequen e, depending on their length and on the nu leotide omposition of thestarting sequen e. Therefore, we an use a variant of the algorithm. We divideMinto two multisets M1 and M2 su h that M1 ontains the \more likely" motifsand M2 ontains the \less likely" motifs. We then produ e the onstrainedsequen e graph on M1 using only step (i) of the algorithm, and onsider M inits entirety in step (iii). As almost any sequen e ontains the motifs ofM2, andas step (iv) may be faster, the total pro essing time is mu h improved.

XVII

Fig. 6. Experiments on random data, with k = 3 and s = 6.6 A knowledgementWe are grateful to Bodo Lass for his help. This resear h was partially supportedby the Fren h IMPG Program ad the proje t \�-vert" of the ACI \New Interfa esof Mathemati s".Referen es1. T. van Aardenne-Ehrenfest and N.G. de Bruijn, Cir uits and trees in orientedlinear graphs, Simon Stevin (= Bull. Belgian Math. So .) 28 (1951), pp. 203{217.2. D. Aldous, A random walk onstru tion of uniform spanning trees and uniformlabelled trees, SIAM Journal on Dis rete Mathemati s 3 (1990), pp. 450{465.3. S. Alts hul and B. Eri kson, Signi� an e of nu leotide sequen e alignments: a meth-ode for random sequen e permutation that preserves dinu leotide and odon usage,Mol. Biol. Evol. 2 (1985), pp. 526{538.4. S. Alts hul, T. Madden, A. S h�a�er, J. Zhang, Z. Zhang, W. Miller and D. Lipman,Gapped BLAST and PSI-BLAST: a new generation of protein database sear hprograms, Nu lei A ids Resear h 25 (1997), pp. 3389{3402.5. E. Balas and C. Yu, On graphs with polynomially solvable maximum-weigth liqueproblem, Networks 19 (1989), pp. 247{253.6. E. Beaudoing, S. Freier, J. Wyatt, J. Claverie and D. Gautheret, Patterns of vari-ant polyadenylation signal usage in human genes, Genome Resear h 10 (2000),pp. 1001{1010.

XVIII7. A. Broder, Generating random spanning trees, in: pro . of 30th Annual Symposiumon Foundations of Computer S ien e, IEEE, 1989, pp. 442{447.8. A. Denise, M. R�egnier and M. Vandenbogaert, Assessing statisti al signi� an e onoverrepresented oligonu leotides, in: O. Gas uel and B. Moret, editors, Pro eedingsof WABI'01, Le ture Notes in Computer S ien e 2149 (2001), pp. 85{97.9. W. Fit h, Random sequen es, Journal of Mole ular Biology 163 (1983), pp. 171{176.10. M. R. Garey and D. S. Johnson, \Computers and intra tability : a guide to thetheory of NP- ompleteness." W. H. Freeman and Company, 1979.11. M. Grots hel, L. Lovasz and A. S hrijver, Polynomial algorithms for perfe t graphs,Annals of Dis rete Mathemati s 21 (1984), pp. 322{356.12. D. Kandel, Y. Matias, R. Unger and P. Winkler, Shu�ing biologi al sequen es,Dis rete Applied Mathemati s 71 (1996), pp. 171{185.13. D. Lipman, W. Wilbur, T. Smith and M. Waterman, On the statisti al signi� an eof nu lei a id similarities, Nu lei A ids Resear h 12 (1984), pp. 215{226.14. P. Ni od�eme, B. Salvy and P. Flajolet, Motif statisti s, Theoreti al ComputerS ien e 287 (2), 2002, pp. 593{618.15. J. Propp and D. Wilson, How to get a perfe tly random sample from a generi Markov hain and generate a random spanning tree of a dire ted graph, Journal ofAlgorithms 27 (1998), pp. 170{217.16. M. R�egnier, A uni�ed approa h to word o urren e probabilities, Dis rete AppliedMathemati s 104 (2000), pp. 259{280.17. G. Reinert, S. S hbath and M. Waterman, Probabilisti and statisti al propertiesof words: An overview, Journal of Computational Biology 7 (2000), pp. 1{46.18. J. van Helden, Metri s for omparing regulatory sequen es on the basis of pattern ounts, Bioinformati s 20 (2004), pp. 399{406.19. J. van Helden, B. Andr�e and J. Collado-Vides, Extra ting regulatory sites fromthe upstream region of yeast genes by omputational analysis of oligonu leotidefrequen ies, Journal of Mole ular Biology 281 (1998), pp. 827{842.20. A. Vanet, L. Marsan and M.-F. Sagot, Promoter sequen es and algorithmi al meth-ods for identifying them, Res. Mi robiol. 150 (1999), pp. 779{799.21. D. Wilson, Generating random spanning trees more qui kly than the over time,in: Pro eedings of the Twenty-Eighth Annual ACM Symposium on the Theory ofComputing, 1996, pp. 296{303.

Shuffling biological sequences with motif constraints

Documents

Transcript of Shuffling biological sequences with motif constraints