Motif matching using gapped patterns

15
Motif matching using gapped patterns Emanuele Giaquinta 1 , Kimmo Fredriksson 2 , Szymon Grabowski 3 , Alexandru I. Tomescu 1,4 , and Esko Ukkonen 1 1 Department of Computer Science, University of Helsinki, Finland {emanuele.giaquinta | tomescu | ukkonen}@cs.helsinki.fi 2 School of Computing, University of Eastern Finland [email protected] 3 Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90–924 od´ z, Poland [email protected] 4 Helsinki Institute for Information Technology HIIT Abstract. We present new algorithms for the problem of multiple string matching of gapped patterns, where a gapped pattern is a sequence of strings such that there is a gap of fixed length between each two consecutive strings. The problem has applications in the discovery of transcription factor binding sites in DNA sequences when using gener- alized versions of the Position Weight Matrix model to describe tran- scription factor specificities. We present a simple practical algorithm, based on bit-parallelism, that, given a text of length n, runs in time O(n(log σ + gw-spandk-len(P )/we)+ occ), where occ is the number of occurrences of the patterns in the text, k-len(P ) is the total num- ber of strings in the patterns and 1 gw-span w is the maximum number of distinct gap lengths that span a single word of w bits in our encoding. We then show how to improve the time complexity to O(n(log σ + log 2 gsize(P )dk-len(P )/we)+ occ) in the worst-case, where gsize(P ) is the size of the variation range of the gap lengths. Finally, by parallelizing in a different order we obtain O(dn/welen(P )+n+occ) time, where len(P ) is the total number of alphabet symbols in the patterns. We also provide experimental results which show that the presented algo- rithms are fast in practice, and preferable if all the strings in the patterns have unit-length. 1 Introduction We consider the problem of matching a set P of gapped patterns against a given text of length n, where a gapped pattern is a sequence of strings, over a finite alphabet Σ of size σ, such that there is a gap of fixed length between each two consecutive strings. We are interested in computing the list of matching patterns for each position in the text. Although our results are general, we focus on the case in which all the strings in the patterns have unit length. This problem is a specific instance of the Variable Length Gaps string matching problem (VLG problem) for multiple patterns and has applications in the discovery of tran- scription factor (TF) binding sites in DNA sequences when using generalized versions of the Position Weight Matrix (PWM) model to represent TF binding specificities. arXiv:1306.2483v1 [cs.DS] 11 Jun 2013

Transcript of Motif matching using gapped patterns

Motif matching using gapped patterns

Emanuele Giaquinta1, Kimmo Fredriksson2, Szymon Grabowski3, Alexandru I.Tomescu1,4, and Esko Ukkonen1

1 Department of Computer Science, University of Helsinki, Finlandemanuele.giaquinta | tomescu | [email protected]

2 School of Computing, University of Eastern Finland [email protected] Institute of Applied Computer Science, Lodz University of Technology, Al.

Politechniki 11, 90–924 Lodz, Poland [email protected] Helsinki Institute for Information Technology HIIT

Abstract. We present new algorithms for the problem of multiple stringmatching of gapped patterns, where a gapped pattern is a sequenceof strings such that there is a gap of fixed length between each twoconsecutive strings. The problem has applications in the discovery oftranscription factor binding sites in DNA sequences when using gener-alized versions of the Position Weight Matrix model to describe tran-scription factor specificities. We present a simple practical algorithm,based on bit-parallelism, that, given a text of length n, runs in timeO(n(log σ + gw-spandk-len(P)/we) + occ), where occ is the number ofoccurrences of the patterns in the text, k-len(P) is the total num-ber of strings in the patterns and 1 ≤ gw-span ≤ w is the maximumnumber of distinct gap lengths that span a single word of w bits inour encoding. We then show how to improve the time complexity toO(n(log σ + log2 gsize(P)dk-len(P)/we) + occ) in the worst-case, wheregsize(P) is the size of the variation range of the gap lengths. Finally, byparallelizing in a different order we obtain O(dn/welen(P)+n+occ) time,where len(P) is the total number of alphabet symbols in the patterns.We also provide experimental results which show that the presented algo-rithms are fast in practice, and preferable if all the strings in the patternshave unit-length.

1 Introduction

We consider the problem of matching a set P of gapped patterns against a giventext of length n, where a gapped pattern is a sequence of strings, over a finitealphabet Σ of size σ, such that there is a gap of fixed length between each twoconsecutive strings. We are interested in computing the list of matching patternsfor each position in the text. Although our results are general, we focus on thecase in which all the strings in the patterns have unit length. This problem isa specific instance of the Variable Length Gaps string matching problem (VLGproblem) for multiple patterns and has applications in the discovery of tran-scription factor (TF) binding sites in DNA sequences when using generalizedversions of the Position Weight Matrix (PWM) model to represent TF bindingspecificities.

arX

iv:1

306.

2483

v1 [

cs.D

S] 1

1 Ju

n 20

13

In the VLG problem a pattern is a concatenation of strings and of variable-length gaps. An efficient approach to solve the problem for a single pattern isbased on the simulation of nondeterministic finite automata [10,6]. A method tosolve the case of one or more patterns is to translate the patterns into a regularexpression [11,4]. The best time bound for a regular expression is O(n(k logw

w +log σ)) [4], where k is the number of the strings and gaps in the pattern and w isthe machine word size in bits. Observe that in the case of unit-length keywordsk = Θ(len(P)), where len(P) is the total number of alphabet symbols in thepatterns. There are also algorithms efficient in terms of the total number α ofoccurrences of the strings in the patterns (keywords) within the text [9,13,3] 5.The best bound obtained for a single pattern is O(n log σ+α) [3]. This methodcan also be extended to multiple patterns. However, if all the keywords have unit

length this result is not ideal, because in this case α is Ω(n len(P)σ ) on average

if we assume that the symbols in the patterns are sampled from Σ accordingto a uniform distribution. A similar approach for multiple patterns [8] leadsto O(n(log σ + K) + α′) time, where K is the maximum number of suffixesof a keyword that are also keywords and α′ is the number of text occurrencesof pattern prefixes that end with a keyword. This result may be preferable ingeneral when α′ < α. In the case of unit-length keywords, however, a lowerbound similar to the one on α holds also for α′, as the prefixes of unit length

have on average Ω(n |P|σ ) occurrences in the text. Recently, a variant of thisalgorithm based on word-level parallelism was presented in [15]. This algorithmworks in time O(n(log σ+(log |P|+ k

w )αm)), where k in this case is the maximumnumber of keywords in a single pattern and αm ≥ dα/ne is the maximum numberof occurrences of keywords at a single text position. When α or α′ is large,the bound of [4] may be preferable. The drawback of this algorithm is that,to our knowledge, the method used to implement fixed-length gaps, based onmaintaining multiple bit queues using word-level parallelism, is not practical.

Note that the above bounds do not include preprocessing time and the log σterm in them is due to the simulation of the Aho-Corasick automaton for thestrings in the patterns.

In this paper we present new simple algorithms for the problem of matching aset of gapped patterns. Our algorithms are based on dynamic programming andbit-parallelism. The first algorithm has O(n(log σ+ gw-spandk-len(P)/we) + occ)-time complexity, where k-len(P) is the total number of keywords in the pat-terns and 1 ≤ gw-span ≤ w is the maximum number of distinct gap lengthsthat span a single word in our encoding. Although it is preferable only whengw-span w, it can also be used as a filter to speed up the algorithmby Haapasalo et al [8]. We then show how to improve the time bound toO(n(log σ + log2 gsize(P)dk-len(P)/we) + occ), where gsize(P) is the size of thevariation range of the gap lengths. Note that in the case of unit-length key-words we have k-len(P) = len(P). This bound is a moderate improvement overthe more general bound for regular expressions by Bille and Thorup [4] for

5 Note that the number of occurrences of a keyword that occurs in r patterns and inl positions in the text is equal to r × l

log gsize(P) = o(√

logw). The second algorithm is based on a different paralleliza-tion of the dynamic programming matrix and has O(dn/we len(P)+n+occ)-timecomplexity. The advantage of this bound is that it does not depend on the num-ber of distinct gap lengths. However, it is not strictly on-line, because it processesthe text w characters at a time and it also depends on len(P) rather than onk-len(P). The proposed algorithms obtain a bound similar to the one of [4], inthe restricted case of fixed-length gaps, while being also practical. For this rea-son, they provide an effective alternative when α or α′ is large. They are alsofast in practice, as shown by experimental evaluation.

The rest of the paper is organized as follows. In Section 2 we recall somepreliminary notions and elementary facts. In Section 3 we discuss the motivationfor our work. In Section 4 and 5 we present the new algorithms for the problem ofmatching gapped patterns. Finally, in Section 6 we present experimental resultsto evaluate the performance of our algorithms.

2 Basic notions and definitions

Let Σ = c1, c2, . . . , cσ denote an alphabet of size σ. Let Σ∗ denote the Kleenestar of Σ, and Σm the set of all possible sequences of length m over Σ. |S| isthe length of string S, S[i], i ≥ 0, denotes its (i+ 1)-th character, and S[i . . . j]denotes its substring between the (i+ 1)-st and the (j + 1)-st characters (inclu-sive). For any two strings S and S′, we say that S′ is a suffix of S (in symbols,S′ w S) if S′ = S[i . . . |S| − 1], for some 0 ≤ i < |S|.

A gapped pattern P is of the form

S1 · j1 · S2 · . . . · j`−1 · S` ,

where Si ∈ Σ∗, |Si| ≥ 1, is the i-th string (keyword) and ji ≥ 0 is the lengthof the gap between keywords Si and Si+1, for i = 1, . . . , `. We can transform Pinto the equivalent pattern ψ(S1) · j1 · ψ(S2) · . . . · jl−1 · ψ(S`), where

ψ(S) =

S[0] · 0 · ψ(S[1 . . . |S| − 1]) if |S| > 1

S[0] otherwise,

such that all the keywords have unit length. We denote by

(p1 · j1 · p2 · . . . · jq−1 · pq)

the ψ-encoding of P , where q =∑`i=1 |Si|.

We say that P occurs in a string T at ending position i if T [i−∑q−1i=k (ji+1)] =

pk , for k = 1, . . . , q. In this case we write P wg Ti. We denote by len(P ) = q andk-len(P ) = ` the number of alphabet symbols and keywords in P , respectively.The gapped pattern Pi = S1 · j1 · S2 · . . . · ji−1 · Si is the prefix of P of lengthi ≤ `. Given a set of gapped patterns P, we denote by len(P) =

∑P∈P len(P )

and k-len(P) =∑P∈P k-len(P ) the total number of symbols and keywords in

the patterns, respectively.

The RAM model is assumed, with words of size w in bits. We use somebitwise operations following the standard notation as in the C language: &, |,∼, for and, or, not and left shift, respectively. The function to computethe position of the most significant non-zero bit of a word x is blog2(x)c.

3 Motivation

Given a DNA sequence and a motif that describes the binding specificities of agiven transcription factor, we study the problem of finding all the binding sitesin the sequence that match the motif. The traditional model used to representtranscription factor motifs is the Position Weight Matrix (PWM). This modelassumes that there is no correlation between positions in the sites, that is, thecontribution of a nucleotide at a given position to the total affinity does notdepend on the other nucleotides which appear in other positions. The problemof matching the locations in DNA sequences at which a given transcription factorbinds to is well studied under the PWM model [12]. Many more advanced modelshave been proposed to overcome the independence assumption of the PWM(see [2] for a discussion on the most important ones). One approach, commonto some models, consists in extending the PWM model by assigning weightsto sets of symbol-position pairs rather than to a single pair only. We focus onthe Feature Motif Model (FMM) [14] since, to our knowledge, it is the mostgeneral one. In this model the TF binding specificities are described with so-called features, i.e., rules that assign a weight to a set of associations betweensymbols and positions. Given a DNA sequence, a set of features and a motiflength m, the matching problem consists in computing the score of each site(substring) of length m in the sequence, where the score of a site is the sum ofthe weights of all the features that occur in the site. Formally, a feature can bedenoted as

(a1, i1), . . . , (aq, iq) → ω

where ω is the affinity contribution of the feature and aj ∈ A,C,G, T is thenucleotide which must occur at position ij , for j = 1, . . . , q and 1 ≤ ij ≤ m. Itis easy to transform these rules into new rules where the left side is a gappedpattern: if i1 < i2 < . . . < iq, we can induce the following gapped pattern rule

(a1 · (i2 − i1 − 1) · . . . · (iq − iq−1 − 1) · aq)→ (iq, ω).

Note that we maintain the last position iq to recover the original feature. Thistransformation has the advantage that the resulting pattern is position indepen-dent. Moreover, after this transformation, different features may share the samegapped pattern. Hence, the matching problem can be decomposed into two com-ponents: the first component identifies the occurrences of the groups of featuresby searching for the corresponding gapped patterns, while the second componentcomputes the score for each candidate site using the information provided by thethe first component. For a motif of length m, the second component can be easilyimplemented by maintaining the score for m site alignments simultaneously with

a circular queue of length m. Each time a group of features with an associatedset of position/weight pairs (i1, ω1), . . . , (ir, ωr) is found at position j in thesequence, the algorithm adds the weight ωk to the score of the alignment thatends at position j +m− ik in the sequence, if j ≥ ik.

4 Online algorithm for matching a set of gapped patterns

In this section we present a practical algorithm to search for a set P of gappedpatterns in a text T of length n. We first devise a simple algorithm based ondynamic programming whose time complexity matches the best known boundand then show how to parallelize it using word-level parallelism.

Let P k be the k-th pattern in P and let Ski and jki be its i-th keyword andgap length, respectively. Let also P kl be the prefix of P k of length l. We define

Di = (k, l) | P kl wg Ti ,

for i = 0, . . . , n−1, 1 ≤ k ≤ |P| and 1 ≤ l ≤ k-len(P k). The algorithm computes,for each position i in T , the set Di of the prefixes of the patterns that occurat i. From the definition of Di it follows that the pattern P k occurs in T atposition i if and only if (k, k-len(P k)) ∈ Di. For example, if T = accgtaaacgand P = cgt · 2 · ac, c · 1 · gt, we have D1 = (2, 1), D4 = (1, 1), (2, 2) andD8 = (1, 2), (2, 1) and there is an occurrence of P 2 and P 1 at positions 4 and8, respectively. Let K be the set of distinct keywords in P and let Ti ⊆ K bethe set of matching keywords in T ending at position i. The sequence Ti, for0 ≤ i < n, is basically a new text with character classes over K. In the case ofthe previous example we have T1 = ac, c, T4 = cgt, gt and T8 = ac, c.

We replace each pattern S1 · j1 · S2 · . . . · j`−1 · S` in P with the patternS1 · j1 · S2 · . . . · j`−1 · S` , with unit-length keywords over the alphabet K, whereSi ∈ K and ji = ji + |Si+1| − 1, for 1 ≤ i < `. For P = cgt · 2 · ac, c · 1 · gt, thenew set is ¯cgt · 3 · ac, c · 2 · gt over the alphabet c, ac, gt, ¯cgt.

The sets Di can be computed using the following lemma:

Lemma 1. Let P and T be a set of gapped patterns and a text of length n,respectively. Then (k, l) ∈ Di, for 1 ≤ k ≤ |P|, 1 ≤ l ≤ k-len(P k) and i =0, . . . , n− 1, if and only if

(l = 1 or (k, l − 1) ∈ Di−1−jkl−1) and Skl ∈ Ti.

The idea is to match the transformed patterns against the text T . Let gmin(P)and gmax(P) denote the minimum and maximum gap length in the patterns,respectively. We also denote with gsize(P) = gmax(P) − gmin(P) + 1 the size ofthe variation range of the gap lengths. To compute the sets Di the algorithmpreprocesses the set of patterns so as to obtain the Aho-Corasick (AC) automa-ton [1] for K. The searching phase of the algorithm consists in iterating, for eachposition i, over all the elements in Ti using the AC automaton and extendingDi based on Lemma 1. With set membership queries that require O(1)-time, the

gq-matcher-preprocess (P, T )

1. (δ, root ,B, fo)← AC(P)2. G← ∅3. m← k-len(P)4. I← 0m,M← 0m

5. for g = 0, . . . , gmax(P) do C(g)← 0m

6. l← 07. for S1 · j1 · S2 · . . . · j`−1 · S` ∈ P do8. I← I | 1 l9. for k = 1, . . . , ` do

10. if k = ` then11. M← M | 1 l12. else g ← jk + |Sk+1| − 113. C(g)← C(g) | 1 l14. G← G ∪ g15. l← l + 1

gq-matcher-search (P, T )

1. q ← root2. for i = 0, . . . , |T | − 1 do3. q ← δ(q, T [i]),H← 0m

4. for g ∈ G do5. H← H | (Di−1−g & C(g))6. Di ← ((H 1) | I) & B(fo(q))7. H← Di & M8. report(H)

report(H)

1. while H 6= 0m do2. k ← blog2(H)c3. report(k)4. H← H & ∼(1 k)

Fig. 1. The gq-matcher algorithm for the string matching problem with gapped pat-terns.

time complexity of the algorithm is O(n log σ + α), where α is the number ofoccurrences of the keywords in the text. The AC automaton requires Θ(len(P))space. Moreover, for the recursion of Lemma 1, the algorithm needs to keepthe sets D computed in the last gmax(P) iterations, and the maximum cardi-nality of each such set is bounded by k-len(P). Hence, the space complexity isO(len(P) + gmax(P)k-len(P)). We omit the details because the role of this algo-rithm is to only provide the basic framework for our main bit-parallel algorithm.

We now describe the version of the algorithm based on word-level parallelism.We first build the AC automaton for K. Let Q denote the set of states of theautomaton, root the initial state and label(q) the string which labels the pathfrom state root to q, for any q ∈ Q. The transition function δ(q, c) is defined asthe unique state q′ such that label(q′) is the longest suffix of label(q) · c. We alsostore for each state q a pointer fo(q) to the state q′ such that label(q′) is thelongest suffix of label(q) that is also a keyword. Let

B(q) = (k, l) | Skl w label(q) ,

be the set of all the occurrences of keywords in the patterns in P that aresuffixes of label(q), for any q ∈ Q. We preprocess B(q) for each state q such thatlabel(q) ∈ K and compute it for any other state using B(fo(q)). The sets B canbe preprocessed during the construction of the AC automaton with no overheadin the time complexity. Let G be the set of all the distinct gap lengths in thepatterns. In addition to the sets B(q), we preprocess also a set C(g), for eachg ∈ G, defined as follows:

C(g) = (k, l) | jkl = g ,

for 1 ≤ k ≤ |P| and 1 ≤ l < k-len(P k). We encode the sets Di, B(q) andC(g) as bit-vectors of k-len(P) bits. The generic element (k, l) is mapped onto

bit∑k−1i=1 k-len(P i) + k-len(P kl−1), where k-len(P k0 ) = 0 for any k. We denote

with Di, B(q) and C(g) the bit-vectors representing the sets Di, B(q) and C(g),respectively. We also compute two additional bit-vectors I and M, such that thebit corresponding to the element (k, 1) in I and (k, k-len(P k)) in M is set to 1,for 1 ≤ k ≤ |P|. We basically mark the first and the last bit of each pattern,respectively. Let Hi be the bit-vector equal to the bitwise or of the bit-vectors

Di−1−g & C(g) , (1)

for each g ∈ G. Then the corresponding set Hi is equal to⋃g∈G(k, l) | (k, l) ∈ Di−1−g ∧ jkl = g .

Let q−1 = root and qi = δ(qi−1, T [i]) be the state of the AC automaton afterreading symbol T [i]. It is not hard to see that B(fo(qi)) encodes the set Ti. Thebit-vector Di can then be computed using the following bitwise operations:

Di ← ((Hi 1) | I) & B(fo(qi))

which correspond to the relation

(k, l) | (l = 1 ∨ (k, l − 1) ∈ Hi) ∧ (k, l) ∈ B(fo(qi)) .

To report all the patterns that match at position i it is enough to iterate over allthe bits set in Di & M. The algorithm, named gq-matcher, is given in Figure 1.

The bit-vector Hi can be constructed in time O(gw-spandk-len(P)/we), 1 ≤gw-span ≤ w, as follows: we compute Equation 1 for each word of the bit-vectorseparately, starting from the least significant one. For a given word with indexj, we have to compute equation 1 only for each g ∈ G such that the j-th wordof C(g) has at least one bit set. Each position in the bit-vector is spanned byexactly one gap, so the number of such g is at most w. Hence, if we maintain,for each word index j, the list Gj of all the distinct gap lengths that span the

corresponding positions, we can compute Hi in time∑dk-len(P)/wej=1 |Gj |, which

yields the advertised bound by replacing |Gj | with gw-span = maxj |Gj |.The time complexity of the searching phase of the algorithm is then

O(n(log σ+gw-spandk-len(P)/we)+occ), while the space complexity is O(len(P)+(gmax(P) + k-len(P))dk-len(P)/we).

Observe that the size of the sets Gj depends also on the ordering of thepatterns (unless k-len(P ) is a multiple of w for each P ∈ P), since more thanone pattern can be packed into the same word. Hence, it can be possibly reducedby finding an ordering that maps onto the same word patterns that share manygap lengths. We now show that the problem of minimizing

∑j |Gj | is hard. In

order to formally define the problem, we introduce the following definition:

Definition 1. Let L1, L2, . . . , Ln be a sequence of lists of integers and let Lc bethe list resulting from their concatenation, say Lc = l1, . . . , l|Lc|. For a given in-

teger b, we define the b-mapping of the lists as the sequence of lists Lb1, Lb2, . . . , L

br

where r = d|Lc|/be, list Lbi contains the elements l(i−1)b+1, l(i−1)b+2, . . . , l(i−1)b+b

of Lc, for 1 ≤ i ≤ b|Lc|/bc, and, if r > b|Lc|/bc, list Lbr contains the elementsl(r−1)b+1, l(r−1)b+2, . . . , l(r−1)b+(|Lc| mod b).

Then, the problem of minimizing∑j |Gj | can be stated as (where in our case

we have n = |P|, b = w, U = G and Lk = jk1 , jk2 , . . . , j

kk-len(Pk), for 1 ≤ k ≤ |P|):

Problem 1 (Permutation with Minimum Distinct Binned Symbols, PMDBS).Given a sequence of n lists of integers L1, L2, . . . , Ln over a universe U , andan integer b, find the permutation π of 1, . . . , n which minimizes the sum, overall lists Lb in the b-mapping of Lπ(1), . . . , Lπ(n), of the number of distinct ele-

ments in Lb.

Theorem 1. Problem PMDBS is NP-hard in the strong sense.

Proof. We reduce from the Hamiltonian Path Problem (see [7] for basic notionsand definitions). In the decision version of the Problem PMDBS, we ask for apermutation π of 1, . . . , n such that the sum, over all lists Lb in the b-mappingof Lπ(1), . . . , Lπ(n), of the number of distinct elements in Lb is at most a givennumber M .

The idea behind our reduction is that the vertices of a graph G will beencoded by lists, where the list of a vertex consists of the indices of the edges in-cident to it, under a suitable encoding. This encoding will be such that, choosingM suitably, all adjacent lists in a permutation of 1, . . . , n satisfying the boundM correspond to vertices adjacent in G.

v1

v2

v3 v4e1

e2

e3

e4

(a) A graph G

L1 = 1 2 5 6 1 2 5 6 1 2 5 6 1 2 5 6L2 = 1 3 7 8 1 3 7 8 1 3 7 8 1 3 7 8L3 = 2 3 4 9 2 3 4 9 2 3 4 9 2 3 4 9L4 = 4 10 11 12 4 10 11 12 4 10 11 12 4 10 11 12

(b) The instance LG

Fig. 2. The reduction of the Hamiltonian Path Problem to Problem 1.

Entering into details, given an input G = (V = v1, . . . , vn, E =e1, . . . , em) to the Hamiltonian Path Problem, we construct the following in-stance LG to Problem 1 (see Fig. 2 for an example).

– We have lists L1, . . . , Ln.– The universe U consists of numbers 1, . . . ,m, which will be used to encode

adjacencies, and numbers m+1, . . . , n2−m, which will be used for padding,to ensure that all lists have the same length.

– For every vertex vi ∈ V , having ei1 , ei2 , . . . , eit as incident edges, we say thatthe basic list of Li is the list i1, i2, . . . , it padded (at the end) with m− t newnumbers from m+ 1, . . . , n2−m, unused by any other list. Finally, list Liconsists of n concatenated copies of its basic list, so that |Li| = nm.

– We set b = (n+ 1)m and M = (2m− 1)(n− 1) +m.

We show that G has a Hamiltonian path if and only if instance LG admits apermutation π of 1, . . . , n such that the sum, over all lists Lb in the b-mappingof Lπ(1), . . . , Lπ(n), of the number of distinct elements in Lb is at most M . Sincethe values of the integers in U are bounded by a polynomial in the size of thelists L1, . . . , Ln, this claim will entail the NP-hardness in the strong sense ofProblem PMDBS.

First, observe that from the choice of b and of the lengths of lists Li, for anypermutation π of 1, . . . , n, the b-mapping Lb1, . . . , L

br of Lπ(1), Lπ(2), . . . , Lπ(n)

has a special form. Indeed, since b = (n + 1)m, and the length of the lists Liis nm, we have that r = d(n2m)/((n + 1)m)e = dn2/(n + 1)e = n. It can beeasily shown by induction that, for all 1 ≤ j ≤ n− 1, list Lbj consists of the last(n− j + 1)m integers in the list Lπ(j) followed by the first jm integers from the

list Lπ(j+1). List Lbn consists of the last m integers of list Lπ(n).For the forward direction, let P = vi1 , . . . , vin be a Hamiltonian path ofG. We

show that the permutation π of 1, . . . , n defined such that π(j) = ij satisfies thebound M . Let Lb1, L

b2, . . . , L

bn be the b-mapping of Lπ(1), Lπ(2), . . . , Lπ(n). From

the above observation, for all 1 ≤ j ≤ n − 1, the number of distinct integers inLbj equals the number of distinct integers in Lπ(j), which is m, plus the numberof distinct integers in Lπ(j+1), which is m, minus the number of integers sharedbetween Lπ(j) and Lπ(j+1). Since vπ(j) and vπ(j+1) are connected by an edge,then the index of this edge appears in both Lπ(j) and Lπ(j+1), thus the number

of distinct elements in Lbj is at most 2m − 1. The claim is now clear, since Lbnconsists of m distinct integers.

For the backward implication, let π be a permutation of 1, . . . , n such thatthe sum, over all lists Lb in the b-mapping of Lπ(1), . . . , Lπ(n), of the num-

ber of distinct elements in Lb is at most M . We claim that the sequenceP = vπ(1), . . . , vπ(n) is a Hamiltonian path in G. Since π is a permutation of1, . . . , n, we only have to show that for all 1 ≤ i ≤ n − 1, there is an edgebetween vπ(i) and vπ(i+1).

Let Lb1, Lb2, . . . , L

bn be the b-mapping of Lπ(1), Lπ(2), . . . , Lπ(n). The fact

that the number of distinct elements in the list Lbn is m entails that the sum,over all 1 ≤ j ≤ n − 1 , of the number of distinct elements in Lbj is at mostM − m = (2m − 1)(n − 1). For all 1 ≤ j ≤ n − 1, vertices vπ(j) and vπ(j+1)have at most one edge incident to both of them (the edge connecting them),therefore, the number of distinct integers in each list Lbj is at least 2m−1. Fromthe above observation, for all 1 ≤ j ≤ n − 1, the number of distinct integers ineach list Lbj is exactly 2m− 1.

Since the number of distinct integers in the list Lπ(j) is m and the numberof distinct integers in the list Lπ(j+1) is m, but the number of distinct integers

in Lbj is at most 2m − 1, we have that lists Lπ(j) and Lπ(j+1) share at least

one integer. We padded the basic lists of Lπ(j) and Lπ(j+1) with integers uniqueto them, thus the only integer shared by them must be the index of the edgeincident to both vπ(j) and vπ(j+1). Such an edge connects vπ(j) and vπ(j+1), andthus P is a path in G. ut

The gq-matcher algorithm is preferable only when gw-span w. However,it can also be used as a filter to speed up the pma algorithm [8]. The idea is tosearch for the set of the prefixes of a fixed small length k of the patterns in Pwith the gq-matcher algorithm and feed all the occurrences to pma in such away that pma starts from the prefixes of length k. In this way we reduce the α′

term in the time complexity of pma to the number of occurrences in the text ofpattern prefixes of length ≥ k, which can be significantly better in this context.

We now show how to improve the time complexity in the worst-case by con-structing an equivalent set of patterns with O(log gsize(P)) distinct gap lengths.W.l.o.g. we assume that gmin(P) and gmax(P) are a power of two (if they are notwe round them to the nearest power of two). Let lsb(n) be the bit position of theleast significant bit set in the binary encoding of n, for n ≥ 1. Observe that, forany positive g ∈ G, the minimum and maximum value for lsb(g) are log gmin(P)and log gmax(P), respectively, and the number of bits set in the binary encodingof g is O(log gsize(P)). Let also

G′ = 0 ∪ 2i | log gmin(P) ≤ i ≤ log gmax(P) .

We augment the alphabet Σ with a wildcard symbol ∗ that matches any symbolof the original alphabet and define by recursion the function

φ(g) =

g if g ∈ G′

(2lsb(g) − 1) · ∗ · φ(g − 2lsb(g)) otherwise

that maps a gap length g onto a concatenation of l gap lengths and l−1 wildcardsymbols, where l is the number of bits set in the binary encoding of g if g ispositive or 1 otherwise. By definition, all the gaps in the resulting sequencebelong to the set

G′′ = G′ ∪ 2i − 1 | log gmin(P) ≤ i < log gmax(P) .

We generate a new set of patterns P ′ from P, by transforming each patternS1 · j1 · S2 · . . . · j`−1 · S` in P into the equivalent pattern

S1 · φ(j1) · S2 · . . . · φ(j`−1) · S` .

Observe that extending the algorithm presented above to handle wildcardsymbols is straightforward. By definition of φ we have that k-len(P ′) <log gsize(P)k-len(P), since the number of gaps that are split is at mostk-len(P) − |P| and the number of wildcard symbols that are added per gapis at most log gsize(P). The number of words needed for a bit-vector is then< dlog gsize(P)k-len(P)/we ≤ log gsize(P)dk-len(P)/we. In this way we obtain anequivalent set of patterns such that the set G of distinct gap lengths is containedin G′′ and so its cardinality is O(log gsize(P)). We thus obtain the following re-sult:

gq-matcher-t (P, T )

1. for s ∈ Σ do V[s]← 02. for c← 0 to dn/we do3. for i← cw to min(n, (c+ 1)w)− 1 do V[T [i]]← V[T [i]] | (1 (i mod w))4. for k ← 1 to |P| do5. Dk,w

1,c ← V[pk1 ]

6. for r ← 2 to len(P k) do Dk,wr,c ← V[pkr ] & M(k, r − 1, c, jr−1 + 1)

7. report(Dk,w

len(Pk),c)

8. for i← cw to min(n, (c+ 1)w)− 1 do V[T [i]]← 0

Fig. 3. gq-matcher-t algorithm.

Theorem 2. Given a set P of gapped patterns and a text T of length n, allthe occurrences in T of the patterns in P can be reported in time O(n(log σ +log2 gsize(P)dk-len(P)/we) + occ).

5 Row-wise parallelization for text given in chunks

We now describe a slightly different solution, based on the ideas of the (δ, α)-matching algorithm described in [5]. This algorithm works for a single patternonly, thus to solve the multi-pattern case we need to run (the search phase of)the algorithm several times. In this algorithm we take a different approach tohandle arbitrary length keywords. In particular, we first transform each patternin P into the corresponding ψ-encoding, so that all the keywords have unit lengthand the number of keywords is len(P). We also parallelize over the text, ratherthan over the set of patterns. The main benefit is that now there is only one gaplength to consider at each step. This also means that instead of preprocessingthe set of patterns, we now must preprocess the text. For the same reason thealgorithm is not strictly on-line anymore, as it processes the text w charactersat a time.

Let us decompose D to be a set of |P| matrices, where Dk corresponds to thek-th pattern. Let Dk

r,c = 1 iff P kr wg Tc and 0 otherwise. Thus the matrix Dk has

len(P k) rows and n columns. The new version encodes just the same informationin a slightly different way. This can be computed using the recurrence

Dkr,c =

1 if pkr = T [c] and (r = 1 or Dk

r−1,c−jkr−1−1= 1)

0 otherwise.

The matrix is easy to compute in O(n len(P k)) time using dynamic program-ming. We now show how it can be computed in O(dn/we len(P k)) time usingword-level parallelism by processing chunks of w columns in O(1) time.

To this end, assume that we have a bit-matrix V , such that Vs,c = 1 iffs = T [c], and 0 otherwise, and where s is any symbol appearing in any of the

patterns. This matrix is easy to compute in O(dn/we len(P) + n) total worstcase time over all the patterns.

The computation of Dk will proceed row-wise, w columns at once, as eachmatrix element takes only one bit of storage and we can store w columns intoa single machine word. We adopt the notation Dk,wr,c = Dk

r,cw...(c+1)w−1, and

analogously for V. First notice that by definition Dk,w1,c = Vwpk1 ,c

. Assume now

that the words Dk,wr−1,c′ for c′ ≤ c have been already computed, and we want to

compute Dk,wr,c . To do so, we need to check if any text character in the current

chunk T [cw . . . (c+ 1)w− 1] matches the pattern character pkr (readily solved asVwpkr ,c

), and if g = jr−1 + 1 text characters back there was a matching pattern

prefix of length r − 1. The corresponding bits signaling these prefix matches,relevant to the current chunk, are distributed in at most two consecutive wordsin a w-bit wide interval in the previous row, namely in words Dk,wr−1,c′−1 and

Dk,wr−1,c′ , where c′ = c−bg/wc. We select the relevant bits and combine them intoa single word using the following function:

M(k, r, c, g) = (Dk,wr,c−bg/wc−1 (w − (g mod w))) | (Dk,wr,c−bg/wc (g mod w)).

The recurrence can now be written as

Dk,wr,c ← Vwpkr ,c & M(k, r − 1, c, jr−1 + 1),

and Dk can be computed in O(dn/we len(P k)) time for any k. To check theoccurrences, we just scan the last row of the matrix and report every positionwhere the bit is 1. To handle all the patterns, we run the search algorithm |P|times, which gives O(dn/we len(P) + n + occ) total time, including the prepro-cessing. The algorithm needs O(len(P) + dgmax(P)/wemaxk(len(P k))) words ofspace, as only the current column of Vw and the last O(dgmax(P)/we) columnsof Dk,w need to be kept in memory at any given time. The algorithm, namedgq-matcher-t, is given in Figure 3. We thus obtain the following result:

Theorem 3. Given a set P of gapped patterns and a text T of length n, givenin chunks of w characters, all the occurrences in T of the patterns in P can bereported in time O(dn/we len(P) + n+ occ).

6 Experimental results

The proposed algorithms have been experimentally validated. In particular, wecompared the new algorithms gq-matcher, gq-matcher-t with the pma al-gorithm of [8]. The gq-matcher and gq-matcher-t have been implementedin the C++ programming language and compiled with the GNU C++ Compiler

4.4, using the options -O3. The source code of the pma algorithm was kindlyprovided by the authors. The test machine was a 2.53 GHz Intel Xeon E5630running Ubuntu 10.04 and running times were measured with the getrusage func-tion. The benchmarks consisted of searching for a set of randomly generated

10

100

1000

10000

100000

10 20 30 40 50 60

tim

e in m

illis

econds

max gap (50 patterns)

GQ-MATCHERGQ-MATCHER-T

PMA

100

1000

10000

100000

10 20 30 40 50 60

tim

e in m

illis

econds

max gap (100 patterns)

GQ-MATCHERGQ-MATCHER-T

PMA

10

100

1000

10000

100000

40 60 80 100 120 140 160 180 200

tim

e in m

illis

econds

number of patterns (max gap 20)

GQ-MATCHERGQ-MATCHER-T

PMA

10

100

1000

10000

100000

40 60 80 100 120 140 160 180 200

tim

e in m

illis

econds

number of patterns (max gap 40)

GQ-MATCHERGQ-MATCHER-T

PMA

10

100

1000

10000

2 3 4 5 6

tim

e in m

illis

econds

keyword length (total number of symbols 256)

GQ-MATCHERGQ-MATCHER-T

PMA

10

100

1000

10000

2 3 4 5 6

tim

e in m

illis

econds

keyword length (50 patterns)

GQ-MATCHERGQ-MATCHER-T

PMA

Fig. 4. Experimental results on a genome sequence of Escherichia coli with randomlygenerated gapped patterns with keywords of unit length. Top row: 6 keywords, varyinggap interval with a set of 50 and 100 patterns; Middle row: 6 keywords, varying numberof patterns with maximum gap 20 and 40; Bottom row: 2 keywords, varying keywordlength.

gapped patterns in the genome sequence of 4, 638, 690 base pairs of Escherichiacoli (σ = 4)6. Figure 4 (top row) shows the running times for searching a setof randomly generated gapped patterns with 6 keywords of unit length with afixed number of patterns equal to 50 and 100, respectively, and such that themaximum gap varies between 5 and 60. Figure 4 (middle row) shows the runningtimes for searching a set of randomly generated gapped patterns with 6 keywordsof unit length with a fixed maximum gap of 20 and 40, respectively, and suchthat the number of patterns varies between 25 and 200. We used a logarithmicscale on the y axis. Note that the number of words used by our algorithm isequal to d6× |P|/we, so it is between 3 and 19 in our experiments since w = 64.The experimental results show that the new algorithms are significantly faster(up to 50 times) than the pma algorithm in this particular scenario.

The gq-matcher-t algorithm is preferable if the text can be processed byreading w symbols at a time. This implies that, in the worst-case, we report anoccurrence of a pattern at position i in the text only after reading the symbolsup to position i+w−1. This condition may not be feasible for some applications.Otherwise, albeit slower, the gq-matcher algorithm is a good choice.

To compare the performance of pma and our algorithms for arbitrary key-word lengths we performed another benchmark. Figure 4 (bottom row) showsthe running times for searching a set of patterns with 2 keywords and a fixedmaximum gap of 20 and such that the keyword length varies between 2 and6. In the benchmark to the left the number of patterns is calculated using theformula 4w/2l, where l is the keyword length, so as to fix the total number ofsymbols, i.e., len(P), to 4w (i.e., 4 words in our algorithm). In the one to theright the number of patterns is fixed to 50, so that len(P) increases as the key-word length grows. The benchmark shows that our algorithms are significantlyfaster than pma up to keyword length 4, while for longer keywords they havesimilar performance.

References

1. Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid tobibliographic search. Commun. ACM, 18(6):333–340, 1975.

2. Yingtao Bi, Hyunsoo Kim, Ravi Gupta, and Ramana V. Davuluri. Tree-basedposition weight matrix approach to model transcription factor binding site profiles.PLoS ONE, 6(9), 2011.

3. Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind. Stringmatching with variable length gaps. Theor. Comput. Sci., 443:25–34, 2012.

4. Philip Bille and Mikkel Thorup. Regular expression matching with multi-stringsand intervals. In Moses Charikar, editor, SODA, pages 1297–1308. SIAM, 2010.

5. Kimmo Fredriksson and Szymon Grabowski. Efficient bit-parallel algorithms for(δ, α)-matching. In Carme Alvarez and Maria J. Serna, editors, WEA, volume 4007of Lecture Notes in Computer Science, pages 170–181. Springer, 2006.

6 http://corpus.canterbury.ac.nz/

6. Kimmo Fredriksson and Szymon Grabowski. Nested counters in bit-parallel stringmatching. In Adrian Horia Dediu, Armand-Mihai Ionescu, and Carlos Martın-Vide, editors, LATA, volume 5457 of Lecture Notes in Computer Science, pages338–349. Springer, 2009.

7. Michael R. Garey and David S. Johnson. Computers and Intractability; A Guideto the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA,1990.

8. Tuukka Haapasalo, Panu Silvasti, Seppo Sippu, and Eljas Soisalon-Soininen. Onlinedictionary matching with variable-length gaps. In Panos M. Pardalos and SteffenRebennack, editors, SEA, volume 6630 of Lecture Notes in Computer Science, pages76–87. Springer, 2011.

9. Michele Morgante, Alberto Policriti, Nicola Vitacolonna, and Andrea Zuccolo.Structured motifs search. Journal of Computational Biology, 12(8):1065–1082,2005.

10. Gonzalo Navarro and Mathieu Raffinot. Fast and simple character classes andbounded gaps pattern matching, with applications to protein searching. Journalof Computational Biology, 10(6):903–923, 2003.

11. Gonzalo Navarro and Mathieu Raffinot. New techniques for regular expressionsearching. Algorithmica, 41(2):89–116, 2004.

12. Cinzia Pizzi and Esko Ukkonen. Fast profile matching algorithms - a survey. Theor.Comput. Sci., 395(2-3):137–157, 2008.

13. M. Sohel Rahman, Costas S. Iliopoulos, Inbok Lee, Manal Mohamed, andWilliam F. Smyth. Finding patterns with variable length gaps or don’t cares. InDanny Z. Chen and D. T. Lee, editors, COCOON, volume 4112 of Lecture Notesin Computer Science, pages 146–155. Springer, 2006.

14. Eilon Sharon, Shai Lubliner, and Eran Segal. A feature-based approach to modelingprotein-dna interactions. PLoS Computational Biology, 4(8), 2008.

15. Seppo Sippu and Eljas Soisalon-Soininen. Online matching of multiple regularpatterns with gaps and character classes. In LATA, pages 523–534, 2013.