Abstraction and trees for biosequences analysis

32
UNIVERSIT ` A CA’ FOSCARI DI VENEZIA Dipartimento di Informatica Technical Report Series in Computer Science Rapporto di Ricerca CS-2005-11 Ottobre 2005 N. Cannata, N. Cocco, M. Simeoni Abstraction and trees for biosequences analysis Dipartimento di Informatica, Universit` a Ca’ Foscari di Venezia Via Torino 155, 30172 Mestre–Venezia, Italy

Transcript of Abstraction and trees for biosequences analysis

UNIVERSITA CA’ FOSCARI DI VENEZIADipartimento di Informatica

Technical Report Series in Computer Science

Rapporto di Ricerca CS-2005-11

Ottobre 2005

N. Cannata, N. Cocco, M. Simeoni

Abstraction and trees for biosequences analysis

Dipartimento di Informatica, Universita Ca’ Foscari di Venezia

Via Torino 155, 30172 Mestre–Venezia, Italy

Abstraction and trees for biosequences analysis

Nicola Cannata1, Nicoletta Cocco2, and Marta Simeoni2

1 CRIBI Universita di Padova, [email protected]

2 Dipartimento di Informatica, Universita di Venezia, Italy{cocco,simeoni}@dsi.unive.it

Technical Report CS-2005-11(revised version of CS-2005-7)

Abstract. Pattern discovery is used for determining, in a blind way,subsequences characterizing a given sequence or set of sequences. It is ap-plied in genome and proteome analysis for discovering interesting biose-quences which are usually very short when compared to the length of theanalyzed sequence. Abstraction of subsequences, that is grouping similarsubsequences and representing them in a compact way as patterns, seemsparticularly useful in the field of pattern discovery in order to stress sim-ilarities among interesting subsequences. In this paper we propose a setof techniques for pattern discovery which makes use of abstraction. Wedefine a data structure, the k-trie, which is an enriched and truncatedsuffix trie, for collecting and counting subsequences of length at mostk. We propose an on-line algorithm for building a k-trie in linear time.We associate the chi-square score to the subsequences represented in thetree in order to estimate their ”interest”. By analyzing the properties ofthe score w.r.t. symbol concatenation and string abstraction, we derive amethod for collecting the most interesting subsequences in an automaticway. Besides, we abstract a set of subsequences of the same length intoa set of rigid patterns. Such abstraction may be represented by a treecorresponding to the prefix automaton associated to the set of patterns.We propose to use such trees for concisely representing the most inter-esting subsequences, for searching patterns and for comparing biologicalsequences.

1 Introduction

1.1 Motivation

The statistical analysis of substring occurrences in biological sequences is used asan instrument to discover bio-molecular signals in the sequences and to hypothe-size their functional or structural properties [7]. Both the two groups of over- andunder-represented oligo-sequences occurring with a significant deviation from theexpected frequency (in a model of random symbols distribution or in a more so-phisticated data driven model) could reveal some interesting biological meaning.For example, in [35] the detection of over-represented oligonucleotides is adopted

as a simple and fast method to isolate DNA binding sites for transcription factorsfrom families of co-regulated genes. Further applications concern the analysis ofother nucleic acid binding sites (e. g. for the ribosome to start the mRNA trans-lation [17]), the identification of sorting signals in protein sequences [14], thediscovery and representation of protein domains [21], the search for backgroundregularities in the DNA (e.g.[23]) or in proteins.

However, the molecular machinery allows some degrees of freedom, since it of-ten permits the presence of one nucleotide (or aminoacid, for protein sequences),chosen from a set of possible ones, in a fixed position of the sequence. This factis reflected in the common pattern representations, for example by adopting theIUPAC alphabet [11] when dealing with DNA or by including into brackets allthe aminoacid or nucleotide symbols that could occur in a given position of asequence. The usage of degenerate symbols and simplified alphabets may allowus to discover hidden properties or regularities otherwise not easily seen fromthe original sequences [9].

In this paper we propose to combine these two techniques: a blind search forover- and under-represented oligo-sequences in biological sequences and the ab-straction of sequences, that is the possibility to have a set of alternative symbolsin some positions of each sequence.

Blind search methods and analyses are generally applied to unaligned se-quences sets in order to detect unknown signals that can then be further refinedand investigated with more specific techniques (e.g. multiple sequence align-ment [32], Position Weight Matrices [31], HMM [13], Sequence Logos [30]) byrestricting the analysis to the discovered patterns or regions of interest. Be-sides we expect that the comparison of the characteristic patterns we can obtainfrom sets of sequences related to different biological features (e.g. exon/intron,coding/non-coding, different secondary structures of proteins, protein sequencessorted in different cellular components) could produce very interesting resultsin order to infer putative discriminating tools to be used in biological sequencesclassification and annotation. The comparison of characteristic patterns of wholegenomes or proteomes could help in classifying organisms and in gaining moreinsight into biological knowledge (e.g. extremophile genomes).

1.2 Related work

In the literature many approaches explore the problem of discovering unknownpatterns in biological sequences and propose some partial solution techniques.Some approaches are closer to our proposal because they are based on suffixtrees, other approaches are based on completely different techniques such as, forexample, graph cliques (Winnover [28]) and random projections (Projection [8]).

A recent survey on existing approaches, methodologies and tools for patterndiscovery is presented by Pavesi et al. in [26]. The focus is on discovering specificpatterns, namely patterns describing transcription factor binding sites, and someof the methods they report are based on suffix trees [3, 19, 25].

In particular, Apostolico et al. [3] present a deep investigation on how toannotate the nodes of a suffix tree with their expected values, variances and

scores of significance, with respect to the simple probabilistic model in whichsequences are produced by a random source emitting symbols from a known al-phabet independently and according to a given distribution. The authors showhow to perform tree annotation in an incremental and efficient way. They provethat, given a text of length m, the full tree annotation can be obtained in opti-mal O(m2) worst case and O(m log(m)) expected time and space. This resultis achieved by expressing mean, variance and related scores of significance in-crementally, thus allowing for their efficient computation. Moreover, in [4] theauthors present a deep analysis on the monotonicity of some scores of signifi-cance w.r.t. string length. Such property in fact allows one to bound the numberof candidate over- and under-represented strings in a sequence and carry out therelative computations in efficient time and space. The tool VERBUMCULUS [5]is based on the efficient techniques presented in [3].

Other approaches based on suffix trees share the same underlying idea: theytake advantage of the inherent compactness and efficiency of suffix trees to rep-resent the input sequences and propose methodologies and algorithms to findunknown patterns, usually potential transcription factor binding sites, by assum-ing some knowledge on their shape and structure, i.e., patterns with an upperbound on the number of mismatches, patterns with gaps, structured motifs, etc.

Pavesi et al. [25] propose the Weeder algorithm for finding patterns of un-known length in DNA sequences. Starting from a suffix tree representing theinput sequences, Weeder allows finding patterns with mismatches, that is pat-terns where only symbol substitutions are permitted. The exact length of thepatterns to be found is not a needed input parameter. In order to overcome thecombinatorial explosion of an exhaustive enumeration method, the algorithmimposes a restriction on the location of mismatches of the patterns to be found.The Weeder algorithm is implemented by the tool Weeder Web [27].

Crochemore and Sagot [12] review the problems of localization and extractionof motifs in sequences, where motifs are both patterns and structured motifs. Astructured motif is composed of at least two patterns separated by a (variable)number of spacers. Each pattern represents a transcription factor binding site. In[19] Marsan and Sagot describe two exact algorithms for extracting structuredmotifs by using a suffix tree. Both algorithms proceed by visiting the suffix treeand looking for the structured motif. In order to ”skip” the spacers in the tree,the first algorithm adopt a jumping technique, while the second one temporarilymodifies the tree by putting the patterns one beside the other.

Among the approaches and tools for pattern discovery which are not basedon suffix trees, we mention Teiresias [29], Winnover [28] and Projection [8].

Most of the approaches consider the so called rigid patterns, that is patternsallowing for a set of symbols – or a don’t care symbol – in each position, buthaving a fixed length. Other approaches, such as Apostolico et al. [3–5], considersimple strings, often called deterministic patterns in the literature. Another pos-sibility is to consider flexible patterns which allow also for gaps of flexible length.Although flexible patterns are clearly more expressive and powerful than deter-ministic and rigid patterns, they are very difficult to handle since the space of

candidate subsequences becomes soon very huge. An interesting classification ofpatterns in terms of which data regularities they can capture is presented in [24],as well as some applications of pattern discovery in molecular biology.

A different way to specify patterns consists in describing them through amatrix profile (also called position specific weight matrix ). It is a Σ×m matrix,where Σ is the alphabet of the input sequence and m is the length of the unknownpattern. Each cell (i, j) of the matrix contains the probability of finding the i-thsymbol of Σ at the j-th position of the pattern. In this case the relevant stringsare the ones which best fit the profile. Pattern discovery approaches based onmatrix profiles make use of Expectation Maximization algorithms and Gibbssampling techniques [6, 18, 22].

1.3 Structure of the Paper

The paper is organized as follows. In Section 2 we define a tree data structure,the k-trie, for representing all the substrings, up to a fixed length, contained ina sequence or a set of sequences. Each node of the tree represents a substringwith associated the number of its occurrences and possibly other useful relatedinformation. The tree is essentially an enriched and cut suffix trie [15, 20, 33, 36],representing all substrings of the given input sequence up to a fixed length k.Since space complexity could be a problem, the tree can be pruned with respectto some filtering conditions in order to reduce its complexity. In Section 3 we de-fine what is a pattern in our context and its tree representation. In Section 4 thechi-square score is introduced for evaluating the significance of substrings. Wediscuss how the score distributes over the k-trie and how it is affected by stringconcatenation and string union. In Section 5 we propose a technique for com-puting the set of the most interesting substrings of a given length in the k-trie inan automatic way. In Section 6 we describe a technique for abstracting the setof the most interesting substrings into a set of patterns. In Section 7 we consideran alternative and orthogonal way of using abstraction, namely abstracting thealphabet itself, thus making explicit the intended similarities among some of itssymbols. In Section 8 we illustrate how we intend to use k-tries, sets of inter-esting substrings and their abstractions for analyzing and comparing familiesof biological sequences, in order to single out their similarities and differencies.Some concluding remarks and perspectives for future work follow in Section 9.

2 The k-trie

The substrings we search for in biological sequences are generally very short incomparison to the length of the considered sequences: motifs and signals areat most ten or twenty symbols long while sequences under examination can bemillions of symbols long, such as a whole chromosome. For this reason, given abiological sequence (or set of sequences), we want to represent only the substringsup to a fixed length. In this section we introduce an efficient tree structuresupplying such a representation.

The reader is assumed to be familiar with the terminology and the basicconcepts of language theory, such as symbol, string, alphabet, regular expression,see for example [16].

2.1 The tree structure

Let T [1..m] be a text of length m on an alphabet Σ; we want to single out allthe substrings of length at most k in T , where 1 ≤ k ≤ m. We represent allsuch substrings in a tree (a suffix trie [15]) of depth k, called the k-trie of T anddenoted by Tk.

Each path p = n0n1n2 . . . nl (l ≤ k), starting from the root n0 of the treeTk, represents a substring. In particular, the node nj of level j in p, 1 ≤ j ≤ l,is labeled by a symbol aj of Σ and by a counter cj representing the number ofoccurrences of the substring a1a2 . . . aj in the text T . Further information canbe associated to each node in the tree as shown in the next sections. Hence eachnode, except the root n0, represents a substring with its information.

Let N be the set of nodes in Tk. We define the function strTk: N → Σ∗, which

takes a node n in Tk and returns the associated string, obtained by concatenatingthe symbols labelling the nodes of the path from the root to n. Note that str isan injective function since different nodes in Tk correspond to different strings.

There are at most |Σ|k distinct complete paths (and substrings) in Tk. How-ever, in general Tk is not a complete tree.

Example 1. Let T = AGAGGAC and k = 2. The associated 2-trie is the follow-ing:

(A,3)

(G,1)

(A,2)(G,3)

(C,1)

(G,2)

(C,1)

It contains four different substrings of length 2, {AC,AG, GA,GG}, andthree different substrings of length 1, {A,G,C}. It contains also their numberof occurrences in T : for example AG occurs two times, GG occurs one time, andGA occurs two times.

A k-trie has the unique prefix property, namely for each prefix of a string inT of length less or equal to k, there exists a unique node in Tk and a uniquepath from the root to that node representing that prefix.

We need to be able to traverse a k-trie in different ways:

– along the paths from the root;– along all the sons of a node nj of level j, 1 ≤ j ≤ k − 1, i.e., along the

alternative symbols in position j +1 which may follow the substring str(nj);– along all the nodes of level j, 1 ≤ j ≤ k, in all the substrings in Tk.

The implementation of the k-trie has to take care of these requirements.Hence, the structure of each node in the tree can be abstractly described by thefollowing type.

type node = recordsymbol : char;counter : int;child : pointer(node); {pointer to the first child}sibling : pointer(node); {pointer to the sibling}level : pointer(node); {pointer to the nodes of the same level}

end;

Additional fields can be added to associate further information to each node.To traverse the tree by levels we also need an array of k pointers to the levels

of the tree:

lev = array [1, k] of pointer(node);

which give access to the list of nodes at each level. All such pointers are initializedto nil.

2.2 Building the k-trie

In order to efficiently build the k-trie, we use an array of k pointers to nodes inthe tree

ptab = array [1, k] of pointer(node);

which is useful to count the substring occurrences while constructing the tree,but it can be deallocated afterwards.

A simple on-line algorithm for building the k-trie associated to the input textT is given below. It linearly scans the text from left to right and for each symbolT (i), it inserts T (i) in the tree by counting an occurrence in each position (level)in [1, k] for each substring of T ending in position i: one occurrence of T (i) inposition (level) 1 for T [i .. i], one in position 2 for T [i−1 .. i], .. , one in positionj for T [i− j + 1 .. i], .. , one in position k for T [i− k + 1 .. i].

In order to count the occurrences of T (i) in each position of the substringsof length at most k with a linear scan of T , we save in ptab the pointers to thenodes of each level where the occurrences of T (i− 1) have been counted in theprevious step.

The procedure Build tree makes use of lev [1] to access the first levelof the tree (i.e., lev [1] can be thought of as the root) and of the procedureInsert(t, j, p1, p2). The latter procedure looks for a node with symbol t in the

list pointed by p1 at level j. If t is already in the list, Insert increments theassociated counter, otherwise it adds a new node with symbol t to the head ofthe lists pointed by p1 and by lev [j]. If p1 = nil (lev [j] = nil) it initializes thelist by setting p1 (lev [j]) to point to the newly created node. If a new node iscreated, then it initializes its fields by setting symbol to t, counter to 1, childto nil, sibling to p1 and level to lev [j], that is the level of the tree to which thenode is added. In either case it returns p2 which is a pointer to the node foundor created for t.

Build tree(T, k);Insert(T (1), 1, lev [1], ptab(1)); { initialize Tk and ptab }for i := 2 to k − 1 dobegin

for j := i downto 2 do{ j is the level in which T(i) is inserted }Insert(T (i), j, ptab(j − 1).child , ptab(j));

Insert(T (i), 1, lev [1], ptab(1));end;for i := k to m dobegin { insert and count all the remaining symbols in T }

for j := k downto 2 do{ j is the level in which T(i) is inserted }Insert(T (i), j, ptab(j − 1).child , ptab(j));

Insert(T (i), 1, lev [1], ptab(1));end;

Let us consider the complexity of the procedure Build tree. Regarding timecomplexity, we may observe that, in the worst case, each Insert(t, j, p1, p2) hasto scan a list pointed by p1 of length |Σ|. Hence in the worst case the tree canbe built in time O(k ·m · |Σ|).Regarding space complexity, the worst case is when the k-trie contains all thepossible strings on Σ of length at most k. This can be the case when dealingwith a large text and a small k, e. g. with long DNA sequences. Since eachnode stores five items requiring constant space c, we have c

∑ki=1 |Σ|i = c(1 −

|Σ|k)/(1− |Σ|) ≤ c|Σ|k. Hence the tree requires at most S(|Σ|k) memory cells.This indicates that space can be a problem for large alphabets, even for shortsubstrings. Note that in the definition of the k-trie we consider the minimalinformation associated to each node. In describing how to use it for analyzing thetext T , we actually associate other information to its nodes, such as substringsexpected frequency and score.

The k-trie can be used also for representing substrings belonging to morethan one text, analogously to what done by a generalized suffix tree [15]. In fact,let us consider l texts on the alphabet |Σ|, T1, T2, . . . Tl, then we can apply theprocedure for building the k-trie to each text Ti separately and cumulate thesubstrings with their counters in one single tree. The worst time complexity forbuilding the k-trie is then O(k ·m · |Σ|), where m =

∑li=1 |Ti|. In the following,

for simplicity’s sake, we consider the case of analyzing one single text, even ifpractical applications generally consider a set of texts.

2.3 Comparison with similar structures

A k-trie corresponds to the first k levels of a suffix trie for T , with furtherinformation and links associated to each node.

In alternative to the described construction, we could first build a compactsuffix tree with a standard construction, such as Ukkonen’s algorithm [33, 15],and then modify it to obtain an annotated and cut suffix tree. In fact, a stan-dard suffix tree needs to be first annotated with the information concerning thenumber of occurrences of each substring, and then cut to represent strings oflength k at most. This can be done by linearly traversing the suffix tree. Theresulting tree is a data structure similar to the k-trie as far as path labelling isconcerned. More precisely, for each path of the k-trie there is a corresponding(possibly shorter) path in the suffix tree associated to the same string.

Another possibility, more efficient w.r.t. space, is to build directly a compactsuffix tree of depth at most k, using the on-line algorithm recently proposed byAllali and Sagot in [1]. Also this construction is linearly dependent from k, |Σ|and m.

Regarding space, the k-trie worsen the complexity of a k factor in the worstcase w.r.t. the corresponding annotated and cut suffix tree. On the other hand,when dealing with long DNA sequences, that is with a large text T , a small k anda small alphabet Σ, we expect to obtain a k-trie which is close to a complete tree,with essentially the same structure (i.e., number of nodes) of the correspondingannotated and cut suffix tree.

There are two further differences between a suffix tree and a k-trie. A k-trielinks the nodes by level, which is useful for determining the set of the mostinteresting strings, as shown in Section 5. Moreover, a k-trie does not report theposition of each substring in the original text, since in our intended applicationsthis is not an essential information.

2.4 Properties of counters

In this section we point out some properties of the occurrence counters in a k-trie. Such properties may suggest heuristics to reduce the size of the k-trie Tk,while keeping the substrings in Tk which have an ”interesting” frequency.

For the sake of simplicity let us denote all the nodes at level j with lev [j]and let son of be a function producing all the sons of a given node. For a noden we denote with c(n) the value of the associated counter. Then the followingproperties hold in a k-trie.

1. The sum of the counters associated to all the nodes in level j is almost equalto the length of the text:∑

n∈lev [j]

c(n) = m− j + 1, for any j ∈ [1, k]. (1)

that is, for level 1 the sum is exactly m, for level 2 is m − 1, . . . , and forlevel k is m − k + 1. This is due to the fact that k ≤ m and that the first

k − 1 symbols in T will not be counted in all the levels of the tree. When kis much smaller than m, the interval [m− k,m] is also very small, hence wecan say that the sum of the counters associated to all the nodes at any levelj approximates the length of the text.If the k-trie represents l texts, T1, T2, . . . Tl, the sum of counters becomes:

∑n∈lev [j]

c(n) = m− (j − 1)l, for any j ∈ [1, k], where m =l∑

i=1

|Ti|.

2. The counter values are not increasing along a path. In particular, if a nodenj can be reached from the root through a path n0n1n2 . . . nj , then c(nh) ≥c(nl), for all 1 ≤ h < l ≤ j. This is due to the fact that the nodes along thepath n0n1 . . . nj correspond to prefixes of the substring strTk

(nj) associatedto the path itself, and clearly smaller prefixes of a substring are more frequentin T than longer ones.

3. The counter value of a node n at level j is greater or equal to the sum ofthe counters associated to its sons: c(n) ≥

∑nh∈son of (n) c(nh), for any j ∈

[1, k − 1].In fact, the sons of n represent all the possible substrings in T which areconcatenation of the substring s = strTk

(n) and a symbol a ∈ Σ. Clearly,the occurrences of any string s·a in the text T cannot exceed the occurrencesof the string s itself.

2.5 Pruning the k-trie

As already pointed out, a k-trie Tk can be unpractically costly in terms of spacefor a large Σ and k. Therefore, it could be useful to find ways for pruning thetree, while maintaining the information we are interested in, that is the numberof occurrences of the “most interesting” substrings. To this end, we can use theproperties of counters presented in section 2.4.

We could apply a pruning strategy by determining a threshold with respectto which the ”interesting” frequency is shown and pruning the tree dependingon a local or global pruning condition. By local we mean a pruning conditiondepending only on the currently visited node. In contrast, by global we meana pruning condition depending on the path of the visit or on some (statistical)relation among substrings.

When pruning the tree for reducing its complexity, we do not want to produceisolated nodes. Hence, a node is always pruned with all its descendants. In thisway we obtain a pruned k-trie which is smaller than the original one but with thesame structural properties. A simple top-down pruning procedure is abstractlydescribed below. It is applied to all the sons of the root of Tk.

Prune tree(n);{ n is the node from where the top-down pruning starts }if Test(n) then Prune(n)

{ if test is positive the node is pruned with all its descendants }else for all ni son of(n) do Prune tree(ni);

{ son of applied to a leaf produces an empty set of sons}end;

In case of a local pruning condition, the function Test depends only on someinput parameters (such as a threshold and a tolerance) and on the counter ofthe current node n. For a global condition, Test will be more complex and itcould require to keep track of previous evaluations done during the pruning ofthe tree. Clearly, this has to be tuned with respect to the particular application:actually the pruning strategy is a very critical step in the analysis of a text T .In [10] we give some examples of local and global pruning techniques.

Since the node n is pruned with all its descendants, by pruning n we caneliminate many substrings and in particular some ”interesting” ones. Special carehas to be put in pruning safely, i.e. avoiding indirect pruning of ”interesting”substrings. The simplest pruning heuristics uses a fixed threshold for all Tk. InTest, c(n) is compared with such a threshold. Property 2 of counters guaranteesthat this local pruning condition is safe.

A local pruning strategy has a time complexity linear in the size of the tree,in the worst case. For a global pruning strategy we should analyze the complexitycase by case.

3 Patterns and their tree representation

In this section we specify the kind of patterns we consider in our pattern discov-ery framework and how they can be represented. Clearly a string is a pattern,but symbol variations are typical of biological sequences and moreover the avail-able knowledge is not precise. As a consequence, we wish to deal with somethingmore vague or more abstract than a string and to allow for some variants of asymbol in a specific position. Hence, a pattern in our context is an abstractionof a set of strings, namely a particular kind of regular expression correspondingto a finite set of strings of fixed length.

Definition 2. A pattern p on an alphabet Σ is a finite concatenation of non-empty subsets of Σ, p = S1S2..Sh, where Sj ⊆ Σ, Sj 6= ∅ , j ∈ [1, h].Each non-empty subset Sj is called abstract symbol in Σ. The length of thepattern is the length of the concatenation.

For simplicity we denote sets of symbols in Σ with square brackets andwithout commas, that is [ACT ] stands for {A,C, T}. Note that [ACT ] can bewritten also as the regular expression A+C +T . Moreover, we identify a patternand its corresponding regular expression.

Example 3. Consider Σ = {A, T,C, G} and p = [AT ][AC][CG]. Hence, p isa pattern of length 3 corresponding to the regular expression (A + T )(A +C)(C + G). The strings abstracted by p are the strings in the regular set{AAC,AAG, ACC,ACG, TAC, TAG, TCC, TCG}.

Abstract symbols can be partially ordered in a natural way by set inclusion.

Definition 4. Let S and S′ be two abstract symbols in Σ. S is strictly (respec-tively, weakly) more abstract than S′, S > S′ (S ≥ S′), iff S ⊃ S′ (S ⊇ S′).Equivalently, we say that S′ is strictly (respectively, weakly) more precise thanS.

In example 3, [A] < [AT ], [ACG] > [AC] and [ACGT ] is strictly moreabstract than any other abstract symbol in Σ. In fact the most abstract symbolis the set Σ itself. It corresponds to what is often called in the literature ”don’tcare symbol” or ”dot”.

The partial order on abstract symbols induces a partial order on patterns,which is defined as follows.

Definition 5. Let p = S1S2..Sh and p′ = S′1S

′2..S

′h be two patterns of length h.

We say that p′ is weakly more abstract than p iff for all j ∈ [1, h], S′j ≥ Sj.

Equivalently, we say that p is weakly more precise than p′.We say that p′ is strictly more abstract than p iff for all j ∈ [1, h], S′

j ≥ Sj andfor at least one i ∈ [1, h], S′

i > Si. Equivalently, we say that p is strictly moreprecise than p′.

Besides the linear representation of a pattern p, we can give a tree represen-tation which corresponds to the prefix automaton accepting the strings in p [2](shortly the prefix tree for p). In such a tree, a node is associated to each prefixof a string in p and a leaf is associated to a string in p. The prefix automatonis deterministic, hence the corresponding tree has the unique prefix property: foreach prefix of a string in p there is a unique node n and a unique path from theroot to n representing that prefix. Therefore both the text T and a pattern p arerepresented by trees having the unique prefix property.

Example 6. Let Σ be {A, T, C, G} and let p be the pattern T [CG]T [AT ] cor-responding to the set of strings {TCTA, TGTA, TCTT, TGTT}. The followingtree is the prefix tree for p.

T

T

C

G

T

T

A

T

A

Note that in a prefix tree for a pattern of length h all the leaves are at thesame depth h. Moreover note that, given two patterns p and p′ of length h, p′

is weakly (strictly) more abstract than p iff the prefix tree for p is (properly)contained into the prefix tree for p′.

Also a set of patterns (or a set of strings) can be represented by a prefix treecorresponding to the prefix automaton accepting the strings in the set.

The structure and construction of the prefix tree is similar to the one givenin Section 2 for the k-trie. For this reason we omit the detailed description ofthe relative algorithm.

Besides representing an abstract string, a pattern captures structure similar-ities among strings. In Section 6 we show how a set of relevant strings of a k-triecan be represented as a set of patterns in order to evidentiate their similarities.In Section 8 we discuss how the prefix tree representation of a pattern (or a setof patterns) can be used for comparing biological sequences and for determiningoccurrences of patterns in biological sequences.

4 Scoring the k-trie with respect to string expectedfrequencies

If we assume or know the expected frequency of any substring, we can associatethis information to the corresponding node in the k-trie and define the ”interest”of each node (substring) in terms of a relevance score which compares its observedfrequency to its expected frequency.

In the following we use the terms expected frequency and observed frequencyof a string in T , but we actually intend expected occurrences and observedoccurrences of the string in T .

In the literature many different ways to associate a relevance measure tostrings have been proposed, see [26] for a recent survey. The simplest choicesare obtained by assuming independence of the symbols in each substring and byassuming that the probability associated to each symbol in Σ does not vary in T .In this case, the probability of the substring s = s1s2 . . . sh is the product of theprobability of each symbol in s, that is pr(s) =

∏hi=1 pr(si), and the expected

frequency is Exp(s) = (m− h + 1)∏h

i=1 pr(si).When we consider T1, . . . , Tl input texts of size m1, . . . ,ml, respectively, the

expected frequency of s generalizes to

Exp(s) = (m1−h+1)h∏

i=1

pr(si)+. . .+(ml−h+1)h∏

i=1

pr(si) =l∑

j=1

(mj−h+1)h∏

i=1

pr(si).

Note that the computation of Exp does not care about substrings overlapping.The relevance of a substring can be measured by various scores. The simplest

ones are the following:

z1(n) = c(n)− Exp(n), z2(n) = c(n)Exp(n) , z3(n) = (c(n)−Exp(n))2

Exp(n) ,

where n is the node corresponding to a substring s, that is strTk(n) = s, c(n) is

the observed frequency of s, i.e., the number of its occurrences in T , and Exp(n)is the expected frequency of s. The score z3 is the well known chi-square score.

More generally, any relevance score for a substring s associated to a noden uses a relation R for comparing the observed frequency c(n) of s with itsexpected frequency Exp(n) which depends on the assumptions or knowledgeon T . For example, if we drop the simplifying assumption of independence ofsymbols in s, Exp(n) could be estimated by analyzing a sample set of substrings(or texts) with an HMM.

The choice of the relevance score depends on the concrete application: itshould be the score more useful for identifying the substrings which are ”inter-esting” for the intended purposes.

Given a k-trie annotated in each node with a relevance score, we can analyzeit with respect to such a score. In general only a few nodes in the tree have veryhigh (low) scores, these correspond to the relevant substrings.

4.1 Frequency behaviour with respect to symbol concatenation

It is interesting to characterize how the observed and expected frequencies of sub-strings vary depending on their length. In Section 2.5 we have already describedthe properties of the observed frequency, c(n), w.r.t. symbol concatenation. Inthis section we state how symbol concatenation affects the expected frequencyof substrings.

Let us consider two strings strTk(n) = s1s2..sh−1, associated to the node n

at level h− 1 in the k-trie, and strTk(n′) = s1s2..sh−1sh, associated to the node

n′, son of n. Let |T | = m. Then

Exp(n) = (m− h + 2)∏h−1

i=1 pr(si) and Exp(n′) = (m− h + 1)∏h

i=1 pr(si).

Hence the expected frequency of a string can be computed incrementally alongthe paths of the k-trie by concatenating one symbol at a time. In fact

Exp(n′) = Exp(n) pr(sh)(m− h + 1)(m− h + 2)

.

Therefore, with our hypotheses, a traversal of the k-trie is sufficient for anno-tating it with the expected frequency of each substring, namely this informationcan be added to each node in linear time w.r.t the number of nodes in the tree.The previous scores depend only on the observed and the expected frequencyof each node and they can be evaluated in constant time. Hence also the scoreannotation can be associated to the nodes of the k-trie in the same traversal, thatis in linear time w.r.t the number of nodes in the tree.

Moreover, since 0 ≤ pr(sh) (m−h+1)(m−h+2) ≤ 1, with our hypotheses the expected

frequency cannot increase along the paths of the k-trie. The same holds for theobserved frequency c(n), as noted in Section 2.4.

When we consider T1, . . . , Tl input texts of size m1, . . . ,ml, respectively, weobtain:

Exp(n) =∑l

j=1(mj − h + 2)∏h−1

i=1 pr(si) and Exp(n′) =∑lj=1(mj − h + 1)

∏hi=1 pr(si)

that is

Exp(n′) = Exp(n) pr(sh)

∑lj=1(mj − h + 1)∑lj=1(mj − h + 2)

Since 0 ≤ pr(sh)Pl

j=1(mj−h+1)Plj=1(mj−h+2)

≤ 1, also in this case the expected frequency

cannot increase along the paths of the k-trie.

Definition 7. Consider an alphabet Σ and a binary operation on strings op :Σ∗ ×Σ∗ → Σ∗. Let f be a function mapping strings to reals, f : Σ∗ → R. Wesay that:

– f is monotone non-increasing w.r.t. op iff f(s1) ≥ f(s1 op s2)– f is monotone non-decreasing w.r.t. op iff f(s1) ≤ f(s1 op s2)

for all strings s1 and s2 in Σ∗.

By the above statements we conclude that both the expected and the observedfrequencies are monotone non-increasing functions w.r.t. symbol concatenation.

In the following, we analyze how the chi-square score varies with respectto both string concatenation and string union, which is the basis for stringabstraction. A similar analysis has been given also for z1 and z2 in [10]. We denotewith D(n) the difference between the observed and the expected frequency ofthe substring corresponding to node n in Tk, that is D(n) = c(n) − Exp(n).We say that the string is over-represented in T (resp. under-represented in T ) ifD(n) > 0 (resp. D(n) < 0).

4.2 Chi-square score behaviour with respect to symbolconcatenation

Consider again the strings strTk(n) and strTk

(n′) and their corresponding nodesn, n′ in the k-trie, with n′ son of n. We use the following notation:

– ∆cn denotes the decrease of the observed frequencies of the two strings, that

is ∆cn = c(n)− c(n′);

– ∆En denotes the decrease of the expected frequencies of the two strings, that

is ∆En = Exp(n) − Exp(n′) = Exp(n)(1 − pr(sh)δh), where δh = (m−h+1)

(m−h+2)

for one text T and δh =Pl

j=1(mj−h+1)Plj=1(mj−h+2)

for T1, . . . , Tl input texts of size

m1, . . . ,ml, respectively.

Note that both ∆cn ≥ 0 and ∆E

n ≥ 0 because of monotonicity w.r.t. symbolconcatenation of the expected and observed frequencies. Moreover ∆E

n = 0 onlyif Exp(n) = 0, and if the expected frequency of a string is zero then, by mono-tonicity w.r.t. symbol concatenation, the same holds for all the strings obtained

from it by concatenating further symbols. For simplicity we exclude the limitcase ∆E

n = 0.The chi-square score has always a positive value, the greater the more over-

represented or under-represented is the string to which it is applied.We have z3(n) = D(n)2

Exp(n) and z3(n′) = D(n′)2

Exp(n′) . Since Exp(n) ≥ Exp(n′), we havea sufficient condition for score increasing, which is D(n)2 ≤ D(n′)2.This condition can be expressed also as |c(n)− Exp(n)| ≤ |c(n′)− Exp(n′)|.In the other case, D(n)2 > D(n′)2, the score may increase or decrease.

Hence the chi-square score is not monotone w.r.t. symbol concatenation, thatis along the paths in the k-trie. This has two important consequences:

1. in general only sparse nodes should have high (low) scores. Hence, by pruningthe k-trie w.r.t. the chi-square score, namely by eliminating uninterestingnodes, one would loose the tree structure and produce isolated nodes;

2. string relevance varies along the paths in a non-monotonic way, namely thehigh score of a node does not imply anything on the scores of its descendants.

Apostolico et al. present in [3] a deep investigation on how to annotate thenodes of a suffix tree efficiently and incrementally with their expected values,variances and scores of significance, under the same probabilistic model we adopt.Moreover, the authors present in [4] a deep analysis on the monotonicity of somescores of significance w.r.t. string length. Since the chi-square score is one ofthe scores they consider, we could use such efficient annotation methods andmonotonicity results in our framework.

4.3 Chi-square score behaviour with respect to string union

In this section we study how the chi-square score is affected by string abstraction.We recall that a pattern is just a particular set of strings and that the processof abstracting a string consists in applying a union operation with other stringssatisfying some shape conditions. Hence, we analyze how the score is affected byadding a string to a set of strings, that is by the string union operation.

We need to extend appropriately the notion of monotonicity.

Definition 8. Consider an alphabet Σ and a binary operation op : P(Σ∗) ×Σ∗ → P(Σ∗). Let f be a function mapping sets of strings to reals, f : P(Σ∗) →R. We say that:

– f is monotone non-increasing w.r.t. op iff f(S) ≥ f(S op r),– f is monotone non-decreasing w.r.t. op iff f(S) ≤ f(S op r)

for all sets of strings S and all strings r on Σ, with r 6∈ S.

We extend now the notions of probability, observed and expected frequenciesfrom a single string to a set of strings of the same length. Two distinct stringsmay be considered as independent events. Let us consider the set Σh of all thestrings of length h over Σ. It is easy to show that the probabilities of the stringsin Σh sum up to one.

Definition 9. Let Tk be a k-trie and let S = {r1 . . . rq} be strings of length h inTk. We define the probability and the observed and expected frequencies of theset S as follows:

− pr(S) =∑q

i=1 pr(ri),− c(S) =

∑qi=1 c(ri),

− Exp(S) =∑q

i=1 Exp(ri).

From the previous definition it follows that both the expected and the observedfrequencies are compositional and monotone non-decreasing w.r.t. string union.

In analogy to the case of a single string, we define the abbreviation D(S) =c(S)−Exp(S) and, due to the previous definitions, we have D(S) =

∑qi=1 D(ri).

Also the chi-square score can be naturally extended to sets of strings:

z3(S) =(c(S)− Exp(S))2

Exp(S).

In order to analyze how the score behaves w.r.t. string union, let us consider,in a k-trie Tk, a set of strings S having the same length h and a further string rof length h, associated to a node n at lev[h], that is r = strTk

(n). Consider nowthe set S′ = S ∪ r, with r 6∈ S. We have:

z3(S′) =((c(S) + c(n))− (Exp(S) + Exp(n))2

Exp(S) + Exp(n).

Let us compare z3(S) to z3(S′). We multiply both expressions by Exp(S)(Exp(S)+Exp(n)); we simplify common subexpressions and, after dividing both expres-sions for Exp(n)Exp(S), we can compare

z3(S) to z3(r) + 2D(S)D(n)Exp(n) .

Therefore we have:

– z3(S′) is equal to z3(S) iff z3(S) = z3(r) + 2D(S)D(n)Exp(n) ;

– z3(S′) decreases w.r.t. z3(S) iff z3(S) > z3(r) + 2D(S)D(n)Exp(n) ;

– z3(S′) increases w.r.t. z3(S) iff z3(S) < z3(r) + 2D(S)D(n)Exp(n) .

From the above analysis it follows that the chi-square score is not monotonew.r.t. string union. In fact, by adding a new string to a set of strings (or a pat-tern), the score of the resulting set (pattern) can either increase or decrease withrespect to the one of the initial set (pattern). This has important consequencesfor computing a set containing the most relevant substrings in an incrementaland automatic way, as shown in the next section.

5 Determining the most relevant strings

In this section we discuss how to compute the subset of the most relevant stringsfrom a set of potentially interesting strings in a text T , that is from strings which

are either over- or under-represented in T . Since the chi-square score z3 do nodistinguish over-represented strings from under-represented ones, we deal withthe two cases separately.

Let us consider the sets Overh and Underh of over- and under-representedstrings of length h in T , respectively. They can be obtained by examiningD(n) associated to each node n at lev[h] in Tk, and by selecting just the over-and under-represented strings. Note that, in principle these sets could containO(|Σ|h) strings.

From Overh we want to compute the set MOS (T, h) of the Most Over-represented Strings of length h in T . In the same way, from Underh we want todetermine the set MUS (T, h) of the Most Under-represented Strings of lengthh in T . Whenever T and h are clear from the context, we denote the two setssimply with MOS and MUS .

The simplest way to determine the set MOS (resp. MUS ) out of Overh (resp.Underh), consists in filtering such a set of strings w.r.t. a threshold of significanceon the score values. This can be done by scanning the set and by keeping onlythe strings with score above the threshold, with a complexity linear in the sizeof Overh (resp. Underh). The resulting set MOS (resp. MUS ) clearly dependson the chosen threshold.

Another way to determine the set MOS (resp. MUS ) consists in buildingincrementally such set with the aim of maximizing its total score. As shown inthe previous section, the chi-square score is not monotone w.r.t. string union,therefore only strings that do not decrease the current score of the set can beadded. In the resulting set each string is not only interesting per se but also withrespect to the total score of the set itself. In the rest of the section we discussthis alternative construction of MOS (resp. MUS ).

In order to get a better understanding of the behaviour of the chi-squarescore w.r.t. string union, let us consider the case of a set of strings S and afuther string r, associated to the node n of the k-trie. For simplicity we identifya string with its node in the tree, namely r with n. Let S′ = S∪{n}. We alreadyknow that z3(S′) ≥ z3(S), if and only if z3(S) ≤ z3(n)+ 2D(S)D(n)

Exp(n) . By analyzing

the 2D(S)D(n)Exp(n) component, we can distinguish three different cases:

1. 2D(S)D(n)Exp(n) = 0. This happen if either D(S) = 0 or D(n) = 0, that is if the

expected frequency is equal to the observed frequency either for the stringsin S or for r;

2. 2D(S)D(n)Exp(n) > 0. This happens if D(S) and D(n) have either both a positive

value, that is c(S) > Exp(S) and c(n) > Exp(n), or both a negative value,namely S and r are both over-represented or both under-represented in T ;

3. 2D(S)D(n)Exp(n) < 0. This happens if D(S) and D(n) have one a positive value and

the other one a negative value, i.e., one is over-represented and the other isunder-represented in T .

In order to build the set MOS (MUS ) out of Overh (Underh), only the second ofthe above cases applies. In fact, the first case corresponds to considering strings

which occur exactly as many times as expected and the third case correspondsto considering a mixture of over-represented and under-represented strings.

Let us first consider how to determine MOS (T, h). Let ni be any string inOverh, then c(ni) > Exp(ni) and D(ni) > 0. We assume that the strings inOverh = {n1, n2, . . . , nq} are ordered in non-increasing order by score, namelyz3(ni) ≥ z3(nj), when i < j. The set S of maximal score is incrementally builtas follows.Initially S = {n1}, where n1 corresponds to a string with the highest score inOverh. Then, another string ni ∈ Overh is added to S only if the union ofS and ni does not decrease the total score, that is only if z3(S) ≤ z3(ni) +2D(S)D(ni)

Exp(ni). We recall that 2D(S)D(ni)

Exp(ni)> 0, since all the considered strings are

over-represented.An invariant property of this informal procedure is that the score of the

set S is always greater or equal to the score of any string in Overh, that isz3(S) ≥ z3(ni) for each ni ∈ Overh.

Condition z3(S) ≤ z3(ni) + 2D(S)D(ni)Exp(ni)

depends on the partial set S already

built, that is on its score z3(S) and on 2D(S)D(ni)Exp(ni)

, which also depends on S.Two important consequences come from this fact:

a. It is possible that a string ni decreases the score of S by union, while a stringnj, with i < j, does not. In fact it can happen that

– D(ni)2

Exp(ni)≥ D(nj)

2

Exp(nj), that is z3(ni) ≥ z3(nj);

– D(nj)Exp(nj)

could be much greater than D(ni)Exp(ni)

, then

z3(S) > D(ni)2

Exp(ni)+ 2D(S)D(ni)

Exp(ni), and

z3(S) ≤ D(nj)2

Exp(nj)+ 2D(S)D(nj)

Exp(nj).

As a consequence, we have to consider for union all the strings in Overh.b. It is possible that a string ni decreases the score of S by union, although the

same string does not decrease the score of a larger set S′ ⊃ S. In fact it canhappen that– z3(S) < z3(S′);– z3(S) > z3(ni) + 2D(S)D(ni)

Exp(ni), and

z3(S′) ≤ z3(ni) + 2D(S′)D(ni)Exp(ni)

,when D(S′) is much greater than D(S).

Because of the facts above, in order to determine the set S of maximal score,each string in Overh may have to be considered more than one time.

The previous informal method for building MOS (T, h) is summarized by thefollowing abstract procedure:

Build MOS(Overh);{Overh = {n1, n2, . . . , nq} is ordered into non-increasing order w.r.t. the score z3}MOS := {n1}; Mark(n1);repeat

for ni ∈ Overh and not Marked(ni) do

if z3(MOS) ≤ z3(ni) + 2D(MOS)D(ni)Exp(ni)

then

begin MOS := MOS ∪ {ni}; Mark(ni); end;until MOS is stable;end;

The construction of MOS is automatic, namely it does not depend on the choiceof a threshold. On the other hand the resulting set of strings strongly depends onthe initial ordering of Overh and the associated initialization of MOS . We couldchoose another ordering on Overh, for example the one based on the differencebetween the observed and the expected frequency D(i), namely D(ni) ≥ D(nj),when i < j, and we would obtain a different subset of Overh. In this case theinitialization of MOS would be with a string most over-represented in Overh.

Let us analyze the time complexity of the procedure Build MOS. The worstcase is when only one string is added to MOS at each iteration on Overh. Thenexactly n iterations of the repeat cycle are performed before stopping, wheren = |Overh|. Each iteration requires to consider all the strings not yet markedin Overh, and to test if they can be added to MOS . The test requires constanttime. Therefore the iterations of the repeat cycle require to examine n stringsat first, then n − 1, n − 2 and so on until just one string remains. Hence inthe worst case the procedure Build MOS is quadratic in the size of the set ofstrings in input, that is O(n2). The initial ordering of Overh does not increasesuch complexity.

Note that the set Overh could be very large, since in principle it could containO(|Σ|h) strings. Therefore considering all the unmarked strings in Overh at eachiteration until stability could be very expensive. A convenient heuristic might beto stop the construction of the subset as soon as the cardinality of MOS reachesa given number. Another possibility is to filter Overh w.r.t. a threshold on thescore values, thus reducing the size of the set of strings in input to Build MOS.

With the simplifying assumption that all the symbols in the alphabet Σ havethe same probability, 1/|Σ|, the computation of MOS becomes much simpler andfaster. In fact, in that case Exp(ni) = (m−h+1)

|Σ|h , for all i, that is for all the stringsin Overh. Then, given ni and nj in Overh, z3(ni) > z3(nj) implies

– D(ni) > D(nj) and then also– 2D(S)D(ni)

Exp(ni)>

2D(S)D(nj)Exp(nj)

.

As a consequence, the case (a) previously described becomes impossible withthis assumption. In fact z3(ni) > z3(nj) implies z3(ni) + 2D(S)D(ni)

Exp(ni)> z3(nj) +

2D(S)D(nj)Exp(nj)

. Hence in procedure Build MOS we can stop considering the stringsin Overh as soon as we find one string which decreases the score of MOS byunion.

Also the previous case (b) is no more a problem: at most one single scanof Overh guarantees that MOS is the subset of strings of maximal score. Eachstring ni in Overh can be considered only once for union, either ni satisfiescondition z3(MOS ) ≤ z3(ni)+ 2D(MOS)D(ni)

Exp(ni), or it doesn’t. Hence, in the simple

assumption of equiprobable symbols, the abstract procedure for computing MOSbecomes the following:

Build simple MOS(Overh);{Overh = {n1, n2, . . . , nq} is ordered into non-increasing order w.r.t. the score z3}MOS := {n1};while ni ∈ Overh and z3 (MOS) ≤ z3 (ni) + 2D(MOS)D(ni )

Exp(ni )do

MOS := MOS ∪ {ni};end;

In the worst case, the time complexity of the procedure Build simple MOSis linear in the size of the ordered set of strings in input, that is O(n), wheren = |Overh|. Note that the resulting set of strings still depends on the initialordering of Overh.

For under-represented strings, the corresponding constructions are similar tothe ones just described. In particular, we can derive the procedures Build MUSand Build simple MUS analogous to the previous ones for over-representedstrings.

Note that the set MOS (T, h) (resp. MUS(T, h)) is a set of annotated strings,namely each string has its associated observed frequency, expected frequencyand chi-square score. This information is used also when abstracting the mostrelevant strings, that is expressing them as a sum of patterns.

6 Abstracting the most relevant strings

In this section we would like to combine string abstraction and scoring. After themost interesting strings have been determined and stored into MOS(T, h), wemight want to perform an abstraction step, in order to concisely represent sucha set of strings and to stress similarities among relevant strings. A natural choicein our framework is to define an abstraction of MOS (T, h) as an equivalent setof patterns, which is just a particular kind of regular expression, as pointed outin Section 3. More precisely, the abstraction of MOS is an equivalent sum ofdisjoint patterns, namely the patterns in the sum form a partition of MOS .

For simplicity’s sake in the following we ignore the information associated toeach string (i.e., observed frequency, expected frequency and chi-square score).Moreover, since for regular expressions ”set of strings” (or ”set of patterns”) isequivalent to ”sum of strings” (resp. ”sum of patterns”), we use one or the otherexpression depending on the context.

Let us consider the following simple example.

Example 10. Let Σ be {A, T, C, G} and let MOS (T, 4) be the set:

{ACGA, ACTA, TCTA, TATA, CATA,CAAT, CGAT,GGAT, GGCT, CAGA}.

An abstraction of MOS (T, 4) can be given in terms of the equivalent sum of6 patterns, namely T [CA]TA + AC[GT ]A + CA[GT ]A + GG[AC]T + CGAT +CAAT.

This abstraction is not the best one. We can in fact further abstract MOS (T, 4)as the sum of 5 patterns: T [CA]TA + AC[GT ]A + CA[GT ]A + GG[AC]T +C[AG]AT. This second abstraction is clearly preferable since it is more concise.

Note that both a sum of strings and a sum of patterns on an alphabet Σcan be viewed as a sum of strings on an extended alphabet Σ′ isomorphic toP(Σ) − ∅. Namely, a string is just a special pattern and in each position of apattern there is a symbol in Σ′ corresponding to a set of symbols of Σ. In Section3 we call the symbols in Σ′ abstract symbols. In this way both the initial setof strings, MOS (T, h), and its abstraction can be viewed as a sum of strings onΣ′ and we can deal with strings and patterns in the same way. For simplicity,we denote the union of s1, s2 ∈ Σ′ either with [s1s2] or with the correspondingsymbol in Σ′, depending on the context.

We introduce now the basic abstraction operation. Let S1 and S2 be twostrings. We recall that the Hamming distance of S1 and S2, H(S1, S2), is thenumber of symbol positions in which the two strings differ. Equivalently wemay say that the Hamming distance of S1 and S2 is the minimal number ofsymbol substitutions necessary to transform S1 into S2. The Hamming distanceis defined on patterns as well since they are just strings on Σ′. Two patterns P1

and P2 are at Hamming distance one, that is H(P1, P2) = 1, iff they are identicalexcept for a fixed position: P1 = αs1β, and P2 = αs2β, where α, β ∈ Σ′∗ ands1, s2 ∈ Σ′.

Remark 11. Let P1 and P2 be two distinct patterns. They are at Hamming dis-tance one iff their sum is equivalent to a single pattern whose symbols are theunion of the corresponding symbols in the two patterns: H(P1, P2) = 1 iff ∃P such that P =α[s1s2]β and P ≡ P1 + P2.

This is due to distributivity of concatenation w.r.t. sum in regular expressions.We call simple abstraction the operation of trasforming a pair of disjoint patternsP1 and P2, which are at Hamming distance one, into the equivalent pattern Pas stated in Remark 11.

Example 12. Let us consider MOS (T, 4) in the previous Example 10. The stringsACGA and ACTA are at Hamming distance one, since they differ only in thethird position. We can apply a simple abstraction to them, since they are dis-joint, and we obtain AC[GT ]A.The disjoint patterns AC[GT ]A, AC[GT ]C and AC[GT ]T are at Hamming dis-tance one, since they differ only in the last position. By applying a simple ab-straction we obtain AC[GT ][ACT ].

Definition 13. Let S = {s1, . . . , sn} be a set of strings of length h on an al-phabet Σ.An abstraction A of S is a sum of disjoint patterns of length h on Σ which isequivalent to S, that is A = P1 + . . . + Pm, Pi ∩Pj = ∅ for all i, j ∈ [1,m], m ≤n and A ≡ S.

The size of the abstraction A is m, the number of patterns it consists of.A is a maximal abstraction of S iff no pair of patterns in A are at Hammingdistance one.

Remark 14. By Remark 11 we could restate the definition of maximal abstrac-tion of a set of strings S as: A is a maximal abstraction of S iff it is not possibleto apply a simple abstraction to any pair of patterns in A.

In general the maximal abstraction of a set of strings of length h on analphabeth Σ is not unique. Moreover there can be maximal abstractions withdifferent sizes for the same set S.

Example 15. Let MOS (T, 3) be {ACG, ACT, TCG, TCA}. We can abstract itas AC[GT ]+TC[GA] but also as [AT ]CG+ACT +TCA. Both such abstractionsof MOS (T, 3) are maximal, the first one has size 2 and the second has size 3.The first one is a maximal abstraction of MOS (T, 3) of minimal size.

One way to build a maximal abstraction of MOS(T, h) consists in consideringone string in MOS(T, h), s1, and building with it a pattern by ”absorbing” asmany other strings in MOS(T, h) as possible, that is incrementally applying sim-ple abstraction steps to s1 and to all the other strings in MOS(T, h) which allowit. The same procedure is repeated for all the remaining strings in MOS(T, h).Then the set of the resulting patterns is considered and the procedure is iterateduntil no further simple abstraction is possible.

To better clarify this technique, let us consider the following example.

Example 16. Let MOS(T, 4) be:

{ACGA, AATA,ACTA,AAGA, CCTA,CATA, CAGA, CCGA}.

We choose the first string ACGA and apply a simple abstraction with ACTA;the result is AC[GT ]A. Since no other simple abstractions are possible, wechoose the next ”unused” string AATA and apply a simple abstraction withAAGA; the result is AA[GT ]A. We proceed in this way until all the stringsin MOS(T, 4) have been considered. The equivalent set of patterns which isobtained is {AC[GT ]A, AA[GT ]A, CC[GT ]A, CA[GT ]A}, and it is now con-sidered for further abstractions.

We choose AC[GT ]A and AA[GT ]A and by a simple abstraction we getA[AC][GT ]A. Since no other simple abstractions are possible, we choose CC[GT ]Aand CA[GT ]A and we get C[AC][GT ]A. The resulting equivalent set of patternsis {A[AC][GT ]A, C[AC][GT ]A}.Now a further simple abstraction is possible, thus getting a maximal abstractionof the initial set of strings, which is the pattern {[AC][AC][GT ]A}.

The following abstract procedure Build ABS computes ABS, a maximalabstraction of MOS(T, h), as illustrated in the previous example.

Build ABS(MOS(T, h), h);NewABS := MOS(T, h);repeatABS := NewABS; NewABS := ∅; Marked := ∅;{Marked is the set of marked strings in ABS}

repeats1 := Choose in(ABS); Marked := Marked ∪ {s1};{an unmarked string is nondeterministically chosen in ABS and it is marked}for all s2 ∈ ABS and not s2 ∈ Marked do

if H(s1, s2) = 1 thenbegin

s1 := S Abstract(s1, s2); Marked := Marked ∪ {s2};{a simple abstraction is applied to the two strings and the result is stored into s1}

end;NewABS := NewABS + {s1};until |ABS| = |Marked|;

until |ABS| = |NewABS|;end;

The external repeat cycle stops when no simple abstraction step is possible andABS = NewABS. Then, by Remark 14, the resulting set of patterns, ABS, isa maximal abstraction of the initial set of strings, MOS(T, h).

Regarding time complexity, let us consider the worst case, when only onesimple abstraction is possible at each iteration of the external cycle. Hence, ex-actly n iterations of the external cycle are performed before stopping, wheren = |MOS(T, h)|. Each iteration requires to consider a string (or pattern) s1 inABS and to abstract it together with another string (or pattern) s2, if they areat Hamming distance one. In the worst case the Hamming distance from s1 isevaluated for all the strings in ABS, but only one string makes the abstractionpossible. Therefore each iteration of the external cycle requires at most n2 com-parisons of two strings to determine if their Hamming distance is one. In theworst case, each such comparison requires to consider all the h symbols in thestrings. Let us assume that Choose in(ABS) requires constant time. Besides letus assume that symbols comparison and union requires constant time. This canbe obtained by coding the symbols in Σ′ as memory words with bits representingthe symbols in Σ. Then in the worst case the procedure Build ABS is cubicw.r.t. the size of the set of strings in input and linear w.r.t. the length of thestrings, that is O(hn3).

The maximal abstraction of MOS(T, h) produced by the procedure Build ABSdepends on the order of choice of s1, specified by Choose in(ABS), and on theorder in which the other unmarked strings in ABS, are considered. The descrip-tion of the procedure is nondeterministic, to stress the fact that the order inwhich strings are considered is not relevant for obtaining a maximal abstractionof MOS(T, h). We could give a deterministic version of the procedure by statingan order on the strings in MOS(T, h) and by specifying that Choose in(ABS)chooses, for example, the first unmarked string (or pattern) in the set ABSand the for all cycle scans the unmarked strings (or patterns) in ABS in such

an order. Since the assumption that Choose in(ABS) requires constant time ispreserved, the complexity of the deterministic version of the procedure is thesame as the complexity of the nondeterministic version. On the other hand, theorder of choice of s1 and s2 may be relevant for determining a maximal abstrac-tion of minimal size.

S Abstract(s1, s2) has to take care also of the relevant information associ-ated to the two strings (or patterns), for example it could sum both the observedand the expected frequencies of s1 and s2 and associate such sums to their ab-straction s1. It could also evaluate the score associated to the corresponding setof strings, as shown in Section 4.3.

7 Abstracting the alphabet

In this section we discuss a second possibility for using abstraction in patterndiscovery: we could abstract the alphabet Σ, for example by assuming that somesymbols are similar (or undistinguishable) for the purpose of our analysis. Thiscan be a reasonable assumption, in particular for large alphabets such as theone of the aminoacids. For example we could know that aminoacids with similarproperties may be substituted one for each other in some protein and exploitthis knowledge while looking in the sequence T for interesting substrings. Moregenerally, in a blind analysis it could be useful to explore different abstractionsof Σ.

Abstracting the alphabet Σ means translating it into a new alphabet ∆ thatshould reflect the knowledge or assumptions on symbols similarity. Such similar-ity information on Σ can be formalized by means of an equivalence relation R,grouping together similar symbols. This induces a partition of Σ into equivalenceclasses which can be interpreted as the (abstract) symbols of a new (abstract)alphabet ∆ isomorphic to Σ/R. The alphabet translation is then a function τmapping each symbol in Σ to the corresponding symbol in ∆ representing theequivalent class it belongs to.

Example 17. Let Σ = {A, T, C, G} and assume that we want to abstract itby expressing that C and G are similar for the purpose of our analysis. Wedefine a relation R which is reflexive, symmetric and transitive and such thatC R G. This produces the partition ({C,G}, {A}, {T}) in Σ. Hence ∆ can beany set isomorphic to {[CG], [A], [T ]}. The translation function τ , associated to∆, is defined by: τ(A) = [A], τ(T ) = [T ], τ(C) = [CG], τ(G) = [CG]. It can beextended to strings in Σ∗, for example τ(ATCCGA) = [A][T ][CG][CG][CG][A].In this way we give a translation from a string in Σ into an abstract string in∆.

In the context of pattern discovery, a translation τ of the original alphabetΣ into a more abstract alphabet ∆ can be applied at different levels: we candirectly translate the text T , or translate and compact either the k-trie Tk orMOS (T, h). The choice among these alternatives is very much dependent on theapplication.

Note that abstracting the alphabet and abstracting the most relevant stringscan be usefully combined. After abstracting the alphabet, the analysis of thesubstrings can follow the proposals shown in Section 5 and in Section 6.

8 The analysis framework

In this section we discuss how we intend to use the k-trie, the sets MOS and MUSand the patterns previously described in order to analyze biological sequences.

Our framework can be summarized by the following schema:

gata

Tk

PRUNE_TREE

T

BUILD_TREE BUILD_ABSBUILD_MOS

BUILD_MUS

[acgt]ata

at[at]a

ABS

....acaataccga

MOS or MUS

aataataatataattacata

From the considered biological sequence(s) T , we derive a k-trie Tk represent-ing all the strings of length at most k in T , with their expected and observedfrequencies and chi-square score. A pruning technique can be applied to Tk forspace optimization, but the pruning module is orthogonal to the analysis pro-cess. Then we focus on the set of the most relevant strings of a given length h inT , thus refining Tk into MOS(Tk, h) or MUS(Tk, h). This is done separately forunder- and over-represented strings. We can built MOS(Tk, h) in various waysdepending what we focus on, hence a few independent modules are availablefor doing this, they order strings, they filter over- or under-represented stringsw.r.t. a fixed threshold of significance and they implement the abstract proce-dures Build MOS, Build simple MOS, Build MUS, Build simple MUS.Each set of interesting strings can be separately represented in an abstract wayas a set ABS of relevant patterns. This can be done in different ways dependingon the order on strings used by the procedure Build ABS.

Therefore, we have three different levels of precision in representing the in-formation contained in the input sequence T , which are Tk, MOS(Tk, h) (orMUS(Tk, h)) and ABS. In particular, Tk constitutes a kind of signature or fin-gerprint of the original sequence T . We can give an experimental validation ofthe relevance of such information and exploit it for two main processes:

1. We can compare different biological sequences through their representationsin order to discover possible similarities and differences.In particular we can compare two biological sequences, T1 and T2, (or fam-ilies of sequences) by means of their k-tries. In general the length of the

sequences is much greater than k and than |Σ|. Under these hypotheses,the two k-tries might have quite similar structures and symbols, but theirobserved frequencies and scores might significantly differ. Then, similaritiesand differences in structure or symbols between the two trees are worth tobe highlighted, while differences in observed frequencies and scores have tobe pointed out only if they are significative.We can also compare T1 and T2 by means of their MOS representation forover-represented strings (or MUS for under-represented). This is a more syn-thetic representation and very likely a more meaningful one w.r.t. interestingstrings.We could even compare T1 and T2 by means of their ABS-representationsfor over- or under-represented patterns. This is an even more synthetic andreadable representation but it depends on the order on strings used by theBuild ABS procedure. As a consequence, we should adopt the same or-der on the two sets of strings when abstracting them in order to allow forcomparison.

2. We can search a text T for determining the presence of a pattern p, assumedto be an interesting one. For example, p could be a pattern found into anothertext T ′, or into MOS or ABS of T ′. Since an interesting pattern represents aset of interesting strings, we would like to determine which part of p occursin T and how many occurrences it has.

Comparing texts and searching for patterns can be viewed in a unified wayas a tree comparison operation. In fact, a part from k-tries, also MOS , ABS anda single pattern p can be represented as a tree with the single prefix property, asshown in Section 3. Hence both problems (1) and (2) can be solved by findingthe maximum common subtree starting from the root [34], namely a subtree ofboth T1 and T2 in case of (1), or of both the prefix tree of p and Tk in case of(2), which starts from the roots of the two trees and whose number of edges ismaximum among all possible subtrees.

Let us now focus on problem (2). Let p be a pattern on Σ of length h. p is aregular expression and it defines a (regular) set of strings on Σ. We define whenand how p occurs in a k-trie Tk and how many occurrences it has.

Definition 18. Let Tk be the k-trie of a text T on an alphabet Σ and p be apattern on Σ of length h, with h ≤ k. We say that p occurs in Tk iff there existsa string s in p and a node n in lev[h] of Tk such that strTk

(n) = s. The node nis called an occurrence point of p in Tk. We say that p totally occurs in Tk iff foreach string s in p there exists a node n in lev[h] of Tk such that strTk

(n) = s.

Remark 19. If a pattern p occurs in Tk, then there exists a unique pattern p′,weakly more precise than p, which totally occurs in Tk and which is maximalw.r.t. this property. p′ is called the maximum occurrence of p in Tk.

In order to determine exactly how many occurrences p has in Tk, we need toconsider all the paths of length h in Tk which correspond to strings in the set

p, and to sum up the counters associated to the occurrence points of p. Let c(p)be the number of occurrences of p in Tk and let p′ be the maximum occurrenceof p in Tk then c(p) =

∑n∈lev [h],strTk

(n)∈p′ c(n).

The previous definitions and observations can be naturally extended andapplied as well to a sum of patterns.

We can determine the maximum occurrence of p in Tk by finding the max-imum common subtree of Tk and the prefix tree of p. This can be obtainedstarting from the root and visiting in parallel the two trees through a level-ordervisit.

The procedure Matching Trees takes in input two trees, T1 and T2, possi-bly of different depths, and it uses a queue Q to support the parallel level-ordervisit with the typical queue operations: Enqueue inserts the specified elementin the queue, Dequeue removes and returns the value of the first element in thequeue. Each element in the queue Q is actually a pair of nodes: the left one is anode of T1 and the right one is a node of T2.

Initially, the roots of the two trees are enqueued together and then, each timea pair of nodes is dequeued, all their sons are taken into consideration and anytwo nodes having the same symbol are enqueued together.

In order to highlight the maximum common subtree of the two given trees,each visited node is marked with a matching flag which can be stored in a singlebit field added to each node. We assume that the matching flag of each node isinitially set to zero, and that the list of sons of each node is sorted with respectto the alphabet symbols.

{the matching flags of all nodes are set to zero}Max Common subtree(T1, T2)Enqueue(Q, (n1, n2)); {n1 is the root of T1 and n2 is the root of T2}while not isEmpty(Q) dobegin

(x, y) := Dequeue(Q);for all x′ son of (x) and y′ son of (y) do{order on symbols is used to optimize node comparison}if (x′.symbol = y′.symbol) thenbegin x′.matching flag := 1; y′.matching flag := 1; Enqueue(Q, (x′, y′)); end;

end;

The complexity of the procedure Max Common subtree isO(min(n1, n2)),where n1 and n2 are the number of nodes of T1 and T2 respectively, [34].

Example 20. Consider the following two trees:

(A,10)

(A,7)

(A,1)

(C,2)

(G,1)

(G,3)

(C,5)

(C,1)

(A,1)

(C,2)

(G,3)

C

G

A

C

A

C

T

The one on the left is the prefix tree associated to the sum of patterns C[AC] +G[ACT ], the one on the right corresponds to the first two levels of a k-trie of agiven text T . The result of applying the Max Common subtree procedure ishighlighted in both trees.

The maximum occurrence of a pattern p (or a sum of patterns) shows exactlywhich part of p is present in the text T , namely which strings in p occur in Tand each of them with how many occurrences. It also shows which part of p doesnot occur in T . In pattern discovery this can be as interesting as the symmetricinformation.

Clearly also problem (1), i.e. the comparison of two k-tries, consists in findingtheir maximum common subtree starting from the root. The procedure Max Common subtreecan be sligthly modified for comparing two k-tries and for pointing out their sim-ilarities and their differences. One possibility is to enqueue together only pairs ofnodes having the same symbol and with observed frequencies (or scores) rangingover a given interval of values [δmin, δmax]. Note that when we compare two textswith significantly different lengths, we have to normalize observed frequenciesw.r.t. the size of the text.

9 Concluding remarks

In this paper we illustrate some techniques for pattern discovery, based on thedefinition, construction and manipulation of a tree data structure, the k-trie,and on the concept of abstraction.

A k-trie is an enriched and cut suffix trie, representing and counting all sub-strings of length at most k of some input texts. We propose an on-line algorithmfor building such a k-trie, whose time complexity is linear with respect to thelength of the input texts. The space complexity can be problematic, but it ispossible to filter out uninteresting substrings and to reduce the space requiredby the tree by means of pruning techniques.

We consider the chi-square score for measuring the interest of each substringof the input texts. We analyze the score behaviour in terms of monotonicityw.r.t. two basic operations: symbol concatenation and string union. The scorebehaviour w.r.t. symbol concatenation influences how relevant strings distributealong the paths of the k-tree. The properties of the score w.r.t. string union arethe basis for building a set of the most interesting strings in an incremental andautomatic way.

Abstraction of substrings, that is grouping similar substrings and representingthem in a compact way as patterns, seems particularly useful in the patterndiscovery framework, to point out similarities among interesting substrings. Wepropose to use rigid patterns, which are particular regular expressions denotinga set of strings with the same length, and to consider a tree representationfor them which corresponds to their associated prefix automaton. We intend toexploit abstraction and its tree representation in different ways:

– for representing the set of the most interesting strings (i.e. the most over-represented or the most under-represented ones) in a concise way;

– for comparing different biological sequences, with suspected or assumed sim-ilar properties, by a direct comparison of the corresponding sets of the mostinteresting strings; for example in order to single out similarities and differ-ences between input sequences coming from different DNA regions or differ-ent families of proteins;

– for determining how much of a pattern occurs in a k-trie and then in theassociated texts.

A prototype tool implementing the proposed techniques has been developed.We plan to use it for verifying the real effectiveness of our approach throughconcrete applications and experiments on biological sequences. Until now we didnot focus on reducing the complexity of the procedures: further work is necessaryfor optimizing them. Moreover, we would like to extend our proposal to l-flexiblepatterns, namely patterns which admit at most l gaps. This is a main issue forfuture work.

Acknowledgment

The procedure Build ABS has been developed by Alberto Carraro in his BScdegree thesis in Computer Science at the University of Venice.

References

1. J. Allali and M.-F. Sagot. The at most k-depth factor tree. Submitted for publi-cation.

2. D. Angluin. Inference of reversible languages. Journal of ACM, 29(3):741–765,1982.

3. A. Apostolico, M. E. Block, S. Lonardi, and X. Xu. Efficient detection of unusualwords. Journal of Computational Biology, 7(1/2):71–94, 2000.

4. A. Apostolico, M. E. Bock, and S. Lonardi. Monotony of surprise and large-scale quest for unusual words. In RECOMB 2002: Int. Conf. on Research inComputational Molecular Biology 2002, pages 22–31. ACM press, 2002.

5. A. Apostolico, F. Gong, and S. Lonardi. Verbumculus and the discovery of unusualwords. Journal of Computer Science and Technology, 19(1):22–41, 2004.

6. T. L. Bailey and C. Elkan. Fitting a mixture model by expectation maximizationto discover motifs in biopolymers. In Int. Conference on Intelligent Systems inMolecular Biology, volume 2, pages 28–36, 1994.

7. A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to the auto-matic discovery of patterns in biosequences. J Comput Biol., 5(2):279–305, 1998.

8. J. Buhler and M. Tompa. Finding motifs using random projections. In RECOMB2001: International Conference on Research in Computational Molecular Biology2001, pages 69–76. ACM press, 2001.

9. N. Cannata, S. Toppo, C. Romualdi, and G. Valle. Simplifying amino acid al-phabets by means of a branch and bound algorithm and substitution matrices.Bioinformatics, 18(8):1102–1108, 2002.

10. N. Cocco, N. Cannata, and M. Simeoni. k-tries and abstraction for biosequencesanalysis. Technical Report CS-2005-7, Dipartimento di Informatica, Universita Ca’Foscari di Venezia, 2005.

11. A. Cornish-Bowden. Nomenclature for incompletely specified bases in nucleic acidsequences: recommendations 1984. Nucleic Acids Res., 13:3021–3030, 1985.

12. M. Crochemore and M.-F. Sagot. Motifs in sequences: localization and extraction.In A. Konopka and al., editors, Handbook of Computational Chemistry. MarcelDekker Inc, 2005.

13. S. R. Eddy. Profile hidden markov models. Bioinformatics, 14(9):775–763, 1998.14. J. L. Gardy, M. R. Laird, F. Chen, S. Rey, C. J. Walsh, M. Ester, and F. S.

Brinkman. PSORTb v.2.0: expanded prediction of bacterial protein subcellular lo-calization and insights gained from comparative proteome analysis. Bioinformatics,21(5):617–623, 2005.

15. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge UniversityPress, 1997.

16. J. Hopcroft and J. Ullman. Introduction to Automata Theory, Languages andComputation. Addison Wesley, 1979.

17. M. Kozak. Initiation of translation in prokaryotes and eukaryotes. Gene,234(2):187–208, 1999.

18. C. E. Lawrence, S. F. Altschul, and M. S. Boguski. Detecting subtle sequencesignals: A gibbs sampling strategy for multiple alignment. Science, 262:208–214,1993.

19. L. Marsan and M. F. Sagot. Algorithms for extracting structured motifs using asuffix tree with an application to promoter and regulatory site consensus identifi-cation. Journal of Computational Biology, 7:345–362, 2000.

20. E. McCreight. A space-economical suffix tree contruction algorithm. Journal ofthe ACM, 23:262–272, 1976.

21. N. J. Mulder, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns,P. Bradley, P. Bork, P. Bucher, L. Cerutti, R. Copley, E. Courcelle, U. Das,R. Durbin, W. Fleischmann, J. Gough, D. Haft, N. Harte, N. Hulo, D. Kahn,A. Kanapin, M. Krestyaninova, D. Lonsdale, R. Lopez I. Letunic, M. Madera,J. Maslen, J. McDowall, A. Mitchell, A. N. Nikolskaya, S. Orchard, M. Pagni, C. P.Ponting, E. Quevillon, J. Selengut, C. J. Sigrist, V. Silventoinen, D. J. Studholme,R. Vaughan, and C. H. Wu. Interpro, progress and status in 2005. Nucleic AcidsRes., 33:201–205, 2005.

22. A. F. Neuwald, J. S. Liu, and C. E. Lawrence. Gibbs motif sampling: Detection ofbacterial outer membrane protein repeats. Protein Sci., 4(2):1618–1632, 1995.

23. H. Herzel P. Schieg. Periodicities of 10-11bp as indicators of the supercoiled stateof genomic DNA. Journal of Molecular Biology, 343(4):891–901, 2004.

24. L. Parida. A formal treatment of the problem of pattern discovery. Tutorial Paperat IMSB/ECCB 2004 – 12th International Conference on Intelligent Systems forMolecular Biology and 3rd European Conference on Computational Biology, 2004.

25. G. Pavesi, G. Mauri, and G. Pesole. An algorithm for finding signals of unknownlength in dna sequences. Bioinformatics, 17(Suppl. 1):S207–S214, 2001.

26. G. Pavesi, G. Mauri, and G. Pesole. In silico representation and discovery oftranscription factor binding sites. Briefings in Bioinformatics, 5(3):1–20, 2004.

27. G. Pavesi, P. Meneghetti, G. Mauri, and G. Pesole. Weeder web: discovery oftranscription factor binding sites in a set of sequences from co-regulated genes.Nucleid Acids Research (Web Server Issue), 32:W199–W203, 2004.

28. P. A. Pevzner and S-H. Sze. Combinatorial approaches to finding subtle signals indna sequences. In ISMB’00: Int. Conf. on Molecular Biology 2000, pages 269–278.American Association for Artificial Intelligence, 2000.

29. I. Rigoutsos and A. Floratos. Combinatorial pattern discovery in biological se-quences: The teiresias algorithm. Bioinformatics, 14(2):229, 1998.

30. T. D. Schneider and R. M. Stephens. Sequence logos: a new way to display con-sensus sequences. Nucleic Acids Res., 18(20):6097–6100, 1990.

31. G. D. Stormo. Dna binding sites: representation and discovery. Bioinformatics,16(1):16–23, 2000.

32. J. D. Thompson, D. G. Higgins, and T. J. Gibson. CLUSTAL W: improving thesensitivity of progressive multiple sequence alignment through sequence weight-ing, position-specific gap penalties and weight matrix choice. Nucleic Acids Res.,22:4673–4680, 1994.

33. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249–260, 1995.34. G. Valiente. Algorithms on Trees and Graphs. Springer Verlag, 2002.35. J. van Helden, B. Andre, and J. Collado-Vides. Extracting regulatory sites from

the upstream region of yeast genes by computational analysis of oligonucleotidefrequencies. Journal of Molecular Biology, 281:827–842, 1998.

36. P. Weiner. Linear pattern matching algorithms. In 14th IEEE Symposium onSwitching and Automata Theory, pages 1–11, 1973.