On The Approximate Pattern Occurrences In A Text

12

Transcript of On The Approximate Pattern Occurrences In A Text

ON THE APPROXIMATE PATTERN OCCURRENCES IN A TEXT�Mireille R�egniery Wojciech SzpankowskizINRIA Department of Computer ScienceRocquencourt Purdue University78153 Le Chesnay Cedex W. Lafayette, IN 47907France [email protected] [email protected] a given pattern H and a random text T generated randomly according to theBernoulli model. We study the frequency of approximate occurrences of the pattern H ina random text when overlapping copies of the approximate pattern are counted separately.We provide exact and asymptotic formul� for mean, variance and probability of occurrenceas well as asymptotic results including the central limit theorem and large deviations. Ourapproach is combinatorial: we �rst construct some language expressions that characterizepattern occurrences which are translated into generating functions, and �nally we use an-alytical methods to extract asymptotic behaviors of the pattern frequency. Applications ofthese results include molecular biology, source coding, synchronization, wireless communi-cations, approximate pattern matching, games, and stock market analysis. These �ndingsare of particular interest to information theory (e.g., second-order properties of the relativefrequency), and molecular biology problems (e.g., �nding patterns with unexpected high orlow frequencies, and gene recognition).1 IntroductionRepeated patterns and related phenomena in words (sequences, strings) are knownto play a central role in many facets of computer science, telecommunications, andmolecular biology. One of the most fundamental questions arising in such studies isthe frequency of pattern occurrences in another string known as text. For applicationseven more important is to know how many times a given pattern approximately occursin a (random) text. By approximate occurrence we mean that there exists a substringof the text within given distance from the (given) pattern. The de�nition of thedistance is irrelevant in this paper. This problem is also more challenging that theexact pattern occurrence. Applications include wireless communications, approximatepattern matching (cf. [14]), molecular biology (cf. [28]), games, code synchronization,(cf. [9, 10]), source coding (cf. [4], stock market analysis, and so forth.�This research was supported by NATO Collaborative Grant CRG.950060. It was initiated atINRIA, Sophia Antipolis during the summer of 1996, and both authors are grateful to projectMISTRAL for hospitality and support.yThis research was additionally supported by ESPRIT LTR Project No. 20244 (ALCOM-IT).zThis research was additionally supported by NSF Grants NCR-9206315 and NCR-9415491.

We study the problem in a probabilistic framework in which the text is generatedrandomly according to the so called Bernoulli model in which every symbol of a �nitealphabet S is generated independently of the other symbols with di�erent probabil-ities of symbol generations (if all the probabilities are the same, then the model iscalled symmetric Bernoulli model). Our approach to this problem falls under themethodology of \combinatorics on words" (cf. [10, 17]): we construct certain lan-guages that characterize approximate pattern occurrences in a text which are furthertranslated into generating functions. This approach might be of interest to otherproblems on words.Pattern occurrences in a random string is a classical problem (cf. [6]). Severalauthors also contributed to this problem, however, the most important recent contri-butions belong to Guibas and Odlyzko. In particular, in [10], the authors computed,in the symmetric Bernoulli model, the moment generating function for the numberof strings that do not contain any of the given set of patterns (approximate search)and for the number of strings that contain r occurrences of a given pattern (exactsearch). An extension of this last result to the asymmetric Bernoulli model can befound in [8]. The Markovian model was also tackled for the exact search. Li [16],Chrysaphinou and Papastavridis [2] extended the Guibas and Odlyzko result of nopattern occurrence. Prum et al. [22] (see also [25]) obtained the normal limitingdistribution for the number of pattern occurrences in the Markovian model. In [23],we derived the limiting distribution and computed the �rst moments.In this paper, we provide a complete characterization of the frequency of approx-imate pattern occurrences in a random text generated according to the Bernoullimodel, when overlapping approximate copies of the pattern are counted separately.LetH be a set of strings, for example the set of all strings of lengthm which are withingiven distance from some pattern H. We compute exactly the generating function ofthe number of occurrences of H-patterns (cf. Theorem 2.1) which further providesthe mean and the variance (cf. Theorem 2.2). Evaluation of the variance is quitechallenging since it depends on the internal structure of the patterns through the socalled correlation matrix introduced in this paper. In addition, we present severalasymptotic results concerning the probability of r occurrences for di�erent range ofr. We consider r = O(1), as well as the central limit and the large deviations rangeof r.Our results should be of particular interest to information theory (e.g., relativefrequency, code synchronization, source coding, etc.) and molecular biology. Twoproblems of molecular biology can bene�t from these results. Namely: �nding pat-terns with unexpected (high or low) frequencies (the so called contrast words) [28],and recognizing genes by statistical properties [28]. Statistical methods have beensuccessfully used from the early 80's to extract information from sequences of DNA.In particular, identifying deviant short motifs, the frequency of which is either toohigh or too low, might point out unknown biological information (cf. [28] and othersfor the analysis of functions of contrast words in DNA texts). From this perspective,our results give estimates for the statistical signi�cance of deviations of word occur-rences from the expected values and allow a biologist to build a dictionary of contrast

words in genetic texts.One can also use these results to recognize statistical properties of various otherinformation sources such as images, text, etc. In information theory, the relativefrequency de�ned as �n(H) = On(H)=(n � m + 1), where m is the length of thepattern, is often used to estimate the statistics of the information source. The relativefrequency was mostly studied for the exact pattern occurrence, while in this paper weextend it to approximate occurrence. Such an extension is relevant to some recentapplications such as lossy extension of the Lempel-Ziv scheme (cf. [18, 19, 29]) andlossy extension of the shortest common superstring problem (cf. [7, 30]). It is wellknown [4, 20] that �n(H) for the exact pattern occurrence converges almost surely tothe probability P (H) of the pattern H. The same holds for the approximate patternoccurrence if one replace P (H) by P (H). Recently, Marton and Shields [20] provedthat �n(H) for the exact pattern occurrence converges exponentially fast to P (H)for sources satisfying the so called blow-up property (e.g., Markov sources, hiddenMarkov, etc). Our results extends Marton and Shields results to approximate patternoccurrences (for the Bernoulli model our results from [23] suggest that extensionto a Markovian model is possible). Such a rate of convergence is needed in someapplications (cf. [19]).This paper is organized as follows. In the next section we present our mainresults and their consequences. The proofs are delayed until the last section. Ourderivation in Section 3.1 uses a combinatorial approach of languages that translatesinto relationships on the associated generating functions, and �nally we use analyticaltools in Section 3.2 to derive asymptotic results.2 Main ResultsOur goal is to estimate the frequency of overlapping approximate pattern occurrencesin a random text. Let H be a set of any given patterns H1; : : : ;HM such that nonecontains another as a substring. This set may be the k-neighbourhood of a givenpattern H when a distance d(�; �) (such as the Hamming distance, the edit distance,etc...) is de�ned. We assume that the text string T is a realization of an independently,identically distributed sequence of random variables (i.i.d.), such that a symbol s inthe alphabet S occurs with probability P (s). This de�nes the so called Bernoullimodel. We denote P (Hi) the probability of the string Hi occurring in the randomtext.We �nd it convenient and useful to express our �ndings in terms of languages, e.gset of words. We associate with every language L a generating function de�ned asbelow:De�nition 1 For any language L we de�ne its generating function L(z) asL(z) = Xw2LP (w)zjwj (1)where P (w) is the probability of the word w, jwj is the length of w, and we adopt theusual convention that P (�) = 1.

It turns out that several properties of pattern occurrences depend on the so calledcorrelation polynomial that is de�ned next.De�nition 2 Given two strings H and F , let AH;F be the set of words:AH;F = fFmk+1 : Hmm�k+1 = F k1 g (2)and the set of positions k satisfying F k1 = Hmm�k+1 is denoted as HF . One de�nesthe correlation polynomial as the associated generating function, denoted AH;F (z).When H is equal to F , AH;H(z) is named the autocorrelation polynomial.We shall work with matrices and vectors, so we adopt the following convention.Bold upper-case letters are reserved for vectors which are assumed to be columnvectors; e.g., U t(z) = (U1(z); : : : ; UM (z)) where Ui(z) is the generating function ofa language UHi (see next section), and the upper index "t\ denotes transpose. Weshall use blackboard bold letters for matrices (e.g., A (z) = fAHiHj(z)gi;j=1;M). Inparticular, we write I for the identity matrix, and 1 = (1; : : : ; 1)t for the unit vector.Finally, we recall that (I� M)�1 = Pr�0 Mk provided the inverse matrix exists (i.e.,det(I� M) 6= 0 or k M k< 1 where k � k is any matrix norm).In the sequel, we denote by On(H) (or simply by On) a random variable repre-senting the number of approximate occurrences of H in T . We have:De�nition 3 Let H be a set of patterns H = fHigi2f1;:::;Mg:(i) Let T be a language of words containing at least one occurrence from H, and forany nonnegative integer r, let Tr be the language of words containing exactly roccurrences from H. We denote T (r)(z) its generating function which becomes:T (r)(z) = Xn�0PrfOn(H) = rgzn (3)for jzj � 1. In addition, we introduce a bivariate generating function as follows:T (z; u) = 1Xr=1T (r)(z)ur = 1Xr=1 1Xn=0PrfOn(H) = rgznur : (4)(ii) We de�ne the correlation matrix of H as the matrix of correlation polyno-mials, e.g. A (z) = fAHi;Hj(z)gi;j=1;M .Remark: (i) When Hi does not overlap on its right end with Hj, the set Ai;j isempty and Ai;j(z) = 0.(ii) It is worth noting that � belongs to Ai;j if and only if i = j. In other words, Hicoincides with Hj on its �rst character if and only if i = j. Hence, the constant termin Ai;j(z) is 0 when i 6= j and 1 when i = j. 2Now, we are ready to summarize our main �ndings in the form of two followingtheorems. The �rst theorem presents exact formul� for the generating functions

T (r)(z) and T (z; u), and can be used to compute exactly parameters related to thepattern occurrence On(H). In the second theorem, we provide asymptotic results forthe probability PrfOn = rg for various ranges of r. All proofs are presented in thenext section. The method of derivation extends the method presented in [23].Theorem 2.1 Let H be a given set of patterns of length m, and T be a randomtext of length n generated according to the Bernoulli model. The generating functionsT (r)(z) and T (z; u) can be computed as follows:T (r)(z) = Rt(z)M(z)r�1U(z) (5)T (z; u) = Rt(z)u(I� uM(z))�1U(z) ; (6)where M(z) = (D(z) + (z � 1)I)[D(z)]�1 ; (7)(I� M (z))�1 = A (z) + zm1 � z1 �H t ; (8)U(z) = 11� z (I� M(z)) � 1 ; (9)Rt(z) = zm1� zH t � (I� M(z)) ; (10)and D(z) = (1 � z)A (z) + zm1 �H t ; (11)In the above, H = (P (H1); : : : ; P (HM ))t, and A (z) = fAHi;Hj(z)gi;j=1:M is the matrixof the correlation polynomials of patterns from the set H.The above theorem is a key to the next asymptotic results. These results arederived in the next section using analytical tools.Theorem 2.2 Let the hypotheses of Theorem 2.1 be ful�lled. We denote P (H) =PHi2H P (Hi) = H t � 1.(i) Moments. We obtainEOn(H) = (n�m+ 1)P (H) ; (12)V ar On(H) = (n�m+ 1) �P (H) + P 2(H)� 2mP 2(H) + 2H t(A (1) � I)1�(13)+ m(m� 1)P 2(H) � 2H t _A (1) � 1 ; (14)where _A (1) denotes the derivative of the matrix A (z) at z = 1.(ii) Distribution: Case r = O(1). Let �H be the smallest root of det D(z) = 0outside the unit circle jzj � 1. Exists � > �H such that:PrfOn(H) = rg = (�1)r+1ar+1r! (n)r��(n�m+r+1)H (15)+ rXj=1(�1)jaj nj � 1!��(n+j)H +O(��n) ;

where (n)r = n(n� 1) � � � (n � r + 1) andar+1 = H t (D(�H) + (�H � 1)I)r�1 (D�(�H))r+1 � 1(det0 D(�H))r+1 ; (16)where D�(z) is the adjoint matrix of D(z), and det0 D(�H ) denotes the derivative ofdet D(z) at z = �H. The remaining coe�cients aj can be computed according to thefollowing formula:aj = 1(r + 1� j)! limz!�H dr+1�jdzr+1�j �T (r)(z)(z � �H)r+1� (17)with j = 1; 2; : : : r.(iii) Distribution: Case r = EOn + xpV ar On. Let x = O(1). Then:PrfOn(H) = rg = 1p2�V ar On e� 12x2 1 +O 1pn!! ; (18)(iv) Distribution: Case r = (1 + �)EOn with � 6= 0. De�ne � (t) to be the root ofdet(I� etM(e� )) = 0 ; (19)and !a to be the root of � 0(!a) = �a (20)where a = 1 + �. Then:PrfOn(H) = rg = 1!ap2�V ar On e�((n�m+1)I(a) �1 +O �1n�� (21)where I(a) = a!a + � (!a).As mentioned before, the above results have abundance of applications in infor-mation theory and molecular biology. For example, they can be used to estimate therelative frequency de�ned as �n(H) = On(H)n�m+ 1 :Relative frequency appears in the de�nition of types and typical types (cf. [4]), andis often used to estimate information source statistics. The reader is referred to [23]for more details.3 AnalysisThe key element of our analysis is an algebraic derivation of the generating functionT (z; u) presented in Theorem 2.1, (e.g., we express the languages characterizing Tror T in terms of known languages for which one easily computes the associatedgenerating functions). In general, the admissible relationships are the ones that easilytranslate into relationships between generating functions.

3.1 Combinatorial Relationships Between LanguagesWe start with some de�nitions:De�nition 4 (i) For i; j 2 f1; : : : ;Mg, we de�ne for r � 1 the language M(r�1)i;jas M(r�1)i;j = fw : Hiw 2 Tr and Hj occurs at the right end of wg : (22)We write Mi;j =M(1)i;j .(ii) The language Ri is the set of words containing only one occurrence of Hi, locatedat the right end. We also de�ne Ui asUi = fu : Hiu 2 T1g : (23)In other words a word u 2 Ui if the only occurrence from H in Hiu is Hi.We now can describe the languages T and Tr in terms of the languages justintroduced. This will further lead to a simple formula for the generating function ofOn(H).Theorem 3.1 The language Tr can be represented for any r � 1 as follows:Tr = Xi;j2f1;:::;MgRiM(r�1)i;j Uj : (24)The language T satis�es the fundamental equation:T = Xr�1 Xi;j2f1;:::;MgRiM(r�1)i;j Uj : (25)Proof: We �rst prove (24) and obtain our decomposition of Tr as follows. Let the �rstoccurrence of H in a word belonging to Tr be, say, Hi; it determines a pre�x p of thisword that is in Ri. Then, one concatenates a non-empty word w that produces thesecond occurrence of H, say Hk. Hence, w is in someMi;k. This process is repeatedr�1 times and we may assume the last occurrence is Hj ; e.g. the word concatenatedto the right of p is in M(r�1)i;j . Finally, one adds after the last H occurrence a su�xu that does not produce a new occurrence of H. Equivalently, u is in Uj , and w isa proper subword of Hju. Finally, a word belongs to T if it belongs to Tr for somer � 1.We now prove the following result that summarizes relationships between thelanguages introduced in De�nition 4. Below, we use the following notation: we de�ne�, and � as disjoint union, subtraction and concatenation of languages. For sakeof clarity, we assimilate below a singleton fwg to its unique element w.

Theorem 3.2 The languages Mi;j, Ui and Rj satisfy, for i; j 2 f1; : : : ;Mg:[k�1M(k)i;j = W �Hj �Ai;j f�g ; (26)Ui � S = [j Mi;j � Ui f�g ; (27)S � Rj � (Rj �Hj) = [i HiMi;j ; (28)where W is the set of all words, S is the alphabet set, � is the empty word.Proof: All the above relations are proved in a similar fashion. We �rst deal with(26). Let w be in W � H and k + 1 be the number of subwords of Hi � w that are inH. Certainly, this number is greater than or equal to 2 and the last occurrence, sayHj , is on the right of Hiw: This implies that w 2 M(k)i;j . Furthermore, a word w inSk�1M(k)i;j is not in W � Hj i� its size jwj is smaller than jHj j. Then, the right Hoccurrence in Hiw overlaps with Hi, which means that w is in Ai;j. Reciprocally, anyword in Ai;j quali�es, but the empty word, when it belongs to it. Although � is notin Ai;j when i 6= j, our set expression remains correct; e.g. Ai;j � f�g = Ai;j when� 62 Ai;j.Let us turn now to (27). When one adds a character s right after a word u fromUi, two cases may occur. Either Hius still does not contain a second occurrence ofH, which means that us is a non-empty word of Ui. Or a new element of H appears,say Hj , clearly at the right end. Then, us is in Mi;j and we get the left inclusion.Furthermore, any non-empty word of Ui�f�g is in Ui �S, and a strict pre�x of a wordw in Mi;j cannot contain any H-occurrence; hence, this pre�x is in Ui and w is inUi � S.We now prove (28). Let x = sw be a word in Hi �Mi;j where s is a symbol fromS. As x contains exactly two occurrences of H, Hi located at its left end, and Hjlocated at its right end, w is in Rj and x is in S � Rj �Rj. Reciprocally, if a wordswHj from S �Rj is not in Rj, then swHj contains a second H occurrence, say Hi. AswH is in Rj , the only possible position is on the left end, and then x is in Hi �Mi;j.We now rewrite:S � Rj �Rj = S � Rj � (Rj \ S � Rj) = S � Rj � (Rj �Hj)which completes the proof.We are now ready to prove (7)-(10). We make use of a few rules, namely: thedisjoint union � and concatenation � become the sum operation + and the mul-tiplication operation on generating functions. In other words, the union languageL = L1 � L2 is transferred into the generating function L(z) = L1(z) + L2(z), when-ever L1 \ L2 = ;. The generating function of L = L1 � L2 is L(z) = L1(z)L2(z) forthe Bernoulli model (cf. [23] for extension to Markov model). We only prove (8) toprovide a general scheme. We rewrite the language relationship (27) from Theorem3.2 as Ui � S � (Ui � �) = [Mj=1Mi;j for any i 2 f1; : : : ;Mg. The left side of thisequation yields Ui(z) � (z� 1) + 1. The right-hand side is the sum of the terms of the

i-th row of matrix M(z), or, equivalently, the i-th row of M(z). As the result holds forany i, we get the equation (27) between two column vectors. Finally, (5) in Theorem2.1 is a direct consequence of (25).3.2 Moments and Limiting DistributionIn this �nal subsection, we derive the �rst two moments of On as well as asymptoticsfor PrfOn = rg for di�erent ranges of r, that is, we prove Theorem 2.2. Actually,we should mention that using general results from renewal theory one immediatelyguesses that the limiting distribution must be normal for r = EOn+O(pn). However,here the challenge is to estimate precisely the variance. Our approach o�ers an easy,uniform, and precise derivation all of moments, including the variance, as well aslocal limit distributions (including the convergence rate) for the central and largedeviations regimes.A. MomentsFirst of all, from Theorem 2.1 we shall conclude thatT 0(z; 1) = zmH t � 1(1� z)2 = PHi2H P (Hi)zm(1� z)2 ; (29)T 00(z; 1) = 2zm(H t � 1)2zm(1� z)3 + 2zmH t(A (z) � I)1(1� z)2= 2 �PHi2H P (Hi)zm�2(1 � z)3 + 2zm �Pi;j P (Hi)Ai;j(z)�PHi2H P (Hi)�(1� z)2 :(30)Indeed, we observe that T 0(z; u) = Rt(z)(I� uM(z))�2U(z). Making u = 1 and using(7){(10) directly leads to (29). Establishing (30) works the same, with a little morealgebra. Notably, one often uses (I� M (z))�1 = (1� z)�1D(z) (cf. (8)).Now, we observe that both expressions admit as a numerator a function that isentire beyond the unit circle and a power of 1=(1 � z) as a denominator. We makeuse of the following basic formula:[zn](1� z)�p = �(n + p)�(p)�(n + 1) (31)To obtain EOn we proceed as follows:EOn = [zn]T 0(z; 1) = XHi2HP (Hi)[zn�m](1� z)2 = (n�m+ 1) XHi2HP (Hi) :Computation of the variance is a little more intricate by hand, but straightforwardwith a symbolic computation system. For any analytic function �(z), one may rewrite[zn]�(z)(z � 1)�p as a function of the p � 1 derivatives of � at z = 1. We easily getour formula on the variance (cf. (14) of Theorem 2.2).B. Asymptotic Results

We now establish Theorem 2.2, that is, we compute PrfOn = rg for di�erentranges of r. We �rst observe that (cf. [11]), for any matrix P:[P(z)]�1 = P�(z)detP(z)where P�(z) is the adjoint matrix of P(z). Since every entry in A (z) is a polynomial,we conclude that �(z) = D(z) is a polynomial. It follows easily that D�(z);�(z)M(z)and �(z)(I� uM(z) are matrices of polynomials. Hence, T (z; u) rewritesT (z; u) = (z; u)det(�(z)(I� uM(z))) (32)where and the determinant are polynomials in z and u. This allows for applyingstandard results [1, 12]. E.g. Levy's theorem applies and the number of occurrencesis asymptotically gaussian. This proves 2.2(iii). Details of this general scheme canbe found in [23]. Similarly, observing that conditions of G�artner-Ellis theorem [5] areful�lled, we prove 2.2(iv). Detailed computations can be found in [23].We now consider r = O(1), and turn our attention to formula (5) of Theorem 2.1.From (32), this generating function rewritesT (r)(z) = �(z)�r�(z)where �(z) is a polynomial. To establish an asymptotic expression for PrfOn = rgone needs to extract the coe�cient at zn of T (r)(z). By Hadamard's theorem (cf.[24]) we conclude that the asymptotics of the coe�cients of T (r)(z) depend on thesingularities of T (r)(z), e.g. the roots of �(z). As �(z) is a polynomial, the set ofroots is non empty. Let �H be the root of smallest modulus outside jzj � 1. It is asingularity of order r+1. In particular, �(z) = (z� �H)�0(�H) +O((z� �H)2). Therest of the derivation follows exactly our footsteps from [23], so we refer the readerto it for details.It is worth noticing that the above results are asymptotic in n. For �nite andpractical n, we must use our exact results and numerical algorithms, but then toavoid numerical instability, we observed that (32) guarantees that P (On = r) satis�esa linear equation that may be solved easily numerically. This was realized using theMaple package Gfun for large practical n. Applications to the search of exceptionalpatterns in the yeast genome are currently conducted.ACKNOWLEDGEMENTIt is our pleasure to acknowledge several discussions with B. Salvy and E. Cowardon the numerical computation.References[1] E. Bender, Central and Local Limit Theorems Applied to Asymptotic Enumer-ation, J. Combin. Theory, Ser. A, 15, 91-111, 1973.

[2] C. Chrysaphinou, and S. Papastavridis, The Occurrence of Sequence of Patternsin Repeated Dependent Experiments, Theory of Probability and Applications,167-173, 1990.[3] M. Crochemore and W. Rytter, Text Algorithms, Oxford University Press, NewYork 1995.[4] I. Csisz�ar and J. K�orner, Information Theory: Coding Theorems for DiscreteMemoryless Systems, Academic Press, New York 1981.[5] R. Ellis, Large Deviations for a General Class of Random Vectors, Ann. Probab.,1-12, 1984.[6] W. Feller, An Introduction to Probability and its Applications, Vol. 1, John Wiley& Sons, New York 1968.[7] A. Frieze and W. Szpankowski, Greedy Algorithms for the Shortest CommonSuperstring That Are Asymptotically Optimal, Proc. European Symposium onAlgorithms, Springer LNCS, No. 1136, 194-207, Barcelona 1996.[8] I. Fudos, E. Pitoura and W. Szpankowski, On Pattern Occurrences in a RandomText, Information Processing Letters, 57, 307-312, 1996.[9] L. Guibas and A. Odlyzko, Maximal Pre�x-Synchronized Codes, SIAM J. Appl.Math, 35, 401-418, 1978.[10] L. Guibas and A. W. Odlyzko, String Overlaps, Pattern Matching, and Nontran-sitive Games, J. Combin.Theory Ser. A, 30, 183-208, 1981.[11] R. Horn and C. Johnson, Matrix Analysis, Cambridge University Press, Cam-bridge 1985.[12] H-K. Hwang, Th�eor�emes Limites Pour les Structures Combinatoires et les Fonc-tions Arithm�etiques, Th�ese de Doctorat de l'Ecole Polytechnique, 1994.[13] P. Jacquet and W. Szpankowski, Autocorrelation on Words and Its Applications.Analysis of Su�x Trees by String-Ruler Approach, J. Combin.Theory Ser. A, 66,237-269, 1994.[14] P. Jokinen and E. Ukkonen, Two Algorithms for Approximate String Matching inStatic Texts, Proc. MFCS 91, Lecture Notes in Computer Science 520, 240-248,Springer Verlag 1991.[15] D.E. Knuth, The Art of Computer Programming: Fundamental Algorithms, vol.1., Addison-Wesley, Reading 1973 .[16] S. R. Li, A Martingale Approach to the Study of Occurrences of Sequence Pat-terns in Repeated Experiments, Ann. Probab., 8, 1171-1176, 1980.

[17] Lothaire, M., Combinatorics on Words, Addison Wesley, Reading, Mass. 1982.[18] T. Luczak and W. Szpankowski, A Lossy Data Compression Based on StringMatching: Preliminary Analysis and Suboptimal Algorithms, Proc. Combinato-rial Pattern Matching, Asilomar, LNCS 807, 102-112, Springer-Verlag, 1994.[19] T. Luczak and W. Szpankowski, A Suboptimal Lossy Data Compression Basedon Approximate Pattern Matching, 1996 International Symposium on Informa-tion Theory, Whistler 1996; also Purdue University CSD-TR-94-072, 1994.[20] K. Marton and P. Shields, The Positive-Divergence and Blowing-up Properties,Israel J. Math, 80, 331-348 (1994).[21] A. Odlyzko, Asymptotic Enumeration, in Handbook of Combinatorics, Vol. II,(Eds. R. Graham, M. G�otschel and L. Lov�asz), Elsevier Science, 1995.[22] B. Prum, F. Rodolphe, and E. Turckheim, Finding Words with UnexpectedFrequencies in Deoxyribonucleic Acid Sequence, J.R. Stat. Soc. B, 57, 205-220,1995.[23] M. Regnier and W. Szpankowski, A Last Word on Pattern Frequency Occurrencein a Markovian Sequence?, Purdue University, CSD-TR-96-043, 1996. Prelimi-nary draft at ISIT'97.[24] R. Remmert, Theory of Complex Functions, Springer Verlag, New York 1991.[25] S. Schbath, Etude Asymptotique du Nombre d'Occurrences d'un mot dans uneChaine de Markov et Application �a la Recherche de Mots de Fr�equence Excep-tionnelle dans les S�equences d'ADN, Th�ese Universit�e Ren�e Descartes Paris V,1995.[26] W. Szpankowski, Asymptotic Properties of Data Compression and Su�x Trees,IEEE Trans. Information Theory, 39, 1647-1659, 1993.[27] W. Szpankowski, A Generalized Su�x Tree and Its (Un)Expected AsymptoticBehaviors, SIAM J. Computing, 22, 1176-1198 (1993).[28] M. Waterman, Introduction to Computational Biology, Chapman & Hall, NewYork 1995.[29] E.H. Yang, and J. Kie�er, On the Performance of Data Compression AlgorithmsBased upon String Matching, preprint (1995).[30] Z. Zhang and E. Yang, An On-Line Universal Lossy Data Compression Algo-rithm via Continuous Codebook Re�nement { Part II: Optimality for Phi-MixingSource Models, IEEE Trans. Information Theory, 42, 822-836, 1996.