Morphisms on amiable words

10
Morphisms on Amiable Words Adrian Atanasiu ? Faculty of Mathematics and Computer Science, Bucharest University, Str. Academiei 14, Bucharest 010014, Romania Abstract. Using the fact that the Parikh matrix mapping is not an injective mapping, the paper investigates some properties of the set of words having the same Parikh matrix; these words are called “amiable” or “M - equivalent”. The aim is to reduce the number of amiable words using a morphism which provides additional information about them. 1 Introduction The idea of identifying binary sequences using as input data the number of a’s and b’s is quite old. Unfortunately this information (given by the Parikh mapping associated to the sequence) is insufficient: there are x + y x binary sequences α with |α| a = x and |α| b = y (we denote by |α| w the number of appearances of the scattered sequence w in α). Once the Parikh matrix mapping ([6]) has been defined and especially when the Parikh matrix mapping associated with the binary sequences ([2]) has been studied (where |α| ab is also taken into consid- eration), the number of sequences defined by the same characteristics has been drastically decreased. For instance, from the 184756 binary sequences α having |α| a = 10 and |α| b = 10, only 5448 do have |α| ab = 50; moreover, this is the most unfortunate choice (for other situations see some Examples from [2] and [3]). This represents almost 3% from all 184756 posible strings. Because this number still remains quite large, the possibility of identifying the sequences by using this procedure is reduced (especially for the balanced sequences, when |α| a is almost equal to |α| b ). A remarkable improvement seems to be the use of some morphisms which dis- tinguish the amiable binary words by their Parikh matrices. To this aim, the Section 3 of the paper analyses the morphisms φ : Σ * 1 -→ Σ * 2 where Σ 1 is a binary alphabet, and Σ 2 has at most 3 elements. Such a morphism (Istrail mor- phism) was studied in [3]. As a result, these morphisms distinguish the sequences in many classes of amiable words, but not totally. There are some words – like abbabaab and baababba – that remain amiable over any morphism with |Σ 2 |≤ 3. The last section proposes the construction of a morphism over a general alphabet, that is able to separate two arbitrary amiable words. The only problem which remains to be solved is the size of the alphabet Σ 2 , which increases with the length of the analyzed words. ? e-mail aadrian@gmail.com 1

Transcript of Morphisms on amiable words

Morphisms on Amiable Words

Adrian Atanasiu?

Faculty of Mathematics and Computer Science, Bucharest University, Str. Academiei14, Bucharest 010014, Romania

Abstract. Using the fact that the Parikh matrix mapping is not aninjective mapping, the paper investigates some properties of the set ofwords having the same Parikh matrix; these words are called “amiable”or “M - equivalent”. The aim is to reduce the number of amiable wordsusing a morphism which provides additional information about them.

1 Introduction

The idea of identifying binary sequences using as input data the number of a’sand b’s is quite old. Unfortunately this information (given by the Parikh mapping

associated to the sequence) is insufficient: there are

(x+ yx

)binary sequences

α with |α|a = x and |α|b = y (we denote by |α|w the number of appearancesof the scattered sequence w in α). Once the Parikh matrix mapping ([6]) hasbeen defined and especially when the Parikh matrix mapping associated withthe binary sequences ([2]) has been studied (where |α|ab is also taken into consid-eration), the number of sequences defined by the same characteristics has beendrastically decreased. For instance, from the 184756 binary sequences α having|α|a = 10 and |α|b = 10, only 5448 do have |α|ab = 50; moreover, this is themost unfortunate choice (for other situations see some Examples from [2] and[3]). This represents almost 3% from all 184756 posible strings. Because thisnumber still remains quite large, the possibility of identifying the sequences byusing this procedure is reduced (especially for the balanced sequences, when |α|ais almost equal to |α|b).A remarkable improvement seems to be the use of some morphisms which dis-tinguish the amiable binary words by their Parikh matrices. To this aim, theSection 3 of the paper analyses the morphisms φ : Σ∗1 −→ Σ∗2 where Σ1 is abinary alphabet, and Σ2 has at most 3 elements. Such a morphism (Istrail mor-phism) was studied in [3]. As a result, these morphisms distinguish the sequencesin many classes of amiable words, but not totally. There are some words – likeabbabaab and baababba – that remain amiable over any morphism with |Σ2| ≤ 3.

The last section proposes the construction of a morphism over a generalalphabet, that is able to separate two arbitrary amiable words. The only problemwhich remains to be solved is the size of the alphabet Σ2, which increases withthe length of the analyzed words.

? e-mail [email protected]

1

2 Definitions and basic results

The Parikh matrix mapping (introduced in [6]) is an extension of the Parikhmapping ([8]). This extension is based on a special type of matrices, where theclassical Parikh vector appears as the second diagonal2.

We start with some basic notations and definitions. Let ZZ be the set ofintegers and Σ be a nonempty and finite alphabet, whose elements are totallyordered; namely, for ai, aj ∈ Σ we have ai < aj iff i < j. Also, we denote by |Σ|the number of elements from the set Σ. The set of all words over Σ is Σ∗, withε as empty word. Finally, for α ∈ Σ∗, |α| denotes the length of α.

The number of occurrences of a character a ∈ Σ in a word α ∈ Σ∗ isdenoted by |α|a. If u, v ∈ Σ∗, then the word u is a (scattered) subword of v ifu = β1β2 . . . βr and v = γ0β1γ1 . . . γr−1βrγr, for some r ≥ 1 and βi, γj ∈ Σ∗. Wedenote by |α|u the number of occurrences of u in α as a subword. For instance|abab|ab = 3.

If A and B are two finite nonempty sets, a morphism on A is an applicationf : A∗ −→ B∗ such that f(uv) = f(u)f(v) for all u, v ∈ A∗. It is uniquelydetermined by its values on the alphabet A. The set A is the domain of f , andB is the codomain.

Definition 1. Let Σ = {a1, a2, . . . , as} be an ordered alphabet andMs+1 be themultiplicative monoid of (s + 1) - dimensional upper-triangular matrices withnonnegative integral entries and unit diagonal. The Parikh matrix mapping,denoted Ψs, is the morphism

Ψs : Σ∗ −→Ms+1

defined by condition: if k = 1, . . . , s and Ψs(ak) = (mi,j)1≤i,j≤s+1, then for each1 ≤ i ≤ s + 1, mi,i = 1, mk,k+1 = 1, all other elements of the matrix Ψs(ak)being 0.

Because in general the value of s is fixed, there will be no confusion if wedenote Ψs(α) by Mα.

A matrix M ∈ Ms+1 such that M = Mα for a particular word α ∈ Σ∗ iscalled Parikh matrix.

In [6] some basic properties of Parikh matrices are detailed. The followingresult will be needed in the sequel.

Theorem 2. ([6]) Consider Σ = {a1, a2, . . . , as} and α ∈ Σ∗. The matrixMα =Ψs(α) = (mi,j)1≤i,j≤s+1 has the following properties

– mi,j = 0 for all 1 ≤ j < i ≤ s+ 1,– mi,i = 1 for all 1 ≤ i ≤ s+ 1,– mi,j+1 = |α|ai...aj for all 1 ≤ i ≤ j ≤ s.

2 By the second diagonal of a (s + 1) × (s + 1) matrix M we mean the diagonal oflength s immediately above the main diagonal.

Example 1. For the alphabet Σ = {a, b, c}, Theorem 2 implies that

Mα =

1 |α|a |α|ab |α|abc0 1 |α|b |α|bc0 0 1 |α|c0 0 0 1

Definition 3. Two words α, β ∈ Σ∗ are called amiable, denoted α ∼a β, if andonly if Mα = Mβ

3.

For further notions and results on Parikh matrix mapping, the reader isreferred to [2], [4], [5], [7], and references given therein.

3 The morphisms with codomains composed of 1, 2 or 3characters

Let Σ1, Σ2 be two finite ordered alphabets, and φ : Σ∗1 −→ Σ∗2 be a morphism.All results obtained in this section are for a binary alphabet denoted by

Σ1 = {a, b}.Let us consider |Σ2| = k; we will detail here only the cases k = 1, 2, 3.Moreover, we shall work using the assumption:

(∀x ∈ Σ2, ∃a ∈ Σ1) [|φ(a)|x > 0]

3.1 The case k = 1

In this variant, a result will be immediate:

Lemma 4. (∀α, β ∈ Σ∗1 )[α ∼a β =⇒ φ(α) ∼a φ(β) =⇒ |φ(α)| = |φ(β)|].

So, any morphism φ offers no significative information about the words α and β(only a relation between their lengths).

3.2 The case k = 2

Let Σ2 = {x, y} be the codomain of the morphism φ.

Theorem 5. (∀α, β ∈ Σ∗1 )[α ∼a β =⇒ φ(α) ∼a φ(β)].

Proof. Let us consider the common Parikh matrix for α and β to be

Mα = Mβ =

1 n q0 1 p0 0 1

Now, it is easy to prove that for every word γ ∈ Σ∗1 and for every characteru ∈ Σ2,

3 In [7] the term “ambiguous words” is used.

|φ(γ)|u =∑r∈Σ1

|γ|r · |φ(r)|u

Because |α|r = |β|r for all r ∈ Σ1, we therefore obtain |φ(α)|v = |φ(β)|v forall characters v ∈ Σ2 (that is, the words φ(α) and φ(β) have the same Parikhvector).

It remains to check the equality |φ(α)|xy = |φ(β)|xy. We have

|φ(γ)|xy =∑r∈Σ1

|γ|r · |φ(r)|xy +∑r,t∈Σ1

|γ|rt · |φ(r)|x · |φ(t)|y

Because {r, t} ⊆ Σ1 = {a, b}, the following four cases can appear in thesecond sum:

1. rt = ab. Then |α|ab = q;2. rt = ba. Then (see [2]) |α|ba = |β|ba = n · p− q;3. rt = aa. Then |α|aa = |β|aa = n · (n− 1)/2;4. rt = bb. Then |α|bb = |β|bb = p · (p− 1)/2.

In all these situations we will obtain |φ(α)|xy = |φ(β)|xy.

Remark.

1. The reciprocal of Theorem 5 is not true.Let us define the morphism φ(a) = xy, φ(b) = yx, and the wordsα = ab, β = ba. We have α 6∼a β but φ(α) ∼a φ(β).

2. If |Σ1| > 2 then Theorem 5 is not true.For example, let us consider Σ1 = {a, b, c}, Σ2 = {x, y} and the morphism

φ(a) = x, φ(b) = φ(c) = y.

Then α = cabc, β = acbc are two amiable words, but φ(α) = yxyy andφ(β) = xyyy are not amiable, because |φ(α)|xy = 2, |φ(β)|xy = 3.Therefore for a non-binary alphabet as domain, some amiable words canbe distinguished eventually, by using some appropriate morphisms. If thedomain is a binary alphabet, there is no possibility to distinguish betweentwo amiable words.

3.3 The case k = 3

Theorem 6. Let Σ2 = {x, y, z} be an alphabet4 and φ : Σ∗1 −→ Σ∗2 a morphism.Then

(∀α, β ∈ Σ∗1 )

α ∼a β =⇒Mφ(α) −Mφ(β) =

0 0 0 r0 0 0 00 0 0 00 0 0 0

, r ∈ ZZ

4 It is possible that Σ1 and Σ2 be defined with different orders. For example, Σ1 ={a < b}, Σ2 = {b < c < a}.

Proof. The proof is similar with the proof of Theorem 5.

The Remark of Section 3.2 is true also for the Theorem 6.Using different values of r ∈ ZZ in Theorem 6, many amiable words can be

separated. As an illustration see the next Example.

Example 2. ([2],[3]) Let Σ1 = {a, b}, Σ2 = {a, b, c} be two alphabets, and letthe (Istrail) morphism be defined by

φ(a) = abc, φ(b) = ac.

Let us consider Example 3 from [2], where all words with Parikh vectorΨ = (19, 2) are listed. After applying the Istrail morphism, there are no amiablewords α, β ∈ Σ∗1 with φ(α) ∼a φ(β).

|α|ab 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18|Cα| 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10#φ 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10

Table 1

Table 1 lists all binary words α with the Parikh vector Ψ = (19, 2) (only thefirst half of the Table 1 is constructed; according to [2] Lemma 1, the second halfis a reflected copy of the first half).For every value of q = |α|ab, where α ∈ Σ∗1 , the second row of the Table showsthe number of amiable words from the set Cα = {w | w ∼a α}: the words havingthe Parikh matrix

M =

1 19 q0 1 20 0 1

The third row gives the number of classes of amiable words in which the setX = {φ(w) | w ∈ Cα} is divided.

Therefore, in this case (by using the Istrail morphism), all words can bedistinguished.

Unfortunately, a morphism φ : Σ∗1 −→ Σ∗2 with |Σ2| = 3 will never com-pletely separate any pair of amiable words. Some amiable words will remainamiable no matter what morphism is used. The next Theorem shows this asser-tion:

Theorem 7. Let φ : {a, b}∗ −→ {x, y, z}∗ be a morphism. If α, β ∈ {a, b}∗ areamiable, then

φ(αβ) ∼a φ(βα)

Proof. Obviously, if α ∼a β then αβ ∼a βα. From Theorem 6 will result that|φ(α)|w = |φ(β)|w and |φ(αβ)|w = |φ(βα)|w for w ∈ {x, y, z, xy, yz}.

It remains to prove only |φ(αβ)|xyz = |φ(βα)|xyz.This equality results from Theorem 6 and

|φ(αβ)|xyz = |φ(α)φ(β)|xyz = |φ(α)|xyz+|φ(β)|xyz+|φ(α)|x|φ(β)|yz+|φ(α)|xy|φ(β)|z|φ(βα)|xyz = |φ(β)φ(α)|xyz = |φ(β)|xyz+|φ(α)|xyz+|φ(β)|x|φ(α)|yz+|φ(β)|xy|φ(α)|z

Example 3. Because abba ∼a baab we have

φ(abbabaab) ∼a φ(baababba)

for any morphism φ : {a, b}∗ −→ {x, y, z}∗.

4 Morphisms defined by control words

Let us consider a general alphabet Σ1 = {a1, . . . , as} (s ≥ 2) and let w =c1c2 . . . cn, (ci ∈ Σ1) be a nonempty word, called ”control word”.

Let Ψ(w) = (n1, . . . , ns) be the Parikh vector of w. We define the alphabetΣ2 = {x1, . . . , xn} and a morphism φ : Σ∗1 −→ Σ∗2 as follows:For r = 1, . . . , s,

φ(ar) = xk1 . . . xknr, where cki = ar (i = 1, . . . , nr).

The image of letter ar through the morphism φ is an (ordered) sequence com-posed of those characters from the alphabet Σ2 corresponding (relative to thedefined order) to the positions of letter ar in the control word.

Example 4. Let us consider Σ1 = {a, b, c} and the control word w = abccb. So,n = 5 and Ψ(w) = (1, 2, 2). According to this construction, we define Σ2 ={x1, x2, x3, x4, x5} and

φ(a) = x1, φ(b) = x2x5, φ(c) = x3x4.

Theorem 8. Using the construction above, for a word α ∈ Σ∗1 , the Parikh ma-trix of φ(α) is

Mφ(α) =

1 |α|c1 |α|c1c2 . . . |α|w0 1 |α|c2 . . . |α|c2...cn

...0 0 0 . . . 1

Proof. It is enough to prove the equality |φ(α)|x1

= |α|c1 . All other entries ofthe Parikh matrix Mφ(α) are similarly proved.

The character x1 appears in the word φ(α) ∈ Σ∗2 whenever x1 is in the imageφ(a) for a character a ∈ Σ1. Therefore

|φ(α)|x1=∑a∈Σ1

|α|a · |φ(a)|x1

By construction, x1 appears only once: in that particular image φ(ak) (for anindex k uniquely determined) where ak = c1.

Therefore |φ(α)|x1= |α|ak = |α|c1 .

Remark. The matrix obtained in Theorem 8 is similar with the extended Parikhmatrix defined in [9].

4.1 A morphism based on a cyclic control word

Let us consider an integer p ≥ 2, and the control word w = (a1 . . . as)p. Therefore

we can define the alphabet Σ2 = {x1, x2 . . . xsp} and the morphism φ : Σ∗1 −→Σ∗2 by

φ(ai) = xixs+ix2s+i . . . x(p−1)s+i (1 ≤ i ≤ s)

Remark. For p = 1, the Parikh matrix mapping Ψs is obtained.

Let α ∈ Σ∗1 be an arbitrary word. In the (s · p+ 1)× (s · p+ 1) Parikh matrixMφ(α), the whole information is provided by α and the first s rows (the nexts·(p−1) rows repeat the information of the first s rows, and the last row providesno information). So, we can keep only these first s rows from the Parikh matrixMφ(α). Let us denote by Mα the matrix

Mα =

1 |α|a1 |α|a1a2 . . . |α|a1...as |α|a1...asa1 . . . |α|(a1...as)p0 1 |α|a2 . . . |α|a2...as |α|a2...asa1 . . . |α|a2...as(a1...as)p−1

...0 0 0 . . . |α|as |α|asa1 . . . |α|as(a1...as)p−1

In general for two amiable words α, β ∈ Σ∗1 , the value of p is the highest

integer p0 such that|α| = |β| ≥ sp0

because for any p > p0, the last s · (p− p0) columns in Mα are zero.Let us denote by row1(α), . . . , rows(α) the rows of the matrix Mα.

Theorem 9. The s×(s·p+1) matrixMα can be recursively generated as follows:

1. M ε =

1 0 . . . 0 . . . 00 1 . . . 0 . . . 0

...0 0 . . . 1 . . . 0

2. Maiα can be obtained from Mα as follows:

(a) rowj(aiα)←− rowj(α) for j ∈ {1, 2, . . . , s} \ {i};(b) rowi(aiα)←− rowi(α) + rowi+1(α) if 1 ≤ i < s,

rows(asα)←− rows(α) + Shifts(row1(α)),where Shifts(row1(α)) = (0, . . . , 0︸ ︷︷ ︸

s

, 1, |α|a1 , |α|a1a2 , . . . , |α|(a1...as)p−1)

Proof. 1. Obvious.2. Let us consider the k’th entry from the j’th row of the matrix Mα. We have:

(a) |aiα|ajβ = |α|ajβ for j 6= i, j < k < s · p+ 1, and β ∈ Σ∗1 ;(b) |aiα|aiai+1β = |α|ai+1β + |α|aiai+1β for 1 ≤ i < s, and β ∈ Σ∗1 ;(c) |asα|asa1β = |α|asa1β + |α|a1β .

Because |α|a1β (β = a2a3 . . .) is the (k− s)’th entry of the row row1(α),it can be rewritten as the k’th element by shifting with s positions tothe right.

Example 5. Let us consider s = 2, p = 4. Then

Mα =

(1 |α|a |α|ab |α|aba |α|abab |α|ababa |α|ababab |α|abababa |α|abababab0 1 |α|b |α|ba |α|bab |α|baba |α|babab |α|bababa |α|bababab

)If we take the binary words from Example 3, then we have

Mabbabaab =

(1 4 8 10 12 4 4 0 00 1 4 8 10 4 4 0 0

)and

M baababba =

(1 4 8 10 4 4 0 0 00 1 4 8 10 12 4 4 0

)Therefore these words are separated by 4 entries (from 18).

As a conclusion, a general morphism based on a cyclic control word solvesmany ambiguities using a (1 + s) · (1 + s · blogs(|α|)c) matrix Mα instead of a(1 + s) · (1 + s) Parikh matrix Mα (where α ∈ {a1, . . . , as}∗).

Moreover, this matrix can be generated very easily using the recursive pro-cedure of Theorem 9.

Unfortunately, this cyclic morphism does not solve entirely the separation ofany pair of amiable words.

Example 6. Let α = abccb and β = acbbc be two words over a ternary alphabetΣ1 = {a, b, c}, having the Parikh matrix:

Mα = Mβ =

1 1 2 20 1 2 20 0 1 20 0 0 1

Their images φ(α) and φ(β) constructed with p = 2 (but any integer value

of p gives the same result) have also the same matrix:

Mabccb = Macbbc =

1 1 2 2 0 0 00 1 2 2 0 0 00 0 1 2 0 0 0

Therefore this morphism is efficient for distinguishing long amiable words

defined over small alphabets.

4.2 A morphism able to distinguish two arbitrary amiable words

The main result of this paragraph is

Theorem 10. Let Σ1 be an alphabet with at least two characters, and α, β ∈ Σ∗1be two amiable words. There is an alphabet Σ2 and a morphism φ : Σ∗1 −→ Σ∗2such that φ(α) 6∼a φ(β).

Proof. The idea is to construct a morphism which uses one of the two amiablewords as a control word. Therefore, let us consider the control word w = α.

Then, using Theorem 8, the last entry of the first row in a Parikh matrix for(an arbitrary word) γ ∈ Σ∗1 is |γ|α.

In our case we have |α|α = |β|α if and only if α = β.

Example 7. The words α = acbbc and β = abccb from Example 6 can be sepa-rated using the construction from Theorem 10. So, if we take the control wordw = acbbc we have

Mφ(acbbc) =

1 1 2 2 1 10 1 2 2 1 10 0 1 2 1 10 0 0 1 2 20 0 0 0 1 20 0 0 0 0 0

, Mφ(abccb) =

1 1 2 2 0 00 1 2 2 0 00 0 1 2 1 00 0 0 1 2 20 0 0 0 1 20 0 0 0 0 0

One open problem is: For a class Cα of amiable words, will the control word

w = α distinguish every pair of words from Cα ? (that is, if β1, β2 ∈ Cα, andw = α, then φ(β1) 6∼a φ(β2) ?)

Unfortunately, the morphism defined in this paragraph builds a very largeParikh matrix: if α, β ∈ Σ1 are two amiable words of length n, their Parikhmatrices defined by the morphism have (n + 1)2 entries each. So it is unusablefor long words.

Remark. Only the last entry of the first row of Mφ(α) is enough for separationbetween two amiable words α and β.

Using this remark, the first rows of the corresponding Parikh matrices areenough for separating two amiable words. But if we wish to generate only theserows, a good algorithm has to be found. The algorithm defined by Theorem 9has the time complexity O(|α|), but in this case it needs O(|α|2) space (whereα ∈ Σ∗1 is the control word).

References

1. A. Atanasiu, R. Atanasiu, I. Petre - Parikh Matrices and Amiable Words,Theoretical Computer Science 390, 1(2008), 102-109

2. A. Atanasiu - Binary amiable words, Intern. J. Found. Comput. Sci. 18, 2(2007),387-400.

3. A. Atanasiu - Parikh Matrices and the Istrail Morphism, IJFCS (in press).4. A. Atanasiu, C. Martin-Vide, A. Mateescu - On the injectivity of Parikh matrix

mapping, Fundamenta Informaticae 49(2001), 166-180.5. S. Fosse, G. Richomme - Some characterisations of Parikh matrix equivalent binary

words, Inf. Processing Letters 92(2), 77-82 (2004).6. A. Mateescu, A. Salomaa, K. Salomaa, S. Yu - On the extension of the Parikh

mapping, Theoret. Informatics Appl. 35 (2001), 551-564.7. A. Mateescu, A. Salomaa - Matrix indicators for subword occurrences and ambigu-

ity, Int. J. Found. Comput. Sci. 15 (2004), 277-292.8. R.J. Parikh - On context-free languages, J. Assoc. Comput. Mach., 13 (1966), 570-

581.9. V. Serbanuta - Extended Parikh matrices, Theoretical computer Science 310 (2004),

233 - 246.