Data Mining of Gene Expression Microarray via Weighted Prefix Trees

Data Mining of Gene Expression Microarray viaWeighted Prefix Trees

Tran Trang, Nguyen Cam Chi, and Hoang Ngoc Minh

Centre Integre de BioInformatique,Centre d’Etude et de Recherche en Informatique Medicale,

Universite de Lille 2, 1 Place Verdun, 59045 Lille Cedex, France{ttran, cnguyen, hoang}@univ-lille2.fr

Abstract. We used discrete combinatoric methods and non numericalalgorithms [9], based on weighted prefix trees, to examine the data min-ing of DNA microarray data, in order to capture biological or medicalinformations and extract new knowledge from these data. We describehierarchical cluster analysis of DNA microarray data using structure ofweighted trees in two manners : classifying the degree of overlap be-tween different microarrays and classifying the degree of expression lev-els between different genes. These are most efficiently done by findingthe characteristic genes and microarrays with the maximum degree ofoverlap and determining the group of candidate genes suggestive of apathology.

Keywords: combinatoric of words, weighted trees, data mining, clusteranalysis, DNA microarrays.

1 Introduction

DNA microarray is technology used to measure simultaneously the expressionlevels of thousands of genes under various conditions and then provide genome-wide insight. Microarray is a microscope slide to which the thousands of DNAfragments are attached. The DNA microarrays are hybridized with fluorescentlylabelled cDNA prepared from total mRNA of studied cells. The cDNA of thefirst cell sample is labelled with a green-fluorescent dye and the second with ared-fluorescent dye. After hybridization, the DNA microarrays are placed in ascanner to create a digital image of the arrays. The intensity of fluorescent lightvaries with the strength of the hybridization. The measure of expression level ofa gene is determined by the logarithm of the ratio of the luminous intensity ofthe red fluorescence IR to the luminous intensity of the green fluorescence IG,E = log2(

IR

IG). In this work, a change of differential expression of a gene between

two cell samples by a factor of greater than 2 was considered significant. Thus,if E ≥ 1, the gene is said up-regulated; if −1 < E < 1, it is said no-regulated; ifE ≤ −1, it is said down-regulated in the second cell sample [3].

The important application of microarray techology is an organization of mi-croarray profiles and gene expression profiles into different clusters according to

T.B. Ho, D. Cheung, and H. Liu (Eds.): PAKDD 2005, LNAI 3518, pp. 21–31, 2005.c© Springer-Verlag Berlin Heidelberg 2005

22 T. Trang, N.C. Chi, and H. N. Minh

degree of expression levels. Since some microarray experiments can contain upto 30 000 target spots, the data generated from a single array mounts up quickly.The interpretation of the experiment results requires the mathematic methodsand software programs to capture the biological or medical information and toextract new knowledge from these data. There exist already several statisticalmethods and software programs to analyze the microarrays. Examples include1) Unsupervised learning methods : hierarchical clustering and k-means clus-tering algorithms based on distance similarity metrics [3, 4] are used to searchfor genes with maximum similarity. 2) Supervised learning methods : supportvector machine approaches based on kernel function [5], the bayesian naive ap-proach based on maximum likelihood method [6, 7] are used to classify the geneexpression into a training set. These methods are based essentially on numericalalgorithms and do not really require structured data.

In this paper, we used combinatoric methods and non-numerical algorithms,based on structure of weighted prefix trees, to analyze microarray data, i.e :

1. defining the distances to compare different genes and different microarrays,2. organizing the microarrays of a pathology into the different clusters,3. classifying the genes of a pathology into the different clusters,4. researching the characteristic genes and/or characteristic microarrays,5. determining the group of candidate genes suggestive of a pathology.

The key to understanding our approach is the information on gene expression in amicroarray is represented by a symbolic sequence, and a collection of microarraysis viewed as a language, which is implementated by weighted prefix trees [8, 9].In fact, the expression levels of a gene will be modeled as an alphabet whosesize is the number of expression levels. Thus, a microarray profile or a geneexpression profile is viewed as a word over this alphabet and a collection ofmicroarrays forms a language that is called a DNA microarray language. Inthis study, we chose three symbols to represent the three expression levels ofa gene according to : a spot representing a gene up-regulated is encoded by +;a gene no-regulated is encoded by ◦; a gene down-regulated is encoded by −.From this modeling, we obtain an alphabet X = {+, ◦,−} and a microarrayis then represented by a ternary word over X. And a collection of microarraysis represented as the set of ternary words. The encoding by symbol sequencespermits the analysis of the microarrays by using ternary weighted trees. Thesetrees provide a visual of a set of data and also are tools to classify enormousmasses of words according to the overlap degree of their prefixes. Thus, thesestructures open a new way to examine the data mining step in the process ofknowledge discovery in databases to understand general characteristics of DNAmicroarray data. In other words, they permit the extraction of hidden biologicaland medical informations from the mass of this data. They also address theautomatic learning and pattern recognition problems. This paper describes twomanners of hierarchical cluster analysis of DNA microarray data, using weightedprefix trees, includes clustering the profile of microarrays and clustering genesexpression profiles. The next section recalls elements of words and languages.Section 3 presents weighted prefix trees. Section 4 describes the experimentalresults on a DNA microarray data of breast cancer cells.

Data Mining of Gene Expression Microarray 23

2 Elements of Words and Languages

2.1 Words and Precoding of Words

Let X = {x1, . . . , xm} be an alphabet of the size m. A w is a sequence. Thelenght of the word w over X is |w|. In particular, the empty word is denotedby ε. The set X∗ is the set of words over X. For any h ≥ 0, we denote Xh theset {w ∈ X∗, s.t. |w| = h}. The concatenation of word xi1 . . . xik

and xj1 . . . xjl

is the word xi1 . . . xikxj1 . . . xjl

. Equipped with the concatenation product, X∗

is a monoid whose element neuter is the empty word ε. A word u (resp. v) iscalled prefix, or left factor (resp. suffix, or right factor) of w if w = uv. For anyu, v ∈ X∗, let a be the longest left common factor of u and v, i.e u = au′ andv = av′. For xi ∈ X, let precod(xi) = i be the precoding of xi, for i = 1, ..,m.The precoding precod(w) of w in base m = CardX is defined as precod(ε) = 0and precod(w) = m precod(u) + precod(x), if w = ux, for u ∈ X∗ and x ∈ X.

2.2 Languages

Let L be a language containing N words over X∗. For u ∈ X∗, let us considerNu = Card{w ∈ L|∃v ∈ X∗, w = uv}, in particular Nε = N . Thus, for u ∈ L,

Nu =∑

x∈X

Nux and for any h ≥ 0, N =∑

u∈Xh

Nu. (1)

Let µ : L → IN be the mass function defined as µ(u) = Nxi1+ · · ·+ Nxih

, for allu = xi1 . . . xih

∈ L. For u ∈ X∗, let us consider also the ratios Pu = Nu/N , inparticular Pε = 1. For u ∈ X∗, xi ∈ X, to simplify the notation, let

p = precod(u), qi = precod(uxi) = mp + precod(xi). (2)

and we consider the ratios

Pp,qi= Nuxi

/Nu, for i = 1, ..,m. (3)

By the formula (1), since∑

xi∈X Nuxi= Nu then the ratios Pp,qi

define thediscrete probability over X∗ : 0 ≤ Pp,qi

≤ 1 and∑

u∈X∗,xi∈X Pp,qi= 1. For

u = xi1 . . . xih∈ Xh, the appearance probability of u is computed by

Pu = Pqi0 ,qi1. . . Pqih−1 ,qih

. (4)

Note that∑

w∈L Pw = 1 and for any h ≥ 0,∑

u∈Xh Pu = 1.

2.3 Rearrangement of Language

Let L = {w1, . . . , wN} be the language such that |wi| = L, for i = 1, .., N . Let SL

denote the set of permutations over [1, .., L]. Let σ ∈ SL and let w = xi1 . . . xiL.

Then σw = xσ(i1) . . . xσ(iL). We extend this definition over L as σL = {σw}w∈L.There exists a permutation σ ∈ SL such that σw1 = av1, . . . , σwN = avN , where


v1, . . . , vN ∈ X∗ and a is the left longest factor of L. The σ is not unique. TheRearrangement(L) algorithm is proposed in [9] as : for h ≤ L and for x ∈ X,let nh(x) be the number of letters x in the position h of the words in L andlet nh = maxx∈X nh(x) and

∑x∈X nh(x) = N . One rearranges then n1, . . . , nL

appearing in the order nσ(1) ≥ · · · ≥ nσ(L) by sorting algorithm [2].

Example 1. Let L = {+ + −◦, ◦ + +◦,− + +◦}. One has n2(+) = n4(◦) = 3 >n3(+) = 2 > n1(+) = n1(◦) = n1(−) = 1, so one permutes the position 1 and 4.

One obtains σ =(

1 2 3 44 2 3 1

)and σL has a = ◦+ as a longest common prefix.

3 Weighted Prefix Trees [9]

3.1 Counting Prefix Trees

Let L ⊆ X∗ be a language. The prefix tree A(L) associated to L is usually usedto optimize the storage of L and defined as follows

• the root is initial node which contains the empty word ε,• the set of the nodes corresponds to the prefixes of L,• the set of the terminal nodes represents L,• the transitions are of form precod(u) x−−−→ precod(ux), for x ∈ X, u ∈ X∗.

Equipped with the number Nux as defined in (1), the prefix tree A(L) be-comes a counting tree. To enumerate the nodes of a tree, we use the precodingof words defined in (2). The internal nodes p = precod(u) associated to prefix uare nodes such that Nu ≥ 2 and the simple internal nodes q are nodes such thatNu = 1. Thus, an internal node p = precod(u) corresponds to Nu words startingwith the same prefix u stored in sub-tree p. The counting tree of the languageL is constructed by Insert(L,A) algorithm in [8, 9]. By this construction, thetransitions between the nodes on a counting tree have the form : for x ∈ X

and u ∈ X∗,precod(u)x,Nux−−−−−→ precod(ux). Counting prefix trees permits the

comparison of all words of L according to the mass of their prefixes as definedin section 2.2. We proposed the Characteristic-Words(A(L)) algorithm in [8, 9],which returns the words having the maximum number of occurences, to extractcharacteristic words.

3.2 Probabilistic Prefix Trees

To compute the appearance probability of an output word over an alphabet X, weintroduce the probabilistic tree. Augmented with the probability Pp,q defined in(3), the counting tree A(L) becomes a probabilistic tree. The labelled probabilityis estimated by the maximum likelihood method (see (3)). It is the conditionalprobability that the word w accepts the common prefix ux knowing the commonprefix u. The transitions between the nodes on a probabilistic tree have the form :

for x ∈ X and u ∈ X∗, p = precod(u)x,Pp,q−−−−−→ q = precod(ux). The appearance

probability of a word is computed by (4).


Example 2. Let L = {+−+◦,++◦◦,++◦◦,++◦◦,+−+◦,+−+◦,++◦◦,++−◦} be the language over X = {+, ◦,−}, we have the weighted trees as below.

a) prefix tree b) counting tree c) probabilistic tree

+ ,8 + ,1

1

o o

o

o

+−1

+,5,3−

6 4

o,1−+,3 ,4

,3o ,1o ,4

,3/8− +

6 4

,1 −

o ,1o

,4/5o

14

o,1

1

+

o

,5/8

,1/5+

,1

4759

6 4

44

0 0 0

19 15 14 1519 14 19 15

59 47 44 59 4447

+ −

+−+o ++−o ++oo +−+o ++−o ++oo +−+o ++−o ++oo

Fig. 1. Weighted trees associated to L. The sequence of nodes 0 → 1 → 4 → 14 →44 represents the characteristic word + + ◦◦ with the maximum mass

3.3 Trees Having Longest Prefix

Consider the schema as follows

A(L) ←→ L σ−→ σL ←→ A(σL),

where A(L) (resp. A(σL)) is the tree associated with L (resp. σL). Where thetree A(σL) represents the longest prefix of σL (see section 2.3).

Example 3. Let L given in Example 2 and let σ =(

1 2 3 41 4 2 3

). One presents these

trees in the Figure 2.

1

+

,3− +,5

6 4

,3+ −

19 15 14

,4oo,3o

4459 47

0

,1 ,4o

,1

,8

++oo++−o+−+o

↔

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

+ − +◦+ + ◦ ◦+ − +◦+ + ◦ ◦+ − +◦+ + ◦ ◦+ + −◦+ + ◦ ◦

⎫⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭

σ−→

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

+ ◦ −++ ◦ +◦+ ◦ −++ ◦ +◦+ ◦ −++ ◦ +◦+ ◦ +−+ ◦ +◦

⎫⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭

↔

0

1

,8+

,8

5

,3− ,5

1618

,3+ − o,4

505155

+

,1

o

+o−+ +o+− +o+o

Fig. 2. Schema obtaining the tree A(σL)


We have proposed the Longest-Prefix-Tree(L∪w) algorithm to insert a wordinto a longest prefix tree and to give a new longest prefix tree [9].

4 Experimental Results

4.1 Descriptions

We employ the weighted trees, to organize microarrays (resp. genes) expressionprofiles into different clusters such that each cluster contains all microarrays(resp. genes) which represents a highly similar expression in degree of overlap.Here, we describe the cluster according to two manners including 1) the clusteranalysis of DNA microarrays for searching common profile of microarray data;2) the cluster analysis of gene expression profiles to search common expressionprofile of genes. Consider a collection of L genes across in N different measureexperiments. The gene expression profiles or the gene expression patterns is thematrix E =

(log2 IRij

/IGij

)1≤i≤L,1≤j≤N

, where IRij(resp. IGij

) is the luminousintensity of the red (resp. green) fluorescence dye of spot i in experiment j. Bysymbolic represention, the gene expression profiles is represented as a languageof L words of length N over alphabet {+, ◦,−}. In the same way, the profileof microarray data is the transposition of the matrix E, and the profile of mi-croarray data is representated by a language of N words of lenght L. These twolanguages are implemented by use of longest prefix tree permitting automaticallyto classify profile of microarrays (resp. genes) according to common expressionlevels as Figure 3, where the longest common expression indicates the similaritybetween microarrays (resp. genes). In the case of gene expression profiles (Sec-tion 4.3), the prefixe indicate also the co-regulated genes. This method returnsthen the hierarchical clustering using weighted trees which is an unsupervisedlearning technique and thus it does not requires a priori knowledge of clusternumber before clustering. This criterion is important in DNA microarray dataanalysis since the characteristics of the data are often unknown.

+++−++−+ +++o +o+o+o+−+o−+

5055

1618

5−

+,10

1

+,10

0

10 slides

4 genes

51

,5−,2o,3+

+,7,3

50

o ,4

16

,1−

51

gene 5 gene 1−4

5+,5,3−

18

55

gene 6−8

o ,8

1

+,8

0

8 genes

4 experiments

,3+

slide 1−5slide 9−10slide 6−8

Fig. 3. Counting trees of microarray data. Two sub-trees associated to node 5define two clusters corresponding. The process of nodes 0 → 1 → 5 → 16 → 50represents 5 microarrays (resp. 4 genes) having the same expression levels + + +−(resp. + ◦ +◦) with maximum degree of overlap


4.2 Clustering of DNA Microarrays

As an example, we analyze the data of 77 microarrays of 9216 genes of breastcancer cells coming from 77 patients available at the website http://genome-www.stanford.edu/breast-cancer/. The cluster analysis and the characteristicmicroarrays were represented in the Figures 4, 5, 6 and Table 1.

0 100 200 300 400

020

4060

80

Counting prefix tree

Depth of the tree

Occ

uren

ces

num

ber

0 100 200 300 400

0.5

0.6

0.7

0.8

0.9

1.0

Probabilistic prefix tree

Depth of the tree

Pro

babi

listic

tran

sitio

n

Fig. 4. Counting and probabilistic tree. Each depth of tree gives the number(probability) of microarrays having the common expression. The microarrays having amaximal common expression are represented by a dark line

,77

ooo,68

...

,77RNASE1

?C14orf136PPP2R53PICALMFBI1FTH1NDNSTX5ATAF12

DSCR1o,76,1

BET1LCOX5AUBE2Q

chromosome 14 open reading frame 163protein phosphatase 2, regulatory subunit B (B56)phosphatidylinositol binding clathrin assembly proteinHIV−1 inducer of short transcripts binding proteinferritin, heavy polypeptide 1necdin homolog (mouse)

TAF12 RNA polymerase II, TATA boxnescient helix loop helix 1

down syndrome critical region gene 1

blocked early in transport 1 homolog (S. cerevisiae) likecytochrome c oxidase subunit Vaubiquitin−conjugating enzyme E2Q (putative)

syntaxin A5

H.s mRNA; cDNA DKFZp666J045

NHLH1

microarray 18

Gene name77 breast microarrays

ribonuclease, RNase A family, 1 (pancreatic)Fc fragment of IgE, high affinity IFCER1G

−

ooooooooo

microarray 29

Gene symbol

−

++

,77

Fig. 5. Visual of microarrays. The 12 first depth of counting tree includes two genesFCER1G, RNASE1 are up-regulated, one gene is down-regulated and 9 genes are no-regulated on 77 experiments, etc... These three first genes can be viewed as the groupof candidate genes of breast cancer microarray


0 20 40 60 80

0.05

0.10

0.15

1. Depth 150

Microarray experiments

Pro

babi

listic

pre

fix

0 20 40 60 80

0.02

0.04

0.06

0.08

0.10

2. Depth 245


Pro

babi

listic

pre

fix

0 20 40 60 80

0.01

40.

018

0.02

20.

026

3. Depth 350


Pro

babi

listic

pre

fix

0 20 40 60 80

0.01

40.

018

0.02

20.

026

4. Depth 427


Pro

babi

listic

pre

fix

Fig. 6. Appearance probability of prefixes. At the depth 427 there are two mi-croarrays, 18 and 29, with coincident probability of 0.026. It can be considered as thecharacteristic microarrays

Table 1. Group of similar microarrays. At depth 245 there are 4 groups of mi-croarray with highly degree of overlap. Microarray 18 and 29 are most similarity : thereare 427 first genes having the same expression levels in which 35 genes are up-regulated,377 genes are no-regulated and 15 genes are down-regulated

cluster slideID overlap degree (+, ◦,−)

1 18 427 (35,377,15)29 427 (35,377,15)

2 20 392 (30,350,12)25 392 (30,350,12)

3 7 324 (29,284,11)23 324 (29,284,11)

4 5 261 (25,266,10)42 261 (25,266,10)32 245 (25,210,10)36 248 (25,213,10)


0 2000 4000 6000 8000

0.00

00.

002

0.00

4

1. Depth 60

Number of genes

Pro

babi

listic

pre

fix

0 2000 4000 6000 8000

0.00

050.

0015

2. Depth 65

Number of genes

Pro

babi

listic

pre

fix

0 2000 4000 6000 8000

0.00

050.

0015

3. Depth 70

Number of genes

Pro

babi

listic

pre

fix

0 2000 4000 6000 8000

2e−

046e

−04

1e−

03

4. Depth 77

Number of genes

Pro

babi

listic

pre

fix

Fig. 7. Cluster of gene expression profiles. The graphs represent the appearanceprobability of expression prefixes of genes. The weighted tree gives the genes with themaximum degree of overlap of co-upregulated and co-downregulated expression levelswhich are represented in Table 2

4.3 Clustering Gene Expression Profiles

The Figure 7 and Table 2 give the cluster analysis of gene expression profiles.

4.4 Observations and Notes

There are two groups of co-regulated genes. In Table 2, the maximum co-upregu-lated (resp. co-downregulated) genes are represented on the left (resp. right) table.Each cluster represents the maximum co-regulated genes with the correspondingmicroarray identity : the cluster A (resp. a) of Table 2 represents 2 up-regulated(resp. 1 down-regulated) genes over 77 microarrays corresponding to three firstdepths of Figure 5. These maximum co-regulated genes could be then consid-ered as characteristic genes of breast microarrays. These results permit alsoto isolate the groups of no-regulated genes : from the depth 4 to the depth 12,


Table 2. Groups of the co-regulated genes. Left table represents 5 clusters of 21co-upregulated genes. Right table represents 4 clusters of 21 co-downregulated genes

cluster gene symbol overlap (+, ◦,−)

A FCER1G 77 (77,0,0)RNASE1 77 (77,0,0)

B C1orf29 58 (58,0,0)G1P2 58 (58,0,0)MGP 58 (58,0,0)ISLR 58 (58,0,0)C1QG 58 (58,0,0)SPARCL1 58 (58,0,0)

C HLA-DQA2 50 (50,0,0)IGHG3 50 (50,0,0)? 50 (50,0,0)CD34 50 (50,0,0)C1QB 50 (49,1,0)

D SFRP4 41 (41,0,0)FCGR3A 41 (41,0,0)

E MS4A4A 32 (32,0,0)FBLN1 32 (32,0,0)FLJ27099 32 (32,0,0)FCGR2A 32 (32,0,0)CTSK 32 (32,0,0)FGL2 32 (32,0,0)

cluster gene symbol overlap (+, ◦,−)

a ? 77 (0,0,63)DLG7 63 (0,0,63)AHSG 63 (0,0,63)RPS6KA3 63 (0,0,63)GASP 63 (0,0,63)

b MPO 49 (0,0,49)HBZ 49 (0,0,49)SERPINE2 49 (0,0,49)DHFR 49 (0,1,48)ESDN 49 (0,1,48)

c FGA 45 (0,1,44)LGALS4 45 (0,0,45)SLCA5 45 (0,0,45)

d APOB 29 (0,0,29)GPA33 29 (0,0,29)MAL 29 (0,0,29)ARD1 29 (0,1,28)ETV4 29 (0,1,28)ASGR 29 (0,1,28)APOH 29 (0,1,28)AFP 29 (0,1,28)

there are 9 no-regulated genes over 77 microarrays and at the depth 5500 of thecounting tree there are 4348 no-regulated genes (48%) in least 50 microarrays(65%).

5 Conclusions

We used the weighted prefix trees to examine the data mining in the processof knowledge discovery in DNA microarray data. The hierarchical clustering us-ing weighted trees gives a tool to cluster gene expression microarray data. Thelongest prefix tree is used to establish the characteristic genes and/or character-istic microarrays that have the longest common expression and the maximumdegree of overlap. It permits also to determine the groups of candidate (andno-regulated) genes of pathologic condition. We anticipate that with further re-finement these methods may be extremely valuable in analysing the mass ofDNA microarray data, with possible significant clinical applications. In addi-tion to application on microarrays, weighted prefix tree could be also used toexplore other kinds of genomic data and they are pontentially usefull in otherclassification problems.

Acknowledgements. Many thanks to J. Soula for help in the data conversion.


References

1. M.Crochemore and al. : Algorithmique du texte, Vuibert Informatique, 20012. P.Flajolet and R.Sedgewick : An introduction to the analysis of algorithms, Addison-

Wesley, 19963. Jonathan R.Pollack and al. : Microarray analysis reveals a major direct role of DNA

copy number alteration in the transcriptional program of human breast tumors, Proc.Natl. Acad. Sci. USA, 2002

4. M.Eisen, David Botstein and al. : Cluster analysis and display genome-wide expres-sion patterns, Stanford, 1998

5. M.Brown, W.Grundy and al. : Knowledge-based analysis of microarray gene expres-sion data by using support vector machines, University of California, 1999

6. Inaki Inza and al. : Filter versus wrapper gene selection approaches in DNA mi-croarray domains, Artificial Intelligence in Medecine, Elsevier, 2004

7. P.Walker, A. Famili and al. : Data mining of gene expression changes in Alzheimerbrain, Artificial Intelligence in Medecine, Elsevier, 2004

8. Tran Trang, Nguyen Cam Chi, Hoang Ngoc Minh, al. : The management and theanalysis of DNA microarray data by using weighted trees, Hermes Publishing, 2004

9. Tran Trang, Nguyen Cam Chi, Hoang Ngoc Minh : Management and analysis ofDNA microarray data by using weighted trees, to appear in Journal of Global Opti-mization : Modeling, Computation and Optimization in Systems Engineering

Data Mining of Gene Expression Microarray via Weighted Prefix Trees

Documents

Transcript of Data Mining of Gene Expression Microarray via Weighted Prefix Trees