Simulated annealing based automatic fuzzy clustering combined with ANN classification for analyzing...

12
Computers & Operations Research 37 (2010) 1369--1380 Contents lists available at ScienceDirect Computers & Operations Research journal homepage: www.elsevier.com/locate/cor Simulated annealing based automatic fuzzy clustering combined with ANN classification for analyzing microarray data Ujjwal Maulik a , Anirban Mukhopadhyay b, a Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India b Department of Computer Science and Engineering, University of Kalyani, Kalyani 741235, India ARTICLE INFO ABSTRACT Available online 19 March 2009 Keywords: Microarray gene expression data Fuzzy clustering Cluster validity indices Variable configuration length simulated annealing Artificial neural network Gene ontology Microarray technology has made it possible to monitor the expression levels of many genes simultane- ously across a number of experimental conditions. Fuzzy clustering is an important tool for analyzing microarray gene expression data. In this article, a real-coded Simulated Annealing (VSA) based fuzzy clustering method with variable length configuration is developed and combined with popular Artificial Neural Network (ANN) based classifier. The idea is to refine the clustering produced by VSA using ANN classifier to obtain improved clustering performance. The proposed technique is used to cluster three publicly available real life microarray data sets. The superior performance of the proposed technique has been demonstrated by comparing with some widely used existing clustering algorithms. Also statistical significance test has been conducted to establish the statistical significance of the superior performance of the proposed clustering algorithm. Finally biological relevance of the clustering solutions are established. © 2009 Elsevier Ltd. All rights reserved. 1. Introduction With the advancement of microarray technology, it is now pos- sible to measure the expression levels of a huge number of genes across different experimental conditions simultaneously [1]. Mi- croarray technology in recent years has major impacts in many fields such as medical diagnosis and biomedicine, characterizing various gene functions, understanding different molecular biological processes [2–5] etc. Due to its large volume, computational analysis is essential for extracting knowledge from microarray gene expres- sion data. Clustering is one of the primary approaches to analyze such large amount of data. Clustering [6–8] is a popular exploratory pattern classifica- tion technique which partitions the input space into K regions {C 1 , C 2 ,..., C K } based on some similarity/dissimilarity metric where the value of K may or may not be known a priori. The main objective of any clustering technique is to produce a K ×n partition matrix U(X) of the given data set X, consisting of n patterns, X ={x 1 , x 2 ,..., x n }. The partition matrix may be represented as U = [u kj ], k = 1,..., K and j = 1,..., n, where u kj is the membership of pattern x j to cluster C k . In crisp partitioning u kj = 1 if x j C k , otherwise u kj = 0. On the other hand, for fuzzy partitioning of the data, the following conditions Corresponding author. Fax: +91 33 25828282. E-mail addresses: [email protected] (U. Maulik), [email protected] (A. Mukhopadhyay). 0305-0548/$ - see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.cor.2009.02.025 hold on U (representing non-degenerate clustering): 0 < n j=1 u kj < n, K k=1 u kj = 1, and K k=1 n j=1 u kj = n. Some early works dealt with visual analysis of gene expression patterns to group the genes into functionally relevant classes [2,3,9]. However, as these methods were very subjective, standard clustering methods, such as K-means [10], fuzzy C-means (FCM) [11], hierar- chical methods [4], Self Organizing Maps (SOM) [12], graph theoretic approach [13], Simulated Annealing (SA) based approach [14,15] and Genetic Algorithm (GA) based clustering methods [16–18] etc. have been utilized for clustering gene expression data. In this article, a two-stage fuzzy clustering algorithm has been proposed that tries to use the supervised classification for unsuper- vised clustering of gene expression data by combination of both. A variable configuration length Simulated Annealing (VSA) based fuzzy clustering algorithm that minimizes Xie–Beni (XB) validity index [23] has been utilized for generating the fuzzy partition matrix as well as the number of clusters in the first stage. Thereafter the high mem- bership points of each cluster are identified and used to train an Artificial Neural Network (ANN) based classifier [19,20]. Finally in the second stage, the trained ANN classifier is applied to classify the remaining points. The proposed two-stage technique is named as VSA–ANN. The superiority of the proposed VSA–ANN clustering method, as compared to other popular methods for clustering gene expression data, namely FCM [21], average linkage [6], SOM [12] and the re- cently proposed SiMM–TS clustering [18] is established for three real life gene expression data, viz., Yeast Sporulation, Human Fibroblasts

Transcript of Simulated annealing based automatic fuzzy clustering combined with ANN classification for analyzing...

Computers & Operations Research 37 (2010) 1369 -- 1380

Contents lists available at ScienceDirect

Computers &Operations Research

journal homepage: www.e lsev ier .com/ locate /cor

Simulated annealing based automatic fuzzy clustering combinedwith ANNclassification for analyzingmicroarray data

Ujjwal Maulika, Anirban Mukhopadhyayb,∗

aDepartment of Computer Science and Engineering, Jadavpur University, Kolkata 700032, IndiabDepartment of Computer Science and Engineering, University of Kalyani, Kalyani 741235, India

A R T I C L E I N F O A B S T R A C T

Available online 19 March 2009

Keywords:Microarray gene expression dataFuzzy clusteringCluster validity indicesVariable configuration length simulated annealingArtificial neural networkGene ontology

Microarray technology has made it possible to monitor the expression levels of many genes simultane-ously across a number of experimental conditions. Fuzzy clustering is an important tool for analyzingmicroarray gene expression data. In this article, a real-coded Simulated Annealing (VSA) based fuzzyclustering method with variable length configuration is developed and combined with popular ArtificialNeural Network (ANN) based classifier. The idea is to refine the clustering produced by VSA using ANNclassifier to obtain improved clustering performance. The proposed technique is used to cluster threepublicly available real life microarray data sets. The superior performance of the proposed technique hasbeen demonstrated by comparing with some widely used existing clustering algorithms. Also statisticalsignificance test has been conducted to establish the statistical significance of the superior performance ofthe proposed clustering algorithm. Finally biological relevance of the clustering solutions are established.

© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

With the advancement of microarray technology, it is now pos-sible to measure the expression levels of a huge number of genesacross different experimental conditions simultaneously [1]. Mi-croarray technology in recent years has major impacts in manyfields such as medical diagnosis and biomedicine, characterizingvarious gene functions, understanding different molecular biologicalprocesses [2–5] etc. Due to its large volume, computational analysisis essential for extracting knowledge from microarray gene expres-sion data. Clustering is one of the primary approaches to analyzesuch large amount of data.

Clustering [6–8] is a popular exploratory pattern classifica-tion technique which partitions the input space into K regions{C1,C2, . . . ,CK} based on some similarity/dissimilarity metric wherethe value of K may or may not be known a priori. The main objectiveof any clustering technique is to produce a K×n partitionmatrix U(X)of the given data set X, consisting of n patterns, X = {x1, x2, . . . , xn}.The partition matrix may be represented as U= [ukj], k=1, . . . ,K andj = 1, . . . ,n, where ukj is the membership of pattern xj to cluster Ck.In crisp partitioning ukj = 1 if xj ∈ Ck, otherwise ukj = 0. On the otherhand, for fuzzy partitioning of the data, the following conditions

∗ Corresponding author. Fax: +913325828282.E-mail addresses: [email protected] (U. Maulik), [email protected]

(A. Mukhopadhyay).

0305-0548/$ - see front matter © 2009 Elsevier Ltd. All rights reserved.doi:10.1016/j.cor.2009.02.025

hold onU (representing non-degenerate clustering): 0<∑n

j=1 ukj <n,∑Kk=1 ukj = 1, and

∑Kk=1

∑nj=1 ukj = n.

Some early works dealt with visual analysis of gene expressionpatterns to group the genes into functionally relevant classes [2,3,9].However, as thesemethods were very subjective, standard clusteringmethods, such as K-means [10], fuzzy C-means (FCM) [11], hierar-chical methods [4], Self Organizing Maps (SOM) [12], graph theoreticapproach [13], Simulated Annealing (SA) based approach [14,15] andGenetic Algorithm (GA) based clustering methods [16–18] etc. havebeen utilized for clustering gene expression data.

In this article, a two-stage fuzzy clustering algorithm has beenproposed that tries to use the supervised classification for unsuper-vised clustering of gene expression data by combination of both. Avariable configuration length Simulated Annealing (VSA) based fuzzyclustering algorithm that minimizes Xie–Beni (XB) validity index [23]has been utilized for generating the fuzzy partition matrix as well asthe number of clusters in the first stage. Thereafter the high mem-bership points of each cluster are identified and used to train anArtificial Neural Network (ANN) based classifier [19,20]. Finally inthe second stage, the trained ANN classifier is applied to classify theremaining points. The proposed two-stage technique is named asVSA–ANN.

The superiority of the proposed VSA–ANN clustering method, ascompared to other popular methods for clustering gene expressiondata, namely FCM [21], average linkage [6], SOM [12] and the re-cently proposed SiMM–TS clustering [18] is established for three reallife gene expression data, viz., Yeast Sporulation, Human Fibroblasts

1370 U. Maulik, A. Mukhopadhyay / Computers & Operations Research 37 (2010) 1369 -- 1380

Serum and Rat CNS data sets. Moreover, statistical tests have beencarried out to establish that the proposed technique produces resultsthat are statistically significant and do not come by chance. Finally,biological significance test has been conducted to establish that theclusters identified by the proposed technique are biologically rele-vant.

2. Motivation and contribution

Fuzzy clustering of microarray gene expression data has an in-herent advantage over crisp partitioning. While clustering the genes,it is often the case, that some gene has an expression pattern thatis similar to more than one class of genes. For example, in theMIPS (Munich Information Center for Protein Sequences) catego-rization of data, several genes belong to more than one category[22]. Hence it is evident that great amount of imprecision and un-certainty is related with gene expression data. Therefore, it is nat-ural to apply fuzzy clustering methods for partitioning expressiondata.

For defuzzifying a fuzzy clustering solution, genes are assignedto cluster to which they have the highest membership degree. It hasbeen observed that for a particular cluster, some of the genes be-longing to it have higher membership degree to that cluster, andcan be considered as they are clustered properly. On the contrary,some other genes of the same cluster may have lower membershipdegrees. Thus the genes in the latter case are not assigned to thatcluster with high confidence. Therefore, it would be better if we canidentify the low confidence points (genes) from each cluster and re-assign them properly. This observation motivates us to refine theclustering result by using ANN based probabilistic classifier [19,20],which is trained by the points with high membership degree in acluster. The trained ANN classifier can thereafter be used to classifythe remaining points. A variable configuration length Simulated An-nealing (VSA) based fuzzy clustering algorithm that minimizes XBcluster validity index [23] has been utilized for generating the fuzzypartition matrix as well as the number of clusters in the first stage.In the subsequent stage, ANN is applied to classify the points withlower membership degree.

In [18], we proposed a two-stage clustering technique (SiMM–TS)that first identifies the points having significant membership to mul-tiple clusters (SiMM points) in the first stage using variable stringlength GA based clustering (VGA) minimizing XB index [23]. TheSiMM points are then excluded from the data set and the remain-ing points are clustered through a multiobjective clustering method[24]. Finally the SiMM points are assigned to the nearest clusters.SiMM–TS heavily depends on the choice of a threshold parameterPwhich had been fixed through several iterations and thus takes time.Moreover in SiMM–TS, there was no concept of using supervisedlearning tools to improve the clustering. The clustering technique(VSA–ANN) proposed in this article is different from SiMM–TS. Herewe have first evolved the number of clusters and the fuzzy mem-bership matrix through VSA based fuzzy clustering minimizing XBindex. Thereafter the high confidence points (core points) for eachcluster are identified and these points are used to train the ANNclassifier. The remaining points are then classified using the trainedclassifier. This method also uses a membership threshold parameter,however it has been evolved automatically. In SiMM–TS, clusteringis used in both the stages, whereas, in VSA–ANN, clustering is usedin the first stage only. In the second stage, ANN classification is used.As supervised classification is known to perform better compared tounsupervised clustering techniques, in this article an effort has beenmade to use the strength of supervised classification for unsuper-vised clustering of gene expression data, which is the main noveltyof the proposed approach. Thus VSA–ANN is expected to performbetter than SiMM–TS and it is also established experimentally.

Fig. 1. Gene expression matrix.

3. Microarray data

A microarray is a small chip onto which a large number of DNAmolecules (probes) are attached in fixed grids. The chip is madeof chemically coated glass, nylon, membrane or silicon. Each gridcell of a microarray chip corresponds to a DNA sequence. For cDNAmicroarray experiment, the first step is to extract RNA from a tissuesample and amplification of RNA. Thereafter two mRNA samplesare reverse-transcribed into cDNA (targets) labeled using differentfluorescent dyes (red-fluorescent dye Cy5 and green-fluorescent dyeCy3). Due to the complementary nature of the base-pairs, the cDNAbinds to the specific oligonucleotides on the array. In the subsequentstage, the dye is excited by a laser so that the amount of cDNA canbe quantified by measuring the fluorescence intensities [4,25]. Thelog ratio of two intensities of each dye is used as the gene expressionprofiles

gene expression level = log2Intensity(Cy5)Intensity(Cy3)

. (1)

Amicroarray experiment typicallymeasures the expression levelsof large number of genes across different experimental conditionsor time points. A microarray gene expression data consisting of ngenes and m conditions can expressed as a real valued n×m matrixM= [gij], i=1, 2, . . . ,n, j=1, 2, . . . ,m. Here each element gij representsthe expression level of the ith gene at the jth experimental conditionor time point (Fig. 1).

The raw gene expression data may contain noise and also suffersfrom some variations arising from biological experiments and miss-ing values. Hence before applying any clustering algorithm, somepreprocessing of the data is required. Two widely used preprocess-ing techniques are missing value estimation and normalization. Nor-malization is a statistical tool for transforming data into a formatthat can be used for meaningful cluster analysis [26,27]. Among var-ious kinds of normalization techniques, the most used is the one bywhich each row of the matrix M is standardized to have mean 0 andvariance 1.

4. SA based fuzzy clustering with variable length configuration

SA [28,29] is a popular search algorithm that utilizes the princi-ples of statistical mechanics regarding the behavior of a large num-ber of atoms at low temperature, for finding minimal cost solu-tions to large optimization problems by minimizing the associatedenergy. In statistical mechanics, it is very important to investigatethe ground states or low energy states of matter. These states areachieved at very low temperatures. However, it is not sufficient tolower the temperature alone since this results in unstable states. Inthe annealing process, the temperature is first raised, then decreasedgradually to a very low value (Tmin), while ensuring that one spendssufficient time at each temperature value. This process yields stablelow energy states. In [30], a convergence proof for SA, if annealedsufficiently slowly, is given. Being based on strong theory, SA hasbeen applied in diverse areas [31–33] successfully. In this section,

U. Maulik, A. Mukhopadhyay / Computers & Operations Research 37 (2010) 1369 -- 1380 1371

an improved variable configuration length SA based fuzzy clusteringalgorithm is described.

4.1. Solution representation

A solution to the clustering problem is a set of K cluster cen-ters, K being the number of clusters. Here a solution (configuration)is represented by a string of real numbers which represent the co-ordinates of the cluster centers. If configuration i encodes the cen-ters of Ki clusters in m-dimensional space then its length li willbe m × Ki. For example, in four-dimensional space, the configura-tion 〈1.3 11.4 53.8 2.6 10.1 21.4 0.4 5.3 35.6 0.0 10.3 17.6〉 encodesthree cluster centers, (1.3, 11.4, 53.8, 2.6), (10.1, 21.4, 0.4, 5.3) and(35.6, 0.0, 10.3, 17.6). Each center is considered to be indivisible.

4.2. Initial configuration

The string corresponding to the initial configuration encodes thecenters of some K number of clusters, such that K = (rand()%K∗)+ 2,where, rand() is a function returning a random integer, and K∗ is asoft estimate of the upper bound of the number of clusters. Therefore,the number of clusters will vary from 2 to K∗ + 1. The K centersencoded in the initial configuration are randomly selected distinctpoints from the input data set.

4.3. Computation of energy

The energy of a configuration indicates the degree of goodnessof the solution it represents. The goal is to minimize the energy toobtain the lowest energy state. In this article, the XB cluster validityindex [23] is used as the energy function. Let X={x1, x2, . . . , xn} be theset of n data points to be clustered. For computing the energy, thecenters encoded in the configuration are first extracted. Let these bedenoted as Z={z1, z2, . . . , zK}. The membership values uik, i=1, 2, . . . ,Kand k = 1, 2, . . . ,n are computed as follows [21]:

uik = 1

∑Kj=1

(D(zi, xk)D(zj, xk)

)2/(m−1), 1� i�K, 1�k�n, (2)

where D(., .) is a distance function, m is the weighting coefficient andK be the number of clusters encoded in the configuration. (Note thatwhile computing uik using Eq. (2), if D(zj, xk) is equal to zero for somej, then uik is set to zero for all i = 1, . . . ,K, i� j, while ujk is set equalto one.) Subsequently, the centers encoded in the configuration areupdated using the following equation [21]:

zi =∑n

k=1 (uik)mxk∑n

k=1 (uik)m , 1� i�K, (3)

and the cluster membership values are recomputed as per Eq. 2.The XB index is defined as a function of the ratio of the total

variation � to the minimum separation sep of the clusters. Here �and sep can be written as

�(U, Z;X) =K∑i=1

n∑k=1

u2ikD2(zi, xk), (4)

and

sep(Z) = mini� j

{D2(zi, zj)}, (5)

U, Z and X represent the partition matrix, set of cluster centers andthe data set, respectively. The XB index is then written as

XB(U, Z;X) = �(U, Z;X)n × sep(Z)

=∑K

i=1

(∑nk=1 u

2ikD

2(zi, xk))

n × (mini� j {D2(zi, zj)}). (6)

Note that when the partitioning is compact and good, value of �should be lowwhile sep should be high, thereby yielding lower values

of the XB index. The objective is therefore to minimize the XB indexfor achieving proper clustering.

4.4. Perturbation

In this article, three perturbation operations have been used to ob-tain a new configuration from the previous configuration. The threeoperations are applied with equal probability. The three operatorsare as follows:

4.4.1. Perturb centerIn this method, a random center of the configuration is chosen

to be perturbed. A random number � in the range [−1, 1] is gener-ated with uniform distribution. If the value of the center in the d thdimension at is zd, after perturbation it becomes (1± 2.�.p)zd, whenzd �0, and (±2.�.p), when zd =0. The `+' or `−' sign occurs with equalprobability. Here p denotes the perturbation rate which is taken tobe 0.01 in this article.

4.4.2. Split centerThe biggest cluster encoded in the configuration is split. To do

this, first the size Si of each cluster i is computed as follows:

Si =n∑

j=1

uij, 1� i�K, (7)

where K is the number of clusters encoded in the configuration cho-sen for perturbation. Thereafter the center of the biggest cluster isselected. Next, this center is then substituted by two new centersthat are created as follows. A reference point p is found that hasmembership value closest to the mean of the membership valuesabove 0.5 to the center of the biggest cluster. The distance betweenthe reference point p and the selected center at the d th dimension(distd) is computed as

distd = |zd − pd|. (8)

Subsequently the values of the d th dimension of the two new centersthat replace the currently selected center are given by zd ± distd.

4.4.3. Delete centerIn the delete center operation, the smallest cluster is identified

as per Eq. (7) and its center is deleted from the configuration. Ifdelete center operation results in a single cluster, the operation isnot performed.

4.5. Acceptance of the new configuration

Suppose the current configuration curr has energy value Ecurr andthe new configuration new obtained by perturbing the current con-figuration has energy value Enew. If Enew�Ecurr , then the new con-figuration is accepted and considered as the current configuration. IfEnew is greater than Ecurr , then the probability pacc of accepting thenew configuration is given by

pacc = exp(

−Enew − EcurrTt

), (9)

where Tt is the current temperature. This means that the probabilityof acceptance of a comparatively bad solution decreases with theincreasing badness of the new solution and decreasing temperature.

4.6. Annealing schedule

Starting from the initial high temperature T1=Tmax, the tempera-ture is decreased as per some annealing schedule at each generation

1372 U. Maulik, A. Mukhopadhyay / Computers & Operations Research 37 (2010) 1369 -- 1380

Fig. 2. The VSA based fuzzy clustering algorithm.

Fig. 3. Three layer feed-forward ANN classifier model.

t, such that T1�T2� · · · �Tt · · · �Tmin ≈ 0, where Tmin is the min-imum temperature. The asymptotic convergence (i.e., at t → ∞) ofthe SA is guaranteed for a logarithmic annealing schedule of the formTt = Tmax/(1 + ln t), where t�1. However, in practice, the logarith-mic annealing is far too slow and hence we have used a geometricschedule of the form Tt = Tmax ∗ (1 − �)t , where � is a positive realnumber close to zero. At each temperature, to obtain a stable state,the process of perturbation and acceptance of new configuration isrepeated for Iter times. The annealing process terminates when thecurrent temperature reaches Tmin.

In Fig. 2, the different steps of VSA algorithm have been shown.

5. ANN based classifier

The ANN classifier (Fig. 3) algorithm used in this article imple-ments a three layer feed-forward neural network with a hyperbolictangent function for the hidden layer and the softmax function [34]for the output layer. Using softmax, output of i th output neuron isgiven by:

pi =eqi∑Kj=1 e

qj, (10)

where qi the net input to the i th output neuron, and K is the num-ber of output neurons. The use of softmax makes it possible to in-terpret the outputs as probabilities. The number of neurons in theinput layer is d, where d is the number of features of the input dataset. The number of neurons in the output layer is K, where K is thenumber of classes. The i th output neuron provides the class mem-bership degree of the input pattern to the i th class. The number ofhidden layer neurons is taken as 2 ∗ d. The weights are optimizedwith a maximum a posteriori (MAP) approach; cross-entropy errorfunction augmented with a Gaussian prior over the weights. Theregularization is determined by MacKay's ML-II scheme [20]. Outlierprobability of training examples is also estimated [35]. Fig. 3 showsthe feed-forward ANN classifier model.

6. Proposed VSA–ANN clustering technique

As discussed earlier, the fuzzy clustering techniques such as VSAgenerates a fuzzy partition matrix U=[uik], i=1, . . . ,K and k=1, . . . ,n,where K and n be the number of clusters evolved and the number ofdata points, respectively. The solution can be defuzzified by assign-ing each point to the cluster to which it has the highest member-ship degree. Hence for each cluster, the points belonging to it mayhave different membership degrees ranging from high (higher con-fidence) to low (lower confidence). The points having lower mem-bership degrees may be considered as they are not classified to thatcluster with reasonable confidence level. On the other hand, thepoints having high membership values can be thought as they areproperly classified. This motivates us to design a clustering methodwhere the points having high membership values in each cluster areused to train a classifier and the class labels of the remaining pointsare predicted thereby using the trained classifier. In this article, wehave used VSA based fuzzy clustering algorithm to evolve the fuzzymembership matrix as well as the number of clusters. Subsequentlyan ANN based probabilistic classifier is used to do the classificationtask. The method is named as VSA–ANN and its steps are as follows:

Step 1: Cluster the input data set X = {x1, x2, . . . , xn} using VSAbased fuzzy clustering algorithm to evolve the fuzzy membershipmatrix U = [uik], i = 1, . . . ,K and k = 1, . . . ,n, where K and n be thenumber of clusters evolved automatically and the number of datapoints, respectively.

Step 2: Assign each point k, (k = 1, . . . ,n), to some cluster j(1� j�K) such that ujk = maxi=1,. . .,K{uik}.

Step 3: For each cluster i (i = 1, . . . ,K), select all the points j ofthat cluster, for which uij�Ti, whereTi (0<Ti <1) is a thresholdvalue on the membership degree for cluster i. These points act astraining points for cluster i. Combine the training points of all theclusters to form the complete training set. Keep the remaining pointsas the test set.

Step 4: Train the probabilistic ANN classifier using the trainingset created in the previous step.

Step 5: Generate the conditional membership probabilities for theremaining points (test points) using the trained ANN classifier.

Step 6: Obtain the new membership matrix U∗ = [u∗]K×n com-bining the memberships of the training points (obtained using VSA)and the test points (produced by trained ANN).

Step 7: Assign each point k, (k = 1, . . . ,n), to some cluster j(1� j�K) such that u∗

jk = maxi=1,. . .,K{u∗ik}.

Note that the size and the confidence of the training set dependson the choice of the membership thresholds Ti, i = 1, . . . ,K. If Tivalues are large, the size of the training set decreases, however thetraining set will then contain only the points having high member-ship degrees to their respective clusters. Hence the training set willhave more confidence. On the contrary, for small values of Ti, thesize of the training set will increase sacrificing the confidence level.Therefore, the choice of the threshold parameters Ti, i = 1, . . . ,K,

U. Maulik, A. Mukhopadhyay / Computers & Operations Research 37 (2010) 1369 -- 1380 1373

has significant effect on the performance of VSA–ANN. During ex-perimentation, it has been noticed that best clustering performanceis achieved if for each cluster, the points that have greater member-ship degrees than the mean membership degree of that cluster areselected for training. Taking this into account, after several experi-mentation we have fixed Ti as follows:

Ti =1ni

∑j∈Ci

uij, i = 1, . . . ,K, (11)

where ni is the size of cluster i (denoted by Ci). This implies that foreach cluster, the points having membership degrees greater than themean of the membership degrees of all the points of that cluster arechosen as the training points. Thus the membership threshold valuecan vary from one cluster to another.

For the purpose of illustration, in Fig. 4(a) a two-dimensionalartificial data set is shown. The data set contains five clusters. Figs.4(b) and (c) show the training set and test set of points obtained,respectively. This example indicates that the points in the test setare usually situated at the overlapping regions of the clusters thushaving large amount of confusion regarding their class assignment.

7. Distance measures

Choice of distance measure plays a great role in the context ofmicroarray clustering. In this article, Pearson correlation-based dis-tance measure has been used as this is the commonly used distancemetric for clustering gene expression data. A gene expression dataconsisting of n genes and m time points are usually expressed as areal valued n × m matrix E = [gij], i = 1, 2, . . . ,n, j = 1, 2, . . . ,m. Hereeach element gij represents the expression level of the ith gene atthe jth time point.

Pearson correlation: Given two feature vectors, gi and gj, Pearsoncorrelation coefficient Cor(gi, gj) between them is computed as

Cor(gi, gj) =∑p

l=1 (gil − �gi )(gjl − �gj )√∑pl=1 (gil − �gi )

2√∑p

l=1 (gjl − �gj )2. (12)

Here �gi and �gj represent the arithmetic means of the componentsof the feature vectors gi and gj, respectively. Pearson correlationcoefficient defined in Eq. (12) is a measure of similarity between twoobjects in the feature space. The distance between two objects gi andgj is computed as 1 − Cor(gi, gj), which represents the dissimilaritybetween those two objects.

8. Complexity analysis

In this section, we have analyzed the time and space complexityof the proposed VSA–ANN clustering and compared with that of theother clustering methods considered here.

8.1. Time complexity

Since time taken by VSA dominates the training and testing ofANN, The worst case time complexity of VSA–ANN will be domi-nated by the complexity of VSA only. The complexity of VSA can becomputed as follows:

1. Time required for the initialization of the configuration is pro-portional to the length of the configuration. As the length of theconfiguration is proportional to K∗ × d (K∗ = soft estimate of theupper bound of the number of clusters, d = data dimension),the time complexity of initialization is O(K∗ × d).

2. One of the following perturbation operations are performed ran-domly:

(a) Perturb center: this can be performed in O(d) time.(b) Split center: this can be performed in O(n× K∗ × d) time, n

is the number of data points.(c) Delete center: this can be performed in O(n× K∗ × d) time.

Hence in the worst case, the complexity of perturbation is O(n×K∗ ×d).

3. Energy computation is composed of three steps:(a) The complexity of computing the membership of n points

to K∗ clusters is O(n × K∗ × d).(b) For updating K∗ cluster centers, the complexity is O(K∗ ×d).(c) The complexity of computing the energy function is O(n ×

K∗ × d).Hence the total complexity of energy computation is O(n × K∗ × d).

Thus summing up the above complexities, the total time complex-ity becomes O(n × K∗ × d) per iteration. If the number of differenttemperatures is t and number of iterations in each temperature isIter, then the overall time complexity of VSA and hence VSA–ANNbecomes O(t × Iter × n × K∗ × d).

The times complexities of the other algorithms considered in thisarticle are as follows: the time complexity of FCM for k number ofclusters is O(I×n×k×d), where I is the number of iterations. As FCMis needed to be run for different number of clusters starting from 2to K∗ to find the number of clusters, hence the total time complexityof FCM becomes O(

∑K∗k=2 (I × n × k × d)).

The time complexity of average linkage algorithm is O(n2×logn×d). SOM clustering has a time complexity of O(I×n×k×d) for k mapelements and I iterations. To find the number of clusters, SOM is tobe executed for each value of k starting from 2 to K∗, k being even.Hence the total time complexity of SOM becomes O(

∑k=2,4,. . .,K∗ (I ×

n × k × d)).SIMM–TS algorithm has two stages. In the first stage, it uses VGA

based clustering, which has a time complexity of O(G× P × n× K∗ ×d), where G and P are the number of generations and populationsize, respectively. To identify the SiMM points, it takes O(n logn)time. In the second stage, it uses multiobjective GA based clustering,which has a time complexity of O(G × P × n × K∗ × d), with twoobjective functions. However the second stage clustering algorithm isexecuted for T times with different values of the threshold parameterP. Hence the overall time complexity of the second stage and hencethe SiMM–TS algorithm is O(G × P × T × n × K∗ × d).

8.2. Space complexity

VSA has a worst case space complexity of O(K∗ × n), i.e., the sizeof the membership matrix. ANN will have a space complexity ofO(N × H + H × F) (size of the weight matrices), where N, H and Fare the number of neurons in the input layer, hidden layer and theoutput layer, respectively. As discussed in Section 5, N = d, H = 2 ∗ dand F =K∗, in the worst case. Hence the space complexity of ANN isO(d2 +d×K∗). Hence the space complexity of VSA–ANN is O(K∗ ×n),assuming n>>d.

The space complexity of FCM is O(K∗ × n). Average linkage algo-rithm has a space complexity of O(n2). The space complexity of SOMalgorithm is also O(K∗ × n) (The distance matrix size at each itera-tion.) Finally, the space complexity of SiMM–TS is O(P × K∗ × n), Pbeing the population size. Note that the input data set of size O(n×d)is also to be kept in memory for all the above algorithms to get fasterperformance.

9. Data sets and pre-processing

In this article three real life gene expression data sets viz.,Yeast Sporulation, Human Fibroblasts Serum and Rat CNS data

1374 U. Maulik, A. Mukhopadhyay / Computers & Operations Research 37 (2010) 1369 -- 1380

4 6 8 10 12 14 164

6

8

10

12

14

16

4 6 8 10 12 14 164

6

8

10

12

14

16

4 6 8 10 12 14 164

6

8

10

12

14

16

Fig. 4. (a) A two-dimensional artificial data set having five clusters. (b) Training data set. (c) Test data set.

sets have been considered for experiments. The data sets and thepre-processing techniques used are described below.

9.1. Yeast Sporulation

Microarray data on the transcriptional program of Sporulationin budding Yeast has been considered here. The data set [3] ispublicly available at the website http://cmgm.stanford.edu/pbrown/sporulation. DNA microarrays containing 97% of the known and pre-dicted genes are used. The total number of genes are 6118. Duringthe Sporulation process, the mRNA levels were obtained at seventime points 0, 0.5, 2, 5, 7, 9 and 11.5h. The ratio of each gene'smRNA level (expression) to its mRNA level in vegetative cells beforetransfer to the Sporulation medium, is measured. Subsequently, theratio data are then log 2-transformed. Among the 6118 genes, thegenes whose expression levels did not change significantly duringthe harvesting have been ignored from further analysis. This is de-termined with a threshold level of 1.4 for the root mean squaresof the log 2-transformed ratios. The resulting set consists of 690genes.

9.2. Human Fibroblasts Serum

This data set contains the expression levels of 8613 Human genes[36]. The data set is obtained as follows. First, Human Fibroblastswere deprived of serum for 48h and then stimulated by additionof serum. After the stimulation, expression levels of the genes werecomputed over 12 time points and an additional data point wasobtained from a separate unsynchronized sample. Hence the data sethas 13 dimensions. A subset of 517 genes whose expression levelschanged substantially across the time points have been chosen. Thedata are then log 2-transformed. This data set can be downloadedfrom http://www.sciencemag.org/feature/data/984559.shl.

9.3. Rat CNS

The Rat CNS data set has been obtained by reverse transcription-coupled PCR to examine the expression levels of a set of 112 genesduring Rat central nervous system development over nine timepoints [37]. This data set is available at http://faculty.washington.edu/kayee/cluster.

Each data set is normalized so that each row has mean 0 andvariance 1 (Z normalization) [27].

10. Experimental results

This section first provides a description of the performance met-rics used to evaluate the performance of various algorithms. There-after, a comparative study has been made among several algorithmsin terms of the performance metrics. Finally a statistical significancetest has been carried out to establish that the superior performanceof VSA–ANN is statistically significant.

10.1. Performance metrics

For evaluating the performance of the clustering algorithms, sil-houette index [38] is used. In addition, two cluster visualization toolsnamely Eisen plot and cluster profile plot have been utilized.

10.1.1. Silhouette indexSilhouette index [38] is a cluster validity index that is used to

judge the quality of any clustering solution C. Suppose a representsthe average distance of a point from the other points of the clusterto which the point is assigned, and b represents the minimum of theaverage distances of the point from the points of the other clusters.Now the silhouette width s of the point is defined as

s = b − amax{a, b} , (13)

silhouette index s(C) is the average silhouette width of all the datapoints (genes) and it reflects the compactness and separation ofclusters. The value of silhouette index varies from −1 to 1 and highervalue indicates better clustering result.

10.1.2. Eisen plotIn Eisen plot [4], (see Fig. 5(a) for example) the expression value

of a gene at a specific time point is represented by coloring thecorresponding cell of the data matrix with a color similar to theoriginal color of its spot on the microarray. The shades of red colorrepresent higher expression level, the shades of green color representlow expression level and the colors towards black represent absenceof differential expression values. In our representation, the genesare ordered before plotting so that the genes that belong to thesame cluster are placed one after another. The cluster boundariesare identified by white colored blank rows.

10.1.3. Cluster profile plotThe cluster profile plot (see Fig. 5(b) for example) shows for each

cluster the normalized gene expression values (light green) of the

U. Maulik, A. Mukhopadhyay / Computers & Operations Research 37 (2010) 1369 -- 1380 1375

1 2 3 4 5 6 7–2

0

2

time points --->

log2

(R/G

) ---

>

1 2 3 4 5 6 7–2

0

2

time points --->

log2

(R/G

) ---

>

1 2 3 4 5 6 7–2

0

2

time points --->lo

g2(R

/G) -

-->

1 2 3 4 5 6 7–2

0

2

time points --->

log2

(R/G

) ---

>

1 2 3 4 5 6 7–2

0

2

time points --->

log2

(R/G

) ---

>

1 2 3 4 5 6 7–2

0

2

time points --->

log2

(R/G

) ---

>1 2 3 4 5 6 7

–2

0

2

time points --->

log2

(R/G

) ---

>

1 2 3 4 5 6 7–2

0

2

time points --->lo

g2(R

/G) -

-->

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Cluster 5 Cluster 6

Cluster 7 Cluster 8

Fig. 5. Yeast Sporulation data clustered using VSA–ANN clustering method. (a) Eisen plot. (b) Cluster profile plots. (For interpretation of the references to color in this figurelegend, the reader is referred to the web version of this article.)

genes of that cluster with respect to the time points. Also, averageexpression values of the genes of the cluster over different timepoints are shown as a black line together with the standard deviationwithin the cluster at each time point.

10.2. Input parameters for VSA–ANN

The parameters for VSA are as follows: Tmax =100, Tmin =1, Iter=200 and �=0.1. The fuzzy exponent m is chosen to be 2.0. The valueof K∗, i.e., the soft estimate of the upper bound of the number ofclusters is taken to be 15 for all the data sets.

10.3. Comparative study

In order to establish the effectiveness of the proposed VSA–ANNclustering scheme, its performance has been compared with averagelinkage, SOM [12] and SiMM–TS [18] clustering algorithms. MoreoverVSA and an iterated version of fuzzy C-means (IFCM) algorithm havebeen applied independently. FCM [21] is a widely used partitionalclustering technique. The objective of FCM algorithm is to use theprinciples of fuzzy sets to evolve a partitionmatrix U(X). It minimizesthe measure given the following equation:

Jm =n∑

j=1

K∑k=1

umkjD2(zk, xj). (14)

It is known that FCM algorithm sometimes gets stuck at some sub-optimal solution [39]. In the iterated FCM (IFCM), the FCM algorithmis run for different values of K starting from 2 to K∗, where K∗ is the

soft estimate of the upper bound of the number of clusters. For eachK, it is executed 10 times from different initial configurations andthe run giving the best Jm value is taken. Among these best solutionsfor different K values, the solution producing the minimum XB index(Eq. (6)) value is chosen as the best partitioning. The correspond-ing K and the partitioning matrix are considered as the solution. Foraverage linkage and SOM, the algorithms are executed for differentvalues of K ranging from 2 to K∗ and the K value that provides thebest silhouette index score is reported. As SOM can only produceeven number of clusters due to its grid structure, to produce the SOMresult for k clusters (k being odd), we have merged the two closestclusters (minimum distance between the cluster centers) from theSOM clustering result for k + 1 clusters.

10.3.1. Results for Yeast Sporulation dataTable 1 shows the average silhouette index values for algorithms

VSA–ANN, IFCM, VSA, average linkage and SOM over 50 consecutiveruns. It can be noted from the table that VSA (and thus VSA–ANN),average linkage, SOM and SiMM–TS have determined the numberof clusters as 8, whereas IFCM obtained seven clusters in the dataset. From the s(C) values, it is evident that the performance of theproposed VSA–ANN clustering method is superior compared to theother methods.

Table 2 reports the number of points included in the trainingand test sets using the proposed method for all the data sets. Thetable also reports the s(C) index scores for the test data points beforeand after the application of ANN classifier, and the percentage oftest points changed their class labels after the application of ANNclassifier. It can be noted from the table that for Sporulation data,before the application of ANN classifier, the s(C) score of the test

1376 U. Maulik, A. Mukhopadhyay / Computers & Operations Research 37 (2010) 1369 -- 1380

Table 1Average values of s(C) index scores over 50 consecutive runs of various algorithms for different data sets.

Algorithm Sporulation Serum Rat CNS

K s(C) K s(C) K s(C)

VSA–ANN 8 0.5103 6 0.4543 6 0.5318IFCM 7 0.3719 8 0.2933 5 0.4135VSA 8 0.4872 6 0.3571 6 0.4662Average linkage 8 0.4852 6 0.3092 6 0.3601SOM 8 0.3812 6 0.3287 5 0.4121SiMM–TS 8 0.4982 6 0.4289 6 0.4423

Table 2The change in s(C) scores of the test points and percentage of test points that changed their class labels after application of ANN in the second stage of VSA–ANN fordifferent data sets.

Data set Size Size of training set Size of test set s(C) score for test set % of test points changed class label

Before ANN After ANN

Sporulation 690 474 216 0.1703 0.2493 41.20Serum 517 330 187 0.1063 0.1539 33.16Rat CNS 112 63 49 0.1142 0.2083 20.41

2 4 6 8 10 12

–2–1012

time points --->

log2

(R/G

) ---

>

2 4 6 8 10 12

–2–1012

time points --->

log2

(R/G

) ---

>

2 4 6 8 10 12

–2–1012

time points --->

log2

(R/G

) ---

>

2 4 6 8 10 12

–2–1012

time points --->

log2

(R/G

) ---

>

2 4 6 8 10 12

–2–1012

time points --->

log2

(R/G

) ---

>

2 4 6 8 10 12

–2–1012

time points --->

log2

(R/G

) ---

>

Cluster 1

Cluster 3 Cluster 4

Cluster 2

Cluster 5 Cluster 6

Fig. 6. Human Fibroblasts Serum data clustered using VSA–ANN clustering method. (a) Eisen plot. (b) Cluster profile plots.

points were 0.1703. Low value of s(C) for the test data (compared tothe overall silhouette index, which is 0.4872 as can be noticed fromTable 1) indicates that these data points have not been clusteredproperly and needed to be refined. After the application of ANNclassifier in the second stage, 41.2% of the test points have changedtheir class labels and the s(C) score for test data gets improved from0.1703 to 0.2493. This indicates the utility of the proposed VSA–ANNmethod.

To demonstrate visually the result of VSA–ANN clustering,Fig. 5 shows the Eisen plot and cluster profile plots corresponding tothe best results (in terms of silhouette index) provided by VSA–ANNon Yeast data set. The Eisen plot (Fig. 5(a)) clearly demonstratesthe eight prominent and distinguished clusters of the Yeast data.As is evident from the figure, the genes of a cluster exhibit similar

expression profile, i.e., they produce similar color patterns. Thecluster profile plots (Fig. 5(b)) also demonstrate how the expressionprofiles for the different groups of genes differ from each other,while the profiles within a group are reasonably similar.

10.3.2. Results for Human Fibroblasts Serum dataThe average s(C) values obtained by the different clustering algo-

rithms in 50 consecutive runs on Human Fibroblasts Serum data arereported in Table 1. For this data set, all the algorithms except IFCMdetermine the number of clusters as 6, whereas IFCM found eightclusters. For this data set also, the proposed VSA–ANN method out-performs all the other algorithms in terms of s(C). It is evident fromTable 2, that for this data set, 33.16% of the test points have changedtheir class labels and the s(C) scores of the test points gets improved

U. Maulik, A. Mukhopadhyay / Computers & Operations Research 37 (2010) 1369 -- 1380 1377

2 4 6 8–2–10123

time points --->

log2

(R/G

) ---

>

2 4 6 8–2–10123

time points --->

log2

(R/G

) ---

>

2 4 6 8–2–10123

time points --->

log2

(R/G

) ---

>

2 4 6 8–2–10123

time points --->

log2

(R/G

) ---

>

2 4 6 8–2–10123

time points --->

log2

(R/G

) ---

>

2 4 6 8–2–10123

time points --->lo

g2(R

/G) -

-->

cluster 1 cluster 2

cluster 3

cluster 5 cluster 6

cluster 4

Fig. 7. Rat CNS data clustered using VSA–ANN clustering method. (a) Eisen plot. (b) Cluster profile plots.

Table 3Median values of s(C) index scores over 50 consecutive runs of various algorithms for different data sets.

Algorithm Sporulation Serum Rat CNS

VSA–ANN 0.5211 0.4591 0.5307IFCM 0.3982 0.3013 0.4215VSA 0.4891 0.3525 0.4692Average linkage 0.4852 0.3092 0.3601SOM 0.3793 0.3278 0.4122SiMM–TS 0.4982 0.4233 0.4468

Table 4p-values produced by Wilcoxon's rank sum test by comparing VSA–ANN with other algorithms for different data sets.

Data sets p-values (comparing medians of VSA–ANN with other algorithms)

IFCM VSA Average linkage SOM SiMM–TS

Sporulation 2.33E−07 4.87E−06 3.56E−05 3.26E−08 3.22E−03Serum 3.88E−08 2.42E−10 1.39E−16 3.72E−13 7.19E−03Rat CNS 1.06E−07 6.82E−07 4.42E−17 4.71E−10 1.36E−04

from 0.1063 to 0.1539 after the application of ANN classification inthe second stage.

Fig. 6 shows the Eisen plot and cluster profile plots for the clus-tering solution obtained by VSA–ANN technique for the Serum dataset. It is evident from the figure that the genes of each cluster arehighly co-regulated and thus have similar expression profiles.

10.3.3. Results for Rat CNS dataTable 1 reports the average s(C) values for the clustering results

obtained by the different algorithms in 50 consecutive runs on RatCNS data. For this data set also, VSA (thus VSA–ANN), average linkageand SiMM–TS give the number of clusters as 6, similar to that foundin [37]. The IFCM and SOM identified five clusters in the data set.

For this data set also, the proposed VSA–ANN clustering methodprovides much improved value of s(C) compared to all the otheralgorithms. As is evident from Table 2, for this data set also, 20.41%of the test points have changed their class labels and the s(C) scoreof the test data points gets improved from 0.1142 to 0.2083 afterthe application of ANN classification. For illustration purpose, theEisen plot and cluster profile plots have been shown in Fig. 7 for thisdata set.

As discussed above, the results indicate significant improvementin clustering performance using the proposed VSA–ANN clusteringapproach compared to the other algorithms. A statistical significancetest has been carried out next to establish that the superior resultsobtained by VSA–ANN are statistically significant.

1378 U. Maulik, A. Mukhopadhyay / Computers & Operations Research 37 (2010) 1369 -- 1380

Table 5The functional enrichment significance score (p-value) for the significant clusters of Yeast Sporulation data as obtained by different algorithms.

Clusters VSA–ANN IFCM VSA Avg. link. SOM SiMM–TS

1 1.400E−45 1.400E−45 1.400E−45 1.400E−45 8.124E−45 1.400E−452 1.400E−45 8.823E−41 1.325E−42 7.284E−32 1.332E−28 8.527E−443 1.714E−25 1.373E−22 7.652E−25 8.811E−11 7.362E−25 1.102E−244 8.976E−25 1.263E−08 1.145E−23 1.282E−08 1.635E−21 1.095E−235 2.833E−14 1.211E−08 1.223E−12 1.613E−04 1.434E−07 1.057E−126 6.235E−09 1.761E−06 1.032E−06 – 1.710E−06 7.093E−087 1.320E−04 – 1.445E−04 – – 1.664E−048 1.858E−04 – 1.823E−03 – – 1.208E−03

The clusters are sorted according to significance level.

1 2 3 4 5 6 7 8–50

–45

–40

–35

–30

–25

–20

–15

–10

–5

0

Clusters (sorted by significance level) ---->

log1

0(p-

valu

e) --

----

- >

VSA-ANNIFCMVSAAverage linkageSOMSiMM-TS

Fig. 8. Plot of functional enrichment significance score (p-value) for the significant clusters of Yeast Sporulation data as obtained by different algorithms. The p-values havebeen log-transformed (base 10) for better readability. The clusters are sorted according to significance level.

10.4. Statistical significance test

To judge the statistical significance of the clustering results, anon-parametric statistical significance test called Wilcoxon's ranksum test for independent samples [40] has been conducted at1% significance level. Six groups corresponding to six algorithms(1. VSA–ANN, 2. IFCM, 3. VSA, 4. average linkage, 5. SOM, 6.SiMM–TS), have been created for each data set. Each group consistsof the s(C) index scores produced by 50 consecutive runs of thecorresponding algorithm. The median values of s(C) index scores ineach group for all the data sets are shown in Table 3.

It is evident from Table 3 that the median values of s(C) indexscores for VSA–ANN are better than that for the other algorithms. Toestablish that this goodness is statistically significant, Table 4 reportsthe p-values produced by Wilcoxon's rank sum test for comparisonof two groups (the group corresponding to VSA–ANN and a groupcorresponding to some other algorithm) at a time. As a null hypoth-esis, it is assumed that there are no significant difference betweenthe median values of two groups. Whereas, the alternative hypoth-esis is that there is significant difference in the median values ofthe two groups. All the p-values reported in the table are less than0.01 (1% significance level). For example, the rank sum test betweenalgorithms VSA–ANN and IFCM for Sporulation data set provides ap-value of 2.33E−07, which is very small. This is strong evidence

against the null hypothesis, indicating that the better median valuesof the performance metrics produced by VSA–ANN are statisticallysignificant and have not occurred by chance. Similar results are ob-tained for all other data sets and for all other algorithms comparedto VSA–ANN, establishing the significant superiority of the VSA–ANNalgorithm.

11. Biological significance

The biological relevance of a cluster can be verified basedon the statistically significant GO annotation database (http://db.yeastgenome.org/cgi-bin/GO/goTermFinder). This is used to test thefunctional enrichment of a group of genes in terms of three struc-tured, controlled vocabularies (ontologies), viz., associated biologicalprocesses, molecular functions and biological components.

The p-value of a statistical significance test is used to find theprobability of getting values of a test statistic that are at least equalto in magnitude (or more) compared to the observed test statistic.The degree of functional enrichment (p-values) is computed using acumulative hypergeometric distribution. This measures the proba-bility of finding the number of genes involved in a given GO term(i.e., function, process, component) within a cluster. From a givenGO category, the probability p for getting k or more genes within a

U. Maulik, A. Mukhopadhyay / Computers & Operations Research 37 (2010) 1369 -- 1380 1379

Table 6The three most significant GO terms and the corresponding p-values for each of the eight clusters of Yeast data as found by VSA–ANN clustering technique.

Clusters Significant GO term p-value

Cluster 1 Microtubule organizing center—GO:0005815 6.235E−9Spore wall assembly (sensu Fungi)—GO:0030476 1.016E−7Microtubule cytoskeleton organization and biogenesis—GO:0000226 1.672E−7

Cluster 2 Nucleotide metabolic process—GO:0009117 1.320E−4Glucose catabolic process—GO:0006007 2.856E−4External encapsulating structure—GO:0030312 3.392E−4

Cluster 3 Cytosolic part—GO:0044445 1.4E−45Cytosol—GO:0005829 1.4E−45Ribosomal large subunit assembly and maintenance—GO:0000027 7.418E−8

Cluster 4 Spore wall assembly (sensu Fungi)—GO:0030476 8.976E−25Sporulation—GO:0030435 2.024E−24Cell division—GO:0051301 7.923E−16

Cluster 5 Glycolysis—GO:0006096 2.833E−14Cytosol—GO:0005829 3.138E−4Cellular biosynthetic process—GO:0044249 5.380E−4

Cluster 6 M phase of meiotic cell cycle—GO:0051327 1.714E−25M phase—GO:0000279 1.287E−23Meiosis I—GO:0007127 5.101E−22

Cluster 7 Ribosome biogenesis and assembly—GO:0042254 1.4E−45Intracellular non-membrane-bound organelle—GO:0043232 1.386E−23Organelle lumen—GO:0043233 9.460E−21

Cluster 8 Organic acid metabolic process—GO:0006082 1.858E−4Amino acid and derivative metabolic process—GO:0006519 4.354E−4External encapsulating structure—GO:0030312 6.701E−4

cluster of size n, can be defined as [41]

p = 1 −k−1∑i=0

(fi

) (g − fn − i

)(gn

) , (15)

where f and g denote the total number of genes within a categoryand within the genome, respectively. Statistical significance is eval-uated for the genes in a cluster by computing p-values for each GOcategory. This signifies how well the genes in the cluster match withthe different GO categories. If the majority of genes in a cluster havethe same biological function, then it is unlikely that this takes placeby chance and the p-value of the category will be close to 0.

The biological significance test has been conducted for YeastSporulation data at 1% significance level. For different algorithms,the number of clusters for which the most significant GO termshave p-value less than 0.01 (1% significance level) are as follows:VSA–ANN—8, IFCM—6, VSA—8, average linkage—5, SOM—6, andSiMM–TS—8. Note that only for VSA, VSA–ANN and SiMM–TS, allthe clusters produced are significantly enriched with some GO cat-egories. In Table 5, the most significant p-values of the functionallyenriched clusters of Yeast Sporulation data as obtained by differentalgorithms are reported. The clusters are sorted according to thesignificance level. Lesser the p-value, better is the significance. Forvisual inspection, in Fig. 8, the plot of most significant p-values ofthe functionally enriched clusters (sorted by significance level) ofthis data set for different algorithms are shown. The p-values arelog-transformed for better readability. It is clear from the figure thatthe curve corresponding to VSA–ANN comes below the all othercurves. This indicates that all the eight clusters found by VSA–ANNare more significantly enriched compared to the clusters obtainedby other algorithms.

For the purpose of illustration, Table 6 reports the three mostsignificant GO terms shared by the genes of each of the 8 clustersidentified by VSA–ANN technique (Fig. 5). The most significant GOterms for these eight clusters are microtubule organizing center

(p-value: 6.235E−9), nucleotide metabolic process (p-value:1.320E−4), cytosolic part (p-value: 1.4E−45), spore wall assembly(sensu Fungi) (p-value: 8.976E−25), glycolysis (p-value: 2.833E−14),M phase of meiotic cell cycle (p-value: 1.714E−25), ribosome bio-genesis and assembly (p-value: 1.4E−45) and organic acid metabolicprocess (p-value: 1.858E−4), respectively. It is evident from thetable that all the clusters produced by VSA–ANN clustering schemeare significantly enriched with some GO categories, since all thep-values are less than 0.01 (1% significance level). This establishesthat the proposed VSA–ANN clustering scheme is able to producebiologically relevant and functionally enriched clusters.

12. Discussion and conclusions

In this article, a clustering algorithm (VSA–ANN) for clusteringmi-croarray gene expression data that combines a VSA based fuzzy clus-tering method with probabilistic ANN classifier has been proposed.The number of clusters in a gene expression data set are automat-ically evolved by the proposed VSA–ANN clustering technique. Theresults demonstrate how improvement in clustering performance isobtained by refining the clustering solution produced by VSA usingANN classifier. The performance of the proposed clustering methodhas been compared with the average linkage, SOM, VSA, IFCM andrecently proposed SiMM–TS clustering algorithms to show its effec-tiveness on three real life gene expression data sets. It has been foundthat the VSA–ANN clustering scheme outperforms all the other clus-tering methods significantly. Moreover, it is seen that VSA performsreasonably well in determining the appropriate value of the numberof clusters of the gene expression data sets. The clustering solutionsare evaluated both quantitatively (i.e., using silhouette index) andusing some gene expression visualization tools. Also statistical testshave also been conducted in order to establish the statistical signif-icance of the results produced by the proposed technique. Finally abiological significance test has been carried out in order to estab-lish the biological relevance of the clusters produced by VSA–ANNclustering method as compared to the other algorithms.

1380 U. Maulik, A. Mukhopadhyay / Computers & Operations Research 37 (2010) 1369 -- 1380

References

[1] Sharan R, Adi M-K, Shamir R. CLICK and EXPANDER: a system for clusteringand visualizing gene expression data. Bioinformatics 2003;19:1787–99.

[2] Alizadeh AA, Eisen MB, Davis R, Ma C, Lossos I, Rosenwald A. et al. Distincttypes of diffuse large b-cell lymphomas identified by gene expression profiling.Nature 2000;403:503–11.

[3] Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO. et al. Thetranscriptional program of sporulation in budding yeast. Science 1998;282:699–705.

[4] Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and displayof genome-wide expression patterns. Proceedings of the National AcademicScience of the United States of America 1998;14863–8.

[5] Bandyopadhyay S, Maulik U, Wang JT. Analysis of biological data: a softcomputing approach. Singapore: World Scientific; 2007.

[6] Jain AK, Dubes RC. Algorithms for clustering data. Englewood Cliffs, NJ: Prentice-Hall; 1988.

[7] Tou JT, Gonzalez RC. Pattern recognition principles. Reading: Addison-Wesley;1974.

[8] Hartigan JA. Clustering algorithms. New York: Wiley; 1975.[9] Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodica L. et al.

A genome-wide transcriptional analysis of mitotic cell cycle. Molecular Cell1998;2:65–73.

[10] Herwig R, Poustka A, Meuller C, Lehrach H, O'Brien J. Large-scale clustering ofcDNA fingerprinting data. Genome Research 1999;9(11):1093–105.

[11] Dembele D, Kastner P. Fuzzy c-means method for clustering microarray data.Bioinformatics 2003;19(8):973–80.

[12] Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, et al.Interpreting patterns of gene expression with self-organizing maps:methods and application to hematopoietic differentiation. Proceedings of theNational Academy of Sciences of the United States of America 1999;96:2907–12.

[13] Hartuv E, Shamir R. A clustering algorithm based on graph connectivity.Information Processing Letters 2000;76(200):175–81.

[14] Alon U, Barkai N, Notterman DA, et al. Broad patterns of gene expressionrevealed by clustering analysis of tumor and normal colon tissues probed byoligonucleotide arrays. Proceedings of the National Academy of Sciences of theUnited States of America 1999;96:6745–50.

[15] Lukashin AV, Fuchs R. Analysis of temporal gene expression profiles: clusteringby simulated annealing and determining the optimal number of clusters.Bioinformatics 2001;17(5):405–14.

[16] Maulik U, Bandyopadhyay S. Genetic algorithm based clustering technique.Pattern Recognition 2000;33:1455–65.

[17] Mukhopadhyay A, Maulik U, Bandyopadhyay S. Multiobjective evolutionaryapproach to fuzzy clustering of microarray data. Singapore: World Scientific;2007. p. 303–26 [chapter 13].

[18] Bandyopadhyay S, Mukhopadhyay A, Maulik U. An improved algorithm forclustering gene expression data. Bioinformatics 2007;23(21):2859–65.

[19] Bishop C. Neural networks for pattern recognition. Oxford: Oxford UniversityPress; 1996.

[20] MacKay DJC. The evidence framework applied to classification networks. NeuralComputation 1992;4(5):720–36.

[21] Bezdek JC. Pattern recognition with fuzzy objective function algorithms. NewYork: Plenum Press; 1981.

[22] Mewes HW, Albermann K, Heumann K, Liebl S, Pfeiffer F. MIPS, homology dataand yeast genome information: a database for protein sequences. Nucleic AcidResearch 1997;25:28–30.

[23] Xie XL, Beni G. A validity measure for fuzzy clustering. IEEE Transactions onPattern Analysis and Machine Intelligence 1991;13:841–7.

[24] Bandyopadhyay S, Maulik U, Mukhopadhyay A. Multiobjective genetic clusteringfor pixel classification in remote sensing imagery. IEEE Transactions onGeoscience and Remote Sensing 2007;45(5):1506–11.

[25] Domany E. Cluster analysis of gene expression data. Journal of Statistical Physics2003;110(3–6):1117–39.

[26] Shannon W, Culverhouse R, Duncan J. Analyzing microarray data using clusteranalysis. Pharmacogenomics 2003;4(1):41–51.

[27] Kim SY, Lee JW, Bae JS. Effect of data normalization on fuzzy clustering of DNAmicroarray data. BMC Bioinformatics 2006;7(134).

[28] Kirkpatrick S, Gelatt C, Vecchi M. Optimization by simulated annealing. Science1983;220:671–80.

[29] van Laarhoven PJM, Aarts EHL. Simulated annealing: theory and applications.Dordrecht: Kluwer Academic Publisher; 1987.

[30] Geman S, Geman D. Stochastic relaxation, Gibbs distributions and the Bayesianrestoration of images. IEEE Transactions on Pattern Analysis and MachineIntelligence 1984;6(6):721–41.

[31] Caves R, Quegan S, White R. Quantitative comparison of the performanceof SAR segmentation algorithms. IEEE Transactions on Image Processing1998;7(11):1534–46.

[32] Maulik U, Bandyopadhyay S, Trinder J. SAFE: an efficient feature extractiontechnique. Journal of Knowledge and Information Systems 2001;3:374–87.

[33] Bandyopadhyay S, Maulik U, Pakhira MK. Clustering using simulated annealingwith probabilistic redistribution. International Journal of Pattern Recognitionand Artificial Intelligence 2001;15(2):269–85.

[34] Andersen LN, Larsen J, Hansen LK, HintzMadsen M. Adaptive regularization ofneural classifiers. In: Proceedings of the IEEE workshop on neural networks forsignal processing VII, New York, USA; 1997. p. 24–33.

[35] Sigurdsson S, Larsen J, Hansen L. Outlier estimation and detection: applicationto skin lesion classification. In: Proceedings of the international conference onacoustics, speech and signal processing; 2002.

[36] Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee J. et al. The transcriptionalprogram in the response of the human fibroblasts to serum. Science1999;283:83–7.

[37] Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Barker JL, et al. Large-scaletemporal gene expression mapping of central nervous system development.Proceedings of the National Academy of Sciences of the United States of America1998;95:334–9.

[38] Rousseeuw P. Silhouettes: a graphical aid to the interpretation and validation ofcluster analysis. Journal of Computational and Applied Mathematics 1987;20:53–65.

[39] Groll L, Jakel J. A new convergence proof of fuzzy c-means. IEEE Transactionson Fuzzy Systems 2005;13(5):717–20.

[40] Hollander M, Wolfe DA. Nonparametric statistical methods, 2nd ed. 1999.[41] Tavazoie S, Hughes J, Campbell M, Cho R, Church G. Systematic determination

of genetic network architecture. Nature Genetics 1999;22:281–5.