A differential evolution algorithm based automatic determination of optimal number of clusters...

7
A Differential Evolution Algorithm Based Automatic Determination of Optimal Number of Clusters Validated by Fuzzy Intercluster Hostility Index Sourav De # 1, Siddhartha Bhattacharyya #2, Paramartha Dutta *3 # Dept. of Computer Science & Iinformation Technology, University Institute of Technology, The University of Burdwan Golapbag (North), Burdwan, West Bengal, India, Pin-713104 1 sourav. de7 9@gmail. com 2 [email protected] * Dept. of Computer and System Sciences, Visva-Bharati University Santiniketan, West Bengal, India, Pin - 731 235 3 [email protected] Abstract-Automatic data clustering through determination of optimal number of clusters from the data content, is a challenging proposition. Lack of knowledge regarding the underlying data distribution poses constraints in proper determination of the inherent number of clusters. A differential evolution (DE) algorithm based approach for the determination of the optimal number of clusters from the data under consideration, is presented in this article. The optimum number of clusters obtained by the algorithm is further validated by means of a proposed fuzzy intercluster hostility index between the different clusters thus obtained. Applications of the proposed approach on clustering of real life gray level images indicate encouraging results. The proposed method is also compared with the classical DE (which operates with a known number of classes) and the automatic clustering DE (ACDE) algorithms. I. INTRODUCTION Clustering plays an imp ortant role in various fields like statistics, computer science, engineering, biology to social sciences or psychology. Clustering means partitioning of data points into disj oint groups/regions such that data points belonging to same cluster are similar while data points belonging to different clusters are dissimilar to each other. This measure of similarity/dissimilarity is usually based on some metric. If (x., x 2 , .. ,x n ) are n number of points in the input space I and all the points are separated into K clusters, e.g. (VI,V 2, .. , V K ) based on some features, then I] , 7= l/J ; for i = 1,.., K; U i nUj =¢; for i = ,.., K; i =1,.., K and .. UK I 1 7= } and ." = . 1=1 Different clustering techniques are in vogue for evaluating the relationship between patterns by organizing patterns into clusters, so that the intracluster patterns are more similar among themselves than are the patterns belonging to different clusters, i.e. intercluster patterns. A good and extensive overview of clustering algorithms can be found in [1][2]. The k-means algorithm is one of the simplest and most popular iterative clustering algorithm [1][3]. The procedure follows a simple and easy way to classify a given data set to a certain number of clusters. The cluster centroids are initialized randomly or derived from some a priori information. Each data in the data set is then assigned to the closest cluster. Finally, the centroids are reassigned according to the associated data point. This algorithm optimizes the distance criterion either by maximizing the intercluster separation or by minimizing the intracluster separation. The k-means algorithm is a data-dependent greedy algorithm that may converge to a suboptimal solution. An unsupervised clustering algorithm, using the k-means algorithm, is presented by Rosenberger et al. [4]. An improved k-means algorithm for clustering is presented in [5]. In this method, each data point is stored in a kd-tree and it is shown that the algorithm runs faster as the separation between the clusters increases. Fuzzy c-means [6] is another widely used clustering technique. Yang et al. [7] proposed a new clustering algorithm, referred to as the penalized FCM (PFCM) algorithm for application to images. The clustering process can be optimized by using the genetic algorithm (GA) in an evolutionary approach. Genetic algorithms (GAs) [8] are derivative-free stochastic optimization methods based on the concepts of natural selection and evolutionary processes. In a GA, a population of candidate solutions, described by some chromosomes, is iteratively updated by different genetic operators viz., selection, mutation, and crossover. Each candidate solutions are evaluated by a fitness function that controls the population evolution in optimization. The mutation operator is applied to provide insurance against the development of a uniform population incapable of further evolution. The crossover operator is employed to recombine information between different candidate solutions. Sheikh et al. [9] reported different types of GA based clustering algorithms. GA has been used to segment an image into different regions 978-1-4244-4787-9/09/$25.00 ©2009 IEEE 105 ICAC 2009

Transcript of A differential evolution algorithm based automatic determination of optimal number of clusters...

A Differential Evolution Algorithm BasedAutomatic Determination of Optimal Number ofClusters Validated by Fuzzy Intercluster Hostility

IndexSourav De # 1, Siddhartha Bhattacharyya #2, Paramartha Dutta *3

# Dept. ofComputer Science & Iinformation Technology, University Institute ofTechnology, The University ofBurdwanGolapbag (North), Burdwan, West Bengal, India, Pin-713104

1 sourav. de7 9@gmail. com

2 [email protected]

* Dept. ofComputer and System Sciences, Visva-Bharati UniversitySantiniketan, West Bengal, India, Pin - 731 235

3 [email protected]

Abstract-Automatic data clustering through determination ofoptimal number of clusters from the data content, is achallenging proposition. Lack of knowledge regarding theunderlying data distribution poses constraints in properdetermination of the inherent number of clusters.

A differential evolution (DE) algorithm based approach forthe determination of the optimal number of clusters from thedata under consideration, is prese nted in this article. Theoptimum number of clusters obtained by the algorithm is furthervalidated by means of a proposed fuzzy intercluster hostilityindex between the different clusters thus obtained.

Applications of the proposed approach on clustering of reallife gray level images indicate encouraging results. The proposedmethod is also compared with the classical DE (which operateswith a known number of classes) and the automatic clusteringDE (ACDE) algorithms.

I. INTRODUCTION

Clustering plays an imp ortant role in various fields likestatistics, computer science, engineering, biology to socialsciences or psychology. Clustering means partitioning of datapoints into disjoint groups/regions such that data pointsbelonging to same cluster are similar while data pointsbelonging to different clusters are dissimilar to each other.This measure of similarity/dissimilarity is usually based onsome metric. If (x., x

2, .. ,x

n) are n number of points in the

input space I and all the points are separated into K clusters,

e.g. (VI,V2 , .. , VK ) based on some features, then I] , 7= l/J ;

for i = 1,.., K; U i nUj =¢; for i = ,.., K; i =1,.., K and

. . UK I1 7= } and ." = .1=1

Different clustering techniques are in vogue for evaluatingthe relationship between patterns by organizing patterns intoclusters, so that the intracluster patterns are more similaramong themselves than are the patterns belonging to differentclusters, i.e. intercluster patterns. A good and extensive

overview of clustering algorithms can be found in [1][2]. Thek-means algorithm is one of the simplest and most populariterative clustering algorithm [1][3]. The procedure follows asimple and easy way to classify a given data set to a certainnumber of clusters. The cluster centroids are initializedrandomly or derived from some a priori information. Eachdata in the data set is then assigned to the closest cluster.Finally, the centroids are reassigned according to theassociated data point. This algorithm optimizes the distancecriterion either by maximizing the intercluster separation or byminimizing the intracluster separation. The k-means algorithmis a data-dependent greedy algorithm that may converge to asuboptimal solution. An unsupervised clustering algorithm,using the k-means algorithm, is presented by Rosenberger etal. [4]. An improved k-means algorithm for clustering ispresented in [5]. In this method, each data point is stored in akd-tree and it is shown that the algorithm runs faster as theseparation between the clusters increases. Fuzzy c-means [6]is another widely used clustering technique. Yang et al. [7]proposed a new clustering algorithm, referred to as thepenalized FCM (PFCM) algorithm for application to images.

The clustering process can be optimized by using thegenetic algorithm (GA) in an evolutionary approach. Geneticalgorithms (GAs) [8] are derivative-free stochasticoptimization methods based on the concepts of naturalselection and evolutionary processes. In a GA, a population ofcandidate solutions, described by some chromosomes, isiteratively updated by different genetic operators viz.,selection, mutation, and crossover. Each candidate solutionsare evaluated by a fitness function that controls the populationevolution in optimization. The mutation operator is applied toprovide insurance against the development of a uniformpopulation incapable of further evolution. The crossoveroperator is employed to recombine information betweendifferent candidate solutions. Sheikh et al. [9] reporteddifferent types of GA based clustering algorithms. GA hasbeen used to segment an image into different regions

978-1-4244-4787-9/09/$25.00 ©2009 IEEE 105 ICAC 2009

depending upon the piecewise linearity of the segments [10].In this article, GA has been used to determine hyperplanes asdecision boundaries, which divide the dataset into a number ofclusters. Srikanth et al. [11] proposed a supervised clusteringalgorithm by GA. Some classical clustering techniques areoften hybridized with GA [12] to cluster unlabeled datasets.An improved genetic algorithm is presented by Katari et ale[13]. In this approach, GA is combined with the NeIder-Mead(NM) Simplex search and the ~means algorithm. The GA­based k-means algorithm i.e., the Genetic k-means (GKA) isalso used to cluster unlabeled data sets [14]. However, in mostof the reported approaches, the number of clusters k is eitherset in advance or it is externally provided.

Nevertheless, the number of clusters from a previouslyunlabeled dataset can be found autormtically. Automatic dataclustering through determination of optimal number ofclusters from the data content, is at the helm of affairs in theresearch community. Lack of knowledge regarding theunderlying data distribution poses constraints in properdetermination of the inherent number of clusters. Severalalgorithms for automatic clustering exist in the literature [15].An automatic clustering is implemented by GA in biologicaldata mining and information retrieval [16]. In this method, acohesion-and-coupling metric is integrated into the proposedhybrid algorithm consisting of a genetic algorithm and a split­and-merge algorithm. Evolutionary strategy (ES) basedapproaches that have the capability to cluster unlabeleddatasets, is proposed by Lee et ale [17]. In this approach,variable-length strings have been used to search for both thecentroids and the optimal number of clusters. Differentapproaches, to dynamically cluster a dataset applyingevolutionary programming, have been investigated in therecent past. Bandyopadhyay et.al. [18] used a variable string­length genetic algorithm to determine to number of clustersautomatically. The underlying string representation comprisesboth real numbers (for valid cluster centroids) and don't caresymbols (for invalid ones).

Evolutionary computing techniques have been appliedrigorously to determine clusters in complex data sets in thepast few years. Differential Evolution (DE) algorithm is onesuch search and optimization algorithm which is more likelyto find a function's true global optimum. The classical versionof DE algorithm was proposed by Storn et ale [19]. In thisapproach, each component in the individual vector isrepresented by real coding of floating point numbers. Thistechnique is employed as a search procedure due to its fastconvergence properties and ability to find the optimality ofsolutions. However, the application of algorithm to real-lifesituations is limited by the fact that the number of clustersmust be provided externally. Das et al. [20] proposed animproved version of the classical form of DE algorithm byincorporating several modifications. The proposed automaticclustering differential evolution (ACDE) algorithm possessesthe capability to cluster unlabeled datasets automatically. Theactive/inactive cluster centroids are selected based on athreshold value, which is associated with each and everycluster centroid. In this approach, a cluster centroid is

designated as an active cluster if the corresponding thresholdvalue is greater than 0.5. All other clusters with clustercentroids having thresholds less than 0.5 are made inactive,thereby arriving at an optimum active number of clusters.However, the basis for the choice of a threshold value of 0.5 isheuristic.

Several traditional approaches for determining the optimalnumber of clusters in a data set exist. The quality ofpartitioning for a range of cluster numbers is generallymeasured by a cluster validity index or a statistical­mathematical function. These are concerned with thedetermination of the optimal number of clusters followed bythe investigation of the quality of the clustering results. Thebasic idea of determining the optimum number of classes is toemploy several instances of the algorithm with varied numberof classes as input and then to choose the partitioning of thedataset based on the best achievable validity measure. Themaximum/minimum values of the validity measures suggestthe appropriate partitions. Some well-known validitymeasures available in the literature are Davies-Bouldin'smeasure [21], Dunn's separation measure [22], Bezdek'spartition coefficient [23], Xie-Beni's separation measure [24],Pakhira Bandyopadhyay Maulik (PBM) index [25], etc.Optimization algorithms are mostly used to evaluate thesecluster validity indices.

In this article, we proposed a new approach for anautomatic determination of the number of clusters in an image.The classification of an image into different regions isdetermined from the unlabeled intensity data of the constituentpixels. This method is quite intelligent enough to takedecisions so far as the optimal numbers of clusters areconcerned. The clusters are selected as valid/invalid clustersfrom the image information content itself. The proposedmethod is implemented in two phases. Firstly, it uses theclassical DE algorithm (starting with a given maximumnumber of clusters) for determining the optimal number ofclusters in the input image. Since, the inherent success of aclustering process lies in the segregation of input data intowell-delineated clusters/regions based on their heterogeneity,a fuzzy intercluster hostility index is proposed to quantifyintercluster heterogeneity. In the second phase, this fuzzyintercluster hostility index is then applied for the selection ofthe valid clusters. The total dataset is clustered accurately ifthe fuzzy intercluster hostility between two differentclusters/regions is more than the average fuzzy interclusterhostility index for the entire image data. The best optimalnumber of valid cluster centroids is thus obtained by theproposed approach. The "tness ofthe individual chromosomevectors in the DE algorithm is computed by means of theDavis-Bouldin index [21]. The application of the proposedapproach is demonstrated on two real life gray level images.The proposed approach is also compared with theconventional classical DE algorithm [19] with a given numberof clusters and the automatic clustering DE (ACDE) algorithm[20].

978-1-4244-4787-9/09/$25.00 ©2009 IEEE 106 ICAC 2009

II. IMAGE INTERCLUSTER HETEROGENEITY

An image is clustered into several distinct homogeneousobject regions based on a proper selection of unique objectcentric features to distinguish the different clusters/regions.These features include intensity, spatial co-ordinates, objecttextures, shape etc. In this section, we present imageclustering in respect of intercluster heterogeneity andintracluster homogeneity.

Let, P =P ij'i <m, j ~ n [27] represent the intensity value

of the q, .lJ th pixel in an image of dimensions mx n. An

image comprises a specific permutation of these pixelintensity values, Pij , depending on the intensity distribution.Let, the image is clustered into K number of classes. Thisclustering/classification mechanism can be representedmathematically by a function c defined as

c:P~[O, ...,K] (1)

Such a K-class clustered image can be represented as asequence of gray levels gk'O ~ k ~ K , where, go = 0 and gk =

255 (considering 256 gray level images), and

O=go ~gl < ..·~gk ~···~gK-l ~gK =255 (2)

The k th class comprises all those pixels with intensity levelsE [gK-l' gK) = Sk . The individual classes in a K-class

classified image follow the following conditions.

• S k n S I =0, for k 7:- I,1~ k;1~ K

·u, s, =[0 ..255]

Thus, S/ ={sk,1 ~ k ~ K} suggests a class composition of

the image I. Let, S<1> = {Sf ISf be a representative class

composition of I}. Hence, the objective of the imageclustering procedure is to find out an optimum

choice S; ={s; ,1::S; k ::s; K} E S<I> of image I.

A clustered image thus, comprises K number of distinctclusters/regions differing in heterogeneity with respect to theintensity levels. This intercluster heterogeneity can be bestquantified by a distance metric based on the representativepixel intensity values in the clusters. Moreover, each cluster inthe K-class composition of the image is again homogeneouswith respect to the intensity levels of the constituent pixels.

A. Fuzzy Intercluster Hostility Index

As stated above, the basic objective of image clustering isto segregate the constituent pixels based onuniformity/homogeneity of features. A cluster in a clusteredimage is more homogeneous if the representative fuzzymembership values are closer to each other. On the other hand,the heterogeneity between different clusters arises due to thesharp contrasting fuzzy membership values of the elements inthe clusters. Following this proposition, an image getsclustered into several homogeneous clusters/regions

(R1 '00' RK ) . A fuzzy intercluster hostility index (~j) can be

defined to evaluate the intercluster heterogeneity between twoclusters/regions R, and Rj as

Sij = L'1(R;,Rj) = IR; 1~IRj I~ q~IJl(p)®Jl(q)1 (3)

where, 1~ i, j ~ K , J.1{p) and J1(q) are the fuzzy intensity

values of pixels P and q in the clusters/regions R, and Rj ,

respectively. (8) is a difference operator on the representativefuzzy feature values. /Rjl and ~1 are the cardinalities of theclusters/regions Rj and ~., respectively. Hence,

• Sij = I1.(Rj , Rj ) =0 , ifRj contains homogeneous features.

• Sij =I1.(Rj

, Rj

) =1, if Rj and Rj contains complementary

features.

• O::s; Sij ::s;I, VI::s; i,jVK, in general.

Thus, a K xK number of intercluster hostility index values canbe obtained from a K -class clustered image referred to as the

fuzzy intercluster hostility matrix ( H K (Ll)). It is denoted as

H K(I1) = (Sij) where vst.t s « (4)

III. DAVIS-BoULDIN (DB) VALIDITY INDEX

DB-index [21] is applied in a clustering procedure to decidewhich of the various features and channels would be of mostuse in the subsequent classification task. It is utilized as avalidity criterion to evaluate compact and separate clusters.The ratio of within -cluster scatter to between-clusterseparation is calculated by the DB-index. DB-index uses both

the clusters and their sample means. Let, X ={X1'0 ° o,X k} be

the data set and U ={U 1'0oo,U k} be its K-class clustered

output. The cluster similarity between the ith and J.fh clusters(Rij) is evaluated along with the distance between the clustercentroids (Dij)' Cluster similarity is defined as [21]

S. +S.R.. = 1 J (5)

lJ D ..l)

where, S, and Sj are the average dispersion of the i th and rclusters, respectively. The intercluster distance is given by [21]

N 1

Dij = {~) mKi - mKj IP}P (6)K=l

where, m«. is the centroid of cluster i. Dij is the Minkowskidistance between the centroids that characterize clusters i andj. The dispersion within a cluster is calculated from

1 t, 1-Si = {rLII r - m i ling (7)

1 J

where, m, is the centroid of cluster i. The centroid, m., is

1evaluated as m i = - LP and T; is the number of points in

T. c1 pe .j

cluster C; Specifically, Sj, used in this article, is the standarddeviation of the Euclidean distance between all data points ina cluster and the cluster's centroid. Finally, the DB-index isgiven by [21]

1 NDBI =- LRj where Rj = maxj=l=K Ry (8)

N i=l

978-1-4244-4787-9/09/$25.00 ©2009 IEEE 107 ICAC 2009

(10)

108

A minimum DB-index implies proper and faithful clusteringof a dataset.

N. DIFFERENTIAL EVOLUTION (DE)

The differential evolution (DE) algorithm is a populationbased evolutionary algorithm proposed by Storn et ala [19].DE uses real coding of floating point numbers, whereas thesimple genetic algorithm uses a binary coding scheme forrepresenting problem parameters. It is a simple structured,easy to use, parallel and direct search method having goodconvergence and fast implementation properties. Thepopulation size in a DE algorithm remains constantthroughout the optimisation process. DE has three controlparameters, i.e., the population size (jR), the amplificationfactor of the difference vector (F) and the crossover controlparameter (CCP). During one generation for each vector, DEemploys the mutation, crossover and selection operations toproduce a new vector for the next generation.

An individual in DE is represented by aD-dimensionalvector. A population comprises of NR number of D-

-.dimensional parameter vectors Pi,G ,(i = 1,2,..,NR) to reach

an optimal solution. The evolutionary operations can besummarized as follows.

A. Mutation-.

For each individual vector Pi,G from generation a G, a

mutant vector M i,G is produced by

M i,G+l =PrlJ,G +F~r2 2,G - Pr33,G ) (9)

where i; r1 ,r2 ,r3 E 1,2,... ,NR are randomly chosen and

i *- r1 *- r2 *- r3 . F E [0,1] is a scaling factor for the

difference vector (p r2,G - Pr3,G ). Therefore, the minimum

number of pararreter vectors must be four.

B. Crossover

In this operation, the parent vector is mixed with themutated vector using the following scheme to yield a trial

vector (V i,G+l = (VV,G+l, V 2i,G+l , •• , V Di,G+l)).

~ ={Mji,G+l' if (rnd(j) s CP) or

ji,G+l Pji,G' if (rnd(j) > CP) or

where, j=1, 2, ... , D; rnd(j) E[O,l] is a random number and

CCP is the crossover control parameter E [0,1]

rni

E (1,2,.D) is a randomly selected index which ensures

that V ji,G+l gets at least one element from M ji,G+l after the

crossover operation.

C. Selection

DE uses a greedy selection scheme. The selection operator,

used to produce a better offspring ( P i,G+l ), is given by

978-1-4244-4787-9/09/$25.00 ©2009 IEEE

».: = {~'G+ll, if~,G , otherwise

where,jis the objective function.The number of cluster is predefined in the differential

evolution algorithm. This is the inherent limitation of the DEalgorithm, since, in a real life scenario, the appropriatenumber of clusters in an unprocessed dataset may beimpossible to determine appropriately beforehand.

v. AUTOMATIC CLUSTERING (ACDE) ALGORITHM

Automatic clustering (ACDE) differential evolutionalgorithm [20] is an improved version of the classicaldifferential evolution (DE) algorithm to cluster unlabeleddatasets automatically. Firstly, the different parameters in DEhave been tuned in ACDE for better functioning. Theamplification factor of the difference vector (F) and thecrossover control parameter (CCP) have been modified toimprove the convergence properties. The amplification factor(F) is varied randomly in the range of (0.5, 1) by using therelation

F = 0.5*(1 + rnd(O,I)) (12)

where, rnd(O,I) is a uniformly distributed random number

within the range [0, 1].The diversity of the population is maintained as the search

progresses by tuning the crossover control parameter (CCP).

It is decreased linearly with time from CCPmax = 1.0 to CCPmin

= 0.5. All components of the parent vector substituted by thedifference vector operator [26]. The value of the CCP

decreases as the process progresses according to the followingequation

CCP = (CCPm ax -CCPmin ) *(MNOI -cit)/ MNOI (13)

where, CCPmax and CCPmin are the maximum and minimumvalues of the crossover control parameter (CCP), respectivelyand cit is the current iteration number and MNOI is themaximum number of iterations.

The components of the parent vectors are substituted by thedifference vector operator at the initial stages of the operationbut the rate of substitution of the components becomes lesseras the process progresses.

The most important feature of this algorithm is thechromosome representation for the automatic determination ofthe optimal number of clusters in an unlabeled dataset. Here,the chromosome is devised as a vector of real numbers and thedimension of the chromosome is denoted as K

max+ (K

maxxD),

where, K max is the maximum number of clusters. The first K max

values are positive floating point numbers E [0, 1] and theseare represented as the activation threshold value for eachcluster centroid. The active/inactive cluster centroids aredecided based on some predefined threshold values. A clusteris indicated as an active cluster if the threshold value of thecorresponding cluster centroid is greater than 0.5, otherwise, itis designated as an inactive cluster. Thus, the optimal numbersof the active clusters are settled on the basis of the thresholdvalue.

ICAC 2009

3 Calculate ~vg:= L.i Si (i+1)/Kmax, 1~i~Kmax -1

Remark: Find out the average fuzzy intercluster hostilityindex.

Remark: Cluster centroid R1 or Rmax refers to the firstor last possible centroid, respectively.

(b)(a)

Hence, the valid number of clusters are determined by thisalgorithm by the omission of the redundant cluster centroids.The redundancy is decided on the basis of the fuzzyintercluster hostility index between the corresponding adjacentregions.

4 Sort all Si (i+1)< Savg

5 Do6 Determine the smallest among Si (i+1)< Savg

7 If S(i-1)i< S(i+1)(i+2) then omit Ri

8 If S(i-1) i > S(i+1) (i+2) Then Omit Ri+1

9 If S12< ~vg Then Omit R110 If S(Rmax-1)Rmax~ ~vg Then Omit Rmax

VII. RESULT

In this section the proposed method is compared with theclassical DE [19] and ACDE [20] algorithms. All the methodshave been applied on two gray level Peppers (Figure l(a)) andLena (Figure l(b)) images, each of dimensions 128x128.

11 Loop Until Si (i+1» ~vg

12 End

Fig. 1. (a) Original Peppers image (b) Original Lena image

The number of cluster centroids are fixed (K = 7) in theclassical DE method. The maximum (Kmax ) and the minimum(Kmin ) number of clusters as applied in ACDE and in theproposed method are 10 and 2, respectively. The populationsize for all the processes is 100. The crossover controlparameter (CCP), is equal to 0.8 and the value of F is 0.8. Atthe beginning, the cluster centroids are selected randomlyfrom the images between the maximum and minimum imagegray levels. The Peppers and Lena images clustered by theclassical DE method for K = 7 with two different runs of thealgorithm are shown in Figures 2(a)(b) and (c)(d), respectively.Figures 3(a)(b) show the clustered Peppers image and Figures3(c)(d) show the clustered Lena image obtained with theACDE algorithm. Figures 4(a)(b) show the clustered Peppersimage and Figures 4(c)(d) show the clustered Lena image with

Remark: Find out the fuzzy intercluster hostility indexbetween each and every region.

ACDE is efficient in determining the optimal number ofclusters from datasets. However, the main limitation of thisalgorithm lies in the selection of an appropriate threshold forthe cluster centroids. The choice of 0.5 as the threshold issimply heuristic and hence is far from being considered as afull-proof choice in diverse data distributions.

1 Begin2 Calculate Si(i+1), 1~i~Kmax-1

VI. PRINCIPLE OF VALID CLUSTER DETERMINATION BY FUZZY

lNTERCLUSTER HOSTILITY INDEX

It is already mentioned that the classical DE algorithm isunable to determine the optimal number of clusters from anunlabeled dataset automatically. The number of target clustersis always provided as an input to the algorithm for arriving atan optimal clustered output. It is also clear that though theACDE algorithm improves upon the classical DE algorithm inthe automatic determination of optimal number of clustersfrom a starting value of the number of clusters, yet choice ofthresholding out of the inactive clusters is heuristic in nature.In this section, we present an unsupervised method for thedetermination of the validity of the cluster centroids derivedfrom a classical DE algorithm. The basic principle of thiscluster validation process is centered on the underlyingintercluster heterogeneity of the resultant clusters.

As is evident from the properties of the proposed fuzzyintercluster hostility index discussed in Section II-A, twoclusters are denoted as homogeneous to each other if the fuzzyintercluster hostility index is closer to zero (0). A value of thefuzzy intercluster hostility close to one (1) implies that twoclusters are heterogeneous/hostile to each other. Given thisproperty, a particular cluster can be merged with its neighborif the clusters are homogeneous to each other. Thus, theobjective of the cluster validation procedure using the fuzzyintercluster hostility index is to find out the distinct clusters,by eliminating some of the clusters through merging with theirneighboring clusters by investigating the corresponding fuzzyintercluster hostility indices. The resultant clusters, whichremain distinct after the validation process are those ones withappreciable heterogeneity between themselves.

The clustering process generally assigns a point (Pij) to aparticular cluster (out of K clusters) with cluster centroid mijfirstly by calculating the distance dt(pij", mij) between a point(Pij) in the image data and the cluster centroids f(nij), andsecondly by the following condition.

dti p; ,mij) = MinvcE{1,2,"",K } {dt(pij' m iC ) } (14)

The fuzzy intercluster hostility index (Si,i+l,l ~ i ~ K max- 1 )

between each and every clusters are computed. The average

fuzzy intercluster hostility index (, avg) for the entire dataset

is also calculated, firstly to evaluate the average globalheterogeneity and secondly, to determine the valid clustercentroids by the following algorithm.

978-1-4244-4787-9/09/$25.00 ©2009 IEEE 109 ICAC 2009

the proposed approach. From the results obtained it is evidentthat the performance of the proposed fuzzy interclusterhostility index based cluster validation method is comparablewith the DE and the ACDE algorithm. Moreover, the use ofthe proposed fuzzy intercluster hostility index determinedfrom the image context also obviates the limitations arisingout of the choice of the threshold in the ACDE algorithm.

The cluster centroids {CK, K=l, 2,... , K max } for the twoimages with two different runs of the proposed method areshown in Table I. The boldfaced values in Table I indicate thevalid cluster centroids.

TABLE ICLUSTER CENTROIDS OBTAINED WITH THEPROPOSED METHOD FOR THETEST

IMAGES

Fig. 2. (a)(b) Segmented Peppers image with classical DE (K = 7) (c)(d)Segmented Lena image with classical DE (K = 7)

(b)

«I)

(a)

(c)Fig. 4. (a) Clustered Peppers image with the proposed method (obtainedK=7) (b) Clustered Peppers image with the proposed method (obtained

K=6) (c) Clustered Lena image with the proposed method (obtainedK=7) (d) Clustered Lena image with the proposed method (obtained

K=6)

VIII. CONCLUSION

A cluster validation approach for the determination of theoptimal number of clusters in a dataset is presented in thisarticle. The basis of this validation procedure fordistinguishing between valid and invalid optimal clusters,derived from a classical differential evolution algorithm, liesin measuring the fuzzy intercluster hostility index to reflectthe underlying heterogeneity between the clusters. Theproposed approach has succes sful in determining the validoptimal clusters from gray level images. Comparative resultsare reported with reference to the clusters obtained with theclassical differential evolution algorithm with a given numberof clusters as well as with the automate clustering differentialevolution algorithm. Methods however, remain to beinvestigated to apply the proposed approach for thedetermination of optimal number of clusters in color imagesas well on datasets characterized by multiple number offeatures. The authors are currently engaged in these directions.

Image C1 C2 C3 C4 c, C6 C7 Cs C9 CIOLena 1 111 104 242 151 237 168 188 60 189

7 20 234 102 244 140 130 41 235 64Pepper 161 144 19 126 103 249 130 162 231 8

126 181 151 239 159 18 190 241 20 1

(b)

(b)

«I)

(a)

(c)

(a)

(c) «I)Fig. 3. (a) Clustered Peppers image with ACDE (obtained K=6) (b)

Clustered Peppers image with ACDE (obtained K=7) (c) Clustered Lenaimage with ACDE (obtained K=6) (d) Clustered Lena image with ACDE(obtained K=5)

978-1-4244-4787-9/09/$25.00 ©2009 IEEE 110 ICAC 2009

ACKNOWLEDGMENT

The authors would like to thank University Institute ofTechnology, The University of Burdwan, Burdwan and Visva­Bharati University, Santiniketan for the logistic andinfrastructural support provided for implementing theproposed method.

REFERENCES

[1] A. K. Jain, M. N. Murty, and P. 1. Flynn, "Data clustering: A review,"ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, Sep. 1999.

[2] F. Hoppner, F. Klawonn, R. Kruse and T. Runkler, Fuzzy clusteranalysis-methods for classification, Data Analysis and ImageRecognition, Plenum Press, New York, 1999.

[3] 1. T. Tou and R. C. Gonzalez, Pattern Recognition Principles. London,U.K.: Addison-Wesley, 1974.

[4] C. Rosenberger and K. Chehdi, "Unsupervised clustering method withoptimal estimation of the number of clusters: Application to imagesegmentation," 15th International Coriference on Pattern Recognition,Barcelona, Spain, vol. 1, pp. 656-659,2000.

[5] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R.Silverman, and A. Y. Wu, "An Efficientk-Means Clustering Algorithm:Analysis and Implementation," IEEE Tran. on Patt. Analy. and Mach.Intel!, vol. 24, pp. 881-892,2002.

[6] 1.C. Bezdek, Pattern Recognition with Fuzzy Objective FunctionAlgorithms, New York, Plenum Press, 1981.

[7] Y: Yang, Ch. Zheng, and P. Lin, "Fuzzy e-meansclustering algorithmWIth a novel penalty term for image segmentation," Opto-ElectronicsRev., vol. 13, no. 4, pp. 309-315,2005.

[8] 1. Holland, "Adaptation in Neural Artificial Systems," Ann. Arbor, MI:University ofMichigan, 1975.

[9] R. H. Sheikh, M. M. Raghuwanshi, A. N. Jaiswal, "Genetic AlgorithmBased Clustering: A Survey," First Int. Conf on Emerging Trends inEngg. and Tech., pp.314-319, 2008.

[10] S. Bandyopadhyay, C. A. Murthy and S. K. P al, "Pattern classificationwith genetic algorithms," Patt. Recog. Lett., vol. 16, pp. 801-808, 1995.

[11] R. Srikanth, R. George, N. Warsi, D. Prabhu, F. E. Petri, and B. P.Buckles, "A variable-length genetic algorithm for clustering andclassification, Pall Recog. Lett., vol. 16, no. 8, pp. 789-800, Aug.1995.

[12] Y. C. Chiou and L. W. Lan, "Theory and methodology geneticclustering algorithms," Eur. 1. Oper. Res., vol. 135, no. 2, pp. 413-427,2001.

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

V. Katari, S. C. Satapathy, 1. V R. Murthy and P. V. G. D. PrasadaReddy, "Hybridized Improved Genetic Algorithm with VariableLength Chromosome for Image Clustering," Inter. Jour. ofCompo Sci.and Net. Secur., vol. 7, no. 11, pp. 121-131, Nov. 2007.K. Krishna and M. N. Murty, "Genetic K-means algorithm," IEEETrans. Syst., Man, Cybern., vol. 29, no. 3, pp. 433-439, Jun. 1999.M. Halkidi, Y. Batistakis, and M. Vazirgiannis, "On clusteringvalidation techniques," 1. Intel!. Inf. Syst., vol. 17, no. 2/3, pp. 107-145,Dec. 2001.R. M. Othman, S. Deris, R. M.Illias, Z. Zakaria and S. M. Mohamad,"Automatic Clustering of Gene Ontology by Genetic Algorithm," Inter.Jour. ofInform. Tech., vol. 3, no. 1, pp. 37-46,2005.C. Y. Lee and E. K. Antonsson, "Self-adapting vertices for mask­layout ~nthesis," Proc. Model. Simul. Microsyst. Corf., M. Laudonand B. Romanowicz, Eds., San Diego, CA, pp. 83-86, Mar, 2000.S. Bandyopadhyay and U. Maulik, "Genetic clustering for automaticevolution of clusters and application to image classification," Patt.Recog., vol. 35, pp. 1197 --1208,2002.R. Storn and K. Price, "Differential evolution - A simple and efficientheuristic for global optimization over continuous spaces," J Glob.Optim., vol. 11, no. 4, pp. 341-359, Dec. 1997.S. Das, A. Abraham, and A. Konar, "Automatic Clustering Using anImproved Differential Evolution Algorithm," IEEE Trans. on Sys.,Man, and Cyber., vol. 38, no. 1, pp. 218-237, Jan. 2008D. L. Davies and D. W. Bouldin, "A cluster separation measure," IEEETrans. Pattern Ana!. Mach. Intel!., vol. 1, no. 2, pp. 224-227, Apr.1979.1. C. Dunn, "Well separated clusters and optimal fuzzy partitions," JCybern., vol. 4, pp. 95-104, 1974.1. C. Bezdek, "Numerical taxonomy with fuzzy sets," J Math Biol., vol.1, pp. 57 -71,1974.X. L. Xie and G. Beni, "A validity measure for fuzzy clustering," IEEETrans Pattern Anal Machine Intel!,vol. 13, pp. 841-847, 1991.M: K. Pakhira, S. Bandyopadhyay, and U. Maulik, "Validity index forcnsp and fuzzy clusters," Patt. Recog. Lett., vol. 37, no. 3, W. 487­501, Mar. 2004.1. Mao and A. K. Jain, "Artificial neural networks for feature extractionand multivariate data projection," IEEE Trans. Neural Netw. , vol. 6, no.2, pp. 296-317, Mar. 1995.S. Bhattacharyya, "Object extraction in a soft computing framework,"PhD Thesis, Jadavpur University, 2008.

978-1-4244-4787-9/09/$25.00 ©2009 IEEE 111 ICAC 2009