A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and...

Pattern Recognition 42 (2009) 1193 -- 1209

Contents lists available at ScienceDirect

Pattern Recognition

journal homepage: www.e lsev ier .com/ locate /pr

Adistance-relatednessdynamicmodelforclusteringhighdimensionaldataofarbitraryshapes and densities

Noha A. Yousria,b,∗, Mohamed S. Kamelb, Mohamed A. IsmailaaComputers and System Engineering, University of Alexandria, EgyptbElectrical and Computer Engineering, University of Waterloo, Ontario, Canada

A R T I C L E I N F O A B S T R A C T

Article history:Received 23 August 2007Received in revised form 26 July 2008Accepted 29 August 2008

Keywords:ClusteringDynamic modelArbitrary shaped clustersArbitrary density clustersHigh dimensional dataDistance-relatedness

It is important to find the natural clusters in high dimensional data where visualization becomes difficult.A natural cluster is a cluster of any shape and density, and it should not be restricted to a globular shapeas a wide number of algorithms assume, or to a specific user-defined density as some density-basedalgorithms require.In this work, it is proposed to solve the problem by maximizing the relatedness of distances betweenpatterns in the same cluster. It is then possible to distinguish clusters based on their distance-baseddensities. A novel dynamic model is proposed based on new distance-relatedness measures and clusteringcriteria. The proposed algorithm “Mitosis” is able to discover clusters of arbitrary shapes and arbitrarydensities in high dimensional data. It has a good computational complexity compared to related algo-rithms. It performs very well on high dimensional data, discovering clusters that cannot be found byknown algorithms. It also identifies outliers in the data as a by-product of the cluster formation process.A validity measure that depends on the main clustering criterion is also proposed to tune the algorithm'sparameters. The theoretical bases of the algorithm and its steps are presented. Its performance is illus-trated by comparing it with related algorithms on several data sets.

© 2008 Elsevier Ltd. All rights reserved.

1. Introduction

Clustering is the unsupervised classification of data intogroups/clusters. It is widely used in biological and medicalapplications [1], computer vision [2], robotics [3], geograph-ical data [4], and others. A clustering algorithm takes as in-put, the data set, and a number of parameters and returns thegroups/clusters in data, according to a specific clustering objective orcriterion.

The most important thing that differentiates clustering algo-rithms is the objective used for clustering. Many of the traditionalalgorithms as K-Means, average and complete linkage clusteringand the more recent ones as PAM [5], CLARANS [6], and BIRCH[7] are restricted to obtaining clusters of a particular shape, asglobular or spherical shaped clusters. However, density-basedalgorithms as DBScan [8], DenClue [9], and SNN (Shared Near-est Neighbors) [4] have introduced new clustering criteria that

∗ Corresponding author at: Electrical and Computer Engineering, University ofWaterloo, Ontario, Canada.

E-mail addresses: [email protected], [email protected],[email protected] (N.A. Yousri), [email protected] (M.S. Kamel),[email protected] (M.A. Ismail).

0031-3203/$ - see front matter © 2008 Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2008.08.037

do not restrict the shape of clusters, and thus are able to ob-tain what is called “arbitrary” shaped clusters. Yet, they dependon using a static model for clustering, which deteriorates whenarbitrary densities are present in data. Chameleon [10], on theother hand, uses a dynamic model to be able to merge clus-ters of related densities together. Yet, Chameleon suffers somedrawbacks.

In this work, a new clustering algorithm “Mitosis” is proposed todiscover clusters of arbitrary shapes and densities. It uses distancerelatedness between patterns as a measure to identify clusters ofdifferent densities.

The novelty of Mitosis is depicted in:

• Combining both local and global novel distance-consistency mea-sures in a dynamic model to cluster the data.• Introducing a greedy clustering approach to keep upwith the linear

time performance of simple algorithms.• The ability to identify outliers in the presence of clusters of differ-

ent densities in the same data set.• The use of a parameter selection procedure to guide the clustering

algorithm.• Suitability for high dimensional data, where the discovery of ar-

bitrary shape and density clusters reveals other relations not re-vealed by traditional clustering algorithms.

http://www.sciencedirect.com/science/journal/pr

http://www.elsevier.com/locate/pr

mailto:[email protected]





https://www.researchgate.net/publication/223832637_On_the_Color_Image_Segmentation_Algorithm_Based_on_the_Thresholding_and_the_Fuzzy_c-Means_Technique?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/2373208_A_Method_for_Clustering_the_Experiences_of_a_Mobile_Robot_that_Accords_with_Human_Judgments?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/251547314_CHAMELEON_A_Hierarchical_Clustering_Algorithm_Using_Dynamic_Modeling?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/221653977_A_Density-Based_Algorithm_for_Discovering_Clusters_in_Large_Spatial_Databases_with_Noise?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/220695963_Finding_Groups_in_Data_An_Introduction_To_Cluster_Analysis?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/220907306_Finding_Clusters_of_Different_Sizes_Shapes_and_Densities_in_Noisy_High_Dimensional_Data?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==


https://www.researchgate.net/publication/3297085_CLARANS_a_method_for_clustering_objects_for_spatial_data_mining_IEEE_Trans_Knowl_Data_Eng?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/2500574_BIRCH_an_efficient_data_clustering_method_for_very_large_databases_ACM_SIGMOD_conference?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

1194 N.A. Yousri et al. / Pattern Recognition 42 (2009) 1193 -- 1209

Mitosis is able to overcome limitations of Chameleon, and at thesame time preserve the ability to discover arbitrary shaped clustersof arbitrary densities.

The rest of the paper is organized as follows: Section 2 presentsthe background, Section 3 gives the proposed dynamic model, Sec-tion 4 gives the experimental results, and Section 5 concludes thepaper.

2. Background

Partitioning algorithms as K-Means [11], its variants [12,13], K-Medoids [14], PAM [5] (Partitioning Around Medoids), BIRCH [7] andCLARANS [6], tend to use a center-based clustering objective, re-sulting in clusters of known geometrical shape, e.g. a sphere or anellipse. Grid-based clustering algorithms include STING [15], CliQue[16], WaveCluster [17], and DenClue [9]. Both WaveCluster and Den-Clue are able to obtain arbitrary shaped clusters, yet their limitationto low dimensional data and their static models hinders their abil-ity to find clusters of different densities. Density-based clusteringalgorithms include DBScan [8], DenClue [9], SNN [4], and GDD [18].Those algorithms can find arbitrary shaped clusters, yet due to theirstatic density models, they limit the clusters to a particular user-defined density. Hierarchical clustering algorithms include Single,Average, and Complete linkage (see Ref. [11]), CURE [19], ROCK [20]and Chameleon [10]. All those, except for Chameleon fail to find arbi-trary shaped clusters. Chameleon, however, is an efficient algorithmthat uses a dynamic model to obtain clusters of arbitrary shapes andarbitrary densities. Yet, it suffers the following drawbacks:

• A slow speed compared to related algorithms (due to a quadraticterm in the number of initial clusters).• The overhead of using graph partitioning, where more parametersshould be tuned, and more time is required.• The difficulty of tuning the algorithm's parameters, due to theneed for building the complete hierarchy for each single parametervariation.

Moreover, Chameleon is known for low dimensional spaces, andwas not applied to high dimensions.

The model proposed in this work introduces a novel dynamicmodel based on distance relatedness concepts to overcome draw-backs of related algorithms. Related background is presented in thefollowing subsection.

2.1. Dynamic modeling and distance relatedness concepts

The significance of a dynamic model of clustering is first ex-plained, and the theoretical background of distance relatedness isthen discussed.

2.1.1. Dynamic vs. static modelsA static model refers to using user-defined parameters for defin-

ing density thresholds without referring to the characteristics of theunderlying data. DBScan, DenClue, WaveCluster and SNN, are exam-ples of algorithms that use a static model and are unable to find clus-ters of arbitrary densities. For example, DBScan uses two parametersEps and Minpts, where Eps specifies a threshold on range distance,and Minpts specifies the minimum number of neighbors. A patternwith a number of neighbors more than Minpts at a distance less thanEps, is considered a core point (pattern) and is used to track clus-ters. Chameleon, on the other hand introduced a dynamic model forclustering based on the following:

1. The use of k-nearest neighbors (k-NN).2. Merging of clusters based on relative closeness and relative inter-

connectivity measures, weighed by user defined parameters.

p1

p2

050

100150200250

1Cluster

Dis

tanc

e A

vera

ge

2

Fig. 1. Distances between patterns reflect the density structure.

In this research, a dynamic model is developed based on distance-relatedness measures to allow the discovery of natural clustershapes. Background on distance relatedness concepts is given next.

2.1.2. Distance relatedness conceptsDistance-based density: Clusters can be visually detected by rec-

ognizing relatively higher density regions in data separated by re-gions of relatively lower densities. It is observed that the behavior ofneighborhood distances determines the degree of a cluster's density;smaller distances reflect higher densities, and vice versa. As shownin Fig. 1b the average distances between patterns (the MinimumSpanning Tree (MST) distances) differentiate the densities of clustersin Fig. 1a. Several other points important to build an efficient modelare as follows:

• Arbitrary shaped clusters can be detected by tracking connectivitybetween patterns' neighborhoods.• The distances between a pattern and its neighbors determine the

density of its cluster.• Patterns of the same cluster, should have consistent neighborhood

distances to preserve a consistent density though out the cluster.

Distance relatedness measures: Local and global measures for dis-tance relatedness have been separately considered for clustering. Lo-cal relatedness refers to relatedness of distances in two neighboringpatterns' neighborhoods, while global relatedness refers to related-ness of distances in a group of patterns' neighborhoods to those ina neighboring group's neighborhoods. Thus global relatedness con-siders relating the neighborhoods of two patterns that may not bein each other's neighborhoods, but can be reached from one anotherthrough a group of neighbor patterns. Global distance relatednessdoes not guarantee local distance relatedness. Yet considering bothlocal and global relatedness aims at increasing the distance consis-tency of each cluster, and avoiding the effect of outliers which maymerge two clusters together.

Zahn's local measures: In the earlywork by Zahn [21] “inconsistent”edges (distances/similarities) are removed from the MST represent-ing the data, in order to obtain a clustering solution. An edge be-tween two patterns is considered inconsistent if its value comparedto the average and standard deviation of its neighboring edges islarger than a specified factor. This would only work for simple datasets, but would not work for other data sets as pointed out in Ref.[22], due its sole dependence on local distance relatedness However,it introduced the distance relatedness concept to clustering.

None of the recent developed algorithms tend to utilize the ideaproposed by Zahn. However, Chameleon uses a model based on thesame key concept of edge relatedness, this time using a global modeland ignoring the local edge relatedness between patterns.

Chameleon's global measures: Chameleon starts by partitioningthe k-Nearest Neighbor graph of the data into a number of initialclusters. It them merges clusters together using a relative close-ness/interconnectivity model. This model examines the relatedness

https://www.researchgate.net/publication/4184282_Grid-based_clustering_algorithm_for_multi-density?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/223624491_CURE_An_efficient_clustering_algorithm_for_large_databases?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/220599725_Graph_theoretical_clustering_based_on_limited_neighborhood_sets?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==


https://www.researchgate.net/publication/221653977_A_Density-Based_Algorithm_for_Discovering_Clusters_in_Large_Spatial_Databases_with_Noise?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/3792369_ROCK_A_robust_clustering_algorithm_for_categorical_attributes?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/220695963_Finding_Groups_in_Data_An_Introduction_To_Cluster_Analysis?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/265367770_Soft_clustering_Algorithms_and_validity_of_solutions?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/2419875_WaveCluster_A_Multi-Resolution_Clustering_Approach_for_Very_Large_Spatial_Databases?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/220566383_Data_clustering_a_review_ACM_Comput_Surv?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/220566383_Data_clustering_a_review_ACM_Comput_Surv?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/225734295_The_Elements_of_Statistical_Learning_Second_Edition_Data_Mining_Inference_and_Prediction?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/240157352_Automatic_Subspace_Clustering_of_High_Dimensional_Data_for_data_mining_applications?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==


https://www.researchgate.net/publication/3297085_CLARANS_a_method_for_clustering_objects_for_spatial_data_mining_IEEE_Trans_Knowl_Data_Eng?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/3045517_Zahn_CT_Graph-theoretical_methods_for_detecting_and_describing_gestalt_clusters_IEEE_Trans_Comput_C20_68-86?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/2500574_BIRCH_an_efficient_data_clustering_method_for_very_large_databases_ACM_SIGMOD_conference?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/221311096_STING_A_Statistical_Information_Grid_Approach_to_Spatial_Data_Mining?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/220604306_Soft_clustering_of_multidimensional_data_a_semi-fuzzy_approach?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

N.A. Yousri et al. / Pattern Recognition 42 (2009) 1193 -- 1209 1195

Fig. 2. Inconsistencies of distances between patterns of Chameleon's initial clusters(bordered clusters).

of edge weights present in two neighboring clusters, and examinesthe relatedness of the inner edges to the edges connecting the twoclusters. Its restriction to global measures and ignorance of localmeasures results in associating outliers with their initial clusters (seeFig. 2). The disadvantages of this model rely in:

1. Ignoring local structures of single patterns.2. Demanding pair-wise examination of existing clusters at each new

merging step.

The first factor allows outliers to merge to clusters from the very firststeps of clustering, making it difficult to identify outliers. Whereas,the second factor results in a quadratic term complexity (in the num-ber of initial clusters).

3. Proposed distance relatedness dynamic model

The dynamic model proposed here aims at maintaining the con-sistency of distances between patterns in each cluster. It defines adistance-relatedness clustering criterion based on the following:

1. A dynamic range neighborhood that depends on the density con-text of the pattern.

2. Measures that reflect the distance structure of the data.

The following sections present the foundations of the clustering cri-terion, the neighborhood definition, the distance measures used, andthe clustering algorithm Mitosis.

3.1. Proposed distance relatedness measures

The proposed distance relatedness model unlike Zahn's modeland Chameleon's model, combines both local and global distancerelatedness. It attempts to cluster the data starting from local mea-sures and extending that to global ones in the course of the algo-rithm. In the following discussion, the symbols x, y, p, and q denotea pattern, d(x,y) denotes the distance between patterns x and y, adenotes the association between two patterns x and y, which is thetuple (x,y,d(x,y)), A denotes a set of associations, f and k denote userparameters, and � denotes the average value of distances.

Local distance relatedness: Any cluster is initially formed from sin-gleton patterns that are related with respect to their neighborhoodaverage distance measure (average local distances—see Eq. (3)), andhave an in-between distance related to this measure as well. This lo-cal merging criterion is a modification of Zahn's inconsistencymodel.The rules proposed here are as follows (a distance replaces an edgeweight), considering that an average local distance measure is theaverage of distances in a neighborhood of a pattern as given later(Eq. (3)):

1. The distance between two patterns is related to each of thepatterns' average local distance measure, by a factor k.

2. The two patterns' average local distance measures are related byfactor k.

d1 p1p2

p3d2p4

Fig. 3. Effect of distance inconsistency on merging of patterns.

The first rule combines Zahn's rules into a unified rule, whilethe second rule prevents merging of two distance-inconsistent pat-terns/clusters. This second rule increases the robustness of the localrelatedness, and cannot be satisfied in the previous model presentedby Zahn.

Fig. 3 illustrates the importance of the mutual consistencybetween the three entities associated with the local relatednesscriterion; two patterns and an in-between distance. It shows twoexamples, the first example (p1, p2, d1) has two patterns with dis-tance d1 related to each patterns' neighborhood distances, yet thetwo neighborhood distance averages are not related. The secondcase (p3, p4, d2) gives an example of two related neighborhooddistance averages, but an unrelated in-between distance d2. Thetwo patterns do not merge in any of these cases, justifying theabove proposed local criterion that imposes the mutual consistencybetween the two neighborhoods and the in-between distance. Thesame rationale is used when merging two clusters, but this timeconsidering the clusters' global characteristics.

Global distance relatedness: The property of distance consistencyextends from local relations between neighborhood patterns toglobal relations between neighboring clusters. The global distancerelatedness model follows the same rules of the local relatedness,but with replacing patterns with clusters, a pattern's local averagedistance with a cluster's average distance, and the distance betweentwo singleton patterns with the distance between the two patternsin the two neighboring clusters. A cluster's average distance (onlydistances approved to be consistent to the cluster) is used to reflectthe distance behavior inside the cluster.

3.1.1. Distance consistency definitionsThe rules governing consistency of one distance to another clus-

ter's/pattern's average distance, or two clusters'/patterns' averagedistance to each other are as follows—given an input parameterk � 1:

A distance d(x,y) is k-consistent to an average � of distances if

d(x, y) < k.�

Two distance averages �1 and �2 are k-consistent to each other if

�1 < k.�2 ∧ �2 < k.�1 or equivalently min(�1,�2) < k.max(�1,�2)

A distance d(x,y) is k-consistent to two averages if

d(x, y) < k.�1 ∧ d(x, y) < k.�2 i.e. d(x, y) < k.min(�1,�2)

According to the above definitions the merge criteria are proposed.

3.2. Proposed clustering algorithm

3.2.1. Dynamic range neighborhoodFinding arbitrary shapes in data mainly depends on using nearest

neighbor information. For the proposed algorithm, nearest neighborsin a dynamic-range are sought using a binary metric tree as the one


proposed in Ref. [23] or [24] for high dimensional data. The dynamicrange neighborhood is proposed as follows:

Definition 1. Dynamic range: Given that P is the set of patterns, f isa user-defined parameter, and d(x,y) is a distance metric defined fortwo patterns x and y, the dynamic range for a pattern q is defined as

r(q)= f minp∈P{d(p, q)} (1)

Definition 2. Dynamic range nearest neighbors: These are the nearestneighbors in the dynamic range distance, defined as follows:

NNr(q) = {p ∈ P|d(p, q)� r(q)} (2)

The dynamic range neighborhood enables controlling the relationbetween the largest and smallest distances between the pattern andits neighbors. This makes the neighborhood's distances consistentwith respect to factor f � 1, so any distance is upper bounded bythe minimum distance. Due to such a consistency, outliers or evenpatterns that belong to a nearby lower density cluster will be ex-cluded from the neighborhood. Also, any distance used to merge thepattern to a cluster is bounded by this minimum distance (as will beseen next).

3.2.2. Definitions and clustering criteriaThe following definitions define the measures used to capture the

structure of the data during the algorithm execution.

Definition 3. Local average distance: The average distance betweenpattern p and the neighbors in its dynamic range NNr(p),

�NNp =

∑∀x∈NNr(p)

d(p, x)

|NNr(p)|(3)

This measure reflects the behavior of distances in the vicinity ofa pattern, and is used to decide if a pattern should be merged toanother cluster or a pattern.

Definition 4. Cluster average distance: The average of distances ofthe set of associations Ac in cluster c, is given as follows:

�c =∑∀(p1,p2)∈Ac

d(p1, p2)

|Ac| (4)

This measure reflects the behavior of the cluster's “accepted” dis-tances (accepted as consistent to older distances in the cluster). Itis used to decide whether a cluster should be merged with anothercluster/pattern. This measure is calculated and updated in the courseof the clustering algorithm.

The proposed algorithm is based on two main criteria that usethe above distance measures, these are the merge criterion and therefining criterion discussed next.

Merge criterion: A merge criterion, that unifies the rules of localand global distance relatedness, is proposed. The clustering processis implicitly done by adding associations consistent to the averagedistance of a cluster.

Incremental merging procedure: To be able to achieve clusteringthrough the local and global measures, the order of the associationsconsidered during themerging process is important. The associationsare sorted in a nondecreasing order of distances. This ordering isessential for building denser cluster cores first. This bias towardsdenser clusters is to avoid outliers' effect. An outlier is known bybehaving abnormally with respect to its distance from other patterns.Thus starting with larger distances first, can cause splitting of denserclusters and incorrectly linking them to less dense clusters.

The clusters are formed incrementally, starting by singleton pat-terns, merging associations, and thus implicitly merging patternsinto clusters.

Merge step: Given an association a = (p1,p2,d(p1,p2)) between pat-terns p1 and p2, belonging to clusters c1 and c2, respectively, thenfollowing the previous discussion in Section 3.1, c1 and c2 are mergedif the following merging condition is satisfied:

d(p1,p2) < k.min(�c1,�c2) ∧max(�c1,�c2) < k.min(�c1,�c2) (5)

where c1 and c2 are clusters of p1 and p2, respectively, �c1,�c2 arethe average distance of each cluster as given in Eq. (4), and k � 1 isan input parameter. Initially, each cluster contains a single pattern.

Refining criterion: In order to avoid the effect of outliers, a shrink-ing technique is used to shrink the average distance of the cluster,and remove associations with distances not consistent with the newaverage. This removes outlier distances, where an outlier distancecan mistakenly connect two neighboring clusters of similar distancecharacteristics.

The distance between an outlier and a pattern inside the clusteris generally greater than common distances inside the cluster. Theexistence of outlier distances causes a skewness of the distribution ofdistances in the same cluster. To remove this skewness, the averageof distances is shrunk towards smaller core distances, and distancesthat are not consistent to this average are removed. Fig. 4 shows thatby the removal of the outlier distances from the cluster at the bottomleft, with the distance distribution on the upper left, the cluster issplit into two genuine clusters (lower right), and the skewness ofdistribution is removed (upper right).

A previous shrinking technique was considered by CURE [19] toshrink the cluster representatives towards the center of the clus-ter. The case is different here, where distances are considered, andthe average of distances is shrunk towards the core (smaller) dis-tances of the cluster. Shrinking can be done in a number of ways, themethod used here avoids the use of another parameter, and provedby experimentation to be suitable. It uses the harmonic average ofdistances inside the cluster.

Definition 5. Cluster harmonic distance average: This is the harmonicaverage of distances between patterns in a cluster, given as follows:

�Hc= |Ac|

/ |Ac |∑i=1

1/di (6)

di is the ith association's distance in the list of associations of Ac. Itis known that the harmonic mean is one Pythagorean mean of lessvalue than the arithmetic mean, and is used in general for averagingrates. It is used for averaging distances to give the effective distanceaverage which can easily be proved to be the inverse of the averageof similarities between patterns in the cluster, when the similarityis taken as: s(x,y) = 1/d(x,y).

Given k as an input parameter, the following condition should besatisfied for an association's distance:

d(p1,p2) < k.�Hc(7)

3.2.3. Algorithm phasesThe algorithm takes as an input the data set P, and 2 main pa-

rameters f and k. The main steps are as follows:

1. Get associations:(a) For each pattern get its dynamic range nearest neighbors—

using parameter f-, and calculate local average distance.(b) Order the associations—ascendingly on distances-between

patterns in a list L.

https://www.researchgate.net/publication/223624491_CURE_An_efficient_clustering_algorithm_for_large_databases?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/2373366_M-tree_An_Efficient_Access_Method_for_Similarity_Search_in_Metric_Spaces?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/224299127_Adaptive_similarity_search_in_metric_trees?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==


0 2 4 6 8 10 12 14 16 18 200

100

200

300

400

500

600

700

800

900

1000

distance

Freq

uenc

y

0 2 4 6 8 10 120

100

200

300

400

500

600

700

800

900

1000

Freq

uenc

y

Distance

1

2

Fig. 4. Removing skewness from distance distribution splits one cluster (left) to two clusters (right).

2. Merge patterns into clusters: For each association in L (visited inorder), if the merging criterion is satisfied—according to k-, mergeassociated patterns/clusters, and update the clusters' measures.

3. Refine clusters:(a) For each cluster, add any internal association consistent to

the cluster's average.(b) For each cluster, calculate the Harmonic average of distances,

and remove associations not consistent to this measure.

The detailed algorithm is given in Appendix A. The algorithmstarts by retrieving nearest neighbors for each pattern, and accord-ingly generates associations. A list of associations L is created whichis sorted on distances. Initially the clusters are the patterns them-selves, and the average distance of a cluster is the pattern's localaverage distances.

In the merging stage, associations in L are scanned once, in theirsorted order, and each is checked for the possibility of merging twoclusters (Eq. (5)). When two clusters are merged, a new cluster iscreated and its average distance is calculated from the averages ofthe contributing clusters, and the merging association's distance.Note that during the merge procedure, when an association betweentwo patterns of the same cluster is visited, it is also added to thecluster if it satisfies the merge criterion, and its distance is usedto update the cluster' average distance measure. During the mergestage each cluster's associations are those only used to merge thecluster's patterns together.

In the refining stage, unlike the merging stage, associations of acluster are extended to cover all associations (only in neighborhoods)between patterns inside the cluster that are bounded to the cluster'saverage distance. Internal associations not added during the mergestep are considered in their ascending order, and the cluster's mea-sures are updated incrementally. In the refining stage, the clustersare refined according to Eq. (7). After the three stages, a clusteringsolution is obtained, in which outliers are identified as patterns in acluster of size less than 1% of the size of the data.

3.2.4. Outlier handlingPatterns in clusters of sizes less than 1% of the total data size

are considered outliers. To assign outliers to the main clusters, eachoutlier pattern examines its neighbors, and it is assigned to the clus-ter which shows majority in its neighborhood. If a pattern's neigh-borhood does not contain any assigned patterns, the dynamic rangeneighborhood is increased to examine a larger set of neighbors. Thisneighborhood enlargement is done until the pattern is assigned to aknown cluster. The outlier handling algorithm is given in Appendix A.

3.2.5. Parameter tuningMitosis uses two main parameters f and k, compared to

Chameleon that uses three main parameters for building the hierar-chy, and a fourth cutting threshold for obtaining a single clusteringsolution, beside other additional parameters needed for graph parti-tioning. DBScan, on the other hand uses two parameters, but resultsin less efficient clustering solutions.

Parameter tuning is essential for any clustering algorithm, andit becomes more tolerated for relatively higher speed algorithms. Inthis section, a heuristic search to tune Mitosis' parameters is given.

Parameter f, controlling the neighborhood size, is the main pa-rameter which decides the lower bound on the number of clustersthat can be obtained. The value of f should be varied in an incre-mental fashion so as not to deteriorate the speed of the algorithm.Values of f can be selected just above the value of 1, and increasedby small steps, to avoid unnecessary huge neighborhood sizes.

Parameter k controls the degree of merging patterns/clusters to-gether, within the limits of the neighborhood decided by f. Increas-ing values of k, for the same f value, means decreasing the num-ber of clusters obtained, and vice versa. It is important to note thatwhen decreasing/increasing f, a corresponding increase/decrease ink is needed for achieving similar clustering solutions. This facilitatessearching through the parameter space. It is important to note thatthe time spent in trying a new value of k for the same f value corre-sponds to the merging time which is much less than the time spentin finding nearest neighbors for a new value of f.


Given a particular f value, and in order to select an appropriatek value, a validity measure is suggested here that considers the fol-lowing factors:

1. Distance standard deviation (�): This is the distance standard de-viation averaged overall the clusters, where the distances con-sidered are those of the MST of each cluster. A cluster standarddeviation is calculated as

�MST(c) =

√√√√∑∀d∈MST(c)(d− �MST(c))

2∣∣MST∣∣− 1

(8)

MST(c) is the set of distances comprising the MST of cluster c, and�MST(c) is the average of distances in the MST of cluster c. Thesize of the MST is the number of its edges. The main reason thatan MST is used, is to unify the view of a cluster solution amongdifferent solutions, so that sizes of neighborhoods would not bebiased to a particular clustering solution compared to another.

2. Density (S): Relatively high density is an implicit objective forMitosis which reflects the richness of the clusters obtained, where“richness” refers to the number of the patterns in a cluster. For asingle cluster, it is measured by averaging—for all patterns—thenumber of patterns' neighbors that belong to the cluster.

3. Number of patterns clustered, which is the number of clustersmultiplied by the average size of clusters. Only clusters of sizesexceeding 1% of the size (i.e. not outliers) of the data set areconsidered.

The validity measure is defined as follows:

v= S�.√Nc.|C| (9)

S is the average population-based density (S) of all clusters, � is theaverage distance standard deviation (of all clusters (see Eq. (8)), Nc

is the average size of all clusters and C is the set of clusters obtained.The intuition behind selecting this validity measure is that seek-

ing more compact clusters, i.e. clusters of small standard deviations,results in finding a large number of tiny fragments of the real clusters.In order to avoid that, the average density of each cluster (which isthe average density of all patterns' neighborhoods) should increase.Also, the size of the clustered data (measured by Nc.|C|) should beincreased (leading to a smaller number of clusters), but should bebalanced by the standard deviation (smaller number of clusters leadsto larger standard deviations). Thus, the objective is to decrease thestandard deviation, increase the density of clusters, and increase thenumber of clustered data, leading to selecting parameters that in-crease the value of v.

When a value of f is selected, k can vary between the value thatobtains the largest number of clusters to the value obtaining thesmallest number of clusters. (Since only clusters of sizes larger than1% of the data size are considered, the f–k trend showing the numberof clusters has a peak that does not have to be at the starting valueof k). A local maximum in the trend plotting v across the range of kvalues for a particular f-value is used to indicate a relatively morevalid clustering solution. The choice of the first local maximum—afterthe peak of the number of clusters' trend—was appropriate in mostof the cases.

Seeking the best value obtained for v can be investigated fora number of f values. The issue of selecting a suitable value of famong all other values can be done using the pattern fit (PF) measuredescribed later.

The proposed heuristic search is described as follows:

1. Select an initial value for f (slightly above the value of 1), initialstep for changing f, and another one for k.

Table 1Time complexity comparison to related algorithms

Algorithm Complexity

DBScan O(Dn log2(n))SNN O(Dn log2(n))Chameleon O(nm+Dn log2(n)+m2 log2(m))Mitosis O(Dn log2(n))

2. For the current value of f, execute the algorithm at values of kranging between the minimum (no clusters) and the maximumvalues (minimum number of clusters—after the maximum peak),incremented by its step.

3. For the current value of f, plot the value of validity v against thevalue of k, and detect the local maxima.

4. Select a new value for f greater than 1, and differs one step fromthe last f value selected.

5. Repeat steps 2–4, and tabulate the number of clusters at localmaximas of v in the f–k trend, to keep track of:• Number of clusters obtained at local maximas of v in the f–k

curve obtained at different f values, and• Values of k at which local maximas are obtained in the f–k

curves obtained at different f values.

Examples of using the parameter selection procedure are given inthe experiments section. The stability in the number of clusters atdifferent local maximas of v is expected (as shown later), yet mightnot be always directly obtained. In these cases investigating othervalues of k or f may be required, for instance smaller steps betweenf values or between k values can be used. One can start by a stepsize and later decrease it to get more accurate values of k and f,yet it is believed that the local maximas obtained at larger k stepsizes supersede those obtained at smaller k step sizes, so that the kvalues that result in the local maximas have preference to investigatethan those k values obtained at smaller step sizes. (Table 6 of theexperiments section, shows stabilities at a larger step value for k.)

Given that different f, k settings result in the same number ofclusters, the clustering results can be compared using the followingmeasure:

PF(Pr)=∑

p∈P|NNd(p) ∩ c(p)|/|NNd(p)||P| (10)

This is the average PF value for all patterns, where Pr is the parti-tioning obtained, P is the data set, c(p) is the cluster of pattern p, andd(p)is the distance that specifies a common dynamic range size forall solutions (it is selected as the distance that properly assigns eachpattern to its cluster according to a given solution). The PF measuresthe average of all patterns' ratio of similarly classified neighbors tothe total number of neighbors. The larger this value is, the more pre-ferred the partitioning becomes, due to the increase of connectivitybetween the patterns of the same cluster. When using this measureto compare two different solutions, a unified f value (used to getd(p)) is used to decide a standard neighborhood size. The use of thisprocedure is later demonstrated in the experiments section.

3.2.6. Performance issuesGiven that n is the size of the data set, and D is the dimension

size, the average time complexity of Mitosis is O(Dn log2(n)). Thisis also the average time complexity of DBScan and SNN. They donot need to compute all pair-wise distances, and use only neighbor-hood information, thus bounded by the complexity of searching ina metric tree. Comparison of the algorithm to related algorithms isgiven in Table 1, showing their average time complexities. The di-mension size D is considered for all the algorithms due to the initialcalculations of distances (distance measures require comparison ofcorresponding features as Euclidean distance), and should not be ne-


glected for high dimensions. Nearest neighbors are sought in all ofthe above algorithms, where DBScan and Chameleon use structuressuitable for low dimensions, and Mitosis uses a binary metric treesuitable for high dimensions. Regarding Chameleon's complexity, mrefers to the number of initial clusters produced by the graph parti-tioning. The size of m is dependent on the size of n, and when theparameter Minsize (a parameter specifying the size of the smallestpossible graph partition) decreases, m will increase, approaching thesize of the data set n. The dimension size Dweighs the n log2(n) termfor seeking neighbors in high dimensional data.

Mitosis' time complexity is derived as follows: A binary metrictree is constructed for the whole data, which requires O(Dn log2(n))time. The nearest neighbors are then sought within a time boundedby the above average complexity. In the average case, the number oftotal neighbors is of O(n) (similar rationale is used in Ref. [10]), andthus sorting the associations is of order O(n log2(n)). The scanningof all associations in each of Mitosis' steps, requires—in the averagecase—an O(n) time complexity.

4. Experiments

To examine the efficiency of Mitosis compared to other algo-rithms, two 2-D data sets used by Chameleon, and other six highdimensional data sets were used. In the low dimensional case, Mi-tosis is compared to the results of Chameleon, DBScan and SNN.Whereas, for high dimensional data, the results are compared toother algorithms as Click [1], that is used for high dimensional dataand also to DBScan, after extending its implementation to adopthigh dimensions. K-Means is also used to include a center-basedclustering solution in the comparison. The algorithms are comparedwith respect to the quality of solution using the F-Measure (see Ref.[25]), the Adjusted Rand index of Ref. [26], and the Jaccard coeffi-cient discussed in Ref. [27] (see Appendix A). The speed of Mitosisis compared to Chameleon and DBScan, and the standard deviationof distances—discussed later—is used to measure the quality of theclustering considering the distance-consistency objective of Mitosis.

4.1. Chameleon's data sets

The results obtained byMitosis for two 2-D data sets DS5 and DS4used by Chameleon [10], are similar to those in Ref. [10] as shownin Fig. 5a and d. The first data set DS5 is a mixture of clusters withdifferent densities (in Fig. 5a, the I shaped vertical cluster—left ofinverted Y cluster—has a lower density than surrounding clusters),used by Chameleon to illustrate its efficiency in obtaining arbitrarydensity clusters. Mitosis was able to obtain the genuine clusters atparameter settings of (f = 2.15, k = 2.5), as shown in Fig. 5a. BothDBScan and SNN (Fig. 5b, and c) were unable to obtain a comparableclustering solution. At the setting of (Eps = 10, Minpts = 3), DBScanonly identified six instead of eight clusters, merging some of theoriginal clusters (upper left and lower left clusters in Fig. 5b). This isdue to the presence of different densities in the same data set whichmakes DBScan either identifies lower density clusters but merge twoneighboring higher density ones, or do not identify the lower densitycluster, and identifies the higher density ones correctly. SimilarlySNN merged two clusters together (lower left clusters in Fig. 5c) dueto its static nature derived from DBScan.

Mitosis was able to identify the clusters in DS4 at the parametersetting (f = 2.25, k = 2.6) as shown in Fig. 5d. DBScan (Fig. 5e) iden-tified the clusters at the parameter setting Eps = 5.9, and Minpts = 4due to the density consistency in all the clusters. Similarly SNN wasable to identify the correct clusters for this data set (similar to Fig.5d). Mitosis results shown in Fig. 5a and d are obtained after as-signing outliers to main clusters, while results shown in Fig. 5f andg. are obtained after discarding outliers. Clusters of sizes less than

1% of the data size were identified as outliers, showing its ability toobtain the genuine clusters and identify outliers.

Comparing results: The Adjusted Rand Index proposed by Ref. [26]is used to compare results of the algorithms. Table 2 shows theindex value, indicating a relatively high match between Mitosis' andChameleon's results when outliers are handled inMitosis, and similarDBScan and Mitosis solutions in the case of DS4 data set. However,a relatively low index value is obtained when comparing Mitosis toDBScan in the case of DS5, due to the failure of DBScan in identifyingthe true clusters. This result indicates the ability of Mitosis to obtainresults of similar quality to Chameleon's.

Speed performance: The speed of Mitosis is compared to the speedof Chameleon and DBScan. The data sets considered are the DS4 andDS5 data sets, after dividing them into data subsets, each of size1000 patterns. The subsets are added incrementally, and the speedof the algorithms is recorded at each increment. For Chameleon,the time considered is the time required for obtaining the initialgraph partitioning, added to it the time required for obtaining thelowest level of the hierarchy (first pairwise comparison betweenall initial clusters that results in merging two clusters only). Thetime for obtaining the nearest neighbors is not considered for bothMitosis and Chameleon when comparing them to each other. Yet, forMitosis, this is included when comparing it to DBScan, as DBScan'stime performance is mainly affected by the nearest neighbor searchtime. The time measured in seconds, reflects the CPU processingtime. The algorithms are executed on a machine with a 0.5GB RAM,and a 1.73GHz processor.

Results in Fig. 6a and b shows a significant difference in speed,in favor of Mitosis, where Mitosis exhibits a speed that is more thanseven times faster than Chameleon's (considering only Chameleon'slowest level of hierarchy). Moreover, Chameleon would need muchmore time to get more levels up the hierarchy. On the other hand,Fig. 6c and d shows that Mitosis speed is comparable to that ofDBscan, that is relatively so fast due to its simple but less efficientstatic model. Thus, despite the fact that Mitosis is comparable toChameleon with respect to the quality of solution, it is able to obtainthe results in a much less time, avoiding the quadratic term—asdiscussed before—present in Chameleon's complexity (see [35,36]).

Distance consistency: The results in Table 3 show the averageddistance standard deviations (see Eq. (8)) for all clusters obtained byeach algorithm. Mitosis genuine clusters—excluding outliers—showlow distance deviations compared to Chameleon. Mitosis and DB-Scan are comparable with respect to their clusters' distance charac-teristics. DBScan only clusters together, patterns that are in a fixeddistance Eps from each other, so all distances between patterns donot exceed this static threshold. Yet, this static threshold hindersDBScan from discovering relatively lower density clusters in data.Chameleon, on the other hand, cannot obtain such low standarddeviation values, due to the presence of outliers in initial clusters,which is not handled by Chameleon.

4.2. High dimensional data

Six data sets of different applications were used to examine theefficiency ofMitosis compared to the ground truth, DBScan, K-Means,and Click [1] (Click is used in two of the data sets, and avoided inthe other data sets due to its restriction to the correlation coefficientsimilarity measure). SNN was not used for high dimensions due tothe difficulty of selecting three parameters, and its inefficiency asshown for Chameleon's 2-D data sets. Chameleon was not used forthe difficulty of adjusting its parameters, and the lengthy time re-quired for examining each single parameter variation. The data setsincluded for comparison are two time series data sets, two characterrecognition data sets, a document data set, and a protein localiza-tion data set. Different distance metrics were used; Pearson-based

https://www.researchgate.net/publication/3303538_IEEE_Survey_of_clustering_algorithms?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==


https://www.researchgate.net/publication/228084514_Recent_advances_in_clustering_A_brief_survey?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==

https://www.researchgate.net/publication/24056046_Comparing_Partitions?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==


a

b c

d e

f g

Fig. 5. DS5 data set: (a) Mitosis/Chameleon clusters; (b) DBScan clusters; (c) SNN clusters. DS4 data set (d) Mitosis/Chameleon clusters; (e) DBScan clusters; (f) and (g)Mitosis after removing outliers for data sets DS5 and DS4. (Figures are best seen in color, each cluster is identified by a specific color, and a specific marker.)


metric was used for the time series data sets, and the Euclidean dis-tance metric was used with the other data sets. The F-Measure, theAdjusted Rand index, and the Jaccard coefficient were all used tocompare the algorithms.

4.2.1. Time series dataIn this section, the efficiency of Mitosis in high dimensions is

studied. The data sets considered are labeled sets. Cluster visualiza-tion using tools in Ref. [28] is used to illustrate the results. The twodata sets used are: Synthetic Control Charts (SCC) data set contain-ing 600 patterns, each of 60 time points, and Cylinder Bell Funnel(CBF) data set, containing 900 patterns, each of 128 time points. Bothare obtained from UCR repository [29]. A suitable metric used hereis based on Pearson Correlation, as follows: d(x, y)=

√1− CORR(x, y),

where CORR(x,y) is the Pearson correlation coefficient. When rowsare normalized, this distance is a metric [30] and can be used toretrieve neighbors using a metric tree.

Synthetic control charts data: The SCC data contains six maingroups shown in Table 4. The “Normal” labeled cluster (class 1in Table 4) is a lower density cluster (large average distance,

Table 2Adjusted rand index comparing Mitosis results to Chameleon and DBScan

AlgorithmDBScanChamData

93.089.0Data set DS5Data set DS4 317.069.0

Mitosis-outliers removedMitosis-outliers included

0

10

20

30

40

50

60

70

80

0data size

Tim

e in

sec

s

0

10

20

30

40

50

60

70

80

90

100

Data Size

Tim

e in

sec

s

0123456789

10

data size

Tim

e in

sec

s

MitosisDBScan

MitosisDBScan

MitosisDBScan

MitosisDBScan

02468

1012141618

Data size

Tim

e in

sec

s

2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000

0 2000 4000 6000 8000 10000

12000

0 2000 4000 6000 8000 10000 12000

Fig. 6. Speed of Mitosis compared to Chameleon using: (a) DS5 data set, (b) DS4 data set, and to DBScan using, (c) DS5 data set, and (d) DS4 data set.

distances considered are those of the MST of the cluster) comparedto the other clusters. Note that the “Up-Trend” and “Up-Shift” clus-ters (classes 3 and 5 in Table 4), as well as the “Down-Trend” and“Down-Shift” ones (classes 4 and 6 in Table 4) show consistent dis-tance averages. External validity indexes are given in Fig. 8a to com-pare Mitosis to the other algorithms. Clustering results of all thealgorithms as well as the original solution are given in Fig. 7.

Mitosis (Fig. 7b) used the best detected combination of f and kvalues at 1.4 and 1.4. Parameters f and k were selected according tothe procedure explained earlier; Table 5 shows the number of clus-ters obtained at different local maximas of v and the correspondingk values that obtained them (shown in bold). Note that there are twotracks of local maximas at different k values. The first tracks a num-ber of clusters of 5 or 4 at k values between 1.2 and 1.3 through allf values examined. The second tracks a number of clusters of 4 at kvalues of 1.4–1.45 for f values between 1.4 and 1.7. Those two tracksrecommend a stability at a number of clusters of 4. (Table 6 showsalso the selection of parameters at larger steps of k.) The clusteringsolutions obtained at all f values (at number of clusters = 4), werecompared according to the PF measure given earlier. The PF values

Table 3Distance standard deviations of different algorithms

AlgorithmData

2.361.551.52DS52.311.191.35DS4

Mitosis outliers removed DBScan Cham

https://www.researchgate.net/publication/224839879_Quick_Clustering_Algorithms?el=1_x_8&enrichId=rgreq-03fd36d7-a836-4f6d-9026-876fcbc6674b&enrichSource=Y292ZXJQYWdlOzIyMDYwNDMzMDtBUzoxNDc2NjEwMTMzOTM0MTBAMTQxMjIxNjUzNzk5MQ==


Fig. 7. Time series clusters (mean ± standard deviation) of SCC data: (a) original classes; (b) Mitosis at 1.4,1.4; (c) DBScan at 0.5, 3; (d) click at default homogeneity; and(e) K-Means at k = 6.

Table 4Average distances of known synthetic control charts clusters

Cluster Normal Cyclic Up-trend Down-trend Up-shift Down-shift

Avg. dist. 0.83 0.34 0.40 0.38 0.41 0.41

are as follows—given as (f, PF)—: (1.4, 0.83916), (1.5, 0.83911), and(1.6, 0.832), recommending the use of f = 1.4.

At the setting of f = 1.4, and k = 1.4, Mitosis obtains 4 clusterscovering all six known classes. The four clusters merge the origi-nal “Upward” clusters' patterns (original clusters 3 and 5) together,and the “Downward” clusters' (original clusters 4 and 6) patterns to-gether. This is due to the presence of intermediate linking patterns,and the consistent distances of the merged classes. DBScan (Fig. 7c),

was unable to discover the cluster labeled “Normal”, due to its rela-tively low density. In an effort to find possible combinations of pa-rameters to realize this cluster, DBScan continues to merge the otherclusters until merging all patterns in one cluster, and never identify-ing the “Normal” cluster. It obtained three clusters at the setting of(Eps = 0.5, Minpts = 3), discovering the “Cyclic” cluster and similarto Mitosis, it merged the “Up-Shift”, and the “Up-Trend” clusters to-gether, as well as the “Down-Shift” and the “Down-trend” clusters.Click (Fig. 7d) was set at its default homogeneity, and obtained nineclusters that do not contain the “Normal” cluster. Instead, it over-clustered the other five original clusters K-Means, (Fig. 7e) at k = 6tends to mingle patterns from different original clusters together.The values of the F-Measure, the Adjusted Rand index and Jaccardcoefficient are shown in Fig. 9a, with Mitosis having the maximum


a

b

cTimepoints

Featurevalues

Fig. 8. Comparing clusters (mean ± standard deviation is plotted) obtained by reference, Mitosis and K-Means for the CBF data: (a) reference solution; (b) Mitosis; and (c)K-Means.

Table 5Parameter selection for the synthetic control charts time series data set

fk 1.2 1.3 1.4 1.5 1.6 1.7

1.05 8 7 7 8 11 91.1 8 8 7 7 6 71.15 7 7 7 5 5 61.2 5 5 5 4 4 41.25 5 5 4 4 4 41.3 5 5 4 4 4 41.35 5 5 4 4 4 41.4 5 5 4 4 4 41.45 5 5 4 4 4 41.5 5 5 4 3 3 31.55 5 5 3 2 2 21.6 5 3 2 2 2 21.65 4 3 2 2 2 21.7 3 3 2 2 2 21.75 3 3 2 2 2 11.8 3 3 2 2 21.85 3 3 2 2 11.9 3 3 2 11.95 3 2 12 2 22.05 22.1 1

Summary1.2/1.4 5 5 4 4 4 4

values for all the indexes, showing its superiority to the other algo-rithms, due to the discovery of the “Normal” low dense cluster. Yet,due to merging the connected patterns in the “Upward shift” and“Upward trend” clusters, as well as those patterns in the “Downwardshift” and “Downward trend” clusters, it could not reach higher val-ues. According to Mitosis, the connectedness of those patterns can-not be broken, which should be taken into consideration for furtherinvestigation of this data.

Table 6Using larger step sizes for parameter selection in the synthetic control charts dataset (only critical rows are shown)

fk 1.2 1.3 1.4 1.5 1.6 1.7

1.1 8 8 7 71.2 5 5 5 41.3 5 5 4 41.4 5 5 4 41.5 5 5 4 31.6 5 3 2 2

1.2–1.6 5 5 4 4

6 744432

2

4

4

2

4–3

4

3

Cylinder bell funnel (CBF) data: This data has three main clusters.At the setting of (f = 1.14, k = 1.11), Mitosis was able to discoverfour clusters, two of them correspond to two of the original clus-ters, while the other two clusters reveal that the third original clus-ter can be broken up into a tiny cluster, and a larger one, unique intheir time series trends. After exploring a range of parameter set-tings, DBScan—at Eps = 0.45 and Minpts = 1—was able to discoverfour clusters; two tiny ones, and two larger clusters, that do notmatch any of the original clusters. Click over-clusters the data setat its default homogeneity, giving seven clusters. K-Means at k = 3,merges patterns from different original clusters together.

Fig. 9b shows the values of the F-Measure, the Adjusted Randindex, and Jaccard coefficient, indicating the superiority of the solu-tion obtained by Mitosis, reaching values of 0.91, 0.81, and 0.77 forthe three indexes, respectively. The other algorithms fall below thatsignificantly, giving half these values for two of the indexes, even inthe case of DBScan.

The parameter selection scheme lead to the values shown inTable 7, indicating that most stable values were at the values of fbetween 1.13 and 1.15 at the setting: k = 1.11. The number of clus-ters obtained for the range of these f values is 4 or 5 clusters. Moreinvestigations of the PF value for f value of 1.14, and 1.15—that gavethe same number of clusters—gave the values of 0.8667 and 0.8666,


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Mitosis1.4,1.4

Dbscan0.5,1

k-Meansk=6

Click

Inde

x va

lue

F MeasureAdjusted Rand IndexJaccard Coefficient

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Mitosis1.14,1.11

DBScan0.45,1

Click

Inde

x va

lue

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Mitosis(1.12,1.18)

DBScan(18,1)

Inde

x va

lue

F measureAdjusted Rand IndexJaccard

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Mitosis(1.14,1.4)

DBScan(28,3)

DBScan(0.16,9)

Inde

x V

alue

F measureAdjusted Rand IndexJaccard

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.8

Mitosis(1.95,1.7)

Inde

x va

lue

F measure

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Mitosis(1.5,1.4)

K Means(K=8)

Inde

x V

alue

F measure Adjusted Rand IndexJaccard

K Means(k=10)

K Means(K=10)

k-Meansk=3

DBScan(3,2)

HAC

F measureAdjusted Rand IndexJaccard Coefficient

Fig. 9. F-Measure, Adjusted Rand Index and Jaccard Coefficient (appear from left to right for each algorithm in each figure) for the: (a) synthetic control chart data set, (b)CBF data, (c) optical character recognition set, (d) pen-digit character recognition set, (e) document data, and (f) protein localization data set.

respectively. As a result, the values of f = 1.14 and k = 1.11 were se-lected.

Fig. 8 compares the clusters obtained by each of the reference so-lution, Mitosis and K-Means. Most similar cluster shapes are aligned

beneath each other to easily compare their shapes. It is observedthat Mitosis is able to obtain clusters similar to the reference solu-tion, but further divides one of the clusters into two distinct clusters(the clusters positioned in the middle of Fig. 8b both attribute to


Table 7Parameter selection for CBF data (only critical rows are shown)

fk 1.12 1.13 1.14 1.15 1.16 1.17

1.06 8 10 9 9 9 121.07 9 6 5 6 9 71.08 5 6 5 5 5 61.09 5 5 5 5 4 41.1 3 5 5 5 4 41.11 3 5 4 4 4 4

Summary1.09–1.11 9 5 4 4 4 4

Table 8Selection of parameters for the optical character recognition set (only critical rowsare shown)

fk 1.1 1.11 1.12 1.13 1.14 1.15

1.17 10 9 10 9 9 91.18 10 10 10 9 9 101.19 10 9 9 8 9 91.2 10 9 8 8 9 81.21 9 8 8 8 9 91.22 8 8 7 7 9 81.23 8 9 9 8 8 81.24 9 9 8 8 8 81.25 7 8 8 8 6 61.26 7 6 6 6 5 51.27 7 6 5 5 5 8

Summary9990181.1–71.1

the cluster (middle figure of Fig. 8a) obtained by the reference solu-tion). Note that the two clusters have distinct pattern shapes, whichcan be the reason that Mitosis distinguished them. K-Means, on theother hand obtains one cluster similar to an original one, while theother two clusters (right figures of Fig. 8c) mingle patterns from twodifferent original clusters (i.e. clusters obtained in reference solu-tion) together due to its center-based objective. These results verifythe indexes values indicated in Fig. 9b when comparing Mitosis toK-Means.

4.2.2. Character recognition dataThe optical character recognition, and the pen digit character

recognition from the UCI repository [31] are used. The first consistsof 5620 patterns, each of 64 dimensions. The second consists of10992 patterns, each of 16 dimensions. The features (dimensions)are used to describe the bitmaps of the digit characters. The aim is toproperly classify the digit characters to 10 classes from 0 to 9.Mitosis,DBscan and K-Means are used here, while Click is not used for itsrestriction to correlation coefficient as a similarity measure, whilethe Euclidean distance is used as a distance measure in this case.Fig. 9c and d give the validity indexes values for the two data sets.For the optical character recognition set, stability values for validity vwere detected at different values of f and k, as shown in Table 8. Thevalues of f = 1.12 and k = 1.18 were selected, knowing that the nextvalues of f = 1.13, and k = 1.17 show no difference in the indexes'values. Regarding DBScan, many parameters were examined, and theresult that gave the best indexes' values was detected at Eps = 18and Minpts = 1.

Regarding the pen digit recognition, Mitosis' parameter values areselected according to Table 9. The values of f and k at 1.14 and 1.4,respectively, were chosen. For DBScan, the parameter settings thatgave the best indexes' values were at Eps = 28 and Minpts = 3. Mito-sis shows superiority in its performance regarding the three indexes'

Table 9Parameter selection for the pen-digit recognition dataset (only critical rows areshown)

fk 1.12 1.13 1.14 1.15 1.16

1.4 11 10 11 11 111.5 12 11 11 12 111.6 12 12 12 12 121.7 11 10 10 10 101.8 9 9 9 9 9

Summary1.4/1.7 9–12 10 10–11 10–11 10–11

Table 10Clusters and their distance standard deviation for the reference, Mitosis and K-Means solutions for the optical character data set

Cluster Reference Mitosis K-Means

Chara Stdv Chara Stdv Chara Stdev

1 0 2.6 0 2.7 0 2.62 1 2.9 1 3.2 0 3.23 2 2.7 2 2.7 2 2.84 3 2.9 (3,9) 3.5 4 2.85 4 2.9 4 2.7 5 3.26 5 2.7 5 3.1 6 3.37 6 2.6 6 2.6 (1,8) 4.38 7 3.1 7 2.8 (1,9) 4.39 8 3.1 8 3.2 (4,7,9) 3.510 9 2.8 9 2.9 (3,8,9) 2.8

aCharacters in each cluster, characters combined in one cluster are bracketed.

Table 11Clusters obtained in the optical character and pen-digit character data sets

Optical character data set

Reference: 0,1,2,3,4,5,6,7,8,9Mitosis: 0,1,2, 4,5,6,7,8,9, (3,9)DBScan: 0, 2, 4,5,6,7, 9, (1,8), (3,9)K-Means: 0,0,2,4,5,6, (1,8), (1,9), (4,7,9), (3,8,9)

Pen digit data set

Reference: 0,1,2,3,4,5,6,7,8,9Mitosis: 0, 4, 5, 6, 7, 8, (1,7) ,(1,2), (5,9), (9,5), (3,7,8,9)DBScan: 0, 4, 5, 6,7,8, (1,2), (9,5) , (1,3,7)K-Means: 0,0, 4,6,8, (5,8), (5,9), (1,2,7), (1,2,3,7,9),(1,3,5,7,8,9)

values (Fig. 9c and d). The F-Measure reaches a value above 0.8 andcloser to 0.9, thus approaching the reference solution. DBScan and K-Means behave similarly, regarding their indexes' values. One reasonthat might degrade the performance of DBScan in this case, ratherthan being unable to identify different densities, is discarding manypatterns as being outliers or noise. On the other hand, due to allocat-ing all patterns to clusters, K-Means gives larger indexes' values com-pared to DBScan. Considering the reference, Mitosis' and K-Means'solutions (DBScan excluded as it obtains nine clusters), the characterclusters of each algorithm and the corresponding distance standarddeviations are given in Table 10. It can be observed that the stan-dard deviations obtained by the reference solution are the lowestfollowed by those obtained by Mitosis, and the largest values are ob-tained by K-Means. It is also shown that while Mitosis only combinescharacters `3' and `9' together in cluster 4, K-Means has four clustersthat combine 2 or 3 characters together. Thus, for this data set, it isobserved that the higher the consistency of distances (i.e. the lowerdistance standard deviation), the more the solution approaches thereference solution. Comparing the clusters obtained by each algo-rithm, for the two data sets, the results in Table 11 were observed.

Comparing the results obtained by Mitosis and DBScan to thoseobtained by K-Means, it is observed that K-Means combines more


characters in one cluster, as in the cluster containing six characters((1,3,5,7,8,9)) for the pen-digit data set. Although Mitosis' and DB-Scan's results show some consistency, yet DBScan shows less con-sistency with the reference solution (as shown in Fig. 9c and d) thanMitosis because it leaves a large number of patterns unclassified.

4.2.3. Document clustering, protein localization and other data setsThe document data set consists of 314 documents, each is ex-

pressed by a number of features corresponding to the frequency ofoccurrence of terms (words) in the document. The terms are around10100 terms (dimensions). The data set was obtained from the au-thors of Refs. [25,32], and an accompanying similarity matrix wasused to retrieve the nearest neighbors. Mitosis is compared againstDBScan and the results obtained by Hierarchical Agglomerative Clus-tering (HAC) present in Ref. [25]. The parameters in this case weref = 1.95, and k = 1.7, obtained by the parameter selection method.Only the F-Measure was used in that case to compare to the resultsobtained in the related work [25] (HAC), as shown in Fig. 9e. Despitethe fact that small differences are present between the F-Measurevalues for the three algorithms, yet Mitosis still outperforms bothDBScan and HAC. It is to be noted that HAC (average and completelinkage) result in globular-like shaped clusters.

The Protein localization data is the Ecoli data setwith 336 proteinsof seven dimensions, obtained from the UCI repository [31]. Thevalues of the indexes are given in Fig. 9f, showing the superiorityof Mitosis over DBScan and K-Means. The parameters for Mitosis inthis case were selected at f = 1.5, and k = 1.4. Parameters for DBScanwere selected at Eps = 0.16, and Minpts = 9, giving the best indexes'values.

Mitosis was compared to related algorithms in clustering geneexpression patterns in Ref. [33]. The algorithm showed ability ofdiscovering clusters of arbitrary densities of groups of connectedgene expression patterns. Data sets that were used are Leukemia (inRef. [33]), Serum (in Ref. [33]), and breast cancer (in Ref. [34]).

4.3. Discussion

Mitosis was shown to outperform other algorithms for cluster-ing all the data sets presented here. Its superiority over K-Meansis due to the fact that K-Means is restricted to clusters of sphericalor globular shapes, while Mitosis can discover the natural clustershapes, whether they are spherical or not. Mitosis' superiority overDBScan is due to the fact that DBScan restricts the density of clustersto a static user-defined density. These reasons justify why both DB-Scan and K-Means have lower validity indexes' values than Mitosisin most of the cases; DBScan may not recognize low density clus-ters (as the `Normal' cluster in the SCC data set), and K-Means maycombine parts of different clusters (i.e. original clusters) together.Another reason for the lower validity indexes' values of DBScan isthat it does not cluster all patterns, while Mitosis handles this byhandling outliers.

Distance standard deviation: Table 12 shows the average distancestandard deviation (see Eq. (8)) of the clusters obtained by each al-gorithm and the original (reference) solution, the number of clustersdiscovered, and the size of the clustered data. All those factors af-fect the value of the standard deviation; as the number of clustersincreases, the standard deviation is smaller, and as the number ofclustered data increases the standard deviation is larger. Thus com-parisons between different solutions should take into considerationalmost similar number of clusters, and sizes of clustered data. It isobserved that Mitosis' clusters maintain the smallest distance devia-tions, even when compared to the reference solution. DBScan showscompetitive and much lower values due to the large number of un-classified patterns left out (which is also associated with its failure todiscover relatively lower density clusters). The size of clustered datais also shown in Table 12 to be taken into consideration. It is also

Table 12Distance standard deviation results

Algorithm No. of Size of clustered Average distanceclusters data standard deviation

SCC data setReference solution 6 600 0.051Mitosis 4 600 0.045DBScan 3 469 0.043Click 9 453 0.103K-Means 6 600 0.18

CBF data setReference solution 3 900 0.0276Mitosis 4 900 0.0272DBScan 4 760 0.021Click 7 895 0.035K-Means 3 900 0.042

Ecoli data setReference solution 8 336 0.068Mitosis 10 336 0.037DBScan 2 250 0.025K-Means 8 336 0.055

Pen digit recognitionReference solution 10 10992 7.35Mitosis 11 10992 7.67DBScan 8 8875 4.12K- Means 10 10992 8.66

Optical character recognitionReference solution 10 5620 2.88Mitosis 10 5620 2.97DBScan 9 ∼4090 1.8K-Means 10 5620 3.33

shown that the highest deviation values are obtained by K-Meansfollowed by lower values obtained by Click. These algorithms tend tofind globular-clusters, and although Click proposes novel measures,it does not guarantee finding arbitrary shaped clusters.

5. Conclusions

A new clustering algorithm “Mitosis” is proposed for finding ar-bitrary shapes of arbitrary densities in high dimensional data. Unlikeprevious algorithms, the proposed algorithm uses a dynamic modelthat combines both local and global distance measures. The modelis depicted in the proposed dynamic range neighborhood, and theproposed clustering criteria which use distance relatedness to mergepatterns together.

The algorithm's ability to distinguish arbitrary densities in a com-plexity of order O(Dn log2(n)) renders it attractive to use. Its speed iscomparable to simpler but less efficient algorithms and its efficiencyis comparable to efficient but computationally expensive ones. Itsability to distinguish outliers is also of a great importance in highdimensions. Moreover, introducing an accompanying parameter se-lection procedure makes Mitosis more convenient to use, comparedto related algorithms.

The experimental results illustrate the efficiency of Mitosis, com-pared to ground truth, for discovering relatively low and high den-sity clusters of arbitrary shapes. The use of real high dimensionaldata sets supports its applicability in real life applications. Validityindexes indicate that Mitosis outperforms related algorithms as DB-Scan which finds clusters of arbitrary shapes. It is also compared toa center-based algorithm, illustrating the importance of discoveringnatural cluster shapes.

Acknowledgment

We would like to thank the referees for their suggestions in en-hancing the presentation of this work.


Appendix A

BEGIN//Phase 1 : Get AssociationsBuild_Tree(P) //Builds Metric Tree T for data set PGet_Dynamic_Nearest_Neighbors(P) //uses T and range value r(p) = f.mindist(p) and updates minimum distances for pGet_Average_Distances(P) //calculates average neighborhood distances for all patternsL1← Sort_Neighborhood_Distances (P) //Sorts list of associations between neighbors ascendingly on distances

//Phase 2 : Merge Patterns into ClustersL2 = { } //Set list of accepted associations L2 to emptyInitialize_Labels //each pattern takes its id as the labelRepeat

(p,q)← Get_Next_Association(L1) //Retrieve next association from L1c1 = Label(p), c2 = Label(q) // c1 is cluster label of p, c2 is cluster label of qInitialize_Average_Distances(p,q) //if pattern p or q is a singleton, Set cluster average distances to pattern local average

// Set empty the cluster list of distances Dist(c1) or Dist(c2)If (d(p, q) < k.min(�(c1),�(c2)) Andmax(�(c1),�(c2) < k.min(�(c1),�(c2)) then //merge condition is satisfied:c3← Get_new_Label(c1,c2) //c3 gets maximum cluster label of c1, and c2

If c1 < > c2 then //c1 and c2 are two different clusters�(c3)= d(p, q)+ �(c1).|Dist(c1)| + �(c2).|Dist(c2)| // Update distance average of c3Dist(c3)= Dist(c1) ∪ Dist(c2) ∪ d(p, q) // Update list of distances of cluster c3

Else�(c3)= d(p, q)+ �(c1).|Dist(c1)| //Update distance average of c3Dist(c3)= Dist(c1) ∪ d(p, q) //Update list of distances of cluster c3�(c3)= �(c3)/|Dist(c3)| // Update distance average for c3

End IfL1 = L1−a, L2= L2 ∪ a //Remove a from list of associations L1, Add a to list of accepted associations L2

End If //end if merge condition is satisfiedUntil all associations in L1 are visited

//Phase 3 : Refine ClustersSort (L2) // Sort list of accepted associations L2Initialize_Clusters //Create an empty list of associations Clusterlist(c), Initialize harmonic sumAdd_Associations_To_Clusters( L2) // Add each association to cluster list of c and update harmonic sum for cluster cInitialize_Labels //each pattern takes its id as the label, labels are resetL2 = {} //Set list L2 to emptyFor each cluster c{Get_Harmonic_Average(c) // Get harmonic average of cFor each association a = (p,q,d(p,q)) in cluster list Clusterlist(c){ c1 = Label(p), c2 = Label(q) //c1 is cluster label of p, c2 is cluster label of q

If d(p, q) < k.�H(c) then //distance d(p,q) satisfies harmonic average of cc3 = Get_Label(c1,c2) //c3 = new cluster label from c1 and c2L2= L2 ∪ a //Add association a to list L2

End If } }For each cluster c {ClusterList(c) = { } } //Set empty cluster list ClusterList(c)For each association a = (p,q,d(p,q)) in L2{c1 = Label(p), c2 = Label(q) // Get cluster label of p, and qIf c1 = c2 then {Clusterlist(c1)= Clusterlist(c1) ∪ a} // Add association a to cluster list of c1}END

Outlier Detection and HandlingBeginO = { } //Initialize the set of outliers O to emptyFor each cluster c

{ If sizec < �.size P then { O = O ∪ c} } //Add patterns of c to O, � = 0.01 in all experimentsFor each pattern p in O

{ Repeatc = cluster assigned to the majority of p's neighborsIf sizec > 1 {Assign p to cluster c}

Elsef = 2.f //increase value of f, multiply by a factor ( e.g. 1.5, 2, . . . )Retrieve dynamic range neighbors for p in range f.mindistp

End IfUntil p is assigned to a cluster }

End


Appendix B

1. F-Measure for a clustering result C is defined as follows (as in Ref. [25]):

FC =∑

i(|i|.F(i))∑i|i|

where i is a cluster, and |i| is its size, F(i) is defined as

F(i)= 2PR/(P + R)

where P is the precision, and R is the recall defined as follows:

P(i, j)= N(i, j)/N(i)

R(i, j)= N(i, j)/N(j)

where N(i,j) is the number of members of class i in cluster j, the N(i) is the number of members of class i, the N(j) is the number ofmembers of cluster j.2. Adjusted Rand Index is defined as follows: Let �ij be the number of patterns in both cluster i and class j, ni, nj be the numbers of

patterns in each of cluster i and class j, respectively, the adjusted rand index is

AR=∑

i,j

(nij2

)−

[∑i,j

(ni2

) ∑j

(nj2

)] / (n2

)12

[∑i

(ni2

)+∑

j

(nj2

)]−

[∑i

(ni2

)+∑

j

(nj2

)] / (n2

)

3. Jaccard Coefficient is defined as: If a defined partition for the data is given as C1, and the cluster structure is C2, then define:

a: as the number of patterns belonging to the same group in C1, and the same cluster in C2,b: as the number of patterns belonging to the same group in C1, but two different clusters in C2,c: as the number of patterns belonging to two different groups in C1, but the same cluster in C2.

The Jaccard coefficient can be defined as

J = a/(a+ b+ c)

References

[1] R. Shamir, R. Sharan, Algorithmic approaches to clustering gene expressiondata, in: Current Topics in Computational Biology, MIT Press, Cambridge, MA,2002, pp. 269–299.

[2] Y.W. Lim, S.U. Lee, On the color image segmentation algorithm based on thethresholding and the fuzzy c-means techniques, Pattern Recognition 23 (9)(1990) 935 952.

[3] T. Oates, M.D. Schmill, P.R. Cohen, A method for clustering the experiences of amobile robot that accords with human judgments, Copyright � 2000, AmericanAssociation for Artificial Intelligence.

[4] L. Ertoz, M. Steinbach, V. Kumar, Finding clusters of different sizes, shapes,and densities in noisy, high dimensional data, in: Proceedings of Second SIAMInternational Conference on Data Mining, San Francisco, CA, USA, May 2003.

[5] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to ClusterAnalysis, Wiley, New York, 1990.

[6] R.T. Ng, J. Han, CLARANS: a method for clustering objects for spatial datamining, IEEE Trans. Knowl. Data Eng. 14 (5) (2002) 1003–1016.

[7] T. Zhang, R. Ramakrishnan, M. Linvy, BIRCH: an efficient data clustering methodfor very large data sets, Data Mining Knowl. Discovery 1 (2) (1997) 141–182.

[8] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discoveringclusters in large spatial data sets with noise, in: Proceedings of the SecondInternational Conference on Knowledge Discovery and Data Mining, Portland,OR, pp. 226–231.

[9] A. Hinneburg, D.A. Keim, An efficient approach to clustering in large multimediadatabases with noise, in: Knowledge Discovery and Data Mining, 1998.

[10] G. Karypis, E.H. Han, V. Kumar, CHAMELEON: a hierarchical clustering algorithmusing dynamic modeling, Computer 32 (8) (1999) 68–75.

[11] A.K. Jain, M.N. Murty, P.J. Flyn, Data clustering: a review, ACM Comput. Surveys31 (3) (1999).

[12] M.A. Ismail, Soft clustering: algorithms and validity of solutions, in:M.M. Gupta, T. Yamakawa (Eds.), Fuzzy Computing: Theory, Hardware andApplications, Elservier Science Publishers B.V., North-Holland, Amsterdam, 1988,pp. 445–471.

[13] S.Z. Selim, M.A. Ismail, Soft clustering of multidimensional data: a semi-fuzzyapproach, Pattern Recognition 17 (5) (1984) 559–568.

[14] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning. DataMining, Inference and Prediction, Springer, New York, 2001.

[15] W. Wang, J. Yang, M. Muntz, STING: a statistical information grid approach tospatial data mining, in: Proceedings of the 1997 International Conference onVery Large Data Bases (VLDB'97), 1997, pp. 186–195.

[16] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clusteringof high dimensional data for data mining applications, in: Proceedings of the1998 ACM-SIGMOD Conference on the Management of Data, pp. 94–105.

[17] G. Sheikholeslami, S. Chatterjee, A. Zhang, Wave Cluster: a multiresolutionclustering approach for very large spatial databases, in: Proceedings of the1998 International Conference on Very Large Data Bases (VLDB'98), 1998, pp.428–439.

[18] B. Qiu, X. Zhang, J. Shen, A grid based clustering algorithm for multi-density,in: Proceedings of the Fourth International Conference on Machine Learningand Cybernetics, Guangzhou, 18–21 August 2005.

[19] S. Guha, R. Rastogi, K. Shim, CURE: an efficient clustering algorithm for largedatabases, in: Proceedings of the 1998 ACM-SIGMOD International ConferenceManagement of Data (SIGMOD'98), 1998, pp. 73–84.

[20] S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categoricalattributes, in: The Proceedings of the IEEE Conference on Data Engineering,1999.

[21] C. Zahn, Graph–theoretical methods for detecting and describing Gestaltclusters, IEEE Trans. Comput. C-20 (1971) 68–86.

[22] R. Urquhart, Graph theoretical clustering based on limited neighborhood set,Pattern recognition 15 (3) (1982) 173–187.

[23] N.A. Yousri, M.A. Ismail, M.S. Kamel, Adaptive similarity search in metric trees,in: IEEE Systems Man Cybernetics (SMC) 2007, Montreal, Canada, October 2007.

[24] P. Ciaccia, M. Patella, P. Zezula, Mtree: an efficient access method for similaritysearch in metric spaces, in: Proceedings of the 23rd Conference on Very LargeDatabases (VLDB'97), pp. 426–435.


[25] K. Hammouda, M.S. Kamel, Phrase-based document similarity based on anindex graph model, in: The 2002 IEEE International Conference on Data Mining(ICDM'02), Maebashi, Japan, December 2002, pp. 203–210.

[26] L. Hubert, P. Arabie, Comparing partitions, J. Classification (1985) 193–218(supplementary http://faculty.washington.edu/kayee/pca/supp.pdf).

[27] M. Halkidi, Y. Batistakis, M. Vazirgiannis, Cluster validity methods. Part 1,SIGMOD Record 31(2), June 2002.

[28] 〈http://www.math.tau.ac.il/_rshamir/click/click.html〉.[29] E. Keogh, X. Xi, L. Wei, C.A. Ratanamahatana, The UCR Time Series

Classification/Clustering Homepage: 〈www.cs.ucr.edu/∼eamonn/time_series_data/〉, 2006.

[30] J.A. Hartigan, Clustering Algorithms, Wiley Series in Probability andMathematical Statistics, 1975.

[31] 〈http://mlearn.ics.uci.edu/MLSummary.html〉.[32] K. Hammouda, M.S. Kamel, Efficient phrase-based document indexing for web

document clustering, IEEE Trans. Knowl. Data Eng. (TKDE) 16 (10) (2004).[33] N.A. Yousri, M.A. Ismail, M.S. Kamel, Discovering connected patterns in gene

expression arrays, in: IEEE Symposium on Computational Intelligence inBioinformatics and Computational Biology (CIBCB) 2007, Hawaii, USA.

[34] N.A. Yousri, M.S., Kamel, M.A. Ismail, Pattern cores and connectedness in cancergene expression, in: Seventh IEEE International Conference on BioInformaticsand BioEngineering (BIBE) 2007, Boston, USA, October 2007.

[35] S. Kotsiantis, P. Pintelas, Recent advances in clustering: a brief survey, WSEASTrans. Inf. Sci. Appl. 1 (2004) 73–81.

[36] R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Trans. Neural Networks16 (3) (2005).

About the Author—NOHA A. YOUSRI received the B.Sc. (Hons) CS (Alexandria University, Egypt), M.Sc. (Alexandria University, Egypt).Noha A. Yousri joined the University of Waterloo, Canada in July 2006 as a co-supervised Ph.D. student under a joint program between the University of Alexandria, Egyptand the University of Waterloo. She is a member of the Pattern Analysis and Machine Intelligence Laboratory at the Department of Electrical and Computer Engineering.Noha's research interests are in pattern recognition, specifically nontraditional clustering algorithms, and bioinformatics. She also Co-authored papers in the areas of datawarehousing, similarity search, outlier analysis, clustering gene expression data and core pattern finding.Noha is a student member of the IEEE Society.

About the Author—M.S. KAMEL received the B.Sc. (Hons) EE (Alexandria University), M.A.Sc. (McMaster University), Ph.D (University of Toronto).He joined the University of Waterloo, Canada in 1985 where he is at present Professor and Director of the Pattern Analysis and Machine Intelligence Laboratory at theDepartment of Electrical and Computer Engineering. Professor Kamel holds Canada Research Chair in Cooperative Intelligent Systems.Dr. Kamel's research interests are in Computational Intelligence, Pattern Recognition, Machine Learning and Cooperative Intelligent Systems. He has authored and co-authoredover 350 papers in journals and conference proceedings, seven edited volumes, two patents and numerous technical and industrial project reports. Under his supervision,72 Ph.D. and M.A.SC. students have completed their degrees.He is the Editor-in-Chief of the International Journal of Robotics and Automation, Associate Editor of the IEEE SMC, Part A, Pattern Recognition Letters, Cognitive Neurodynamicsjournal. He is also member of the editorial advisory board of the International Journal of Image and Graphics and the Intelligent Automation and Soft Computing journal.He also served as Associate Editor of Simulation, the Journal of The Society for Computer Simulation.Based on his work at the NCR, he received the NCR Inventor Award. He is also a recipient of the Systems Research Foundation Award for outstanding presentation in 1985and the ISRAM best paper award in 1992. In 1994 he has been awarded the IEEE Computer Society Press outstanding referee award. He was also a coauthor of the bestpaper in the 2000 IEEE Canadian Conference on Electrical and Computer Engineering. Dr. Kamel is recipient of the University of Waterloo outstanding performance award,the Faculty of Engineering distinguished performance award.Dr. Kamel is member of ACM, PEO, Fellow of IEEE, Fellow of the Engineering Institute of Canada (EIC) and Fellow of the Canadian Academy of Engineering (CAE). He servedas Consultant for General Motors, NCR, IBM, Northern Telecom and Spar Aerospace. He is member of the Board of Directors and Co-founder of Virtek Vision Inc. of Waterloo.

About the Author—M.A. ISMAIL received the B.Sc. (honors) and M.Sc. degrees in Computer Science from the University of Alexandria, Egypt, in 1970 and 1974, respectively,and the Ph.D. degree in Electrical Engineering from the University of Waterloo, Canada, in 1980.He is a Professor of Computer Science in the Department of Computer Science, Alexandria University, Egypt. He has taught Computer Science and Engineering at theUniversity of Waterloo, Canada, University of Petroleum and Minerals (UPM), Saudi Arabia, the University of Windsor, Canada, and the University of Michigan, Ann Arbor.His research interests include pattern analysis and machine intelligence, data structures and analysis, medical computer science, and nontraditional databases.

http://faculty.washington.edu/kayee/pca/supp.pdf

http://www.math.tau.ac.il/_rshamir/click/click.html

http://www.cs.ucr.edu/~eamonn/time_series_data/

http://www.cs.ucr.edu/~eamonn/time_series_data/

http://mlearn.ics.uci.edu/MLSummary.html

A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and...

Documents

Transcript of A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and...