A heuristic algorithm for solving the minimum sum-of-squares clustering problems

22
A heuristic algorithm for solving the minimum sum-of-squares clustering problems * Burak Ordin 1 Adil M. Bagirov 2 1 Department of Mathematics, Faculty of Science, Ege University, Izmir, Turkey 2 Faculty of Science and Technology, Federation University Australia, Victoria, Australia Abstract Clustering is an important task in data mining. It can be formulated as a global optimiza- tion problem which is challenging for existing global optimization techniques even in medium size data sets. Various heuristics were developed to solve the clustering problem. The global k-means and modified global k-means are among most efficient heuristics for solving the min- imum sum-of-squares clustering problem. However, these algorithms are not always accurate in finding global or near global solutions to the clustering problem. In this paper, we intro- duce a new algorithm to improve the accuracy of the modified global k-means algorithm in finding global solutions. We use an auxiliary cluster problem to generate a set of initial points and apply the k-means algorithm starting from these points to find the global solution to the clustering problems. Numerical results on 16 real-world data sets clearly demonstrate the su- periority of the proposed algorithm over the global and modified global k-means algorithms in finding global solutions to clustering problems. Keywords: minimum sum-of-squares clustering, global optimization, nonsmooth optimization, k-means algorithm, global k-means algorithm. 1 Introduction The cluster analysis, also known as the unsupervised partitioning, deals with the problems of organization of a collection of patterns into clusters based on similarity. It is among the most important tasks in data mining. The cluster analysis has applications in different areas such as medicine, engineering, business. There are different types of clustering problems. In this paper, we consider unconstrained hard clustering problem which can be formulated as an optimization problem and there are various such * Published as: B. Ordin and A.M. Bagirov, A heuristic algorithm for solving the minimum sum-of-squares clustering problems, Journal of Global Optimization, 61(2), 2015, 341-361. 1

Transcript of A heuristic algorithm for solving the minimum sum-of-squares clustering problems

A heuristic algorithm for solving the minimumsum-of-squares clustering problems∗

Burak Ordin1 Adil M. Bagirov2

1Department of Mathematics, Faculty of Science, Ege University,Izmir, Turkey

2Faculty of Science and Technology, Federation University Australia,Victoria, Australia

Abstract

Clustering is an important task in data mining. It can be formulated as a global optimiza-tion problem which is challenging for existing global optimization techniques even in mediumsize data sets. Various heuristics were developed to solve the clustering problem. The globalk-means and modified global k-means are among most efficient heuristics for solving the min-imum sum-of-squares clustering problem. However, these algorithms are not always accuratein finding global or near global solutions to the clustering problem. In this paper, we intro-duce a new algorithm to improve the accuracy of the modified global k-means algorithm infinding global solutions. We use an auxiliary cluster problem to generate a set of initial pointsand apply the k-means algorithm starting from these points to find the global solution to theclustering problems. Numerical results on 16 real-world data sets clearly demonstrate the su-periority of the proposed algorithm over the global and modified global k-means algorithmsin finding global solutions to clustering problems.

Keywords: minimum sum-of-squares clustering, global optimization, nonsmooth optimization,k-means algorithm, global k-means algorithm.

1 Introduction

The cluster analysis, also known as the unsupervised partitioning, deals with the problems oforganization of a collection of patterns into clusters based on similarity. It is among the mostimportant tasks in data mining. The cluster analysis has applications in different areas such asmedicine, engineering, business.

There are different types of clustering problems. In this paper, we consider unconstrained hardclustering problem which can be formulated as an optimization problem and there are various such

∗Published as: B. Ordin and A.M. Bagirov, A heuristic algorithm for solving the minimum sum-of-squares clusteringproblems, Journal of Global Optimization, 61(2), 2015, 341-361.

1

formulations. This includes combinatorial, integer programming and nonconvex nonsmooth op-timization formulations [3, 5, 6, 10]. In all cases the clustering problem is a global optimizationproblem, it has many solutions and only global solutions provide the best cluster structure of adata set. Various optimization techniques have been applied to solve this problem. These tech-niques include dynamic programming, branch and bound, cutting plane, interior point methods,the variable neighborhood search algorithm and metaheuristics such as simulated annealing, tabusearch, genetic algorithms [1, 11, 14, 15, 17, 18, 19, 20, 22, 25, 28, 30, 31].

Above mentioned algorithms are not efficient for solving clustering problems in large datasets. Therefore various heuristics have been developed to tackle such problems. These heuristicsinclude k-means algorithms and their variations such as h-means, j-means. However, these algo-rithms are very sensitive to the choice of initial solutions, they can find only local solutions andsuch solutions in large data sets may significantly differ from global ones [18].

Over the last several years different incremental algorithms have been developed to addressdifficulties with the choice of initial solutions in the k-means algorithms. Incremental clusteringalgorithms attempt to optimally add one new cluster center at each stage. In order to computethe k-partition of a set these algorithms start from an initial state with the k− 1 centers for the(k−1)-clustering problem and the remaining k-th center is placed in an appropriate position.

The global k-means algorithm, introduced in [24], is a significant improvement of the k-meansalgorithm. In this algorithm each data point is used as a starting point for the k-th cluster center.Such an approach leads to a global or near global minimizer. However, this approach is notefficient since it is very time consuming, as at each iterations the number of the k-means algorithmapplications made is the number of data points. Instead the authors suggest two procedures toreduce computational load. The incremental approach is also discussed in [21].

The modified global k-means algorithm was introduced in [3]. It is an incremental algorithmwhere initial solutions for cluster centers are computed by solving an auxiliary clustering problem.Results of numerical experiments presented in [3] demonstrate that the modified global k-meansalgorithm is able to find a better solution than the global k-means algorithm, although it requiresmore computational effort.

Although the global and modified global k-means algorithms are efficient algorithms for solv-ing the minimum sum-of-squares clustering problems, they are not always accurate in locatingglobal solutions. In particular, these algorithms are not efficient for solving clustering problems indata sets with isolated clusters as well as in sparse data sets. In this paper, we develop a new algo-rithm to improve accuracy of the modified global k-means algorithm in locating global solutions.An important step in the proposed algorithm is a procedure for generating a set of initial solu-tions. Using results of numerical experiments on 16 real-world data sets we demonstrate that theproposed algorithm is (and sometimes significantly) more accurate than the existing incrementalalgorithms.

The rest of the paper is organized as follows: Section 2 gives a brief description of optimiza-tion formulations of the clustering problem. Various optimization based algorithms for solvingclustering problems are discussed in Section 3 and necessary conditions for these problems aregiven in Section 4. Section 5 presents an algorithm for finding initial points. An incrementalclustering algorithm is described in Section 6. The results of numerical experiments are given inSection 7. Section 8 concludes the paper.

2

2 Optimization formulations of clustering problems

In cluster analysis we assume that we have been given a finite set of points A in the n-dimensionalspace IRn, that is

A = {a1, . . . ,am}, where ai ∈ IRn, i = 1, . . . ,m.

The hard unconstrained clustering problem is the distribution of the points of the set A into a givennumber k of disjoint subsets A j, j = 1, . . . ,k with respect to predefined criteria such that:

1) A j 6= /0, j = 1, . . . ,k;

2) A j⋂Al = /0, j, l = 1, . . . ,k, j 6= l;

3) A =k⋃

j=1A j;

4) no constraints are imposed on the clusters A j, j = 1, . . . ,k.

The sets A j, j = 1, . . . ,k are called clusters. We assume that each cluster A j can be identified by itscenter (or centroid) x j ∈ IRn, j = 1, . . . ,k. The problem of finding these centers is called the k-thclustering problem. Data points from the same cluster are similar and data points from differentclusters are dissimilar to each other. In order to formulate the clustering problem one needs to de-fine the so-called similarity (or dissimilarity) measure. The dissimilarity measure between pointsx,y ∈ IRn can be defined using a distance. We define this measure using the squared Euclideandistance:

‖x− y‖2 =n

∑i=1

(xi− yi)2, x,y ∈ IRn.

In this case the clustering problem is also known as the minimum sum-of-squares clustering prob-lem. The k-th clustering problem can be reduced to the following optimization problem (see[10, 30]):

minimize ψk(x,w) =1m

m

∑i=1

k

∑j=1

wi j ‖x j−ai‖2 (1)

subject tok

∑j=1

wi j = 1, i = 1, . . . ,m, (2)

wi j ∈ {0,1}, i = 1, . . . ,m, j = 1, . . . ,k. (3)

Here wi j is the association weight of pattern ai with the cluster j, given by

wi j =

{1 if pattern ai is allocated to the cluster j,0 otherwise

(4)

and

x j =∑

mi=1 wi jai

∑mi=1 wi j

, j = 1, . . . ,k. (5)

Here w is an m× k-matrix. The problem (1)-(3) is called the integer programming formulation ofthe clustering problem.

3

Nonsmooth nonconvex optimization formulation of the clustering problem is as follows (see[4, 5, 6, 10]):

minimize fk(x) subject to x = (x1, . . . ,xk) ∈ IRn×k, (6)

where

fk(x1, . . . ,xk) =1m

m

∑i=1

minj=1,...,k

‖x j−ai‖2. (7)

We call both ψk and fk cluster functions. Comparing these two functions as well as two differentformulations of the hard clustering problem one can note that:

1. The objective function ψk depends on variables wi j, i = 1, . . . ,m, j = 1, . . . ,k, which areintegers. However, the function fk depends only on continuous variables x1, . . . ,xk.

2. The number of variables in problem (1)-(3) is m×k whereas in the problem (6) this numberis n× k and the number of variables does not depend on the number of instances. It shouldbe noted that in many real-world data sets the number of instances m is considerably greaterthan the number of features n.

3. Since the function fk is represented as a sum of minima functions it is nonsmooth for k > 1.

4. Both functions ψk and fk are nonconvex for k > 1.

5. Problem (1)-(3) is integer programming problem and problem (6) is nonsmooth global op-timization problem. These problems are equivalent in the sense that their global minimizerscoincide (see [10]).

Items 1 and 2 can be considered as advantages of the nonsmooth optimization formulation (6)of the clustering problem.

3 Optimization algorithms for solving clustering problems

In this section we briefly discuss optimization algorithms for solving the minimum sum-of-squaresclustering problems. These algorithms have been developed for more than five decades and with-out loss of generality they can be divided into the following groups:

1. Clustering algorithms based on deterministic optimization techniques. The clustering prob-lem is a global optimization problem, therefore different global and local optimization algo-rithms were applied to solve it. These algorithms include local methods such as the dynamicprogramming, the interior point method, the cutting plane method and global search meth-ods such as the branch and bound and the neighborhood search methods [14, 18, 20, 22, 25].Other deterministic optimization techniques include the discrete gradient, truncated codif-ferential and hyperbolic smoothing methods which were applied to solve the minimum sum-of-squares clustering problem using its nonsmooth optimization formulation [8, 13, 33, 34].In [29], the clustering problem is formulated as a nonlinear programming problem for whicha tight linear programming is constructed via the reformulation-linearization technique. Thisconstruct is embedded within a specialized branch-and-bound algorithm to solve the prob-lem to global optimality. A clustering algorithm based on a variant of the generalized Ben-ders decomposition, denoted as the global optimum search, is developed in [32].

4

2. Clustering algorithms based on metaheuristics. Some metaheuristics were applied to solvethe clustering problem including tabu search, simulated annealing and genetic algorithms [1,2, 11, 28, 31]. Metaheuristics can efficiently deal with both integer and continuous variablesand therefore they can be applied to solve the clustering problem using its formulation (1)-(3).

3. Heuristics. These algorithms were specifically designed to solve the clustering problem.The k-means and its variations are representatives of such heuristics [19, 30].

4. Heuristics based on the incremental approach. These algorithms start with the computationof the center of the whole data set A and attempt to optimally add one new cluster centerat each stage. In order to solve Problem (6) for k > 1 these algorithms start from an initialstate with the k−1 centers for the (k−1)-clustering problem and the remaining k-th centeris placed in an appropriate position. The global k-means and modified global k-means arerepresentatives of these algorithms [3, 7, 21, 24].

Numerical results demonstrate that most of algorithms mentioned in 1) and 2) are not always effi-cient to solve the clustering problems in large data sets. For example, some deterministic methodsand metaheuristics cannot efficiently solve clustering problems in data set with thousands of datapoints. These methods require considerably more computational effort than heuristic algorithmsin such data sets.

Heuristic clustering algorithms have shown to be efficient algorithms. Their typical represen-tative, k-means algorithm, can deal with very large data sets. However, this algorithm is verysensitive to the choice of starting points. It can calculate only local minimizers of the clusteringproblem and in large data sets these local minimizers might be significantly different from theglobal minimizers. It is expected that global minimizers of the clustering problem provide the bestcluster structure of a data set with least number of clusters.

Since the clustering is a global optimization problem the use of local methods for its solutionshould be combined with algorithms for finding good starting points for cluster centers. Algo-rithms based on an incremental approach are able to find such starting points. Incremental clus-tering algorithms have been developed over the last decade [3, 7, 21, 23, 24]. Numerical resultsdemonstrate that these algorithms are efficient for solving clustering problems in large data sets.In this paper, we develop an incremental clustering algorithm based on the combination of themodified global k-means algorithm and a special algorithm for finding a set of starting points.

4 Clustering and auxiliary clustering problems

In this section we study optimality conditions for the k-th clustering and also for the so-called k-thauxiliary clustering problems.

4.1 The clustering problem

The objective function fk in Problem (6) can be expressed as a difference of two convex (DC)functions as follows:

fk(X) = f 1k (X)− f 2

k (X), X = (x1, . . . ,xk) ∈ IRn×k,

5

where

f 1k (X) =

1m

m

∑i=1

k

∑j=1‖x j−ai‖2, f 2

k (X) =1m

m

∑i=1

maxj=1,...,k

k

∑t=1,t 6= j

‖xt −ai‖2.

The function f 1k as a sum of convex functions (squared Euclidean norms) is also convex. The

representation of the function f 2k is more complex. It is a sum of maxima of sum of convex

functions (squared Euclidean norms). Since sum of convex functions is convex, functions undermaximum are convex. Furthermore, since maximum of a finite number of convex functions is alsoconvex (see [27]), we get that the function f 2

k is represented as a sum of convex functions andtherefore it is also convex.

The necessary condition for a minimum in Problem (6) is:

∂ f 2k (X)⊆ ∂ f 1

k (X), (8)

where ∂ f (X) is a subdifferential of a function f at a point X in the sense of convex analysis [27].It is clear that the function f 1

k is differentiable and its gradient is:

∇ f 1k (X) = 2(X− A).

Here A = (A1, . . . , Ak), A1 = . . .= Ak = (a1, . . . , an) and at =1m ∑

mi=1 ai

t . This means that the subd-ifferential ∂ f 1

k (X) is a singleton for any X . In general, the function f 2k is nonsmooth. To compute

its subdifferential we consider the function

ϕi(X) = maxj=1,...,k

k

∑t=1,t 6= j

‖xt −ai‖2

and introduce the following set:

Ri(X) =

{j ∈ {1, . . . ,k} :

k

∑t=1,t 6= j

‖xt −ai‖2 = ϕi(X)

}.

The subdifferential ∂ϕi(X) of the function ϕi at X is as follows:

∂ϕi(X) = conv{

V ∈ IRn×k : V = 2(X j− A ji ), j ∈ Ri(X)

}where

X j =(

x1, . . . ,x j−1,0n,x j+1, . . . ,xk), A j

i =(

A ji1, . . . , A

jik

)∈ IRn×k

andA j

it = ai, t = 1, . . . ,k, t 6= j, A ji j = 0n.

Then the subdifferential ∂ f 2k (X) can be expressed as:

∂ f 2k (X) =

1m

m

∑i=1

∂ϕi(X).

Since the subdifferential ∂ f 1k is a singleton at any point X it follows from (8) that the subdifferential

∂ f 2k is a singleton at any stationary point of fk. This means that the subdifferentials ∂ϕi(X), i =

6

1, . . . ,m are also singletons at any stationary point X which in its turn means that at any such pointX the index sets Ri(X) are singletons for all i∈ {1, . . . ,m}. This implies that for each i∈ {1, . . . ,m}there exists a unique j ∈ {1, . . . ,k} such that Ri(X) = { j}. It follows from the DC representationof the function fk that this j stands for the index of the cluster to which the data point ai belongs.Thus, if X is a stationary point in the sense of (8) then for each data point ai there exists only onecluster center x j such that

mint=1,...,k

‖xt −ai‖2 = ‖x j−ai‖2

and ‖xt−ai‖2 > ‖x j−ai‖2 for any other t = 1, . . . ,k, t 6= j. This means that the clustering functionfk is continuously differentiable at any stationary point X of Problem (6) and the Clarke subdiffer-ential of this function are singletons at such points, that is

∂ fk(X) = {∇ fk(X)} ,

where

∇ fk(X) =2m

m

∑i=1

∑j∈Ri(X)

(x j−ai).

Then necessary condition for a minimum is ∇ fk(X) = 0 and in addition each cluster center x j, j =1, . . . ,k attracts only data points ai ∈ A such that j ∈ Ri(X).

4.2 The auxiliary clustering problem

Assume that the solution x1, . . . ,xk−1, k ≥ 2 to the (k− 1)-clustering problem is known. Denoteby di

k−1 the distance between ai, i = 1, . . . ,m and the closest cluster center among k− 1 centersx1, . . . ,xk−1:

dik−1 = min

{‖x1−ai‖2, . . . ,‖xk−1−ai‖2

}. (9)

We will also use the notation dak−1 for a ∈ A. Define the following function:

fk(y) =1m

m

∑i=1

min{

dik−1,‖y−ai‖2} , y ∈ IRn. (10)

We call fk the k-th auxiliary cluster function. This function is nonsmooth and as a sum of minimaof convex functions it is, in general, nonconvex. Moreover, the function is locally Lipschitz anddirectionally differentiable. It is obvious that

fk(y) = fk(x1, . . . ,xk−1,y), ∀y ∈ IRn.

A problem:minimize fk(y) subject to y ∈ IRn (11)

is called the k-th auxiliary clustering problem.The objective function fk in Problem (11) can be represented as a difference of two convex

functions as follows:fk(y) = f 1

k (y)− f 2k (y),

7

where

f 1k (y) =

1m

m

∑i=1

dik−1 +

1m

m

∑i=1‖y−ai‖2, f 2

k (y) =1m

m

∑i=1

max{

dik−1,‖y−ai‖2} .

It is clear that the function f 1k as a sum of constant and convex functions (squared Euclidean

norms) is also convex. The function f 2k is represented as a sum of maxima of constant and convex

functions (squared Euclidean norms) and therefore it is also convex.The necessary condition for a minimum in Problem (11) can be formulated as:

∂ f 2k (y)⊆ ∂ f 1

k (y). (12)

It is obvious that the function f 1k is differentiable at any y ∈ IRn and

∂ f 1k (y) =

{∇ f 1

k (y)}

where

∇ f 1k (y) = 2(y− a), a =

1m

m

∑i=1

ai.

In order to compute the subdifferential of the function f 2k at y ∈ IRn we consider the following

three sets:B1(y) =

{a ∈ A : ‖y−a‖2 < da

k−1},

B2(y) ={

a ∈ A : ‖y−a‖2 = dak−1},

B3(y) ={

a ∈ A : ‖y−a‖2 > dak−1}.

It is clear that Bt(y)⋂

Bp(y) = /0, t, p = 1,2,3, t 6= p and A = B1(y)⋃

B2(y)⋃

B3(y). Then thesubdifferential ∂ f 2

k (y) at y can be expressed as:

∂ f 2k (y) =

2m

[∑

a∈B2(y)conv (0,y−a)+ ∑

a∈B3(y)(y−a)

]. (13)

Since the subdifferential ∂ f 1k (y) is a singleton at any y ∈ IRn it follows from (12) that the subd-

ifferential ∂ f 2k (y) is also a singleton at any stationary point y of Problem (11). (13) implies that

∂ f 2k (y) is a singleton in two cases: (i) the set B2(y) = /0; (ii) the set B2(y) is a singleton containing

a point a ∈ A such that y = a. In both cases:

∂ f 2k (y) =

{2m ∑

a∈B3(y)(y−a)

}. (14)

If we have Case (i) then the k-th auxiliary cluster function fk is continuously differentiable at thestationary point y. This function is strictly differentiable (see, for example, [12] for the defini-tion of strictly differentiable functions) at y if Case (ii) is satisfied. This means that the Clarkesubdifferential ∂ fk of the function fk is a singleton at any stationary point y and

∂ fk(y) =

{2m ∑

a∈A\B3(y)(y−a)

}.

8

Here there are three cases: (i) B3(y) = /0; (ii) B3(y) 6= /0 and B3(y) 6=A; (iii) B3(y) =A. We considerthese cases separately.

Case (i). This means that all data points are closer or on the same distance to the point y than totheir cluster centers. In most data sets this case cannot happen. However, if it happens then fromthe necessary condition for a minimum we have

∑a∈A

(y−a) = 0

and thereforey =

1m ∑

a∈Aa,

that is the point y must be the centroid of the whole set A.

Case (ii). In this case the set of data points attracted by the stationary point y is not empty andin the same time this set is not the whole set A. This means that the set A \B3(y) 6= /0 and thenecessary condition for a minimum implies that

y =1

|A\B3(y)| ∑a∈A\B3(y)

a.

Here |A\B3(y)| stands for the cardinality of the set A\B3(y). This means that y is the centroid ofthe set A\B3(y). Moreover, the centroid y attracts only points from the set A\B3(y) and all otherdata points are closer to their cluster centers than to the point y.

Case (iii). In this case all data points are closer to their cluster centers than to the point y. Thefunction fk achieves its global maximum at such points and the value f max

k of the global maximumis:

f maxk =

1m

m

∑i=1

dik−1.

Such points can also be considered as local minimizers of the function fk, however these pointsare not good candidates for cluster centers as they do not provide a decrease of the auxiliarycluster function comparing with the optimal value of the cluster function for the (k−1)-clusteringproblem.

5 Computation of starting points

Given the solution (x1, . . . ,xl−1) ∈ IR(l−1)n, l ≥ 2 to the (l − 1)-partition problem consider thefollowing two sets:

S1 ={

y ∈ IRn : ‖y−ai‖2 ≥ dil−1, ∀i ∈ {1, . . . ,m}

},

S2 ={

y ∈ IRn : ∃i ∈ {1, . . . ,m} such that ‖y−ai‖2 < dil−1}.

It is obvious that cluster centers x1, . . . ,xl−1 ∈ S1. The set S2 contains all points y ∈ IRn whichattract at least one point from A. Since the number l of clusters is less than the number of datapoints in the set A all data points which are not cluster centers belong to the set S2 (because such

9

points attract at least themselves) and therefore this set is not empty. Note that S1⋂

S2 = /0 andS1⋃

S2 = IRn. It is clear that

fl(y)≤1m

m

∑i=1

dil−1, ∀y ∈ IRn

and

fl(y) = fl−1(x1, . . . ,xl−1) =1m

m

∑i=1

dil−1, ∀y ∈ S1,

that is the l-th auxiliary cluster function is constant on the set S1 and any point from this set is aglobal maximizer of this function. In general, a local search method will terminate at any of thesepoints. Therefore, starting points for solving Problems (6) and (11) should not be chosen from theset S1. In this section, we design a special procedure which allows one to select starting pointsfrom the set S2.

Take any y ∈ S2. Then one can divide the set A into two subsets as follows:

B1(y) ={

a ∈ A : ‖y−a‖2 ≥ dal−1},

B2(y) ={

a ∈ A : ‖y−a‖2 < dal−1}.

Notice that B1(y) = B2(y)⋃

B3(y) and B2(y) = B1(y). The set B2(y) contains all data points a ∈ Awhich are closer to the point y than to their cluster centers and the set B1(y) contains all other datapoints. Since y ∈ S2 the set B2(y) 6= /0. Furthermore, B1(y)

⋂B2(y) = /0 and A = B1(y)

⋃B2(y).

Then

fl(y) =1m

(∑

a∈B1(y)da

l−1 + ∑a∈B2(y)

‖y−a‖2

).

The difference zl(y) between the value of the l-th auxiliary cluster function at y and the valuefl−1(x1, . . . ,xl−1) for the (l−1)-clustering problem is:

zl(y) =1m ∑

a∈B2(y)

(da

l−1−‖y−a‖2)which can be rewritten as

zl(y) =1m ∑

a∈Amax

{0,da

l−1−‖y−a‖2} . (15)

The difference zl(y) shows the decrease of the value of the l-th cluster function fl comparingwith the value fl−1(x1, . . . ,xl−1) if the point (x1, . . . ,xl−1,y) is chosen as the cluster center for thel-clustering problem.

If a data point a ∈ A is the cluster center then this point belongs to the set S1, otherwise itbelongs to the set S2. Therefore we choose a point y from the set A\S1. We take any y = a∈ A\S1,compute zl(a) and introduce the following number:

z1max = max

a∈A\S1

zl(a). (16)

Let γ1 ∈ [0,1] be a given number. We compute the following subset of A:

A1 ={

a ∈ A\S1 : zl(a)≥ γ1z1max}. (17)

10

If γ1 = 0 then A1 = A\S1 and if γ1 = 1 then the set A1 contains data points with the largest decreasez1

max (the global k-means algorithm from [24] uses such data points as the starting point for the l-thcluster center).

For each a ∈ A1 we compute the set B2(a) and its center c(a). We replace the point a ∈ A1 bythe point c(a) because the latter is better representative of the set B2(a) than the former. Denoteby A2 the set of all such centers. For each c ∈ A2 we compute the number z2

l (c) = zl(c) using (15).Finally, we compute the following number:

z2max = max

c∈A2

z2l (c). (18)

The number z2max represents the largest decrease of the values fl(x1, . . . ,xl−1,c) among all centers

c ∈ A2 comparing with the value fl−1(x1, . . . ,xl−1).Let γ2 ∈ [0,1] be a given number. We define the following subset of A2:

A3 ={

c ∈ A2 : z2l (c)≥ γ2z1

max}. (19)

If γ2 = 0 then A3 = A2 and if γ2 = 1 then the set A3 contains only centers c with the largest decreaseof the cluster function fl . If take γ1 = 0 and γ2 = 1 then we get the scheme for finding startingpoints used in the modified global k-means algorithm [3].

All points from the set A3 are considered as starting points for solving the auxiliary clusteringproblem (11). This problem is nonconvex and it is important to find good starting points if oneapplies a local method for its solution. We use all data points for the computation of the set A3and therefore this set contains starting points from different parts of the data sets. Such a strategyallows to find either global or near global solutions to Problem (6) (as well as to Problem (11))using only local methods.

Applying the k-means algorithm and using starting points from A3 we get a set of local min-imizers of Problem (11). Since the k-means algorithm starting from different points can arrive tothe same solution, the number of local minimizers found is no greater than the cardinality of theset A3. We denote by A4 the set of local minimizers of Problem (11) obtained using the k-meansalgorithm and starting points from A3. Define

f minl = min

y∈A4

fl(y). (20)

Let γ3 ∈ [1,∞) be a given number. Compute the following set:

A5 ={

y ∈ A4 : fl(y)≤ γ3 f minl}. (21)

If γ3 = 1 then A5 contains the best local minimizers of the function fl obtained using starting pointsfrom the set A3. If γ3 is sufficiently large then A5 = A4.

Thus, an algorithm for finding starting points to solve Problem (6) can be summarized asfollows.

Algorithm 1 Algorithm for finding the set of starting points.

Input: The solution (x1, . . . ,xl−1) to the (l−1)-clustering problem, l ≥ 2.

11

Output: The set of starting points for the l-th cluster center.

Step 0. (Initialization). Select γ1,γ2 ∈ [0,1] and γ3 ∈ [1,∞).

Step 1. Compute z1max using (16) and the set A1 using (17).

Step 2. Compute z2max using (18) and the set A3 using (19).

Step 3. Compute the set A4 of local minimizers of the auxiliary clustering problem (11) using thek-means algorithm and starting points from the set A3.

Step 4. Compute f minl using (20) and the set A5 using (21). A5 is the set of starting points for the

l-th cluster center.

6 An incremental clustering algorithm and its implementation

In this section we present an incremental algorithm for solving Problem (6).

Algorithm 2 An incremental clustering algorithm.

Input: The data set A = {a1, . . . ,am}.

Output: The set of k cluster centers {x1, . . . ,xk}.

Step 1. (Initialization). Compute the center x1 ∈ IRn of the set A. Set l := 1.

Step 2. (Stopping criterion). Set l := l + 1. If l > k then stop. The k-partition problem has beensolved.

Step 3. (Computation a set of starting points for the next cluster center). Apply Algorithm 1 tocompute the set A5 defined by (21).

Step 4. (Computation a set of cluster centers). For each y∈ A5 apply the k-means algorithm startingfrom the point (x1, . . . ,xl−1, y) to solve Problem (6) and find a solution (y1, . . . , yl). Denote by A6a set of all such solutions.

Step 5. (Computation of the best solution). Compute

f minl = min

{fl(y1, . . . , yl) : (y1, . . . , yl) ∈ A6

}and the collection of cluster centers (y1, . . . , yl) such that

fl(y1, . . . , yl) = f minl .

Step 6. (Solution to the l-partition problem). Set x j := y j, j = 1, . . . , l as a solution to the l-thpartition problem and go to Step 2.

One can see that Algorithm 2 in addition to the k-partition problem solves also all intermediatel-partition problems where l = 1, . . . ,k− 1. Steps 3 and 4 are the most time-consuming steps ofthis algorithm. The success of the algorithm heavily depends on Step 3.

12

Implementation of Algorithm 2. In order to implement this algorithm one should select thevalues of three parameters: γ1,γ2 ∈ [0,1] and γ3 ∈ [1,∞). The small values of γ1 allow to includethe most of data points in the set A1 which makes the algorithm time consuming in large data sets.This is not the case for small data sets. Moreover, in general, points in small data sets are notdense and therefore in such data sets both parameters γ1 and γ2 can be chosen sufficiently small.In general, points in large data sets are dense and therefore it is not necessary to choose γ1 and γ2small in such data sets. The choice of small values for γ1 and γ2 in large data sets will increasethe CPU time considerably without any significant improvement of the cluster function value andconsequently the cluster structure of a data set.

The choice of γ3 is not as important as the choice of other two parameters γ1 and γ2. Resultsshow that as the number of clusters increases the difference between the values of the cluster andauxiliary cluster problems becomes smaller. This means that it is not necessary to select γ3 large.

In our implementation we select γ1 = 0.3,γ2 = 0.3 and γ3 = 3 for small data sets (with thenumber of data points m ≤ 200); γ1 = 0.5,γ2 = 0.8/0.9 and γ3 = 1.5 for medium size data sets(200 < m≤ 6000) and γ1 = 0.85/0.9,γ2 = 0.99 and γ3 = 1.1 for large data sets (with m > 6000).

In order to reduce computational effort required by Algorithm 2 in large data sets we removefrom the set A1 all points which are close to previous cluster centers. Such points are not goodcandidates to be starting points for cluster centers. Since data points in large data sets, in general,are dense such an approach allows one to remove a large number of points from A1. This approachwas also used in [7] to accelerate convergence of the modified global k-means algorithm.

We apply the reduced version of the k-means algorithm to solve the auxiliary clustering prob-lem (11) in Step 3. In this algorithm the first l−1 cluster centers are fixed and only the l-th clusteris updated (see, for details, [3, 7]). In order to solve the l-th cluster problem (6) in Step 4 we applythe full version of the k-means algorithm.

7 Results of numerical experiments

To verify the efficiency of Algorithm 2 (MS-MGKM - multi-start modified global k-means algo-rithm) numerical experiments with a number of real-world data sets have been carried out on aPC Intel Core 2 Dual CPU 2.50 GHz and RAM 3 GB. 16 data sets have been used in numericalexperiments. The brief description of the data sets is given in Table 1. The detailed description ofGerman towns, Bavaria postal data sets can be found in [30], Fisher’s Iris Plant data set in [16],TSPLIB1060 and TSPLIB3038 data sets in [26] and all other data sets in [9].

For comparison we also include results by the global k-means (GKM), the modified global k-means (MGKM) and multi-start k-means (MS-KM) algorithms. All algorithms were implementedin Lahey Fortran 95. For each data set we allowed the MS-KM to run twice the CPU time usedby the MS-MGKM for each data set, however we restricted the number of initial solutions by5000. This number is chosen to allow the MS-KM algorithm to run as many times as possiblewithin the given time frame. Due to the time restriction the MS-KM algorithm does not use allstarting points in medium size and large data sets. Furthermore, due to this restriction the numberof starting points used by the MS-KM algorithm in large data sets is only several hundred and thisnumber is significantly less than that of for medium size data sets. We computed up to 10 clustersin data sets with no more than 150 instances, up to 50 clusters in data sets with the number ofinstances between 150 and 1000 and up to 100 clusters in data sets with more than 1000 instances.

13

Table 1: The brief description of data sets

Data sets Number of Number ofinstances attributes

German towns 59 2Bavaria postal 1 89 3Bavaria postal 2 89 4

Fisher’s Iris Plant 150 4Heart Disease 297 13

Liver Disorders 345 6Ionosphere 351 34

Congressional Voting Records 435 16Breast Cancer 683 9

Pima Indians Diabetes 768 8TSPLIB1060 1060 2

Image Segmentation 2310 19TSPLIB3038 3038 2Page Blocks 5473 10

Pendigit 10992 16Letters 20000 16

Results of numerical experiments are presented in Tables 2-5. In these tables we use the followingnotation:

• k is the number of clusters;

• fbest is the best known value of the cluster function (7) (multiplied by m) for the correspond-ing number of clusters;

• E is the error in % which is calculated as follows:

E =f − fbest

fbest×100%

where f is the value of the clustering function obtained by an algorithm;

• N is the number of squared Euclidean norm evaluations for the computation of the corre-sponding number of clusters. To avoid large numbers in tables we use its expression in theform N = α × 10l and present the values of α in tables. l = 4 for German towns, BavariaPostal 1 and 2, Iris Plant data sets, l = 5 for Heart Disease, Liver Disorders, Ionosphere,Congressional Voting Records data sets, l = 6 for Breast Cancer, Pima Indians Diabetes,TSPLIB1060, Image Segmentation data sets, l = 7 for TSPLIB3038, Page Blocks and l = 8for Pendigit and Letters data sets;

• t is the CPU time (in seconds);

14

• red color is used to show cases when Algorithm 2 found the new best solution.

Since CPU time used by the MS-KM algorithm is twice that of used by the MS-MGKMalgorithm, in tables we report only the error E for the MS-KM. For all data sets the number ofsquared Euclidean norm evaluations required by the MS-KM algorithm is significantly larger thanthat of required by the MS-MGKM algorithm.

Results for small data sets are given in Table 2. These results clearly demonstrate that forsmall data sets the proposed algorithm finds either better or very similar solutions in comparisonwith other three algorithms. In German towns, Bavaria postal 2 and Iris Plant data sets the pro-posed algorithm outperforms the GKM and MGKM. Moreover, the MS-MGKM found 4 new bestsolutions in these data sets. On the other side, the proposed algorithm requires significantly morenorm evaluations and CPU time than the GKM and MGKM algorithms. Results for the MS-KMalgorithm shows that for small data sets this algorithm can be considered as an alternative to otheralgorithms. This algorithm failed only once to find global or near global solutions.

Results for data sets containing no more than 500 instances are given in Table 3. We cansee that the MS-MGKM algorithm found new best solutions in 25 cases out of 36. In all othercases this algorithm finds either the best known solution or the solution which is slightly differentfrom the best known solution. Again these results demonstrate that the proposed algorithm is (andsometimes significantly) more accurate than other three algorithms. However, the former algo-rithm requires significantly more computational effort (both the number of distance calculationsand CPU time) than the GKM and MGKM algorithms. The MS-KM algorithm is efficient forsolving clustering problems in these data sets with k≤ 5 (this number is 10 for Heart Disease dataset). One can see that in all data sets results by this algorithm deteriorate as the number of clustersincreases and the MS-KM is not an efficient algorithm for solving the clustering problem in thesedata sets when k > 5.

Table 4 presents results for data sets containing from 500 to 2500 instances. The proposedalgorithm found new best solutions in 22 cases out of 36. In all other cases (except the case k = 50for TSPLIB1060 data set) this algorithm finds either the best known solution or the solution veryclose to it. The new algorithm outperforms considerably other three algorithms in the sense ofaccuracy although it requires more computational effort than the GKM and MGKM algorithms.The MS-KM algorithm can find global solutions to clustering problems in these data sets fork ≤ 10. We can see that in all data sets results by this algorithm deteriorate as the number ofclusters increases which means that the MS-KM cannot be considered as an alternative to otherthree algorithms in these data sets.

Finally, results for data sets containing more than 3000 instances are presented in Table 5. Inthese data sets the proposed algorithm found new best solutions to the clustering problems in 18cases out of 36. In all other 18 cases it finds either the best solution or the solution close to it. Onecan also notice that for two largest data sets (Pendigit and Letters) the proposed algorithm requiresless computational effort than the GKM and MGKM algorithms. The MS-KM algorithm failed tosolve clustering problems in Page Blocks data set. This algorithm was quite successful in Pendigitand Letters data sets. It is due to the fact that data points in both data sets are dense. Results forthese and also for medium-size data sets demonstrate that the MS-KM is not an efficient algorithmto solve clustering problems even in moderately large data sets with relatively large number ofclusters.

15

8 Conclusions

In this paper, we developed a new algorithm for solving the minimum sum-of-squares clusteringproblems. The clustering problem is formulated as a global optimization problem and a specialalgorithm was designed to generate good starting points for its solution. In order to do so weused the auxiliary cluster problem. The proposed algorithm is an incremental algorithm. Wepresented results of numerical experiments on 16 real-world data sets with different number ofinstances. These results demonstrate that the proposed algorithm is able to find either global ornear global solutions to clustering problems and it is more accurate than other two the state-of-the-art clustering algorithms: the global k-means and the modified global k-means algorithms.However, in small and medium size data sets the proposed algorithm requires significantly morecomputational effort than other two algorithms. On the same time the difference in the number ofdistance evaluations and CPU time required by the proposed and other two algorithms decreasesas the size of a data set increases. This means that the proposed algorithm is efficient to globallysolve the clustering problems in large data sets with high accuracy.

Acknowledgements

This paper was written during the visit of Dr. Burak Ordin to the University of Ballarat supportedby The Scientific and Technological Research Council of Turkey (TUBITAK). The research byAdil M. Bagirov was supported under Australian Research Council’s Discovery Projects fundingscheme (Project No. DP140103213).

References

[1] Al-Sultan, K.S.: A tabu search approach to the clustering problem. Pattern Recognition.28(9), 1443-1451 (1995)

[2] Al-Sultan, K.S., Khan, M.M.: Computational experience on four algorithms for the hardclustering problem. Pattern Recognition Letters 17, 295–308 (1996)

[3] Bagirov, A.M.: Modified global k-means algorithm for sum-of-squares clustering problems.Pattern Recognition. 41(10), 3192–3199 (2008)

[4] Bagirov, A.M., Rubinov, A.M., Yearwood, J.: A global optimisation approach to classifica-tion. Optimization and Engineering. 3(2), 129–155 (2002)

[5] Bagirov, A.M., Rubinov, A.M., Soukhoroukova, N.V., Yearwood, J.: Supervised and unsu-pervised data classification via nonsmooth and global optimization. TOP: Spanish OperationsResearch Journal. 11(1), 1–93 (2003)

[6] Bagirov, A.M., Ugon, J.: An algorithm for minimizing clustering functions, Optimization,54(4-5), 2005, 351-368.

[7] Bagirov, A.M., Ugon, J., Webb, D.: Fast modified global k-means algorithm for sum-of-squares clustering problems. Pattern Recognition. 44, 866–876 (2011)

16

[8] Bagirov, A.M., Yearwood, J.: A new nonsmooth optimization algorithm for minimum sum-of-squares clustering problems. European Journal of Operational Research. 170, 578–596(2006)

[9] Blake, C., Keogh, E., Merz, C. J.: UCI Repository of machine learning databases[http://www.ics.uci.edu/∼mlearn/MLRepository.html]. Irvine, CA: University of California,Department of Information and Computer Science. (1998)

[10] Bock, H.H.: Clustering and neural networks. In: Rizzi, A., Vichi, M., Bock H. H., (eds.)Advances in Data Science and Classification. pp. 265–277. Springer-Verlag, Berlin (1998)

[11] Brown, D. E., Entail, C. E.: A practical application of simulated annealing to the clusteringproblem. Pattern Recognition. 25, 401–412 (1992)

[12] Clarke, F.H.: Optimization and Nonsmooth Analysis, Wiley-Interscience, New York, 1983.

[13] V.F. Demyanov, A.M. Bagirov, A.M. Rubinov, A method of truncated codifferential withapplication to some problems of cluster analysis, Journal of Global Optimization, 23(1), 63–80 (2002)

[14] Diehr, G.: Evaluation of a branch and bound algorithm for clustering. SIAM J. Scientific andStatistical Computing. 6, 268–284 (1985)

[15] Dubes, R., Jain, A. K.: Clustering techniques: the user’s dilemma. Pattern Recognition. 8,247–260 (1976)

[16] Fisher, R.A.: The use of multiple measurements in taxonomic problems, Ann. Eugenics, VIIpart II (1936) 179–188. Reprinted in: Fisher R.A. Contributions to Mathematical Statistics.Wiley, (1950)

[17] Hanjoul, P., Peeters, D.: A comparison of two dual-based procedures for solving the p-median problem. European Journal of Operational Research. 20, 387–396 (1985)

[18] Hansen, P., Jaumard, B.: Cluster analysis and mathematical programming. MathematicalProgramming. 79(1-3), 191–215 (1997)

[19] Hansen, P., Mladenovic, N.: J-means: a new heuristic for minimum sum-of-squares cluster-ing. Pattern Recognition. 4, 405–413 (2001)

[20] Hansen, P., Mladenovic, N.: Variable neighborhood decomposition search. Journal ofHeuristic. 7, 335–350 (2001)

[21] Hansen, P., Ngai, E., Cheung, B.K., and Mladenovic, N.: Analysis of global k-means, an in-cremental heuristic for minimum sum-of-squares clustering. Journal of Classification. 22(2),287–310 (2005)

[22] Koontz, W.L.G., Narendra, P.M., Fukunaga, K.: A branch and bound clustering algorithm.IEEE Transactions on Computers. 24, 908–915 (1975)

[23] Lai, J.Z.C., Huang, T.-J.: Fast global k-means clustering using cluster membership and in-equality, Pattern Recognition, 43(3), 731737 (2010)

17

[24] Likas, A., Vlassis, M., Verbeek, J.: The global k-means clustering algorithm. Pattern Recog-nition. 36, 451–461 (2003)

[25] Merle, O. du, Hansen, P., Jaumard, B., Mladenovic, N.: An interior point method for mini-mum sum-of-squares clustering. SIAM J. on Scientific Computing. 21, 1485–1505 (2001)

[26] Reinelt, G.: TSP-LIB-A Traveling Salesman Library. ORSA J. Comput. 3, 319–350 (1991)

[27] Rockafellar, R.T.: Convex Analysis, Princeton University Press, Princeton, 1970.

[28] Selim, S. Z., Al-Sultan, K. S.: A simulated annealing algorithm for the clustering. PatternRecognition. 24(10), 1003–1008 (1991)

[29] Sherali, H.D. and Desai, J.: A global optimization RLT-based approach for solving the hardclustering problem, Journal of Global Optimization, 32, 281–306 (2005)

[30] Spath, H.: Cluster Analysis Algorithms. Ellis Horwood Limited, Chichester (1980)

[31] Sun, L.X., Xie, Y.L., Song, X.H., Wang, J.H., Yu, R.Q.: Cluster analysis by simulated an-nealing. Computers and Chemistry. 18, 103–108 (1994)

[32] Tan, Meng Piao, Broach, James R., Floudas, Christodoulos A.: A novel clustering approachand prediction of optimal number of clusters: global optimum search with enhanced posi-tioning, Journal of Global Optimization, 39(3), 323–346 (2007)

[33] Xavier, A.E.: The hyperbolic smoothing clustering method. Pattern Recognition. 43(3), 731–737 (2010)

[34] Xavier A.E. and Xavier, V.L.: Solving the minimum sum-of-squares clustering problem byhyperbolic smoothing and partition into boundary and gravitational regions. Pattern Recog-nition. 44(1), 70–77 (2011)

18

Table 2: Results for small data sets

k fbest MS-MGKM GKM MGKM MS-KME α t E α t E α t E

German towns2 0.12143 ·106 0.00 1.84 0.00 0.00 0.23 0.00 0.00 0.59 0.00 0.003 0.77009 ·105 0.00 4.21 0.02 1.45 0.29 0.00 1.45 0.99 0.00 0.004 0.49601 ·105 0.24 5.72 0.02 0.72 0.37 0.00 0.72 1.43 0.00 0.005 0.38716 ·105 0.00 8.75 0.02 0.00 0.49 0.00 0.00 1.91 0.00 0.006 0.30536 ·105 0.00 11.02 0.02 0.00 0.60 0.00 0.27 2.35 0.00 0.007 0.24433 ·105 0.08 12.51 0.02 0.09 0.73 0.00 0.00 2.80 0.00 0.008 0.21631 ·105 0.00 15.76 0.02 0.64 0.83 0.00 0.54 3.26 0.00 0.009 0.18550 ·105 2.13 18.99 0.02 2.13 1.00 0.00 4.46 3.73 0.00 0.0010 0.16307 ·105 1.81 23.37 0.02 1.81 1.12 0.00 1.52 4.27 0.00 0.00

Bavaria postal 12 0.60255 ·1012 0.00 1.20 0.00 7.75 0.45 0.00 0.00 1.26 0.00 0.003 0.29451 ·1012 0.00 3.34 0.00 0.00 0.51 0.00 0.00 2.13 0.00 0.004 0.10447 ·1012 0.00 5.57 0.00 0.00 0.73 0.00 0.00 3.16 0.00 0.005 0.59762 ·1011 0.00 8.36 0.00 0.00 1.05 0.00 0.00 4.29 0.00 0.006 0.35909 ·1011 0.00 11.53 0.00 0.00 1.17 0.00 0.00 5.22 0.00 0.007 0.21983 ·1011 0.61 16.45 0.00 1.50 1.55 0.00 1.50 6.41 0.02 0.618 0.13385 ·1011 0.00 20.67 0.02 0.00 1.98 0.00 0.00 7.65 0.02 0.009 0.84237 ·1010 0.00 22.62 0.02 0.00 2.15 0.00 0.00 8.71 0.02 0.0010 0.64465 ·1010 0.00 28.21 0.02 0.00 2.87 0.00 0.00 10.20 0.02 0.00

Bavaria postal 22 0.48631 ·1011 0.00 1.38 0.00 7.32 0.45 0.00 7.32 1.25 0.00 0.003 0.17399 ·1011 0.00 3.19 0.00 0.00 0.51 0.00 0.00 2.13 0.00 0.004 0.75591 ·1010 0.00 6.05 0.00 0.00 0.66 0.00 0.00 3.12 0.00 0.005 0.53429 ·1010 0.00 9.65 0.00 1.86 0.80 0.00 1.86 4.08 0.00 0.006 0.31876 ·1010 0.00 13.41 0.00 1.21 0.92 0.00 1.21 5.00 0.02 0.007 0.22159 ·1010 0.00 17.99 0.00 0.51 1.49 0.00 0.51 6.32 0.02 0.008 0.17045 ·1010 0.18 23.28 0.02 0.73 1.71 0.00 0.73 7.35 0.02 0.009 0.14030 ·1010 1.07 27.70 0.02 0.00 2.12 0.00 0.00 8.41 0.02 0.2610 0.11841 ·1010 0.00 31.91 0.02 0.73 2.31 0.00 0.73 9.41 0.02 4.05

Iris Plant2 152.348 0.00 0.45 0.02 0.00 0.13 0.00 0.00 0.36 0.00 0.003 78.851 0.00 1.29 0.02 0.01 0.18 0.00 0.01 0.63 0.00 0.004 57.228 0.05 2.69 0.02 0.05 0.22 0.00 0.05 0.91 0.00 0.005 46.446 0.06 3.89 0.03 0.54 0.25 0.02 0.54 1.17 0.02 0.006 39.040 0.07 5.36 0.03 1.44 0.28 0.02 1.44 1.44 0.02 0.007 34.298 0.01 7.74 0.05 3.17 0.31 0.02 3.17 1.70 0.02 0.008 29.989 0.25 8.92 0.06 1.71 0.39 0.02 1.71 1.99 0.02 0.009 27.786 0.28 12.25 0.08 2.85 0.42 0.02 2.85 2.24 0.02 0.0010 25.834 0.07 16.30 0.09 3.55 0.45 0.02 3.55 2.50 0.02 0.00

19

Table 3: Results for medium size data sets

k fbest MS-MGKM GKM MGKM MS-KME α t E α t E α t E

Heart Disease2 0.59890 ·106 0.00 0.41 0.05 0.00 0.05 0.02 0.00 0.14 0.02 0.005 0.32797 ·106 0.00 3.01 0.27 0.52 0.07 0.02 0.52 0.43 0.05 0.0010 0.20065 ·106 0.00 15.26 0.94 0.75 0.16 0.03 2.70 0.93 0.09 0.0015 0.14765 ·106 0.00 26.51 1.47 0.04 0.27 0.06 0.72 1.46 0.17 0.3020 0.11778 ·106 0.39 37.16 1.94 0.00 0.40 0.09 1.34 2.04 0.23 1.6325 0.99292 ·105 0.00 49.62 2.47 3.35 0.55 0.11 2.86 2.59 0.33 3.5630 0.86216 ·105 0.00 65.15 3.08 2.99 0.68 0.14 3.31 3.15 0.44 4.6640 0.67701 ·105 0.00 89.89 4.05 3.13 0.97 0.20 1.39 4.35 0.69 10.9950 0.54878 ·105 0.00 120.18 5.20 3.95 1.32 0.27 1.85 5.54 1.03 10.99

Liver Disorders2 0.42398 ·106 0.00 0.30 0.03 93.96 0.06 0.00 93.96 0.06 0.00 0.005 0.21826 ·106 0.01 1.38 0.13 0.08 0.10 0.03 0.08 0.58 0.03 0.0010 0.11940 ·106 0.00 13.20 0.63 6.93 0.20 0.05 6.96 1.27 0.08 7.1515 0.97405 ·105 0.00 28.94 1.20 1.69 0.34 0.08 0.06 2.03 0.13 0.5720 0.81192 ·105 0.00 45.07 1.77 1.07 0.51 0.11 0.77 2.75 0.19 3.3225 0.69212 ·105 0.00 69.90 2.61 1.98 0.70 0.13 1.74 3.51 0.28 6.3330 0.60325 ·105 0.00 90.65 3.28 1.57 0.88 0.16 1.36 4.30 0.39 14.5340 0.47336 ·105 0.00 134.68 4.67 4.68 1.46 0.23 1.05 6.04 0.66 21.2950 0.38305 ·105 0.00 180.10 6.06 9.01 1.99 0.28 3.33 7.80 0.97 33.68

Ionosphere2 0.24194 ·104 0.00 0.54 0.08 0.00 0.07 0.03 0.00 0.19 0.05 0.005 0.18915 ·104 0.06 3.21 0.39 0.07 0.09 0.05 0.18 0.59 0.13 0.0010 0.15594 ·104 0.00 7.29 0.81 2.38 0.14 0.08 0.64 1.25 0.27 0.5115 0.13901 ·104 0.00 13.53 1.39 5.16 0.19 0.11 0.81 1.93 0.42 1.2920 0.12524 ·104 0.00 19.00 1.86 7.33 0.25 0.13 1.52 2.61 0.77 4.7125 0.11408 ·104 0.00 26.19 2.42 7.49 0.34 0.16 0.68 3.33 1.39 5.6930 0.10430 ·104 0.00 35.24 3.09 7.77 0.44 0.20 0.37 4.06 2.20 6.2340 0.85658 ·103 0.87 63.06 4.98 7.82 0.79 0.30 0.00 5.58 4.77 15.7350 0.70258 ·103 0.28 92.73 6.89 6.63 1.11 0.38 0.00 7.16 8.39 23.58

Congressional Voting Records2 0.16409 ·104 0.00 0.77 0.11 0.12 0.10 0.02 0.12 0.29 0.03 0.005 0.13358 ·104 0.00 6.27 0.56 1.02 0.16 0.05 1.02 0.92 0.11 0.0010 0.11233 ·104 0.00 19.96 1.50 2.04 0.28 0.08 0.70 1.97 0.20 0.2715 0.99207 ·103 0.00 32.28 2.25 1.69 0.47 0.13 1.87 3.07 0.31 1.3120 0.90647 ·103 0.00 49.55 3.28 2.29 0.63 0.17 0.88 4.19 0.44 2.5425 0.83975 ·103 0.00 68.34 4.36 3.31 0.76 0.22 1.26 5.30 0.58 2.3830 0.77677 ·103 0.00 90.83 5.61 3.44 1.01 0.27 0.69 6.48 0.73 3.8640 0.68935 ·103 0.00 130.63 7.75 4.03 1.52 0.38 0.69 8.71 1.16 5.3650 0.61750 ·103 0.00 177.65 10.19 5.53 1.99 0.48 1.14 11.10 1.84 7.27

20

Table 4: Results for medium size data sets

k fbest MS-MGKM GKM MGKM MS-KME α t E α t E α t E

Breast Cancer2 0.19323 ·105 0.00 0.10 0.09 0.00 0.02 0.05 0.00 0.07 0.06 0.005 0.13705 ·105 0.00 0.86 0.58 2.28 0.03 0.09 1.86 0.22 0.17 0.00

10 0.10205 ·105 0.00 2.00 1.06 0.11 0.06 0.17 0.13 0.46 0.33 0.4715 0.87047 ·104 0.00 3.71 1.72 0.88 0.08 0.23 0.92 0.71 0.48 2.6920 0.76952 ·104 0.00 5.82 2.48 2.99 0.11 0.31 1.17 0.97 0.66 3.3025 0.69464 ·104 0.00 8.40 3.38 4.45 0.13 0.38 0.31 1.24 0.83 7.6730 0.63603 ·104 0.00 11.09 4.27 4.75 0.16 0.45 1.28 1.50 0.98 8.6240 0.54878 ·104 0.00 15.22 5.59 6.14 0.22 0.61 2.36 2.02 1.39 13.1450 0.48123 ·104 0.00 18.94 6.75 8.05 0.30 0.77 3.68 2.56 1.83 15.50

Pima Indians Diabetes2 0.51424 ·107 0.00 0.09 0.13 0.00 0.03 0.06 0.00 0.09 0.09 0.005 0.17369 ·107 0.03 0.59 0.38 0.14 0.04 0.13 0.14 0.28 0.22 0.00

10 0.93046 ·106 0.00 1.96 0.91 1.86 0.07 0.20 1.86 0.66 0.41 0.2515 0.69499 ·106 0.00 4.10 1.73 0.33 0.11 0.30 0.36 0.94 0.59 0.5020 0.57241 ·106 0.00 7.58 2.98 0.34 0.15 0.39 0.71 1.28 0.80 1.5525 0.48869 ·106 0.00 13.20 4.97 0.37 0.22 0.52 0.92 1.63 0.98 4.7130 0.43219 ·106 0.00 24.04 8.75 2.83 0.25 0.59 0.98 1.99 1.22 4.4940 0.35656 ·106 0.00 35.54 12.61 1.28 0.40 0.83 1.81 2.70 1.70 6.1150 0.30903 ·106 0.00 43.61 15.17 1.98 0.53 1.06 1.73 3.41 2.28 7.99

TSPLIB10602 0.98319 ·1010 0.00 0.14 0.06 0.00 0.06 0.08 0.00 0.17 0.08 0.00

10 0.17548 ·1010 0.22 4.04 1.14 0.23 0.15 0.36 0.05 1.16 0.34 0.0020 0.79179 ·109 0.14 13.79 3.19 1.88 0.30 0.69 1.88 2.43 0.66 0.0830 0.48125 ·109 0.24 25.79 5.39 3.34 0.47 1.03 3.37 3.73 0.97 0.8040 0.34342 ·109 0.00 42.92 8.28 4.00 0.65 1.38 2.82 5.03 1.30 2.7350 0.25551 ·109 1.16 54.96 10.20 3.10 0.89 1.73 2.53 6.42 1.69 6.0760 0.19960 ·109 0.00 79.88 13.95 3.16 1.13 2.08 2.42 7.80 2.06 5.3480 0.12967 ·109 0.00 108.81 18.17 4.38 1.72 2.80 4.44 10.70 2.89 13.33100 0.97019 ·108 0.00 148.26 23.61 3.60 2.27 3.53 3.50 13.50 3.75 14.22

Image Segmentation2 0.35606 ·108 0.00 0.64 0.52 0.00 0.27 1.06 0.00 0.80 1.39 0.00

10 0.97952 ·107 0.64 9.22 6.00 1.76 0.37 3.97 1.76 5.16 6.75 0.6320 0.51283 ·107 0.77 25.55 14.41 0.09 0.64 7.58 1.49 10.80 13.11 13.2430 0.35074 ·107 0.00 55.93 27.98 0.07 1.25 11.36 0.07 16.70 20.89 15.9440 0.27398 ·107 0.18 101.05 46.63 1.25 1.71 16.67 1.24 22.50 28.92 14.3150 0.22249 ·107 0.55 138.79 61.17 2.41 2.28 18.73 2.41 28.30 37.72 20.4360 0.18818 ·107 0.00 182.38 77.19 1.47 2.97 22.50 2.34 34.30 46.91 26.6980 0.14207 ·107 0.00 237.84 97.61 2.59 4.59 30.19 1.64 46.60 68.81 38.76100 0.11419 ·107 0.00 314.08 122.92 1.74 6.38 38.00 0.81 58.90 93.69 43.19

21

Table 5: Results for large data sets

k fbest MS-MGKM GKM MGKM MS-KME α t E α t E α t E

TSPLIB30382 0.31688 ·1010 0.00 0.11 0.56 0.00 0.05 1.38 0.00 0.14 0.86 0.00

10 0.56025 ·109 0.57 1.53 4.92 2.78 0.09 8.41 0.58 0.92 3.30 0.0020 0.26681 ·109 0.20 4.69 11.89 2.00 0.16 16.63 0.48 1.92 5.77 0.1330 0.17557 ·109 0.39 12.84 29.17 1.45 0.30 25.00 0.67 2.95 8.25 0.2240 0.12548 ·109 0.33 20.36 43.59 1.35 0.40 33.23 1.35 3.99 10.70 1.0750 0.98400 ·108 0.57 30.25 61.75 1.19 0.53 41.52 1.41 5.05 13.23 1.8360 0.81180 ·108 0.00 43.61 85.42 1.02 0.64 49.75 2.01 6.10 15.75 1.4380 0.60642 ·108 0.00 80.41 147.11 0.95 0.96 66.42 1.58 8.29 20.94 2.40100 0.48182 ·108 0.00 120.68 211.00 2.11 1.29 83.16 1.52 10.50 26.11 3.35

Page Blocks2 0.57937 ·1011 0.00 0.14 1.88 0.24 0.15 8.19 0.00 0.45 6.92 0.24

10 0.45662 ·1010 0.00 1.24 12.08 0.80 0.17 49.62 0.00 2.86 34.09 213.8920 0.16742 ·1010 0.00 2.51 20.75 2.37 0.23 92.30 2.57 5.93 62.09 698.9830 0.93523 ·109 0.00 4.05 29.83 1.38 0.32 132.41 0.62 9.01 89.42 1295.3540 0.62570 ·109 0.67 5.62 37.14 0.17 0.42 172.13 0.00 12.10 118.55 732.8650 0.42024 ·109 0.00 7.27 43.42 2.21 0.59 212.27 2.17 15.20 149.77 1138.8960 0.30850 ·109 0.00 10.59 57.25 1.09 1.01 254.88 1.42 18.50 184.06 4103.9280 0.20436 ·109 0.00 19.86 90.89 2.16 1.42 334.36 0.69 25.00 258.69 2418.36100 0.14428 ·109 0.00 27.30 114.59 0.81 2.05 415.19 0.91 31.60 346.94 3480.86

Pendigit2 0.12812 ·109 0.00 0.01 6.94 0.39 0.01 9.42 0.00 0.02 16.73 0.00

10 0.49302 ·108 0.00 0.13 58.58 0.00 0.11 81.94 0.00 0.22 137.17 0.0020 0.34123 ·108 0.22 0.32 144.13 0.22 0.23 168.36 0.39 0.46 281.33 0.0030 0.27157 ·108 0.13 0.56 254.30 0.00 0.36 255.52 0.00 0.71 425.77 0.0340 0.23446 ·108 0.00 0.84 387.47 0.11 0.49 343.11 0.00 0.96 570.50 0.6350 0.21090 ·108 0.00 1.14 523.19 0.56 0.61 430.71 0.20 1.20 713.95 0.9860 0.19357 ·108 0.00 1.47 678.02 0.22 0.74 518.97 0.36 1.45 858.09 1.1580 0.16961 ·108 0.00 2.20 1021.52 0.25 1.00 696.94 0.11 1.96 1148.34 0.84100 0.15258 ·108 0.00 3.13 1456.25 0.45 1.27 876.94 0.35 2.46 1438.78 1.78

Letters2 0.13819 ·107 0.00 0.002 14.48 0.00 0.04 40.17 0.00 0.08 72.89 0.00

10 0.85752 ·106 0.00 0.23 180.42 0.00 0.36 363.78 0.00 0.72 583.36 0.0020 0.67263 ·106 0.00 0.66 463.38 0.72 0.78 732.23 0.53 1.53 1202.36 0.0030 0.57977 ·106 0.08 1.13 741.50 0.08 1.19 1094.59 0.00 2.35 1811.45 0.1340 0.51925 ·106 0.00 1.65 1057.08 0.78 1.61 1453.70 0.00 3.16 2457.58 0.4150 0.47727 ·106 0.00 2.20 1384.72 0.29 2.03 1812.06 0.23 3.98 3065.69 0.060 0.44166 ·106 0.00 2.86 1781.64 0.37 2.46 2173.91 0.12 4.80 3678.17 0.5280 0.39129 ·106 0.00 4.20 2594.13 0.64 3.32 2892.00 0.25 6.46 4907.64 0.64100 0.35644 ·106 0.00 5.67 3465.92 0.29 4.17 3611.05 0.04 8.12 6130.53 0.78

22