Experiments with a Non-convex Variance-Based Clustering Criterion

12
Experiments with a Non-convex Variance-Based Clustering Criterion Rodrigo F. Toso, Evgeny V. Bauman, Casimir A.Kulikowski, and Ilya B. Muchnik Abstract This paper investigates the effectiveness of a variance-based clustering criterion whose construct is similar to the popular minimum sum-of-squares or k-means criterion, except for two distinguishing characteristics: its ability to discriminate clusters by means of quadratic boundaries and its functional form, for which convexity does not hold. Using a recently proposed iterative local search heuristic that is suitable for general variance-based criteria—convex or not, the first to our knowledge that offers such broad support—the alternative criterion has performed remarkably well. In our experimental results, it is shown to be better suited for the majority of the heterogeneous real-world data sets selected. In conclusion, we offer strong reasons to believe that this criterion can be used by practitioners as an alternative to k-means clustering. Keywords Clustering • Variance-based discriminants • Iterative local search 1 Introduction Given a data set D D fx 1 ;:::; x n g of d -dimensional unlabeled samples, the clustering problem seeks a partition of D into k nonempty clusters such that the most similar samples are aggregated into a common cluster. We follow the R.F. Toso () • C.A. Kulikowski Department of Computer Science, Rutgers University, Piscataway, NJ 08854, USA e-mail: [email protected]; [email protected] E.V. Bauman Markov Processes International, Summit, NJ 07901, USA e-mail: [email protected] I.B. Muchnik DIMACS, Rutgers University, Piscataway, NJ 08854, USA e-mail: [email protected] F. Aleskerov et al. (eds.), Clusters, Orders, and Trees: Methods and Applications: In Honor of Boris Mirkin’s 70th Birthday, Springer Optimization and Its Applications 92, DOI 10.1007/978-1-4939-0742-7__3, © Springer Science+Business Media New York 2014 51

Transcript of Experiments with a Non-convex Variance-Based Clustering Criterion

Experiments with a Non-convex Variance-BasedClustering Criterion

Rodrigo F. Toso, Evgeny V. Bauman, Casimir A. Kulikowski,and Ilya B. Muchnik

Abstract This paper investigates the effectiveness of a variance-based clusteringcriterion whose construct is similar to the popular minimum sum-of-squares ork-means criterion, except for two distinguishing characteristics: its ability todiscriminate clusters by means of quadratic boundaries and its functional form, forwhich convexity does not hold. Using a recently proposed iterative local searchheuristic that is suitable for general variance-based criteria—convex or not, thefirst to our knowledge that offers such broad support—the alternative criterionhas performed remarkably well. In our experimental results, it is shown to bebetter suited for the majority of the heterogeneous real-world data sets selected.In conclusion, we offer strong reasons to believe that this criterion can be used bypractitioners as an alternative to k-means clustering.

Keywords Clustering • Variance-based discriminants • Iterative local search

1 Introduction

Given a data set D D fx1; : : : ; xng of d -dimensional unlabeled samples, theclustering problem seeks a partition of D into k nonempty clusters such thatthe most similar samples are aggregated into a common cluster. We follow the

R.F. Toso (�) • C.A. KulikowskiDepartment of Computer Science, Rutgers University, Piscataway, NJ 08854, USAe-mail: [email protected]; [email protected]

E.V. BaumanMarkov Processes International, Summit, NJ 07901, USAe-mail: [email protected]

I.B. MuchnikDIMACS, Rutgers University, Piscataway, NJ 08854, USAe-mail: [email protected]

F. Aleskerov et al. (eds.), Clusters, Orders, and Trees: Methods and Applications:In Honor of Boris Mirkin’s 70th Birthday, Springer Optimization and Its Applications 92,DOI 10.1007/978-1-4939-0742-7__3, © Springer Science+Business Media New York 2014

51

52 R.F. Toso et al.

variational approach to clustering, where the quality of a clustering is evaluated by acriterion function (or functional), and the optimization process consists in finding ak-partition that minimizes such functional [2]. Perhaps, the most successful criteriafor this approach are based on the sufficient statistics of each cluster Di , that is,their sample prior probabilities OpDi , means O�Di

, and variances O�2Di

, which yield notonly mathematically motivated but also perceptually confirmable descriptions of thedata. In general, such criteria are derived from the equation

J. OpD1 ; O�D1; O�2

D1; : : : ; OpDk

; O�Dk; O�2

Dk/ D

kX

iD1

JDi . OpDi ; O�Di; O�2

Di/: (1)

In this paper, we focus on functional J3 D PkiD1 Op2

DiO�2Di

, proposed byKiseleva et al. [12]. Other examples include the minimum sum-of-squares criterionJ1 D Pk

iD1

Px2Di

jjx � O�Dijj2 D Pk

iD1 OpDi O�2Di

and Neyman’s [17] variant

J2 D PkiD1 OpDi O�Di that is rooted in the theory of sampling. The key difference from

the traditional J1 to J2 and J3 lies in the decision boundaries that discriminate theclusters: those in the latter are quadratic—the intra-cluster variances (dispersions),derived from the second normalized cluster moments, are also taken into accountupon discrimination—and therefore more flexible than the linear discriminantsemployed by the former, which only takes into account the means (centers) of theclusters.

The distinctive feature of criterion J3 is the lack of convexity, given that both J1

and J2 are convex [15]. This claim has a deep impact in the practical applicationof the criterion, given that, to our knowledge, no simple clustering heuristic offerssupport for non-convex functionals, including the two-phase “k-means” clusteringalgorithm of Lloyd [14]. (It is worth mentioning that the two-phase algorithm wasextended so as to optimize J2, thus enabling an initial computational study ofthe criterion [15].) Even though there exist more complex algorithms that provideperformance guarantees for J3 (see, e.g., [7, 21]), we could not find in the literatureany experimental study validating them in practice, perhaps due to their inherentcomplexities.

There was, to our knowledge, a gap of more than 25 years between theintroduction of the criterion J3 in 1986 [12] and the first implementation of analgorithm capable of optimizing it, published in 2012 by Toso et al. [23]. Theirwork has introduced a generalized version of the efficient iterative minimum sum-of-squares local search heuristic studied by Späth [22] in 1980 which, in turn, is avariant of the online one-by-one procedure of MacQueen [16]. (An online algorithmdoes not require the whole data set as input—it reads the input sample by sample.)

This paper serves two purposes: first and foremost, it provides evidence of theeffectiveness of the criterion J3 when contrasted with J1 and J2 on real-world datasets; additionally, it offers an overview of the local search heuristic for generalvariance-based criterion minimization first published in [23]. Since the minimumsum-of-squares criterion is widely adopted by practitioners—in our view because

Experiments with a Non-convex Variance-Based Clustering Criterion 53

the k-means algorithm is both effective and easy to implement—we offer strongreasons to believe that J3 can be successfully employed in real-world clusteringtasks, given that it also shares these very same features.

Our paper is organized as follows. In the next section, we establish the notationand discuss the background of our work. The experimental evaluation appears inSect. 3, and conclusions are drawn in Sect. 4.

2 Variance-Based Clustering: Criteria and an OptimizationHeuristic

We begin this section by establishing the notation adopted throughout our work.Next, we review the framework of variance-based clustering and describe the threemain criteria derived from it, including functional J3. We then conclude with abrief discussion about the local search heuristic that can optimize any criterionrepresented by Eq. (1).

2.1 Notation

Given an initial k-partition of D, let the number of samples in a given cluster Di benDi ; this way, the prior probability of Di is estimated as OpDi D nDi

n. The first and

second sample central moments of the clusters are given by

M.1/Di

D O�DiD 1

nDi

X

x2Di

x; and (2)

M.2/Di

D 1

nDi

X

x2Di

jjxjj2; (3)

respectively, where the former is the sample mean of the cluster Di . It follows thatthe sample variance of Di is computed by

O�2Di

D M.2/Di

� jjM.1/Di

jj2: (4)

Here, OpDi 2 R, O�Di2 R

d , and O�2Di

2 R, for all i D 1; : : : ; k.

54 R.F. Toso et al.

2.2 Variance-Based Clustering Criteria

Among the criterion functions derived from Eq. (1) is the minimum sum-of-squaresclustering (MSSC) criterion, since

min MSSC D minkX

iD1

X

x2Di

jjx � O�Dijj2 (5)

D min1

n

kX

iD1

nDi

nDi

X

x2Di

jjx � O�Dijj2

D minkX

iD1

nDi

n

1

nDi

X

x2Di

jjx � O�Dijj2

D minkX

iD1

OpDi O�2Di

D J1 (6)

The functional form in Eq. (5) is called membership or discriminant functionsince it explicitly denotes the similarity of a sample with respect to a cluster. Onthe other hand, Eq. (6) quantifies the similarity of each cluster directly. Note that theformer is in fact the gradient of the latter, which, in this case, is convex [2]. In J1,the separating hyperplane (also known as decision boundary) between two clustersDi and Dj with respect to a sample x is given by the equation

jjx � O�Dijj2 � jjx � O�Dj

jj2 D 0: (7)

Let us now turn our attention to a criterion proposed by Neyman [17] for one-dimensional sampling:

J2 DkX

iD1

OpDi

qO�2Di

: (8)

Recently, J2 was generalized so as to support multidimensional data and provento be convex [15]. The decision boundaries produced by criterion J2 are given byEq. (9) and, contrary to those of J1, take into account the variance of the clusters fordiscrimination.

� O�Di

2C 1

2 O�Di

.jjx � O�Dijj2/

��

"O�Dj

2C 1

2 O�Dj

.jjx � O�Djjj2/

#D 0: (9)

Experiments with a Non-convex Variance-Based Clustering Criterion 55

Finally, Kiseleva et al. [12] introduced the one-dimensional criterion function

J3 DkX

iD1

Op2Di

O�2Di

; (10)

whose decision boundaries areh

OpDi O�2Di

C OpDi .jjx � O�Dijj2/

i�

hOpDj O�2

DjC OpDj .jjx � O�Dj

jj2/i

D 0: (11)

The criterion was extended to multidimensional samples [21], but a recentmanuscript has sketched a proof stating that convexity does not hold [15]. On apositive note, and similarly to J2, the decision boundaries of J3 make use of thevariances of the clusters.

2.2.1 Non-convexity of J3

The aforementioned proof relies on the following:

Theorem 1. Assume that f W Rd ! R is twice continuously differentiable. Then,

f is convex if and only if its Hessian r2f .x/ is positive semidefinite for all x 2R

d [20]. utIt is enough to show that J3.Di / D Op2

DiO�2Di

D M.0/Di

M.2/Di

� jjM .1/Di

jj2 is not

convex, with M.m/Di

denoting the m-th non-normalized sample moment of cluster Di .

Theorem 2. Functional J3 is non-convex.

Proof. The partial derivatives of J3.Di / are

@J3.Di /

@M.0/Di

D M.2/Di

;

@J3.Di /

@M.1/Di

D �2M.1/Di

; and

@J3.Di /

@M.2/Di

D M.0/Di

:

Thus, Eq. (12) corresponds to the Hessian of the functional of a cluster.

r2J3.Di / D0

@0 0 1

0 �2 1

1 0 0

1

A (12)

56 R.F. Toso et al.

We now proceed to show that the Hessian in Eq. (12) is not positive semidefinite,as required by Theorem 1.

Definition 1. A matrix M 2 Rd�d is called positive semidefinite if it is symmetric

and x|M x � 0 for all x 2 Rd .

We show that there exists an x such that

xT

0

@0 0 1

0 �2 1

1 0 0

1

A x < 0;

contradicting Definition 1. Plugging x D .1 2 1/ does the job, so r2J3.Di / is notpositive semidefinite and, by Theorem 1, J3 is non-convex.

2.3 Algorithms

Let us begin with J1. Although exact algorithms have been studied [8,9], minimizingEq. (5) is NP-hard [5], and hence their use is restricted to small data sets. Ingeneral, except for approximation algorithms and local search heuristics withno performance guarantees, no other optimization technique shall be suitable tominimize Eq. (1) for large data sets unless P D NP.

The so-called k-means clustering algorithm is perhaps the most studied localsearch heuristic for the minimization of J1 (disguised in Eq. (5)). In fact, k-meansusually refers to (variants of) one of the following two algorithms. The first is aniterative two-phase procedure due to Lloyd [14] that is initialized with a k-partitionand alternates two phases: (1) given the set of samples D and the k cluster centers,it reassigns each sample to the closest center; and (2) with the resulting updatedk-partition, it updates the centers. This process is iteratively executed until a stop-ping condition is met. The second variant is an incremental one-by-one procedurewhich utilizes the first k samples of D as the cluster centers. Each subsequent sampleis assigned to the closest center, which is then updated to reflect the change. Thisprocedure was introduced by MacQueen [16] and is a single-pass, online procedure,where samples are inspected only once before being assigned to a cluster. In [22],an efficient iterative variant of this approach was given and can also be seen inDuda et al. [6] (cf. Chap. 9, Basic Iterative Minimum-Squared-Error Clustering).A comprehensive survey on the origins and variants of k-means clustering algo-rithms can be found in [3].

The main difference between the two approaches above is when the clustercenters are updated: in a separate phase, after all the samples have been considered(two-phase), or every time a sample is reassigned to a different cluster (one-by-one). It is here that the main drawbacks of the one-by-one method appear: thecomputational time required to update the cluster centers after each sample is

Experiments with a Non-convex Variance-Based Clustering Criterion 57

reassigned must increase in comparison with the two-phase approach, whereas theability to escape from local minima has also been questioned [6]. Both issues havebeen addressed in [23], where the former is shown to be true while the latter isdebunked in an experimental analysis. In contrast, significant improvements havebeen made in the two-phase algorithm when tied with the minimum-squared errorcriterion, such as tuning it to run faster [11, 18] or to be less susceptible to localminima [4, 13], and also making it more general [19].

With J2, since convexity holds, the two-phase procedure could be successfullyemployed in a thorough evaluation reported in [15].

At last, J3. Lloyd’s [14] two-phase, gradient-based heuristic cannot be applieddue to the non-convexity of the functional, but MacQueen’s [16] online approachis a viable direction. A randomized sampling-based approximation algorithm wasintroduced by Schulman [21], relying on dimensionality reduction to make effectiveuse of an exact algorithm whose running time grows exponentially with thedimensionality of the data. Also, a deterministic approximation algorithm appearsin [7]. To our knowledge, no implementations validating these approaches have beenreported in the literature. In conclusion, there was no experimental progress with J3

prior to [23] and this can certainly be attributed to the lack of a computationallyviable algorithmic alternative.

2.3.1 An Iterative Heuristic for Variance-Based Clustering Criteria

In this section, we revisit the local search heuristic appearing in [23] for criteriain the class of Eq. (1), including those three studied throughout this paper. Thealgorithm combines the key one-by-one, monotonically decreasing, approach ofMacQueen [16] with the iterative design of Späth [22], extending the efficient wayto maintain and update the sufficient statistics of the clusters also used in the latter.

We first present in Algorithm 1 an efficient procedure to update a given functionalvalue to reflect the case where an arbitrary sample x 2 Dj is reassigned to clusterDi (i ¤ j ); we use the notation J .x!Di / to indicate that x, currently in clusterDj , is about to be considered in cluster Di . Similarly to [6, 22], we maintain theunnormalized statistics of each cluster, namely nDi , the number of samples assignedto cluster Di , mDi D P

x2Dix, and s2

DiD P

x2Dijjxjj2. Such equations not only

can be efficiently updated when a sample is moved from one cluster to another butalso allow us to compute or update the criterion function quickly. Note that in [6,22],only nDi and mDi need to be maintained since the algorithms are tied with J1 as inEq. (5) (the minimum sum-of-squares variant).

The main clustering heuristic is shown in Algorithm 2. The procedure isinitialized with a k-partition that is used to compute the auxiliary and the samplestatistics in lines 1 and 2, respectively. In the main loop (lines 5–17), every samplex 2 D is considered as follows: Algorithm 1 is used to assess the functional valuewhen the current sample is tentatively moved to each cluster D1; : : : ;Dk (lines 6–8).

58 R.F. Toso et al.

Algorithm 1 Computes J .x!Di /

Input: sample x 2 Dj , target cluster Di , current criterion value J �, and cluster statistics: nDj ,mDj , s2

Dj, OpDj , O�Dj

, O�2Dj

, nDi , mDi , s2Di

, OpDi , O�Di, and O�2

Di.

1: Let n0

DjWD nDj � 1 and n0

DiWD nDi C 1.

2: Let m0

DjWD mDj � x and m0

DiWD mDi C x.

3: Let .s2Dj

/0 WD s2Dj

� jjxjj2 and .s2Di

/0 WD s2Di

C jjxjj2.

4: Let Op0

DjWD n0

Dj

nand Op0

DiWD n0

Di

n.

5: Let O�0

DjWD 1

n0

Dj

mD0

jand O�0

DiWD 1

n0

Di

m0

Di.

6: Let . O�2Dj

/0 WD 1n0

Dj

.s2Dj

/0 � jj O�0

Djjj2 and . O�2

Di/0 WD 1

n0

Di

.s2Di

/0 � jj O�0

Dijj2.

7: Compute J .x!Di / with the updated statistics for clusters Di and Dj .

Algorithm 2 Minimizes a clustering criterion functionInput: an initial k-partition.1: Compute nDi , mDi , and s2

Di8 i D 1; : : : ; k.

2: Compute OpDi , O�Di, and O�2

Di8 i D 1; : : : ; k.

3: Set J � WD J. OpD1 ; O�D1; O�2

D1; : : : ; OpDk ; O�Dk

; O�2Dk

/.4: while convergence criterion not reached do5: for all x 2 D do6: for all i j hDi .x/ D 0 do7: Compute J .x!Di / via Algorithm 1.8: end for9: if 9 i j J .x!Di / < J � then

10: Let min D i j mini J .x!Di /. (i.e., x ! Dmin mostly improves J �.)11: Let j D i j hDi .x/ D 1. (i.e., Dj is the current cluster of x.)12: Set hDmin .x/ WD 1 and hDj .x/ WD 0. (i.e., assign x to cluster Dmin.)13: Update: nDmin , nDj , mDmin , mDj s2

Dmin, s2

Dj.

14: Update: OpDmin , OpDj , O�Dmin, O�Dj

, O�2Dmin

, O�2Dj

.

15: Set J � WD J .x!Dmin/.16: end if17: end for18: end while

If there exists a cluster Dmin for which the objective function can be improved, thesample is reassigned to such cluster and all the statistics are updated. The algorithmstops when a convergence goal is reached.

With Algorithm 1 running in �.d/, the running time to execute one iterationof Algorithm 2 is �.nkd/, the same of an iteration of the simple two-phaseprocedure [6]. In practice, though, due to the constant terms hidden in the analysis,the latter is consistently faster than the former. Results for clustering quality haveshown that the two approaches offer comparable clusterings [23].

Experiments with a Non-convex Variance-Based Clustering Criterion 59

The algorithm is quite flexible and can be extended with one or more regulariza-tion terms, such as to balance cluster sizes. Given that it is simple to implement andquick to run, the method can also be employed as a local search procedure with aglobal optimization overlay, such as genetic algorithms.

3 Experimental Results

This section offers an overview of the results obtained in [23]. Algorithm 2 wasimplemented in CCC, compiled with g++ version 4.1.2, and run on a single2.3 GHz CPU with 128 GBytes of RAM. The algorithm was stopped when nosample was moved to a different cluster in a complete iteration.

A visualization-friendly two-dimensional instance illustrating the capabilitiesof J3 is depicted in Fig. 1, offering a glimpse of how the decision boundaries ofthe three criteria in study discriminate samples. Clearly, J2 and J3 built quadraticboundaries around the central cluster (of smaller variance) and linear hyperplanesbetween the external clusters (of the same variance), since Eqs. (9) and (11) becomelinear when O�2

DiD O�2

Dj. For J1, all boundaries are linear and thus unable to provide

a proper discrimination for the central cluster.In Fig. 2 we plot the value of J3 after each sample is inspected by the algorithm

over the course of nine iterations on a synthetic data set, when the algorithm halted.The figure not only shows that the criterion is consistently reduced up until thesecond iteration but also provides a hint for a possible enhancement, suggesting thatthe algorithm could be stopped at the end of the third iteration. Since our focus ison J3 itself, no running-time enhancements were made on the algorithm though.

a b c

Fig. 1 Decision boundaries for a mixture of five equiprobable Gaussian distributions. The centralcluster has a quarter of the variance of the external clusters. (a) Criterion J1. (b) Criterion J2.(c) Criterion J3

60 R.F. Toso et al.

3.00

3.50

4.00

4.50

5.00

5.50

6.00

6.50

0 1 2 3 4 5 6 7 8 9

crite

rion

valu

e

iteration

Fig. 2 Evolution of the criterion value for a randomly generated mixture of Gaussians with n D10;000, k D d D 50, and �2 D 0:05. Each of the k D 50 centers was randomly placed inside ad -dimensional unit hypercube

3.1 Clustering Quality Analysis

In the subsequent experiments, we selected 12 real-world classification data sets(those with available class labels) from the UCI Machine Learning Repository [1]having fairly heterogeneous parameters as shown in Table 1. The available classlabels from the selected data sets were used in order to compare the functionalsunder the following measures of clustering quality: accuracy, widely adopted bythe classification community, and the Adjusted Rand Index or ARI [10], a pair-counting measure adjusted for chance that is extensively adopted by the clusteringcommunity. (See Vinh et al. [24].)

From the qualitative results in Table 1, we note that J3 significantly outperformsboth J1 and J2 on average, being about 2 % better than its counterparts inboth quality measures. Although we chose not to display the individual standarddeviations for each data set, the average standard deviation in accuracy across alldata sets was 0.0294, 0.0291, and 0.0276 for J1, J2, and J3, respectively; for ARI,0.0284, 0.0282, and 0.0208, respectively. In this regard, J3 also offered a more stableoperation across the different initial solutions.

4 Summary and Future Research

This paper has shown promising qualitative results for a non-convex criterionfunction for clustering problems, obtaining outstanding results on heterogeneousdata sets of various real-world applications including digit recognition, imagesegmentation, and discovery of medical conditions. We strongly believe that thiscriterion can be an excellent addition to applications involving exploratory data

Experiments with a Non-convex Variance-Based Clustering Criterion 61

Table 1 Description and solution quality for real-world data sets obtained from the UCIRepository [1]

Parameters Accuracy Adjusted Rand Index

Data set k d n J1 J2 J3 J1 J2 J3

Arcene 2 10000 200 0.6191 0.6173 0.6750 0.0559 0.0536 0.1180Breast-cancer 2 30 569 0.8541 0.8735 0.8770 0.4914 0.5502 0.5613Credit 2 42 653 0.5513 0.5865 0.5819 0.0019 0.0226 0.0193Inflammations 4 6 20 0.6773 0.6606 0.7776 0.4204 0.4008 0.6414Internet-ads 2 1558 2359 0.8953 0.8279 0.7961 0.4975 0.3434 0.2771Iris 3 4 150 0.8933 0.8933 0.8933 0.7302 0.7302 0.7282Lenses 2 6 24 0.6036 0.6011 0.6012 0.0346 0.0326 0.0382Optdigits 10 64 5619 0.7792 0.7702 0.7959 0.6619 0.6498 0.6810Pendigits 10 16 10992 0.6857 0.6960 0.7704 0.5487 0.5746 0.6155Segmentation 7 19 2310 0.5612 0.5516 0.5685 0.3771 0.3758 0.4028Spambase 2 57 4601 0.6359 0.6590 0.6564 0.0394 0.0773 0.0726Voting 2 16 232 0.8966 0.8875 0.8865 0.6274 0.5988 0.5959Average 0.7211 0.7187 0.7400 0.3739 0.3675 0.3959Wins 4 3 7 3 3 7

Quality measures are averaged over 1,000 runs with random initial k-partitions

analysis, given its ability to discriminate clusters with quadratic boundaries basedon their variance.

Future research paths include a more extensive experimentation with functionalsJ2 and J3 while also contrasting them with J1 to better understand their strengthsand weaknesses.

References

1. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository (2009) http://archive.ics.uci.edu/ml/

2. Bauman, E.V., Dorofeyuk, A.A.: Variational approach to the problem of automatic classifica-tion for a class of additive functionals. Autom. Remote Control 8, 133–141 (1978)

3. Bock, H.-H.: Origins and extensions of the k-means algorithm in cluster analysis. Electron. J.Hist. Probab. Stat. 4(2) (2008)

4. Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: Proceedingsof the 15th International Conference on Machine Learning, pp. 91–99. Morgan KaufmannPublishers, San Francisco (1998)

5. Brucker, P.: On the complexity of clustering problems. In: Optimization and OperationsResearch. Lecture Notes in Economics and Mathematical Systems, vol. 157, pp. 45–54.Springer, Berlin (1978)

6. Duda, R.O., Hart, P.E., Storck, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, NewYork (2000)

7. Efros, M., Schulman, L.J.: Deterministic clustering with data nets. Technical Report 04-050,Electronic Colloquium on Computational Complexity (2004)

8. Grotschel, M., Wakabayashi, Y.: A cutting plane algorithm for a clustering problem. Math.Program. 45, 59–96 (1989)

9. Hansen, P., Jaumard, B.: Cluster analysis and mathematical programming. Math. Program. 79,191–215 (1997)

10. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)

62 R.F. Toso et al.

11. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: Anefficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal.Mach. Intell. 24(7), 881–892 (2002)

12. Kiseleva, N.E., Muchnik, I.B., Novikov, S.G.: Stratified samples in the problem of representa-tive types. Autom. Remote Control 47, 684–693 (1986)

13. Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means algorithm. Pattern Recognit. 36, 451–461 (2003)

14. Lloyd, S.P.: Least squares quantization in PCM. Technical report, Bell Telephone LabsMemorandum (1957)

15. Lytkin, N.I., Kulikowski, C.A., Muchnik, I.B.: Variance-based criteria for clustering and theirapplication to the analysis of management styles of mutual funds based on time series of dailyreturns. Technical Report 2008-01, DIMACS (2008)

16. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In:Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1,pp. 281–297. University of California Press, Berkeley (1967)

17. Neyman, J.: On the two different aspects of the representative method: the method of stratifiedsampling and the method of purposive selection. J. R. Stat. Soc. 97, 558–625 (1934)

18. Pelleg, D., Moore, A.: Accelerating exact k-means algorithms with geometric reasoning. In:Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, pp. 277–281. ACM, New York (1999)

19. Pelleg, D., Moore, A.: x-means: extending k-means with efficient estimation of the numberof clusters. In: Proceedings of the 17th International Conference on Machine Learning, pp.727–734. Morgan Kaufmann Publishers, San Francisco (2000)

20. Ruszczynski, A.: Nonlinear Programming. Princeton University Press, Princeton (2006)21. Schulman, L.J.: Clustering for edge-cost minimization. In: Proceedings of the 32nd Annual

ACM Symposium on Theory of Computing, pp. 547–555. ACM, New York (2000)22. Späth, H.: Cluster Analysis Algorithms for Data Reduction and Classification of Objects.

E. Horwood, Chichester (1980)23. Toso, R.F., Kulikowski, C.A., Muchnik, I.B.: A heuristic for non-convex variance-based

clustering criteria. In: Klasing, R. (ed.) Experimental Algorithms. Lecture Notes in ComputerScience, vol. 7276, pp. 381–392. Springer, Berlin (2012)

24. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is acorrection for chance necessary? In: Proceedings of the 26th Annual International Conferenceon Machine Learning, pp. 1073–1080. ACM, New York (2009)