A Comparison of the Stability Characteristics of Some Graph Theoretic Clustering Methods

10
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-3, NO. 4, JULY 1981 A Comparison of the Stability Characteristics of Some Graph Theoretic Clustering Methods VIJAY V. RAGHAVAN AND C. T. YU Abstract-Assessing the stability of a clustering method involves the measurement of the extent to which the generated clusters are affected by perturbations in the input data. A measure which specifies the dis- turbance in a set of clusters as the minimum number of operations required to restore the set of modified clusters to the original ones is adopted. A number of well-known graph theoretic clustering methods are compared in terms of their stability as determined by this measure. Specifically, it is shown that among the clustering methods in any of several families of graph theoretic methods, clusters defined as the con- nected components are the most stable and the clusters specified as the maximal complete subgraphs are the least stable. Furthermore, as one proceeds from the method producing the most narrow clusters (maximal complete subgraphs) to those producing relatively broader clusters, the clustering process is shown to remain at least as stable as any method in the previous stages. Finally, the lower and the upper bounds for the measure of stability, when clusters are defined as the connected com- ponents, are derived. Index Terms- Connected components, graph theoretic clustering, maximal complete subgraphs, stability bounds, stability comparisons, stable classifications. I. INTRODUCTION A. The Problem IN many fields such as information retrieval, social sciences, and life sciences, a specialist finds himself confronted with a large number of objects. An important step in analyzing, understanding, and coping with such data often consists of classifying or clustering this set into "natural" and "homoge- neous" groups. Many automatic clustering techniques are now available [1] - [13] . The suitability of a clustering method to a given application depends primarily on the extent to which the classes obtained meet a chosen classificatory criterion. It is also common practice to consider other factors such as com- putational efficiency, whether or not the generated clusters de- pend on the order in which the objects are processed, whether certain objects may be assigned to more than one cluster (cluster overlap), and whether the algorithm is sensitive to errors in the input data. When large amounts of input data are involved, it is likely that some errors occur in the parametric representation (choice of attributes and their values) of objects and in the conversion of this information into machine readable form. Consequently, Manuscript received September 26, 1978; revised May 28, 1980. This work was supported in part by a Natural Sciences and Engineering Research Council of Canada Grant. V. V. Raghavan is with the Department of Computer Science, Univer- sity of Regina, Regina, Sask., Canada. C. T. Yu is with the Department of Information Engineering, Univer- sity of Illinois at Chicago Circle, Chicago, IL 60638. it is desirable to evaluate clustering techniques in terms of the effect such errors have on the clusters obtained. In this con- text a classification algorithm is considered "stable" if the generated clusters are not grossly affected by small errors in the input. The current work is concerned with the comparison of some well-known automatic clustering methods from the point of view of their stability. Jardine and Sibson [9] have identified a formal model of classification methods in which one of the steps in the pro- cess consists of transforming the "input" similarity values to "target" similarity values. They have shown that the trans- formation corresponding to the definition of clusters as the connected components is continuous, thus suggesting that those clusters are stable. Corneil and Woodward [15] have compared, besides other characteristics, the stability of clustering methods proposed by Vaswani [18], Gotlieb and Kumar [19], and Zahn [20] using the measure suggested by Rand [21]. Sokal and Rohlf [16] and Jackson [17] suggested measures to compute the similarity between two classifications. Such measures might be useful in stability comparisons. Jackson [22] proposed a measure to estimate the stability of a similarity function and also provided efficient algorithms to compute this measure. Yu [14] has suggested a measure of stability which is suitable for evaluating graph theoretic clustering methods. B. Overview The use of measures such as those proposed by Sokal, Rohlf, and Jackson facilitate experimental evaluation of the stability of different classification strategies. In contrast, it is found that the measure proposed by Yu can be used to evaluate the relative stability of graph theoretic clustering strategies ana- lytically. This paper provides the details and the results of such an analysis. In Section II a few essential definitions relating to graph theoretic clustering methods are introduced. In Section III clusters defined as the maximal complete subgraphs are shown to be the least stable. Certain properties that are common to many well-known graph theoretical clustering methods are then identified and the connected component method is shown to be the most stable among all the clustering methods possess- ing these properties. The stability characteristics of particular clustering methods relative to the others in several families of clustering methods are considered in Section IV. Section V deals with the derivation of lower and upper bounds for the stability of clusters defined as the connected components. Finally, Section VI provides a summary of the findings. 0162-8828/81/0700-0393$00.75 C 1981 IEEE 393

Transcript of A Comparison of the Stability Characteristics of Some Graph Theoretic Clustering Methods

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-3, NO. 4, JULY 1981

A Comparison of the Stability Characteristics ofSome Graph Theoretic Clustering Methods

VIJAY V. RAGHAVAN AND C. T. YU

Abstract-Assessing the stability of a clustering method involves themeasurement of the extent to which the generated clusters are affectedby perturbations in the input data. A measure which specifies the dis-turbance in a set of clusters as the minimum number of operationsrequired to restore the set of modified clusters to the original ones isadopted. A number of well-known graph theoretic clustering methodsare compared in terms of their stability as determined by this measure.

Specifically, it is shown that among the clustering methods in any ofseveral families of graph theoretic methods, clusters defined as the con-nected components are the most stable and the clusters specified as themaximal complete subgraphs are the least stable. Furthermore, as oneproceeds from the method producing the most narrow clusters (maximalcomplete subgraphs) to those producing relatively broader clusters, theclustering process is shown to remain at least as stable as any method inthe previous stages. Finally, the lower and the upper bounds for themeasure of stability, when clusters are defined as the connected com-ponents, are derived.

Index Terms- Connected components, graph theoretic clustering,maximal complete subgraphs, stability bounds, stability comparisons,stable classifications.

I. INTRODUCTION

A. The ProblemIN many fields such as information retrieval, social sciences,

and life sciences, a specialist finds himself confronted witha large number of objects. An important step in analyzing,understanding, and coping with such data often consists ofclassifying or clustering this set into "natural" and "homoge-neous" groups.Many automatic clustering techniques are now available

[1] -[13] . The suitability of a clustering method to a givenapplication depends primarily on the extent to which theclasses obtained meet a chosen classificatory criterion. It isalso common practice to consider other factors such as com-putational efficiency, whether or not the generated clusters de-pend on the order in which the objects are processed, whethercertain objects may be assigned to more than one cluster(cluster overlap), and whether the algorithm is sensitive toerrors in the input data.When large amounts of input data are involved, it is likely

that some errors occur in the parametric representation (choiceof attributes and their values) of objects and in the conversionof this information into machine readable form. Consequently,

Manuscript received September 26, 1978; revised May 28, 1980. Thiswork was supported in part by a Natural Sciences and EngineeringResearch Council of Canada Grant.V. V. Raghavan is with the Department of Computer Science, Univer-

sity of Regina, Regina, Sask., Canada.C. T. Yu is with the Department of Information Engineering, Univer-

sity of Illinois at Chicago Circle, Chicago, IL 60638.

it is desirable to evaluate clustering techniques in terms of theeffect such errors have on the clusters obtained. In this con-text a classification algorithm is considered "stable" if thegenerated clusters are not grossly affected by small errors inthe input. The current work is concerned with the comparisonof some well-known automatic clustering methods from thepoint of view of their stability.Jardine and Sibson [9] have identified a formal model of

classification methods in which one of the steps in the pro-cess consists of transforming the "input" similarity values to"target" similarity values. They have shown that the trans-formation corresponding to the definition of clusters as theconnected components is continuous, thus suggesting thatthose clusters are stable.Corneil and Woodward [15] have compared, besides other

characteristics, the stability of clustering methods proposedby Vaswani [18], Gotlieb and Kumar [19], and Zahn [20]using the measure suggested by Rand [21]. Sokal and Rohlf[16] and Jackson [17] suggested measures to compute thesimilarity between two classifications. Such measures mightbe useful in stability comparisons. Jackson [22] proposed ameasure to estimate the stability of a similarity function andalso provided efficient algorithms to compute this measure.Yu [14] has suggested a measure of stability which is suitablefor evaluating graph theoretic clustering methods.

B. OverviewThe use of measures such as those proposed by Sokal, Rohlf,

and Jackson facilitate experimental evaluation of the stabilityof different classification strategies. In contrast, it is foundthat the measure proposed by Yu can be used to evaluate therelative stability of graph theoretic clustering strategies ana-lytically. This paper provides the details and the results ofsuch an analysis.

In Section II a few essential definitions relating to graphtheoretic clustering methods are introduced. In Section IIIclusters defined as the maximal complete subgraphs are shownto be the least stable. Certain properties that are common tomany well-known graph theoretical clustering methods arethen identified and the connected component method is shownto be the most stable among all the clustering methods possess-ing these properties. The stability characteristics of particularclustering methods relative to the others in several families ofclustering methods are considered in Section IV. Section Vdeals with the derivation of lower and upper bounds for thestability of clusters defined as the connected components.Finally, Section VI provides a summary of the findings.

0162-8828/81/0700-0393$00.75 C 1981 IEEE

393

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-3, NO. 4, JULY 1981

II. BASIC DEFINITIONS AND NOTATIONSA. Graph Theoretic DefinitionsGraphs have been used to abstract the relationships between

many kinds of physical entities. In the context of the cluster-ing problem, the set of objects to be classified corresponds tothe vertices of the graph. A similarity measure is used tospecify the closeness between any pair of objects. An edgewould be associated with a pair of objects if the measure ofcloseness between them is sufficiently large (as determined bya threshold). Formal definitions of common notions such asa graph, subgraph, complement of a graph, adjacency, con-nectedness, and cycle can be found in standard books on graphtheory [231, [24].Definition 1: Let P be a property. A maximal set S having

property P is a set such that S U {x} does not satisfy P for anyx if S. A minimal set S' having property P is a set such thatS' - {x} l does not have P for any x E S'.Definition 2: A maximal complete subgraph (MCS), G1 =

(V , El), of a graph G = (V, E) is a maximal subgraph of Gsuch that for every xl,x2 E V1, x, is adjacent tox2 .By definition, when two vertices are adjacent they are very

close to each other. Thus, every object contained in a maximalcomplete subgraph is very close to every other object in thatsubgraph. A maximal complete subgraph of n vertices hasC(n, 2)2 edges.Definition 3: A connected component (or simply, compo-

nent and abbreviated as CC) G1 = (VI, E1) of a graph G = (V,E)is a maximal subgraph of G such that for every x1 X2, E V1 xI,and x2 are connected.The definition indicates that each connected component

represents a largest possible set of related, although not neces-sarily very close, objects. A component G1 =(Vl, El) of agraph is minimally connected if it is minimal with respect tothe edge set E1. That is, none of the edges in a minimallyconnected graph can be removed without causing the resultinggraph to be disconnected. It is easy to show by induction thatsuch a graph in n vertices has (n - 1) edges and no cycles. Arelated fact is that any edge in a cycle of a graph can be re-moved without affecting the connectivity. Finally, we intro-duce the notion of a bridge.Definition 4: If v and w are adjacent and if the deletion of

this edge would cause v and w to be disconnected, then theedge (v, w) is a bridge.

B. Notations Relating to the Measure of StabilityThe measure suggested by Yu [14] estimates the amount of

change in a set of clusters by the minimum number of "opera-tions" required to restore the set of modified clusters to theoriginal ones, where an operation is either an addition or adeletion of an edge. Implicit in the above statement is theassumption that the change in the data is so small that there isa 1-1 correspondence between the vertices of the modifiedgraph and those of the original graph. In other words, neitherany of the original objects is lost nor any new ones created.

1Tf A and B are two sets, then (A - B) is the set of elements in A, butnot in B.2C(n, 2) = n(n - 1)/2, the number of 2 combinations of n elements.

01 07

0 006 08

27

G*:

04

04 05 06 08

Fig. 1. An initial, a perturbed, and a restored graph to illustrate theamount of work.

Therefore, the operations consist only of the addition and/ordeletion of edges needed to restore the set of modified clustersto the original ones. More precisely, let G = (V, E) be thegraph that would represent the object-object similarities ifthere had been no errors in the input data. Let G * = (V, E*)denote the graph actually obtained as the result of some per-turbations in the input. That is, E* is obtained by deletingsome edges from E and adding some edges from E to E. Thus,edge deletions come from the original graph, whereas edgeadditions are from the complement graph. Given a clusterdefining method D, suppose G** = (V, E**) denotes a graphwhich is obtained through minimum number of changes to G*such that G and G** have an identical set of clusters. Then,the amount of change is specified by the expression I(E** -E*) U (E* - E**)I.3 This concept is illustrated using Fig. 1.The changes to G due to errors are the addition of edge

(03,07) and the deletion of (01,02) and (04,05). Letthe clusters be defined as the CC's. Clearly, the removal of(03, 07) and the addition of one of the edges {(°l, 02),(01,0°), (04, 02), (04, 05)} are sufficient to restore theclusters (refer to G**) and the work needed for restoration is2. Clearly, certain edge changes [adding (03, 07) or deleting(06, 08)] would affect the resulting clusters, and yet certainother changes [say, deletion of (02, 03)] do not alter theclusters obtained. In this sense, the measure takes the struc-ture of the graph into account.This measure, which represents the amount of work neces-

sary, is denoted W(D, G, G*), where D represents the clusteringmethod. However, for simplicity, the notation W(D) will be

3 IfA is a set of p elements, then IAl = p.

394

°2

RAGHAVAN AND YU: COMPARISON OF STABILITY CHARACTERISTICS

used henceforth. The following definition enables the compar-ison of the stability of one clustering method to another interms of our measure.Definition 5: A clustering method D1 is at least as stable as

another method D2 if W(D1) < W(D2) for every G and G*.Thus, if D is a class of clustering methods, then a method D1

is the most stable in D if, for every D2 E D, W(D1) < W(D2)for every G and G *; the definition of the least stable methodis analogous.

III. STABILITY CHARACTERISTICS OF THE CC AND THEMCS DEFINITIONS

It has been shown by Yu [141 that the amount of workrequired to convert a set of modified clusters to the originalones is as much as it can possibly be when clusters are definedat the MCS's. The following proposition states that the MCS'sof two graphs are identical only if the graphs are identical.Thus, every edge addition and deletion must be reversed orreset in order to have the same set of clusters. The proof ofthe proposition can be found in [14].Proposition 1: Let {cl, ** ,ci} and {cl*,** ,c7*} be two

sets of clusters defined as the MCS's, respectively, of the graphsG=(V,E) and G**=(V,E**). Then {C1,C2, - Cil{l*,* .. c*,l*} e* G = G *The MCS method has the most stringent requirement for a

set of objects to be placed in a cluster and, consequently,produces very narrow or tight clusters. In contrast, the CCmethod has the least restrictive requirement and leads to verybroad or loose clusters. The result above, which implies thatclusters defined at MCS's are the least stable, suggests thatperhaps the CC method is the most stable. This possibility isexplored in a formal setting in this paper.Since there are a large number of graph theoretic clustering

methods, it is not very practical to analyze their stability indi-vidually. This paper, therefore, provides a collective treatmentby defining certain properties that characterize several familiesof graph theoretic clustering methods, and then by obtainingresults that are indicative of the relative stability of particularclustering methods in these families.The basic objective of clustering is to obtain groups of closely

related objects. Thus, it is desirable to ensure that objectsplaced in the same cluster are at least somewhat related. Thisrequirement can be stated more formally as follows.Definition 6: A cluster is unfragmented if for any two ver-

tices v and w in the same cluster, v and w are connected.This property implies that any cluster must be a subset

(although not necessarily a proper subset) of a connected com-ponent. In the rest of the paper the term cluster will referonly to unfragmented clusters.Just as we place unrelated objects in different clusters, it is

also natural to always have close objects placed in some cluster.This notion is formalized in the following way.Definition 7: A cluster definition is adjacency oriented if

for every pair of adjacent vertices v and w in a graph, the appli-cation of the definition to the graph yields at least one clusterthat contains both v and w.

It is easy to see that clusters defined as the CC's and MCS'sare adjacency oriented and unfragmented. Several other exist-

ing clustering algorithms such as the fine k-clustering (Bk), thecoarse k-clustering (Bk), the u-diametric clustering (C") meth-ods, described in [9], [25], [26], and the method of Gotlieband Kumar [19], which forms diffuse concepts from sharpones, also have these properties.4 The stability characteristicsof these methods are analyzed in this and the next sections.When cluster definitions that are not adjacency oriented are

adopted, there may be adjacent vertices that do not belong toany cluster. Let the edges corresponding to these vertices bereferred to as intercluster edges. If an edge is not an interclus-ter edge, then it is between vertices in a single cluster. Suchan edge is termed an intracluster edge. In these cluster defini-tions it is reasonable to expect the clusters to remain unalteredif there is a decrease in the number of intercluster edges. Wespecify this as a property that is required of any reasonablecluster definition.Definition 8: A cluster definition is cohesively consistent

if the deletion of intercluster edges leaves all the clustersunchanged.Certain well-known methods, where the clusters are given for

k > 1 by the set of k-bonds, k-components, or k-blocks [13],are not adjacency oriented but are cohesively consistent andunfragmented. Consequently, in the rest of the paper it isassumed that a cluster definition generates unfragrnented clus-ters and is either adjacency oriented or cohesively consistent.The main result of this section is to show that clusters de-

fined as the CC's are the most stable among all adjacencyoriented definitions. We first specify some further notation.The graphs G and G* represent the graphs before and aftercertain input changes and G** denotes the restored graph thathas the same clusters as G. Let G have e edges, s componentswith n1 vertices in the jth connected component g1, for 1 <j <s, and let n1 > n1+ I, for 1 Sj < (s - 1). The set of graphs thathave these properties is denoted by g (s, e, n , n2, , ns).Finally, we denote the subgraphs consisting of the vertices ofg1 in G* and G**, respectively, by g7 and g7* Note that g7need not be connected.The proof of Proposition 2 consists, essentially, of show-

ing that for any adjacency oriented definition, G and G** canhave the same clusters only if corresponding to each compo-nent in G, the graph G** has a component consisting of ex-actly the same vertices. Thus, in order to restore clusters ac-cording to any adjacency oriented clustering method, it isnecessary that all edges added between the components of Gbe removed and that enough intracluster edges be added tog7 so that g7** becomes connected. The amount of work re-quired to restore clusters defined as the CC's is exactly that.Lemma 1 gives the amount of work required to restore allgi's that might be disconnected to g**'s. Lemma 2 is a tech-nical result used in proving Proposition 2.Lemma 1: Let G have s components. Suppose that the

perturbations in the data are such that no intercomponentedges are added to G and that g1 in G, 1 < j < s, are split intom1 components in G*. Then W(CC) = .1(m1 - 1).

4Since each final cluster identified is either identical to one of theMCS's present in the given graph or is the union of some of them, thesemethods are adjacency oriented and unfragmented.

395

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-3, NO. 4, JULY 1981

Proof: It is easy to show by induction on the number ofcomponents that the minimum number of edges needed tocombine m1 components into one is (mi - 1).Lemma 2: Let H = (V, E) be a connected graph and let

H' = (V, E') be another graph on the same vertex set. If v isconnected to w in H', for every v, w E Vsuch that v is adjacentto w in H, then H' is connected.

Proof: Let a, b be any two vertices in H'. Since H isconnected, there exists a path from a to b. Suppose it isa, V1, V2, *. , v1, b. Then, by hypothesis, in the graph H', ais connected to vI, vi is connected to vi+1 for 1 < i 6 j - 1,and v; is connected to b. Thus, a is connected to b in H'. Sincethe above result is true for any a, b E V, H' is connected. #Proposition 2: Given a cluster defining method D, W(CC) <

W(D) for every initial graph G and for every set of changes toG if and only if the cluster definition D is adjacency oriented.

Proof: X If D is not adjacency oriented, then there existsa graph and a set of changes to the graph such that W(D) <W(CC). Specifically, consider the graph G = (V, E) such that

1) the edge (u, v) E E is a bridge, and2) the vertices u and v do not belong to any cluster according

to D.If the only change to G is the "deletion of the edge (u, v),"

then it is clear that W(CC) = 1. Since D is cohesively consis-tent, W(D) = 0.

o The changes causing G to be modified to G* are of thefollowing two types.Case 1: The addition of intercomponent edges, given by

{e = (v, w) I v and w are in different components of G}.Consider any edge e = (v, w) given above. When restoring

clusters according to the CC definition, this edge must be de-leted. With the definition D, v and w cannot be assigned tothe same cluster in G since D is unfragmented. However, therewill be at least one cluster in G * containing v and w since D isadjacency oriented. Therefore, the edge e must be deleted fordefinition D. Thus, for edges considered here, W(D) = W(CC).Case 2: The intracomponent edge additions and deletions

that result in g7 having (mi + 1) components, 0 < mi < n1 - 1and 1 S j s.Since all intercomponent edge additions described in Case 1

must eventually be removed, they are assumed to be deletedfrom G*. Since D is adjacency oriented and always producesunfragmented clusters, for every pair of adjacent vertices (v, w)in g1, v must be connected to w in G**. Thus, both the defini-tions D and CC require the addition of m1 edges to g7 (Lemmas1 and 2). This work is required for each g1. Whereas the CCmethod requires exactly this amount, to ensure that the set ofvertices in each g1 form the same clusters according to defini-tion D in G**, further changes in edges may be required.Thus, for the changes considered in this case, W(D) > W(CC).By putting the two cases together, we have W(D) > W(CC). #The sufficiency part of the proposition implies that the CC

method is the most stable among all the adjacency orientedtechniques such as the fine and the coarse k-clustering methodsfor any given initial graph and for any set of changes. Thenecessity part shows that clustering methods such as the k-bondmethods will be more stable than the CC method for someinitial graphs and certain kinds of changes to them. Yu [141

has informally addressed the related problem of establishingthat the CC method is the most stable among all graph theo-retic clustering methods in a probabilistic sense. The followingcorollary, which is an immediate consequence of Propositions1 and 2, summarizes the results of this section.Corollary 1: For every G and every set of modifications to

G, W(MCS) > W(D) > W(CC), where D is any adjacency ori-ented clustering method.

IV. STABILITY ORDERING IN ADJACENCY ORIENTEDFAMILIES OF CLUSTERING METHODS

Clusters identified as the CC's are very broad whereas thosedefined as the MCS's are very tightly knit. But, from a practi-cal point of view, cluster definitions that represent a compro-mise between these two extremes have been considered moreappropriate. Therefore, it is of interest to study the stabilitycharacteristics of such intermediate cluster defmitions.The various adjacency oriented methods referred to earlier

[9], [19] are families of clustering methods where the degreesof tightness of clusters depend on a certain parameter. TheMCS's of the graph are obtained as clusters for the highestvalue of the parameter, whereas the lowest value of the param-eter yields the broadest form of clusters. Furthermore, thelatter clusters correspond to the CC's of the graph in several ofthese families. Since the results of the earlier section implythat the MCS and the CC methods are the least and the moststable, respectively, within each family, it is conjectured thatsuccessively more stable clustering methods would be obtainedas the value of the parameter decreases. The property to beinvestigated is defined below.Definition 9: Suppose At denotes a sequence of adjacency

oriented clustering methods with the property that any clusterdefined by At2 must be included in some cluster defined byAt5 for any t1 < t2. Furthermore, let each member At produceunfragmented clusters. Such a sequence At is stability orderedif, for every initial graph and any set of changes to it, W(At5) <W(At2), for t < t2

First, the Bk methods [9] are considered with respect to thisproperty. These methods have the following graph theoreticdescription:

Given a graph G = (V, E), VI = p, its MCS's are marked.Whenever there are k or more vertices in common be-tween two of these subgraphs, edges necessary to makethe union of the vertices in the two subgraphs into asingle MCS are inserted. This process is repeated untilno further alteration is possible. The Bk clusters are theMCS's of the resulting graph. Let this algorithm be re-ferred to as B_CLUSTER.Remark 1: At the end of the process, the overlap between

any two MCS's is no more than (k - 1). Thus, B1 = CC andBp-l =MCS.Remark 2: For k . j, consider Gk and G, obtained, respec-

tively, by B_CLUSTER (G, k) and B_CLUSTER (G, j). SinceEiC Ek, in the process of finding the clusters for k = 1, 2,p - 1 it is sufficient to work down the values of k and findGk from Gk+ rather than go back to G itself.The stability orderedness of the Bk methods is established in

the following proposition.

396

RAGHAVAN AND YU: COMPARISON OF STABILITY CHARACTERISTICS

01

02 03 2 03

04* 05 04; *05

G G*

Fig. 2. An initial and a perturbed graph to illustrate Dt is not stabilityordered.

Proposition 3: The sequence Bk of fine clustering methodsis stability ordered.

Proof: Let G* = (V, E*) denote the graph obtained aftersome perturbations have occurred in the original graph. LetW(Bk) and W(Bk+, ) be, respectively, the minimum amountof work required to change G* to some G** and G**I suchthat B_CLUSTER (G*, k) returns Gk and B_CLUSTER(Gk , k + 1) returns Gk+ I. Then by Remark 2,

B_CLUSTER (G 1, k)

= BCLUSTER (B_CLUSTER (G*, k + 1), k)

= BCLUSTER (Gk+l, k) = Gk.

Therefore, W(Bk) < W(Bk+ 1)-The critical aspect of the above proof is the observation

made in Remark 2. The algorithms for methods Bo, Cu andthe method proposed by Gotlieb and Kumar all share the abovefeature. Thus, these methods are also stability ordered.We next identify a family of adjacency oriented cluster

definitions that is not stability ordered. Notice that in thecase of a CC, it is sufficient to have n - 1 edges for a set ofn objects to form a cluster, whereas C(n, 2) edges are neededto form a complete graph. Intermediate cluster definitionscan, therefore, be thought of as containing a number of edgessomewhere between these two extremes. A natural generaliza-tion of the way clusters are obtained using the MCS definitionsuggests the following family of cluster definitions.Definition 10: A subgraph G1 = (V1, El) of G, whereV1 = p, is maximal t-complete if it is connected, has at leastft * C(p, 2)15 edges, 0 S t S 1, and is maximal with respectto V1.The maximality of the t-complete subgraphs is needed to

avoid the possibility of an identified cluster being a subset ofanother.Let Dt denote the definition whose clusters are given by all

maximal t-complete subgraphs of G. Clearly, D1 = MCS andDo = CC. Thus, Dt for 0 < t < 1, represents a compromisebetween the clusters defined as the CC's and those defined as

the MCS's. It is clear that this family of definitions are adja-cency oriented and produce clusters that are unfragmented.

It is shown by means of a counter example that the sequence

Dt, defined earlier, is not stability ordered. Consider the graphsin Fig. 2. Let G and G * represent the original and the modified

5 [x 1 is the smallest integer greater than or equal to x.

graphs, respectively. For Dt with t = 0.8, the clusters in bothG and G* are {(l, 2), (2, 3, 4, 5)}. Thus, no restoration isneeded in this case. However, if t = 0.65 then the cluster de-fined byDt on G is {(I, 2, 3, 4, 5)}, which is different from theclusters defined by the method on G*. Consequently, somerestoration must be performed. Clearly, for this example,W(Dt, t = 0.65) > W(Dt, t = 0.8).

V. THE MAXIMUM AND THE MINIMUM AMOUNT OF

WORK FOR CLUSTERS GIVEN AS THE CC'S

In the earlier sections, the stability characteristics of manyclustering methods are compared to one another. One of theresults obtained is that the CC method is the least sensitive toerrors in input among several families of clustering methods.We now derive certain results that are useful in estimating theamount of work needed for restoring the CC clusters whenonly the general characteristics of the initial graph and thekinds of changes to be made are known.

It is easily seen that if the CC's of a graph are close to beingminimally connected and the changes consist primarily ofdeletions, then substantial perturbations in the clusters wouldresult. On the other hand, if the components are maximallyconnected, then the same kinds of changes in the edges wouldlead to very little changes in the original clusters. Propertiesof this nature are exploited to derive an upper and a lowerbound for the amount of work required to restore clustersdefined by the CC's.Let the original graph G be a member of§ (s, e, n I, n2,*.*

n,) and suppose that a is the number ofedge additions (which isIE* - El) and d is the number of edge deletions (given byIE- E*I) made to G to obtain the modified graph G*. Fur-ther, let G* be an element of *(G), where *(G) is the setof graphs having the same vertices as G and each graph ing*(G) is obtainable from G by a additions and d deletions.Then the lower and the upper bounds correspond, respectively,to

min W(CC, G, G*) and

G* E A*(G)

max W(CC, G, G *).

G*E A*(G)

In other words, given the parameters s, e, n I , n2,... , ns, a,and d we are interested in characterizing G and G* that corre-spond to the best and the worst cases in terms of the amountof work needed to restore clusters defined as the CC's.

A. The Lower Bound

When edges are deleted from a CC, the component may bebroken up into smaller components. To achieve the moststable case, the edges that are not essential to maintain theconnectivity of a component should be chosen for deletion.Now consider the effect of edge additions. The addition of

edges can occur within or between components. Whenever anintercluster edge is added, one operation has to be performedto get rid of the edge. However, when intracluster edges areadded, there is no disturbance in the clusters obtained. Thus,in obtaining the lower bound, if the number of additions issmall enough (a S i C(nj, 2) - e) to occur within compo-nents, then the edges for addition will be chosen in such a way

397

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-3, NO. 4, JULY 1981

that the number of intercluster additions is zero. Consequently,we have the following proposition.Proposition 4: For any a, d, and G E g (s, e, n1, , ns)

a' ifd<zW(CC) > a' + (d - z) -( - a') if d >z6

where

s sz=e-Z(n1-1) and a' a*( C(nj,2)-e).

1=1 j=1 /

Proof: a' is the number of intercluster edge additions.Each such edge must be deleted. Thus, W(CC) > a' in bothcases (d < z and d > z). Let e1 denote the number of edges ing1. Since (ni - 1) edges are sufficient to maintain g1 connected,it is possible to delete zj = ei - (n1 - 1) edges and still guaranteethat gj remains connected. Summing over j from 1 to s, it isclear that W(CC) due to edge deletions equals zero when thenumber of deletions d < z. After z deletions, we have s mini-mally connected components. Therefore, each deletion be-yond the z deletions causes 1 unit of work. However, thesedeletions may be offset by the intracluster additions whichnumber (a - a').When d 6 z, it is easy to see that the deletions can occur in

such a way that the remaining edges are enough to ensure thateach component stays connected.Lemma A. 1 (Appendix A) shows that in order to ensure that,

for any number of deletions (up to the number of intraclusteradditions that might be made to g1) made from a g1, g1 remainsconnected, the set of edge additions chosen must not form acycle. By using this lemma, a graph G and a sequence ofchanges to it for which the lower bound given in Proposition 4holds can be identified for any d > z and a (Lemma A.2).Thus, the following proposition is obtained.Proposition 5: The lower bound of Proposition 4 is the

tightest.

B. The Upper BoundIn order to obtain the upper bound the additions must occur

between components, whereas deletions must break up thecomponents in the original graph. For the sake of presenta-tion, the number of edge additions is initially assumed to bezero. The general situation will be considered later in thissection.

If H denotes a connected component of a graph from whichcertain edges are deleted, then in the perturbed graph eitherthe vertices ofH are still connected or they split into a numberof components. By Lemma 1 it is clear that the larger thenumber of components in the perturbed graph, the more isthe work required to do the restoration. The above relation-ship suggests that one of the critical steps in deriving the upperbound is to determine the maximum number of componentsthat a graph with certain specified characteristics can have.Given n vertices and e edges, suppose we wish to construct agraph with the maximum number of components. An intu-itive approach to the problem is to concentrate the edges be-tween as few vertices as possible. That is, a single component

6x * y = x -y if x > y and zero otherwise.

0 0

n = 6

92n2 = 5

(a)

93

n3 = 3

A9'

n, = 6

g 2

n2= 5

9g33

(b)Fig. 3. An illustration of (i, k)-compact and (i, k)-packed graphs.

(a) (2, 3)-compact graph G. (b) (2, 3)-packed graph G'.

which is just large enough to contain all the edges is formedand the remaining vertices are left isolated (this problem is, infact, the dual of maximizing the number of edges in fixednumber of components and vertices).

In the problem considered above no restriction has beenplaced on how the edges should be assigned to the vertices.But, in our context the original graph G contains s compo-nents. After d edge deletions, the jth component gj becomesg7. Since g7 may be disconnected, we refer to g7 as a partof G Our aim is to find a graph having the maximum num-ber of connected components. This graph must have s parts(corresponding to the s components of G) such that the jthpart consists of n1 vertices, I < i < s, a total of (e - d) edgesin the s components and n, > n1+1, for 1Sj S (s - 1). Letthe set of graphs with the above characteristics be denoted byg(s, e - d, nl, n2, , n,). A natural generalization of thescheme referred to earlier is to use the (e - d) edges to firstfill the part with the largest number of vertices, then the nextlargest, and so on, until all the edges are exhausted. The char-acteristics of the resulting graph are formalized in the defini-tions below.Definition 11: A graphH= (V, E),where IlV n and IE= e

is partially complete if C(n - 1, 2) + 1 S e < C(n, 2).Definition 12: A part consisting of m vertices, of a graphH = (V, E) EC, is k-compact if it contains a partially completesubgraph of (k + 1) vertices and m - (k + 1) isolated vertices.Definition 13: A graph G = (V, E) E (s, e, n1, n2 , nS)

is (i, k)-compact if the first (i - 1) parts, 1 S i S s, are com-plete subgraphs, part i is k-compact, 1 < k S ni, and partsi+1, * *, s are isolated vertices.These ideas are clarified by an example. Fig. 3(a) is an (i, k)-

compact graph in g(3, 19, 6, 5, 3) where i = 2, k = 3, and g2 is3-compact. Since the g1's,j = 1, 2, 3, are the parts, none of thegraphs in g can have an edge from a vertex in g, to one ingv, u /= v. Clearly, the number of CC's in G is six and is given

398

RAGHAVAN AND YU: COMPARISON OF STABILITY CHARACTERISTICS

by the expression i - 1 - k + Z%ini. In terms of this example,we wish to assert that the maximum number of componentsthat any graph in g(3, 19, 6, 5, 3) can have is six. The corre-sponding claim for the general case is that an (i, k)-compactgraph in (s, e, nl, n2,.* , ns) has as least as many compo-nents as any graph in that set. The proof of this proposition isprovided in Lemma B.3 (Appendix B), which uses Lemma B.1and Corollary B.2.We now specify a graph in the set g (s, e, 1, n2, nS)

such that after d edge deletions are made the resulting graphrequires the maximum amount of work for restoration. Itturns out that the graph in 9 (s, e, n1, n2, * * *, ns) that attainsthe upper bound is the one in which as many of the largerparts as possible are maximally connected and the remainingparts are minimally connected. The following definitionsmake the structure of the desired graph precise.Defnition 14: Letg= (V',E'), where I V'I = m and IE'I = e,

be a component of a graph. Suppose for some k, 2 < k <(m - 1), C(k, 2) + I < e - (m - k - 1) C(k + 1, 2). Theng is k-packed if it contains a partially complete subgraph of(k + 1) vertices, a minimally connected subgraph in the re-maining (m - k - 1) vertices, and exactly one edge between thetwo subgraphs.Definition 15: A graph G = (V, E) E (s, e,n,n2,n'2 , ns)

is (i, k)-packed if the first (i - 1) components, 1 6 i < s, arecomplete subgraphs, component i is k-packed, 2 S k < ni, andcomponents i + 1, * * *, s are minimally connected.

Fig. 3(b) illustrates an (i, k)-packed graph in g (3, 19,6, 5, 3),where i = 2, k = 3, and g2 is k-packed. It is easy to see that ifG' is the initial graph and the number of deletions d is three orless then the edge deletions can be made from minimally con-nected regions of g and g'3 . This causes one unit of work foreach deletion. If d > 3 then the deletions can be made suchthat the resulting graph consists of the maximum possiblenumber of components that a graph in the remaining numberof edges can have. Thus, an (i, k)-packed graph should lead tothe worst case for W(CC). This result is stated next.Proposition 6: Let G= (V, E)E 9 (s, e, n1, * * *, n,) be an

(i, k)-packed graph. Then for any given number of deletionsd the maximum amount of work required for the restorationis attained for G.

Proof: See Appendix C.We now present the expressions for the upper bound for the

different possible cases of d. Initially, each deletion causesone unit of work. Then the (k + I)st vertex of the ith compo-nent is isolated. The isolation of this vertex may involve anumber of deletions. The other cases consider situations inwhich some edges are to be deleted from complete subgraphs.For these cases, the amount of work is expressed as a functionof the number of edges deleted from a complete subgraph andthe number of vertices in the complete subgraph.Proposition 7: Let G E g (s, e, n, n2, * * *, ns) be such that

for some i, 1 6 i < s, and for some k, 2 < k < ni, e satisfies

i-i i-iE C(n1,2)+C(k,2)+l1 e-pS EC(n1,2)j=l j=l

+C(k+ 1,2). (1)

If the perturbation involves d edge deletions, then

rd if d 6 p (minimally connected portion)

W(CC) S

p if p < d< q (edges between the (k + l)st ver-tex and the complete subgraph of k vertices inith part)

p + 1 if d = q (the (k + 1)st vertex above is iso-lated)

p + 1 +f(k,d - q) if q <d.bi. (edges in thecomplete subgraph of size k in the ith part)

s

E; (ni - l) + f(nt, d - bt) if bt < d A bt-j=t+1

(edges in the tth part for some t)where

bt = e - C(nj, 2),j=l

q = bi1 - C(k, 2) (number of edges not in the maximallyconnected sections of G)

for t,0.t.i- 1, and f(x,y) is the minimum amount ofwork to restore a cluster, which is originally a complete sub-graph of size x, aftery deletions occur. f(x, y) is shown in thelemma of Appendix D to be -1 + 4y/(2x - 1), which says thatfor each deletion the amount ofwork for restoration is, roughly,(2/x).

Proof: Obvious, since an (i, k)-packed graph satisfies theinequality (1).Suppose the perturbation involves both edge additions and

edge deletions. The upper bound for the amount of work, dueto additions, will be achieved if the perturbations leading toedge additions occur between components as every edge soadded must filnally be removed. Thus, if the number of edgeadditions is not so large that all additions can occur betweencomponents, then the bounds of Proposition 7 will be increasedby the quantity a', the number of intercluster additions. Whenthe number of intracluster edge additions (a - a') is not zero,it is easy to see that the maximum work will be attained foran (i', k')-packed graph of (e + a - a') edges. Consequently,the values of p, q, and b1, 1S f i - 1, of Proposition 7 willhave to be appropriately modified.

C. A Comparison of the BoundsIn this section the results concerning the lower and the upper

bounds are summarized. Given a graph GEE q (s, e,n ns)let the changes involve d deletions and a. additions. It wasmentioned earlier that to achieve the lower bound, wheneverpossible, the additions must be within components; for theupper bound, however, the additions would be between com-ponents. For convenience, assume that the number of addi-tions of these two types are held constant independently ofeach other. That is, let a' be the number of additions betweencomponents and (a - a') the number of additions within.

Fig. 4 characterizes the bounds of the CC definition relativeto the actual amount of work needed for the MCS definition.The line for W(MCS) shows that each deletion increases the

399

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-3, NO. 4, JULY 1981

W(MCS)

0 0

S p

b tW (cc)mi

C(k ,2) C(n, 1.2) z ;- e

p q b' 2i I l s

The Number of Deletions,d

Fig. 4. Amount of work for graphs in § (s, e, n 1, n2, .. n.) for d dele-tions, a' intercluster additions, and (a -a') intracluster additions.p', (e - q'), and b;, 1 6 j < i' - 1 are, respectively, the number ofedges in the minimally connected sections, the maximally connectedsections, and the first (i' - 1) components of an (i', k' )-packed graph.

amount of work by 1. The upper bound for W(CC) overlapswith the line for W(MCS) up to a certain point, and then in-creases at a decreasing rate. The values of p', q', and b~,1 < j < i - 1, are obtained using the steps in Proposition 7,but in reference to an (i', k')-packed graph E g (s, e + a - a',nl, * * *, ns). In other words, when (a - a') intracluster edgesare added to an (i, k)-packed graph in e edges (in such a waythat the desired structure is retained) the values of i and kmay be modified to i' and k', respectively. The lower boundfor W(CC) initially remains at a' and, after (d - z - (a - a'))deletions occur, increases at the same rate as W(MCS) provided(a- a') 6 EL, (n1- 1). If (a- at) > Zq= (n1- 1), then asshown in Proposition 5, W(CC) remains at a', irrespective of d.

VI. CONCLUSIONS

A measure of work that determines the disturbance in a setof clusters as the minimum number of operations required torestore the set of modifiled clusters to the original ones is usedto compare the stability of several types of clustering methods.A number of families of graph theoretic cluster definingschemes (such as those proposed in [19], [25], and [26]),which are identified to be adjacency oriented, are shown to in-clude two well-known methods of defining clusters: connectedcomponents and maximal complete subgraphs. The connectedcomponent method is shown to be the most stable of all clus-tering methods in each of these families. The maximal com-plete subgraphs are found to represent the worst possible casein terms of stability. Furthermore, it is shown that certainfamilies of adjacency oriented clustering methods are such thatas one proceeds from the method producing the most narrowclusters (maximal complete subgraphs) to those producingrelatively broader clusters, the clustering process remains atleast as stable as any method in the previous stages. A lowerand an upper bound for the amount of work, when clustersare defined as the connected components, are derived. Thesebounds are the tightest possible. Future research may considerthe possibility of using similar techniques to relate the stabilityof nonadjacency oriented clustering methods to the results of

this work. One promising approach is to compare the "average"stability of such methods to that of the CC method.

APPENDIX A

Lemma A. 1: Let H = (V, E) and H" = (V, E") be, respec-tively, the complement graph and a minimally connected sub-graph of H = (V, E). If H has no cycles, then there exists aset of edges T in H" such that whenever TI < El , H** =(V, E U E" - T) is connected.

Proof: Let t = I TI. Consider a set of t edges in H. Addany one of the edges to H". Exactly one cycle is created inH". This cycle must contain an edge originally in H". Clearly,this edge can be deleted (i.e., assigned to T) without affectingthe connectivity of H". This process is then repeated for thet edges in E. Each time a cycle is created in H", by the hypoth-esis that H has no cycles, the cycle must contain at least oneedge originally in H", and this edge can be deleted while main-taining the connectivity of H**.Lemma A.2: Consider g1, the /th component of G. Let j

denote the complement graph of g1, 1 < j S s, and G the unionof all the - 9s.Assume that G is chosen in such a way that if there are suffi-

cient number of edges in G then each is connected; otherwise,each kj has no cycles. Then it is possible to make a additionsand d > z deletions such that W(CC) = a' + (d - z) * (a - a'),where z = e (ni - l) and a' = a *(,= C(ni, 2) - e).

Proof: Let a1 and di be, respectively, the number of addi-tions and deletions to g1. Clearly, I;= a1 = a - a'. The firstz deletions are made from G such that the resulting graph G" isminimally connected. That is, as indicated in Proposition 4,z; deletions transform g1 to a minimally connected component". We now consider the effect of a1 additions (from g-) and

(di - z;) deletions (from g,!) on g;'.Consider g," as corresponding to H" of Lemma A.l. If the

number of edges in k- is less than or equal to (n1 - 1), then letgj correspond to H of Lemma A.1. Otherwise, let a minimallyconnected subgraph of ki correspond to H. If a1 > (n1 - 1),then these edges are sufficient to make the jth componentconnected. Thus, no work is needed irrespective of di. Ifa1 < (n1 - 1), then either a1> (di - zj) or a1 < (di - z;). Inthe former case, by Lemma A.l the graph obtained after theadditions and deletions remains connected. Thus, no work isinvolved. In the latter case, the graph can remain connectedup to (a1 + zj) deletions and a1 additions (by Lemma A.1).Each deletion after that increases the number of componentsin the subgraph by one. Thus, the total amount of work is

= (di - zi - ai),which isd- z - (a - a').#

APPENDIX B

Problem ofLemma B. 1

Consider the following problem. Given Li, 1 < j < n, suchthat LI L2 > * Ln,

n2

maximize E (x2 - xi)j=l-

(A.1)

subject to the constraints 2= xi= c, for some constantC . = I Li and 0 . x, . Li for eachj.Lemma B.1: Letc=i = Li+ b,for some i, 1. i . n,and

400

RAGHAVAN AND YU: COMPARISON OF STABILITY CHARACTERISTICS

some b. 1 S b < Li. Then the following is a maximum solu-tion to (A.1):

xj=Li for 1S j.i- 1,

xi=b and xj=0 for i+ 1 < j < n. (A.2)

Remark: It is clear that both (xl, * -

, xs. . ..Ixt * ,Xn)with x5 =O and xt =AO and (xl, * * *, xt, * * -IX,x5 * *,Xn) havethe same value for Sin 1 (xi - xi). Thus, we shall assume thatall nonzero values must occur in the beginning.

Proof: Suppose (A.2) is not a maximum. Then, let

Yi,Y2V *. Yn (A.3)

be a maximum solution different from (A.2). Let s be thesmallest j such that y. < Li. Let t > s be the smallest j suchthat y1 > 0.

Case]: Ys>Yt.

Let ys = Ys+ 1 and Y' = Yt - 1. Then by simple computa-tion it can be shown that (y , *,ys1s,Ys,Ys+ 1, * *,Y 1

Yt,Yt+ 1 *.* ,Yn) is a better solution to (A.1) then (A.3), con-tradicting the maximality of (A.3).Case 2: Ys <Yt. ys :# 0 by remark.Let a new solution be formed with Ys = yt + 1 and yt = ys - 1.

None of the constraints are violated as Ls > Lt and Yt < Lt.Again, it is a simple matter to show that the solution with ysand y' replacing ys and yt is better than (A.3).Corollary B.1: Suppose problem (A.1) is modified so that

xi < c = _ Li + b. Then solving this modified prob-lem is equivalent to solving (A. 1) since 1 (Xj - x;) is a non-

decreasing function. Furthermore, the maximum value is

{L2L + b2- Lij=l j=l

Lemma B.2: Let G = (V,E) E (s, e, n, n2,n ,ns) be an

(i, k)-compact graph. Then, e satisfies

i - I

C(k,2)+1Se- C(n1,2)<C(k+1,2)1=1

(A.4)

and G has at least as many components as any graph in (s, e,ni,n2, * ns).

Proof: It is clear that inequality (A.4) is satisfied by Gsince a complete subgraph in x vertices has C(x, 2) edges. Bysimple computation, G has i - 1 - k + E n1 components.To show that this is the maximum, suppose that a graph dif-

ferent from G from the set g (s, e, n 1, n2, , ns) has a greaternumber of components. Let the jth part in this graph have

ni - xi components, 0< xi < (n1 - 1). Then, by hypothesis,

s s

(n - xi) > i - 1 - k + Zn,.j=1 1=

This implies that

S i-l

E xi < (n, - 1) + (k - 1).j=1 j=1

(A.5)

By Theorem 2.3 in Deo [23], a graph of (n1 - xi) compo-nents and n1 vertices has no more than C(xj, 2) edges. Thus,there are at most C(xj, 2) edges in the jth part. Consequently,the number of edges in the graph,

s

e < ; C(xj, 2)1=1

2- ( xi - E Xij)

j=1 j=1

Given inequality (A.5) and assuming without loss of general-ity that ni > n1.+.1, 1 6j < (s - 1), by Corollary B.1 we have

e S 1 { (n, - 1)2 + (k - 1)2 - £ (ni - 1) - (k - 1)j=1 j=1

= , C(nj,2)+C(k,2).1=1

This contradicts the inequality (A.4).

APPENDIX C

ProofofProposition 6Case ]: 1< d S p.That is, the number of edge deletions is not more than the

number of edges in the minimally connected parts of G. Con-sequently, each deletion in this range increases the number ofcomponents by one. Clearly, this is the maximum possible.Case 2: p <d . e.It is clear that the deletion of the d edges can be made from

G such that the modified graph is an (i, k)-compact graph in0(s, e - d, ni, n2, , ns). By Lemma B.2 this graph has themaximum number of components. By Lemma 1 the numberof edge additions required for restoration is the maximum. #

APPENDIX DLemma: Given a complete graph G in n vertices, suppose i

edges are deleted from G. Iff(n, i) is the minimum number ofedge additions required to make the resulting graph connected,then

f(n,i). 1 +(4i/(2n- 1)).Proof: It is clear that f(n, i) will be the maximum if the

deletions are made in such a way that the perturbed graph hasthe largest number of components.From Lemma B.2 we can deduce that a graph in C(n, 2) - i

edges and n vertices that has the maximum number of compo-nents has (n - k) isolated vertices and a partially completesubgraph in the remaining k vertices, where C(k, 2) + 1 6C(n, 2) - i C(k + 1, 2). Thus, f(n, i) < n - k. Expressingk as a function of n and i, we have

f(n,i) n - {[{-1 + SQRT [1 + 4(n(n - 1) - 2i)]}/21}An+ 2- SQRT(1 +4n2 - 4n- 8i)/2= n + 1 - SQRT ((2n - 1)2 - 8i)/2= n + 2 - SQRT (2n - 1 - 2 SQRT 2i) (2n - 1

+ 2 SQRT 2i)/2

S n + - (4n2 - 4n - 8i + 1)/(4n - 2)

since SQRT (a - b) > 2ab/(a + b) for a, b >0

= 1 + 4i/(2n - 1)-1 + i(2/n).

401

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-3, NO. 4, JULY 1981

REFERENCES[1] R. E. Bonner, "On some clustering techniques," IBM J. Res.

Develop., vol. 8, pp. 22-32, 1964.[2] G. H. Ball, "Data analysis in social sciences: What about details,"

in Proc. AFIPS, FJCC. New York: Macmillan, 1965, pp. 533-559.[3] S. Watanabe, "A unified view of clustering algorithms," Informa-

tion Processing 71. Amsterdam, The Netherlands: North-Holland,1972, pp. 149-154.

[4] S. C. Johnson, "Hierarchical clustering schemes," Psychometrika,vol. 12, pp. 241-254, 1967.

[5] G. N. Lance and W. T. Williams, "A general theory of classifica-tory sorting strategies: I. Hierarchical systems," Comput. J.,vol. 9, pp. 373-382, 1967.

[6] , "A general theory of classificatory sorting strategies: II. Clus-tering systems," Comput. J., vol. 10, pp. 271-277, 1967.

[71 R. M. Cormack, "A review of classification," J. Roy. Stat.Soc. Series A, vol. 134, pp. 321-367, 1971.

[8] R. F. Ling, "On the theory and construction of k-clusters,"Comput. J., vol. 15, pp. 326-332, 1972.

[9] N. Jardine and R. Sibson, Mathematical Taxonomy. New York:Wiley, 1971.

[10] L. J. Hubert, "Some applications of graph theory to clustering,"Psychometrika, vol. 39, pp. 283-309, 1974.

[11] C. T. Yu, "A clustering algorithm based on user queries," J. Amer.Soc. Inform. Sci., vol. 25, pp. 218-226, 1974.

[12] W. H. E. Day, "Validity of clusters formed by graph-theoreticcluster methods," Math. Biosci., vol. 36, pp. 299-317, 1977.

[13] D. W. Matula, "Graph theoretic techniques for cluster analysisalgorithms," in Advanced Seminar on Classification and Cluster-ing, J. Van Ryzin, Ed. New York: Academic, 1977, pp. 95-129.

[14] C. T. Yu, "The stability of two common matching functions inclassification with respect to a proposed measure," J. Amer. Soc.Inform. Sci., vol. 27, pp. 248-254, 1976.

[15] D. G. Corneil and M. E. Woodward, "A comparison and evalua-tion of graph theoretical clustering techniques," INFOR, vol. 16,pp. 74-89, 1978.

[16] R. R. Sokal and F. J. Rohlf, "The comparison of dendrograms byobjective methods," Taxon, vol. 11, pp. 33-40, 1962.

[17] D. M. Jackson, "Comparison of classifications," in NumericalTaxonomy, A. J. Cole, Ed. New York: Academic, 1969, pp.91-111.

[18] P. K. T. Vaswani, "A technique for cluster emphasis and its appli-cation to automatic indexing," Information Processing 68. Am-sterdam, The Netherlands: North-Holland, 1968, pp. 1300-1303.

[19] C. C. Gotlieb and S. Kumar, "Semantic clustering of index terms,"J. Ass. Comput. Mach., vol. 15, pp. 493-513, 1968.

[201 C. T. Zahn, "Graph theoretical methods for detecting and describ-ing gestalt clusters," IEEE Trans. Comput., vol. C-20, pp. 68-86,1971.

[21] W. M. Rand, "Objective criteria for the evaluation of clusteringmethods," J. Amer. Stat. Ass., vol. 66, pp. 846-850, 1971.

[22] D. M. Jackson, "An error analysis for functions of qualitativeattributes with applications to information retrieval," in Software

Engineering-COINS III, vol. 2, J. T. Tou, Ed. New York: Aca-demic, 1969, pp. 71-87.

[23] N. Deo, Graph Theory with Applications to Engineering andComputer Science. Englewood Cliffs, NJ: Prentice-Hall, 1974,pp. 22-23.

[24] F. Harary, Graph Theory. Reading, MA: Addison-Wesley, 1969.[25] N. Jardine and R. Sibson, "The construction of hierarchic and

non-hierarchic classifications," Comput. J., vol. 11, pp. 177-184,1968.

(26] N. Jardine and R. Sibson, "A model for taxonomy," Math. Biosci.,vol. 2, pp. 465-482, 1968.

Vijay V. Raghavan received the B.Tech. degreefrom the Indian Institute of Technology, Ma-dras, India, in 1968, the M.B.A. degree fromMcMaster University, Hamilton, Ont., Canada,in 1970, and the Ph.D. degree in computerscience from the University of Alberta, Edmon-

tt t0iton,Alta., Canada, in 1978.From 1970 to 1973 he was a Lecturer in

Business Administration at Saint Mary's Univer-sity, Halifax, Nova Scotia. Currently, he is anAssistant Professor of Computer Science at the

University of Regina, Regina, Sask., Canada. His research interests in-clude problems relating to automatic information retrieval and databasemanagement systems. He has published articles in the Association forComputing Machinery TODS, the Journal of the American Societyfor Information Science, and Information Processing Letters.

Dr. Raghavan is a member of the Association for Computing Ma-chinery and the Canadian Information Processing Society.

C. T. Yu was born on August 31, 1948. Hereceived the B.Sc. degree from Columbia Uni-versity, New York, NY, in 1970 and the M.Sc.and Ph.D. degrees from Cornell University,Ithaca, NY, in 1972 and 1973, respectively.From 1973 to 1977 he was an Assistant Pro-

fessor of Computing Science at the Universityof Alberta, Edmonton, Canada, and in 1977became an Associate Professor. Since 1978 hehas been an Associate Professor in the Depart-ment of Information Engineering at the Univer-

sity of Illinois at Chicago Circle. His research interests include databasemanagement and information retrieval. He has published articles in theJournal of the Association for Computing Machinery, the Associationfor Computing Machinery TODS, Information Processing and Manage-ment, Information Systems, Information Processing Letters, Journal ofthe American Society for Information Science, and Theoretical Com-puter Science. He has also presented papers at numerous conferences.

402