Synthesizing Statistical Knowledge from Incomplete Mixed-Mode Data

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-9. NO. 6. NOVEMBER 1987

Synthesizing Statistical Knowledge from IncompleteMixed-Mode Data

ANDREW K. C. WONG, MEMBER, IEEE, AND DAVID K. Y. CHIU

Abstract-The difficulties in analyzing and clustering (synthesizing)multivariate data of the mixed type (discrete and continuous) arelargely due to: 1) nonuniform scaling in different coordinates, 2) thelack of order in nominal data, and 3) the lack of a suitable similaritymeasure. This paper presents a new approach which bypasses thesedifficulties and can acquire statistical knowledge from incompletemixed-mode data. The proposed method adopts an event-covering ap-proach which covers a subset of statistically relevant outcomes in theoutcome space of variable-pairs. And once the covered event patternsare acquired, subsequent analysis tasks such as probabilistic inference,cluster analysis, and detection of event patterns for each cluster basedon the incomplete probability scheme can be performed. There are fourphases in our method: 1) the discretization of the continuous compo-nents based on a maximum entropy criterion so that the data can betreated as n-tuples of discrete-valued features; 2) the estimation of themissing values using our newly developed inference procedure; 3) theinitial formation of clusters by analyzing the nearest-neighbor distanceon subsets of selected samples; and 4) the reclassification of the n-tu-ples into more reliable clusters based on the detected interdependencerelationships. For performance evaluation, experiments have beenconducted using both simulated and real life data.

Index Terms-Cluster analysis, event-covering, incomplete proba-bility scheme, mixed-mode data, probabilistic inference, statisticalknowledge.

I. INTRODUCTION

A NEW challenge to computer-based pattern recogni-tion is to detect probabilistic patterns from a database

which is usually characterized by heterogenous featuresof different types, including the mixed discrete and con-tinuous type [1], [2]. This challenge arises from the needin the decision-making process when management controland strategic planning are involved [3]. Such process usu-ally requires unstructured and semistructured decision-making using information from a database. Unlike struc-tured decision-making, which often has well defined ob-jectives and is usually supported by the database schemaand query language, unstructured and semistructured de-cision-making may have to select relevant informationthat often the decision-makers may not be previouslyaware of. Hence, extracted knowledge in the form of sta-tistical patterns (based on statistical and cluster analysis)

Manuscript received May 2, 1985; revised June 1, 1987. Recommendedfor acceptance by J. Kittler. This work was supported by the Natural Sci-ences and Engineering Research Council of Canada.

A. K. C. Wong is with the Department of Systems Design Engineering,University of Waterloo, Waterloo, Ont. N2L 3G1, Canada.

D. K. Y. Chiu is with the Department of Computing and InformationScience, University of Guelph, Guelph, Ont. Canada.

IEEE Log Numiber 8716844.

will be very useful in rendering the information suitablefor this kind of decision-making.However, while pattern analysis techniques such as

cluster analysis on multivariate continuous data are wellestablished [4] and methods to analyze discrete-valueddata have been proposed [5]-[7], the problem of detectingclustering patterns in multivariate data of the mixed typeremains mostly unsolved. The problem posed by the ex-istence of continuous and discrete-valued features is ob-vious. Methods based on similarity measures are gener-ally difficult, if not impossible, to apply to such data [4].Alternative methodologies based on probabilistic model-ing [8] require an extremely large amount of data. Whendiscrete-valued variables are transformed into binary-val-ued variables [9], this transformation will drastically in-crease the number of variables in the analysis. Also, in-formation that certain outcomes are from the same variablewill be lost in the subsequent analysis. Furthermore, whenthe data contain considerable noise which are irrelevantfor the analysis, and when the parametric form of theprobability distribution on the data is unknown, the taskbecomes even more difficult. Despite these difficulties, apractical method to detect clustering and statistical pat-terns in such data will be very useful and desirable.What we propose here is a practical approach to cir-

cumvent the difficulties. In our method, a mixed-modeprobability model is approximated by a discrete one. Wefirst discretize the continuous components using a mini-mum loss of information criterion. Treating a mixed-modefeature n-tuple as a discrete-valued one, we propose a newstatistical approach for synthesis of knowledge based oncluster analysis. The advantage of this method, as we shallsee later, is that it requires neither scale normalization norordering of the discrete features. Hence, it bypasses twoserious concerns in pattern recognition, namely: 1) theproblem of nonuniform scaling in different feature coor-dinates, and 2) the lack of order in nominal data.By synthesis of the data into statistical knowledge, we

refer to the following processes: 1) synthesize and detectfrom the data inherent patterns which indicate statisticalinterdependency (between certain variables and/or a sub-set of their outcomes); 2) group the given data into inher-ent clusters based on these detected interdependency; and3) interpret the underlying patterns for each clusters iden-tified. The method of synthesis is based on our newlydeveloped event-covering approach [5], [6]. By event-covering, we mean covering or selecting a subset of

0162-8828/87/1100-0796$01.00 © 1987 IEEE

796

WONG AND CHIU: SYNTHESIZING STATISTICAL KNOWLEDGE

statistically interdependent events in the outcome spaceof variable-pairs, disregarding whether or not the varia-bles (considering the complete outcome set) are statisti-cally interdependent. From the detected statistical inter-dependence patterns of the data, a probabilistic decisionrule is used to group the data into clusters. Finally, againusing event-covering, we can detect the interdependencepatterns between the feature events and the detectedsubgroups.

Since the proposed method is based on a general patternanalysis technique on a set of sample observations, it canbe applied to a broad spectrum of problem domains wheresimple self-learning and automatic information selectioncapability is desirable. Then, it can play an important rolein extending some of the existing decision-support andknowledge-based systems. It can be used to provide newknowledge of a problem domain, or to verify importantinterdependence relationships provided by human ex-perts. Information thus obtained can also be used as ad-ditional knowledge to logical information in deductivedatabases [10], or to data partitioning in distributed da-tabases [1], [2], [11].

For performance evaluation, the proposed method isapplied to cluster incomplete data (or data with missingoutcomes). The method has the following phases:

1) discretization of the continuous components basedon the maximum entropy criterion;

2) estimation of the missing values in the data set usingthe developed inference method;

3) initiation of clusters by analyzing the distance andthe nearest-neighbor characteristics of selected samples;

4) reclassification of the samples into more reliableclusters based on the statistical interdependence pattern ofthe samples.

II. MIXED-MODE DATA AND DISCRETIZATION OF THECONTINUOUS COMPONENTS

A. Data Representation and DefinitionsBefore describing in detail the new approach in han-

dling mixed-mode data by discretizing the continuouscomponents, here a few related definitions are introduced.The representation of data is similar to that in the rela-tional model of database where the data are representedas tuples.

Definition 1: Let x = (xI, x2, , xP, ,xq*,x) be an n-tuple (1 . p < q . n) such that the

values of xI, x2, * * , xp are continuous, xp+1, xp+2,* Xq are discrete ordinal and Xq + I, Xq + 2, *.. , X,, are

discrete nominal. Then x is called a mixed-mode n-tupleand the corresponding random n-tuple is represented as X= (XI, X2, - * *, X,) where Xk, 1 c k c n, is a contin-uous or discrete valued variable.

Definition 2: Let the interval [a, b] be the range spaceof the continuous random variable Xj in X. A partition onX1 is defined as a set of Lj intervals {[zo, z], [z1, Z2],

, [ZLj-I, ZLj ] }, where zo - a, ZLj - b, and, zi-I <zi for i = 1, 2, * - U,Li.

Definition 3: In association with the partition, theboundary set is defined to be the set of ordered end-pointszo, z1, * ZLj which delimit the Lj intervals. { ajr r =

1, 2, , Lj } then denotes a set of quanta such thatZr -I < ajr < Zr.

Definition 4: A finite probability scheme b on the par-tition is defined to be the set of probability values { Pi }such that

Pi ff(Xj) dXj= F(zi)-F(zi )Ci-

for i = 1, 2,* Ljwhere f and F are the probability density function and thecumulative probability function of Xj in [a, b] respec-tively, and zi- and z; are two consecutive elements in theordered boundary set.With these definitions in mind, discretization is re-

ferred to as the process that produces from the range ofthe continuous random variable Xj the partition of L1 in-tervals. Thus, there is associated with the intervals aboundary set and a quanta set. From the probability func-tion and the partition, a finite probability scheme is ob-tained.

B. Maximum Marginal Entropy Discretization andPartitioning

It is clear that the number of ways to discretize the out-come of a continuous variable is infinite. A common pro-cedure is to divide the range into equal length intervals.When the outcomes are not evenly distributed, a largeamount of information may be lost after discretizationusing this method (see Section II-C). To minimize the in-formation loss, we adopt the following partitioningmethod based on maximum entropy [121.

Formally, let 'I be the set of all finite probabilityschemes that can be derived by all the discretization pro-cesses for a fixed probability function. The maximum en-tropy discretization problem is to find a ;* E 4I such that:

H(Q*) >-H(;) V8 cT

where H is the Shannon's entropy function. This methodwill ensure maximum entropy with minimum loss of in-formation after discretization.

Since high dimension discretization is highly combi-natoric, an approximation using marginal entropy is pro-posed [13]. In practice, we are generally given an ensem-ble of samples with their probability distributionunknown. The discretization problem thus becomes a par-tition problem of the observed values for a variable Xj(where some of the outcomes may be repeated). The in-tervals on the range of a variable Xj are chosen so as tomaximize the marginal entropy calculated on the finiteprobability scheme. Since the algorithm is still combina-toric in nature, to furnish a computationally efficient al-gorithm, local improvement technique is introduced [14].When selecting the number of intervals L1 for a contin-

uous variable Xj, it is obvious that in general the more

797

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-9, NO. 6, NOVEMBER 1987

intervals there are the less information will be lost. How-ever, the reliability of the probability estimation based onLj interval partitioning is affected by the sample size.Hence a rule of thumb [15] is adopted for determining theupper bound of Lj. Since second-order statistics are re-quired in the probability estimation, the sample size forreliable estimation should be greater than A times L4 forall Xj, where A can be taken as 3 for liberal estimation.Subject to this upper bound, the values of Lj in practicewill depend on the size of available memory and compu-tational resources.The partitioning algorithm, using maximum marginal

entropy, can be divided into two phases: 1) initial detec-tion of interval boundaries; and 2) improvement on theinterval boundaries. The first phase is devised to find in-tervals such that the sample frequency at each interval isas even as possible. The second phase is introduced toimprove the interval boundaries iteratively by increasingthe entropy value through local perturbation. It iteratesuntil no improvement can be made. In practice, if the ob-served continuous data take distinct values, then iterationis not necessary.Even though the maximum entropy discretization may

not produce a unique solution for some data set, the heu-ristic algorithm [14] we adopted can arbitrarily select aunique set of maximum entropy intervals when more thanone set of such intervals exists [16]. The partitioning al-gorithm can in principle be applied to ordinal-valued var-iables so as to reduce the number of distinct outcomes inthe analysis. However, when the sample size and com-putational resources are sufficient, there is no need forsuch application. Despite the algorithm's heuristic nature,it is simple, computationally acceptable, and gives goodresults.

C. Comparison of the Maximum Entropy and the Equal-Width Discretization ApproachTo evaluate the proposed maximum entropy discreti-

zation approach on discrete probability distribution esti-mation, we compare it with that based on equal-width in-terval discretization. Given an ensemble of sampleobservations with unknown probability density function,the number of observations falling into each interval is amaximum likelihood estimation of the probability densityfunction [17]. This estimation is known as maximumlikelihood estimation irrespective of how the intervals arechosen, given that the number of intervals is fixed. Toillustrate the difference between these two approaches, weperform the following experiments.

1) Maximum Entropy Discretization Experiment 1:Consider a variable X and the following values are ob-served in 30 samples which are sorted in increasing orderas:

SampleX valueSampleX valueSampleX value

10.1114.0219.5

20.9124.5229.5

31.5134.9239.7

42.0145.5249.7

Count9

6

3

0

Count9

6

3

0

0 1 2 3 4 53.25 4.25

6 7 8 9 10 11 12 13 X8.95 9.85 13.0

Histogram Based on Maximum Entropy Discretization

0 1 2 3 4 5 6 7 8 9 10 11 12 13 X2.6 5.2 7.8 10.4 13.0

Histogram Based on Equal Width Discretization

Fig. 1. Comparing histograms based on maximum entropy and equal widthdiscretization.

The probability distribution of X can be estimated fromthe histogram constructed based on these values. Let usarbitrarily select the number of intervals to be 5 and letthe range for the outcomes ofX be [0.0, 13.0 ]. The max-imum entropy method then assigns 6 samples to each ofthe five intervals whereas the equal width interval methodassigns the interval width to be 2.6. The histograms forprobability estimation are plotted in Fig. 1. Comparingthe two methods, we observe that the maximum entropymethod is more precise as an estimation than the equal-width interval method. It is expected that the precisionwould increase with the increase of discretization inter-vals, given that the sample size is large enough.

2) Maximum Entropy Discretization Experiment 2: Asupervised classification task based on Bayes' decision isused in the second experiment to show the effectivenessof maximum entropy discretization for class discrimina-tion. Three classes of two-dimensional data of the form x= (xI, x2) and with different means are stochasticallygenerated. Data from the first class are generated basedon the random combinations of two bivariate normal dis-tributions, whereas data from the second and third classesare generated based on a single bivariate normal distri-bution. The variance matrices are then varied to produce48 simulated data sets for each of the 7 sets of correlationcoefficients (Table I). The hold-out method of 10 percentis used to evaluate the classification result. The 7 sets ofcorrelation coefficients and the average misclassificationrate are also tabulated in Table I. The result shows thatthe maximum entropy discretization approach is consist-ently superior to the equal-width discretization approach.

52.8156.0

2510.0

6 73.2 3.3

16

7.3

2610.3

17

8.5

27

10.5

83.5188.82811.1

93.7199.12911.8

103.8209.23012.9

I I .j

II

798


statistically interdependent events in the outcome spaceof variable-pairs, disregarding whether or not the varia-bles (considering the complete outcome set) are statisti-cally interdependent. From the detected statistical inter-dependence patterns of the data, a probabilistic decisionrule is used to group the data into clusters. Finally, againusing event-covering, we can detect the interdependencepatterns between the feature events and the detectedsubgroups.

Since the proposed method is based on a general patternanalysis technique on a set of sample observations, it canbe applied to a broad spectrum of problem domains wheresimple self-learning and automatic information selectioncapability is desirable. Then, it can play an important rolein extending some of the existing decision-support andknowledge-based systems. It can be used to provide newknowledge of a problem domain, or to verify importantinterdependence relationships provided by human ex-perts. Information thus obtained can also be used as ad-ditional knowledge to logical information in deductivedatabases [10], or to data partitioning in distributed da-tabases [1], [2], [11].

For performance evaluation, the proposed method isapplied to cluster incomplete data (or data with missingoutcomes). The method has the following phases:

1) discretization of the continuous components basedon the maximum entropy criterion;

2) estimation of the missing values in the data set usingthe developed inference method;

3) initiation of clusters by analyzing the distance andthe nearest-neighbor characteristics of selected samples;

4) reclassification of the samples into more reliableclusters based on the statistical interdependence pattern ofthe samples.

II. MIXED-MODE DATA AND DISCRETIZATION OF THECONTINUOUS COMPONENTS

A. Data Representation and Definitions

Before describing in detail the new approach in han-dling mixed-mode data by discretizing the continuouscomponents, here a few related definitions are introduced.The representation of data is similar to that in the rela-tional model of database where the data are representedas tuples.

Definition 1: Let x = (x1, x2, , xp, , Xq,* , xn) be an n-tuple I1 p . q c n) such that the

values of xI, x2, * * , xp are continuous, Xp+, Xp+2,* xq are discrete ordinal and xq+± , Xq+2, * *, xn are

discrete nominal. Then x is called a mixed-mode n-tupleand the corresponding random n-tuple is represented as X= (Xl, X2, - - *, Xn) where Xk, 1 < k < n, is a contin-uous or discrete valued variable.

Definition 2: Let the interval [a, b] be the range spaceof the continuous random variable Xj in X. A partition onXi is defined as a set of Li intervals { [zo, z1], [z1, z2],* * * , [zL-J , LZLj 1}, where zo = a, ZLj = b, and, zi-I <

zi for i = 1, 2, i - , Li.

Definition 3: In association with the partition, theboundary set is defined to be the set of ordered end-pointszo, zl, , ZLj which delimit the Lj intervals. { ajr Jr =1, 2, * * , Lj } then denotes a set of quanta such thatZr - < ajr < Zr -

Definition 4: A finite probability scheme Q on the par-tition is defined to be the set of probability values { Pi }such that

Pi = f (Xj ) dXj = F(zi) -F(zi- I)zi- I

for i = I, 2,*l Ljwhere f and F are the probability density function and thecumulative probability function of Xj in [a, b] respec-tively, and zi- l and zi are two consecutive elements in theordered boundary set.With these definitions in mind, discretization is re-

ferred to as the process that produces from the range ofthe continuous random variable Xj the partition of Lj in-tervals. Thus, there is associated with the intervals aboundary set and a quanta set. From the probability func-tion and the partition, a finite probability scheme is ob-tained.

B. Maximum Marginal Entropy Discretization andPartitioning

It is clear that the number of ways to discretize the out-come of a continuous variable is infinite. A common pro-cedure is to divide the range into equal length intervals.When the outcomes are not evenly distributed, a largeamount of information may be lost after discretizationusing this method (see Section II-C). To minimize the in-formation loss, we adopt the following partitioningmethod based on maximum entropy [12].

Formally, let T be the set of all finite probabilityschemes that can be derived by all the discretization pro-cesses for a fixed probability function. The maximum en-tropy discretization problem is to find a 4* E T such that:

H(N6*) >- H(;) v 6c-4

where H is the Shannon's entropy function. This methodwill ensure maximum entropy with minimum loss of in-formation after discretization.

Since high dimension discretization is highly combi-natoric, an approximation using marginal entropy is pro-posed [13]. In practice, we are generally given an ensem-ble of samples with their probability distributionunknown. The discretization problem thus becomes a par-tition problem of the observed values for a variable XA(where some of the outcomes may be repeated). The in-tervals on the range of a variable Xj are chosen so as tomaximize the marginal entropy calculated on the finiteprobability scheme. Since the algorithm is still combina-toric in nature, to furnish a computationally efficient al-gorithm, local improvement technique is introduced [14].When selecting the number of intervals L1 for a contin-

uous variable Xj, it is obvious that in general the more

797

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-9. NO. 6, NOVEMBER 1987

be identified similarly. Ej represents the subset of the hy-pothesized values which are interdependent with the out-comes of Xk. It should be noted that Ej x E1k then repre-sents an event subspace of the complete outcome space ofthe variable-pair selected by this covering process. Statis-tical information can be analyzed based on the incompleteprobability scheme [18] defined on the event subspacespanned by E1k x E, rather than on the complete set ofoutcomes.

2) Interdependency between Restricted Variables:Based on Ek5 x EJ, the interdependency between the tworestricted variables defined on Ek X Ek can be calculated.Let the restricted variables be represented as Xk and Xj,defining on E{ and Ej, respectively. An information mea-sure called interdependence redundancy [7] defined on theincomplete probability schemes of the subsets is calcu-lated as:

R(X{,Xk)= I(Xk, Xk)/H(Xk, XJ)

where I(Xk, AT) and H(XJk, Xj) are the expected mutualinformation and the Shannon's entropy defined on X{k andX), respectively. The value of R (Xk, Xj ) will indicate thedegree of statistical interdependency between the two re-stricted variables. We have chosen the interdependencyredundancy measure because it is normalized and boundedby O and 1. Note that if either Ek = 1 or EEk I= 1 thenR(X{k, Xj) = 0 since there is only redundancy informa-tion for each of the variables rather than interdependencyinformation between the variables. If the redundancy in-formation is also desirable in this situation, we can adopta two-phased approach [5] which makes inferences basedon the interdependency information in the first phase (ourproposed method) and then when a rejection occur, makeinferences based on the redundancy information. R (Xk,XjT) has an asymptotic chi-square property [7]:

2 R(XK, XJ ) M (Xk, XJ) H(Xk, XJ ) Xdfwhere df is the corresponding degree of freedom havingthe value (|E{ - 1)( E) - 1) and M(Xj,X)) is thek i~~~~~~~~~~~~~~~~~number of observations in the incomplete scheme of (Xk,Xk). The chi-square test is then used to detect statisticalinterdependency between the two restricted variables at agiven significance level.

B. Probabilistic Imputation of Missing Values in Mixed-mode Data

Before performing cluster analysis, the missing valuesin the data set are estimated from the other observed val-ues which are selected based on the detected statisticalinterdependency. Since a missing value can occur in anyof the variables in the tuple, statistical interdependency iscalculated between all the variable-pairs. Using the twostatistical tests described in the previous sections, onlyvalues which are statistically significant for the estimationprocess are selected. Let the unknown value in an n-tuplex bexj and its hypothesized value be ajr. The conditionsfor a value Xk = ak, (k * j ) in x to be selected for esti-

mation are:1) the value Xk is an observed value;2) R (Xlk, AT) is statistically significant;3) aks E 1k ajr E FJAn information measure called the normalized sur-

prisal (NS) is used in the decision rule for estimating themissing values. NS corresponds to the weighted infor-mation of a hypothesized value ajr, and is conditioned onthe selected values [denoted here as x' (ajr)]. x' (ajr) =

{ xl, x, *, } represents a sub-n-tuple of x where m(m < n) is the number of values selected. NS(xjajr x' (air)) is defined as follows:

NS(xj = ajr x'(ajr)) -I(xa -air x'(ajr))

( R(Xk, Xk))where

1) I(xj = ajr X'(ajr)) = mk=I {R( XjX ) I(airIxk)};2) I(ajr Xk) is the conditional information defined on

the incomplete probability scheme on Ek XEx where

I(ajr4x') = log P(ajr 4)k ~~Z P(aj,,I4x)ajiu EEf

and LOIueEf P (ajuIx) > T such that T is chosen as a sizethreshold for reliable probability estimation [15].2NS is normalized by the total weights and the number

of selected events after weighting each conditional infor-mation by R(XK, X)), its measure of interdependence re-dundancy. In [5], we have discussed more thoroughly theintuitive properties of NS which are as follows:

1) larger the weights, more reliable the estimation;2) larger the conditional information, more reliable the

estimation.In rendering a meaningful calculation, Xk is selected

only if a reasonable sample size is available, or:

EZAP(aju 4x) > T.aju e Ej

The following decision rule based on the informationmeasure NS is designed. Given Tj = {ajrlr = 1, 2,

Lj } as the set of all possible values that can beassigned to an unknown xj, then

xj = ajt if NS(xj = aj,Ix'(ajt))= min

ajrE TjNS(xj = aj,Ix'(ajr)).

-Since the second-order statistics are required in the probability esti-mation, the minimum sample size for a reliable estimation is recommendedto be:

T= A X max LV=1.2, .t

where the constant A may be taken as 3 for liberal estimation and LJ is thenumber of possible events for variable Xj in X. A size threshold T is alsoused in the cluster initiation phase, however, we do not find the choice ofthe value T to be sensitive in affecting the result in our experiments [6]. Ifthe sample size or the cluster size is small, T can be chosen to be smallerbased on some initial trial of the experiments, and small clusters can stillbe detected while large clusters are not affected.

800


If x' is an empty set for all hypothesized values or if thereare more than one hypothesized value which yields theminimum NS values, then the estimation cannot be made,and the estimated value is still unknown. These sampleswhich are incomplete because of unknown estimation aretaken out initially for cluster initiation.The computational complexity of the inference method

is relatively low. The number of chi-square test applica-tions is (Lk +Lj + 1 ) for a variable-pair (Xk, Xj ) whereLk and Lj represent the number of distinct events for Xkand Xj, respectively. The tests determine the statisticalsignificance of the events for Xk and Xj with respect totheir interdependency. For data represented as an n-tuple,there are nC2 ( = n (n - 1) /2) different variable-pairs, andthe total number of statistical test applications is

n n

j- kk (Lk +Lj + 1)j 1 k 1,k#j

or

O[n2(maxLk)], k= 1, 2, * * ,n.k

Including the calculation of the probability estimates, thecomplexity of the event-covering process is

O[Mn2(maxLk)], k= 1 2, nk

where M is the number of samples for probability esti-mation. The NS calculation is also linearly proportionalto the number of selected events in the estimation.

C. Unbiased Probability Estimator

When estimating the probability based on an ensembleof samples, zero probability may be encountered if theprobability estimation is based on direct frequency count.In order to have a better probability estimate for thesecases, an unbiased probability estimate proposed by [19],[20] is adopted.Consider a pair of restricted variables (X{k, X ) with the

incomplete probability scheme involving events in Ek andEj, the unbiased marginal distribution of Xy is defined as

P(Xj = ajr) = {M(ajr)+ E. }/{M(xk, X) + IEk

where M(ajr) and M(XJk, Xjk) are respectively the fre-quency of occurrence of ajr and the sample size for theincomplete scheme of Xk. Similarly, the unbiased jointdistribution of Xk and XJk is defined as

p(Xk = ajr, X= aks) = {M(ajr, akS) + 17

{M(Xk, Xj ) + EJj| IEjk}

where M( ajr, akJ) is the number of occurrence of the jointoutcome (air, aks) in the incomplete scheme of the ensem-ble. Hence the conditional information I(ajr aks) is cal-

culated as

I(ajr aks) =P(ajr, akS)/P(aks)

- l Pa(atjP( aksa);PI(aks )

log M(ajr, aks) + 1

ajt EE M(aj,akj ) + }

D. Cluster Analysis on the Data

After the missing values are imputed, cluster analysiscan be performed. First, clusters are initiated based on thenearest-neighbor characteristics of the ensemble. Thenclusters are regrouped based on the statistical interdepen-dency detected from the data.

1) Cluster Initiation: The cluster initiation process in-volves three phases: 1) selecting samples which are notyet clustered and are more likely to form clusters first; 2)finding a data-dependent nearest-neighbor criterion whichreflects the cluster characteristic; and 3) merging reliablesamples to form clusters based on this criterion. Thesethree phases of the process are applied iteratively until allthe samples are considered.The first phase of cluster initiation estimates the prob-

ability for a sample to occur and then selects a subset ofsample with higher probability estimation. The probabil-ity of a sample is estimated by a second-order productapproximation on the discretized data [21].3 Further, theprocess involves the calculation of a mean probability [6].4With the probability estimates of each sample calculatedand the mean probability on a given set of samples de-fined, a subensemble of the unclustered samples with rel-atively higher probability estimate is selected by the fol-lowing criterion. A sample is selected if its probability isgreater than the mean probability of the remaining un-clustered samples. We represent these selected samples asS .

The second phase involves the calculation of nearest-

3An estimation of P(x), known as the dependence tree product approx-imation [21] can be expressed as:

=I

where 1) the indexes { mIl, m2, i*,, } are a permutation of the integerset { 1, 2, * *, n } and k is a function of j, 2) the ordered pairs (x,n,xmk( ) are identified from the branches of a spanning tree (defined on X )where the branch weights are the expected mutual information between thevariable nodes; and the ordered pairs are chosen such that the summedexpected mutual information of all the branches is maximized, and 3)P(x"1, ix,.t0) = P(x,, ). The probability defined above is known to be thebest second-order approximation of the high-order probability distribution[211.

4Let a set of selected samples be denoted as S. The mean probability forS is defined as

whe E iP(X)/iSs

where S is the sample size.

801

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-9, NO. 6, NOVEMBER 1987

neighbor distance5 for all the samples in S'. Let D(x, S')be the nearest-neighbor distance value of x consideringall the samples in S'. Among these distances, let D* bethe maximum value.6 Using the clustering procedure re-ported in [6], samples can be merged into a cluster basingon the analysis of the nearest-neighbor distance. The clus-ter initiation is outlined as follows:

1) Calculate P(x) for all x in the ensemble.2) Set K := 0; t := 0;3) Let C0 be a dummy subgroup representing samples

of unknown cluster. Initially C0 is empty. Initialize thefirst cluster Cl containing the sample x such that P(x) ishighest.

4) If the number of unclustered samples > T then P' isassigned the mean probability of unclustered samples elseP' is assigned 0; (T is a size threshold indicating thesmallest size of a cluster. )

5) List all the unclustered samples in a table S' if P (x)> P';

6) Calculate D (x, S') for all x in S'.7) D* := maxx,s, D(x, S') (see footnote 6).8) For all x in S' do:

* Get x in S' such that P(x) is highest.* If D(x, CkL) . D* for more than 1 cluster (say

Ck, for i - 1, 2, . . ., t, t > 1), then:If ki < K forsomeithen C0: C0 U {x};

-else Ck,:= {X} U Ck, for all i;* If D(x, Ck) . D* for exactly 1 cluster Ck, then Ck

:= {X} U Ck;* If D(x, Ck) > D* for all clusters Ck, k = 1, 2,

t, then: t := t + 1; C, := {x};* Remove x from S':

9) K:= t;10) Goto 4 until all samples are considered;11) If Ck| < T, the size threshold, then C0:= C0 U

Ck for all k.Computationally, the proposed cluster initiation pro-

cedure is reasonably fast. It requires the calculation ofnearest-neighbor distance between sample-pairs in a sub-ensemble only. The probability estimate is calculated onlyonce for each sample. Also it can apply to any distancemeasure and it allows uncertain samples to be temporarilyassigned as belonging to unknown cluster.

5We use the Hamming distance on the complete discretized n-tuples.Let x and y be two n-tuples; then the Hamming distance, d (x, y), is definedas

d(x,y) = Ekk=I

where

~0 Xk 'ka;. = 1 otherwise.

6Since outlier has a large nearest-neighbor distance and will affect thevalue of D* which is the maximum of such distances, we use a heuristicmethod to choose D* as the maximum value of all nearest-neighbor dis-tance in S' provided there is a sample in S' having a nearest-neighbor dis-tance value of D* - 1 (with the distance values rounded to the nearestinteger value). In another word, this method screens out the outliers inaffecting the value of D*.

2) Cluster Regrouping: After finding the initial clus-ters along with their membership, the cluster membership(or the cluster label) of each sample x can be consideredas an additional value of x. Let the cluster label variablebe C and its current set of detected outcomes be { c,, C2,

Cg } where g is the current number of clusters de-tected. The regrouping process is thus essentially an in-ference process for estimating the cluster label of a sam-ple. During this process, the values which are statisticallyinterdependent with the cluster label (now treated as avariable) are selected. Joint outcomes (second or higherorder outcomes) which are found to be interactive in asample x can be considered as additional observed fea-tures if computational resources and storage space areavailable [6]. Then the decision rule based on the mini-mum NS value (see Section III-B) can be applied to esti-mate the cluster label of a sample. The process of esti-mation iterates until stable clusters are found. The clusterregrouping algorithm is outlined as follows:

1) Compute the finite probability schemes based on thecurrent cluster labels.

2) Identify the events in the covered event subspace forall variables with respect to the cluster label variable;

3) Set number of change : = 0;4) For each x in the ensemble do;

* If estimation is uncertain because more than onecluster label satisfies the minimum criterion or be-cause no value in the n-tuple has been selected, thenassign the label as missing.

* Otherwise assign x to cluster label cj if:

NS (Cj XT (Cj)) = min NS (CM XT (CU));CueC

* if cj * previous cluster label then:-number of change :- number of change +

1;update cluster label for x;

5) If number of-change > 0 then goto 1 else stop.Because there is no distance measure defined for mixed-

mode data, the cluster analysis is performed based on thestatistical properties rather than distance measure in thefinal phase of the algorithm with all the variables treatedas nominal variables, including the ordinal variables.However, since the cluster initiation is based on the near-est-neighbor characteristics, the final clusters consist ofboth distance and statistical information of the data en-semble.When the clusters are found, interdependency between

the class and the event values is a piece of synthesizedknowledge which is extracted from the ensemble of dataas a whole, and could not be acquired from individualsample in isolation.

IV. EXPERIMENT USING SIMULATED DATA

In evaluating this approach to mixed-mode data analy-sis, an experiment using simulated data is performed. Togenerate a set of simulated data, four clusters are createdbased on four n-tuples (Fig. 2). The data are representedas x = (xI, X2, * * *, X7). These n-tuples are repeated a

802


Class Tuples frequencyXi X2 X3 X4 X5 X6 X7

Contin. Ord. Nominal

1 (1, 6, 3, 6, 1, F, A) 2002 (6, 3, 1, 6, 6, C, C) 1503 (3, 1, 6, 1, 3, A, F) 754 (6, 3, 6, 3, 1, A, C) 75

Total 500

Fig. 2. Original n-tuples for generating the simulated data involving fourclasses.

TABLE IIRESULT OF EXPERIMENT IN ESTIMATrING MISSING VALUES

Variables Xi X2 X3 X4 Xs X6 X7Type Continous Ordinal Nominal Total

Incorrect 19 10 9 9 5 1 2 55Reject 0 0 0 8 1 0 0 9Correct 37 46 36 31 36 49 51 286

Total 56 56 45 48 42 50 53 350

number of times to produce an ensemble of 500 n-tuples.Note that the clusters are not determined by a single valuebut by the joint information of the n-tuple. To createmixed-mode data with noise perturbation, 40 percent ofthe values are randomly replaced by a value with equalprobability from the set { B, D, E } for nominal variables,and { 2, 4, 5 } for continuous and ordinal variables. Thesereplaced values have no information about the cluster.Then noise with normal distribution of zero mean and 0.5standard deviation are imposed on all the values in theensemble. The variables XI, X2, X3 are designed as con-tinuous, X4, X5 as ordinal discrete, and X6, X7 as nominaldiscrete. The generated values are added to continuousand ordinal discrete values. Thus, X1, X2, X3 takes up thereal value after the addition. For X4, X5, the value isrounded to the nearest integer value bounded by 1 and 6.For x6 and X7, if the Gaussian noise value generated isgreater than 1, then the resulting value is randomlychanged to any arbitrary possible outcome with equalprobability. To create a set of incomplete n-tuples, 10percent of all values is randomly taken out, so that thereis a total of 350 missing values.The purpose of the experiment is to cluster this set of

incomplete n-tuples. First, the maximum entropy discre-tization on the continuous values is applied. Each contin-uous value will be represented by one of six discretequantum values indicating six intervals (i.e., Lj = 6 forj = 1, 2, 3). Then the inference method is applied toestimate any missing value and perform cluster analysison the data. For continuous variables, the original valueis compared to see if it falls in the range of the estimatedinterval. The 95 percent confidence level is used in all thechi-square tests.The result of the experiment in estimating the missing

values is tabulated (Table II). The number of errors onthe different types of variables is probably proportional tothe amount of noise imposed. The error rate of the initial

TABLE IIIRESULT OBTAINED IN CLUSTERING SIMULATED DATA (INITIAL CLUSTERS)

Class Misclass. Unknown Correct Total

1 1 26 173 2002 3 28 119 1503 1 15 59 754 2 48 48 75

Total 7 117 376 500

Note: There are 8 incomplete n-tuples in class 1 and 1incomplete n-tuple in class 2 among the n-tuples of theunknown class.

TABLE IVRESULT OBTAINED IN CLUSTERING SIMULATED DATA (FINAL CLUSTERS)

Class Misclass. Unknown Correct Total

1 2 0 198 2002 3 0 147 1503 4 0 71 754 10 0 65 75

Total 19 0 481 500

clustering result is very low even though the unknown rateis high (Table III). The unknown rate is the highest inclass 4 because, compared to the other classes, the orig-inal n-tuple that represents it is the most similar to theothers. The final result is given in Table IV. The overallresult indicates that the method achieves high reliabilityfor this set of data.

V. EXPERIMENT USING HYDROMETRIC NETWORK DATA

The next experiment involves hydrometric networkdata. In order to integrate the hydrologic, meteorologic,and physiographic aspect of hydrometric network in aquantitative analysis, a set of samples are collected over131 catchment areas in British Columbia, Canada [22],for cluster analysis. Seven hydrometric features are cho-sen for each catchment area 1) mean annual runoff; 2)mean annual precipitation; 3) mean runoff coefficient; 4)relief and bathymetry; 5) ground water activities; 6) mois-ture index; and 7) forest coverage. They are expressed asX = (xI, X2, , X7) (Fig. 3), where the first three fea-tures are of the continuous type and the others are of thediscrete type (nominal as well as ordinal). Since the dataare complete n-tuples, the discretization process on thecontinuous features can be applied immediately, and thenthe event-covering and cluster analysis are performed. Thecontinuous variables are discretized into four intervals (Lj= 4 for j = 1, 2, 3 ). All the statistical tests are based ona confidence level of 95 percent. After cluster regrouping,the final result is shown in Fig. 4.When examining the two clusters found, feature values

characterizing the clusters are noted. Generally speaking,samples of cluster 1 are flat river basins such as flatlandand plateau. They have relatively low annual runoff andlow precipitation and with noticeable underground water

803

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-9. NO. 6, NOVEMBER 1987

Var. Data type Possible values.__ -_ -_ -_ -_ -__ _- --_

Xi continuous 3.0 to 65.0X2 continuous 11.0 to 71.0X3 continuous 0 to 1.0X4 nominal {mountain, flatland,valley or plateaulXs nominal {underground water,underground water

in downstream, no underground waterlX6 ordinal {sub-arid, semi-arid,sub-humid,humid}X7 ordinal {poorly-covered,half-covered,

fully-covered]Fig. 3. A description of hydrometric network data.

Cluster 1: Size 50

General River Basin Characteristicsrelatively low annual runoffrelatively low annual precipitationrelatively high runoff coefficientmajority of flat topographynoticeable underground water activitieslow to medium moisture contentless forest coverage

Unique Feature ValuesMean annual runoff below 13.3Mean annual precipation below 18.5Mean runoff coefficient above 0.675flatland and plateauUnderground water activity

Cluster 2: Size 81

General River Basin CharacteristicsRelatively high annual runoffRelatively high annual precipitationRelatively low runoff coefficientMostly mountainous topographyRelatively scarcity of underground water

activitiesRelatively high moisture contentMore forest coverage

Unique Feature Valuenone

Fig. 4. A description of clusters on the hydrometric network data.

RestrictedVariables R(XkC,C) Ekc

x1c 0.200 {3.00-6.3, 6.4-13.3,13.4-24.9, 25.0-65.0}

x2c 0.138 {<18.5, 18.6-23.7,23.8-36.1, > 36.2]

X3c 0.175 {< 0.29, 0.30-0.51,0.52-0.67, > 0.681

X4C 0.169 {mountain,flatland,plateau}X5c 0.567 {underg. water,

no underg. waterlx6c 0.160 {semi-arid,sub-humid,

humidlX7c 0.071 {poorly covered,

half-covered }

Fig. 5. Measure of interdependent redundancy between cluster and the re-stricted variables.

activities, whereas cluster 2 are river basins having rela-tively high annual runoff and precipitation. They aremostly mountainous with relatively low undergroundwater activities. The measures of interdependence redun-dancy between the restricted variables and the cluster la-bel variable are described in Fig. 5. They indicate that the

Var. events statistical significance

Xi < 13.30 indicate cluster 1> 13.30 more likely cluster 2

X2 < 18.50 indicate cluster 118.50-23.70 more likely cluster 123.70-36.10 more likely cluster 2> 36.10 highly probable cluster 2

X3 < 0.295 highly probable cluster 20.295-0.515 not indicative0.515-0.675 highly probable cluster 1> 0.675 indicates cluster 1

X4 mountain more likely cluster 2flatland indicates cluster 1valley not indicativeplateau indicates cluster 1

XS underg.water indicates cluster 1underg.waterin downstream not indicativeno underground

water highly probable cluster 2

X6 sub-arid not indicative *semi-arid highly probable cluster 1sub-humid highly probable cluster 1humid highly probable cluster 2

X7 poorly-covered highly probable cluster 1half-covered more likely cluster 2fully-covered not indicative *

* may be due to small sample size

Fig. 6. The significance of the events in indicating the subgroups.

ground water activities are the most important factor indetermining the subgroups and the forest coverage is theleast important. Fig. 6 shows the significance of the dif-ferent events in indicating the subgroups.

VI. CONCLUSIONIn order to acquire more information in tackling com-

plicated tasks involving high-level skills, there is an in-creasing need to analyze complex multivariate data withvariables from different sources and of different forms ofdescription. This paper has proposed a feasible solutionto detect clustering patterns in mixed-mode data in an in-tegrated way. The method is mathematically and intui-tively meaningful [13], [16]. Furthermore, it possess al-gorithmic simplicity. When a reasonably large set ofobservations is analyzed by a general inference and clus-ter analysis algorithm using the event-covering approach,new knowledge is acquired indicating different forms ofinterdependent patterns: subset of interdependent events,interdependent patterns between the restricted variablesinvolving only these events, and clustering patterns basedon these acquired interdependence relationships. Once theclusters are formed, further class-value interdependentpatterns can be extracted. Information thus obtained re-flect synthesized knowledge inherent in the data as awhole. Despite the influence of statistically irrelevantevents in the data, experiments using simulated incom-plete data and real life hydrometric network data haveproduced very encouraging results.

804


REFERENCES

II] A. K. C. Wong and H. C. Shen, "Data base partitioning for dataanalysis," in 1979 Proc. mIt. Conf. Cybernetics and Society, Denver.CO, pp. 514-518.

121 H. C. Shen, M. S. Kamel, and A. K. C. Wong, "Intelligent database management systems," in Proc. 1983 hit. Conf. Systems. Man,.and Cybernetics.

131 R. J. Thierauf, Decision Support Systems for Effective Planning andControl. Englewood Cliffs, NJ: Prentice-Hall, 1982.

14] R. 0. Duda and P. E. Hart, Pattern Classification and Scene Aialv-sis. New York, Wiley, 1973.

151 A. K. C. Wong and D. K. Chiu, "An event-covering method foreffective probabilistic inference." Pattern Recognition. vol. 20, no.2. pp. 245-255. 1987.

[6] D. K. Chiu and A. K. C. Wong, 'Synthesizing knowledge: A clusteranalysis approach using event-covering," IEEE Trans. Syst., Man,and Cybern.. vol. SMC-16, pp. 251-259. Mar./Apr. 1986.

171 A. K. C. Wong and T. S. Liu, "Typicality, diversity, and featurepattern of an ensemble." IEFE Trains. Comnput., vol. C-24. pp. 158-181. Feb. 1975.

18] S. S. Wilks, Mathematical Statistics. New York: Wiley, 1962.19] D. J. Hand, "Statistical pattern recognition of binary variables," in

Pattern Recognition Theory and Applications, J. Kittler, K. S. Fu,and L. F. Pau, Eds. Dordrect, The Netherlands: D. Reidel. 1982,pp. 19-33.

110] H. Gallaire, J. Minker, and J. Nicolas, "Logic and databases: A de-ductive approach," Comput. Surveys. vol. 16. no. 2. pp. 153-185.June 1984.

11] S. Navathe, S. Ceri, G. Wiederhold, and J. Dou, "Vertical partition-ing algorithms for database design," ACM Trans. Database Syst.,vol. 9, no. 4, pp. 680-710, Dec. 1984.

1121 F. M. Reza, An Introduction to Infionnition Theory. New York:McGraw-Hill, 1961.

[131 B. Forte, M. de Lascurain, A. K. C. Wong, "The best lower boundof the maximum entropy for discretized two dimensional probabilitydistributions," IEEE Trans. Inforn. Theorv, to be published.

1141 R. S. Garfinkel and G. L. Nemhauser, Integer Programming. NewYork: Wiley, 1972.

1151 J. C. Stoffel, "A classifier design technique for discrete variable pat-tern recognition problems," IEEE Trans. Comnput., vol. C-23, pp.428-441, 1974.

116] C. T. Ng and A. K. C. Wong, "On nonuniqueness of discretizationof two-dimensional probability distribution subject to maximizationof Shannon's entropy," IEEE Trans. Inforn. Theory, to be pub-lished.

1171 R. A. Tapia and J. R. Thompson. Nonparametric Probability DensityEstimation. Baltimore. MD: The John Hopkins University Press,1978.

1181 S. Guiasu, InJbrmnation Theory with Applicationis. New York:McGraw-Hill. 1977.

1191 R. Christensen, "Entropy minimax, a non-Bayesian approach toprobability estimation from empirical data," in Proc. IEEE 1973 Ilt.Conif. Cybernetics and Society. pp. 321 -325.

[201 R. D. Smallwood, A Decision Structurefor Testing Machine. Cam-bridge, MA: M.I.T. Press. 1962.

121] C. K. Chow and C. N. Liu, "Approximating discrete probability dis-tributions with dependence trees," Ik'EE Trans. InJrm. Tleory, vol.IT-14. pp. 462-467, 1968.

1221 A. K. C. Wong, "Problem definition for the pattern analysis and dis-play systenms of hydrometric networks in British Columbia," Reportto Environment Canada, Mar. 1982.

Andrew K. C. Wong (M'79) received the Ph.D.degree from Carnegie-Mellon University, Pitts-burgh, PA, in 1968.

For several years, he taught at Carnegie-Mel-lon. He is currently a Professor of Systems DesignEngineering and the Director of the PAMI Group,University of Waterloo, Waterloo, Ont., Canada.In 1984, he also assumed the responsibility of Re-search Director, in charge of the research portionof the Robotic Vision and Knowledge Base Proj-ect at the University. He has authored and coau-

thored chapters/sections in several engineering books and published manyarticles in scientific journals and conference proceedings.

Dr. Wong is an Associate Editor of the Journal ofComputers in Biologyand Medicinte.

David K. Y. Chiu received the M.Sc. degree incomputing and information science from Queen'sUniversity, Kingston, Ont., Canada, in 1979 andthe Ph.D. degree in system design engineeringfrom the University of Waterloo, Waterloo, Ont.,in 1986.

Currently, he is on the faculty of the Universityof Guelph, Guelph. Ont., Department of Com-puting and Information Science. His research in-terests include pattern analysis and machine intel-ligence, knowledge based system and computervision.

805

Synthesizing Statistical Knowledge from Incomplete Mixed-Mode Data

Documents

Transcript of Synthesizing Statistical Knowledge from Incomplete Mixed-Mode Data