Robust methods for multivariate analysis — a tutorial review

Post on 11-Mar-2023

1 views 0 download

Transcript of Robust methods for multivariate analysis — a tutorial review

ELSEVIER Chemometrics and Intelligent Laboratory Systems 32 (1996) l-10

Chemometrics and intelligent laboratory systems

Tutorial

Robust methods for multivariate analysis - a tutorial review

Yi-Zeng Liang a, Olav M. Kvalheim b, * a Department of Chemistry and Chemical Engineering, Hunan University, Changsha, P.R. China

b Department of Chemistry, University of Bergen, N-5007 Bergen, Norway

Received 3 July 1994; accepted 21 December 1994

Abstract

Robust methods developed in statistics and chemometrics for multivariate calibration and exploratory analysis are re- viewed. Robust methods can be classified according to aim: (i) regression methods, (ii) methods for outlier detection (di- agnostics), and (iii) methods for dimensionality reduction (exploratory analysis). Based on this taxonomy, some of the meth- ods are described in detail and illustrated with examples.

Keywords; Calibration; Classification; Least squares; Multivariate analysis; Partial least squares; Regression; Outlier detection; Robust anal- ysis; Robust regression

Contents

1. Introduction ...................... 2. Robust regression methods ..............

2.1. M-estimators and generalised M-estimator ... 2.2. Least median of squares (LMSJ ......... 2.3. Least trimmed squares (LTS) .......... 2.4. Robust partial least squares (RPLSJ. ...... 2.5. Robust principal component regression (RPCR)

3. Diagnostics ...................... 3.1. Classic diagnostics. ............... 3.2. Robust diagnostics. ...............

4. Robust methods for dimensionality reduction .... 4.1. Projection pursuit (PP) ............. 4.2. Robust PCA and robust SVD ..........

5. Discussion. ......................

* Corresponding author.

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

.......

. . . . . .

......

......

......

......

. . . . . .

......

......

......

......

......

......

......

......

1 2 2 5 5 5 5 6 6 I 9 9 9

10

0169.7439/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved SSDI 0169-7439(95)00006-2

Y.-Z. Liang, O.M. Kvalheim / Chemometrics and Intelligent Laboratory Systems 32 (1996) 1 -IO

Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1. Introduction

A normal distribution of observations represents a necessary assumption for most of the commonly used data-analytical statistic methods. However, Clancey [l] examined approximately 250 error distributions involving 50000 chemical analyses of metals and found that only lo-15% of the series could be re- garded as normally distributed. A similar phe- nomenon has also been observed in the analysis of blood constituents [2]. Observations of this sort may be due to error distributions deviating from normal- ity or to the presence of outliers in the data. Outliers are observations that appear to break the pattern shown by the major part of data [3]. There are many reasons for the presence of outliers, from recording errors to a non-representative sampling design. The reader is referred to Barnett and Lewis [3] for a com- prehensive discussion of this point. Outliers can be accommodated or rejected in the modelling process. If accommodation is chosen, robust estimation meth- ods are necessary for the model building. Robust es- timation reduces the influence of outlying observa- tions in the model.

came in 1887 from Edgeworth [4]. He argued that outliers have a very large influence on LS because the residuals {ri) are squared. As a cure, Edgeworth pro- posed the least absolute values regression estimator:

Minimise C] li 1 (2)

This technique is often referred to as L, regres- sion, whereas LS with squared residuals is denoted L,. Unfortunately, L, regression is very sensitive to a second kind of outlier, so-called bad leverage points. A leverage point is an observation which is isolated from the major part of the observations. A bad leverage point is a point which in addition devi- ates strongly from the regression line defined by the other observations. Examples of this kind of outliers can be found in a paper by Rousseeuw [5].

Most multivariate methods applied to chemical data are based on least squares (LS) techniques. For instance, principal component analysis (PCA), multi- ple linear regression (MLR), principal component re- gression (PCR), and partial least squares (PLS) re- gression are all LS techniques. Least squares tech- niques are not robust against outliers. This follows from the properties of the objective function for LS procedures. Thus, the objective function is the sum of the squared residuals:

In order to evaluate the robustness of the methods used in statistics, Hodges [6] introduced the term breakdown point. A general formulation of this con- cept was given by Hampel [7]. Following Rousseeuw [5] the breakdown point can be loosely defined as the smallest fraction of contamination (outliers) that seri- ously offset the estimator from the ‘true’ one. The breakdown point for LS (L,) is l/n, which means that a single outlier in a set of IZ observations can de- stroy the LS estimates. The breakdown point of L, is also l/n due to the possible presence of leverage points.

Minimise CrF = C( yi - BOXY, - . . . - 0,~~~)” (1)

In Eq. (l), {rj; i = 1, . . . , n) are the residuals, (yi; i=l , .*., n} are the corresponding values of the re- sponse (or dependent variable), {xii; i = 1, . . . , n; j=l 9 -.., m) the values of the explanatory variables and (0,; j=l, . . . . m) the LS estimates of the pa- rameters.

There are two ways to deal with outliers and ab- normal error distribution: diagnostics and robust esti- mators. As discussed by Rousseeuw and Leroy [8], diagnostics and robust regression have the same goal although they proceed in opposite order: In a diag- nostic approach one starts by identifying the outliers and then fits the rest of the data by a classic method. In a robust approach one starts by fitting a model that does justice to the majority of the data and at the same time reveals the outliers as observations with large residuals from the robust estimator. In some applica- tions, both approaches yield exactly the same results. The choice between them is thus almost a matter of taste.

A first step towards robust regression estimators The aim of this paper is to present an overview of

Y.-Z. Liang, O.M. Kualheim / Chemometrics and Intelligent Laboratory Systems 32 (1996) I-10 3

robust methods recently developed in statistics and chemometrics for multivariate analysis. Thus, robust regression, diagnostics and robust methods for di- mensionality reduction are described and illustrated with examples.

2. Robust regression methods

2.1. M-estimators and generalised M-estimator

Methods collected under the term maximum like- lihood estimators, abbreviated as M-estimators [9], replace the squared residuals r-2 in Eq. (1) by an- other function of the residuals. Thus, the analogue to Eq. (1) becomes

Minimise C p( ri) (3) The function p is symmetric (i.e. p( - t) = p(t)

for all t) with a unique minimum at zero. Differenti- ating Eq. (3) with respect to the regression coeffi- cients {e,; j = 1, . . . , m} yields

C$( ri/s)xi = 0 (4) where rC, is the derivative of p and s is a dispersion or scale estimator obtained from {ri}. The two vec- tors xi and 0 in Eq. (4) are defined by xi = (xi,, . . . , xi,,,jT and 0 = (0, . . . , OjT. Note that the choice p(r,) = ir,’ provides the ordinary LS method from the M-estimator.

In practice, M-estimators are derived directly from the $(ri) functions and not by proceeding via p(ri) functions as implied by the formal definition (Eqs. (3) and (4)). Several $(ri) functions have been pro- posed. We provide one example to illustrate the ap- proach. Andrews defined a t,!4ri) weight function as [lo]:

4(z)= o

i

sin( z/c) IzIG c

Izl>c

In Eq. (5), c is a tuning parameter. Note that z is calculated from the residuals {ri}, meaning that 4 is also a function of the estimates of regression coeffi- cients oj. Without an estimation method one cannot obtain the 0, values and thus the weight function $. The M-estimator, therefore, claims an iterative pro- cedure [ 111.

The explicit iterative procedure for solving Eq. (4) can be expressed as follows.

(i) Use the LS technique to obtain initial estimates of the regression coefficients:

8 = (X’X) _lXTy (6)

The initial residuals can be modified to down- weight large residuals by

( 1.5 median{ 1 riI} sign( r,)

/

lril > 1.5median{lr,l} ri =

‘i (7)

\ lril < 1.5median{lril)

The median of ri represents a robust estimate of the mean or, in statistical terminology, the location of the error distribution.

The actual initial estimates of 0, values can then be obtained by

eini= o+Ae (8)

where

A8 = (XTX) -‘XTr

(ii) In the kth iteration, let

( Wjk+ “)’ = (cl( r,(‘)/sCk))/( r!k)/s(k))

where

sCk) = median( I rjk)l)

Then, Eq. (4) becomes

C[ rjk)$( rjk)/sCk))xi/rjk)] /sCk) = O/S(~)

that is,

(9)

(10)

(11)

(12)

Cr,(k)(W!k+ ‘))‘xi = C[ ( yi - ock+ ‘jTxi)wik+ “1

x(Wjk+‘)X;)=O (13)

This results in

@k+ 1) _ -(

XTW(k+ “X) -‘XT W(k+ Vy (14)

where WCk+‘)is a n X n matrix with diagonal ele- ments equal to VV!~+ ‘) and all offset elements equal to zero:

W(k+ 1) = &g[ W!k+l)] (15)

The iterative procedure defined by Eqs. (lo)-(14) continues until the difference between the regression coefficients oCk+‘) and f?(k) is less then a pre-set threshold.

4 Y.-Z. Liang, O.M. Kualheim / Chemometrics and Intelligent Laboratory Systems 32 (1996) I-10

An example illustrating the iterative procedure given above is given in Table 1. From this table one can observe that the iterative procedure for calculat- ing the M-estimator converges rapidly. The down- weighting procedure defined by Eqs. (7)-(9) for ob- servations with large residuals in ordinary LS calcu- lation works well and three outliers are detected (the last three italic numbers in Table 1, i.e. 1.704, 1.792 and 2.372). In the following iteration, defined by Eqs. (lo)-(14), the influence from these outliers is down- weighted by setting their weights to zero. In the sec- ond iteration, one remaining outlier is detected, i.e. the observation with y = 1.006 (residual 0.2778 in the second iteration). Its influence is eliminated in the following iteration and the procedure converges in the third run in which the weights are equal to 1.0 for all observations except the four outliers.

In chemometrics, Philips and Eyring [12] were the first to apply an M-estimator to regression analysis. They analysed 38 data sets, each containing at least

ten observations, to determine the slope and intercept of univariate regression models. They found that the results of robust regression were either the same or superseded that of LS regression. Wolters and Kate- man [13] applied another M-estimator to study the influence of different distributions for parameter esti- mation with Monte Carlo simulations. They con- cluded that, for a number of measurements equal to or more than ten, better parameter estimates were achieved with the robust methodology. Wei et al. [14] applied a sine function M-estimator to multicompo- nent analysis of UV data. Their results showed that when the noise was not normally distributed, the ro- bust method provided better concentration estimates than LS. Xie et al. [15] applied several M-estimators to multicomponent analysis for curing problems with partial non-linearity. Wavelengths with strong devia- tion from linearity were regarded as outliers. Xie et al. obtained improved concentration estimates with the robust method compared to LS.

Table 1 Calculation of the M-estimator based upon Andrews’ weighting function [lo], i.e. Eq. (5) in the text

Y

0.5266 0.6596 0.8001 0.9399 1.107 1.183 1.268 1.318 1.329 1.299 1.227 1.122 0.9890 0.8396 1.006 0.5378 1.704 1.792 2.372 0.1328

x1

0.5207 0.6640 0.8185 0.9766 1.129 1.265 1.375 1.449 1.482 1.469 1.409 1.307 1.169 1.009 0.8370 0.6671 0.5100 0.3735 0.2618 0.1754

x2

First a

0.6640 0.8185 0.9766 1.129 1.265 1.375 1.449 1.482 1.469 1.409 1.307 1.169 1.009 0.8370 0.6671 0.5100 0.3735 0.2618 0.1754 0.1122

Iterative No.

Second b

Residual

0.0933 0.0985 0.0955 0.0842 0.0675 0.0392 0.0068

- 0.0286 - 0.0663 - 0.1037 -0.1351 -0.1598 -0.1704 -0.1732

0.1513 - 0.1486

1.137 1.369 2.071

- 0.0924

Weight

0.9986 0.9984 0.9984 0.9988 0.9992 0.9997 1.000 0.9999 0.9993 0.9982 0.9970 0.9958 0.9952 0.9950 0.9939 0.9963 0.000 0.000 0.000 0.9991

Third ’

Residual

0.0218 0.0219 0.0207 0.0175 0.0129

- 0.0012 - 0.0107 - 0.0004 - 0.0283 - 0.0361 -0.0411 - 0.0440 - 0.0445 - 0.0428

0.2778 - 0.0378

1.268 1.474 2.151 0.0139

Weight Residual Weight

0.9999 0.0004 1.000 0.9999 - 0.0001 1 .oOO 0.9999 - 0.0001 1.000 0.9999 - 0.0003 1.000 1.000 - 0.0001 1.000 1.000 0.0004 1.000 1.000 - 0.0002 1.000 1.000 - 0.0004 1.000 0.9999 0.0001 1.000 0.9998 - 0.0003 1.000 0.9997 0.0002 1.000 0.9997 0.0003 1.000 0.9997 0.0000 1.000 0.9998 0.0004 1.000 0.000 0.3199 0.000 0.9998 - 0.0001 1.000 0.000 1.299 0.000 0.000 1.500 0.000 0.000 2.171 0.000 1.000 0.0002 1.000

Estimates by LS: 2.3759, - 1.4033; after correction by Eqs. (7)-(9): 1.3358, -0.3981; expected estimates: 0.5000, 0.4000. = -0.3981. b

0, = 1.336; O2 = 0, = 0.7033; f12 = 0.2086.

c 8, = 0.5005; e* = 0.4000.

Y.-Z. Liang, O.M. KuaIheim/ Chemometrics and Intelligent Laboratory Systems 32 (1996) I-10 5

2.2. Least median of squares (LMS)

The breakdown point of M-estimators or gener- alised M-estimators (GM-estimators) is, in general, not larger than 30%. In order to obtain a robust method with higher breakdown point, Rousseeuw de- veloped the so-called least median of squares (LMS) estimator [16] using the following objective function:

Minimise median( r-f) (16) Here median(r2) denotes the median of the

squared residuals. This objective function provides a breakdown point of 50%, which is the maximum that can be achieved by a robust method. Furthermore, the LMS estimator is robust with respect to both outliers in y and X. An example can be found in a paper by Rousseeuw [5]. Unfortunately, the LMS method per- forms poorly from the point of view of convergence rate. The time needed for the LMS calculation in general is proportional to n3 where n is the number of observations. Steel and Steiger [17] invented an algorithm in which the computation time is propor- tional to [n log(n)]*. An algorithm for LMS was pre- sented in Ref. [8].

Least median of squares (LMS) regression was in- troduced to chemometrics by Massart et al. [18]. They applied LMS to a number of data from the chemical literature and found that the robust method was more efficient than LS. Rutan and Carr [19] used simulated univariate data to compare several robust procedures and adaptive Kalman filtering technique with respect to their ability to eliminate outliers in small data sets. Ukkelberg and Borgen [20] developed a robust alter- nating regression aiming at outlier detection.

2.3. Least trimmed squares (LTS)

In order to repair the poor convergence rate (asymptotic efficiency) for LMS, Rousseeuw intro- duced the least trimmed squares (LTS) estimator, given by

h

Minimise c (rF)i:n (17) i= 1

where (r?)l:n G . . . G ( rf),:n are the squared residu- als ordered according to their values. This is very similar to LS. The difference is that the largest squared residuals are left out in the summation. The most robust LTS estimator is obtained when h is ap-

proximately n/2, in which case the breakdown point is 50%. The LTS estimator converges at the usual rate and the method behaves satisfactorily with respect to asymptotic efficiency.

2.4. Robust partial least squares (RPLS)

Partial least squares (PLS) regression, developed by Wold [21], is a popular regression method in chemometrics. A robust partial least squares (RPLS) method was recently developed by Wakeling and MacFie [22]. They substituted the procedure for esti- mating w (loading vector for the X block) and c (loading vector for the Y block) in PLS with a robust regression step. They showed the robust approach to be efficient in the presence of random outliers in the Y block. The price to be paid was to abandon the or- thogonality constraint on successive w. The algo- rithm was designed to compensate independently for outliers in the X and Y data blocks. The robust step in this RPLS uses the biweight method; an M-esti- mator developed by Beaton and Tukey 1231.

2.5. Robust principal component regression (RPCR)

A robust principal component regression (RPCR) procedure, primarily aimed at working as an outlier detection tool, was recently developed by Walczak and Massart [24]. Their approach is based on the el- lipsoidal multivariate trimming (MVT) [8,25] and the least median of squares (LMS) methods with a sim- ple computational procedure. The MVT method was utilised to get a robust dispersion matrix and thus ro- bust principal components in order to reveal outliers in the X data block as a first step in PCR. The LMS regression technique was then employed in the PCR in order to identify the outliers in the y data set using standardised residuals from the robust model. It is noteworthy to point out that Walczak and Massart [24] concluded: “As RPCR minimizes the median of residuals instead of the sum of squared residuals it should probably not be treated as the final model. The RPCR should preferably be considered only as an outlier identification tool, and the final model should be obtained by the usual PCR procedure.” This phi- losophy of laundering the data for outliers to obtain an outlier-free subset has later been followed by Walczak [26]. The laundering of the data is based on a genetic algorithm.

6 Y.-Z. Liang, O.M. Kualheim/ Chemometrics and Intelligent Laboratory Systems 32 (1996) l-10

3. Diagnostics

Outlier diagnostics focuses attention on observa- tions with a large influence on the least squares (LS) estimator. The field of diagnostics consists of a com- bination of numerical and graphical tools. In this work, we divide the diagnostics methods into two categories, classic and robust diagnostics.

3.1. Classic diagnostics

Classic diagnostics are based on the LS model and its residuals or some other non-robust estimates, such as the mean or covariance matrix. Some quantities that occur frequently in classic diagnostics are the di- agonal elements of the LS projection matrix H.

The multiple regression model (direct calibration in chemometrics terms) can be written in matrix form as follows:

y=XB+e (18)

Xl1 Xl2 . . . xlP

x = X21 x22 . . . x2P

(19) . . . . . . . . .

X fll X n2 ... X “P

The hat matrix is defined by

H = X(XTX) -lXT (20)

The n X n matrix is called the hat matrix because it transforms the observed vector y into an LS esti- mate, that is, 9 = Hy (= X(XTX)-‘XTy = X0). It can easily be verified that

HH = H (idempotent) (21)

HT = H (symmetric) (22)

and

trace(H) =p (i.e. Ch,, =p) (23)

Eq. (23) follows from

trace(H) = trace(X(XTX))lXT)

= trace(XTX(XTX))l) = trace(1,) =p

(24)

Eq. (24) utilizes that tracecAB) = trace(BA).

The fact that H is idempotent and symmetric im- plies that

hii = (H)ii = (HH)ii = Chijhji = Chijhij = Chfj

= hfj + xi, jhfj (25)

From Eq. (25) one observes that 0 < hii < 1. Note that

a$/ayi = hi, (26)

This indicates that hii measures the effect of the ith observation on its own prediction. Thus, large hii (close to 1) means that variable i has an unusually large influence on the LS regression coefficients. Statisticians determine potentially influential points as hii > 2p/n or 3p/n (two or three times the aver- age value of h,,). In the latter cases the ith observa- tion might be regarded as a leverage points [27-301.

Mahalanobis distance (MD) is another commonly used diagnostic tool in statistics and chemometrics:

MD; = (xi - x)C-‘(xi - x)’ (27)

Here C is the covariance matrix of X and x is the average vector of x,(i = 1, . . . , n). It can be proved that

MD;=(n-1)(&-l/n) (28)

Thus, the Mahalanobis distance has almost the same diagnostic ability as the hat matrix. Both meth- ods might be useful when the data contain observa- tions with outlying observations {xi]. However, leverage points may have some masking effect on each other so that the corresponding hii values and thus MD values appear normal. The reason is that Mahalanobis distance and hii are based on a classic covariance matrix which is not robust against out- liers.

Examination of hii or MD, alone is not sufficient to disclose all outliers in regression analysis, because neither of them take y into account. In order to as- sess the influence of the ith observation in y, it is useful to run the regression both with and without that observation. This procedure results in Cook’s squared distance [30]:

CD2(i) = ([(e- O(i)]TMIB- o(~)])/c (29)

In Eq. (291, 8 is the LS estimate from the full data set, and O(i) is the LS estimate on the data set with-

Y.-Z. Liang, O.M. Kualheim/ Chemometrics and Intelligent Laboratory Systems 32 (1996) I-10 7

out observation i. Usually, one chooses M = XTX and c=ps2= p[CrF/(n -p)]. A large value of CD2(i) implies that the ith observation has a consid- erable influence on the determination of 8. Cook’s squared distance can be extended to diagnostics of multiple outliers by measuring the joint effect of deleting more than one case [28]:

CD2(Z) = ([e- e(Z)]‘M[O- O(Z)]]/c (30)

Here Z represents the indices corresponding to a sub- set of observations. The other symbols have the same meaning as in Cook’s squared distance for the case of a single observation. The quantity CD2(Z) can be interpreted in an analogous way to CD’(i). How- ever, the selection of observations to be included in Z is not at all obvious. It may well happen that subsets

Table 2 Comparison between different diagnostic methods

of observations are jointly influential although indi- vidual observations are not. Therefore, single-ob- servation diagnostics does not reveal which subsets have to be considered. Moreover, the computation for all pairs, triplets and so on, leads to Cz runs, where m= 1,2, . . . . n/2. Diagnostics methods that are able to cope efficiently with multiple outliers without suf- fering of the masking effect, seem still to be waiting for their invention.

3.2. Robust diagnostics

In order to avoid the masking effect, a robust dis- tance (RD) was proposed by Rousseeuw et al. [31]:

RD’= [xi - T(x)]C(X)-‘[xi - T(x)IT (31)

Species logtbody WJ, ~1 log(brain W), x2 MD, RD, hii

Mountain beaver 0.1301 0.9085 1.01 0.54 0.0226 cow 2.6675 2.6268 0.70 0.54 0.0534 Grey wolf 1.5603 2.0774 0.30 0.40 0.0371 Goat 1.4419 2.0607 0.38 0.63 0.0394 Guinea pig 0.0170 0.7404 1.15 0.74 0.0185 Diplodocus 4.0682 1.6990 2.64 6.83 0.2165 Asian elephant 3.4060 3.6630 1.71 1.59 0.1011 Donkey 2.2721 2.6222 0.71 0.64 0.0528 Horse 2.7168 2.8162 0.86 0.48 0.0600 Potar monkey 1.0000 2.0607 0.80 1.67 0.0591 Cat 0.5185 1.4082 0.69 0.69 0.0351 Giraffe 2.7235 2.8325 0.87 0.50 0.0606 Gorilla 2.3160 2.6085 0.68 0.52 0.0517 Human 1.7924 3.1206 1.72 3.39 0.1133 African elephant 3.8231 3.7568 1.76 1.14 0.1094 Triceratops 3.9731 1.8451 2.37 6.11 0.1858 Rhesus monkey 0.8325 2.2529 1.22 2.72 0.0895 Kangaroo 1.5441 1.7482 0.20 0.67 0.0233 Hamster - 0.9208 0.0000 1.86 1.19 0.0268 Mouse - 1.6383 - 0.3979 2.27 1.24 0.0520 Rabbit 0.3979 1.0828 0.83 0.47 0.0208 Sheep 1.7443 2.2430 0.42 0.54 0.0418 Jaguar 2.0000 2.1959 0.26 0.29 0.0364 Chimpanzee 1.7173 2.6435 1.05 1.95 0.0706 Brachiosaurus 4.9395 2.1889 2.91 7.26 0.3012 Rat - 0.5528 0.2788 1.59 1.04 0.0215 Mole - 0.9136 0.4771 1.58 1.19 0.0602 Pig 2.2833 2.2553 0.40 0.75 0.0393

Distances exceeding the cutoff value x,‘,,,.,,, = 2.72 for MD (Mahalanobis distance), RD (robust distance) and cutoff values for hii (greater than 3p/n = 0.2143) are in italics.

8 Y.-Z. Liang, O.M. Kvalheim/ Chemometrics and Intelligent Laboratory Systems 32 (1996) l-10

By comparing Eq. (31) with Eq. (27) for the cal- culation of the Mahalanobis distance, we observe that T(x) and C(X) replace the mean x and covariance matrix C in Eq. (27) to obtain a robust diagnostics. The T(x) and C(X) are the so-called minimum vol- ume ellipsoid estimators (MVE) [32] and they can be obtained by the following iterative weighted proce- dure:

T(X)k = ($Xi)/[ $T(X)k]

and

(32)

C(X)k= (wi” - I)-l[x, - T(X)‘lTIXi - T(x)k]

(33)

where the weights wi” = w(RD,k- ‘1 depend on the previous robust distances,

Wk= 1 ifRD”-‘dc I

i 0 otherwise (34)

The cut-off value c might be chosen as x~~,~,~. Rousseeuw et al. used this robust diagnostics to-

gether with residuals from LMS (instead of IS) to diagnose outliers both in y and X. However, Cook and Hawkins [33] showed that this procedure may

indicate a plethora of outliers, the identity of which can change dramatically with small changes in the parameters of the algorithm for robust estimation. More recently, Atkinson and Mulira proposed a new robust diagnostic technique, a so-called stalactite plot for the deletion of multivariate outliers to remedy this failure [34]. Instead of replacing the means and co- variance matrix with robust estimates in the Maha- lanobis distance, they made a robust covariance ma- trix by sequential construction of an outlier-free sub- set of the data, starting from a small random subset. The detection of multiple outliers in multivariate data can be accomplished by a forward procedure in which the Mahalanobis distances are calculated, and these distances are subsequently used to produce a stalac- tite plot. The stalactite plot provides a summary of suspected outliers as the subset increases. Combined with probability plots and resampling procedures, the stalactite plot leads to identification of multivariate outliers, even in the presence of appreciable masking effects. An example is showed in Table 2. The data consist of the logarithmic transformed brain weight (in grams) and body weight (in kilograms) of 28 species. The problem to be investigated is to deter- mine whether a larger brain is required to govern a

4

* *

3.5 -

3-

2.5 -

Brain _ weight 2

1.5-

I- *

m *

0.5 - * aI

O- *

* -0.5 1

-2 -1 0 I 2 3 4 5

Body weight

Fig. 1. Scatter plot of brain vs. body weight of 28 species after logarithmic transformation (Table 1).

Y.-Z. Liang, O.M. Kvalheim / Chemometrics and Intelligent Laboratory Systems 32 (1996) l-10 9

heavier body. The diagonal elements of the hat ma- trix, Mahalanobis distances and robust distances are collected in Table 2. The scatter plot of the data is shown in Fig. 1 with the outliers marked. Table 1 shows that Mahalanobis distance detects only one outlier (italic in Table 2). The hat matrix does not perform much better. Only two outliers are detected. On the other hand, the robust distance (defined by Eq. (31)) discloses three strong and two weak outliers. The stalactite plot detects the three strong outliers found by robust diagnostics [33].

In chemometrics, Noes developed an outlier diag- nostic procedure for principal component regression (PCR) in order to obtain leverage and influence mea- sures [35]. Hu and Massart [36] examined the outlier detection ability of several robust methods, such as the single median method, repeated median and least median of squares (LMS), and fuzzy calibration. Their overall conclusion is that robust and fuzzy methods provide acceptable calibration results in the presence of outliers. In addition, the outlier diagnos- tics based on the residuals from LS failed to obtain correct results.

4. Robust methods for dimensionality reduction

4.1. Projection pursuit (PP)

If a multivariate observation is plotted as a point in a p-dimensional variable space, a set of n obser- vations forms a point cloud in this p-dimensional space. The goal of multivariate data analysis is to find and describe the structure of the point cloud in a lower-dimensional space. Projection pursuit (PP> is a method for dimensionality reduction. It searches for a lower-dimensional subspace that reflects the struc- ture of the data in the original p-dimensional data in an optimal way. PP techniques were reviewed by Huber [37]. Friedman and Stuetzle extended the idea behind PP and added projection pursuit regression (PPR) [38], projection pursuit classification (PPC> [39] and projection pursuit density estimation (PPDE) [401.

In fact, principal component analysis is a special PP procedure. Let X be a data set of n observations {xi = (xii, xi*, . . . , xipj; i = 1, . . . , n} with covari- ante matrix V = X’X. Denote the eigenvalues of V

by ~1, ~2, . . . , y,. The first principal component is the projection of X onto a certain direction, that is,

y1 = max(aT Va,) = max ( aTXT Xa,) Ila,ll = 1

y2 = max(a;Va,) ]]a211 = landa, I a,

y, = max(a:Va,) Ila,ll = 1 anda, I ar, . . . ,ap_l

(35)

It is well known that {ai; i = 1, . . . , p} are the as- sociated eigenvectors of V. On the other hand, if the principal components of V are known, V can be con- structed by

V = CyiaiaT (36)

The projection index of classic PCA is the vari- ance, which is very sensitive to outliers. This obser- vation suggests that if one chooses a robust projec- tion index, robust PCA should be possible. It is also interesting to note that there is a relation between ro- bust regression and PP [S]. As remarked by Rousseeuw, the usefulness of PP in synthesising high-breakdown procedure is therefore not surpris- ing.

4.2. Robust PCA and robust SVD

Li and Chen [41] developed a robust PCA based on the PP approach. The projection index was cho- sen as Huber’s M-estimator of dispersion. Xie et al. [42] developed a robust principal component analysis (PCA) based on projection pursuit and simulated an- nealing. The sample median was used as projective index. The results for simulated data showed the ro- bust PCA to be resistant to deviation from normal distribution and outliers. A robust singular value de- composition (RSVD), also based on projection pur- suit (PP>, was recently developed by Ammann [43]. The method can be described as an iteration in two steps; a least squares regression fit of the data matrix followed by a rotation to the regression hyperplanes. Robust location and covariance estimators are devel- oped via general M-estimators for covariance matrix eigenvectors and eigenvalues. The proposed RSVD can be used as a basis for several multivariate meth- ods such as errors-in-variables regression, discrimi- nant analysis, and principal components.

10 Y.-Z. Liang, 0.M. Kualheim / Chemometrics and Intelligent Laboratory Systems 32 (1996) l-10

5. Discussion

Several robust methods for multivariate analysis have recently been developed in statistics and chemometrics. One driving force for this interest is the industrial need for robust methods. Thus, outliers occur frequently in real world data. On the other hand, all the common chemometrics methods make explicit or implicit assumptions about the underlying data structure, assumptions that may be vulnerable to deviations from normality. These assumptions are sometimes mathematically convenient rationalisa- tions of an often fuzzy knowledge or belief. To cure this situation, robust methods are necessary alterna- tives. Chemometricians should test these methods, and, if necessary, adapt them to chemical practice. New methods specifically developed for chemical purposes may also be necessary.

Acknowledgements

Y.Z.L. thanks the Norwegian Research Council for Science and Humanities (NAVF), the National Natu- ral Science Foundation of the P.R.C. and the Fok Ying Tong Education Foundation for a travel grant.

References

[l] V.J. Clancey, Nature, 159 (1947) 339-340. [2] E.K. Harris and D.L. DeMets, Clin. Chem., 18 (1972) 605-

612. [3] V. Bamett and T. Lewis, Outliers in Statistical data, 3rd edn.,

Wiley, Chichester, 1993. [4] F.Y. Edgeworth, Hermathena, 6 (1887) 279-285. [5] P.J. Rousseeuw, J. Chemom., 5 (1991) l-20. [6] J.L. Hodges, Proc. Fifth Berkeley Symp. Math. Stat. Probab.,

1 (1967) 163-168. [7] H.R. Hampel, Ann. Math. Stat., 42 (1971) 1887-1896. [8] P.J. Rousseeuw and A.M. Leroy, Robust Regression and

Outlier Detection, Wiley, New York, 1987. [9] P.J. Huber, Robust Statistics, Wiley, New York, 1981.

[lo] D.F. Andrews, Technometrics, 16 (1974) 523-531. [ll] P.W. Holland and R.E. Welsch, Commun. Stat. (Theory

Methods), 6 (1977) 813-828. [12] G.R. Philips and E.R. Eyring, Anal. Chem., 55 (1983) 1134-

1138. [13] R. Wolters and G. Kateman, J. Chemom., 3 (1989) 329-342.

[141

[I51

[161 [171

[181

[191

DO1

ml

La

1231

1241

[251

[261 1271 1281

b91 [301 [311

[321

I331

[341

[351 [361

[371 [381

W.Z. Wei, W.H. Zhu and S. Yao, Chemom. lntell. Lab. Syst., 18 (1993) 17-26. Y.L. Xie, Y.Z. Liang, J.H. Jiang and R.Q. Yu, Anal. Chim. Acta, 313 (1995) 185-196. P.J. Rousseeuw, J. Am. Stat. Assoc., 79 (1984) 871-880. J.M. Steele and W.L. Steiger, Discrete Appl. Math., 14 (1986) 93-100. D.L. Massart, L. Kaufman, P.J. Rousseeuw and A. Leroy, Anal. Chim. Acta, 187 (1985) 171-179. S. Rutan and P.W. Carr, Anal. Chim. Acta, 16 (1988) 131- 142. A,. Ukkelberg and OS. Borgen, Anal. Chim. Acta, 277 (1993) 489-494. K. Jdreskog and H. Wold (Eds.), Systems under Indirect Ob- servation: Causality, Structure, Prediction, North-Holland, Amsterdam, 1982. I.N. Wakeling and H.J.H. Macfie, J. Chemom., 6 (1992) 189-198. A.E. Beaton and J.W. Tukey, Technometrics, 16 (1974) 147-185. B. Walczak and D.L. Massart, Chemom. Intell. Lab. Syst., 27 (1995) 41-54. J.S. Devlin, R. Gnanadesikan and J.R. Kettering, J. Am. Stat. Assoc., 76 (1981) 354-362. B. Walczak, Chemom. Intell. Lab. Syst., 28 (1995) 259-272. J.P. Stevens, Psychol. Bull., 95 (1984) 334-344. R.D. Cook and S. Weisberg, Residuals and Influence in Re- gression, Chapman and Hall, London, 1982. R.R. Hocking, Technometrics, 25 (1983) 219-249. R.D. Cook, Technometrics, 19 (1977) 15-18. P.J. Rousseeuw and B.C. van Zomeren, J. Am. Stat. Assoc., 85 (1990) 633-639. P.J. Rousseeuw, in W. Grossmann, G. Pflug and I. Vincze (Eds.), Mathematical Statistics and Applications, Vol. B, Reidel, Dordrecht, 1985, pp. 283-297. R.D. Cook and D.M. Hawkins, J. Am. Stat. Assoc., 85 (1990) 640-644. A.C. Atkinson and H.-M. Mulira, Stat. Comput., 3 (1993) 27-35. T. Nres, Chemom. Intell. Lab. Syst., 5 (1989) 155-168. Y.Z. Hu, J. Smeyers-Verbeke and D.L. Massart, Chemom. Intell. Lab. Syst., 9 (1990) 31-44.23 P.J. Huber, Ann. Stat., 13 (1985) 435-475. J.H. Friedman and W. Stuetzle, Projection pursuit classifica- tion, Unpublished manuscript, 1980.

[39] J.H. Friedman and W. Stuetzle, J. Am. Stat. Assoc., 76 (1981) 817-823.

[40] J.H. Friedman and W. Stuetzle, J. Am. Stat. Assoc., 79 (1984) 599-608.

[41] G.Y. Li and Z.L. Chen, J. Am. Stat. Assoc., 80 (1985) 759- 766.

[42] Y.L. Xie, J.H. Wang, Y.Z. Liang, L.X. Sun, X.H. Song and R.Q. Yu, J. Chemom., 7 (1993) 527-541.

[43] L.P. Ammann, J. Am. Stat. Assoc., 88 (1994) 505-514.