Climaco Pinto et al ACA 2009 653 131 142

12
Analytica Chimica Acta 653 (2009) 131–142 Contents lists available at ScienceDirect Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca Improving the detection of significant factors using ANOVA-PCA by selective reduction of residual variability R. Climaco-Pinto a,b , A.S. Barros b , N. Locquet c , L. Schmidtke d , D.N. Rutledge a,c,a Laboratoire de Chimie Analytique, AgroParisTech, 16, rue Claude Bernard, 75005 Paris, France b Departamento de Química, Universidade de Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal c INRA/AgroParisTech UMR 214 “IAQA”, 16, rue Claude Bernard, 75005 Paris, France d National Wine and Grape Industry Centre, Charles Sturt University, Boorooma Street, Wagga Wagga, NSW 2650, Australia article info Article history: Received 18 February 2009 Received in revised form 15 July 2009 Accepted 10 September 2009 Available online 16 September 2009 Keywords: ANOVA-PCA ASCA Error removal Discrimination abstract Selective elimination of residual error can be used when applying Harrington’s ANOVA-PCA in order to improve the capabilities of the method. ANOVA-PCA is sometimes unable to discriminate between levels of a factor when sources of high residual variability are present. In some cases this variability is not random, possesses some structure and is large enough to be responsible for the first principal components calculated by the PCA step in the ANOVA-PCA. This fact sometimes makes it impossible for the interesting variance to be in the first two PCA components. By using the proposed selective residuals elimination procedure, one may improve the ability of the method to detect significant factors as well as have an understanding of the different kinds of residual variance present in the data. Two datasets are used to show how the method is used in order to iteratively detect variance associated with the factors even when it is not initially visible. A permutation method is used to confirm that the observed significance of the factors was not accidental. © 2009 Elsevier B.V. All rights reserved. 1. Introduction 1.1. ANOVA-PCA (or APCA) Analysis of variance-principal components analysis (ANOVA- PCA) has been used for the detection of biomarkers [1,2], to assess the stability of reference materials [3] and to evaluate the signifi- cance of factors of an experimental design, as well as for prediction of new samples [4]. This supervised method uses the ANOVA paradigm to create a series of matrices containing the means for the different levels of the main effects and interactions of the fac- tors of an experimental design to which are added the residual errors. PCA is then applied to each of these mean plus error matri- ces in order to evaluate the significance of the effects against the residual error. As usual with PCA, scores and loadings are obtained, which may be used to study the existence of groupings of indi- viduals and to evaluate the importance of the initial variables in the definition of the effects and the sources of residual variation Corresponding author at: Laboratoire de Chimie Analytique, AgroParisTech, 16, rue Claude Bernard, 75005 Paris, France. Tel.: +33 1 44 08 16 48; fax: +33 1 44 08 16 53. E-mail addresses: [email protected] (R. Climaco-Pinto), [email protected] (A.S. Barros), [email protected] (L. Schmidtke), [email protected] (D.N. Rutledge). and to compare it to the different factors in the experimental design. It is clear from the above description that this procedure is not related to the ANOVA-based method that is often used to detect sig- nificant variables prior to a multivariate analysis such as PCA. It is in fact very similar to ASCA (analysis of variance-simultaneous com- ponents analysis) where similar matrices, but without the residual errors added back, are analyzed by simultaneous components anal- ysis [5–7]. To avoid confusion and underline this similarity, we prefer to use the term APCA throughout this paper, rather than ANOVA-PCA. The most important difference between APCA and ASCA is that with the latter method the multivariate analysis is per- formed on the matrices of level means of the factors without the residual errors having been added back, which means that it is nec- essary to use a resampling procedure such as bootstrapping in order to be able evaluate the significance of the factors in comparison to the residual error. With standard APCA, resampling is not necessary as the significance of the factors can be estimated by examining the scores plots. But, although resampling is not required by APCA, it may of course be applied in a similar way to gain further insight into the characteristics of the factors and samples. Depending on the data being analyzed, problems may arise with APCA. One clear limitation of the method is when there is a large amount of structured residual variance, due for example to an inter- fering substance with specific absorbance peaks. The variability due to this interference may give rise to principal components with high 0003-2670/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.aca.2009.09.016

Transcript of Climaco Pinto et al ACA 2009 653 131 142

Ir

Ra

b

c

d

a

ARRAA

KAAED

1

1

Ptcopttecrwvt

rf

ar

0d

Analytica Chimica Acta 653 (2009) 131–142

Contents lists available at ScienceDirect

Analytica Chimica Acta

journa l homepage: www.e lsev ier .com/ locate /aca

mproving the detection of significant factors using ANOVA-PCA by selectiveeduction of residual variability

. Climaco-Pintoa,b, A.S. Barrosb, N. Locquetc, L. Schmidtked, D.N. Rutledgea,c,∗

Laboratoire de Chimie Analytique, AgroParisTech, 16, rue Claude Bernard, 75005 Paris, FranceDepartamento de Química, Universidade de Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, PortugalINRA/AgroParisTech UMR 214 “IAQA”, 16, rue Claude Bernard, 75005 Paris, FranceNational Wine and Grape Industry Centre, Charles Sturt University, Boorooma Street, Wagga Wagga, NSW 2650, Australia

r t i c l e i n f o

rticle history:eceived 18 February 2009eceived in revised form 15 July 2009ccepted 10 September 2009vailable online 16 September 2009

a b s t r a c t

Selective elimination of residual error can be used when applying Harrington’s ANOVA-PCA in orderto improve the capabilities of the method. ANOVA-PCA is sometimes unable to discriminate betweenlevels of a factor when sources of high residual variability are present. In some cases this variability is notrandom, possesses some structure and is large enough to be responsible for the first principal components

eywords:NOVA-PCASCArror removaliscrimination

calculated by the PCA step in the ANOVA-PCA. This fact sometimes makes it impossible for the interestingvariance to be in the first two PCA components. By using the proposed selective residuals eliminationprocedure, one may improve the ability of the method to detect significant factors as well as have anunderstanding of the different kinds of residual variance present in the data.

Two datasets are used to show how the method is used in order to iteratively detect variance associatedwith the factors even when it is not initially visible. A permutation method is used to confirm that the

he fa

observed significance of t

. Introduction

.1. ANOVA-PCA (or APCA)

Analysis of variance-principal components analysis (ANOVA-CA) has been used for the detection of biomarkers [1,2], to assesshe stability of reference materials [3] and to evaluate the signifi-ance of factors of an experimental design, as well as for predictionf new samples [4]. This supervised method uses the ANOVAaradigm to create a series of matrices containing the means forhe different levels of the main effects and interactions of the fac-ors of an experimental design to which are added the residualrrors. PCA is then applied to each of these mean plus error matri-es in order to evaluate the significance of the effects against the

esidual error. As usual with PCA, scores and loadings are obtained,hich may be used to study the existence of groupings of indi-

iduals and to evaluate the importance of the initial variables inhe definition of the effects and the sources of residual variation

∗ Corresponding author at: Laboratoire de Chimie Analytique, AgroParisTech, 16,ue Claude Bernard, 75005 Paris, France. Tel.: +33 1 44 08 16 48;ax: +33 1 44 08 16 53.

E-mail addresses: [email protected] (R. Climaco-Pinto),[email protected] (A.S. Barros), [email protected] (L. Schmidtke),[email protected] (D.N. Rutledge).

003-2670/$ – see front matter © 2009 Elsevier B.V. All rights reserved.oi:10.1016/j.aca.2009.09.016

ctors was not accidental.© 2009 Elsevier B.V. All rights reserved.

and to compare it to the different factors in the experimentaldesign.

It is clear from the above description that this procedure is notrelated to the ANOVA-based method that is often used to detect sig-nificant variables prior to a multivariate analysis such as PCA. It is infact very similar to ASCA (analysis of variance-simultaneous com-ponents analysis) where similar matrices, but without the residualerrors added back, are analyzed by simultaneous components anal-ysis [5–7]. To avoid confusion and underline this similarity, weprefer to use the term APCA throughout this paper, rather thanANOVA-PCA. The most important difference between APCA andASCA is that with the latter method the multivariate analysis is per-formed on the matrices of level means of the factors without theresidual errors having been added back, which means that it is nec-essary to use a resampling procedure such as bootstrapping in orderto be able evaluate the significance of the factors in comparison tothe residual error. With standard APCA, resampling is not necessaryas the significance of the factors can be estimated by examining thescores plots. But, although resampling is not required by APCA, itmay of course be applied in a similar way to gain further insightinto the characteristics of the factors and samples.

Depending on the data being analyzed, problems may arise withAPCA. One clear limitation of the method is when there is a largeamount of structured residual variance, due for example to an inter-fering substance with specific absorbance peaks. The variability dueto this interference may give rise to principal components with high

1 ca Chi

voct

mctrrficai

1

toai

aielps

trrelmubt

vip

2

2

daitsame

wa

x

xv

32 R. Climaco-Pinto et al. / Analyti

ariance, which will make it difficult to reach a conclusion. In theriginal APCA method, it was considered that if the first principalomponent is not due to the variability of the factor being tested,hat factor is considered non-significant.

By eliminating part of the variability from the residual erroratrix, it may be possible to make the spectra of the replicates more

omparable so that the principal components which are related tohe factor become the ones with largest variance, instead of thoseelated to the more or less structured variability present in theesiduals. By applying PCA to the residuals matrix calculated in therst step of the APCA, and eliminating a certain number of principalomponents, it is possible to selectively reduce the residual vari-bility. Then, a less-noisy version of the initial matrix, with reducedntra-sample variability, can be rebuilt and analyzed again by APCA.

.2. APCA with selective residuals variance reduction

The ability of the APCA method may be improved in cases wherehere is only a separation of the scores of the samples for the levelsf the Factor under consideration along PC2 or higher, indicative ofsituation where the Factor is significant even though its variability

s less than that of the residual errors.The modification of the APCA method consists in rebuilding

series of increasingly less-noisy initial matrices by eliminatingncreasing numbers of principal components from the residualrror matrix, thus reducing the inter-repetition variability. Theseess-noisy matrices are then analyzed by APCA to determine at whatoint it is possible to observe a separation of the scores for theamples belonging to different levels of the considered Factor.

To verify that the separation is not simply an artifact due tohe elimination of too much residual error, the Factor levels areandomized and the APCA is repeated, in order to compare theesults of the tested Factor levels with those of the random lev-ls. If the distances between group centroids for the tested Factorevels are superior to the distances for the random levels, the Factor

ay be considered as significant compared to that reduced resid-al error. This procedure makes it easier to study the relationshipetween the information due to a Factor and the different parts ofhe residual variance.

The present work shows how the selective reduction of residualariance can help the standard APCA method to overcome one ofts limitations. Two real datasets are used to illustrate the proposedrocedure.

. Theory

.1. APCA

APCA can be seen as a supervised method to test whether aata matrix contains information related to the various Factors ofn experimental design. Each of the samples (rows) of the matrixs attributed to a level for each of the Factors and Interactions ofhe design. APCA successively calculates a series of matrices corre-ponding to the means of the variables at each level of each Factor,nd then subtracts them from the original matrix to give a finalatrix of residual errors. In this matrix, the vector of residuals of

ach sample is different.In the simple case of an experimental design with two Factors

ith j and k levels, this decomposition can be written in vector forms

i = x + �j + �k + ��jk + �i (1)

i being the vector of responses for sample i, x the global averageector for the whole data matrix X (of dimensions I × F). �j and

¯k are the effects for Factors one and two respectively, ��jk is the

mica Acta 653 (2009) 131–142

interaction between these Factors and �i the residuals for the vectorof responses for the sample. The length of all these vectors is F.

The residuals matrix is then added back to each of the factormatrices and a PCA is performed on each one of these “fac-tor + residuals” matrices and the scores and loadings are examined.The original APCA procedure only considered the first two PCs. Con-clusions may be drawn concerning each Factor by examining theHotelling T2 [8] ellipses (at 95% confidence) for each level. Foursituations may arise:

1. If there is a separation of the levels along PC1, the Factor is sig-nificant compared to the residual error.

2. If the separation is along PC2, there is information related to theFactor levels, but the noise variance contained in PC1 related tothe residual error is greater.

3. If there is no separation in the first two components, it maymean that the data matrix contains no information related tothe different levels of the Factor.

4. If there is no separation in the first two components, it may meanthat although there is information in the data matrix related tothe different levels of the Factor, the residual error variability isso much greater that it is only visible in later PCs.

In order to be able to decide between the situations 3 and 4, amethod is proposed here to progressively reduce the part of thevariance of the residual error matrix to be added back to the levelmeans matrices. Once a separation is observed along PC1, its valid-ity is verified by using a permutation method. The proposed methodis therefore an intermediate along the continuum between APCAand ASCA.

2.2. Improved detection of significant factors by selectiveelimination of residual variability

In some situations the variability between replicates behaveslike white noise, in which case simply averaging the replicates mayreduce it sufficiently for the differences among different samples tobecome apparent. However, this is very often not the case, and it isthen necessary to selectively eliminate certain parts of the variabil-ity. In the case of APCA, this may be done very simply by reducingthe variance of the residual errors matrix.

As was explained by Zhu et al. in their paper on EROS [9]: “Whenthe measurements were replicated . . . there was considerable vari-ability between the replicate spectra. . . . A pre-treatment designedto remove this variability from the spectra was constructed in thefollowing way. Data on replicate measurements were analyzedby principal component analysis (PCA) to identify the dominantstructure in the variability, and the corresponding factors were sub-tracted from the spectra by orthogonal projection. The approach,which was successful in this application, is quite general, and couldbe used in many other situations.”

Their idea here is to perform a data pre-treatment which reducesvariation in the replicates by subtracting information related to themain sources of variability present in the residuals. Such a data pre-treatment is in some ways similar to Orthogonal Signal Correction(OSC) [10] and Orthogonal Partial Least Squares regression (OPLSs)[11]. As with these latter two techniques, the number of orthogonalcomponents to be used must be optimised to ensure that too muchresidual variability is not removed (to avoid an “over-fitting” ofthe residuals). A permutation test is used here to do this, makingit possible to check the significance of each Factor compared to

the residual error. Including this data pre-treatment within APCAmeans that the procedure can now be situated on a continuumbetween standard APCA, where the complete matrix of residuals isadded back to the matrices of means, and ASCA where none of theresiduals are added to the matrices of means.

R. Climaco-Pinto et al. / Analytica Chimica Acta 653 (2009) 131–142 133

Table 1Mahalanobis distances between the 2 centroids due to the Factor Concentrationlevels, calculated for the model with the true group attribution and for 500 modelswith group classification randomly attributed. p-value calculated as the number of“random” model distances superior to the “true” distances, divided by the numberof models (500).

Number of PCseliminated from �

Maximum randomdistance (500 iterations)

Realdistance

p-value (500iterations)

ct

m(dase

FCn

0 0.181 0.256 0.0001 0.082 0.256 0.000

60 (all) 0.071 0.254 0.000

In order to optimise the selection of the number of principalomponents to be removed from the residuals matrix, a permuta-ion procedure similar to that used in ASCA must be implemented.

The way the residual variability is extracted from the residualsatrix is succinctly described below, in Eq. (2). This intra-sample

or inter-replicates) variability is the residual error of APCA � (ofimensions I × F), calculated according to Eq. (1). The residual vari-

bility may be progressively reduced by applying a PCA to � andubtracting n PCs from the residuals, by using its scores T andigenvectors P according to:

r = � − TPT (2)

ig. 1. (a) APCA scores and (b and c) loadings using APCA on the initial data for Factoroncentration. Although there is a separation of the two concentration levels, it isot on PC1, which is associated to the CO2 contributing to the residuals.

Fig. 2. (a) Loadings of PC1 extracted from the residual variance matrix; (b) resultingdata matrix X after eliminating PC1.

where �r is the matrix of reduced residuals after elimination ofthe selected variance, T (of dimensions n × I) and P (of dimensionsF × n) matrices composed of, respectively, the n selected PC scoresand eigenvectors from the PCA on �.

A reduced residual error version of the initial dataset can thenbe “rebuilt” to give X by summing all the matrices of the Factorlevels and the matrix of reduced residuals, as in Eq. (3):

X = X + � + � + �� + �r (3)

where X is a matrix in which all lines are equal to the global average,� is a matrix in which each line is the average of the group to which

Table 2Mahalanobis distances between the 2 centroids due to the Factor Day levels, calcu-lated for the model with the true group attribution and for 500 models with groupclassification randomly attributed. p-value estimated as the number of “random”model distances superior to the “true” distances, divided by the number of models(500).

Number of PCseliminated from �

Maximum randomdistance (500 iterations)

Realdistance

p-value(500 iterations)

0 0.171 0.109 0.0341 0.051 0.082 0.000

60 (all) 0.053 0.078 0.000

134 R. Climaco-Pinto et al. / Analytica Chimica Acta 653 (2009) 131–142

FXrF

ti

2

suib

TMlcm(

Table 4Mahalanobis distances between the 2 centroids due to the Factor Oak levels, calcu-lated for the model with the true group attribution and for 500 models with groupclassification randomly attributed. p-value estimated as the number of “random”model distances superior to the “true” distances, divided by the number of models(500).

Number of PCseliminated from �

Maximum randomdistance (500 iterations)

Realdistance

p-value(500 iterations)

0 2.982 1.223 0.2823 0.745 0.580 0.019

108 (all) 0.960 0.572 0.024

Table 5Mahalanobis distances between the 2 centroids due to the factor Mox levels, calcu-lated for the model with the true group attribution and for 500 models with groupclassification randomly attributed. p-value estimated as the number of “random”model distances superior to the “true” distances, divided by the number of models(500).

Number of PCseliminated from �

Maximum randomdistance (500 iterations)

Realdistance

p-value(500 iterations)

0 2.722 0.186 0.9684 1.001 0.272 0.206

108 (all) 0.798 0.265 0.250

ig. 3. (a) APCA scores and (b) loadings using APCA for Factor Concentration of˜ after eliminating PC1 from the residual variance matrix. This reduction in theesidual variance has eliminated the structured CO2 related variance leading to theactor Concentration being significant compared to the reduced residual error.

he sample belongs for that Factor (the same for � and ��) and �r

s the matrix of reduced residuals.

.3. APCA with selective reduction of residual variability

If no separation of the levels is achieved when performing thetandard APCA method on the data, one starts to eliminate resid-al variability from the data. This is done iteratively by subtracting

ncreasing proportions of the variance in � using increasing num-ers of PCs as P in Eq. (2) to then calculate X as in Eq. (3). The

able 3ahalanobis distances between the 3 centroids due to the Factor Year levels, calcu-

ated for the model with the true group attribution and for 500 models with grouplassification randomly attributed. p-value estimated as the number of “random”odel distances superior to the “true” distances, divided by the number of models

500).

Number of PCseliminated from �

Maximum randomdistance (500 iterations)

Realdistance

p-value(500 iterations)

0 7.820 5.914 0.0781 4.728 7.952 0.0002 3.261 8.555 0.000

108 (all) 2.813 8.544 0.000

Fig. 4. Data related to the results obtained in Fig. 1, before eliminating any residualvariance. (a) Averages of the levels for the normal dataset X (black) and plot ofresiduals (grey). (b) Mahalanobis distances between the two levels for 500 randompermutations (grey) and for the normal classification (black dot).

R. Climaco-Pinto et al. / Analytica Chi

Fig. 5. Data related to the results obtained in Fig. 3, after eliminating PC1 from theresidual variance matrix. (a) Averages for the concentration levels of the dataset XaMc

ittr

gaoeoiimp

2

lTa

It is possible to estimate the significance of an inter-group dis-tance, and therefore of a Factor, from the proportion of distancesin the random reference distribution that are larger than the “true”inter-group distance. For example, if there are 25 distances in a ran-dom reference distribution of 500 distances that are greater than

fter eliminating PC1 from the residuals matrix (black) and residuals (grey). (b)ahalanobis distances for 500 random permutations (grey) and for the normal

lassification (black dot).

terations are continued until a good separation of the groups ofhe levels of the desired Factor is attained when APCA is appliedo X. When the separation is attained for n PCs, the correspondingesiduals �n will be retained for subsequent analyses.

In fact, whether a Factor is significant or not, a separation of theroups will eventually be found, because the residuals tend to zeros more principal components are eliminated. In the extreme casef completely eliminating the residual variability, it is certain thatven small differences in level averages will give rise to a separationf the groups. To verify that the observed separation of the levelss not due to chance (just because the residuals are too small), asn ASCA, a resampling procedure based on permutation of group

emberships is used [7]. A bootstrap alternative can be used, asroposed by Timmerman and Ter Braak [12].

.4. Permutation procedure

The samples are randomly attributed to Factor levels and theevel averages are calculated for this random group classification.he residuals previously calculated for the standard analysis (�n)re added to the resulting matrix of Factor averages and the PCA

mica Acta 653 (2009) 131–142 135

is performed. The total or partial Mahalanobis [13] inter-groupdistances corresponding to each of the levels is calculated. The per-mutation procedure is repeated hundreds or thousands of timesand the distances are added and averaged in order to obtain a sam-pling distribution. A histogram of all the distances is plotted, tobe compared with the inter-group distance obtained by the non-random APCA inter-group distances.

Two situations may be observed when examining the histogramof inter-group distances:

1. The randomly permuted levels are also separated with the testedmatrix of residuals �n, meaning that the Factor is not significant.So much variability has been deleted from the residuals using nPCs that even random averages separate the level groups.

2. The random level groups are not separated while the non-random ones are, in which case the Factor may be consideredas significant.

Fig. 6. (a) APCA scores and (b and c) loadings for the Factor Day on the initial Car-raghenan dataset X. PC1 is related to CO2 and does not separate the groups, whilethere is some separation along PC2.

1 ca Chi

tl

rnrtostw

3

uuts

F(g

36 R. Climaco-Pinto et al. / Analyti

he “true” inter-group distance, then the probability of finding aarger distance is (p-value) 25/500 = 0.05.

The reason that the residuals added to both the normal and theandom permutation models are those from the calculation of theormal classification of samples is that if a Factor is significant, theesiduals from the permutation models are necessarily larger thanhose of the standard method. So, by using the smaller residualsf the standard model, the probability of observing a significanteparation of the random level groups is increased, thus reducinghe probability of false positives, i.e. accepting as significant a Factorhich is not.

. Materials and methods

Two datasets (herein named “Carraghenan” and “Wine”) weresed to demonstrate the method. Both datasets were acquiredsing a Fourier Transform Mid-infrared Spectrometer (Bruker Vec-

or 33) with a thermostated Attenuated Total Reflection (ATR)ampling device (“Golden Gate”, Specac).

ig. 7. (a) Averages for the 2 days when random (grey) and normal classificationsblack) are calculated on the initial dataset X. (b) Mahalanobis distances for 500roup permutations (grey) and for the normal case (black dot).

mica Acta 653 (2009) 131–142

3.1. Carraghenan data

This dataset has already been studied and described in detailelsewhere [4]. Mid-infrared spectra of Carraghenan solutions werecollected and 64 scans averaged over the region 4050–600 cm−1, at4 cm−1 resolution. The diamond ATR crystal was thermostated atthe temperatures defined by the experimental design.

Factors: Concentration (1 and 2%), Temperature (30, 40, 45,50, 60 ◦C) and Day (spectra measured on 2 different days bydifferent operators). Each sample was measured in triplicate.2 × 5 × 2 × 3 = 60 spectra in all.

3.2. Wine data

Mid-infrared spectra of model of red wines: 32 scans were col-lected and averaged over the region 4000–600 cm−1, at 4 cm −1

resolution after 20 min of evaporation on the ATR crystal warmedat 70 ◦C.

Factors: Year (A, B, C), Oak chips addition (Yes, No), Micro-oxygenation (Yes, No). Three samples for each type of wine, eachsample measured in triplicate. 3 × 2 × 2 × 3 × 3 = 108 spectra in all.

Fig. 8. (a) APCA scores and (b) loadings for the Factor Day on the Carraghenandataset X after eliminating PC1.

ca Chi

4

ttotc

ubct(ostcss

F(ld

R. Climaco-Pinto et al. / Analyti

. Results and discussion

The use of the scores plot is important as it allows one to observehe position of the groups in relation to each other for each one ofhe factors. Of course this is more important when there are threer more groups, as one may see tendencies in the data related tohe factor in study and not only if there is a separation, as in thease of the existence of only two groups.

An interesting capability of the method is the possibility tonderstand, to a certain extent, the causes of residual variabilityy looking at the loadings of the extracted PCs. As to how manyomponents should be extracted and which, no prior knowledge ofhe residual sources of variation is needed. The residual variabilitydue for example to the CO2 peak) is extracted sequentially untilne obtains a separation along PC1 of the levels of the factor beingtudied. At that moment, a permutation test is applied to evaluatehe statistical significance of this separation. The different principalomponents associated to the extracted residual variability may be

tudied separately. This will be understood with the help of the casetudies.

ig. 9. (a) Averages for the 2 days when random (grey) and normal classificationsblack) are calculated on the Carraghenan dataset X after eliminating PC1. (b) Maha-anobis distances for 500 group permutations (grey) and for the normal one (blackot).

mica Acta 653 (2009) 131–142 137

4.1. Carraghenan data

As the study of factor Temperature is not interesting in theframework of this method, because of the results already obtained,the study of this dataset will focus on the factors Concentration andDay.

4.1.1. Factor Concentration4.1.1.1. Standard APCA. In a previous article [4], the same set ofmid-infrared spectra of Carraghenan was studied using APCA. Inthat study, there was a large variance due to the CO2 peak whichgave rise to PC1 when studying the Factor Concentration. PC1 wastherefore related to this peak and not to the Factor (Concentration)being studied, which appeared in PC2 (Fig. 1). That CO2 variance,of no interest for the study, was not random but structured andsufficiently large to create PC1. Based on the original philosophyof APCA, the tested factor would be considered as not significantcompared to the residual error. When the CO2 region at around2350 cm−1 was deleted, the Factor Concentration then became sig-

nificant compared to the residual error. Deleting parts of the spectrais, however, not optimal as in some cases there may be informationin that same region of the spectrum.

Fig. 10. (a) Initial wine raw data and (b) after column-centering.

1 ca Chi

4rvttTsctivatmatdra

btTt

Fi

different types of residual variability are present in the data.

38 R. Climaco-Pinto et al. / Analyti

.1.1.2. Selective reduction of residual variability—APCA. The car-aghenan dataset was studied by selective reduction of residualariability before APCA, using the complete spectrum. The dis-ances between the centroids of the two levels were calculated forhe real and for the random groupings. The results presented inable 1 show that the distance between the two groups is alreadylightly larger for the normal than for the random calculation, indi-ating that the variability that separates the two groups is dueo a significant factor even though it is contributing to PC2. Tonvestigate what would happen if part of the residual error matrixariability is eliminated, residual error reduction was applied. Tnd P were calculated using n = 1 and then APCA was applied tohe resulting matrix X. As can be seen in Fig. 2, after this treat-

ent, the contribution to the variability of the CO2 peak centeredt 2350 cm−1 has almost completely disappeared. The concentra-ion levels are now separated by the PC1 of APCA (Fig. 3) and theistances between the real groups is much larger than between theandom permutation groups, so the factor may now be considereds significant compared to the reduced residual error.

The separation along PC1 having been achieved, the distanceetween the two groups is calculated. Also 500 random permuta-ions are performed and APCA applied to the permutated datasets.he level averages and the distances histogram obtained usinghe initial data are presented in Fig. 4, for comparison with the

ig. 11. (a) Scores and (b and c) loadings for the Factor Year following APCA on thenitial wine data.

mica Acta 653 (2009) 131–142

ones of Fig. 5, obtained after elimination of PC1 from the residualvariance matrix. As can be seen in Table 1, none of the random mod-els presents inter-groups distance larger than the model with the“true” group attributions.

The elimination of just one component from the residuals matrixis enough to provide the separation of the two concentration levelsalong PC1 of the APCA.

Although the real distances between the groups decreasesslightly when PC1 is eliminated (Table 1), the random distancesdecreases much more due to smaller residuals, amplifying the rel-ative distance between the real and the random distances.

It is interesting to note that if a Factor is significant comparedto the residual error, its loadings hardly change even though theseparation of the levels increases with increasing number of PCseliminated from the residual variance. Also interesting to noteis that the standard normal variates (SNVs) pre-treatment wasapplied to the data after the residual variance was extracted, notbefore. This can make a significant difference for the case when

4.1.2. Factor Day4.1.2.1. Standard APCA. Spectra were acquired on two differentdays, when different ATR crystal cleaning procedures were used,

Fig. 12. (a) Averages for the 3 years when random (grey) and normal classifications(black) are calculated on the wine dataset. (b) Mahalanobis distances for 500 grouppermutations (grey) and for the normal one (black dot). There is no separation ofYear with the initial data using the standard APCA method.

R. Climaco-Pinto et al. / Analytica Chimica Acta 653 (2009) 131–142 139

rTgo

asttw

4ddcsi1v2gwbFi

Fig. 13. (a) Scores and (b and c) loadings for APCA after eliminating PC1.

esulting in different quantities of residual ethanol in the system.he idea was to check if the different cleaning procedures wouldive different results and to compare the residual variance for eachf the procedures.

Looking at Fig. 6 one can see that the 2 days are not well sep-rated along PC1 (which is again due to CO2) and not completelyeparated along PC2, which would appear to be due to ethanol ando a baseline shift in the right-hand side of the spectra. Fig. 7 showshat the distance between the 2 days is comparable to the distancehen groups are attributed randomly.

.1.2.2. Selective reduction of residual variability—APCA. The proce-ure for selective reduction of residual variability is applied to theata and PC1 is removed. Then APCA is applied to the X matrix. Asan be seen in, Fig. 8b and c, the separation of the 2 days along PC1eems to be due to a combination of the spectral features presentn the previous PC1 (atmospheric CO2) and PC2 (ethanol near 3300,000–1200 cm−1 and baseline shift). PC2 reflects differences in theariability of the quantity of ethanol present in the spectra on thedays. In Fig. 8a, the variability within day 1 along PC2 is much

reater than that for day 2, which is explained by the way the crystalas washed, using much more ethanol, and its loadings resem-

les the ethanol vapor spectrum as shown in a previous article [4].ig. 9b shows that the distance between the centroids for the 2 dayss larger than that due to the randomly permuted levels.

Fig. 14. (a) Averages for the 3 years when random (grey) and normal classifications(black) are calculated on the wine dataset X after eliminating PC1. (b) Mahalanobisdistances for 500 group permutations (grey) and for the normal one (black dot).

The results obtained after eliminating increasing number of PCsfrom the residuals matrix are presented in Table 2. This table showsthat after the elimination of PC1, the distance between the cen-troids of the 2 days is greater for the normal calculation than forthe randomly permuted ones. The real and the maximum randomdistances do not change significantly with increasing amount ofvariance eliminated and are more or less stable after eliminatingPC1 from the residual variance. Although without extracting anyvariance there is still a percentage of random models (p = 0.034)with larger inter-group distances than for the “true” model, afterextracting 1 PC, none of the random models calculated presentsit and as the groups are separated along PC1–Factor Day is nowsignificant.

4.2. Wine data

In the previous dataset, the elimination of part of the residualvariance is determinant to change the results for the significance ofthe tested factors. In the following example this effect is even more

evident. The initial data and after column-centering are presentedin Fig. 10. There is a large quantity of residual variance in the fin-gerprint region (far right) of the spectra as the reproducibility ofthe evaporation method was not very good.

140 R. Climaco-Pinto et al. / Analytica Chimica Acta 653 (2009) 131–142

Fi

44d(efItuw

eprt

4

tairw

Fig. 16. (a) Averages for the 3 years when random (grey) and normal classifications(black) are calculated on the wine dataset X after eliminating PCs 1–2. (b) Maha-

ig. 15. (a) Scores and (b and c) loadings for APCA on the Factor Year after eliminat-ng the first two PCs.

.2.1. Factor Year

.2.1.1. Standard APCA. The APCA analysis for the whole spectrumoes not show any separation of the levels for the Factor YearFigs. 11 and 12 Figs. 11a and 12b). There is again a strong influ-nce of atmospheric CO2 in the spectra (Fig. 11b), which accountsor most of the variability of the spectra and so is present in PC1.n PC2 there is no real separation of the levels for any of the Fac-or levels. Examining the loadings in Fig. 11, one can have a betternderstanding of the type of residual variance and note the regionshich interfere in the detection of the interesting factors.

The results above would indicate that it is necessary to test theffect of eliminating part of the residual error on APCA. Princi-al components are eliminated successively from the intra-sampleesidual error matrix, until the APCA results show a separation ofhe factor levels.

.2.1.2. Selective reduction of residual variability—APCA.4.2.1.2.1. After eliminating PC1. The elimination of PC1 from

he residual variance leads to a slight separation of the 3 years,

s can be seen in Fig. 13. This separation is mainly along PC2, butncludes some variation also in PC1. There is still a large amount ofesidual variability in the fingerprint region of the spectra (Fig. 14b),hich contributes to the observed results.

lanobis distances for 500 group permutations (grey) and for the normal one (blackdot).

Although the separation of the groups is not along PC1, the realdistance between the three levels is already larger than the randomdistances (Fig. 14b), which means that the Factor is significant. Asthis separation is due to both PC1 and PC2 (Fig. 13a), it is not clearwhich spectral features are responsible for it. It is therefore of inter-est to continue reducing residual variance in order to obtain a betterseparation of the groups along PC1 so as to understand better theorigin of the residual variability still present in the data.

4.2.1.2.2. After elimination of PCs 1–2. After the elimination ofthe first two principal components of the residual variance matrix,a much better separation of the groups is obtained, as can be seenin Fig. 15. The three groups are separated by both PCs, which mayhave an interesting spectral interpretation. There may be two typesof effects in the wines due to the Factor Year:

• the conditions inherent to each year of harvest (climate, growingconditions, etc.) which result in physico-chemical differences in

the wines;

• an effect due to ageing of the wine, even with appropriate storage,between vinification and analysis.

R. Climaco-Pinto et al. / Analytica Chimica Acta 653 (2009) 131–142 141

FP

rib

tte

4

uewr

4tTtt

tp

Fig. 18. (a) Averages for the two Oak levels when random (grey) and normal clas-sifications (black) are calculated on the wine dataset X after eliminating the first

ig. 17. (a) Scores and (b and c) loadings for APCA in the Factor Oak after eliminating

Cs 1–3.

In Fig. 16a it can be seen that the residual variability was greatlyeduced in relation to the real averages. The consequence of thiss that the relative difference between real and random distancesecomes much larger (Fig. 16b).

After randomization, the total inter-group Mahalanobis dis-ances obtained are shown in Table 3 where it can be seen thathe results do not change substantially after the first two PCs areliminated from the residual variation.

.2.2. Factor OakIt was not possible to separate the two levels for the Factor Oak

sing the standard APCA method. Therefore, residual variance wasliminated according to the procedure proposed until a separationas obtained. This required removing the first three PCs from the

esidual variance matrix (Table 4).

.2.2.1. After extracting PCs 1–3. The residual variance matrix forhe Factor Oak is by definition the same as that for the Factor Year.he variability eliminated in PCs 1–3 for the Factor Oak is similar to

hat eliminated by PCs 1–2 for the Factor Year, i.e. mainly relatedo CO2 and to variability in the fingerprint region of the spectra.

Once a separation was obtained (Fig. 17a), it was necessary toest the significance of the Factor by calculating the distances for theermutations. Those distances are presented below and, although

three PCs. (b) Mahalanobis distances for 500 group permutations (grey) and for thenormal one (black).

only 8 out of 500 permutations gave values higher than the realdistance (Fig. 18b), Factor Oak cannot be considered as significantas Factor Year, probably because of the presence of the CO2 peak(Fig. 17b), but it is still significant.

This highlights a difficulty inherent to APCA and which is notaddressed by any treatment of the residual error matrix. If an inter-fering spectral feature is not uniformly present at the differentFactor levels, it will contribute to the average vectors and so tothe Factor matrix, instead of to the residual error matrix. Such isthe case for this dataset, where the CO2 peak varied uniformly overthe levels for Year, but not for Oak.

4.2.3. Factor Micro-oxygenation (Mox)The analysis of the Factor Mox was performed in the same way as

for the Factor Oak. Here, 4 PCs had to be eliminated in order to have aseparation of the levels, but the real distance between these factorswas much less than the randomly calculated distances, indicating

that the Factor Mox is not significant. The distances for differentnumbers of eliminated PCs are present in Table 5 as well as thelarge values of p obtained.

1 ca Chi

5

wpatrvut

Cdvswab

fpb

p

oi

sw

[10] S. Wold, H. Antii, F. Lindgren, J. Ohman, Chemometr. Intell. Lab. Syst. 44 (1998)175–185.

[11] J. Trygg, S. Wold, J. Chemometr. 16 (2002) 119–128.

42 R. Climaco-Pinto et al. / Analyti

. Conclusion

The procedure for the iterative use of residual error reductionith APCA is shown. PCA is used to eliminate successive numbers ofrincipal components from the residual error, or intra-sample vari-nce matrix, and APCA is applied to the reconstituted data matrixo evaluate the significance of the Factors compared to the reducedesidual error. A permutation procedure was used to verify thealidity of the separation of groups. The procedure enables one tonderstand in more detail the sources of residual variation withinhe data.

In the first dataset studied, Carraghenan, one of the Factors,oncentration, was related to PC2. By using the proposed proce-ure, it was possible to reduce the contribution of the intra-sampleariance so that the tested factor is represented by PC1, and so isignificant in comparison to the residuals. Similarly, the Factor Dayas not considered as significant when using the initial data but

fter elimination of PC1 from the residual variance matrix, it alsoecame significant.

As for the second dataset, Wine, there was no separation of theactor levels on PCs 1 or 2 using standard APCA. After applying theroposed sequential residual variability reduction, the Factor Yearecame significant, as confirmed by the permutation procedure.

The other two Factors, Oak and Mox, were not significant com-ared to the reduced residual error using the whole spectral region.

The method presented here improves the detection capabilities

f APCA and the study of the sources of variance that may interferen the results.

One advantage of the method is that it is possible eliminate someources of residual variability prior to pre-treatment of the spectra,hich can make a difference in the results.

[[

mica Acta 653 (2009) 131–142

Acknowledgements

“Fundacão para a Ciência e Tecnologia” (FCT), Portugal, for RuiClimaco Pinto’s PhD grant.

The authors thank Véronique Bosc for some of the carraghenanspectra and Jin Chen for all the care in the acquisition of the winespectra.

References

[1] P. Harrington, N. Vieira, J. Espinoza, J. Nien, R. Romero, A. Yergey, Anal. Chim.Acta 544 (2005) 118–127.

[2] P. Harrington, N. Vieira, P. Chen, J. Espinoza, J. Nien, R. Romero, A. Yergey,Chemometr. Intell. Lab. Syst. 82 (2006) 283–293.

[3] J. Sarembaud, R. Pinto, D.N. Rutledge, M. Feinberg, Anal. Chim. Acta 603 (2007)147–154.

[4] R. Climaco-Pinto, V. Bosc, H. Nocairi, A.S. Barros, D.N. Rutledge, Anal. Chim. Acta629 (2008) 47–55.

[5] A.K. Smilde, J.J. Jansen, H.C.J. Hoefsloot, R-J.A.N. Lamers, J. Greef, M.E. Timmer-man, Bioinformatics 21 (13) (2005), 4043-3048.

[6] J.J. Jansen, H.C.J. Hoefsloot, J. Greef, M.E. Timmerman, J.A. Westerhuis, A.K.Smilde, J. Chemometr. 19 (9) (2005) 469–481, 2005.

[7] D.J. Vis, J.A. Westerhuis, A.K. Smilde, J. Greef, BMC Bioinform. 8 (2007) 322.[8] H.A. Hotelling, Generalized T test and measure of multivariate dispersion, in:

Proceeding of the Second Berkeley Symposium on Mathematical Statistics andProbability, University of California Press, Berkeley, 1951, pp. 23–41.

[9] Y. Zhu, T. Fearn, D. Samuel, A. Dhar, O. Hameed, S.G. Bown, L.B. Lovat, J.Chemometr. 22 (2008) 130–134.

12] M.E. Timmerman, C.J.F. Ter Braak, Comp. Stat. Data Anal. 52 (2008) 1837–1849.13] P.C. Mahalanobis, On the generalised distance in statistics, in: Proceedings of

the National Institute of Science of India, vol. 12, 1936, pp. 49–55.