PLS-trees��, a top-down clustering approach

12
Received: 10 March 2009, Revised: 17 June 2009, Accepted: 22 June 2009, Published online in Wiley InterScience: 3 November 2009 PLS-Trees 1 , a top-down clustering approach Lennart Eriksson a * , Johan Trygg b and Svante Wold b,c A hierarchical clustering approach based on a set of PLS models is presented. Called PLS-Trees 1 , this approach is analogous to classification and regression trees (CART), but uses the scores of PLS regression models as the basis for splitting the clusters, instead of the individual X-variables. The split of one cluster into two is made along the sorted first X-score (t 1 ) of a PLS model of the cluster, but may potentially be made along a direction corresponding to a combination of scores. The position of the split is selected according to the improvement of a weighted combination of (a) the variance of the X-score, (b) the variance of Y and (c) a penalty function discouraging an unbalanced split with very different numbers of observations. Cross-validation is used to terminate the branches of the tree, and to determine the number of components of each cluster PLS model. Some obvious extensions of the approach to OPLS-Trees and trees based on hierarchical PLS or OPLS models with the variables divided in blocks depending on their type, are also mentioned. The possibility to greatly reduce the number of variables in each PLS model on the basis of their PLS w-coefficients is also pointed out. The approach is illustrated by means of three examples. The first two examples are quantitative structure-activity relationship (QSAR) data sets, while the third is based on hyper- spectral images of liver tissue for identifying different sources of variability in the liver samples. Copyright ß 2009 John Wiley & Sons, Ltd. Keywords: PLS-Trees; PLS; dendrogram; data mining; clustering; variable selection; outlier detection 1. INTRODUCTION 1.1. General considerations The large amounts of data created in all parts of science and technology are presenting interesting challenges. The manage- ment, analysis and visualization of large data sets—often called data mining—are different in several aspects from the dealing with ‘ordinary’ data sets of moderate size, with, say, up to a few hundred observations and variables. A common problem is the inhomogeneity of large data sets, where relationships between variables are different for different sub-groups (clusters) of the data. If not identified and acted upon, this situation results in more or less trivial models dominated by group differences. Using such models for predictions gives low-quality results, with the predicted values being near the averages of the closest group (cluster). The only known remedy is to segment the data into smaller, fairly homogeneous and balanced groups (clusters). Here outliers can be considered special groups with just one or a few observations. This segmentation will, if successful, allow the estimation of relationships among both variables (columns of a data table) and observations (rows of a data table) with good predictive power. Hence, data mining, and other analyses of large and complex data sets, should involve one or more clustering approaches in the early stages of the data analysis, preferably combined with an embedded multivariate regression step (i.e. PLS, OPLS, O2PLS, or similar). Such a multivariate clustering would model a large data set with potentially many and collinear variables, as a set of models (clusters) where the complexity of the observational direction is reduced from N observations to G clusters, and the complexity in the variable direction is reduced from K variables to A g scores (latent variables), with A g typically being different for the G clusters (index g). Since large data sets usually have missing data, noise and other complications, the clustering approaches need to be able to handle such complications. Another desirable property for a clustering approach is computational speed. A ‘top-down’ approach starting from the whole data set, and successively dividing the data into smaller parts, is usually faster than a ‘bottom-up’ approach, starting with clusters of size 1, combining the two closest to a cluster of size 2, and continuing the merging until all observations have been combined to a single, final cluster. The latter approach needs to keep track of all inter-observation distances as well as observation-cluster and cluster-cluster distances (dissimilarities). This becomes compu- tationally costly when the number of observations, N, exceeds a couple of thousands. 1.2. Latent variable-based approaches to clustering The idea of adopting a latent variable-based approach to hierarchical clustering is not new. A few years ago, Bo ¨ cker et al., introduced a hierarchical clustering algorithm called NIPALSTREE [1]. This is a PCA-based procedure, which at each level of the tree projects a data set onto the first principal component. Thereafter, the dataset is sorted according to this one dimension and split (www.interscience.wiley.com) DOI: 10.1002/cem.1254 Research Article * Correspondence to: L. Eriksson, Umetrics AB, POB 7960, SE-907 19 Umea ˚, Sweden E-mail: [email protected] a L. Eriksson Umetrics AB, POB 7960, SE-907 19 Umea ˚, Sweden b J. Trygg, S. Wold Institute of Chemistry, Umea ˚ University, SE-901 87 Umea ˚, Sweden c S. Wold Umetrics Inc., 42 Pine Hill Rd, Hollis, NH 03049, USA J. Chemometrics 2009; 23: 569–580 Copyright ß 2009 John Wiley & Sons, Ltd. 569

Transcript of PLS-trees��, a top-down clustering approach

Received: 10 March 2009, Revised: 17 June 2009, Accepted: 22 June 2009, Published online in Wiley InterScience: 3 November 2009

PLS-Trees1, a top-down clustering approachLennart Erikssona*, Johan Tryggb and Svante Woldb,c

A hierarchical clustering approach based on a set of PLS models is presented. Called PLS-Trees1, this approach isanalogous to classification and regression trees (CART), but uses the scores of PLS regression models as the basis forsplitting the clusters, instead of the individual X-variables. The split of one cluster into two is made along the sortedfirst X-score (t1) of a PLS model of the cluster, but may potentially be made along a direction corresponding to acombination of scores. The position of the split is selected according to the improvement of a weighted combinationof (a) the variance of the X-score, (b) the variance of Yand (c) a penalty function discouraging an unbalanced split withvery different numbers of observations. Cross-validation is used to terminate the branches of the tree, and todetermine the number of components of each cluster PLS model. Some obvious extensions of the approach toOPLS-Trees and trees based on hierarchical PLS or OPLS models with the variables divided in blocks depending ontheir type, are also mentioned. The possibility to greatly reduce the number of variables in each PLS model on thebasis of their PLS w-coefficients is also pointed out. The approach is illustrated by means of three examples. The firsttwo examples are quantitative structure-activity relationship (QSAR) data sets, while the third is based on hyper-spectral images of liver tissue for identifying different sources of variability in the liver samples. Copyright � 2009John Wiley & Sons, Ltd.

Keywords: PLS-Trees; PLS; dendrogram; data mining; clustering; variable selection; outlier detection

1. INTRODUCTION

1.1. General considerations

The large amounts of data created in all parts of science andtechnology are presenting interesting challenges. The manage-ment, analysis and visualization of large data sets—often calleddata mining—are different in several aspects from the dealingwith ‘ordinary’ data sets of moderate size, with, say, up to a fewhundred observations and variables.A common problem is the inhomogeneity of large data sets,

where relationships between variables are different for differentsub-groups (clusters) of the data. If not identified and acted upon,this situation results in more or less trivial models dominated bygroup differences. Using such models for predictions giveslow-quality results, with the predicted values being near theaverages of the closest group (cluster). The only known remedy isto segment the data into smaller, fairly homogeneous andbalanced groups (clusters). Here outliers can be consideredspecial groups with just one or a few observations. Thissegmentation will, if successful, allow the estimation ofrelationships among both variables (columns of a data table)and observations (rows of a data table) with good predictivepower.Hence, data mining, and other analyses of large and complex

data sets, should involve one or more clustering approaches inthe early stages of the data analysis, preferably combined with anembedded multivariate regression step (i.e. PLS, OPLS, O2PLS, orsimilar). Such a multivariate clustering would model a large dataset with potentially many and collinear variables, as a set ofmodels (clusters) where the complexity of the observationaldirection is reduced from N observations to G clusters, and thecomplexity in the variable direction is reduced from K variables toAg scores (latent variables), with Ag typically being different forthe G clusters (index g).

Since large data sets usually havemissing data, noise and othercomplications, the clustering approaches need to be able tohandle such complications. Another desirable property for aclustering approach is computational speed. A ‘top-down’approach starting from the whole data set, and successivelydividing the data into smaller parts, is usually faster than a‘bottom-up’ approach, starting with clusters of size 1, combiningthe two closest to a cluster of size 2, and continuing the merginguntil all observations have been combined to a single, finalcluster. The latter approach needs to keep track of allinter-observation distances as well as observation-cluster andcluster-cluster distances (dissimilarities). This becomes compu-tationally costly when the number of observations, N, exceeds acouple of thousands.

1.2. Latent variable-based approaches to clustering

The idea of adopting a latent variable-based approach tohierarchical clustering is not new. A few years ago, Bocker et al.,introduced a hierarchical clustering algorithm called NIPALSTREE[1]. This is a PCA-based procedure, which at each level of the treeprojects a data set onto the first principal component. Thereafter,the dataset is sorted according to this one dimension and split

(www.interscience.wiley.com) DOI: 10.1002/cem.1254

Research Article

* Correspondence to: L. Eriksson, Umetrics AB, POB 7960, SE-907 19 Umea,SwedenE-mail: [email protected]

a L. Eriksson

Umetrics AB, POB 7960, SE-907 19 Umea, Sweden

b J. Trygg, S. Wold

Institute of Chemistry, Umea University, SE-901 87 Umea, Sweden

c S. Wold

Umetrics Inc., 42 Pine Hill Rd, Hollis, NH 03049, USA

J. Chemometrics 2009; 23: 569–580 Copyright � 2009 John Wiley & Sons, Ltd.

569

close to the median position. In a final step, the clustering isrecursively applied on the resulting sub-sets until the maximaldistance between cluster members exceeds a user-definedthreshold. Because NIPALSTREE uses PCA, it will be relatively slowon large datasets. Furthermore, only the X-block is taken intoconsideration in the clustering, and any Y-data are notinfluencing the results.A slightly different approach was taken by Barros and Rutledge

in their development of the so-called DiPLS_Cluster method [2].The DiPLS_Cluster method starts with a random binary (0/1)Y-vector, i.e. two random groups. In an iterative fashion theY-values are replaced by new values predicted by the PLS-DAalgorithm using a one-component model. After convergence, thepredicted Y-data are rounded off to the nearest integer (0 and 1)and used for cluster assignment. Hence, this approach works onlywith the X-data although it uses PLS for accounting in its internalcalculations. A modified approach, called Generalized PLS_Clus-ter was recently reported [3]. This approach allows anyreasonably small number of clusters in each node of thedendrogram, instead of just two.To overcome some deficiencies of the DiPLS_Cluster approach,

Thomas and co-workers recently published a method namedDiPCA_Cluster [4], which they claim to be ‘an optimal alternativeto DiPLS_Cluster’. The basic difference between the twoapproaches is that Thomas et al. strive to define a stringentobjective function for the clustering, which leads to a PCA-basedapproach, using only the X-part of the data.

1.3. Aims and scope

Our objective is to present a top-down hierarchical clusteringapproach applying to a data matrix X, and influenced by aconnected response matrix Y (both X and Y are optionallypreprocessed). The approach is fast, robust and flexible. We callthis PLS-Trees. The approach is PLS based to allow the responses(Y) to influence the clustering; it also explicitly uses Y as a part ofthe splitting criterion. PLS-Trees apply to both binary and

continuous Y-data, and single or multiple Y-variables. Beingtop-down and PLS-based they are also computationally fast.The main result of the clustering approach is a hierarchical tree

(dendrogram) of connected PLS models, i.e. the PLS-Tree. Theapproach is analogous to classification and regression trees(CART) [5], but uses the scores of PLS regression models as thebasis for splitting the clusters, instead of individual X-variables.Although the step from CART (classification and regression trees)to PLS-Trees seems obvious, a literature search on PLS and CARTgives no results. Hence, we here present PLS-Trees as a newapproach to Y-focused top-down clustering of large data sets.Figure 1 shows an example dendrogram from the implementa-tion in SIMCA1-Pþ 12.

2. DATA SETS

The PLS-Trees will be illustrated by three examples. Two are fairlysmall, the first without any known classes, while the second has14 known chemical classes. The third is larger with 49 000observations and 470 variables, three known and possibly severalunknown classes.

2.1. CYP3A4 inhibition

The first example concerns a quantitative structure-activityrelationship (QSAR) model regarding the inhibition of the liverenzyme CYP3A4 for a series of N¼ 930 compounds [6]. A trainingset of 551 compounds was previously defined by means of oniondesign [6]. Hence, the prediction set has 379 compounds. For aphysicochemical characterization of the compounds, K¼ 307chemical descriptors were employed. The biological effect is theinhibition of the CYP3A4 enzyme measured as log IC50. This dataset is fairly homogenous with uniform coverage of the PLS scorespace.The objective is to investigate whether the PLS-Tree approach

may indicate subtle groupings in the training set. Such possible

Figure 1. A typical PLS-tree. Each branch of the tree corresponds to a PLS model fitted to a sub-set of observations.

www.interscience.wiley.com/journal/cem Copyright � 2009 John Wiley & Sons, Ltd. J. Chemometrics 2009; 23: 569–580

L. Eriksson, J. Trygg and S. Wold

570

sub-groupings may correspond to locally preserved themes inthe structure-activity relationship (SAR), and may hence warrantmore than one PLS model.

2.2. Log KOC

The second illustration is also one of the QSAR, and deals with soilsorption of environmental pollutants [7]. In contrast to the firstexample, these data exhibit pronounced clustering in the PLSscore space. The log KOC data set contains 351 compoundsdistributed across 14 known chemical classes, denoted Class A–N.There are 64 chemical descriptors. The environmental effect is thesoil sorption coefficient, denoted log KOC.The objective here is to investigate whether the PLS-Tree

approach may confirm existing clustering, whether some clusterscan be united into larger groups, and if local QSAR modeling maylead to enhanced predictive ability.

2.3. Mouse liver imaging data

The third example comprises hyper-spectral images of liver tissuesamples [8]. Spectral images were recorded for a total of 12mouse livers, denoted mouse liver A–L [8]. Each image consists ofa 64� 64 array of pixels, i.e. 4096 pixels. In each pixel, an FT-IRspectrum was measured covering the region 950–1850 cm�1 andmaking up a total of 469 spectral variables.In the data analysis, each hyper-spectral image was first

unfolded to a two-dimensional data matrix where each rowconstitutes an FT-IR spectrum at a specific pixel in themicro-spectroscopic image. Thus, the total mouse liver dataset comprises 4096 (pixels) * 12 (mouse livers)¼ 49152observations (rows) and 469 variables (columns).The imaging data set contains three known groups arising

from the obvious features of red blood cell areas (erythrocytes),liver cell areas (hepatocytes), and empty areas [8]. The dataanalysis objective is two-fold: (i) to investigate if the PLS-Treeapproach is able to differentiate between such known structuraldifferences within the set of pixels and (ii) to explore whether theapproach may highlight other systematic differences—if any—between the images or image parts. Additionally, the treatment ofthis data set may shed some light on how to handle the situationwhen no obvious Y-variable is initially available.

3. DATA ANALYTICAL METHODS

In this paper, we use PLS regression [9] and PLS-Trees asimplemented in the SIMCA1-Pþ 12.0 software (www.umetrics.com).PLS-Trees is an approach similar to regression and classification

trees [5], but using the first score vector of PLS [4] or OPLS [10]instead of the original X-variables for splitting a cluster into two.For clarity, we note that the first score vector, t1, is obtained ast1¼ Xw, where w¼XTy/kXTyk2. It should also be pointed out thatt1 is different in each model, and hence lower layer PLS modelscluster along different directions than the upper layer PLSmodels.Thus, the basic idea of a PLS-Tree is to start with a PLS model of

the whole data set—the user must first specify what is X and Y,scaling, etc.—and then split the data along the sortedfirst score vector, t1, of the PLS model. The position of the splitalong t1 is selected according to the improvement of a

combination of (a) the variance of the X-score (t1), (b) thevariance of Y and (c) a penalty function encouraging a balancedsplit with approximately equal numbers of observations in eachresulting branch.More specifically, after sorting all observations from maximum

to minimum score value along t1, we look to identify the ‘cut’point on t1 that divides X and Y in two parts, 1 and 2, such that thefollowing is minimized:

B � N1 � N2ð Þ2N1 þ N2ð Þ2 þ 1� Bð Þ

� A � Vy1 þ Vy2Vy

þ 1� Að Þ � Vt1 þ Vt2Vt

� �(1)

In the above expression, VY1 is the variance of Y for group 1above the cut, VY2 is the variance of Y for group 2 below the cut, VYis the variance of the entire Y-block, Vt1 is the variance of thet-vector above the cut, Vt2 is the variance of the t-vector belowthe cut, Vt is the variance of the entire t-vector, N1 is the number ofobservations above the cut, and N2 is the number of observationsbelow the cut.The values of the user selected parameters A and B lie between

0 and 1. The first parameter, A, sets the balance between the scoret1 and Y; the closer to zero, the more weight is attributed to thescore t1. The second parameter, B, takes into account the groupsize of the resulting clusters; the closer to zero, the less importantit becomes to have similar group sizes in the dendrogram.In addition, a smallest resulting sub-cluster size—nmin¼min

(N1, N2)—is specified to disallow a solution with many very smallclusters even in the case where the parameter B is zero or verysmall. The present default in SIMCA1-Pþ is nmin¼ 5.Thus, the output value of the variance function (Equation 1) is a

weighted combination of three terms. Figure 2 shows, for a PLSroot model in a typical PLS-Tree, line plots of these three termsand the total variance function against the sorted t1 score vectorof the root model. In summary, this means that a division along t1is sought that minimizes the variation within a group and hencemaximizes the differences between groups in t1 and Y, also takingthe difference in group size into consideration.Cross-validation (CV) [11] is used to terminate the branches of

the tree, and to determine the number of components of eachcluster PLS model. The user may also specify the desired numberof layers in the PLS-Tree and the smallest number of observationsinside a cluster. We note that the clustering phase is speededup considerably by restricting all PLS models into one singlecomponent. Later, in the interpretation phase, models of interestcan be extended to include all CV-significant components(autofitting).Finally, the maximum depth, D, of the PLS-Tree—i.e. the

maximum number of layers in the tree—is specified, typically 4(default) or 5. The number of models making up the tree will be2Dþ1—1, e.g. 31 PLS models with D¼ 4 and 63 models withD¼ 5.

3.1. Searching for the best split

To speed up the search for the best split along the searchdirection when the number of observations in a cluster is large, anapproximate search based on a piece-wise quadratic polynomialapproximation of the search profile is recommended. The

J. Chemometrics 2009; 23: 569–580 Copyright � 2009 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem

PLS-Trees1, a top-down clustering approach

571

following has been implemented in SIMCA1Pþwith good resultsso far:

1. The number of points (npol) used for each polynomial approxi-mation is selected, default: npol¼min(11, sqrt(N)).

2. The number of polynomial pieces will be min(7, integer{2*N/npol}-1).

3. Calculate the initial step length so that the whole range ofobservations, except the first and last nmin ones, are covered,while each polynomial piece overlaps half of the ones on eachside.

4. Compute the ‘cut expression’ of Equation (1) for each point ineach polynomial piece.

5. Fit a quadratic polynomial (by least squares) to each poly-nomial piece, and calculate to lowest total minimum over allpieces.

6. If the step length exceeds 1, divide the step length by 4, andlay out a new polynomial piece centered on the currentminimum.

7. Compute the ‘cut expression’ of Equation (1) for each point inthis polynomial piece.

8. Fit a quadratic polynomial to these data, and calculate theminimum of the latest polynomial.

9. If the step length exceeds 1, repeat from point 6, or elsestop.

4. RESULTS

4.1. CYP3A4 inhibition

The first data set is fairly homogenous with uniform coverageproperties of the PLS score space (Figure 3). In its analysis, weinvestigate the resulting PLS-Trees when varying the values ofthe two adjustable parameters, A and B. These were changedusing a 32 factorial design with the settings 0.1, 0.3, and 0.5 inboth A and B. Nine PLS-Trees were initiated using these values ofA and B. As seen by Figure 4, the dendrograms have quite a variedappearance. Some features are pretty ‘predictable’ and some aremore ‘unexpected’. For instance, the higher the value of B themore equal size the resulting groups will tend to have. The topleft dendrogram in Figure 4 (based on A¼ 0.1 and B¼ 0.5)visualizes the impact of a high B, i.e. the first cut is made at 0.5(50% of the samples goes to each model), the second cut close to0.25 (25% goes to each model), the third close to 0.125, etc. Onthe other hand, with a high A, enabling the structure of theskewed Y to have influence, the shape of the dendrograms ischanged more drastically and somewhat unpredictably.To evaluate the predictive power of the set of PLS models

forming each PLS-Tree, we used a vertical cut-off value of 0.35. Allbranches existing at this cut-off and at higher values wereevaluated for predictive power. Each local model was autofitted

Figure 2. Line plot showing the evolution of the variance function and its three components (see Equation 1) against the index of the sorted t1 score

vector (called ‘Num’) of the root PLS model of a typical PLS-Tree. The solid line represents the total variance function, the dotted line its t-variance term,

the dash-dotted line its Y-variance component and the dashed line its sub-group size part. Being a weighted combination of the dotted, dash-dotted anddashed lines, the shape of the solid line will change depending on the settings of the adjustable parameters A and B. The cut position will be the point on

the solid line where the minimal value is encountered. In the current case, the minimal value is found at ‘Num’ 249 and hence the N¼ 551 observations of

the PLS model are split into two sub-groups containing 249 and 302 observations.

www.interscience.wiley.com/journal/cem Copyright � 2009 John Wiley & Sons, Ltd. J. Chemometrics 2009; 23: 569–580

L. Eriksson, J. Trygg and S. Wold

572

and the external test set (with 379 compounds) was subjected toeach model. The distance to model criterion, called DModX inSIMCA1-Pþ, was used to determine which of the 379 predictionset compounds fitted each local model. The root mean squareerror of prediction (RMSEP) was calculated only for thosecompounds fitting the local model.

In eight out of the nine PLS-Trees, one finds a local PLS modelenhancing the predictive power compared to the top ‘mothermodel’. Thus, there seems to exist a structure-activity theme thatis invariant (or almost so) to the settings of A and B. With the mostoptimistic evaluation, RMSEP is lowered by the PLS-Tree byaround 40%.

Figure 4. The nine PLS-trees obtained by fitting PLS models to the training set of 551 compounds and using different values of the parameters A and B,

as dictated by a 3� 3 factorial design with the levels (0.1, 0.3 and 0.5).

Figure 3. Plots of the reference ‘mother’ PLSmodel based on the training set of 551 compounds. (Left) Scatter plot of observed and predicted biological

effects. (Right) Scatter plot of the first two scores, t1 and t2.

J. Chemometrics 2009; 23: 569–580 Copyright � 2009 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem

PLS-Trees1, a top-down clustering approach

573

4.2. Log KOC

The second data set exhibits strong clustering in the PLS scorespace (Figure 5) and the 351 compounds belong to 14 knownchemical classes, A–N. For constructing training and predictionsets, the compounds were first enumerated as 1, 2, 3, 1, 2, 3, 1. . .etc., giving three sub-sets of 351/3¼ 117 compounds. Threerounds of calculations were made with two of the sub-setsconstituting the training set and the remaining sub-setconstituting the prediction set.Furthermore, compared with the first example, the number of

investigated combinations of A and B was reduced, from nine tofive. The reason for this is that several settings of A and Bproduced similar prediction results for the CYP3A4 dataset, andhence it was concluded that the number of tested combinationscould be decreased without compromising reliability in theresults. Also for the logKOC dataset the principles of DOEwere acknowledged when devising a proper number of A/B-combinations. Consequently, the 22 full factorial design (in fourruns) augmented by one centerpoint was chosen to encode thefollowing five settings of A and B: (0.1, 0.1); (0.5, 0.1); (0.3, 0.3); (0.1,0.5); (0.5, 0.5).Thus, in summary, 15 PLS-Trees were computed. The resulting

dendrograms are not shown. The evaluation of the predictivepower was done as in example 1 above. The conclusion is thatlocal modeling reduces RMSEP between 3 and 25% dependingon which training set is used.Additionally, the analysis of this data set indicates possible

interpretations using the PLS-Tree. Zooming-in on a model calledM14 (training set 1&2 and A¼ B¼ 0.1) gives Figure 6, whichshows the relationship between Y-observed and Y-predicted. Bymarking the same compounds in the original reference model,we realize that M14 is focused on compounds with low Y-values.

As seen in Figure 7, the cluster of M14 contains some of theknown classes. The compounds of chemical class C are allallocated to this branch of the PLS-Tree (recall that they are not allmarked because a third of them were deliberately put in theprediction set). As opposed to this result, the compounds ofchemical class L are completely absent in this part of the PLS-Tree.Hence, in this case the clusters of the PLS-Tree to some extentcorrespond to known chemical classes. In such a case, thePLS-Tree can be seen as consistent with clustering that is believedor determined to exist in a dataset.For PLS-Trees the comparison of loading parameters of

different models provides a measure of similarity/dissimilaritybetween clusters. Figure 8 shows scatter plots of the first loadingvector for a case with similar and a case with dissimilar loadingprofiles. Models M14 and M27 (Figure 8, left), which have similarloading profiles, also have similar RMSEP values. This suggeststhat the QSAR themes identified by these models are related.Conversely, models M14 and M261 (Figure 8, right) seem to haveidentified different QSAR themes.

4.3. Mouse liver imaging data

Figure 9 shows an image of mouse liver sample A with structure(dark areas) arising from tissue and empty regions (light areas).The predominant part of the dark areas is made up of liver cells(hepatocytes). Only the intensely dark area in the top right cornercorresponds to red blood cells (erythrocytes). Similar structuralvariability, i.e. tissue and empty sections, is found in the remaining11 images [8], although no such images are provided here forreasons of brevity.It would be of interest to determine whether the PLS-Tree

approach would respond to these groupings in the images. First,

Figure 5. (Left) Agreement between observed and predicted log KOC for the PLS reference model. (Right) Score plot t1/t2. In both plots the coding is

done according to chemical class. The clustered nature of the data set is obvious.

www.interscience.wiley.com/journal/cem Copyright � 2009 John Wiley & Sons, Ltd. J. Chemometrics 2009; 23: 569–580

L. Eriksson, J. Trygg and S. Wold

574

it is a tedious and time-consuming task to manually assign ameaning to each pixel, and therefore an ensemble of predictivemodels capable of distinguishing erythrocytes from hepatocytes,and tissue from empty spots, would be practical and welcomefrom a routine clinical perspective. Second, there might existsystematic and clinically relevant, but so far unseen differencesbetween the images, and an examination of whether suchinformation can be uncovered and interpreted by the newPLS-Tree approach is therefore warranted.

Figure 10 displays the hyper-spectral and pseudo-coloredimage for sample A recorded at 1655 cm�1. In order to apply thePLS-Tree to this spectral information, as well as all the other 468hyper-spectral images (arising from the remaining 468 wave-numbers), we must provide a Y-variable or a block of Y-variables.In the original publication [8], a 1/0 binary Y-variable was used inan OPLS-DA model as a mean to encode erythrocyte andhepatocyte pixel classes. In that model the spectral variable of1719 cm�1 had the strongest correlation to the erythrocyte/

Figure 6. (Left) Yobs versus Ypred for M14. (Right) Same for the reference model.

Figure 7. (Left) Yobs versus Ypred for M14. (Right) Score space t1/t2 for the original reference model. Note that this plot only shows a sub-set of the

observations earlier plotted in the right-hand part of Figure 5.

J. Chemometrics 2009; 23: 569–580 Copyright � 2009 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem

PLS-Trees1, a top-down clustering approach

575

hepatocyte separation. Hence, to focus the current PLS-Tree onerythrocyte and hepatocyte differences in the set of 12 images,we used this maximally group-separating variable as the singleY-variable.The PLS-Tree was then applied to the 49 152 (rows; 4096 pixels

from 12 images) by 468 (columns; all wavenumbers except1719 cm�1) X-block and the single Y-variable (the 1719 cm�1

wavenumber) using A¼ B¼ 0.3. In the two previous examples theprecise choice of A and B seems to matter less; the results are notoverly sensitive to this choice. Hence, using A¼ B¼ 0.3, i.e. thecenterpoint combination of the two foregoing DOEs, appears agood first default attempt in the current case. Moreover, bysetting the depth of the tree to D¼ 4 (see Section 3), thedendrogram was constrained to comprise 31 PLS models(Figure 11).

The dendrogram in Figure 11 emanates from a ‘mother’ PLSmodel M5 of SIMCA-Pþ based on all 49 152 pixels. After the firstsplit, D¼ 1, the pixels have been split in two sub-sets of 22 371and 26 781 pixels. These two sub-sets are then modeled by thetwo new PLS models (M6 and M7 in Figure 11), and partitionedinto four sub-groups. These four clusters, constituting 10 873,11 498, 15 311, and 11 470 pixels, respectively, are then modeledby the PLS models M8-M11 (Figure 11), and so on, until the fourthlayer (D¼ 4) where the data have been divided into 16sub-groups, each of which is handled by a separate, local PLSmodel.An inspection of pixel origin suggests that the PLS-Tree divides

the mouse liver images into three categories. Table I summarizesthe distribution of pixels in model M6, i.e. one of the two modelsarising after the first split. It can be seen that out of the 22 371

Figure 8. Scatter plot of the first PLS loading vectors between similar (left) and dissimilar (right) models.

Figure 9. Visual image of mouse liver sample A (in 64� 64 pixels).

Figure 10. Pseudo-colored hyper-spectral image of mouse liver sample

A. Compare with Figure 9.

www.interscience.wiley.com/journal/cem Copyright � 2009 John Wiley & Sons, Ltd. J. Chemometrics 2009; 23: 569–580

L. Eriksson, J. Trygg and S. Wold

576

pixels encompassed by M6, 3995 pixels originate from mouseliver A and 3947 from mouse liver B. This means that sample A to98% and sample B to 96% are processed in the left M6-wing ofthe dendrogram. Consequently, the remaining 2% A-pixels and4% B-pixels are handled by the right M7-wing of the dendrogram.Mouse liver samples A and B form the first category of samplesfound by the PLS-Tree, i.e. samples that are almost exclusivelymodeled by M6 (and the associated PLS models further down thetree).The second group comprises mouse liver samples being

reasonably well explained, at approximately 50%, by M6 andtherefore, also reasonably well accounted for by M7. To thiscategory of samples belong mouse livers C and K, which to47–56%, respectively, are accounted for by M6. The third groupcomprises the rest of the samples, which are poorly modeled byM6. For this last group of eight mouse livers barely 1/3rd or less ofthe pixels are dealt with by M6. Thus, on a superficialimage-to-image level of analysis, it can be concluded that thePLS-Tree divides the dozen mouse livers into three categories. Afurther discussion of this grouping is given below in Section 5.

In addition, the PLS-Tree may be analyzed on a detailedpixel-to-pixel basis. Using the 4096 pixels of mouse liver A asillustration, Figure 12 outlines how such an interpretation may bedone. All images seen in Figure 12, which are arranged accordingto the dendrogram structure and its two first splits, arepseudo-colored t1 score images of the respective PLS model(i.e. models M5–M11). These score images should be comparedwith Figures 9 and 10.As evidenced by Figure 12, at the first split the A-pixels are

predominantly flushed down the dendrogram on its left-handside. Only 2% of the pixels are directed to the right-hand side ofthe dendrogram. All score images arising from this 2%-part of theA-pixels reveal that this cluster of A-pixels essentially representsnoise. Conversely, on the left-hand side, the 98% part of theA-pixels represents either tissue or empty regions. After thesecond split these 98% are divided into two sub-groupscontaining 78 and 20% of the A-pixels. As indicted in thecommentary text in Figure 12, these two sub-clusters representempty region and hepatocyte/erythrocyte structures.

5. DISCUSSION

We have here introduced a new top-down clustering PLS-basedapproach, called PLS-Trees. One appealing feature is its speed,especially useful when clustering a large data-set. For example, ittook less than 4min to compute the PLS-Tree for the mouse liverimaging data set. A comparative study on exactly the same dataset using hierarchical cluster analysis based on the first fewprincipal components lasted more than half an hour.As shown by way of the three examples, the PLS-Tree is a

versatile, flexible, and transparent clustering tool that bringsabout a number of interpretative possibilities and improvedmodels. The net result of many of the local models pertaining tothe first two data sets were improved predictive power. Moreover,the PLS-Tree was also found to corroborate classes known oranticipated in the highly clustered second data set.This ability of the PLS-Tree to recognize and predict structured

data was also apparent in the third data set. Not only couldthe PLS-Tree separate between empty areas and portions

Figure 11. Resulting dendrogram of the mouse liver imaging data set. See text for explanation.

Table I. Image allocation of pixels modeled by M6

Image Pixels Fraction

A 3995 0.98B 3947 0.96C 1928 0.47D 1085 0.26E 1477 0.36F 1416 0.35G 1421 0.35H 943 0.23I 1287 0.31J 1345 0.33K 2288 0.56L 1239 0.3Sum 22371

J. Chemometrics 2009; 23: 569–580 Copyright � 2009 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem

PLS-Trees1, a top-down clustering approach

577

corresponding to tissue, but it could equally well segregatebetween hepatocytes and erythrocytes. The latter is evidentwhen comparing Figures 9 and 10 and the M9 score image ofFigure 12. This is also evident when comparing other similar scoreimages (unpublished pictures), related to the remaining 11mouse livers, with the equivalent output images of the workpresented in Reference [8]. There is a very good agreementregarding pixel allocation into hepatocytes and erythrocytefractions.In addition, and perhaps yet more interesting, was the

discovery of the three sub-groups of mouse liver samples. It isexpected there might be a time direction in the mouse imagingdata set, since the samples A and B were first prepared and all theother samples arose from a second, complementary measure-ment campaign.In this context Figure 13 is of interest, displaying the scores and

loadings line plots of model M6. The score plot has 22 371 pointsand the loading plot has 467 points. The two first mouse liverimages are seen to have smaller variability in the score valuesthan the other images. Thus, the structures of A and B beingmodeled by M6 appear more homogeneous than the corre-sponding structures of images C-L. This might suggest that whatis actually being modeled for images C-L by M6 represents edgeeffects and transition effects between empty space and tissueareas rather than the empty or tissue regions themselves.Another possible explanation might be the presence of thicknessdifferences between the slices of mouse liver samples.The interpretation of a PLS-Tree is facilitated by the fact that

each node has an associated ordinary PLS (or OPLS) model.Hence the rich diagnostics of PLS such as plots of scores, loadings,VIP, coefficients and predictions are directly useful for thisinterpretation.In addition, subjecting the PLS-Tree clusters to supervized

classification by means of PLS-discriminant analysis (PLS-DA) [12]

and/or SIMCA classification [13] reveal such things as whichvariables are important for cluster separation, the separation ofthe clusters and poorly classified observations.

5.1. PLS-Trees and CART

PLS-Trees may be seen as an extension of classification andregression trees (CART) in the same way as PLS is an extension ofmultiple linear regression. In comparison with conventionalregression approaches, PLS can handle many, collinear and noisyX-variables and also handle moderate amounts of missing data inX and Y, if the latter is multivariate.Regression trees may have some problems finding the optimal

cut-off value of the currently investigated X-variable as well as inhighlighting the correct X-variable among a set of correlatedX-variables. This has lead to the proposition of so called randomforests, where several reminiscent and randomly generated treesbased on similar cut-offs are pursued, leading to improvedpredictive precision, albeit at the cost of increased computingtime and interpretational complexity.The PLS-Trees seem to have fewer problems with the instability

of the cut-off values. This is probably due to two reasons. First,the smoothing of the current score by fitting sliding piecewisepolynomials in a Savitzky-Golay fashion, relieves some of theburden of finding a stable cut-off value. Secondly, being based onscore vectors (linear combinations of the individual X-variables)the selection of the right X-variables is less of a problem, since thescore vector is a weighted average of all X-variables, where the‘best’ ones, highly correlated to the Y-data, are heavily weighted.No detailed comparisons between CART and PLS-Trees are

provided in this paper, since these methods apply to differentkinds of data; the latter typically involves rank deficient datawith many variables, the former few and fairly independentX-variables.

Figure 12. Pseudo-colored t1 score images of mouse liver sample A. Representing the two upper levels of the dendrogram seen in Figure 11, the score

images are arranged according to the corresponding tree structure. To simplify interpretation, model belonging and the number of covered pixels are

indicated in the header of every score image.

www.interscience.wiley.com/journal/cem Copyright � 2009 John Wiley & Sons, Ltd. J. Chemometrics 2009; 23: 569–580

L. Eriksson, J. Trygg and S. Wold

578

5.2. PLS-Trees and conventional hierarchical clusteranalysis

In applications where responses (Y-variables) are available, thePLS-Trees have the advantage over conventional ‘Y-independent’clustering in that the PLS-Trees improve the relation between Yand the clusters. This is analogous to how PLS may lead to abetter Y-focused latent variable model than PCA. Interestingly,when Y comprises a discrete matrix with 1/0 columnscorresponding to a number of predefined classes, the result isa PLS classification tree.

5.3. The adjustable parameters A and B

The two adjustable parameters A and B in the PLS-Tree algorithmprovide a range of variants to address the current given problem:

� with A¼ 1, the approach is fully focused on Y and hence similarto CART;

� with A¼ 0, the approach is fully focused on t and the X-spaceand hence is more similar to classical cluster analysis includingthe NIPALSTREE and the methods of Ruthledge and coworkers;

� with A¼ 0.3 a reasonable compromise is obtained betweenfocus on Y and X.

� with B¼ 0, the approach gives traditional cluster analysiswhich, in the beginning, often splits off many very smallclusters and individual observations (outliers);

� with B¼ 1, the approach forces the division of clusters intoequal parts, which is a rather rigid approach.

� With B¼ 0.3, a reasonable balance seems to be obtained.

The approach taken here, of running a formal design in A andB, is recommended when knowledge about the data structure isquestionable or non-existent. This actually takes a step towardsome kind of random forest approach.

5.4. Possible extensions

A number of obvious extensions of this first effort are listedbelow. These will be tried as soon as time and resources permit.

5.4.1. Searching along a combination of the first two (or three)X-scores

The inspection of cluster geometries in many applications makesit clear that clusters are often oriented in a fashion that is notparallel with the first X-score (t1). By introducing a parameter cvalued between �1 and 1, and searching along combinations oft1 and t2, i.e. {c t2þ (1— jcj) t1}, the whole plane of the first twocomponents will be investigated instead of only the line along t1.Moreover, as correctly pointed out by a referee, by using t2 of

the first PLS model as a direction orthogonal to t1 to further splitclusters 1 and 2, and analogously for models in lower layers, thesimilarities between PLS-Trees and CART are strengthened, in thesense that tree-based methods like CART can easily be extendedto splitting in multiple dimensions.

5.4.2. Cluster models based on OPLS/O2PLS and/or hierarchicalmodels

Using OPLS/O2PLS [8,10] for the cluster ‘mother model’ makesthe t1 search direction in X-space be more aligned with theY-space, whichmay be desirable inmany cases. We note that withordinary PLS ‘mother models’, the split point search is alwaysalong the X-score(s), even if the Y-variance comes into the searchcriterion.With many X-variables (and/or Y’s), hierarchical models [14]

provide an interesting alternative to variable selection (seesection 5.4.3), and facilitate the model interpretation. ThePLS-Tree approach would also easily work with hierarchicalPLS or OPLS/O2PLS models, providing an interesting alternativefor data sets with many X-variables.

Figure 13. Score line plot and loading line plot corresponding to the first PLS component of model M6.

J. Chemometrics 2009; 23: 569–580 Copyright � 2009 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem

PLS-Trees1, a top-down clustering approach

579

5.4.3. Variable selection

In data mining, the reduction of the set of variables is oftendesirable. With large numbers of X-variables, PLSmodels may alsohave a sub-optimal predictive power. Eliminating variables thateither are strongly correlated to the best predictors oruncorrelated to the Y’s may improve the predictive power ofthe model and simplifies the model interpretation. The approachof ‘Selective PLS’ [15], later used under the name IVS-PLS forinteractive variable selection in PLS [16], provides a simple butpowerful way to variable selection based on keeping onlyvariables with large PLS weights (wk or wk*). Either all variableswith smaller algebraic values than a certain limit, say 75% of thelargest weight, or, say, 75% of the variables with the smallestalgebraic weights would be excluded.In the present context this variable selection would start with

the top (mother) model and the next layer would start with thereduced set of variables reducing it further until the lowest layerwould have the smallest numbers of variables. Alternatively, onecan of course restart the variable selection for each cluster model,and optionally also use different thresholds for the variableselection for the different layers in the PLS-Tree.

6. CONCLUSIONS

PLS-Trees, as implemented in SIMCA1-Pþ 12, provides aninteresting and rapid approach to data mining and clustering oflarge data sets. The PLS-Tree approach can, like any PLS model,handle multiple and collinear variables, even more numerousthan the number of observations. Moderate amounts of missingdata are automatically handled by the PLS-Tree NIPALS algorithm.An advantage with the PLS-Tree approach is its reliance onordinary PLS or OPLS models, making the interpretation of resultsbased on the well-known diagnostic tool-box of these methodssimplified. PLS-Trees is not a ‘new’ method isolated from otherdata analysis, but rather another extension of PLS into thecomplicated domain of data mining in large and complex datasets.

REFERENCES

1. Bocker A, Schneider G, Teckentrup A. NIPALSTREE: a new hierarchicalclustering approach for large compound libraries and its applicationto virtual screening. J. Chem. Inf. Model 2006; 46: 2020–2029.

2. Barros AS, Rutledge DN. PLS_Cluster: a novel technique for clusteranalysis. Chemometr. Intell. Lab. Syst. 2004; 70: 99–112.

3. Bouveresse DJR, Barros AS, Rutledge DN. Generalised PLS_Cluster: anextension of PLS_Cluster for interpretable hierarchical clustering ofmultivariate data. Sens. & Instrumen. Food Qual. 2007; 1: 79–90.

4. Thomas V, Robert S, Richard J. DiPCA_Cluster: An optimal alternativeto DiPLS_Cluster for unsupervised classification. Chemometr. Intell.Lab. Syst. 2008; 90: 8–14.

5. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification andRegression Trees. Wadsworth & Brooks/Cole: Monterey, CA, 1984.

6. Kriegl JM, Eriksson L, Arnhold T, Beck B, Johansson E, Fox T.Multivariate modeling of cytochrome P450 3A4 inhibition. Eur. J.Pharmaceut. Sci. 2005; 24: 451–463.

7. Eriksson L, Johansson E, Muller M, Wold S. On the selection of trainingset in environmental QSAR when compounds are clustered. J.Chemometr. 2000; 14: 599–616.

8. Stenlund H, Gorzsas A, Persson P, Sundberg B, Trygg J. Orthogonalprojections to latent structures discriminant analysis modeling onin-situ FT-IR spectral imaging of liver tissue for identifying sources ofvariability. Anal. Chem. 2008; 80: 6898–6906.

9. Wold S, Sjostrom M, Eriksson L. PLS-regression: a basic tool ofchemometrics. Chemometr. Intell. Lab. Syst. 2001; 58: 109–130.

10. Trygg J, Wold S. Orthogonal projections to latent structures (O-PLS).J. Chemometr. 2002; 16: 119–128.

11. Wold S. Cross-validatory estimationof the number of components in factorand principal components models. Technometrics 1978; 20: 397–405.

12. Sjostrom M, Wold S, Soderstrom M. PLS discriminant plots. In PatternRecognition in Practice II, Gelsema ES, Kanal LN (eds). Elsevier SciencePublishers: North-Holland, 461–470.

13. Wold S, Albano C, Dunn, WJ III, Edlund U, Esbensen K, Geladi P,Hellberg S, Johansson E, Lindberg W, Sjostrom M. Multivariate dataanalysis in chemistry. In Chemometrics—Mathematics and Statistics inChemistry, Kowalski BR (ed). D. Reidel Publishing Company: Dordrecht,The Netherlands, 1984; 1–81.

14. Eriksson L, Berglind R, Larsson R, Sjostrom M. Multivariate biologicalprofiling of the sub-acute effects of halogenated aliphatic hydrocarbons.J. Env. Sci. & Health 1993; A28: 1123–1144.

15. Kettaneh-Wold N, MacGregor JF, Dayal B, Wold. S. Multivariate designof process experiments (M-DOPE). Chemom. and Intell. Lab. Sys. 1994;23: 39–50.

16. Lindgren F, Geladi P, Rannar S, Wold. S. Interactive variable selection (IVS)for PLS. Part 1: theory and algorithms. J. Chemometr. 1994; 8: 349–363.

www.interscience.wiley.com/journal/cem Copyright � 2009 John Wiley & Sons, Ltd. J. Chemometrics 2009; 23: 569–580

L. Eriksson, J. Trygg and S. Wold

580