Comparison of Sparse and Jack-knife partial least squares regression methods for variable selection

Chemometrics and Intelligent Laboratory Systems 122 (2013) 65–77

Contents lists available at SciVerse ScienceDirect

Chemometrics and Intelligent Laboratory Systems

j ourna l homepage: www.e lsev ie r .com/ locate /chemolab

Comparison of Sparse and Jack-knife partial least squares regressionmethods for variable selection

İbrahim Karaman a,⁎, El Mostafa Qannari b,c, Harald Martens d,e, Mette Skou Hedemann a,Knud Erik Bach Knudsen a, Achim Kohler e,d

a Aarhus University, Department of Animal Science, Blichers Allé 20, DK-8830 Tjele, Denmarkb LUNAM University, ONIRIS, USC “Sensometrics and Chemometrics Laboratory”, Rue de la Géraudière, Nantes, F-44322, Francec INRA, Nantes, F-44316, Franced Nofima – Norwegian Institute of Food, Fisheries and Aquaculture Research, P.O.BOX 210,1431 Ås, Norwaye Centre for Integrative Genetics (CIGENE), Department of Mathematical Sciences and Technology (IMT), Norwegian University of Life Sciences, 1432 Ås, Norway

⁎ Corresponding author at: Aarhus University, Departm50, DK-8830 Tjele, Denmark. Tel.: +45 8715 4259; fax: +

E-mail address: [email protected] (İ. Karam

0169-7439/$ – see front matter © 2013 Elsevier B.V. Allhttp://dx.doi.org/10.1016/j.chemolab.2012.12.005

a b s t r a c t
a r t i c l e i n f o
Article history:Received 24 September 2012Received in revised form 20 December 2012Accepted 21 December 2012Available online 7 January 2013

Keywords:Sparse PLSRJack-knife PLSRCross model validationPerturbation parameter

The objective of this study was to compare two different techniques of variable selection, Sparse PLSR andJack-knife PLSR, with respect to their predictive ability and their ability to identify relevant variables. SparsePLSR is a method that is frequently used in genomics, whereas Jack-knife PLSR is often used bychemometricians. In order to evaluate the predictive ability of both methods, cross model validation wasimplemented. The performance of both methods was assessed using FTIR spectroscopic data, on the onehand, and a set of simulated data. The stability of the variable selection procedures was highlighted by thefrequency of the selection of each variable in the cross model validation segments. Computationally,Jack-knife PLSR was much faster than Sparse PLSR. But while it was found that both methods have more orless the same predictive ability, Sparse PLSR turned out to be generally very stable in selecting the relevantvariables, whereas Jack-knife PLSR was very prone to selecting also uninformative variables. To remedythis drawback, a strategy of analysis consisting in adding a perturbation parameter to the uncertainty vari-ances obtained by means of Jack-knife PLSR is demonstrated.

© 2013 Elsevier B.V. All rights reserved.

1. Introduction

Partial least squares regression (PLSR) has frequently been used dur-ing the last decades for the purpose of prediction with multicollineardata. It is well known that this strategy of analysis has a good predictiveability and yields graphical displays, which enhance the interpretationof themodels [1,2]. Even though PLSR efficiently handles multicollinearand noisy data, plots of loadings and regression coefficients may becumbersome and difficult to interpret if the number of variables isvery high [3].

Several approaches for variable selection in PLSR have been pro-posed to identify variables in a matrix X those which are relevant forthe prediction of a matrix Y (either univariate or multivariate) [4–18].One of these approaches is based on the calculation of correlationsbetween subsets of X-variables and Y-variables. This is done by eitheriteratively detecting X-variables with the highest correlation with theY-variables and selecting an interval around the detected variables [9],or by splitting X into a predefined number of variable intervals andthen applying PLSR onto each interval in order to select the most

ent of Animal Science, P.O. Box45 8715 6076.an).

rights reserved.

predictive intervals [10]. Furthermore, it has been suggested todetermine subsets of X-variables with high correlations with theY-variables by grouping the X-variables taking into account their corre-lations with Y-variables and selecting the subset with the best pre-dictive ability [9]. Another approach consisting in a PLSR methodcombined with a backward variable selection has been reported fornear infrared spectroscopic data [19]. In this approach, at each step,the variable that is deemed to be irrelevant is removed from X.

Another group of variable selection methods is based on the assess-ment of the relative importance of the PLS regression coefficients.Examples of such methods are uninformative variable elimination byPLS (UVE-PLS) [11] and PLSR with Jack-knifing [12–14]. In both ofthese methods, uncertainty estimates for the regression coefficientsare calculated. In UVE-PLS, new variables composed of random num-bers are appended to X and the new X thus formed is used in PLSR.Thereafter, uncertainty estimates are calculated by Jack-knifing.The variables in the original X having uncertainty estimates of thesame magnitude as the new (appended) variables are considered asuninformative and removed from X prior to performing a final PLSR.In Jack-knife PLSR, uncertainty estimates of X-variables are used in abackward elimination algorithm until only those variables that aredeemed to be significant are left [14]. Alternatively the procedure canbe stopped after one iteration or repeated manually [20].

http://crossmark.dyndns.org/dialog/?doi=10.1016/j.chemolab.2012.12.005&domain=f

http://dx.doi.org/10.1016/j.chemolab.2012.12.005

mailto:[email protected]

http://dx.doi.org/10.1016/j.chemolab.2012.12.005

http://www.sciencedirect.com/science/journal/01697439

66 İ. Karaman et al. / Chemometrics and Intelligent Laboratory Systems 122 (2013) 65–77

Recently, methods leading to sparse or almost sparse loadingweights have gained ground. Examples of these methods are poweredPLS (PPLS) [15], soft-threshold-PLS (ST-PLS) [16], and sparse PLS[17,18]. All these algorithms aim at determining loading weightsthat are sparse or very close to sparse. PPLS consists in a stepwiseoptimization over a set of candidate loading weights obtained bytaking powers of the y-X correlations and X standard deviations. InST-PLS, a soft-thresholding is used to shrink the loading weights,while in sparse PLS, the idea is to penalize the loading weights byLASSO [21] or Elastic Net [22].

Sparse PLSR methods have become very popular within the fieldof genomics, since they proved to be very effective in situationswhere the number of variables is high. They can be traced back tooptimization criteria with penalties [17,18,23,24]. Jack-knife PLSR,which can also provide sparse models, is well known amongchemometricians [12–14,25,26]. It has been widely used in variousapplications, especially for spectroscopic data, and has been implementedin the commercial software The Unscrambler [20]. Contrariwise, SparsePLSR has not been used for spectroscopic data. An important advantageof Jack-knife PLSR over Sparse PLSR is its speed, since it is based on uncer-tainty tests of the regression coefficients. By comparison, Sparse PLSR iscomputer-intensive, since it involves the tuning of parameters prior tomodeling. The objective of this study was to compare Sparse PLSR andJack-knife PLSR with respect to their predictive ability and their abilityto find important variables.

It is worth noting that both methods involve optimization stepsand are, therefore, prone to overfitting. In order to overcome thisissue, we implemented a cross model validation in order to bettercompare these two methods [13,27,28]. Furthermore, by using across model validation, we also aimed at assessing whether the vari-able selection procedure is stable for different sub-models from crossmodel validation. Thus, the two methods of analysis are compared onthe basis of the prediction ability of the subsets of variables selectedat each cross model validation loop and the frequency of selectionof each individual variable over the various loops. More precisely, amethod will be assessed as stable if the same variables are selectedwith a high frequency in the course of the cross model validation.

Although we conclude that Jack-knife PLSR has a potential foranalyzing spectroscopic data, we have identified that it has a short-coming, which consists in the fact that some uninformative variableswere not discarded. In order to remedy this defect, we suggestedintroducing a perturbation parameter that aims at inflating the uncer-tainty variance associated with the regression coefficients. Thereafter,it is assessed whether the regression coefficients remain significant inspite of this perturbation. Illustrations on the basis of a data set frombio-spectroscopy and a simulated data set are discussed.

2. Theory

2.1. Terminology

In the following, matrices are written as upper-case bold letters(X), vectors are written as lower-case bold letters (b), and scalarsare written as italic letters (a, A). The number of samples is denotedby N, and the number of variables of the data matrix X by K.

The index of one cross-validation segment is denoted by m=1,…,M, the left-out subset of X by Xm, and the remaining samples by X−m.In cross-validation, a model is set up with (X−m,Y−m) by a regressionmethod. This model is applied to Xm in order to predict Ym. Thereaf-ter, the prediction error is calculated to assess the predictive abilityof the model.

For the index of one cross model validation segment we use l=1,…,L. Thus, the left-out subset of X is Xl, and the remaining samples areX− l.

The index of PLS components used in the PLSR models is denotedby a=1,…,A, and the optimal number of PLS components by Aopt. The

transpose of a matrix X is denoted byX′, and a prediction of a matrix Yby a (PLS) regression model is denoted by Y .

2.2. Sparse PLSR

Originally, sparse projection methods were based on imposingsparsity on the loading vectors in principal component analysis(PCA), in order to enhance the interpretability of the principal com-ponents [29–31]. The rationale behind this method is that small load-ings may be set to zero without significantly decreasing the totalvariance recovered by the principal components. Subsequently,sparsity was extended to PLS regression. For this purpose, severalapproaches based on the penalization of the loading weights wereproposed [16–18,32].

The strategy of Sparse PLS regression proposed in [17] is based ona PLS-singular value decomposition (PLS-SVD) [33]. This consists of aPLS regression of a data set Y upon a data set X. For this purposethe matrixM ¼ X′Y is computed and thereafter subjected to an itera-tive algorithm to obtain loading weights of X and Y. The sparsityis achieved by applying a soft thresholding function to the loadingweights. This function denoted by STλ is based on a threshold param-eter λ and is defined for a given scalar z as STλ(z)=sign(z)max(|z|−λ, 0). In other words, STλ sets those loadings to zero, whose absolutevalues are smaller than the threshold parameter λ. A noteworthyproperty of soft thresholding is that it is linked to the method calledElastic Net [22] and it has been shown that soft thresholding ismore or less identical to Elastic Net when the number of variables ismuch higher than the number of samples [30]. In this study, the softthresholding was applied only to loading weights associated withthe data matrix X.

2.2.1. Selection of the sparsity parameter and the optimal number of PLScomponents

The term degree of sparsity is defined as the number of zeros inthe loading weight for each PLS component and it has a direct corre-spondence to λ [31]. It was suggested that λ could be tuned bycross-validation (CV) or by random selection [17]. In this study, wetuned λ for each PLS component by CV. For each PLS component, Q2

which is an index that reflects the predictive ability of the model[34] was calculated for several values of λ within a pre-specifiedrange. Q2 is calculated for a univariate y vector as:

Q2 ¼ 1−

XNi¼1

yi−y i

� �2XNi¼1

yi−yð Þ2ð1Þ

Here y i represents a prediction of the ith sample and y representsthe mean of y.

Thereafter, the degree of sparsity which corresponds to thehighest Q2 was selected. In the case of multivariate Y, root meansquared error of cross-validation (RMSECV) can be used and thedegree of sparsity which corresponds to the lowest RMSECV shouldbe selected. It is worth noting that since this CV process is likely tobe time consuming, one can predefine a minimum degree of sparsityand, therefore, drastically restrict the range of exploration of theoptimal tuning parameter.

Once the sparsity parameters are determined for the various PLScomponents, we proceed to determining the optimal number of PLScomponents (Aopt). For this purpose, the Q2 values were calculatedby generating new models with an increasing number of PLS compo-nents. Usually, the optimal number of PLS component is the one thatcorresponds to the highest value of Q2. However, since over-fitting isa serious issue, we advocate not systematically choosing the modelwhich corresponds to the highest Q2. Rather, we recommend being

https://www.researchgate.net/publication/222001208_Reducing_over-optimism_in_variable_selection_by_cross-model_validation?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/229490846_A_twist_to_partial_least_squares_regression?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/227604843_Zou_H_Hastie_T_Regularization_and_variable_selection_via_the_elastic_net_J_R_Statist_Soc_B_2005672301-20?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/227604843_Zou_H_Hastie_T_Regularization_and_variable_selection_via_the_elastic_net_J_R_Statist_Soc_B_2005672301-20?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/23950210_Sparse_canonical_methods_for_biological_data_integration_Application_to_a_cross-platform_study?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/224881840_Sparse_Principal_Component_Analysis?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/224881840_Sparse_Principal_Component_Analysis?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/258950117_Computer_intensive_statistical_methods?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/2826190_A_Survey_of_Partial_Least_Squares_PLS_Methods_with_Emphasis_on_the_Two-Block_Case?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/23562617_A_Sparse_PLS_for_Variable_Selection_when_Integrating_Omics_Data?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3




https://www.researchgate.net/publication/221997039_Regression_shrinkage_and_selection_via_the_LASSO_J_R_Stat_Soc_B?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/229673251_Predictive_Ability_of_Regression_Models_Part_I_Standard_Deviation_of_Prediction_Errors_SDEP?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/238171417_Cross_model_validated_feature_selection_based_on_gene_clusters?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/42973738_Sparse_Partial_Least_Squares_Classification_for_High_Dimensional_Data?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/227744183_ST-PLS_a_multi-directional_nearest_shrunken_centroid_type_classi_er_via_PLS?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/41166614_Sparse_partial_least_squares_for_simultaneous_dimension_reduction_and_variable_selection?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/41166614_Sparse_partial_least_squares_for_simultaneous_dimension_reduction_and_variable_selection?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/42790702_A_Modified_Principal_Component_Technique_Based_on_the_LASSO?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/222430507_Sparse_Principal_Component_Analysis_via_Regularized_Low_Rank_Matrix_Approximation?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/222430507_Sparse_Principal_Component_Analysis_via_Regularized_Low_Rank_Matrix_Approximation?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

67İ. Karaman et al. / Chemometrics and Intelligent Laboratory Systems 122 (2013) 65–77

able to accept a slightly lower Q2 for less components. More precisely,the following strategy was followed to select the appropriate numberof components, Aopt:

i) Find the number of PLS components with the highest Q2.ii) Check whether there are models with smaller number of PLS

components with Q2 indices larger than 95% of the optimalQ2 found in (i).

iii) Among such models, choose the model with the smallest numberof PLS components.

The aforementioned strategy can easily be modified using RMSECVinstead of Q2 in the case of multivariate Y. Obviously, optimal RMSECVwill be searched among the lowest values.

2.2.2. The Sparse PLSR modelAs a result of optimizing the degree of sparsity and the number

of PLS components by cross-validation, we obtain a final modelwith sparse loading weights and Aopt PLS components. Obviously, avariable that has a zero loading weight in all the selected PLS compo-nents has a zero regression coefficient. Therefore, in Sparse PLSR weconsider the variables with nonzero regression coefficients as theselected variables.

2.3. Jack-knife PLSR

Jack-knifing is a resampling method that is well-known amongchemometricians [35]. It has been modified for PLS regression andPCA where it is used to select variables by calculating uncertaintyestimates associated with the regression coefficients and the loadings[12,14,36].

The resampling in Jack-knife PLSR is achieved by CV, in which theN available samples are split intoM subsets. In the present implemen-tation, this split was done randomly. Prior to Jack-knifing, a PLSRmodel is generated and regression coefficients (b) are calculatedusing all the available samples. Thereafter, regression coefficientsfor every sub-model of CV steps (b−m) are calculated. The uncertaintyvariances for each regression coefficient, s2(b), are calculatedaccording to

s2 bð Þ ¼XMm¼1

b−b−mð Þ2 !

M−1M

� �ð2Þ

where (M−1)/M is a scaling correction factor that accounts for thepartial overlap between the M sample sets used for developing themodels in CV.

Using these variances, t-statistics are computed as:

tb ¼ bs bð Þ ð3Þ

In this notation, b refers to a regression coefficient associated witha specific X/Y variable combination, obtained with a PLS regressionwhere all the samples are included.

The t-statistics are used as t-tests with M degrees of freedom inorder to assess whether the coefficient associated with the variousvariables are significantly different from 0. As a by-product, p-valuesare obtained. Regression coefficients with small p-values, i.e. belowa pre-specified threshold are considered as significant.

2.3.1. Selection of variables and optimal number of PLS componentsIn Jack-knife PLSR, variable selection can be achieved by compar-

ing the p-values associated with the regression coefficients with apredefined threshold value. The variables in X with p-values greaterthan the threshold were discarded from X and a new PLSR modelwas set up using the retained set of variables. We selected variables

by removing variables with p-values lower than a pre-defined thresh-old [20]. With the retained variables, a PLSR model was set up.

For multivariate Y, Jack-knife PLSR has been extended and themethod is readily available in the commercial software The Unscram-bler. For each combination of an X- and Y-variable a p-value isobtained, meaning that a given X-variable has multiple p-values. Inorder to decide if anX-variable is relevant or not, usually it is advocatedto check whether at least one of these p-values corresponding to differ-ent Y-variables for this X-variable is sufficiently small, i.e., lower than athreshold. This is equivalent to merging the selected variables for eachcolumn of Y.

2.3.2. Introduction of a perturbation parameter in Jack-knifingThe rationale behind Eq. (3) is to highlight those variables that

have relatively large regression coefficients while, at the same time,they have small uncertainty variances. However, it occurs that somevariables, with relatively small regression coefficients but also smalluncertainty variance, lead to relatively large t-values thus, incorrectly,indicating that these coefficients are significant. These variablescorrespond to uninformative yet, stable variables. For instance, theymay correspond to some experimental artifact such as a backgroundsignal. In order to remedy this pitfall, we propose to augment theuncertainty variance by a quantity ε and assess whether, in spite ofthis perturbation, the regression coefficients remain significant.In other terms, by introducing this perturbation, the t-statisticsbecomes:

tb;ε ¼b

s bð Þ þ εð4Þ

Therefore, the t-values will decrease and, unless the coefficient b islarge enough to counterbalance the increase of the uncertainty vari-ance, the t-test will tend to be non-significant and the associated var-iable will be discarded. This approach was adapted from a study ongene expressions measured by microarrays where a similar statisticwas defined and a small positive constant was added to the denomi-nator like herein ε was added [37]. Adding the quantity ε in Eq. (4) issomewhat comparable to adding random noise of the same magni-tude over the data set. Furthermore, the perturbation parameter,which needs to be customized to the data at hand, provides a possibil-ity to control the degree of sparsity, with the understanding that thehigher ε is, the more variables are discarded.

The perturbation parameter ε is chosen intuitively since itdepends on the size of bs and s(b)s in the data set. In this study, weinitialized ε at zero so that all X-variables are in the Jack-knife PLSRmodel, then we increased ε until a few X-variables were present.

2.4. Cross model validation

Cross model validation (CMV) is used in order to preventoverfitting and selecting false positive variables [13,38,39]. CMV isvery similar to CV, but an additional external CV loop is includedwhere a subset of the samples was randomly chosen as an “indepen-dent test set” for testing the optimizedmodel, while the inner CV loopis used for optimizing the model. The work-flow of CMV for SparsePLSR and Jack-knife PLSR as used in this study is depicted in Fig. 1.Firstly, one sample or a group of samples (Xl,Yl) were left out fromthe data set and a model was set up with the remaining samples(X− l,Y− l) by either Sparse PLSR or Jack-knife PLSR using CV. Thismodel was applied to Xl in the independent test samples to predicttheir values Y l. Thereafter, Q

2 values were calculated for the indepen-dent cross model validation samples (Q2

CMV) in order to assess thepredictive ability of the model.

Each cross model validation loop resulted in a set of selectedvariables. In order to assess the stability of the procedure of variableselection, the different sets of selected variables were compared. For

https://www.researchgate.net/publication/222001208_Reducing_over-optimism_in_variable_selection_by_cross-model_validation?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/244738666_Variable_selection_in_NIR_based_on_significance_testing_in_Partial_Least_Squares_Regression_PLSR?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/226454437_Assessment_of_PLSDA_cross_validation?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/222873256_Martens_H_Martens_M_Modified_jack-knife_estimation_of_parameter_uncertainty_in_bilinear_modelling_by_partial_least_squares_regression_PLSR_Food_Qual_Prefer_11_5-16?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/12024131_Significance_Analysis_of_Microarrays_Applied_to_The_Ionizing_Radiation_Response?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/6231675_Finding_relevant_spectral_regions_between_spectroscopic_techniques_by_use_of_cross_model_validation_and_partial_least_squares_regression?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

https://www.researchgate.net/publication/222637488_Variable_selection_in_PCA_in_sensory_descriptive_consumer_data?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

Fig. 1. Flow chart of cross model validation algorithm.


each variable we computed how many times it was selected in theCMV process. The frequency of selection of a given variable in theCMV process may be seen as a measure of the degree of relevanceof the variable under consideration. Moreover, this frequency alsohighlights the extent to which the procedure of selection is stable.In other terms, a procedure of selection of variables will be consideredas stable if, in the course of the CMV, it leads to the same subset of se-lected variables.

3. Data sets

A real and a simulated data set were used in order to assess thepredictive performance and the variable selection by Sparse PLSRand Jack-knife PLSR methods. We purposefully chose to work on vi-brational spectroscopic data because, up to our knowledge, SparsePLSR has never been applied to this kind of data, whereas Jack-knifePLSR has been frequently used for vibrational spectroscopic data.

3.1. Real data set

The real data set consists of Fourier transform infrared (FTIR)spectra obtained from a microbiology study where strains of afood-borne bacterial species, Listeria monocytogenes, were character-ized by FTIR spectroscopy according to their susceptibility towardsbacteriocins such as sakacin P. The susceptibility was characterizedby a continuous measurement variable y. We chose the FTIR data asindependent variables X and the continuous variable describing thesusceptibility of L. monocytogenes to sakacin P as dependent variabley. Spectra were pre-processed by taking the second derivative andapplying extended multiplicative signal correction [40,41]. The dataset consisted in total of 88 samples and the matrix X consisted of3631 variables where the variables represent the infrared rangebetween 4000 cm−1 and 500 cm−1. Pre-selection of spectral regionsis often performed but was not applied herein.

3.2. Simulated data set

A 50×5000 matrix X was created with five components andoverlapping features by the following model:

X ¼ CS′ þ EX ð5Þ

where C is a 50×5 matrix of concentrations. Each column of C wasgenerated by randomly drawing numbers from a uniform distributionon an open interval (0,1). S is a 5000×5 matrix representing “fivepure component” spectra in each column. Each component wasplaced on a different position within the interval 1–500 of the fullspectrum (Fig. 2) and generated from a normal distribution withvarying standard deviations (Table 1). The rest of each spectrumwithout information was assigned to zero. EX, a 50×5000 matrix ofnoise, was randomly drawn from a normal distribution with meanzero. In order to produce data sets of different noise levels we useddifferent standard deviations (0.05, 0.01, 0.005 and 0.001) andobtained four different data, respectively. Finally, all variables of Xwere forced to be positive by subtracting the minimum value ofeach column from every entry of the corresponding column.

A 50×1 vector y was created by the following model:

y ¼ 3c1 þ c4 þ ey ð6Þ

where c1 and c4 are concentration vectors of the first and fourth com-ponents, i.e. first and fourth column of C; ey is a 50×1 vector of noisewith elements randomly drawn and four different noise levels as forEX. The first and fourth components are related to y and thereforethe corresponding variables will be relevant to the final models. InFig. 2, it can be seen that the first and the fourth component coverthe regions 1–100 (region A) and 251–400 (region D).

https://www.researchgate.net/publication/21366143_Extended_multiplicative_signal_correction_and_spectral_interference_subtraction_new_preprocessing_methods_for_near_infrared_spectroscopy_J_Pharm_Biomed_Anal_9625-635?el=1_x_8&enrichId=rgreq-a2e017c7-11c3-4731-8105-b9e3403b2397&enrichSource=Y292ZXJQYWdlOzIzNTY1MjQ1NDtBUzo5OTcxODY3MTYzNDQzOUAxNDAwNzg2MTkyMzE3

Fig. 2. Representative spectra of five pure components used to generate X in the simulated data set.


4. Software

Sparse and Jack-knife PLSR algorithms including CV and CMVprocedures were built and implemented using written-in-house andstandard MATLAB routines [42].

5. Results and discussion

Sparse PLSR and Jack-knife PLSR methods were applied on the realand the simulated data sets. The models were generated by CV. Sinceleave-one-out CV is computationally expensive for data sets withhigh number of samples and since it is sometimes biased, we used8 and 10-fold CVs for the real and simulated data sets, respectively,as previously suggested [43] (default for the commercial softwareSimca-P is 7-fold [44]). In this paper, 8-fold CV was used for the realdata set (N=88), which allowed to take out 11 samples in eachsegment and 10-fold CV for the simulated data set (N=50) whichallowed to take out 5 samples in each segment. Similarly for CMVcalculations, 8-fold CMV (7-fold CV within each CMV step) and10-fold CMV (9-fold CV within each CMV step) were used for thereal and the simulated data sets, respectively.

5.1. Real data set

5.1.1. Sparse PLSRThe optimization of the degree of sparsity was performed for the

first 10 PLS components using 8-fold CV as explained in Section 2.2.1.The optimization processwas carried out starting from a degree of spar-sity of 3250 out of 3631 in order to save time, i.e., starting with 381 ini-tial variables in the model. As a result, optimized degree of sparsityvalues for 10 PLS components were determined and the first four are3296, 3599, 3514, and 3256, respectively, i.e., 335, 32, 117, and 375selected variables for each PLS component. After this process, thenumber of PLS components to be used in the Sparse PLSR model wasoptimized and determined to be two. Thus, the final model with a Q2

Table 1Details of simulated data set with five components.

Mean stdev Interval Relevant

Comp 1 0 0.2 1–100 YesComp 2 0 0.3 51–200 NoComp 3 0 0.4 151–300 NoComp 4 0 0.5 251–400 YesComp 5 0 0.6 351–500 No

of 0.45 was generated with two PLS components. In total, 335 variableswere selected by setting up the Sparse PLSR model.

In Fig. 3a and b, plots of the loading weights are shown for the tworetained Sparse PLSR components model. In Fig. 3c, a plot of regres-sion coefficients of the same model is shown. On the first component,the highest contribution arose from the band around 985 cm−1

which is related to the variation in susceptibility toward sakacin P[45]. In addition, a contribution of the band around 840 cm−1 wasdetected, which has also been earlier suggested to be related to thevariation in susceptibility towards sakacin P. In the second compo-nent, the contributions of fatty acid and protein regions can clearlybe seen. From the plot of the regression coefficients, it appears thatpeaks in fatty acid and protein regions are influencing the modelmore than the fingerprint region.

In Fig. 3d and e, score and correlation loading plots are shown. Inorder to enhance the interpretation of these outcomes, samples arecolored according to sakacin groups. The sakacin groups wereobtained by defining a Group A containing samples with low levelof sakacin sensitivity (between 0.0085 and 0.0303) and a Group Bcontaining samples with high level of sakacin sensitivity (between0.0494 and 0.5983). These groups are separated by a gap in the sensi-tivity level according to the previous study [46]. According to thescore plot, the first component separates the samples into threedistinct clusters (earlier termed FTIR groups [47]). The fingerprintregion is contributing strongly to this separation [47]. The groupingwith respect to sakacin groups is mainly due to the second compo-nent. The correlation loading plot shows that peaks around 1630and 1660 cm−1 that correspond to the protein region are highlycorrelated with the grouping and there are contributions of otherpeaks around 1560 and 2915 cm−1 that correspond to the proteinand fatty acid regions, respectively.

8-fold CMV (7-fold CV within each CMV step) was performed inorder to assess the predictive ability of the final model. Q2

CMV wasequal to 0.40 which is close to Q2

CV. In order to assess the stabilityof the variable selection, selected variables of the final CV modeland the models of each CMV step were compared. Table 2 gives thefrequency of selection of the various variables. As can be seen in thefirst row of Table 2, 100% of the variables that have been selected bythe final CV model where at least selected by one CMV model. Thesecond row of Table 2 shows that 100% of the variables that wereselected by the final CV model were at least selected by two CMVmodels, etc. We observe that all variables of the final CV modelwere selected by at least four of the CMV sub-models and 80% ofthe variables were selected by all CMV sub-models. In addition,more than 95% of the variables of the final CV model are observed

image of Fig.�2

Fig. 3. Plots of sparse loading weights of (a) LV 1 and (b) LV 2, (c) plot of regression coefficients, (d) score and (e) correlation loading plots of LV 1 vs LV2 from 2-component SparsePLSR model; (f) frequencies of selected variables by CMV of Sparse PLSR model for real data set. (Group A is labeled with blue ■ and Group B with red ● on the score plot in d.)

Table 2Percentages of observing selected 335 variables of the CV model in different number ofCMV models using Sparse PLSR for real data set.

Percent (%)

Number of CMV models 1 100.002 100.003 100.004 100.005 98.816 95.227 86.878 80.00


in at least six of the CMV sub-models. Therefore, it can be concludedthat variable selection in Sparse PLSR for this data set is very stable.In addition, it is possible to view which variables are selected withinall CMV sub-models. Since 80% of the variables of the final CVmodel were selected in all CMV sub-models, the final CV model ap-pears as a stable and reliable model. In Fig. 3f, the frequency of selec-tion is plotted for each variable. It can be seen that important peaksrelating to fatty acid, protein, mixed and fingerprint regions are con-stantly selected in the CMV steps. Peaks from uninformative FTIR re-gions were not selected with a few exceptions such as peaks around3500 cm−1.

image of Fig.�3


5.1.2. Jack-knife PLSR without the perturbation parameter εIn the first step, a Jack-knife PLSRmodel was generated by 8-fold CV

using all variables and the optimal number of PLS components wasfound to be three. Then p-values for the regression coefficients of this3-component PLSR model were estimated by an uncertainty test asdescribed above. According to Section 2.3.1, an empirical thresholdp-value was selected as 0.05 in order to remove non-significantvariables. Removing non-significant variables resulted in 236 selected

Fig. 4. (a) Plot of regression coefficients, (b) score and (c) correlation loading plots of LV 1 vstrum pre-processed by EMSC; (e) frequencies of selected variables by CMV of Jack-knife PLSRscore plot in d.) (f) Graphical demonstration of selected variables by Jack-knife PLSR with difε=1 (magenta ▲), ε=3 (blue ●), ε=5 (green ★), and ε=10 (yellow ▼).

variables. In the next step, a new PLSR model was generated withthe selected variables. The optimal number of PLS components was es-timated as one for the newmodel by CV. Thus, the final Jack-knife PLSRmodel was generated with one PLS component with Q2 of 0.49 using236 selected variables.

In Fig. 4, the Jack-knife PLSR results are displayed. In Fig. 4a, a plotof regression coefficients for the Jack-knife PLSR model is presented.Since the variable selection process was performed on regression

LV 2 from 1-component Jack-knife PLSR model; (d) a representative FTIR sample spec-model for real data set. (Group A is labeled with blue■ and Group B with red ● on the

ferent ε values for real data set. Selected variables by Jack-knife PLSR with ε=0 (red ♦),

image of Fig.�4


coefficients, selected variables are visualized together with regressioncoefficients. Only regression coefficients of the selected variables usedin the final model were included. The coefficients of the discardedvariables were set to zero. Similarly to Sparse PLSR results, highcontribution of the fatty acid region (3000 to 2800 cm−1) and proteinregion (1700 to 1500 cm−1) can be observed. In addition, the contri-bution of the mixed region (1500 to 1200 cm−1) was observed. In thefingerprint region, high values of regression coefficients at around 850,950 and 1100 cm−1 were detected. The band around 840 cm−1 wassuggested to be related to the variation in susceptibility towards sakacinP [45], and the band around 1100 cm−1 was also found as having asignificant contribution [47].

In Fig. 4b and c, score and correlation loading plots are depicted.From the score plot, a fairly good separation between Group A and Bis visible for the first component. The second component has no effecton the separation. The correlation loading plot shows that bands around1560, 4000 and 1270 cm−1 are positively correlated with Group A;likewise bands around 850, 2960, 2870, 1230 and 1105 cm−1 are posi-tively correlatedwith Group B along the first component. In the spectralregion between 4000 and 3000 cm−1, where we expect noise, thesecond derivative spectra used for the descriptor matrix X are flat,except some minor oscillations due to water vapor signals from theatmosphere that are present in this region (see Fig. 4). This illustratesthe fact that Jack-knife PLSR is likely to select irrelevant variables. An8-fold CMV (7-fold CV within each CMV step) was performed in orderto test the predictive ability of the final Jack-knife PLSR model. Q2

CMV

was found to be 0.40 which is close to Q2CV. Similarly to Sparse PLSR,

selected variables of the final CV model and the models of each CMVstep were compared. In Table 3, the percentages of observing selectedvariables of the CV model in eight CMV sub-models are shown. It canbe seen that about 80% of the variables of the CV model were selectedin at least three of the CMV sub-models and about 10% of the variableswere at all CMV sub-models. In addition, approximately half of the vari-ables of CV model were selected in at least six of the CMV sub-models.Therefore, it can be concluded that variable selection in Jack-knifePLSR is not as stable as in Sparse PLSR. Similarly to Sparse PLSR results,we can also identify which variables were selected within all CMVsub-models for Jack-knife PLSR. By plotting the frequency of selection(see Fig. 4e) we observe that the majority of peaks were selected fromprotein and fatty acid regions. However, compared to Sparse PLSR,many variables from uninformative regions in the infrared domainappeared with high frequency and some of them were observed asfrequently as peaks from protein and fatty acid regions.

5.1.3. Jack-knife PLSR with increasing perturbation parameter εIn Fig. 4f, variables selected by Jack-knife PLSR are shown together

with a second derivative spectrum when tentative values for theperturbation parameter ε (ε=0, 1, 3, 5, 10) were employed. Weobserve that when increasing ε, uninformative variables graduallydisappear. This is due to the fact that ε-forced selection of variablescorresponding to relatively high peaks according to Eq. (4) and lowstability values at the same time. For ε=3, only peaks from finger-print, mixed, protein, and fatty acid regions were used in the model.

Table 3Percentages of observing selected 236 variables of the CV model in different numberCMV models using Jack-knife PLSR for real data set.

Percent (%)

Number of CMV models 1 92.802 82.203 80.084 69.925 57.636 47.037 26.278 9.75

The regression coefficients are mostly high within these regions asseen in Fig. 5a. For ε=10, the model was generated by peaks from theprotein region only. Q2 for the Jack-knife PLSR models with ε=0, 1, 3,5, and 10 were 0.49, 0.48, 0.46, 0.41, and 0.35, respectively. Therefore,applying and increasing ε in a Jack-knife PLSR model prevents selectingirrelevant variables and therefore enhances the interpretation of theoutcomes. However this results in a decrease of the predictive abilityof the models.

In order to show that in Jack-knife PLSR with ε=0, many variableswith relatively low regression coefficients are selected and that thiscan be avoided by introducing ε, we plotted in Fig. 5b and c, histo-grams for regression coefficients of selected variables and histogramsof square root of uncertainty estimates, i.e. s(b), of selected variablesfor ε=0 and ε=3. In Fig. 5b, we observe that for ε=0 a high numberof variables was selected for b values close to zero. Accordingly inFig. 5c, we observe that also a high number of variables was selectedwith very low uncertainty s(b). That the high number of selected vari-ables with very low s(b) mostly refers to variables with relatively lowregression coefficients b, becomes obvious when introducing an ε. Forε=3 most of variables referring to the lowest regression coefficientsand at the same time to the lowest stability values have disappearedas can be seen in Fig. 5b. In conclusion we can say that, as expectedthe high number of low s(b) values disappears when using ε=3.But surprisingly this also leads to the suppressing of the selection oflow b values, since for ε=3 almost all low b-values have disappeared.On the other hand, variables with high b values were mostly kept in themodels. Thus, using ε aids preventing the selection of uninformativestable variables.

5.1.4. Comparison of Sparse PLSR and Jack-knife PLSR for the real data setA first advantage of Sparse PLSR compared to Jack-knife PLSR is

that the selection of variables is operated component-wise. Thismeans that each component that is introduced in the model is a linearcombination of a subset of the original variables whereas, forJack-knife PLSR, the selection of variables is performed once all thenecessary components are introduced in the model. The gain of thisadvantage in terms of getting a better insight from the outcomesand in terms of the interpretation of the successive components in-troduced in the model is obvious.

The selected variables for Sparse PLSR and Jack-knife PLSRare depicted in Fig. 6. It turns out that the main difference betweenvariable selection was the presence of selected variables from thespectral noise region (4000 to 3700 cm−1) and water absorptionregion (around 3400 cm−1) when Jack-knife PLSR was used. As stat-ed above, this can be explained by the fact that the latter regressionmethod is likely to select variables associated to noise when thevariables present some stability (i.e. small uncertainty variables).This effect is to a large extent corrected by the introduction of aperturbation parameter, ε, but at the expense of a lower predictiveability.

Other findings can also be highlighted:- Out of the 335 variables selected by Sparse PLSR and the 236 vari-ables selected by means of Jack-knife PLSR, 63 variables wereidentified to be in common.

- Sparse PLSR selected several bands in the fingerprint region whileJack-knife PLSR did not.

- Both methods selected from the peaks around 850, 945, 1105 and1132 cm−1. There are no peaks in themixed region selected in com-mon by both methods. In the protein region, peaks around 1540,1560, 1650 and 1665 cm−1 and in the fatty acid region, peaksaround 2850 and 2920 cm−1 were selected by both methods.

- In both methods, a strong correlation of the protein region and thesusceptibility level was observed.

Regarding the predictive ability, relatively similar Q2 values wereobtained by means of both Sparse PLSR and Jack-knife PLSR methods

Fig. 5. (a) Plot of regression coefficients from 2-component Jack-knife PLSR model with ε=3; selection frequency histograms for (b) regression coefficients and (c) square root ofuncertainty estimates of variables in Jack-knife PLSR models with ε=0 and ε=3 for real data set. (The histograms were obtained after 8-fold CMV. The y-axis was truncated at 500for better visualization.)


(0.45 and 0.49), respectively. The predictive ability of bothmethodswasassessed by means of Q2

CMV, and the values corresponding to bothmethods were very close (0.40 both and ε=0 for Jack-knife PLSR).

CMVmade it possible to check the consistency and the stability of themethods. From this standpoint, it turned out that Sparse PLSRwasmorestable than Jack-knife PLSR in variable selection since 80% of the vari-ables of Sparse PLSR CV model were observed in all Sparse PLSR CMVsub-models whereas, this percentage is equal to 10% for Jack-knife PLSR.

Regarding the computation time, Jack-knife PLSR took approxi-mately 24 s whereas it took approximately 29 min for Sparse PLSRto set up an 8-fold CV model using five PLS components. Obviously,

Fig. 6. Graphical demonstration of selected variables for real data set within whole FTIR remethods (red ♦).

Sparse PLSR is more time consuming than Jack-knife PLSR becauseof the optimization of the degree of sparsity for each PLS component.

5.2. Simulated data set

The same procedures of analysis described above were performedfor the simulated data.

5.2.1. Sparse PLSRThe optimization of the degree of sparsity was achieved using a

10-fold CV starting from a degree of sparsity of 4300. The optimized

gion. Selected variables by Sparse PLSR (blue ▲), Jack-knife PLSR (green▼), and both

image of Fig.�5

image of Fig.�6


degrees of sparsity for the first four PLS components are 4934, 4843,4389, and 4620, respectively, i.e., 66, 157, 611, and 380 selectedvariables for each PLS component. In the subsequent step, Aopt forthe Sparse PLSR model was determined as equal to three with a Q2

of 0.98. In total, 619 variables were selected for setting up a SparsePLSR model. In order to test the effect of the size of the CV segmentswe performed 4, 5, 6, 7, 10, 25, and 50-fold CV and obtained almostidentical results for Q2 (results not shown).

Fig. 7. Plots of sparse loading weights of (a) LV 1, (b) LV 2 and (c) LV 3, (d) plot of regressvariables by CMV of Sparse PLSR model for simulated data set.

In Fig. 7a, b, c, the sparse loading weights are plotted. From the firstloading weights, it appears that the variables only from the relevantregions A (1–100) andD (251–400)were selected. The percentage of var-iance in y explained by the first component is relatively high (around70%). The first component did not select variables from the noise regionF (501–5000). By considering the second and the third components, fur-ther relevant variables were selected. However, a few irrelevant variablesfrom region E (401–500) and many from the noise region F were also

ion coefficients from 3-component Sparse PLSR model; and (e) frequencies of selected

image of Fig.�7

Table 4Percentages of observing selected variables of the CV models in different number CMVmodels using Sparse PLSR for simulated data set with increasing standard deviation ofadded noise. The CV models have 619, 622, 52 and 382 selected variables respectivelyfrom left to right.

Standard deviation of added noise

0.05 0.01 0.005 0.001

Percent (%) Percent (%) Percent (%) Percent (%)

Number of CMV models 1 75.77 96.46 100.00 100.002 62.20 89.87 100.00 100.003 54.77 82.64 100.00 100.004 48.63 78.62 100.00 100.005 44.75 76.21 100.00 100.006 41.84 74.28 100.00 97.387 37.16 68.97 100.00 86.658 25.69 63.18 86.54 67.029 15.19 57.40 80.77 47.38

10 4.36 44.86 42.31 15.71


selected. In fact, the inclusion of the third component resulted in an im-provement of the predictive ability of the model but it favored theinclusion of irrelevant variables. From the plot of the regression coeffi-cients (Fig. 7d) for the 3-component model, the influence of the noise re-gion F can be observed. It can be seen that this region has a very smallinfluence on the model since the associated coefficients are close to 0. Inaddition for the simulated study, it is possible to check the number ofrelevant variables from regions A and D, that have been selected.Herein Sparse PLSR selected 70 of 100 from region A and 132 of 150from region B.

Fig. 8. (a) Plot of regression coefficients from2-component Jack-knife PLSRmodel; (b) frequencidemonstration of selected variables by Jack-knife PLSRwith different ε values for simulated data s(blue ●), ε=0.005 (green★), and ε=0.010 (yellow ▼). (d) Plot of regression coefficients for s

In order to test the predictive ability of the final Sparse PLSRmodela 10-fold CMV (9-fold CV within each CMV step) was performed.Q2

CMV was found to be 0.97 which is close to Q2CV. In Table 4 the

stability of the variable selection is evaluated. As in Table 2 thefrequency of selection of the various variables is shown, but thistime as a function of the noise level. It can be seen that the stabilityof selection highly depends on the noise level: When the noise level isbelow 0.005, all variables thatwere selected by the global model are se-lected by more than 50% of the cross model validation segments. As itcould be expected, the more the standard deviation of the addednoise decreases, themore stable Sparse PLSR is. If the selection frequen-cies of the variables within CMV were considered (Fig. 7e), the fre-quently selected variables were mostly from the regions A, D, and E.These regions cover the relevant regions and in addition they coverthe region which corresponds to the fifth pure component spectrumin the data set (see Fig. 2). Furthermore, there were variables from thenoise region F which were frequently selected in the Sparse PLSRmodeling process. Decreasing the noise level in the data set to 0.005resulted in selecting much less variables in the noise region (resultsnot shown). However, variables in the region covered by the fifthpure component spectrum were selected less frequently.

5.2.2. Jack-knife PLSR without the perturbation parameter εA Jack-knife PLSR model was generated by 10-fold CV using all the

predictive variables and the optimal number of PLS components wasfound to be equal to three. The selection of variables from the uncer-tainty tests resulted in 379 selected variables. Thereafter, PLSR wasrun on these variables and two PLS components were eventually

es of selected variables by CMVof Jack-knife PLSRmodel for simulated data set. (c) Graphicalet. Selected variables by Jack-knife PLSRwith ε=0(red♦), ε=0.001 (magenta▲), ε=0.003imulated data set from 1-component Jack-knife PLSR model with ε=0.005.

image of Fig.�8

Table 5Percentages of observing selected variables of the CV models in different number CMVmodels using Jack-knife PLSR for simulated data set with increasing standard deviationof added noise. The CV models have 379, 359, 411 and 387 selected variables respec-tively from left to right.

Standard deviation of added noise

0.05 0.01 0.005 0.001

Percent (%) Percent (%) Percent (%) Percent (%)

Number of CMV models 1 94.20 95.82 93.67 96.902 84.70 90.81 87.83 89.413 73.88 84.96 79.81 81.654 61.74 77.44 72.02 75.715 53.56 71.87 65.94 70.286 44.59 63.23 56.93 66.157 39.31 55.99 53.77 57.888 33.51 50.70 45.74 53.759 27.97 44.29 40.39 48.32

10 22.69 37.05 33.82 31.78


retained. This model corresponds to a Q2 of 0.98. As for Sparse PLSR,in order to test the effect of the size of the CV segments we performed4, 5, 6, 7, 10, 25, and 50-fold CV and obtained almost identical resultsfor Q2 (results not shown).

In Fig. 8, the results for the Jack-knife PLSR model are shown forthe simulated data set. In Fig. 8a the plot of the regression coefficientsfor the Jack-knife PLSR model is displayed. We observe that manyvariables are selected in the noise region F (501–5000). A closerlook at the spectral region (1–500), which is shown in the inlet inFig. 8a, revealed that most of the variables were selected in the regionA (1–100) with relatively large regression coefficients. This wasexpected since this region is related to the y vector. From the secondinformative part, region D (251–400), only few variables are selectedwith relatively low regression coefficients. Among the variables inregion A, 79 of 100, whereas in region D, only 30 of 150 were selected.

10-fold CMV (9-fold CV within each CMV step) was performed inorder to test the predictive ability of the final Jack-knife PLSR model.Q2

CMV was found to be equal to 0.96 which is close to Q2CV. In Table 5,

the stability of the variable selection is evaluatedwhere as in Table 4 dif-ferent noise levels were used. This result is different from that of theSparse PLSR, where changing the noise level only slightly affects theperformance of Jack-knife PLSR. From Fig. 8b which depicts the selec-tion frequencies of the variables within CMV it appears that many rele-vant variables of the informative regions A and D were selected in allCMV sub-models. Still almost independent of the noise level, severalvariables from the noise region F were selected with high frequency,some of them were even observed in all CMV sub-models.

5.2.3. Jack-knife PLSR with an increasing perturbation parameter εIn Fig. 8c, the variables selected by Jack-knife PLSR are shown for

different values of ε (ε=0, 0.001, 0.003, 0.005, 0.010). Increasing εprevented the selection of variables from the noise region F (501–5000). However, some variables from the region D (251–400) were alsodiscarded. Moreover, almost all the variables of the region A (1–100)were included in the model as it can clearly be seen in Fig. 8d which de-picts the regression coefficients for ε=0.005. Q2 for the Jack-knife PLSRmodels with ε=0, 0.001, 0.003, 0.005, and 0.010 were respectively0.98, 0.99, 0.97, 0.99, and 0.91. However, even when ε was as high as0.005, variables from the noise region F were still selected. When ε wasfurther increased to 0.010, noise variables were not selected anymore,but also no variable from the informative region D was selected.

5.2.4. Comparison of Sparse PLSR and Jack-knife PLSR for the simulateddata set

Predictive abilities of Sparse PLSR and Jack-knife PLSR were similarboth in cross-validation and in cross model validation. Thus, we con-clude that both methods have a similar predictive ability.

When comparing the stability of the variable selection, there weremajor differences between the two methods. When the noise levelin the data is low to moderate, Sparse PLSR performs very well,but the stability of the variable selection decreases considerably,when the noise is increased. The stability in the variable selection inJack-knife PLSR is inferior to Sparse PLSR, when the noise level islow to moderate. On the other hand, for high noise levels the Jack-knife PLSR performs better.

The difference between Sparse PLSR and Jack-knife PLSR withrespect to the variable selection performance under the present con-ditions may reflect that several component-wise variable selectionsperformed better than one joint variable selection for the wholePLSR model. Alternatively, the difference may reflect that the ElasticNet optimization is superior to the Jack-knife resampling.

Regarding the computation time, the findings from the real casestudy were corroborated. Indeed, the setting up of a 10-fold CVmodel with four PLS components was achieved in approximately41 s when using Jack-knife PLSR and approximately 49 min whenusing Sparse PLSR.

6. Conclusions

Our results show that both Sparse PLSR and Jack-knife PLSRmodels have a good and to a large extent similar predictive ability.Both methods enhance the interpretation of the outcomes sinceonly a subset of variables is retained for the analysis. By using SparsePLSR, in addition to enhancing the interpretation of the plot of the re-gression coefficients associated with the selected variables, it is possi-ble to get more insight from the loading weights associated with thesuccessive components. Indeed, unlike Jack-knife PLSR, Sparse PLSRoperates component-wise thus leading to sparse loading weights.

Furthermore, we showed that cross model validation (CMV) is auseful tool for evaluating the predictive ability of the models whenoptimization is involved. Yet, when comparing Q2 values obtainedby CMV with the ones obtained by CV, we found that the Q2 valuesobtained by CMV were only slightly smaller, showing that the modelswe obtained in the present case are stable. It also turned out that CMVis very useful for evaluating the stability of the selected variables. Bydetermining the frequency of selection for each variable in CMV, wecould investigate the relevance of the selected variables. These fre-quencies also make it possible to assess the stability of variable selec-tion by the regression methods. From the results on the real data setand from the results of the simulated data sets employing low andmoderate noise, it turned out that Sparse PLSR was very stable interms of the selected variables. Therefore, we can conclude that,when low and moderate noise is present as in vibrational spectrosco-py, Sparse PLSR performs very well. Both the predictive ability is goodand the variable selection is accurate. When the noise level was in-creased in the simulated data set, the variable selection performanceof Sparse PLSR was poor. On the contrary, the variable selection per-formance of Jack-knife PLSR was almost unaffected when the noiselevel was increased. However, it is important to point out that thevariable selection performance of Jack-knife PLSR was on averagepoorer than that of Sparse PLSR. We explained this by the fact thatin Jack-knife PLSR some variables, although not relevant forpredicting Y-variables, may have very small uncertainty variancesand are thus detected as significant in the Jack-knife PLSR variable se-lection process. In order to circumvent this pitfall, we introduced aperturbation of the uncertainty variances which discarded those vari-ables that were not relevant. Yet, when increasing the perturbationparameter, the predictive ability slightly decreased. Further investiga-tions are needed to optimize the perturbation parameter both withrespect to variable selection and predictive ability.

As a general conclusion, we can state that under the conditionstested here, Sparse PLSR outperformed Jack-knife PLSR if we takeinto account both the stability of the variable selection and the


predictive ability. But Jack-knife PLSR was to some extent improvedby modifying the variable selection criterion, and was much lesstime-consuming to run. Sparse PLSR may therefore be the methodof choice when dealing with spectroscopic data, where the level ofnoise is relatively low, while for genetic data, where the level ofnoise is comparable to the signal, the Jack-Knife PLSR could be thebetter choice.

Acknowledgements

The authors are grateful for financial support by the Nordic Centreof Excellence on Food, Nutrition and Health “Systems biology in con-trolled dietary interventions and cohort studies” (SYSDIET; 070014)funded by NordForsk. In addition, this work was supported by thegrant 203699 (New statistical tools for integrating and exploitingcomplex genomic and phenotypic data sets) financed by the ResearchCouncil of Norway.

References

[1] H. Wold, Soft modeling: the basic design and some extensions, Systems underindirect observation, North-Holland, Amsterdam, 1982, pp. 1–53.

[2] H. Martens, T. Næs, Multivariate Calibration, Wiley, Chichester, 1992.[3] N. Kettaneh, A. Berglund, S. Wold, PCA and PLS with very large data sets,

Computational Statistics and Data Analysis 48 (2005) 69–85.[4] T. Mehmood, K.H. Liland, L. Snipen, S. Sæbø, A review of variable selection

methods in Partial Least Squares Regression, Chemometrics and IntelligentLaboratory Systems 118 (2012) 62–69.

[5] P. Zerzucha, B. Walczak, Again about partial least squares and feature selection,Chemometrics and Intelligent Laboratory Systems 115 (2012) 9–17.

[6] P. Filzmoser, M. Gschwandtner, V. Todorov, Review of sparse methods inregression and classification with application to chemometrics, Journal ofChemometrics 26 (2012) 42–51.

[7] J.P.M. Andries, Y. Vander Heyden, L.M.C. Buydens, Improved variable reduction inpartial least squares modelling based on Predictive-Property-Ranked Variablesand adaptation of partial least squares complexity, Analytica Chimica Acta 705(2011) 292–305.

[8] C.M. Andersen, R. Bro, Variable selection in regression—a tutorial, Journal ofChemometrics 24 (2010) 728–737.

[9] A. Höskuldsson, Variable and subset selection in PLS regression, Chemometricsand Intelligent Laboratory Systems 55 (2001) 23–38.

[10] L. Nørgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, S.B. Engelsen, Intervalpartial least-squares regression (iPLS): a comparative chemometric study with anexample from near-infrared spectroscopy, Applied Spectroscopy 54 (2000)413–419.

[11] V. Centner, D.L. Massart, O.E. de Noord, S. De Jong, B.M. Vandeginste, C. Sterna,Elimination of uninformative variables for multivariate calibration, AnalyticalChemistry 68 (1996) 3851–3858.

[12] H. Martens, M. Martens, Modified Jack-knife estimation of parameter uncertaintyin bilinear modelling by partial least squares regression (PLSR), Food Quality andPreference 11 (2000) 5–16.

[13] E. Anderssen, K. Dyrstad, F. Westad, H. Martens, Reducing over-optimism invariable selection by cross-model validation, Chemometrics and IntelligentLaboratory Systems 84 (2006) 69–74.

[14] F. Westad, H. Martens, Variable selection in near infrared spectroscopy based onsignificance testing in partial least squares regression, Journal of Near InfraredSpectroscopy 8 (2000) 117–124.

[15] U. Indahl, A twist to partial least squares regression, Journal of Chemometrics 19(2005) 32–44.

[16] S. Sæbø, T. Almøy, J. Aarøe, A.H. Aastveit, ST-PLS: a multi-directional nearestshrunken centroid type classifier via PLS, Journal of Chemometrics 22 (2008)54–62.

[17] K.A. Lê Cao, D. Rossouw, C. Robert-Granié, P. Besse, A sparse PLS for variableselection when integrating omics data, Statistical Applications in Genetics andMolecular Biology 7 (2008).

[18] H. Chun, S. Keles, Sparse partial least squares regression for simultaneousdimension reduction and variable selection, Journal of the Royal StatisticalSociety: Series B: Statistical Methodology 72 (2010) 3–25.

[19] J.A. Fernandez Pierna, O. Abbas, V. Baeten, P. Dardenne, A backward variableselection method for PLS regression (BVSPLS), Analytica Chimica Acta 642(2009) 89–93.

[20] The Unscrambler, Camo Software, Oslo, Norway, 2012.[21] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the

Royal Statistical Society: Series B: Methodological 58 (1996) 267–288.[22] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, Journal

of the Royal Statistical Society: Series B: Statistical Methodology 67 (2005)301–320.

[23] K.A. Lê Cao, P.G.P. Martin, C. Robert-Granié, P. Besse, Sparse canonical methodsfor biological data integration: application to a cross-platform study, BMCBioinformatics 10 (2009).

[24] D. Chung, S. Keles, Sparse partial least squares classification for high dimensionaldata, Statistical Applications in Genetics and Molecular Biology 9 (2010).

[25] H. Martens, M. Høy, F. Westad, D. Folkenberg, M. Martens, Analysis of designedexperiments by stabilised PLS regression and jack-knifing, Chemometrics andIntelligent Laboratory Systems 58 (2001) 151–170.

[26] C.M. Rubingh, S. Bijlsma, E.P.P.A. Derks, I. Bobeldijk, E.R. Verheij, S. Kochhar, A.K.Smilde, Assessing the performance of statistical validation tools for megavariatemetabolomics data, Metabolomics 2 (2006) 53–61.

[27] J.S. Urban Hjort, Computer Intensive Statistical Methods, Chapman and Hall,London, 1993.

[28] L. Gidskehaug, E. Anderssen, B.K. Alsberg, Cross model validated feature selectionbased on gene clusters, Chemometrics and Intelligent Laboratory Systems 84(2006) 172–176.

[29] I.T. Jolliffe, N.T. Trendafilov, M. Uddin, A Modified Principal Component TechniqueBased on the LASSO, Journal of Computational and Graphical Statistics 12 (2003)531–547.

[30] H. Zou, T. Hastie, R. Tibshirani, Sparse principal component analysis, Journal ofComputational and Graphical Statistics 15 (2006) 265–286.

[31] H. Shen, J.Z. Huang, Sparse principal component analysis via regularized low rankmatrix approximation, Journal of Multivariate Analysis 99 (2008) 1015–1034.

[32] N. Krämer, A.L. Boulesteix, G. Tutz, Penalized Partial Least Squares withapplications to B-spline transformations and functional data, Chemometrics andIntelligent Laboratory Systems 94 (2008) 60–69.

[33] J.A. Wegelin, Survey of Partial Least Squares (PLS) Methods, with Emphasis on theTwo-Block Case, Technical Report 371, Department of Statistics, University ofWashington, Seattle, 2000.

[34] G. Cruciani, M. Baroni, S. Clementi, G. Costantino, D. Riganelli, B. Skagerberg,Predictive ability of regression models. Part I: Standard deviation of predictionerrors (SDEP), Journal of Chemometrics 6 (1992) 335–346.

[35] B. Efron, The Jack-knife, the Bootstrap and Other Resampling Plans, Society forIndustrial and Applied Mathematics, Philadelphia, Pennsylvania, 1982.

[36] F. Westad, M. Hersletha, P. Lea, H. Martens, Variable selection in PCA in sensorydescriptive and consumer data, Food Quality and Preference 14 (2007) 463–472.

[37] V.G. Tusher, R. Tibshirani, G. Chu, Significance analysis of microarrays applied tothe ionizing radiation response, Proceedings of the National Academy of Sciences98 (2001) 5116–5121.

[38] F. Westad, N.K. Afseth, R. Bro, Finding relevant spectral regions betweenspectroscopic techniques by use of cross model validation and partial leastsquares regression, Analytica Chimica Acta 595 (2007) 323–327.

[39] J.A. Westerhuis, H.C.J. Hoefsloot, S. Smit, D.J. Vis, A.K. Smilde, E.J.J. Velzen, J.P.M.Duijnhoven, F.A. Dorsten, Assessment of PLSDA cross validation, Metabolomics4 (2008) 81–89.

[40] H. Martens, E. Stark, Extended multiplicative signal correction and spectralinterference subtraction: new preprocessing methods for near infrared spectroscopy,Journal of Pharmaceutical and Biomedical Analysis 9 (1991) 625–635.

[41] A. Kohler, M. Zimonja, V. Segtnan, H. Martens, Standard normal variate,multiplicative signal correction and extended multiplicative signal correctionpreprocessing in biospectroscopy, in: D.B. Stephen, T. Romá, W. Beata (Eds.),Comprehensive Chemometrics, Elsevier, Oxford, 2009, pp. 139–162.

[42] MATLAB, The Mathworks Inc., Natick, Massachusetts, USA, 2010.[43] S. Hassani, H. Martens, E.M. Qannari, A. Kohler, Degrees of freedom estimation in

principal component analysis and consensus principal component analysis,Chemometrics and Intelligent Laboratory Systems 118 (2012) 246–259.

[44] Umetrics, User's Guide to SIMCA-P, SIMCA-P+, Umetrics, 2005.[45] A. Oust, T. Møretrø, K. Naterstad, G.D. Sockalingum, I. Adt, M. Manfait, A. Kohler,

Fourier transform infrared and Raman spectroscopy for characterization ofListeria monocytogenes strains, Applied and Environmental Microbiology 72(2006) 228–232.

[46] T. Katla, K. Naterstad, M. Vancanneyt, J. Swings, L. Axelsson, Differences insusceptibility of Listeria monocytogenes strains to sakacin P, sakacin A, pediocinPA-1, and nisin, Applied and Environmental Microbiology 69 (2003) 4431–4437.

[47] A. Kohler, M. Hanafi, D. Bertrand, E.M. Qannari, A.O. Janbu, T. Møretrø, K.Naterstad, H. Martens, Interpreting several types of measurements in bioscience,Biomedical Vibrational Spectroscopy, John Wiley & Sons, Inc., 2008, pp. 333–356.

Comparison of Sparse and Jack-knife partial least squares regression methods for variable selection

Documents

Transcript of Comparison of Sparse and Jack-knife partial least squares regression methods for variable selection