Prediction of retention indices for identification of fatty acid methyl esters

8
Journal of Chromatography A, 1198–1199 (2008) 188–195 Contents lists available at ScienceDirect Journal of Chromatography A journal homepage: www.elsevier.com/locate/chroma Prediction of retention indices for identification of fatty acid methyl esters Orsolya Farkas a , Igor G. Zenkevich b , Forrest Stout c , John H. Kalivas c ,K´ aroly H ´ eberger a,a Chemical Research Center, Hungarian Academy of Sciences, P.O. Box 17, H-1525 Budapest, Hungary b Chemical Research Institute, Universitetsky pr., 26, St. Petersburg 198504, Russia c Department of Chemistry, Idaho State University, Pocatello, ID 83209, USA article info Article history: Received 5 March 2008 Received in revised form 6 May 2008 Accepted 8 May 2008 Available online 14 May 2008 Keywords: Prediction of retention index Fatty acid methyl-esters QSRR Variable selection MLR PLS Pair-correlation method Lasso method Forward selection abstract Quantitative structure–retention relationships have been developed to predict retention indices of fatty acid methyl esters on standard non-polar polydimethylsiloxane stationary phases. Branched, saturated and unsaturated compounds were included. All retention indices have been evaluated by statistical pro- cessing of experimentally measured and literature data in accordance with the concept of interlaboratory data randomization. Multiple linear regression (MLR) has been carried out to find relationships between selected properties and retention indices. Models have been built in two different ways (i) the same degrees of freedom for all models have been fixed and the variable selection ability has been compared; (ii) variable selection methods have been used in their best performance. The five selection methods were: pair-wise correlation, forward selection, partial least squares projection of latent structures, modified best subset selection and the Lasso method. The stability and the validity of models have been tested by inter- nal and external validation. The error of predicted retention indices is close to the error of interlaboratory reproducibility of retention indices. The most relevant variables in description of retention indices were molecular mass, number of double bonds and number of rotatable bonds complemented with topological descriptors. Predictive models have been built for 130 fatty acid methyl esters for identification purposes. Moreover, prediction of unknown retention indices for 37 fatty acid methyl esters has also been carried out. © 2008 Elsevier B.V. All rights reserved. 1. Introduction Fatty acid methyl esters (FAMEs) belong to the most important objects for GC and GC–MS analyses. Methyl esters are preferable from analytical point of view for determination of fatty acids in the composition of low- or non-volatile triglycerides and lipids of bio- genic origin. The typical number of carbon atoms in the molecules of fatty acids from various sources is generally between 12 and 24. These compounds belong to different chemical sub-groups; their general classification includes saturated, monoenic, polyenic and branched carboxylic acids. The total number of compounds of this class exceeds a few hundred; their contemporary list is presented in the Internet review [1]. The volatility of most FAMEs allows their GC and GC–MS anal- yses with the use of standard wall-coated open tubular (WCOT) columns. However, positive identification of these analytes needs comprehensive information including both standard mass spectra (electron ionization, 70 eV) and GC retention indices (I) on stan- dard phases. The mass spectra for many isomeric methyl esters, Corresponding author. Tel.: +36 1 438 0490; fax: +36 1 325 75 54. E-mail address: [email protected] (K. H ´ eberger). especially those of branched and polyenic acids are highly similar [2]. Therefore, GC and GC–MS identification of FAMEs necessitates the utilization of retention indices [3,4]. Nevertheless, experimen- tal values of these parameters are known only for a relatively small number of most widespread compounds at present [4–16]. In spite of the structural simplicity of FAMEs, no reliable additive schemes for I evaluation have been proposed yet. I values for isomeric alka- nes depend on the position of branching within carbon skeleton of the molecule. However, the position of even methyl group in branched FAMEs should be located by enumeration of carbon atoms starting from both terminal sides of carbon chain. Thus, the first descriptor (k) presents the position of a methyl substituent rel- ative to methoxycarbonyl group, while the second one (n k + 1) presents their location relative to another side of the chain. Both of these descriptors influence the I values. The same problem exists for the descriptors locating the posi- tion of C C double bonds in the molecule. Besides that, if a few substituents are present in the chain, it is necessary to reflect their location in sterically hindered (k 1, k) or non-hindered (k m, k (m > 1)) positions. Similarly, the location of double bonds in (Z) or (E) configuration in conjugated (l 2, l) or non-conjugated (l m, l(m > 2)) positions should be fixed. As a result, the number of descriptors becomes rather large, that explains us the absence of 0021-9673/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.chroma.2008.05.019

Transcript of Prediction of retention indices for identification of fatty acid methyl esters

Journal of Chromatography A, 1198–1199 (2008) 188–195

Contents lists available at ScienceDirect

Journal of Chromatography A

journa l homepage: www.e lsev ier .com/ locate /chroma

Prediction of retention indices for identification of fatty acid methyl estersa b c c a,∗

Orsolya Farkas , Igor G. Zenkevich , Forrest Stout , John H. Kalivas , Karoly Heberger

a Chemical Research Center, Hungarian Academy of Sciences, P.O. Box 17, H-1525 Budapest, Hungaryb Chemical Research Institute, Universitetsky pr., 26, St. Petersburg 198504, Russiac Department of Chemistry, Idaho State University, Pocatello, ID 83209, USA

tentiondardds w

measiple lietentmodods hard s

asso m. Then indf doudels hnkno

a r t i c l e i n f o

Article history:Received 5 March 2008Received in revised form 6 May 2008Accepted 8 May 2008Available online 14 May 2008

Keywords:Prediction of retention indexFatty acid methyl-estersQSRRVariable selectionMLRPLSPair-correlation methodLasso methodForward selection

a b s t r a c t

Quantitative structure–reacid methyl esters on staand unsaturated compouncessing of experimentallydata randomization. Multselected properties and rdegrees of freedom for all(ii) variable selection methpair-wise correlation, forwsubset selection and the Lnal and external validationreproducibility of retentiomolecular mass, number odescriptors. Predictive moMoreover, prediction of uout.

1. Introduction

Fatty acid methyl esters (FAMEs) belong to the most importantobjects for GC and GC–MS analyses. Methyl esters are preferablefrom analytical point of view for determination of fatty acids in thecomposition of low- or non-volatile triglycerides and lipids of bio-genic origin. The typical number of carbon atoms in the moleculesof fatty acids from various sources is generally between 12 and 24.These compounds belong to different chemical sub-groups; theirgeneral classification includes saturated, monoenic, polyenic andbranched carboxylic acids. The total number of compounds of thisclass exceeds a few hundred; their contemporary list is presentedin the Internet review [1].

The volatility of most FAMEs allows their GC and GC–MS anal-yses with the use of standard wall-coated open tubular (WCOT)columns. However, positive identification of these analytes needscomprehensive information including both standard mass spectra(electron ionization, 70 eV) and GC retention indices (I) on stan-dard phases. The mass spectra for many isomeric methyl esters,

∗ Corresponding author. Tel.: +36 1 438 0490; fax: +36 1 325 75 54.E-mail address: [email protected] (K. Heberger).

0021-9673/$ – see front matter © 2008 Elsevier B.V. All rights reserved.doi:10.1016/j.chroma.2008.05.019

n relationships have been developed to predict retention indices of fattynon-polar polydimethylsiloxane stationary phases. Branched, saturated

ere included. All retention indices have been evaluated by statistical pro-ured and literature data in accordance with the concept of interlaboratorynear regression (MLR) has been carried out to find relationships betweenion indices. Models have been built in two different ways (i) the sameels have been fixed and the variable selection ability has been compared;ave been used in their best performance. The five selection methods were:election, partial least squares projection of latent structures, modified best

ethod. The stability and the validity of models have been tested by inter-error of predicted retention indices is close to the error of interlaboratoryices. The most relevant variables in description of retention indices wereble bonds and number of rotatable bonds complemented with topologicalave been built for 130 fatty acid methyl esters for identification purposes.

wn retention indices for 37 fatty acid methyl esters has also been carried

© 2008 Elsevier B.V. All rights reserved.

especially those of branched and polyenic acids are highly similar[2]. Therefore, GC and GC–MS identification of FAMEs necessitates

the utilization of retention indices [3,4]. Nevertheless, experimen-tal values of these parameters are known only for a relatively smallnumber of most widespread compounds at present [4–16]. In spiteof the structural simplicity of FAMEs, no reliable additive schemesfor I evaluation have been proposed yet. I values for isomeric alka-nes depend on the position of branching within carbon skeletonof the molecule. However, the position of even methyl group inbranched FAMEs should be located by enumeration of carbon atomsstarting from both terminal sides of carbon chain. Thus, the firstdescriptor (k) presents the position of a methyl substituent rel-ative to methoxycarbonyl group, while the second one (n − k + 1)presents their location relative to another side of the chain. Both ofthese descriptors influence the I values.

The same problem exists for the descriptors locating the posi-tion of C C double bonds in the molecule. Besides that, if a fewsubstituents are present in the chain, it is necessary to reflect theirlocation in sterically hindered (k − 1, k) or non-hindered (k − m, k(m > 1)) positions. Similarly, the location of double bonds in (Z) or(E) configuration in conjugated (l − 2, l) or non-conjugated (l − m,l(m > 2)) positions should be fixed. As a result, the number ofdescriptors becomes rather large, that explains us the absence of

gr. A 1

O. Farkas et al. / J. Chromato

simple algorithms for I prediction for FAMEs. The selection of opti-mal structural descriptors (variables) remains an actual, pertinentproblem.

Sometimes, I values are available from a single source ofinformation, i.e. they cannot be verified by the comparison withindependently measured data.

Most I values published in the original papers are collected in theNIST database [17]. In this database all compounds with unknown Ivalues are characterized by theoretical evaluations with the use ofan additive scheme. However, these evaluations are not sufficientfor identification of isomers with small retention index differences,because their confidential intervals are ca. ±50 i.u. (confidence level50%) and ±200 (95%).

The quantitative structure–retention relationship (QSRR) estab-lishes relations between retention data (I) and molecular structure.The molecular structure is encoded by variables (descriptors). QSRRcan be decomposed into three main steps: (1) variable selection,(2) model building and (3) model validation [18]. Perhaps the firststep is the most interesting and difficult procedure because thereis no absolute way for the selection of relevant variables from alarge descriptor set. The same algorithm can lead to the selectionof different variables using different selection criteria and variousmodels can be obtained using the same criterion with differentalgorithms. As complex systems can be described with a great num-ber of independent variables, one of the main goals in QSRR is theextraction of relevant information, together with the exclusion ofredundant descriptors (filtering the noise). However, an exhaustivesearch for all possible solutions is not feasible as the number ofindependent variables increases. A lot of techniques are used forvariable selection and model building [18–29].

Our main aim is to build models for description and predictionof Kovats retention indices for a wide variety of fatty acid methylesters replacing the measured properties by easily calculated two-dimensional descriptors. Reliable chromatographic identificationof the analytes for this group is very important in the analyses ofnatural substances like lipids, glycerol esters, lecitins, etc. Besidesthis, the goal of our paper is to compare the efficiency of five variableselection methods (pair-wise correlation, Lasso methods, forwardstepwise, best subset selection, and partial least square projec-tion of latent structures). The comparison has been carried out intwo ways: (i) the same degrees of freedom were applied for allmodels and (ii) the methods were compared in their best perfor-mance, i.e. the models were compared with different degrees offreedom, using different number of descriptors. In this work we

compare the two approaches and make conclusions comparing thetwo.

2. Computation methodology

2.1. Retention data

Average I values from different sources of information havebeen calculated for 130 fatty acid methyl esters. Experimental data(I.Z., minor part) and all available data from literature (major part)were averaged. These data include both target I values on stan-dard non-polar polydimethyl siloxane stationary phases [5–11], and“raw” information, namely the net retention times [30,31], the rel-ative retention times [32], or the so-called ECL values (equivalentchain lengths), often used in chemistry of fatty acids [15,33]. Allthese “non-I” retention data have been recalculated into retentionindices. A more detailed list of references for every compound ispresented in the first release of the US National Institute of Stan-dards and Technology (NIST, Gaithersburg, MD, USA) I databaseincorporated in the NIST/EPA/NIH MS Library (June 2005) [17].

198–1199 (2008) 188–195 189

Experimental I measurements have been performed on aBiokhrom-1 gas chromatograph (Moscow, USSR) equipped withWCOT quartz column (25 m × 0.20 mm) with standard non-polarphase OV-101 in the temperature programming regime (initial tem-perature 40–60 ◦C, final temperature 220–240 ◦C or until the lastpeak has been eluted, ramp 4–6 K/min), carrier gas helium or nitro-gen, split ratio 1:30, followed by calculation of linear-logarithmicretention indices [34,35] using n-alkanes as reference compounds.The methyl esters were linear or branched, and many of them con-tain at least one double bond. The number of carbon atoms variesfrom 3 to 31. The compounds selected belong to a congener serieswith relatively little variability. The real source of the noise in thedata was that the various experimentalists who produced the Idatabase failed to use standardized conditions.

2.2. Variable selection methods

Lasso [24] is a form of Tikhonov regularization [19], as is theshrinkage method ridge regression [20], with subtle but importantdifferences. The estimation of the regression coefficients for

y = Xb (1)

in the case of Tikhonov regularization in general form is given byEq. (2):

b = (XTX + �LTL)−1

XTy (2)

where b (n × 1) is the regression coefficient vector, X (m × n) is thematrix of descriptors, � is a meta-parameter that must be opti-mized, L is a matrix of values, typically set to identity and y collectsthe quantitative information of prediction property from m samples[19]. Eq. (2) is the solution to

min(||y − Xb||22 + �||Lb||22) (3)

When L = I, Tikhonov regularization is said to be in standard formand is know as ridge regression. The ||b||22 symbolizes the square ofthe Euclidean vector norm and it is called 2-norm. The left and rightterms of Eq. (3) represent the bias and variance criteria. Thus, thefinal optimized model minimizes bias and variance concurrentlyand is termed “harmonious” cf. Ref. [28]. The right terminal Eq. (3)with L = I can also be expressed with the so-called 1-norm form:

||b||1 =n∑

i=1

|bi| (4)

This form of Tikhonov regularization is known as the Lassomethod where poor descriptors can be excluded meaning thesevariables are given zero or near-zero regression coefficients. Pair-wise correlation method (PCM) has been developed to choosebetween two correlated independent variables [25,27]. The distinc-tion between two variables can be made using an arrangementinto a 2 × 2 contingency table. PCM can easily be generalizedfor selection and ranking of more than two variables (GPCM).The comparison of factors can be made pair-wise in all possiblecombinations. Successive completion of GPCM is recommendedpreserving maximal diversity of the already ranked descriptorswhile using the same data set. GPCM can be applied also in caseof not normal distributions, but the selected descriptors are notnecessarily the best ones from the point of view of prediction.

Best subset selection (BSS), also called all possible regression, isable to select the best combination of descriptors from the group ofindependent variables, as all combinations are examined [22]. Thecalculation time increases exponentially with the increasing num-ber of descriptors (combinatorial explosion) and the found ‘best’variable combinations are frequently random; they are valid onlyfor the given training set and cannot be generalized.

gr. A 1

190 O. Farkas et al. / J. Chromato

Stepwise regression methods such as forward selection (FS)have been developed to provide good models with much less cal-culations as compared to all possible regressions [36]. Forwardselection can be applied for large descriptor sets, as well, but manyvariable combinations can be overlooked using this method.

Partial least square projection of latent structures (PLS) has beenprimarily developed for prediction [21]. PLS and ridge regressiontends to behave similarly, but PLS shrinks in discrete steps whilethe � parameters for Lasso and ridge vary over a continuous range.

The details of variable selection and model building can be seenin Section 2.5.

2.3. Descriptor calculation

Zero-, one- and two-dimensional descriptors were calculatedusing the Dragon program package [36]. The program omitted oneof the two (or more) variables automatically that showed a cor-relation higher than 0.99. The reduced number of descriptors was152. These descriptors served as starting pool for the comparisonof variable selection methods.

2.4. Data set division and model validation

To avoid selection bias [37], the retention index data set hasbeen divided into three parts [22,37]. Thirty-four percent of thecompounds have been chosen for variable selection (training) and33% as a parameter estimation (calibration) set. Further thirty-threepercent of methyl esters have been taken as external validation set(test set). These sets allow a relatively good mapping of the datastructure, i.e. methyl esters from the entire retention index rangehave been properly represented in all sets.

The descriptive ability of the models has been measured by R2train

and R2calibr (coefficient of determination for training and calibra-

tion set) as well as RMSEtrain and RMSEcalibr (root mean squareerror for training and calibration set). These statistical parametersdo not represent an unbiased estimation of the prediction. How-ever, the independent test set is suitable for method comparison.The predictive ability has been measured by the R2

test (coefficient ofdetermination for the external validation) values. F-value, and rootmean square error for validation (RMSEtest) have also been calcu-lated. Further examination of model applicability was undertakenby reviewing the residual plots and the plots of the experi-mental versus calculated retention data for the entire retention

index set.

2.5. Variable selection and model building

Five different techniques reduced the number of descriptors.Statistical calculations were performed by Statistica, Unscramblerand Matlab software packages [38–40], respectively.

The five different variable selection methods to be comparedwere the following: GPCM, Lasso, FS, modified BSS (mBSS), and PLS.The models were built in two different ways. At first, MLR methodhas been carried out to describe the linear relation between theretention index and independent variables (descriptors) and thenumber of descriptors has been fixed to five. The significance levelwas set at 5% in case of FS and an F-test was used as selection cri-terion. Secondly, optimal number of descriptors has been used ineach case of models.

It may cause problems to define what an optimal model isin case of GPCM and mBSS. As PCM can only be used for vari-able selection, MLR method has been used for model building; insuch a way we ensure the comparability of the methods. Sevendifferent MLR models have been built; the first one contained

198–1199 (2008) 188–195

the best descriptor ranked by PCM, the second one containedthe first and second best independent variables, etc. The calibra-tion set was used to select the optimal model: RMSEcalibr valuesof these models have been compared and the smallest one wasselected.

To avoid combinatorial explosion the number of the indepen-dent variables in a subset has been maximized to three in caseof BSS. After the running procedure, BSS was repeated withoutthe three best variables and this process has been repeated seventimes successively. Then, BSS has been performed on the best21 variables complemented with the most frequently occurringdescriptors again: the five best have been selected. In this way, thecalculation time of BSS was abridged considerably (mBSS). The opti-mal descriptor number has been defined in a similar way to that ofGPCM using the calibration set.

Lasso has been performed on autoscaled X and y data. The targetoptimization approach described in reference [41] was used here.This method optimizes the regression coefficients (model parame-ters) in order to minimize the Euclidean distance to user set targetvalues for bias and variance indicators: the root mean square errorfor the training set (RMSEtrain) and ||b||1, respectively. In the opti-mization process, RMSEtrain and ||b||1 values were normalized to thefull rank (factor) PLS model. The optimization criteria RMSEtrain andthe 1-norm are normalized to prevent optimizations that favor oneof the two criteria due to differences in criteria ranges. Without thisscaling to the full rank PLS model (or some other nearly full rankmodel such as PCR or RR), the implemented Lasso algorithm canyield exclusively underfit models.

The optimization program identifies numerous Lasso modelsminimizing the distance to user set RMSEtrain and ||b||1 target val-ues. By plotting 1-norm values against the RMSEtrain values, anapproximate L-shaped curve is obtained (L-curve) [42,43]. The har-monious model is at or near the corner of the L-curve. Descriptorswere ranked by their Lasso regression coefficients for the finalselected Lasso model based on the training set. Variable selec-tion can be performed by variables with regression coefficientmagnitudes less than a user defined cut-off value for near-zerocoefficients (i.e. 10−4).

PLS optimum performance was established using the calibra-tion set similarly to the way of finding the optimum performancefor GPCM and mBSS. First, the uncertainties of the regression coef-ficients were determined on the training set using the Martens’jackknife method [44] as described in the Unscrambler manual.Then, 66 significant descriptors were selected on the train set. Using

the calibration set 7 PLS components provided the optimal perfor-mance. Its predictive ability was checked using the independent(external) validation set (test set). The best (most significant) fivedescriptors (from 66) have been used in case of the five-descriptorPLS model.

3. Results and discussion

3.1. Comparison of the selected independent variables in thefive-descriptor models

The statistical parameters of the regression models for the train-ing, calibration and test sets can be seen in Table 1. (The notations,list of descriptors and abbreviations [45] used can be found inTable 2.)

Many descriptors have been found as significant by the GPCMmethod; 115 variables were selected in case of simple orderingmethod and 56 descriptors were relevant using the differenceordering and the probability-weighted difference ordering method.The best five descriptors were the following, according to each

O. Farkas et al. / J. Chromatogr. A 1198–1199 (2008) 188–195 191

Table 1Statistical parameters of MLR models with five descriptorsa

MSEtr

.u.

5.60

Method forvariableselection

Descriptors R2train

Ftrain Ri

GPCM 5 Whete (0.7322), Eig1Z (0.0356), TIC0 0.99837 4,774 2

(0.1370), MW (0.8704), ATS1v (0.2061)

Lasso 5 RBN (0.0019), MPC10 (0.4522), Eig1Z(0.8622), MW (0.0000), PHI (0.0516)

0.99891 7,176 20.88

FS 5 HNar (0.2760), RBF (0.3054), nDB(0.0112), RBN (0.0000), QIndex(0.0000)

0.99871 6,018 22.80

mBSS 5 RBN (0.0000), piPC03 (0.0000), PCD(0.9660), GATS3v (0.0011), Qindex(0.1109)

0.99933 11,692 16.36

PLS 5 MW (0.0000), Eig1Z (0.3298), TIC3(0.0677), TIC4 (0.0308), TIC 5 (0.3293)

0.99762 3,270 30.92

See Table 2 for the abbreviations.a R2

train, R2

calibr: coefficient of determination for the training and the calibration set, res

calibration and validation set, respectively. RMSEtrain =

√∑m

i=1(yitrain

−yitrain)2

m−n ; RMSEcalibr =predicted and actual values for m training samples and p descriptors. Ftrain, Fcalibr, Ftest: Fish

index for external validation; ||b||2: 2-norm or Euclidean vector norm. The limit values fcalibration set are indicated by bold.)

GPCM selection criterion methods: Wiener type index fromelectronegativity-weighted distance matrix, leading eigenvaluefrom Z-weighted distance matrix, total information content index,molecular mass and Broto-Moreau autocorrelation index.

Table 2Name and abbreviation of descriptors [45]

Abbreviation Descriptor name

ATS1v Broto-Moreau autocorrelation of a topolATS3m Broto-Moreau autocorrelation of a topolBEHe5 Highest eigenvalue number 5 of BurdenBEHm1 Highest eigenvalue number 3 of BurdenBEHm8 Highest eigenvalue number 8 of BurdenBELm1 Lowest eigenvalue number 1 of BurdenBELm5 Lowest eigenvalue number 5 of BurdenBELp6 Lowest eigenvalue number 6 of BurdenCIC1 Complementary information content (nEig1Z Leading eigenvalue from Z-weighted disGATS3v Geary autocorrelation − lag 3/weightedGATS6v Geary autocorrelation − lag 6/weightedGGI1 Topological charge index of order 1HNar Narumi harmonic topological indexIVDM Mean information content vertex degreeIDDE Mean information content on the distanJGI2 Mean topological charge index of orderLop Lopping centric indexMPC10 Molecular path count of order 10MW Molecular massMWC06 Molecular walk count of order 6nDB Number of double bondsPCD Difference of multiple path counts to paPHI Kier flexibility indexpiPC03 Multiple path count of order 3piPC04 Multiple path count of order 4piPC07 Multiple path count of order 7piPC10 Multiple path count of order 10QIndex Quadratic indexRBF Rotatable bond fractionRBN Number of rotatable bondsSCBO Sum of conventional bond orders (H-depTIC0 Total information content indexTIC3 Total information content index neighboTIC4 Total information content index neighboTIC5 Total information content index neighboVEA1 Eigenvector coefficient sum from adjaceVEA2 Average eigenvector coefficient sum fromWhete Wiener type index from electronegativitXt Total structure-connectivity index

ain R2calibr

Fcalibr RMSEcalib

i.u.R2

test Ftest RMSEtest

i.u.||b||2i.u.

0.99781 3466 28.18 0.99773 37,149 28.74 3590

0.99803 3844 24.29 0.99505 18,710 42.44 49

0.99810 6018 26.26 0.99792 45,365 25.58 880

0.99847 4976 23.53 0.97381 4,368 90.72 464

0.99764 3213 29.27 0.97511 4,883 87.43 15

pectively; RMSEtrain, RMSEcalibr, RMSEtest: root mean square error for the training,√∑m

i=1(yicalibr

−yicalibr)2

m−n ; RMSEtest =

√∑m

i=1(yitest −yitest )2

m , where yi and yi denote the

er ratio for the training, calibration and validation set, respectively. R2test: correlation

or significance (p-values) can be found in brackets. (Significant descriptors in the

The optimal Lasso model is in the corner region of the L-curve(||b||1 plotted against RMSEtrain). For this model, 24 variables areselected as significant when a cut-off of 10−4 was used to detectnear-zero coefficients. The best five descriptors were as follows:

ogical structure lag 1/weighted by atomic van der Waals volumesogical structure lag 3/weighted by atomic massesmatrix/weighted by atomic Sanderson electronegativitiesmatrix/weighted by atomic massesmatrix/weighted by atomic masses

matrix/weighted by atomic massesmatrix/weighted by atomic massesmatrix/weighted by atomic polarizabilitieseighborhood symmetry of order 1)tance matrixby atomic van der Waals valuesby atomic van der Waals values

magnitudece degree equality2

th counts

leted)

rhood symmetry of order 3rhood symmetry of order 4rhood symmetry of order 4ncy matrix

adjacency matrixy-weighted distance matrix

gr. A 1198–1199 (2008) 188–195

192 O. Farkas et al. / J. Chromato

leading eigenvalue from Z-weighted distance matrix, molecularmass, number of rotatable bonds, Kier flexibility index and molec-ular path count of order 10.

Five independent variables were kept in the model by the for-ward selection algorithm (using a predefined 5% significance level).These descriptors were: number of double bonds, number of rotat-able bonds, Narumi harmonical topological index, quadratic index(a topological descriptor) and rotatable bond fraction.

Partial least squares selected molecular mass, leading eigen-value from Z-weighted distance matrix and total informationcontent index neighborhood symmetry of order 3, 4 and 5.

Modified best subset selection found number of rotatable bonds,multiple path count of order 3, difference of multiple path countsto path counts, quadratic index and Geary autocorrelation lag 3 assignificant.

Representation of structure of fatty acid methyl esters can beachieved by topological indices complemented with molecularmass and other zero-dimensional descriptors. Retention in gaschromatography is a very complex process that involves the inter-action of several intermolecular forces. On the non-polar stationaryphases the non-specific dispersion forces determine differencesamong the compounds. Furthermore, the methoxycarbonyl frag-ment is constant in each solute so we cannot expect the leadingrole of polarity and polarity related parameters. In that case thechromatographic behavior of molecules could be explained bytopological indices and bulkiness parameters. The length of thefatty acid chains and the branches in the structure can be describedproperly with these descriptors. The number of double bonds candraw a distinction between compounds with the same chain length.Flexibility and number of rotatable bonds can also discriminatecompounds with or without double bonds and branches. It is veryinteresting, that the FS model does not contain the molecular mass.Although this model does not give the best description (see R2, F andRMSE values on the training and calibration sets), it provides thebest prediction for the retention indices on the independent test set.

Some selected descriptors were not significant in the modelsfitted on the calibration set. The significant independent variableswere marked with bold font in Table 1.

3.2. Models with the five best variables

R2train, Ftrain and R2

calibr, Fcalibr values of the five-variable models(assigned to GPCM 5, Lasso 5, FS 5, PLS 5 and mBSS 5) are reason-ably high, which verifies a good descriptive ability. The differences

between the R2 values are small, yet significant. The different struc-ture encoding possibilities were more or less equivalent. In fact, thedescriptors are highly correlated. RMSEtrain and RMSEcalibr valuesare the smallest (best) in case of the mBSS 5 model and the largest(worst) in case of PLS 5 model. The R2

test values are nearly equal forFS 5 and GPCM 5 models similarly to their RMSEtest values. R2

test andRMSEtest values of Lasso 5 differ significantly from the other two.Model GPCM 5 and FS 5 have the best RMSEtest values. The highRMSEtest values do not prove good predictive power for the mBSS5, Lasso 5 and PLS 5 models. Contrary to the perceptions, mBSS wasnot effective for prediction in this modified form; furthermore, thecalculations were rather slow (several hours on a standard 1.6 GHzcomputer). It is fairly uncommon that FS yields the best model. (FSis not very effective in case of many variables.) Good and reliableQSRR models can be built only if more variable selection methodsare used.

PLS is a widely used method for prediction but our previous[29,46] and present examples show that when uncertainties ofregression coefficients are used for variable subset selection (fol-lowed by MLR to build the final model, Table 1) not very usefulresults are obtained.

Fig. 1. Model FS 5: External validation. Plot of experimental versus predictedretention indices; I: Kovats retention index, ♦: calibration set of methyl-esters,�: prediction set of methyl-esters. (Averaged literature and experimental data areindicated by ‘–’.)

The expected prediction uncertainty inflation is given in Table 1by ||b||2. That is, when the regression vector 2-norm is large, itacts as an inflation factor and greater prediction uncertainties areexpected compared to a model with a smaller 2-norm. The 2-normand RMSEtest columns do not allow drawing unambiguous conclu-sions, but clearly show the complexity of the problem. RMSEtest issmall while ||b||2 is large in the case of FS5 and GPCM5. The Lassoand PLS selected variables generate much smaller ||b||2 and hence,predictions with smaller variance and greater bias are expected.Additionally, the poorer prediction performance of Lasso and PLS isa consequence of not optimal design: five descriptors are far frombeing optimal or harmonious for these methods. Instead, the topfive variables were used to form MLR models. Section 3.3 describesresults with optimal PLS and Lasso models.

The plot of the calculated versus experimental retention indicesof the FS 5 model can be seen in Fig. 1. Fig. 2 shows the observedretention index values versus residual values of the GPCM 5 model.The shape of the residual plot is adequate but some outliers exist(according to ±2� criterion, ∼95% confidence interval, i.e. 80 oreven 100 i.u. deviation is not acceptable). It is interesting, that outly-ing compounds are nearly the same in case of GPCM 5, Lasso 5, mBSS

5 and PLS 5. The outliers were methyl esters of 2-nonenoic, 9,11-octadecadienoic (trans) and henicosanoic acids (only 2-nonenoicacid is not an outlier in case of model mBSS 5). The structure ofthese compounds does not differ from the other methyl esters, e.g.there are no special branches in the compounds. We presume thatthe measured I values are not appropriate and need additional ver-ification. This proposition should be taken into account for any setsof experimental data, especially for the single-measured values.

3.3. Optimal models

The results of the optimal models can be found in Table 3.When using FS the determination of the optimal number of

descriptors was obvious. This stepwise variable selection methodselected five descriptors on 5% error level; therefore the optimalmodel is the same as the FS 5 model in Table 1.

GPCM model with four descriptors has been chosen as optimalusing the minimum of the RMSEcalibr. No significant differenceshave been found between the statistical parameters of GPCM opti-mal and GPCM 5 neither in the training nor in the calibration or inthe validation set. The advantage of GPCM opt model is that every

O. Farkas et al. / J. Chromatogr. A 1

overoptimistic.

Fig. 2. Model GPCM 5 Plot of residuals versus, predicted retention indices. I: Kovatsretention index. Three outliers can be distinguished.

descriptor is significant in the calibration set whereas only Eig1zis significant in the GPCM 5 model. Moreover, the 2-norm value ofthis model shows that prediction with relatively small variance ispossible.

There are some difficulties to define the optimal number of inde-

pendent variables in case of mBSS because the higher the numberof descriptors involved in the model is, the better is the descrip-tion on the training set. Therefore, the calibration set was also usedto find the optimum value. The model containing seven variablescan be chosen empirically as optimal, however, it is likely that ifthe number of descriptors was further increased an even smallerRMSEtest would be resulted. According to the principle of parsimony(also known as Occam’s razor) the simplest one of two (or more)competing models is to be preferred.

Using 66 variables selected by the Martens’ uncertainty test [44]RMSEtrain, RMSEcalibr and RMSEtest values were the best in case ofthe optimal PLS model among all models including optimal andfive-descriptor models. No wonder, PLS model utilizes the largestnumber of descriptors (66). On the expenses of using more dataand more calculation time better predictions were achieved. Theinterpretation of PLS model is difficult if not impossible with these66 variables.

Using 10−4 as cut-off value Lasso optimal model resulted thesecond best RMSEtest value after PLS opt model. However, the 2-norm value of this model is considerably higher than the 2-norm of

Table 3Statistical parameters of optimal models (for notation see Table 1)

Method forvariableselection

Descriptors R2train

Ftrain RMSEtrain i.u.

GPCM opt Whete (0.0145), Eig1Z (0.0010),TIC0 (0.0412), MW (0.0004)

0.99834 6,003 25.52

Lasso opt See below Table 3a 0.99840 26,753 24.18FS opt HNar (0.2760), RBF (0.3054), nDB

(0.0112), RBN (0.0000), QIndex(0.0000)

0.99871 6,018 22.80

mBSS opt Whete (0.0000), TIC0 (0.0002),Eig1z (0.0000), piPC03 (0.0003),piID (0.0187), PCD (0.0192),GATS3v (0.0011)

0.99979 12,345 13.46

PLS opt 7 pls components 0.99939 707,000 14.88

a Descriptors in Lasso opt model: pilD, Eig1, RBN, MW, SCBO, nCq, PCD, PHI, Whete, piPCHNar, Me, TIC.

198–1199 (2008) 188–195 193

the PLS optimal model so prediction with higher variance is possi-ble. Because of the difficulty of the interpretation of 24 descriptorsit is practical to use a simpler model such as GPCM opt. Or else,using a similarly complicated model such as PLS opt results betterstatistical parameters.

3.4. Internal validation

We are well aware that indicators of internal validation areloosely connected to the prediction performance, if at all [47]. Nev-ertheless, internal validation can be useful to compare the resultswith that of external validation. The same splits were used as ear-lier: 1/3 of the data were predicted from the remaining 2/3, threetimes. This method is also called 3-fold cross validation. The resultsare summarized in Table 4.

Always fixed (optimal) models of Table 3 were used. The only dif-ference concerns for PLS opt (last two lines), where all descriptorswere applied not just the 66 selected ones by Martens’ uncertaintytest. (There is no need to repeat the same results for PLS opt.)

The selection of training (Tr), calibration (Ca) and test (Te) setswere random: the RMSE values vary according to the models:Ca > Te > Tr (GPCM), Te > Ca > Tr (FS), Te ≈ Ca > Tr (mBSS), Tr > Te > Ca(PLS opt for global minimum). The comparison of RMSEtest ofTable 3 and RMSEave of Table 4 clearly shows that the pre-diction performance is (slightly) worse for external validation,as expected. It is well-known that internal cross validation is

Optimal performance for PLS can be achieved only after variableselection and optimization of number of PLS components. Including15 or 18 PLS components in the model realizes a definite overfit.Surprisingly, mBSS provides the best models for two of the threeof internal validations with Lasso being better in the other one.However, the RMSEave value for mBSS and Lasso are nearly the same.Considering that Lasso utilizes 24 variables and mBSS only seven,mBSS may be preferred for easier interpretation.

3.5. Calculated retention indices

Supplementary Table S1 shows measured and calculated reten-tion indices of the fatty acid methyl esters with three modelingmethods (FS 5, GPCM opt and PLS opt). We suggest that more mod-els should be used for the same purpose (e.g. to predict I): it is called“consensus modeling” or “ensemble averaging”. It enhances thereliability of the modeling procedure greatly. cis–trans Isomerism(E, Z configuration) is not reflected in GPCM opt model.

Consideration of data presented in Supplementary Table S1confirms that in most of the cases three modeling methods (FS5,

R2calibr

Fcalibr RMSEcalibr i.u. R2test Ftest RMSEtest i.u. ||b||2 i.u.

0.99772 4,260 28.42 0.99764 36,181 27.04 171

0.99868 31,820 20.81 0.99831 31,151 22.78 3840.99810 6,018 26.26 0.99792 45,365 25.58 880

0.99846 3,344 24.26 0.99786 47,235 25.94 407

0.99885 363,000 19.48 0.99863 31,180 20.54 14.2

04, MPC010, piPC03, MWC06, BAC, piPC07, piPC10, CIC1, Lop, TIC0, BEHm8, BELm5,

gr. A 1

t pred

F

17,922,920,825,113,424,9

6 fortion a

194 O. Farkas et al. / J. Chromato

Table 4Internal validation of optimal models

Method forvariable selection

Training set predicted Calibration se

R2 F RMSE i.u. R2

GPCM opt 0.99823 24,283 25.38 0.99767Lasso opt 0.99899 42,523 19.18 0.99818FS opt 0.99849 28,354 23.49 0.99799mBSS opt 0.99918 52,240 17.31 0.99833PLS opta 0.99730 15,861 31.38 0.99688PLS optb 0.99802 21,700 26.84 0.99832

a First minimum (Unscrambler convention). Number of PLS components are 5, 5,b Global minimum. Number of PLS components are 18, 15, 15 for training, calibra

GPCMopt and PLSopt) provide the appropriate coincidence of pre-dicted I values with experimental data. For example, predicteddata for methyl ester of 6,8-dimethyl decanoic acid are 1448, 1477,and 1448 i.u. (average value is 1458 ± 17), while experimental I is1455 ± 10 i.u., that is good correspondence of averages with over-lapped standard deviations. Such a way, an important possibility toverify the identification for FAMEs has been provided besides theuse of mass spectra. The number of mass spectra still exceeds that ofexperimental retention indices [10] (NIST MS Database). The sameregularity can be observed for methyl esters of non-conjugatedpolyunsaturated fatty acids, e.g. methyl homo-�-linolenate (pre-dicted I values are 2254, 2255, 2250, average value 2253 ± 3,experimental I is 2251 ± 3).

Supplementary Table S2 contains 37 fatty acid methyl esters,for which Kovats retention indices are not known. Prediction ofretention indices of these compounds has also been performed withthe FS 5, GPCM opt and PLS opt models.

At the same time, the restriction of the approach in rela-tion to methyl esters of conjugated polyunsaturated carboxylicacids should be mentioned. Predicted Is are significantly less thanexperimental I values for all of these types of esters, e.g. methyl(E)-9,11-octadecadienoate (predicted Is 2077, 2072, 2112, experi-mental 2171 ± 11), etc. The set of chosen variables does not includedescriptors reflecting the influence of conjugated double bonds onchromatographic retention.

In accordance with this conclusion, we can further conclude thatpredicted Is of esters of conjugated acids, namely �-parinaric, twogalendic, and two eleostearic acids in Supplementary Table S2 areless than they should be. On the other hand, all examples whenthere is no coincidence between predicted and experimental I data

for non-conjugated FAMEs should be interpreted that the referencevalues should be corrected in contemporary databases. Such exam-ples are methyl 7,10,13-nonadecatrienoate (predicted Is 2163, 2154,2155, average value 2157 ± 5, experimental = 2134 ± 4) and methyl8,11,14,17-icosatetraenoate (predicted Is 2244, 2245, 2238, averagevalue 2242 ± 3, experimental I = 2197 ± 18). Unreliability of exper-imental I value in the last case is confirmed by its large standarddeviations (± 18 i.u.) exceeding I standard deviations for all FAMEspresented in Supplementary Tables S1 and S2.

Both tables are suitable for identification purposes with theabove restrictions. Optimal models (their parameters, descriptors,etc.) are available from the authors upon request.

4. Conclusions

Useful predictive models can be built using two-dimensionaldescriptors. There is no need for geometry optimization usingthese descriptors and a rapid model building procedure ispossible.

The method comparison was carried out in two different ways.PLS is superior to other techniques in its optimal performance.

198–1199 (2008) 188–195

icted Test set predicted

RMSE i.u. R2 F RMSE i.u. RMSEave i.u.

85 27.66 0.99781 19,168 26.83 26.6295 24.47 0.99865 30,968 21.11 21.5966 25.69 0.99790 19,944 26.30 25.1658 23.40 0.99833 25,063 23.47 21.3938 31.99 0.99744 16,333 29.06 30.8148 23.49 0.99788 19,763 26.42 25.59

training, calibration and test sets, respectively.nd test sets, respectively. RMSEave = average of training, calibration and test RMSEs.

However, the RMSE on the test set differs only slightly for the vari-ous methods in the optimum performance.

Applying the principle of parsimony with respect to (keepingthe number of descriptors minimal), Lasso, and especially the PLS,was not able to produce good predictive models for Kovats reten-tion indices when selected variables were fixed to five and MLR isused to solve y = Xb. With new methods of effective rank computa-tions [48], it should be possible to include and compare final modeldegrees of freedom for parsimony considerations in model compar-isons regardless of the descriptors used or the modeling method.This aspect is currently being studied.

The measured I values of some compounds do not seem to beappropriate. These I values have been detected as outliers in caseof more models. Models without these outliers should have signif-icantly better descriptive and predictive abilities.

Prediction of retention indices has also been carried out for 130fatty acid methyl esters with known experimental data and for 37fatty acid methyl esters for which no measured Kovats indices areavailable.

Acknowledgements

This material is based upon the work supported by the Hun-garian Research Foundation (OTKA T 037684) and the NationalScience Foundation under Grant No. CHE-0400034 and is gratefullyacknowledged.

Appendix A. Supplementary data

Supplementary data associated with this article can be found,in the online version, at doi:10.1016/j.chroma.2008.05.019.

References

[1] Cyberlipid Center 22 March 2007, http://www.cyberlipid.org/search/search0.htm.[2] W.W. Christie, Gas Chromatography–Mass Spectrometry and Fatty Acids, Gas

Chromatography and Lipids, Oily Press, Ayr, 1989, p. 161.[3] G. Dobson, W.W. Christie, Eur. J. Lipid Sci. Technol. 104 (2002) 36.[4] S.A. Mjøs, Eur. J. Lipid Sci. Technol. 106 (2004) 550.[5] K. Ubik, K. Stransky, M. Streibl, Collect. Czech. Chem. Commun. 40 (1974) 2826.[6] R.V. Golovnya, V.P. Uralets, T.E. Kuzmenko, J. Chromatogr. 121 (1976) 118.[7] R.V. Golovnya, T.E. Kuzmenko, Chromatographia 10 (1977) 545.[8] M. Spiteller, G. Spiteller, J. Chromatogr. 164 (1979) 253.[9] D.J. Stern, R.A. Flath, T.R. Mon, R. Teranishi, R.E. Lundin, M.E. Benson, J. Agric.

Food Chem. 33 (1985) 180.[10] K. Kittiratanapiboon, N. Jeyashoke, K. Krisnangkura, J. Chromatogr. Sci. 36

(1998) 361.[11] T. Hanai, C. Hong, J. High Resolut. Chromatogr. 12 (1989) 327.[12] J.A. Barve, F.D. Gunstone, F.R. Jacobsberg, P. Winlow, Chem. Phys. Lipids 8 (1972)

117.[13] R.G. Ackman, S.N. Hooper, J. Chromatogr. 86 (1973) 73.[14] R.G. Ackman, A. Manzer, J. Joseph, Chromatographia 7 (1974) 107.[15] S.A. Mjøs., J. Chromatogr. A 1122 (2006) 249.[16] S.A. Mjøs., O. Grahl-Nielsen, J. Chromatogr. A 1110 (2006) 171.[17] NIST Standard Reference Database Number 69, US National Institute for Sci-

ence and Technology (NIST) MS Data Center, Gaithersburg, MD, June 2005,http://www.webbook.nist.gov.

[[

[

[

[[[

[[[[

[

O. Farkas et al. / J. Chromatogr. A 1

[18] N.R. Draper, H. Smith, Applied Regression Analysis, Wiley, New York, 1981, p.412.

[19] A.N. Tikhonov, Sov. Math. Dokl. 4 (1963) 1035.20] A.E. Hoerl, R.W. Kennard, Technometrics 12 (1970), p. 55, p. 69.21] N.R. Draper, H. Smith, Applied Regression Analysis, Wiley, New York, 1981,

p. 307.22] A.J. Miller, Subset Selection in Regression, Chapman and Hall, London, 1990,

p. 43.23] S. Wold, in: H. Waterbeemd (Ed.), Chemometric Methods for Molecular Design,

VCH, Weinheim, 1995, p. 195.24] R. Tibshirani, J. R. Stat. Soc. B 58 (1996) 267.25] R. Rajko, K. Heberger, Chemom. Intell. Lab. Syst. 57 (2001) 1.26] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learn-

ing. Data Mining, Interference and Prediction, Springer, New York, 2001,p. 57.

27] K. Heberger, R. Rajko, J. Chemom. 16 (2002) 436.28] J.B. Forrester, J.H. Kalivas, J. Chemom. 18 (2004) 372.29] O. Farkas, K. Heberger, J. Chem. Inf. Model. 45 (2005) 339.30] B.Y. Li, Y.Z. Liang, X.P. Du, C.J. Xu, X.N. Li, Y.O. Song, H. Cui, Chromatographia 57

(2003) 235.31] J.J. Jimenez, J.L. Bernal, S. Aumente, L. Toribio, J. Bernal, J. Chromatogr. A 1007

(2003) 101.

[[[[[

[

[[[

[[[[[

[[[

198–1199 (2008) 188–195 195

32] J. Vine, J. Chromatogr. 196 (1980) 415.33] F.T. Gillan, J. Chromatogr. Sci. 21 (1983) 293.34] I.G. Zenkevich, Zh. Anal. Khim. 39 (1984) 1297.35] I.G. Zenkevich, B.W. Ioffe, J. Chromatogr. 439 (1988) 185.36] R. Todeschini, V. Consonni, M. Pavan, Dragon Software Version 2.1, Talete, Milan,

2002.37] A.J. Miller, Subset Selection in Regression, Chapman and Hall, London, 1990, p.

12.38] Statistica 6 Software Package, StatSoft, Tulsa, OK, 2002.39] Unscrambler 9.2, Camo Process, Oslo, 2005.40] Matlab 7.0 with the Matlab Optimization Toolbox, The MathWorks, Natick, MA,

2003.41] J.H. Kalivas, R.L. Green, Appl. Spectrosc. 55 (2001) 1645.42] P.C. Hansen, SIAM Rev. 34 (1992) 561.43] P.C. Hansen, D.P. O’Leary, SIAM J. Sci. Comput. 14 (1993) 1487.44] H. Martens, J.P. Nielsen, S.B. Engelsen, Anal. Chem. 75 (2003) 394.45] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, Wiley–VCH,

Weinheim, 2000.46] O. Farkas, K. Heberger, I.G. Zenkevich, Chemom. Intell. Lab. Syst. 72 (2004)173.47] A. Tropscha, P. Gramatica, V.K. Gombar, QSAR Comb. Sci. 22 (2003) 69.48] H.A. Seipel, J.H. Kalivas, J. Chemom. 18 (2004) 306;

(erratum) H.A. Seipel, J.H. Kalivas, J. Chemom. 19 (2005) 64.