Application of support vector machines to 1H NMR data of fish oils: methodology for the confirmation...

ORIGINAL PAPER

Application of support vector machines to 1H NMR dataof fish oils: methodology for the confirmation of wildand farmed salmon and their origins

Saeed Masoum & Christophe Malabat &Mehdi Jalali-Heravi & Claude Guillou & Serge Rezzi &Douglas Neil Rutledge

Received: 19 June 2006 /Revised: 23 October 2006 /Accepted: 15 November 2006 / Published online: 3 January 2007# Springer-Verlag 2007

Abstract Support vector machines (SVMs) were used as anovel learning machine in the authentication of the origin ofsalmon. SVMs have the advantage of relying on a well-developed theory and have already proved to be successfulin a number of practical applications. This paper provides anew and effective method for the discrimination betweenwild and farm salmon and eliminates the possibility offraud through misrepresentation of the country of origin ofsalmon. The method requires a very simple samplepreparation of the fish oils extracted from the white muscleof salmon samples. 1H NMR spectroscopic analysisprovides data that is very informative for analysing thefatty acid constituents of the fish oils. The SVM has been

able to distinguish correctly between the wild and farmedsalmon; however ca. 5% of the country of origins weremisclassified.

Keywords Support vector machines (SVMs) . Salmon .

Authenticity . NMR . Correlation optimized warping (COW)

Introduction

Statistical learning theory has motivated the introduction ofsupport vector machines (SVMs) as a new class ofclassification algorithms [1–4]. SVMs implement classifiersof an adjustable complexity, controlling the latter foroptimal generalization ability, i.e. the performance forfuture unknown samples. Among the applications forwhich SVMs have been used so far are recognition ofhandwritten digits [5], text categorization [6], gene expres-sion analysis [7] and others [8]. However, the current rangeof applications of the method lies more in digit recognitionand data-mining fields, and recently the SVM has become awell-known tool in chemometrics [9–11].

Nuclear magnetic resonance (NMR) spectroscopy is anon-destructive analytical technique, which enables bothqualitative (e.g. structural determination) and quantitativeanalysis. 1H NMR spectroscopy can be used to analyselipid mixtures such as fish oil, providing information onlipid classes, level of unsaturation, molar fractions ofspecific fatty acids (e.g. ω-3 fatty acids), and of severalminor compounds. Such an analytical approach requires avery simple sample preparation of the fish oil (dilution indeuterated solvent) and the acquisition of quantitative 1HNMR spectra can be achieved in a relatively short time ofthe order of a few minutes.

Anal Bioanal Chem (2007) 387:1499–1510DOI 10.1007/s00216-006-1025-x

S. Masoum :M. Jalali-HeraviDepartment of Chemistry, Sharif University of Technology,P.O. Box 11365-9516, Tehran, Iran

S. Masoume-mail: [email protected]

M. Jalali-Heravie-mail: [email protected]

C. Malabat :D. N. Rutledge (*)Laboratoire de Chimie Analytique,UMR 214 INRA/INA P-G, 16, rue Claude Bernard,75005 Paris, Francee-mail: [email protected]

C. Guillou : S. RezziEuropean Commission, Joint Research Centre, Institute for Healthand Consumer Protection, Physical and Chemical Exposure Unit,BEVABS T.P. 740,21020 Ispra (VA), Italy

Present Address:S. RezziBioAnalytical Science, Metabonomics and Biomarkers,Nestlé Research Center,P.O. Box 44, 1000 Lausanne 26, Switzerland

Within the last decade, owing to improvements inaquaculture, the consumption of salmon has increased morethan 100-fold in the European markets. World farmedsalmon output increased from 7,000 to 700,000 tonsbetween 1980 and 1997 and continues to increase. Salmon,especially farmed salmon, are a good source of healthy n-3fatty acids, but they also contain high concentrations oforganochlorine compounds such as polychlorinated biphe-nyls (PCBs), dioxans and chlorinated pesticides. Thepresence of these contaminants may reduce the net healthbenefits derived from the consumption of farmed salmon,despite the presence of the high level of n-3 fatty acids inthese fish compare to wild salmon [12]. A distinction interms of quality and, notably, price between the wild-fishedand farmed products leaves open a real possibility of fraud[13]. Wild salmon, however, is still perceived by many tobe superior eating compared to farm salmon and because ofthe much restricted availability compared to the farmedfish, wild salmon typically commands a price 2 to 3 timesthat of the farmed equivalent. With such a price difference,there is a temptation to mislabel farmed fish as “wild”. Thisis particularly acute in years when the harvest is poor duethe environmental factors that lead to year-to year fluctua-tions [14].

In addition to this misrepresentation there is also thepossibility of making illicit gains by the misrepresentationof the country of origin of salmon. In some countries,salmon from some areas are seen as more desirable thanfrom the others—the main European salmon-producingareas are Scotland, Ireland and Norway. The price of fishfrom an area preferred by consumers of a particular countryis higher, and again there is scope for fraud. A further factoris that when a country produces a large quantity of anyproduct, the price tends to fall, and anti-dumping duties arelevied over a certain amount of the product exported orsold. Producers of large amounts of farmed fish can betempted to mislabel their fish, to hide its origin and avoidanti-dumping duties [15].

The potential to distinguish juvenile wild from culturedfishes and to discriminate among juvenile fishes by speciesbased on fatty acids composition was demonstrated by Trittet al. [16]. Statistical approaches to data evaluation includedanalysis of variance, correlation analysis, principal compo-nent analysis (PCA) and quadratic discriminant analysis(QDA).

Villarreal and coworkers have detected differences infatty acid compositions between wild and cultured red drum[17]. A major goal of the study was to aid law enforcementofficials in differentiating endemic (wild) fishes fromcultured ones to discourage illegal capture or commercialfraud.

Since September 2001, a European consortium of fivepartners in France, Italy, the UK and Norway has been

working to develop a validated method to enable officiallaboratories to determine exactly where fish come from,and whether or not they are wild. The partners under the“Confirmation of the Origin of Farmed And Wild Salmonand other fish” (COFAWS) project examined a range oftechniques: determining deuterium distribution in the fattyacids by NMR; oxygen and carbon isotope levels byisotope ratio mass spectrometry (IRMS); distribution offatty acids in triglycerides and the composition of the fattyacid mixture by gas chromatography and 13C NMRspectroscopy [15].

The present study aims to develop a validated methodfor the authentication of the origin of salmon and salmon-based products using support vector machines. It is obviousthat success in developing such methods would reduce thepossibility of fraud.

Theory

Correlation optimized warping

NMR spectra of complex mixtures typically show unwant-ed local peak shifts caused by matrix and instrumentvariability, which must be compensated for prior tostatistical analysis and interpretation of the data. Severalpeak alignment algorithms have been developed, some ofwhich have been applied previously to NMR spectra [18,19].

A procedure called correlation optimized warping(COW) was introduced by Nielsen et al. [20] to correctfor misalignments or shifts in discrete data signals. It is apiecewise or segmented data preprocessing method (oper-ating on one sample record at a time) aimed at aligning asample data vector towards a reference vector by allowinglimited changes in segment lengths of the sample vector.Both signals, the profile to align and the reference profile,are first divided into a user-specified number of sections(N) with length m for unaligned profile [21]. Each sectionin the profile to be aligned is then warped, starting from thefirst section. The length of this section is stretched orshortened by shifting the position of its end point by alimited number of points (from −t to +t), defined by theslack parameter, t. The stretched and shortened segmentsobtained for this section are then linearly interpolated to thelength of the corresponding section in the reference profile.For each possible endpoint of this section (from −t to +t),the correlation coefficients between the interpolated sectionand the corresponding reference section are computed andstored together with the position of the section end points.Then, one goes on to the second section which started afterthe set of different end points of the first section. The endpoint of this section is also shifted from −t to +t points for

1500 Anal Bioanal Chem (2007) 387:1499–1510

each endpoint of the first section. The stretched andshortened sections are then linearly interpolated to thelength of section two of the reference and the correlationcoefficients are again computed. The above operation isperformed on all sections until the last section of the profileto be aligned has been treated. To score a warping solution,an objective function, F, is constructed as a cumulative sumof the correlation coefficients of the previous sections. Afterexamining all possible end points of all sections, a globalwarping solution is calculated by starting from the lastsection backwards by determining the maximal value of theobjective function and its corresponding end point for everysection. For a more detailed description of the COWalgorithm we refer the reader to references [22, 23].

Support vector machines

With the progress in computer science and technology, dataprocessing has become widely used in both chemicalresearch and chemical plant control [24]. Up until recently,the statistical methods used in chemistry were almostalways based on the classical multivariate statisticalmethods such as principal components analysis andfactorial discriminant analysis. However, in the case ofcomplex data sets, these methods may not be able toproduce robust and efficient predictive models—this wasthe case for the 1H NMR data studied here. Severalmethods based on the concepts of statistical learning theory,including support vector machine (SVMs) and weight-decay artificial neural network (WD-ANN) [25, 26], havebeen proposed recently to solve this problem. As comparedwith other algorithms used in computer chemistry, SVMhas some outstanding advantages: it can be used for bothclassification (support vector classification, SVC) andregression (support vector regression, SVR); it is suitablefor both linear and nonlinear data processing; it has specialgeneralization ability, in particular for problems where thesample sizes are small; SVM has no difficulty with the localminimum problem. Nowadays, support vector machines(SVMs) are replacing neural networks as tools for solvingpattern recognition problems [27]. They are based on someelegantly simple ideas and provide a clear intuition of whatlearning from examples is all about. In very simple terms,an SVM corresponds to a linear method in a very highdimensional feature space that is nonlinearly related to theinput space. Even though we think of it as a linearalgorithm in a high dimensional feature space, in practice,it does not involve any computations in the high dimen-sional space. By the use of kernels, all necessary compu-tations are performed directly in the input space.

SVM performs pattern recognition for two-class prob-lems by determining the separating hyperplane withmaximum distance to the closest points of the training set.

These points are called support vectors that are shown inFig. 1.

The SVM algorithm constructs a separating hypersurfacein the input space. It acts as follows:

(i) Map the input space into a higher dimensional featurespace through some nonlinear mapping function, chosen apriori (kernel); (ii) construct the maximal margin hyper-plane (MMH) in this feature space. MMH maximizes thedistance of the closest vectors belonging to the differentclasses to the hyperplane.

For a linearly separable data set containing n spectra X={(x1,y1),..., (xn,yn)} with xi ∈Rn, yi ∈{−1, +1}, i =1, . . . , n,the hyperplanes are of the type <wT.x>+b=0, correspondingto decision functions:

f xð Þ¼ wT :x� �þb ð1Þ

where b and w are the hyperplane parameters of offset andweight vector, respectively. A point x is attributed to the class‘+1’ if f(x)>1 and to the class ‘−1’ otherwise.

From the above definition, it follows that the distancebetween the class boundaries, the margin, is given byM¼ 2= wk k:

In the case of overlapping class distributions, not allsamples can be classified correctly and deviations ξi fromthe desired values of the decision function have to beallowed for individual training samples.

As the margin M is given by 2= wk k, one can find theweight vector w, which maximizes M and simultaneouslyminimizes the empirical error, by solving the followingjoint minimization problem:

min 0:5 wk k2 þ CXNi¼1

xi

( )ð2Þ

subject to the constraints :

yi < wT :x >þb� � � 1� J i; for class yi¼ �1 ð3Þ

ξi≥0, for all i

Fig. 1 Decision hyperplane given by the SVM attempts to maximizethe margin between two types of data

Anal Bioanal Chem (2007) 387:1499–1510 1501

The first term in Eq. (2) corresponds to maximizing themargin, whereas the second term penalizes samples on thewrong side of the class boundaries (‘margin errors’) inthe nonseparable case.

As both objectives, minimizing wk k2 and ∑ξi, are notdirectly connected from the theory, the empirical trade-offparameter C has to be introduced, which governs theemphasis on one or the other objective. C is to be chosenbeforehand.

The above minimization is a standard problem inoptimization theory: minimization of a quadratic objectivefunction with linear constraints. It can be solved byapplying Lagrangian theory. From this theory, one canderive that the resulting weight vector w of the decisionfunction (Eq. (1)) is given by a linear combination of thetraining data xi:

w ¼XNi¼1

aiyixi ð4Þ

These prototypes for the class distinction, the supportvectors, can thus be found in a straightforward mannerand provide compressed information for constructing thedecision function. Omitting all training vectors that arenot support vectors will lead to the same final result forthe decision function, which reduces the computationaloverhead.

A reformulation of Eq. (2) gives the final optimizationproblem to be solved with respect to the αi:

min 0:5Xi;j

aiyixTi xjajyj�

Xi

ai

( )ð5Þ

subject toPiaiyi ¼ 0 0 � ai � C; foralli

The objective function (Eq. (5)) involves the soughtcontribution factors αi for every training sample, their classlabels yi, the trade-off parameter C and the scalar productsbetween all possible pairs of training samples.

A fundamental difference between the optimizationproblem (Eq. (5)) and that encountered in most neural

network approaches is that here no local minima can occur.The solution is found in a deterministic manner, i.e. giventhe same training data (and C) the algorithm will alwaysdeliver the same final decision function. In fact, the offset bcan be not unique.

The offset parameter b of the decision function (Eq. (1))can be determined from the constraints (Eq. (3)) as follows:

b ¼ 1

Nb

X0haihC

yi � wTxi� �� ð6Þ

where Nb is the number of support vectors that lie on the‘class boundary’ f(x)=±1 and for which 0<αi<C.

The entire construction can be extended rather naturallyto include nonlinear separating hypersurfaces. The scalarproduct xTi xj is replaced by a function K(xi,xj), namedkernel. If certain conditions are satisfied (Mercer’s con-ditions, [2]) this kernel function can implicitly compute anonlinear mapping x→Φ(x) and subsequent scalar multi-plication Φ(xi)

TΦ(xj) in the mapped space, in one step.The nonlinear decision function is then given by:

f xð Þ ¼Xi¼1;SV

aiyiK xi; xj� �þ b ð7Þ

By applying the nonlinear mapping to the original data xand using a linear classifier in the mapped space Φ(x),nonlinear decision boundaries in the input space areobtained (Fig. 2).

Important families of admissible kernel functions are:

– Gaussian radial basis function kernel (RBF),

K x;yð Þ¼ exp � x� yk k2.2s2Þ;

�ð8Þ

where σ is the variance of the Gaussian.– Polynomial kernel,

K x; yð Þ¼ 1þxyð Þd; ð9Þwhere d is the degree of the polynomial (d=0 for linearkernel).

Fig. 2 Nonlinear mapping tothe original data x and using alinear classifier in the mappedspace Φ(x)


Experimental

Analysis of fatty acids by 1H NMR

The 1H NMR spectra of the fish oil extracted from thewhite muscle of the salmon samples were recorded on aBruker GmbH Advance spectrometer operating at the basicfrequency of 500 MHz using an inverse broadband 5-mmprobe. The sample preparation was simply reduced to thedilution of 120 μL of fish oil in 700 μL of deuteratedchloroform. Quantitative spectra were obtained in 25 minexperimental time under a fully automated way (automaticsample tube insertion, lock, shimming and acquisition). Anexample of the 1H NMR spectrum of salmon oil is shownin Fig. 3. Table 1 summarises the useful information aboutNMR spectra of the fatty acids in salmon oil.

Software

All computations and chemometric analyses were executedwith programs in Matlab v6.5 (The Mathworks, Inc.,Natick, MA, USA). Different algorithms have beenproposed in the literature to perform SVM for classification[28–30]. The Lin’s Lib SVM v2.33 algorithm was used [30]in the present work. All calculations were performed on a3-GHz Pentium IV with 2 Gbytes of RAM.

Results and discussion

Lipids are components of all living cells. The lipid contentand fatty acid composition of an organism may vary withina species because of factors such as sex, location, season ordiet [31]. Because the fatty acid composition of tissue lipids

Fig. 3 1H NMR spectrum of a salmon oil

Table 1 Characteristics of 1HNMR spectrum of the fattyacids in salmon oil

a Eicosapentaenoic acidb Docosahexaenoic acid

Peak Compound Carbon Chemical shift (ppm)

1 All fatty acids (FA) except ω-3 FA -CH3 0.85–0.912 ω-3 FA -CH3 0.95–1.003 All FA except EPAa and DHAb -(CH2)n- 1.20–1.404 All FA except DHA -(CH2)n-CH2-COOH 1.615 Unsaturated FA -(CH2)n-CH=CH 1.97–2.126 All FA except DHA -CH2-COOH 2.28–2.347 DHA =CH-CH2-CH2-COOH 2.37–2.418 Polyunsaturated FA =CH-CH2-CH= 2.74–2.909 Phosphatidylcholine -N(CH3)3 3.3510 Glyceryl C1,3 protons 4.10–4.1911 C1,3 protons 4.26–4.3312 C2 proton 5.24–5.2813 Unsaturated FA -CH=CH- 5.30–5.4414 p-Dinitrobenzene C1–4 protons 8.42


Table 2 Classified results of training, monitoring and validation setsusing SVM for both classification criteria

Sample Givenspecification

Wild (W)/farmed(F) salmon

Country of origin

Class SVMresults

Class SVMresultsc

1a S-AlaskaW W + Alaska +2 S-AlaskaW W + Alaska +3a S-AlaskaW W + Alaska +4b S-AlaskaW W + Alaska +5 S-AlaskaW W + Alaska +6a S-Canada1F F + Canada +7 S-Canada1F F + Canada +8a S-Canada1F F + Canada +9b S-Canada1F F + Canada +10 S-Canada1F F + Canada +11 S-Faroes1F F + Faroes +12a S-Faroes1F F + Faroes +13b S-Faroes1F F + Faroes +14a S-Faroes1F F + Faroes +15 S-Faroes1F F + Faroes +16a S-Iceland1F F + Iceland +17 S-Iceland1F F + Iceland +18a S-Iceland1F F + Iceland +19 S-Iceland1F F + Iceland +20b S-Iceland1F F + Iceland +21b S-Ireland1F F + Ireland +22a S-Ireland1F F + Ireland +23 S-Ireland1F F + Ireland +24 S-Ireland1F F + Ireland +25a S-Ireland1F F + Ireland +26 S-Ireland2F F + Ireland +27 S-Ireland2F F + Ireland +28a S-Ireland2F F + Ireland +29 S-Ireland2F F + Ireland +30 S-Ireland2F F + Ireland +31b S-IrelandW W + Ireland +32a S-IrelandW W + Ireland − (Scotland)33 S-IrelandW W + Ireland +34 S-IrelandW W + Ireland +35a S-IrelandW W + Ireland +36b S-Norway1F F + Norway +37a S-Norway1F F + Norway +38 S-Norway1F F + Norway +39 S-Norway1F F + Norway +40a S-Norway1F F + Norway +41 S-Norway1F F + Norway +42 S-Norway1F F + Norway +43a S-Norway1F F + Norway +44 S-Norway1F F + Norway +45 S-Norway1F F + Norway +46a S-Norway1F F + Norway +47 S-Norway1F F + Norway +48 S-Norway1F F + Norway +49a S-Norway1F F + Norway +50b S-Norway1F F + Norway +51 S-Norway1F F + Norway +52a S-Norway1F F + Norway +

Table 2 (continued)



Country of origin

Class SVMresults

Class SVMresultsc

53 S-Norway1F F + Norway +54 S-Norway1F F + Norway +55a S-Norway1F F + Norway +56 S-Norway1F F + Norway +57 S-Norway1F F + Norway +58a S-Norway1F F + Norway +59b S-Norway1F F + Norway +60 S-Norway1F F + Norway +61a S-Norway2F F + Norway +62 S-Norway2F F + Norway +63 S-Norway2F F + Norway +64a S-Norway2F F + Norway +65 S-Norway2F F + Norway +66 S-Norway2F F + Norway +67a S-Norway2F F + Norway +68 S-Norway2F F + Norway +69 S-Norway2F F + Norway +70a S-Norway2F F + Norway +71b S-Norway2F F + Norway +72 S-Norway2F F + Norway +73a S-Norway2F F + Norway +74 S-Norway2F F + Norway +75 S-Norway2F F + Norway +76b S-NorwayW W + Norway +77a S-NorwayW W + Norway +78 S-NorwayW W + Norway +79b S-NorwayW W + Norway +80 S-NorwayW W + Norway +81a S-NorwayW W + Norway +82 S-NorwayW W + Norway +83 S-NorwayW W + Norway +84a S-NorwayW W + Norway +85 S-NorwayW W + Norway +86b S-NorwayW W + Norway +87 S-NorwayW W + Norway +88a S-NorwayW W + Norway − (Ireland)89 S-NorwayW W + Norway +90 S-NorwayW W + Norway +91a S-NorwayW W + Norway +92 S-Scotland1F F + Scotland +93a S-Scotland1F F + Scotland − (Ireland)94 S-Scotland1F F + Scotland +95 S-Scotland1F F + Scotland +96a S-Scotland1F F + Scotland +97 S-Scotland2F F + Scotland +98b S-Scotland2F F + Scotland +99 S-Scotland2F F + Scotland +100a S-Scotland2F F + Scotland +101 S-Scotland2F F + Scotland +102 S-Scotland2F F + Scotland +103a S-Scotland2F F + Scotland +104 S-Scotland2F F + Scotland +105b S-Scotland2F F + Scotland +


of animals often reflects that of the diet [32–34], analysis offatty acids is a powerful tool for distinguishing wild andfarm fishes. The main aim of the present work was thedevelopment of a validation method for the confirmation ofthe wild and farmed salmon and their origins. To reach thisgoal one needs three requirements: (1) choosing a diverse

data set, (2) using a reliable and relatively fast techniquerelated to the structure of the fatty acids in fish oil and (3) apowerful method to classify the samples using the resultsobtained by the analytical method.

Choice of data set

The confirmation method is based on two criteria: discrim-ination between wild and farmed salmon as “criterion one”and the country of origin of the fish as “criterion two”. It isobvious that presenting a valid confirmation methodrequires a diverse data set. Criterion one has only twocases of being wild or farmed, whereas criterion two canplay a major role in diversity of the method. Therefore atotal of 141 salmon fish oils from eight different regions ofCanada, Alaska, Faroes, Ireland, Iceland, Norway, Scotlandand Tasmania were considered in this work. As can be seenfrom Table 2, this data set was divided into three sets oftraining (74 samples), monitoring (45 samples) and valida-tion sets (22 samples). The monitoring and validation setswere chosen randomly in such a way that there are adequaterepresentatives of the training set. The training set was usedto develop the model. Together with the performance of thetraining set the performance of an independent set must alsomonitored (monitoring set) to obstruct the overtrainingphenomena. The developed model then can be evaluated byusing the fish samples included in the validation set whichhave not been used in the training and the monitoring sets.

Analysis of fish oils

The next step was choosing a suitable method to analysethe fatty acids of the salmon fish oil. The oxidative andhydrolytic degradation of lipids in fish oil was monitoredusing partial least-squares (PLS) regression and near-infrared reflectance (NIR) spectroscopy [35]. Jalali-Heraviet al. have used principal component analysis (PCA)together with HELP on two-dimensional data obtained byGC-MS to analyse the fatty acid methyl esters (FAMEs) inthe complex matrix of commercial fish oils [36]. Sheng-Suan Cai and coworker compared atmospheric pressurephoto-ionization (APPI), atmospheric pressure chemicalionization (APCI) and electrospray ionization mass spec-trometry (ESI) LC/MS for analysis of lipids [37]. Analysisof fish oils by APPI showed significantly enhanced targetanalyte intensities in comparison with APCI and ESI.Although most of the methods used for analyzing the fishoils have been successful, a suitable method for discrimi-nating between the farmed or wild fishes and their originsshould have two properties: (1) its resulting data must beinformative and useful as inputs for the classificationmethods; (2) its sample preparation should be as simple aspossible and fast. Among different techniques 1H NMR

Table 2 (continued)



Country of origin

Class SVMresults

Class SVMresultsc

106 S-Scotland2F F + Scotland +107a S-Scotland2F F + Scotland +108 S-Scotland2F F + Scotland +109b S-Scotland2F F + Scotland − (Norway)110 S-Scotland2F F + Scotland +111a S-Scotland2F F + Scotland +112 S-Scotland2F F + Scotland +113 S-Scotland2F F + Scotland +114b S-Scotland2F F + Scotland +115a S-Scotland2F F + Scotland +116 S-Scotland2F F + Scotland +117 S-Scotland3F F + Scotland +118a S-Scotland3F F + Scotland +119 S-Scotland3F F + Scotland +120 S-Scotland3F F + Scotland +121b S-Scotland3F F + Scotland +122 S-Scotland4F F + Scotland +123a S-Scotland4F F + Scotland +124 S-Scotland4F F + Scotland +125 S-Scotland4F F + Scotland +126b S-Scotland4F F + Scotland +127 S-ScotlandW W + Scotland +128 S-ScotlandW W + Scotland +129a S-ScotlandW W + Scotland +130 S-ScotlandW W + Scotland +131b S-ScotlandW W + Scotland +132a S-ScotlandW W + Scotland +133 S-ScotlandW W + Scotland +134 S-ScotlandW W + Scotland +135b S-ScotlandW W + Scotland +136a S-ScotlandW W + Scotland +137 S-Tasmania1F F + Tasmania +138a S-Tasmania1F F + Tasmania +139 S-Tasmania1F F + Tasmania +140b S-Tasmania1F F + Tasmania +141a S-Tasmania1F F + Tasmania +

a The samples included in the monitoring setb The samples included in the validation set. The remaining samplesare considered in the training setc + correctly classified; − misclassified


spectroscopy is an analytical approach which requires avery simple sample preparation of the fish oils and presentsquantitative data in a relatively short time. Also, the 1HNMR spectra contain a lot of information related to thestructures of the molecules which are useful in classifyingthe salmon fish oils. Therefore, 1H NMR spectroscopy wasused for analyzing the fatty acids of the fish oils extractedfrom the white muscle of the salmon samples.

Fish oils are complex mixtures consisting of differentpolysaturated and unsaturated fatty acids. Figure 3 shows atypical 1H NMR spectrum of fish oil. Parts of this spectrumare expanded to show the complexity of the spectrum. Thecharacteristics of NMR spectrum of the fatty acids insalmon oil are summarized in Table 1. It can be seen thatthe 1H chemical shifts ranged from 0.85 to 8.424 ppm.

Classification of 1H NMR data of salmon fish oils

The NMR data sets are very informative and thereforecould be useful for classification purposes. However, due toincreasing complexity of these data sets, it has becomeimportant to utilize data reduction and chemometrictechniques to accurately access the latent chemical infor-mation within the data. In the present work, the initial goalwas classifying the 1H NMR data based on the wild/farmedsalmon and their origins. To accomplish this goal, patternrecognition methods can be employed to reduce thecomplexity and size of the NMR data so that classificationcan be visually pursued. The simplest of these techniquesare referred to as “unsupervised methods”, since there is noneed for operator input. These methods are often used asexploratory techniques during the initial stages of data

Fig. 4 Performance of COW:the appearance before (a) andafter (b) peak alignment


analysis. Examples of chemometric analysis of NMR datausing these unsupervised methods are principal componentanalysis (PCA) and hierarchical cluster analysis (HCA)[38]. However, in complex systems where the number ofgroups to be separated during classification becomes larger,the performance of simple unsupervised methods degrades,requiring the use of more sophisticated supervised chemo-metric techniques. Among different supervised methodsSVM seems to be the most suitable one, because for theclassification only support vectors are needed. This meansthat for the classification a limited number of data pointsare used and therefore the calculation processes would bereduced. As discussed in section “Support vectormachines”, except for the support vectors all other trainingvectors can be eliminated without changing the finaldecision function. In the present work, among 74 samplesof the training set only a total of 18 samples were chosen assupport vectors for wild/farm classification. On the other 56samples were discarded in developing the optimized model

and the result was very satisfactory. In the case ofclassifying of the fishes based on their origins a total of60 samples were chosen as support vectors. The increasednumber of support vectors in this case compared to theprevious one is due to the larger number of classes in thelatter case and also its complexity (8 classes as the originsof salmon compared to 2 as wild/farmed).

The data set comprised salmon from 8 different regions(Canada, Alaska, Faroes, Iceland, Ireland, Norway, Scot-land and Tasmania) based on wild and farmed salmon. Theinitial size of each NMR spectrum was 1×32,768, whichwas reduced to 1×11,501 by eliminating uninformativespectral regions and by calculating the average of every twopoints. The final dataset, comprised of 141 spectra, is amatrix of (141×11,501).

NMR analysis of complex samples is accompanied byvariation in peak position and peak shape not directlylinked to the sample. This is due to temperature variationand inhomogeneities in the applied magnetic field and

Fig. 5 a A typical 1H NMR spectrum of salmon after truncation and 2-point averaging, b after SNV, c after scaling between 1 and 10, d after Logtransformation


instrumental stabilities or relative concentration differencein the background matrix of the sample. These variationsare complicated and limit the interpretation and analysis ofNMR data by chemometric methods. Alignment of theNMR signals by COW may circumvent these limitationsand is an important preprocessing step prior to multivariateanalysis by correcting them in relation to a referencespectrum. The alignment with COW requires the optimiza-tion of two input parameters, m and t. The optimization isdone as follows: first a reference signal is selected. It mustbe a representative NMR spectrum in which most peaks areclearly present. An NMR spectrum that fulfills theserequirements is therefore chosen as reference. The amountsof m and t are optimized by varying m and t between 10and1,000 and 1 and 3, respectively. These optimizedparameters (m=100, t=3) are then used to align the wholedata set. If some signals still remain unaligned, m and t canbe further optimized for the badly aligned signals.

Figure 4 shows a section of NMR spectra from 0.8 to1.1 ppm. It can be seen that the peaks are not very wellaligned originally, whereas after the COW their positionsare clearly aligned.

The standard normal variates transform (SNV) was thenused for the pretreatment of the salmon NMR spectra. Asthere are small peaks that are significant for the discrimi-nation and are not taken fully into account without somesort of scaling, each row was scaled between 1 and 10, andthen a Log transform was performed to enhance the smallpeaks that are shown in Fig. 5.

Different kernels have been tested on these data, and theresults showed that the RBF kernel is the most reasonablechoice because of its simplicity and ability to model data ofarbitrary complexity. Another reason for using RBF is theparsimony of the number of hyperparameters, whichinfluence the complexity of model selection (the polyno-

Table 3 Classified results of prediction set for the two classificationcriteria using SVM

Sample Given specification Predicted wild(W)/farmed (F)

Predictedcountry of origin

1 Cofaws-blinds F Canada2 Cofaws-blinds F Norway3 Cofaws-blinds F Norway4 Cofaws-blinds W Norway5 Cofaws-blinds F Scotland6 Cofaws-blinds F Norway7 Cofaws-blinds W Norway8 Cofaws-blinds F Canada9 Cofaws-blinds F Norway10 Cofaws-blinds W Ireland11 Market-Canada-W W Ireland12 Market-Canada-W W Ireland13 Market-Canada-W W Ireland14 Market-Canada-W W Norway15 Market-Canada-W W Ireland16 Market-Italy-F F Norway17 Market-Italy-F F Norway18 Market-Italy-F F Norway19 Market-Italy-F F Scotland20 Market-UK W Norway21 Market-UK F Norway22 Market-UK F Norway23 Market-UK F Norway24 Market-UK F Norway25 Market-UK F Norway26 MarketFrance1F F Scotland27 MarketFrance1F F Scotland28 MarketFrance1F F Norway29 MarketFrance1F F Scotland30 MarketFrance1F F Scotland31 MarketFrance1F F Ireland32 MarketFrance1F F Scotland33 MarketFrance1F F Scotland34 MarketFrance1F F Scotland35 MarketFrance1F W Scotland36 MarketNorwayF F Norway37 MarketNorwayF F Norway38 MarketNorwayF F Norway39 MarketNorwayF F Norway40 MarketNorwayF F Norway41 MarketNorwayF F Norway42 MarketNorwayF F Norway43 MarketNorwayF F Tasmania44 MarketNorwayF F Norway45 MarketNorwayF F Norway46 MarketNorwayF F Norway47 MarketNorwayF F Norway48 MarketNorwayF F Norway49 MarketNorwayF F Norway50 MarketNorwayF F Norway51 MarketNorwayF F Norway52 MarketNorwayF F Norway53 MarketNorwayF F Norway54 MarketNorwayF F Norway

Table 3 (continued)

Sample Given specification Predicted wild(W)/farmed (F)

Predictedcountry of origin

55 MarketNorwayF F Norway56 MarketNorwayF F Norway57 MarketNorwayF F Norway58 MarketNorwayF W Norway59 MarketNorwayF F Norway60 MarketNorwayF W Faroes61 MarketNorwayW F Norway62 MarketNorwayW F Norway63 MarketNorwayW F Norway64 MarketNorwayW F Norway65 MarketNorwayW F Norway


mial kernel, for instance requires more hyperparametersthan RBF). Also the RBF kernel is computationally easierthan, for instance, the polynomial one, where the kernelvalue may go to infinity or zero when the polynomialdegree increases. The SVM parameters were optimizedwith training and monitoring sets. The optimal parametersettings for C (Eq. 2) and σ (Eq. 8) are then selected as thevalues that give the maximum correct classification rate.The optimal parameter settings were found to be C=500and σ=0.01 for discrimination of wild and farmed salmonand C=500 and σ=0.02 for country of origin of salmon.

Table 2 presents the classified results for the training,monitoring and validation sets using SVM for bothclassification criteria. It can be seen from this table thatthe SVM has classified correctly all salmon samples in thetraining, monitoring and the validation sets based oncriterion one. However, in the case of criterion two (countryof origin), three samples in the monitoring set (6.7%) andone sample in the validation set (4.6%) were misclassified.Figure 6 compares the number of accurately predictedorigins of the salmon from eight countries with the numberof samples in monitoring and validation sets. It can be seenthat some discrepancies exists for three countries—Ireland,Norway and Scotland—with the larger number of samples.Since the results were astonishing, we used the optimizedSVM models to predict the country of origin and the type(wild/farmed) of salmon in a set of data mainly consistingof “blinds” and “market” samples. The classified results forthis set consisted of 65 samples that are presented inTable 3. We believe that applying SVMs on 1H NMR data

can be considered as a powerful tool for the confirmation ofthe wild and farmed salmon and their origins. This couldhelp in discouraging and prosecuting commercial fraudoccurring in this area.

Conclusion

Support vector machines were used as a new class ofclassification algorithms in a validated method for theconfirmation of wild and farmed salmon and their origins toreduce the possibility of fraud. This method seems to be themost suitable one, because a limited number of data points(support vectors) are needed for the classification. Webelieve that the power of this method partly depends on theanalytical method used for the analysis of the fatty acids ofthe fish oils as the more informative the data produced byan analytical method, the more useful they are for theclassification. Therefore, 1H NMR spectra of the moleculeswere used for analyzing the fish oils. The combination of1H NMR spectroscopy with SVMs has provided a novelmethod for the classification of salmon.

Acknowledgements The authors are grateful to the Europeanproject COFAWS (European Commission DG RTD FP5 projectGRD2–2000–31813) and to all the collaborators from the partners ofthis project (Eurofins Scientific (Nantes- France), North AtlanticFisheries College (Scalloway, Shetland Islands - United Kingdom),SINTEF Fisheries and Aquaculture (Trondheim-Norway), JointResearch Centre (Ispra-Italy)) who contributed to the collection andpreparation of fish samples, and for the authorization to exploit theirNMR data in this work.

Fig. 6 Comparison of the num-ber of accurately predicted ori-gins of salmon from eightcountries with the number ofsamples in monitoring and vali-dation sets


References

1. Vapnik VN (2000) The nature of statistical learning theory.Springer, Berlin Heidelberg New York

2. Cristiani N, Shawe-Taylor J (2000) An introduction to supportvector machines. Cambridge University Press, Cambridge

3. Herbrich R (2001) Learning kernel classifiers. Theory andalgorithms. MIT Press, London

4. Schölkopf B, Smola A (2002) Learning with kernels. MIT Press,Cambridge, MA

5. DeCoste B, Schölkopf B (2002) Mach Learn 46:161–1906. Drucker H, Wu D, Vapnik VN (1999) IEEE Trans Neural Netw

10:1048–10547. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M,

Hausseler D (2000) Bioinformatics 16:906–9148. SVM application list http://www.clopinet.com/isabelle/Projects/

SVM/applist.html9. Belousov AI, Verzakov SA, Von Frese J (2002) Chemom Intell

Lab Syst 64:15–2510. Belousov AI, Verzakov SA, Von Frese J (2002) J Chemometrics

16:482–48911. Fernández Pierna JA, Baeten V, Michotte Renier A, Cogdill RP,

Dardenne P (2004) J Chemometrics 18:341–34912. Hamilton MC, Hites RA, Schwager SJ, Foran JA, Knuth BA,

Carpenter DO (2005) Environ Sci Technol 39:8622–862913. http://www.eurofins.com/research-development/cofaws/index.asp14. http://www.cordis.lu/growth/calls/top-3.19.htm15. http://europa.eu.int/comm/research/growth/gcc/projects/food-

fraud.html#top16. Tritt KL, O’Bara CJ, Wells MJM (2005) J Agric Food Chem

53:5304–531217. Villarreal BW, Rosenblum PM, Fries LT (1994) Trans Am Fish

Soc 123:194–20318. Forshed J, Schuppe-Koistinen I, Jacobsson SP (2003) Anal Chim

Acta 487:189–19919. Wu W, Daszykowski M, Walczak B, Sweatman BC, Susan C,

Connor SC, Haselden JN, Crowther DJ, Rob W, Gill RW, MichaelW, Lutz MW (2006) J Chem Inf Model 46:863–875

20. Nielsen NPV, Carstensen JM, Smedsgaard J (1998) J ChromatogrA 805:17–35

21. Van Nederkassel AM, Xu CJ, Lancelin P, Sarraf M, MacKenzieDA, Walton NJ, Bensaid F, Lees M, Martin GJ, Desmurs JR,Massart DL, Smeyers-Verbeke J, Vander Heyden Y (2006) JChromatogr A (in press)

22. Pravdova V, Walczak B, Massart DL (2002) Anal Chim Acta456:77–92

23. Tomasi G, Van den Berg F, Andersson C (2004) J Chemometrics18:231–241

24. Chen N, Lu W, Yang J, Li G (2004) Support vector machine inchemistry. World Scientific, Singapore

25. Haykin S (1999) Neural networks (a comprehensive foundation).Prentice Hall

26. Suykens JAK, Gestel TV, Brabanter JD, De Moor B, Vandewalle J(2002) Least square support vector machines. World Scientific,Singapore

27. Amendolia SR, Cossu G, Ganadu ML, Golosio B, Masala GL,Mura GM (2003) Chemom Intell Lab Syst 69:13–20

28. The kernel machines http://www.kernel-machines.org29. Image speech and intelligent systems research group (1998)

University of Southampton, UK, available on http://www.isis.ecs.soton.ac.uk/isystems/kernel

30. Chih-Chung C, Chin-Jen L (2002) National Taiwan University,available on http://www.csie.ntu.edu.tw/~cjlin

31. Stansby ME (1981) J Am Oil Chem Soc 58:13–1632. Suzuki H, Okazaki K, Hayakawa S, Wada S, Tamura S (1986) J

Agric Food Chem 34:58–6033. Bergstrom E (1989) Aquaculture 82:205–21734. Alasalvar C, Taylor KDA, Zubcov E, Shahidi F, Alexis M (2002)

Food Chem 79:145–15035. Cozzolino D, Murray I, Chree A, Scaife JR (2005) LWT 38:821–

82836. Jalali-Heravi M, Vosough M (2004) J Chromatogr A 1024:165–

17637. Cai SS, Syage JA (2006) Anal Chem 78:1191–119938. Vandeginste BGM, Massart DL, Buydens LM, De Jong S, Lewi

PJ, Smeyers-Verbeke J (1998) Handbook of chemometrics andqualimetrics: part B. Elsevier, Amsterdam


http://www.clopinet.com/isabelle/Projects/SVM/applist.html

http://www.clopinet.com/isabelle/Projects/SVM/applist.html

http://www.eurofins.com/research-development/cofaws/index.asp

http://www.cordis.lu/growth/calls/top-3.19.htm

http://europa.eu.int/comm/research/growth/gcc/projects/food-fraud.html#top

http://europa.eu.int/comm/research/growth/gcc/projects/food-fraud.html#top

http://www.kernel-machines.org

http://www.isis.ecs.soton.ac.uk/isystems/kernel

http://www.isis.ecs.soton.ac.uk/isystems/kernel

http://www.csie.ntu.edu.tw/~cjlin

Application of support vector machines to 1H NMR data of fish oils: methodology for the confirmation...

Documents

Transcript of Application of support vector machines to 1H NMR data of fish oils: methodology for the confirmation...