Rapid and non-destructive identification of strawberry cultivars by direct PTR-MS headspace analysis...

7
Sensors and Actuators B xxx (2006) xxx–xxx Rapid and non-destructive identification of strawberry cultivars by direct PTR-MS headspace analysis and data mining techniques Pablo M. Granitto a , Franco Biasioli a,, Eugenio Aprea a,b , Daniela Mott a , Cesare Furlanello c , Tilmann D. M¨ ark b , Flavia Gasperi a a Istituto Agrario di S. Michele a/A, S. Michele, Via E. Mach 2, 38010, Italy b Institut f ¨ ur Ionenphysik, Universit¨ at Innsbruck, Technikerstr. 25, A-6020 Innsbruck, Austria c ITC/IRST Centro per la Ricerca Scientifica e Tecnologica, Via Sommarive 18, 38050 Povo, Trento, Italy Received 30 January 2006; received in revised form 27 March 2006; accepted 27 March 2006 Abstract Proton transfer reaction-mass spectrometry (PTR-MS) is a spectrometric technique that allows direct injection and analysis of mixtures of volatile compounds. Its coupling with data mining techniques provides a reliable and fast method for the automatic characterization of agroindustrial products. We test the validity of this approach to identify samples of strawberry cultivars by measurements of single intact fruits. The samples used were collected over 3 years and harvested in different locations. Three data mining techniques (random forests, penalized discriminant analysis and discriminant partial least squares) have been applied to the full PTR-MS spectra without any preliminary projection or feature selection. We tested the classification models in three different ways (leave-one-out and leave-group-out internal cross validation, and leaving a full year aside), thereby demonstrating that strawberry cultivars can be identified by rapid non-destructive measurements of single fruits. Performances of the different classification methods are compared. © 2006 Elsevier B.V. All rights reserved. Keywords: Proton transfer reaction-mass spectrometry; Volatile organic compounds; Random forest; Penalized discriminant analysis; Discriminant partial least squares; Data mining 1. Introduction The importance of volatile organic compounds (VOCs) for the characterization of agroindustrial products, fruits in par- ticular, drove the research for new integrated methods, from headspace sampling to data analysis, to classify fruits based on VOCs emission. Non-invasive and effective quality control [1] and taxonomic issues [2] are of paradigmatic importance. For a crop like strawberry, VOCs emissions are particularly relevant: its aroma is one of the most popular and recognized by consumers and has widespread applications in the food and flavour industry. This led to the publication of a number of papers on the measuring of volatile compounds emitted by strawber- ries, reviewed for example in Ref. [3,4] and on studies on the link of strawberry aroma with cultivar [5], cultivation methods Corresponding author. Tel.: +39 0461 61 51 87; fax: +39 0461 65 09 56. E-mail address: [email protected] (F. Biasioli). [6] and postharvest processing [7]. More generally the biogene- sis of strawberry volatiles is a popular research field that brings together different areas of science from molecular biology to sensory analysis [8]. It is considered as a benchmark for a mul- tidisciplinary approach to the understanding of the flavour of fruits for its complexity and applicative relevance [8]. The development of improved VOCs sensors is an active field of research [9–11]. Proton transfer reaction-mass spectrometry [12] is one of the techniques of choice for fast and effective VOCs detection in particular when the focus is on the availability of a rapid and quantitative fingerprinting, possibly highly correlated with other features of practical interest. When used in this mode, the mass spectrum measured by PTR-MS is considered as the equivalent of an array of sensors, giving a fingerprint of the total volatile profile [13]. After testing a semi-static sampling method and evaluating the use of multivariate analysis of PTR-MS spectra [13] we stud- ied, in a previous work [14], the possibility of analysing PTR-MS spectra by discriminant analysis for the classification of straw- 0925-4005/$ – see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.snb.2006.03.047 SNB-9303; No. of Pages 7

Transcript of Rapid and non-destructive identification of strawberry cultivars by direct PTR-MS headspace analysis...

Sensors and Actuators B xxx (2006) xxx–xxx

Rapid and non-destructive identification of strawberry cultivars bydirect PTR-MS headspace analysis and data mining techniques

Pablo M. Granitto a, Franco Biasioli a,∗, Eugenio Aprea a,b, Daniela Mott a,Cesare Furlanello c, Tilmann D. Mark b, Flavia Gasperi a

a Istituto Agrario di S. Michele a/A, S. Michele, Via E. Mach 2, 38010, Italyb Institut fur Ionenphysik, Universitat Innsbruck, Technikerstr. 25, A-6020 Innsbruck, Austria

c ITC/IRST Centro per la Ricerca Scientifica e Tecnologica, Via Sommarive 18, 38050 Povo, Trento, Italy

Received 30 January 2006; received in revised form 27 March 2006; accepted 27 March 2006

Abstract

Proton transfer reaction-mass spectrometry (PTR-MS) is a spectrometric technique that allows direct injection and analysis of mixtures of volatilecompounds. Its coupling with data mining techniques provides a reliable and fast method for the automatic characterization of agroindustrialpwdtdc©

Ks

1

tthVa

rbflorl

0d

roducts. We test the validity of this approach to identify samples of strawberry cultivars by measurements of single intact fruits. The samples usedere collected over 3 years and harvested in different locations. Three data mining techniques (random forests, penalized discriminant analysis andiscriminant partial least squares) have been applied to the full PTR-MS spectra without any preliminary projection or feature selection. We testedhe classification models in three different ways (leave-one-out and leave-group-out internal cross validation, and leaving a full year aside), therebyemonstrating that strawberry cultivars can be identified by rapid non-destructive measurements of single fruits. Performances of the differentlassification methods are compared.

2006 Elsevier B.V. All rights reserved.

eywords: Proton transfer reaction-mass spectrometry; Volatile organic compounds; Random forest; Penalized discriminant analysis; Discriminant partial leastquares; Data mining

. Introduction

The importance of volatile organic compounds (VOCs) forhe characterization of agroindustrial products, fruits in par-icular, drove the research for new integrated methods, fromeadspace sampling to data analysis, to classify fruits based onOCs emission. Non-invasive and effective quality control [1]nd taxonomic issues [2] are of paradigmatic importance.

For a crop like strawberry, VOCs emissions are particularlyelevant: its aroma is one of the most popular and recognizedy consumers and has widespread applications in the food andavour industry. This led to the publication of a number of papersn the measuring of volatile compounds emitted by strawber-ies, reviewed for example in Ref. [3,4] and on studies on theink of strawberry aroma with cultivar [5], cultivation methods

∗ Corresponding author. Tel.: +39 0461 61 51 87; fax: +39 0461 65 09 56.E-mail address: [email protected] (F. Biasioli).

[6] and postharvest processing [7]. More generally the biogene-sis of strawberry volatiles is a popular research field that bringstogether different areas of science from molecular biology tosensory analysis [8]. It is considered as a benchmark for a mul-tidisciplinary approach to the understanding of the flavour offruits for its complexity and applicative relevance [8].

The development of improved VOCs sensors is an active fieldof research [9–11]. Proton transfer reaction-mass spectrometry[12] is one of the techniques of choice for fast and effective VOCsdetection in particular when the focus is on the availability of arapid and quantitative fingerprinting, possibly highly correlatedwith other features of practical interest. When used in this mode,the mass spectrum measured by PTR-MS is considered as theequivalent of an array of sensors, giving a fingerprint of the totalvolatile profile [13].

After testing a semi-static sampling method and evaluatingthe use of multivariate analysis of PTR-MS spectra [13] we stud-ied, in a previous work [14], the possibility of analysing PTR-MSspectra by discriminant analysis for the classification of straw-

925-4005/$ – see front matter © 2006 Elsevier B.V. All rights reserved.

oi:10.1016/j.snb.2006.03.047

SNB-9303; No. of Pages 7

2 P.M. Granitto et al. / Sensors and Actuators B xxx (2006) xxx–xxx

berry cultivars. Only a reduced number of samples was evaluatedand most of the cultivars were harvested in one location and atthe same time. In this paper, we evaluate the proposed method ina real-world scale, considering samples of nine strawberry cul-tivars collected over 3 years and harvested in different locations.

Usually standard discrimination methods are applied for clas-sification based on instrumental data, e.g., linear discriminantanalysis (LDA) or, as in our preliminary study [14], discriminantpartial least squares (dPLS). However, it is common knowledgein the data mining community that there are no universally goodclassification methods, and different datasets may require differ-ent algorithms to provide optimum performances [15]. Based onthis and on our recent experience on sensory data [16], we alsotest (for the first time for PTR-MS data and, as far as we know, forspectra produced by chemical ionization) two modern discrimi-nation techniques that should provide optimum performances inclassification reducing, at the same time, the risk of overfitting:random forest and penalized discriminant analysis.

2. Materials and methods

2.1. Samples

Among the PTR-MS measurements of single whole straw-berry fruits evaluated in our laboratories during various mea-soaiimnttsptdopi

used throughout this paper, their parents and the number of fruitsevaluated for each year.

2.2. Measurements

PTR-MS is a mass spectrometric technique based on a partic-ular implementation of chemical ionization using proton trans-fer from protonated water ions to the volatile substance to bedetected [12]. More in detail, a relatively pure H3O+ ion beam isformed in a hollow cathode discharge and it crosses the volatilemixture that continuously flows in a drift tube; eventually allmolecules with a proton affinity higher than water (most of thevolatile organic compounds fulfil this condition while normalair constituents do not, allowing the use of air as carrier gas) areionized and, subsequently, analyzed and detected by a standardquadrupole mass spectrometer [12]. During the 3-year period ofmeasurements considered in this work (2002–2004), five differ-ent commercial PTR-MS (Ionicon Analytik, Innsbruck, Austria)apparatuses have been used both in standard (2002 and 2003) andhigh sensitivity configuration (2003 and 2004). Table 2 showsthe experimental settings of the main PTR-MS parameters used.

All measurements have been carried out by direct sampling ofthe headspace of single whole strawberry fruits without any pre-treatment. Samples were evaluated by the measuring proceduredescribed in Ref. [14]. Basically, it involves the stabilization ofthe headspace in a glass vessel (400 ml) before the measurementoaarsmas[

p

wti

TC he nin

C Samp

1 42 63 44 35 36 47 38 49 6

T 37

urement campaigns in 2002, 2003 and 2004, we selected dataf nine cultivars that have been evaluated in each year with a rel-tively balanced number of samples. All fruits were harvestedn the Cesena and Verona provinces (north Italy) both in exper-mental and productive fields. Other differences in cultivationethods (organic/traditional, tunnel/open field) are present but

ot discussed here. Fruits were collected in small batches of twoo five fruits (in most cases three) for each cultivar. Fruits ofhe same batch have been collected in the same place and in theame day and different batches of the same cultivar are com-letely independent. After harvesting, fruits were transportedo our laboratories where measurements took place after a 2–5ay period of storage at 4 ◦C. No specific criteria for selectionf the fruits were applied except for a rough evaluation of aroper ripening and the absence of evident defects or peculiar-ties. In Table 1, we report for each cultivar the one-digit code

able 1odes, names, parentals and the number of samples evaluated in each year for t

ode Cultivar Parental Parental

CS 96627 CS 91.143.5 MissQueen Elisa Miss Usb 35CS 966210 CS 91.143.5 MissCS 972694 Darselect CS 91.143.5Patty Marmolada HoneoyeVR 96571 CS 91.143.5 CS 89.250.2VR 96582 CS 91.143.5 CS 90.608.1VR 97645 Darselect CS 89.384.20Miss CS 80.39.1 Dana

otal

f five complete mass spectra (4 min). To avoid possible system-tic memory effects from one measurement to the next one, thepparatus was flushed with lab air between the measurements,eplicate order was randomized and we used different glass ves-els for every fruit. The spectra considered here range over aass to charge ratio (m/z) from 20 up to 250 amu. The aver-

ge of the last three spectra has been used to characterize eachample. Conversion into ppb has been performed by the formula17]:

pb = 1

(k · t)· [C+]

[H3O+]· 109 · KB · T

P(1)

here k is the reaction constant for the proton transfer reac-ion, t the drift time in the reaction chamber, [C+] the measuredon intensity (counts/s), [H3O+] the intensity of the primary ion

e cultivars studied in this work

les 2002 Samples 2003 Samples 2004 Samples total

6 12 227 17 306 18 286 13 229 18 306 11 216 19 286 13 236 17 29

58 138 233

P.M. Granitto et al. / Sensors and Actuators B xxx (2006) xxx–xxx 3

Table 2Cultivars measured and experimental settings of the various PTR-MS apparatuses used in this study

Year Type Drift voltage (V) Drift temperature (◦C) E/N (Td) Cultivars

2002 Standard 600 40 135 All2003 HS 540 75 135 2, 92003 Standard 540 75 135 All2003 Standard 400 75 101 1, 3, 4, 5, 6, 7, 82004 HS 600 40 135 All2004 HS 400 40 91 All2004 HS 600 75 151 All2004 HS 540 75 135 All

beam (counts/s), KB the Boltzmann constant, T the temperatureand P is the pressure. We use the same value k = 2 × 10−9 cm3/sfor all masses. While the normalization to H3O+ intensity andgas density present in this conversion formula should allowthe comparison of measurements obtained with different appa-ratuses and experimental conditions, the presence of residualfragmentation and its dependence on impact energy introducesystematic differences among spectra acquired at different E/Nvalues. In this sense, the comparison of spectra collected atdifferent E/N provides a fingerprint with systematic shifts thatmakes the classification task more difficult. For this reason, theclassification errors shown in this paper should be consideredas a pessimistic estimation of the performances of the proposedapproach.

Based on the indications of [14], we used only spectra normal-ized to unit area. By this we are assuming that the information isnot in the absolute but rather in the relative intensities of the spec-

trometric peaks. In Fig. 1, we show, for informative purposesonly, the average spectra of the first four cultivars of Table 1.Relevant features are, for example, the series of peaks relatedto esters at m/z = 61 + 14n and the peaks at m/z = 33 (methanol),m/z = 59 (acetone and propanal), m/z = 37 and 55 related to waterclusters. In the present work, however, we use PTR-MS spec-tra as anonymous fingerprints of the samples and the chemicalinformation entangled in the spectra is not used.

2.3. Discrimination methods

The last decade has seen the development of new power-ful statistical learning methods and, in particular, a lot of workwas devoted to two strategies: ensemble methods and the gen-eralization of linear models [18]. In this work, we selected arepresentative method from each class in order to explore theirpotential in PTR-MS applications. We also implement dPLS

Fi

ig. 1. Example of the average normalized spectra (over all samples and experimentantense peaks are labelled by their m/z ratios.

l conditions) of four cultivars considered in the paper coded as in Table 1. Most

4 P.M. Granitto et al. / Sensors and Actuators B xxx (2006) xxx–xxx

models, as in our previous study [14], which is probably themost popular algorithm for spectrometric applications. We useimplementations available as free packages for the R statisti-cal environment software [19]. As mass spectrometric data areusually high dimensional (few hundreds peaks in the case ofPTR-MS), typically a data compression method like principalcomponent analysis (PCA) is used to reduce the dimensionalityas a pre-processing step. However, as this procedure involvesthe risks of overfitting [20], we choose to use the raw measuresof concentration in the PTR-MS spectra as inputs of the dis-crimination methods, without any other pre-processing than thenormalization described in Ref. [14].

2.3.1. Random forests (RF)The statistical principle of the “bias-variance trade-off” is at

the basis of the new modelling tools based on ensembles (orgroups) of predictors: if a combined classifier is formed out ofseveral low-bias-high-variance models, variance is reduced andthe result is a classification method which is more accurate thanits members in average [21]. Furthermore, the accuracy of theensemble is positively correlated to the level of diversity of theindividual members [22,23].

A random forest (RF) [24] is formed by growing severaldecision trees. To create diversity among the members of theensemble, RF fits each tree on a bootstrap replicate of the fullset of examples [25] and allows that only a small random sam-patu

bimdttavae

2

twwat[ddimiwr

ridge constant lambda that penalizes high values of the fittedvariables.

2.3.3. Discriminant partial least squares (dPLS)The dPLS algorithm has been extensively described and used

in the chemometrics literature [29] and we used it for PTR-MSspectra of strawberry headspace in Ref. [14]. Basically, it has twosteps, a PLS projection followed by the application of LDA in theprojected subspace. The number of scores used in the projectionstep plays the role of a regularizing parameter, controlling theoverall flexibility of the model.

2.4. Evaluation methods

In a first experiment, we perform an internal leave-one-out(LOO) cross validation with the full dataset of 233 measuredfruits. One at a time, each sample was removed from the datasetand a RF, PDA or dPLS model was developed over the rest ofthe data. Free parameters, like the value of the ridge constant forPDA or the number of scores kept by dPLS were chosen using aninternal cross validation at this step. Finally, the obtained mod-els classified the sample left aside. The procedure was iteratedover the full dataset. This is the most idealized case: we supposethat the samples from 3 years and the diverse conditions of mea-surement are a representative sample of the whole population,so one expects new samples arriving in the future to be similart

lovatftooolitatd

oda

1aawspa2

le of all features can be chosen as candidates by the splittinglgorithm when adding a new node. The combination of thesewo sources of diversity leads to easy-to-build ensembles withsually good performance.

RF is very resistant to overfitting (even when the ensem-le can contain thousands of trees) and usually performs welln problems with a low samples/features ratio, like DNA-

icrochips’ experiments or spectroscopic and spectrometricata. The algorithm has, in practice, only one free parameter:he number m of features made available at each node duringhe growing of trees. But, as shown by Breiman [24], resultsre not strongly dependent on this parameter, and the defaultalue of m (the square root of the total number of features) usu-lly gives near optimal results. We use this default value in allxperiments presented.

.3.2. Penalized discriminant analysis (PDA)LDA is a standard tool for classification and dimension reduc-

ion [26]. Roughly, it seeks a linear combination of the features,hich maximizes the ratio of its between-class variance to itsithin-class variance. After that, classes are typically assigned

ccording, e.g., to Mahalanobis distances to class centroids inhis transformed space. LDA could perform well in many cases27], but can be too flexible and overfits in situations with hun-reds of features and a reduced number of samples, like PTR-MSata. Penalized discriminant analysis (PDA) [28] is a regular-zed version of LDA, more appropriate for such situations. The

ethod is based on recasting the LDA problem as a regular-zed linear regression one, and then to apply any of the manyell-known techniques available for this task. We use standard

idge regression [18], which has only one free parameter, the

o some of the ones already measured.In a second experiment, to estimate classification errors with

ess-assuming conditions, we consider the use of small batchesf samples instead of individual fruits to perform the culti-ar discrimination. In this case, 75 small groups were selectedccording to similar experimental conditions and same produc-ion batch (same location and harvesting time). Most batches areormed by three strawberries but a few of them are formed bywo, four or five fruits. As the groups have a unique combinationf year/conditions, this experiment simulates the identificationf individual fruits from an unknown production batch measuredn a different place/time by different equipments. Instead of theeave-one-out, we perform a leave-group-out (LGO) evaluation,terating the process of leaving a group out as test set, and usinghe rest of the dataset to fit the models (again, free parametersre selected at this step by internal cross validation). After thathose models individually classified the samples of the indepen-ent test batch.

The third evaluation method involves the use of a full yearf production as test set (leave-year-out: LYO). This is the mostifficult classification task, because all information about thepparatuses and year is deleted from the training set.

We set aside the year 2002, as it is the smallest subset (around5% of the samples) and also because experimental conditionsre relatively close to at least one of the conditions used in 2003nd 2004. In this experiment a RF, a PDA and a dPLS modelere adjusted using samples from years 2003/2004, and each

ample from year 2002 was classified with these models. Freearameters were previously chosen using the full year 2003s a validation set for models developed only with data from004.

P.M. Granitto et al. / Sensors and Actuators B xxx (2006) xxx–xxx 5

The three procedures described above aim at the classifica-tion of individual fruits. Another application of the methodsdescribed here is the identification of the cultivar of smallbatches.

In order to classify a batch, each sample is evaluated by amodel that assigns a probability to each of the nine possiblecultivars (instead of directly assign a class label). These class-probabilities are averaged over all the samples of the batch,which is then associated to the cultivar with the higher averageprobability. This procedure was applied to the three evaluationmethods described before, leave-one-out, leave-group-out andleave-year-out validation, using the partition into independentbatches already used for the leave-group-out evaluation.

3. Results and discussion

In Fig. 2,we show the two most informative components ofprincipal component analysis PCA, PDA and dPLS models, allfitted using the whole dataset. The plots only provide qualita-tive information about relative positions and distances of thediverse varieties, but have moderate value in order to evalu-ate discrimination capabilities. The top panel shows that even

Fo(s

Table 3Relative classification errors (%) for random forest (RF), penalized discriminantanalysis (PDA) and discriminant partial least squares (dPLS) for three differentvalidation methods in the case of single fruits and small independent batches

Validation method Test set RF PDA dPLS

Leave-one-out Single fruit 7.3 6.9 6.0Batch 1.3 0 0

Leave-group-out Single fruit 17.8 17.0 15.4Batch 12.0 9.3 5.3

Leave-year-out Single fruit 21.6 13.5 16.2Batch 9.0 9.0 9.0

an unsupervised technique like PCA provides a relatively goodseparation of several cultivars with only two components. Forexample, cultivars 5 and 9, partly confused on the left of thebiplot, are separated from 1, 2 and 6 on the opposite region. Thesame holds for cultivars 4 and 8, partly confused, with positiveordinate value that are clearly separated from cultivar 3 and, lessclearly, from 9. PDA and PLS training data (middle and lowerpanel) show that the relative position of the cultivars are moreor less conserved in all methods. For example, cultivars 1, 2 and6 on one side opposite to 3, 5 and 9 on the other.

Table 3 summarizes the mean classification errors for all thealgorithms and evaluation experiments. The first row shows theaverage error for the LOO experiment with individual fruits,which are low and very similar. Table 4 shows the same resultsin the form of confusion matrices, where the rows correspondto the real classes of the samples, and the columns to the classesassigned by a particular discrimination method. All methodsshow similar behaviours with good classification performancesand the errors scattered almost uniformly around classes. Pos-sible exceptions are classes 3 and 5 for RF and 5 and 9 fordPLS that seem partially overlapped. Some error counts thatare present only for one method (e.g., 2–7 for PDA) indicatesamples that probably lie on the border between classes, whileerrors repeated by all methods (e.g., one sample of class 5 thatis classified as class 9) are probably outliers that are deep intothe space covered by another class. In general, LOO results con-fiptaficc

ig. 2. Plot of the most informative components for PCA analysis (upper panel)f PTR-MS data and for the training models of PDA (middle panel) and PLSlower panel). Symbol shape and colour identify cultivars. The coding is theame indicated in the middle panel for all graphs.

cfhsoTRtm

rm the qualitative findings of Fig. 1, about the good separationrovided by PTR-MS spectra. The second row of Table 3 showshe corresponding results for evaluation of small batches. PDAnd dPLS classify all batches correctly and RF only misclassi-es one batch. This result is not surprising because individuallassifications showed only scattered errors, which can be easilyorrected by averaging a few samples.

The third and fourth rows of Table 3 show the mean classifi-ation error for the second test: leave-group-out (LGO). Resultsor single fruits are similar for all three methods, in all casesigher than the corresponding LOO levels. For classification ofmall batches dPLS outperforms the other two methods, withnly four errors over the 75 groups (7 for PDA and 9 for RF).able 5 shows the confusion matrices for this case. Again, forF classes 3–5 still have the highest error, and classes 5–9 are

he most confused for PDA. Overall, the results of these experi-ents show that small batches of close-related strawberries can

6 P.M. Granitto et al. / Sensors and Actuators B xxx (2006) xxx–xxx

Table 4Confusion matrices for leave-one-out tests for the three classification methodsevaluated in this work

1 2 3 4 5 6 7 8 9

RF1 20 0 0 0 0 2 0 0 02 0 30 0 0 0 0 0 0 03 0 0 23 1 4 0 0 0 04 0 0 0 20 1 0 1 0 05 0 0 0 1 28 0 0 0 16 0 0 0 0 0 21 0 0 07 0 0 0 0 0 0 26 2 08 0 0 0 1 0 0 0 22 09 0 0 1 0 1 0 1 0 26

PDA1 21 0 0 0 0 1 0 0 02 0 28 0 0 1 0 1 0 03 0 0 24 2 1 0 0 0 14 0 0 0 20 0 0 1 1 05 0 0 0 0 29 0 0 0 16 0 0 0 0 0 21 0 0 07 0 0 0 1 0 0 27 0 08 0 0 0 2 0 1 0 20 09 0 0 1 0 1 0 0 0 27

PLS1 22 0 0 0 0 0 0 0 02 0 29 0 0 1 0 0 0 03 0 0 25 1 1 0 0 0 14 0 0 0 21 0 0 1 0 05 0 0 0 0 29 0 0 0 16 0 0 0 0 0 21 0 0 07 0 0 0 1 0 0 27 0 08 0 0 0 2 0 1 0 20 09 0 0 0 0 3 0 1 0 25

Numbers in bold indicate the actual cultivar (first column) and the assigned culti-var (first row). Upper panel: random forest; middle panel: penalized discriminantanalysis; lower panel: discriminant partial least squares.

be accurately identified with PTR-MS spectra, even when vary-ing the measuring conditions and/or year/place of production.

The results of the LYO experiment agree with this conclusion.The last two rows of Table 3 show that in this case PDA isthe best performer, with 5 errors over 37 samples, against 6and 8 made by dPLS and RF, respectively, for individual fruitsclassification. Whether the increase of error for this last test isrelated to actual year-to-year variability or, more likely, to thedifference in experimental conditions cannot be decided but, asstated before, this is an upper limit for the classification errorof the proposed approaches. When classifying batches in thissetting, all methods correctly identify 10 out of 11 groups. Themisclassified group is the same for all three methods: class 5 isclassified as class 9.

Is the relatively higher confusion between some pairs ofclasses related to limitations of the classification methods, whichhave the difficult task of discriminating between nine classes atthe same time? To answer this question, we select two couplesof classes with high confusion for the LOO experiment, classes3–5 (because it has the maximum error with four samples mis-classified by RF) and 4–8 (because they systematically showconfusion in all classification experiments). We repeat the LOO

Table 5As in Table 4 for leave-group-out tests

1 2 3 4 5 6 7 8 9

RF1 5 1 0 0 0 1 0 0 02 0 9 0 0 0 0 0 0 03 0 0 7 0 2 0 0 0 04 0 0 0 5 0 0 1 1 05 0 0 0 0 9 0 0 0 16 1 0 0 0 0 6 0 0 07 0 0 0 0 0 0 9 0 08 0 0 0 1 0 0 0 6 09 0 0 0 0 0 0 0 0 10

PDA1 7 0 0 0 0 0 0 0 02 0 8 0 0 0 0 1 0 03 0 0 9 0 0 0 0 0 04 0 0 0 5 0 0 1 1 05 0 0 0 0 9 0 0 0 16 0 0 0 0 0 7 0 0 07 0 0 0 0 0 0 9 0 08 0 0 0 1 0 0 0 6 09 0 0 0 0 2 0 0 0 8

PLS1 7 0 0 0 0 0 0 0 02 0 9 0 0 0 0 0 0 03 0 0 8 0 1 0 0 0 04 0 0 0 7 0 0 0 0 05 0 0 0 0 10 0 0 0 06 0 0 0 0 0 7 0 0 07 0 0 0 0 0 0 9 0 08 0 0 0 1 0 0 0 6 09 0 0 1 0 1 0 0 0 8

experiment, but with datasets containing only samples fromtwo classes: 3–5 (in the first case) or 4–8 (in the second). Forboth binary problems and all classification methods, the degreeof confusion does not decrease significantly indicating that allmethods can efficiently handle the complexity of a multiclassproblem. We will further investigate this issue for PTR-MS datain the future.

4. Conclusions

PTR-MS fingerprinting of the volatile compounds emitted bystrawberry fruits can be effectively used for the rapid and non-destructive identification of both whole single fruits and smallbatches. As expected, averaging over a small group of fruitsdecreases the identification errors reducing the role of possibleoutliers or experimental errors. Different classification methodshave been compared. Overall, dPLS shows good performances,giving the best result in five experiments and a good error levelin the remaining one. PDA is the best performer for the LYOexperiment. This is probably related with its more “rigid” natureaccording to the general principle that rigid (regularized) mod-els generalize better in the case of big differences between thetraining and test sets [30].

We choose a straightforward approach using data availablefrom various field campaigns obtained with different appara-tuses and experimental settings avoiding any data pre-processing

P.M. Granitto et al. / Sensors and Actuators B xxx (2006) xxx–xxx 7

like outlier detection (this could be the case of a sample ofcultivar 5 that is systematically assigned to class 9 in every exper-iment) or feature selection/reduction. The reported classificationerrors should thus be considered as a pessimistic estimate andwe expect that they can be reduced using homogeneous experi-mental conditions or refining the statistical approach.

In fact, experiments on reduced datasets with homogeneousE/N values (data not shown) provide near perfect classification.The present work extends our previous findings [14] on threemain aspects: (i) the use of several modern data mining tech-niques and validation methods, (ii) the extension of the samplingover different places and years and (iii) the use of differentapparatuses and experimental set-ups. This indicates that theproposed methodology is useful in real applications even if thedata come from different laboratories/places/years.

The presence of links between PTR-MS data and sensory[31,32] and molecular characterization [33] of agroindustrialproducts together with the results of this work indicates theopportunity to further investigate the application of PTR-MS asan informative classification tool to assist breeders in selectionand for quality control in applicative situations.

Acknowledgments

This work has been partly supported by MURST-MIURproject QUALIFRAPE and PAT projects SAMPPA and INTER-Bf

R

[

[11] M. Vinaixa, A. Vergara, C. Duran, E. Llobet, C. Badia, J. Brezmes, X.Vilanova, X. Correig, Fast detection of rancidity in potato crisps usinge-noses based on mass spectrometry or gas sensors, Sens. Actuators B106 (2005) 67–75.

[12] W. Lindinger, A. Hansel, A. Jordan, On-line monitoring of volatileorganic compounds at pptv levels by means of proton-transfer-reactionmass spectrometry (PTR-MS). Medical applications, food control andenvironmental research, Int. J. Mass Spectrom. Ion Processes 173 (1998)191–241.

[13] F. Biasioli, F. Gasperi, E. Aprea, L. Colato, E. Boscaini, T.D. Mark,Fingerprinting mass spectrometry by PTR-MS: heat treatment vs. pres-sure treatment of red orange juice—a case study, Int. J. Mass Spectrom.223–224 (2003) 343–353.

[14] F. Biasioli, F. Gasperi, E. Aprea, D. Mott, E. Boscaini, D. Mayr, T.D.Mark, Coupling proton transfer reaction-mass spectrometry with lineardiscriminant analysis: a case study, J. Agric. Food Chem. 51 (2003)7227–7233.

[15] D.H. Wolpert, The lack of a priori distinctions between learning algo-rithms, Neural Comput. 8 (1996) 1341–1390, see also, http://www.no-free-lunch.org/.

[16] P.M. Granitto, F. Gasperi, F. Biasioli, C. Furlanello, Modelling sensoryanalysis datasets: the case of Italian cheeses, in: Proceedings of Argen-tine Symposium on Artificial Intelligence, 29–30 August 2005, Rosario,Argentina, 2005.

[17] W. Lindinger, A. Hansel, A. Jordan, Proton-transfer-reaction mass spec-trometry (PTR-MS): on-line monitoring of volatile organic compoundsat pptv levels, Chem. Soc. Rev. 27 (1998) 347–354.

[18] T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of StatisticalLearning, Springer-Verlag, New York, 2001.

[19] R Development Core Team. R: A Language and Environment for Statisti-cal Computing. R Foundation for Statistical Computing, Vienna, Austria,

[

[

[

[

[[

[

[

[

[

[

[

[

[

ERRY. We thank W. Faedi and A. Testoni for providing theruits and for their support.

eferences

[1] R. Dorfner, T. Ferge, C. Yeretzian, A. Kettrup, R. Zimmermann, Lasermass spectrometry as on-line sensor for industrial process analysis: pro-cess control of coffee roasting, Anal. Chem. 76 (2004) 1386–1402.

[2] P.J. Dunlop, C.M. Bignell, D. Brynn Hibbert, M.I.H. Brooker, Use ofgas chromatograms of the essential leaf oils of the genus eucalyptus fortaxonomic purposes: E. subser. Euglobulares (Blakely), Flavour Frag. J.18 (2003) 162–169.

[3] A. Williams, D. Ryan, A. Olarte Guasca, P. Marriott, E. Pang, Analysisof strawberry volatiles using comprehensive two-dimensional gas chro-matography with headspace solid-phase microextraction, J. Chromatogr.B 817 (2005) 97–107.

[4] I. Zabetakis, M.A. Holden, Strawberry flavour: analysis and biosynthesis,J. Sci. Food. Agric. 74 (1997) 421–434.

[5] M.A. Hakala, A.T. Lapvetelainen, H.P. Kallio, Volatile compounds ofselected strawberry varieties analyzed by purge-and-trap headspace GC-MS, J. Agric. Food Chem. 50 (2002) 1133–1142.

[6] S.Y. Wang, W. Zheng, G.J. Galletta, Cultural system affects fruit qualityand antioxidant capacity in strawberries, J. Agric. Food Chem. 50 (2002)6534–6542.

[7] C. Pelayo, S.E. Ebeler, A.A. Kader, Postharvest life and flavor qualityof three strawberry cultivars kept at 5 ◦C in air or air +20 kPa CO2,Postharvest Biol. Technol. 27 (2003) 171–183.

[8] K.G. Bood, I. Zabetakis, The biosynthesis of strawberry flavor (II):biosynthetic and molecular biology studies, J. Food. Sci. 67 (2002) 2–8.

[9] E.N. Schmidt, H. Orsnes, T. Graf, T. Christensen, H. Degn, Monitoringorganic compounds in aqueous solution by rotating ball inlet mass spec-trometry with continuous wave infrared laser desorption, Sens. ActuatorsB 76 (2001) 411–418.

10] C. Peres, F. Begnaud, J. Berdague, Fast characterization of Camembertcheeses by static headspace-mass spectrometry, Sens. Actuators B 87(2002) 491–497.

2005. ISBN 3-900051-07-0, http://www.R-project.org.20] C. Ambroise, G.J. McLachlan, Selection bias in gene extraction in

tumour classification on basis of microarray gene expression data, Proc.Natl. Acad. Sci. USA 99 (2002) 6562–6566.

21] S. Geman, E. Bienenstock, R. Doursat, Neural Networks and thebias/variance dilemma, Neural Comput. 4 (1992) 1–58.

22] P.M. Granitto, P.F. Verdes, H.A. Ceccatto, Neural network ensembles:evaluation of aggregation algorithms, Artif. Intell. 163 (2005) 139–162.

23] T.G. Dietterich, An experimental comparison of three methods for con-structing ensembles of decision trees: bagging, boosting, and random-ization, Mach. Learn. 40 (2000) 139–158.

24] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32.25] B. Efron, R.J. Tibshirani, An Introduction to the Bootstrap, Chapman

& Hall, New York, 1983.26] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge Uni-

versity Press, Cambridge, 1996.27] D. Michie, D. Spiegelhalter, C. Taylor (Eds.), Machine Learning, Neural

and Statistical Classification, Ellis Horwood, 1994.28] T. Hastie, A. Buja, R. Tibshirani, Penalized discriminant analysis, Ann.

Stat. 23 (1995) 73–102.29] S. Wold, M. Sjostrom, L. Eriksson, PLS-regression: a basic tool of

chemometrics, Chemometr. Intell. Lab. 58 (2001) 109–130.30] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New

York, 1999.31] F. Gasperi, G. Gallerani, A. Boschetti, F. Biasioli, A. Monetti, E.

Boscaini, A. Jordan, W. Lindinger, S. Iannotta, The mozzarella cheeseflavour profile: a comparison between judge panel analysis and pro-ton transfer reaction mass spectrometry, J. Sci. Food. Agric. 81 (2000)357–363.

32] F. Biasioli, F. Gasperi, E. Aprea, I. Endrizzi, V. Framondino, F. Marini,D. Mott, T.D. Mark, Correlation of PTR-MS spectral fingerprints withsensory characterisation of flavour and odour profile of Trentingranacheese, Food Qual. Prefer. 17 (2006) 63–75.

33] E. Zini, F. Biasioli, F. Gasperi, D. Mott, E. Aprea, T.D. Mark, A.Patocchi, C. Gessler, M. Komjanc, QTL mapping of volatile compoundsin ripe apples detected by proton transfer reaction-mass spectrometry,Euphytica 145 (2005) 271–281.