Classification of Brazilian soils by using LIBS and variable selection in the wavelet domain

7
Analytica Chimica Acta 642 (2009) 12–18 Contents lists available at ScienceDirect Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca Classification of Brazilian soils by using LIBS and variable selection in the wavelet domain Márcio José Coelho Pontes a , Juliana Cortez b , Roberto Kawakami Harrop Galvão c , Celio Pasquini b , Mário César Ugulino Araújo a,, Ricardo Marques Coelho d , Márcio Koiti Chiba d , Mônica Ferreira de Abreu d , Beáta Emöke Madari e a Universidade Federal da Paraíba, Departamento de Química, João Pessoa, PB, Brazil b Universidade Estadual de Campinas, Instituto de Química, Campinas, SP, Brazil c Instituto Tecnológico de Aeronáutica, Divisão de Engenharia Eletrônica, São José dos Campos, SP, Brazil d Instituto Agronômico de Campinas, Centro de Pesquisa e Desenvolvimento de Solos e Recursos Ambientais, Campinas, SP, Brazil e Embrapa Solos, Rio de Janeiro, RJ, Brazil article info Article history: Received 28 September 2008 Accepted 4 March 2009 Available online 13 March 2009 Keywords: Brazilian soils Laser-induced breakdown spectroscopy Classification Wavelet compression Successive projections algorithm Linear discriminant analysis abstract This paper proposes a novel analytical methodology for soil classification based on the use of laser-induced breakdown spectroscopy (LIBS) and chemometric techniques. In the proposed methodology, linear dis- criminant analysis (LDA) is employed to build a classification model on the basis of a reduced subset of spectral variables. For the purpose of variable selection, three techniques are considered, namely the successive projection algorithm (SPA), the genetic algorithm (GA), and a stepwise formulation (SW). The use of a data compression procedure in the wavelet domain is also proposed to reduce the computa- tional workload involved in the variable selection process. The methodology is validated in a case study involving the classification of 149 Brazilian soil samples into three different orders (Argissolo, Latossolo and Nitossolo). For means of comparison, soft independent modelling of class analogy (SIMCA) models are also employed. The best discrimination of soil types was attained by SPA–LDA, which achieved an average classification rate of 90% in the validation set and 72% in cross-validation. Moreover, the pro- posed wavelet compression procedure was found to be of value by providing a 100-fold reduction in computational workload without significantly compromising the classification accuracy of the resulting models. © 2009 Elsevier B.V. All rights reserved. 1. Introduction Soil classification is an important subject in several areas, such as agriculture and civil engineering. In fact, proper handling and use of the soil, including cultivation planning and design of drainage systems, depend on the soil class. The Brazilian System of Soil Classification [1] employs chemical, physical and morphological parameters. However, the reference methods for determination of these parameters are laborious and time-consuming, mainly due to the required sample treatment procedures. In addition, some classification criteria are subjective and difficult to quantify. The American [2] and French [3] systems of soil classification also suffer from the same problems. Corresponding author at: Universidade Federal da Paraíba, Departamento de Química Laboratório de Automac ¸ ão e Instrumentac ¸ ão em Química Analítica/Quimiometria (LAQA), Caixa Postal 5093, CEP 58051-970 – João Pessoa, PB, Brazil. Tel.: +55 83 3216 7438; fax: +55 83 3216 7437. E-mail address: [email protected] (M.C.U. Araújo). Some papers have been published on the use of parameters such as fertility [4] and morphological characteristics [5] for soil classification. However, few works have been concerned with the development of analytical techniques and/or data treatment pro- cedures to simplify the use of existing soil classification systems [6–8]. Zagatto [6] classified some types of Brazilian soils on the sole basis of chemical composition. For this purpose, the total content of 20 elements in soil samples and their extracts were quantified by Inductively Couple Plasma Optical Emission (ICP-OES) or Atomic Absorption Spectroscopy (AAS) and used as parameters for classifi- cation. K-nearest neighbors (KNN) and Soft Independent Modelling of Class Analogy (SIMCA) were employed and a correct classification rate of 80% was obtained. Demattê et al. [7] evaluated soil types and soil tillage systems by using visible (VIS)–near infrared (NIR) reflectance spectroscopy in the 450–2500 nm range. Different depths were utilized to deter- mine soil classes. Soil survey maps were developed by descriptive interpretation of the spectral curves and statistical analysis. The results were favourably compared to those of a conventional 0003-2670/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.aca.2009.03.001

Transcript of Classification of Brazilian soils by using LIBS and variable selection in the wavelet domain

Ci

MMMa

b

c

d

e

a

ARAA

KBLCWSL

1

aosCpttcAf

dAP

0d

Analytica Chimica Acta 642 (2009) 12–18

Contents lists available at ScienceDirect

Analytica Chimica Acta

journa l homepage: www.e lsev ier .com/ locate /aca

lassification of Brazilian soils by using LIBS and variable selectionn the wavelet domain

árcio José Coelho Pontesa, Juliana Cortezb, Roberto Kawakami Harrop Galvãoc, Celio Pasquinib,ário César Ugulino Araújoa,∗, Ricardo Marques Coelhod, Márcio Koiti Chibad,ônica Ferreira de Abreud, Beáta Emöke Madarie

Universidade Federal da Paraíba, Departamento de Química, João Pessoa, PB, BrazilUniversidade Estadual de Campinas, Instituto de Química, Campinas, SP, BrazilInstituto Tecnológico de Aeronáutica, Divisão de Engenharia Eletrônica, São José dos Campos, SP, BrazilInstituto Agronômico de Campinas, Centro de Pesquisa e Desenvolvimento de Solos e Recursos Ambientais, Campinas, SP, BrazilEmbrapa Solos, Rio de Janeiro, RJ, Brazil

r t i c l e i n f o

rticle history:eceived 28 September 2008ccepted 4 March 2009vailable online 13 March 2009

eywords:razilian soilsaser-induced breakdown spectroscopy

a b s t r a c t

This paper proposes a novel analytical methodology for soil classification based on the use of laser-inducedbreakdown spectroscopy (LIBS) and chemometric techniques. In the proposed methodology, linear dis-criminant analysis (LDA) is employed to build a classification model on the basis of a reduced subsetof spectral variables. For the purpose of variable selection, three techniques are considered, namely thesuccessive projection algorithm (SPA), the genetic algorithm (GA), and a stepwise formulation (SW). Theuse of a data compression procedure in the wavelet domain is also proposed to reduce the computa-tional workload involved in the variable selection process. The methodology is validated in a case study

lassificationavelet compression

uccessive projections algorithminear discriminant analysis

involving the classification of 149 Brazilian soil samples into three different orders (Argissolo, Latossoloand Nitossolo). For means of comparison, soft independent modelling of class analogy (SIMCA) modelsare also employed. The best discrimination of soil types was attained by SPA–LDA, which achieved anaverage classification rate of 90% in the validation set and 72% in cross-validation. Moreover, the pro-posed wavelet compression procedure was found to be of value by providing a 100-fold reduction incomputational workload without significantly compromising the classification accuracy of the resulting

models.

. Introduction

Soil classification is an important subject in several areas, such asgriculture and civil engineering. In fact, proper handling and usef the soil, including cultivation planning and design of drainageystems, depend on the soil class. The Brazilian System of Soillassification [1] employs chemical, physical and morphologicalarameters. However, the reference methods for determination ofhese parameters are laborious and time-consuming, mainly due

o the required sample treatment procedures. In addition, somelassification criteria are subjective and difficult to quantify. Themerican [2] and French [3] systems of soil classification also suffer

rom the same problems.

∗ Corresponding author at: Universidade Federal da Paraíba, Departamentoe Química – Laboratório de Automacão e Instrumentacão em Químicanalítica/Quimiometria (LAQA), Caixa Postal 5093, CEP 58051-970 – João Pessoa,B, Brazil. Tel.: +55 83 3216 7438; fax: +55 83 3216 7437.

E-mail address: [email protected] (M.C.U. Araújo).

003-2670/$ – see front matter © 2009 Elsevier B.V. All rights reserved.oi:10.1016/j.aca.2009.03.001

© 2009 Elsevier B.V. All rights reserved.

Some papers have been published on the use of parameterssuch as fertility [4] and morphological characteristics [5] for soilclassification. However, few works have been concerned with thedevelopment of analytical techniques and/or data treatment pro-cedures to simplify the use of existing soil classification systems[6–8].

Zagatto [6] classified some types of Brazilian soils on the solebasis of chemical composition. For this purpose, the total contentof 20 elements in soil samples and their extracts were quantifiedby Inductively Couple Plasma Optical Emission (ICP-OES) or AtomicAbsorption Spectroscopy (AAS) and used as parameters for classifi-cation. K-nearest neighbors (KNN) and Soft Independent Modellingof Class Analogy (SIMCA) were employed and a correct classificationrate of 80% was obtained.

Demattê et al. [7] evaluated soil types and soil tillage systems

by using visible (VIS)–near infrared (NIR) reflectance spectroscopyin the 450–2500 nm range. Different depths were utilized to deter-mine soil classes. Soil survey maps were developed by descriptiveinterpretation of the spectral curves and statistical analysis. Theresults were favourably compared to those of a conventional

a Chim

ms

(cc(wtct

fdfcpas[

slep[pontcp

tfo[FtAoaaNapbcpasLSB

tIlrcas(paa

M.J.C. Pontes et al. / Analytic

ethod in terms of soil line demarcation and number of detectedoil classes.

Mouazen et al. [8] employed VIS–NIR reflectance spectroscopy306.5–1710.9 nm) to discriminate soil texture classes. Factorial dis-riminant analysis (FDA) was applied to the first five principalomponents (PCs) resulting from the principal component analysisPCA) of the VIS–NIR spectra. Four different classes of soil samplesere classified with 85.7% and 81.8% of correct classification for

he calibration and validation sets, respectively. After two similarlasses (coarse and fine sand) were merged, the correct classifica-ion rate increased to 89.9% (calibration) and 85.1% (validation).

The present paper proposes a novel analytical methodologyor soil classification based on the use of laser-induced break-own spectroscopy (LIBS). In LIBS, a pulsed laser of high power isocused on the sample surface. The high power per area (irradiance)auses the vaporization of sample constituents and the formation oflasma. The spectrum of emission from the plasma is then acquirednd used as analytical response. This technique can be applied toolid, liquid or gaseous materials with little or no sample treatment9].

LIBS has been successfully applied to classification of differentamples, including chemical and biological warfare agent simu-ants [10], alloys [11], archaeological objects [12], polymers [13],xplosives [14], among others. However, only one paper has beenublished on the use of LIBS in the context of soil classification15]. In that work, LIBS spectra initially containing more than 50,000oints were reduced to 68 points corresponding to the spectral linesf eight major soil elements (aluminium, silicon, iron, calcium, mag-esium, potassium, titanium and manganese) and PCA was appliedo the reduced data. As a result, only two soil classes could be dis-riminated. The substantial dispersion of the remaining samplesrevented an adequate classification.

The present paper investigates the use of LIBS and chemometricechniques for classification of Brazilian soil samples into three dif-erent orders, namely Argissolo, Latossolo and Nitossolo. These soilrders were defined in the Brazilian System of Soil Classification1], created in 1999. According to the international classification ofAO (Food and Agriculture Organization of the United Nations) [16],he Argissolo, Latossolo and Nitossolo orders are equivalent to thecrisol, Ferralsol and Nitisol soil groups, respectively. The Argissolorder consists of exchangeable basic-cation poor, morphologicallynd physically heterogeneous soils. Latossolo soils are exchange-ble basic-cation poor and morphologically homogeneous. Theitossolo order comprises soils with variable content of exchange-ble cations, carrying a unique set of physical and morphologicalroperties that reflects on a typical hydrological and mechanicalehaviour. These three orders are representative of humid tropi-al regions with soils typically developed from highly weatheredarent material. These soils are constituted mostly by iron andluminium oxides (e.g. goethite and gibbsite) and 1:1 (Si:Al) layerilicate (basically kaolinite). According to IBGE [17], Argissolo andatossolo are predominant in Brazil, as well as in other countries ofouth America. Nitossolo corresponds to approximately 1% of therazilian territory.

Owing to the very large number of variables in a LIBS spectrum,he use of appropriate feature extraction procedures is required.n this context, a possible approach consists of selecting spectralines corresponding to specific elements [15]. However, in order toeduce the possibility of losing relevant information for the classifi-ation task, the present work employs statistical variable selectionlgorithms instead of a priori considerations. More specifically, the

uccessive projection algorithm (SPA) [18], the genetic algorithmGA) [18], and a stepwise formulation (SW) [19] are adopted for thisurpose. Linear discriminant analysis is then employed to obtainclassification model based on the selected spectral variables. In

ddition, the use of a data compression procedure in the wavelet

ica Acta 642 (2009) 12–18 13

domain is proposed to reduce the computational workload involvedin the variable selection process. For means of comparison, theresults obtained by using SIMCA models are also presented.

2. Theory

The linear discriminant analysis (LDA) classification methodemploys linear decision boundaries (hyperplanes), which aredefined in order to maximize the ratio of between-class to within-class dispersion [20]. In order to have a well-posed problem, thenumber of calibration (training) objects must be larger than thenumber of variables to be included in the LDA model. Therefore, theuse of LDA for classification of spectral data usually requires appro-priate variable selection procedures [18,19,21]. In this section, thethree algorithms adopted for this purpose in the present work (SPA,SW, and GA) will be described. Moreover, a wavelet compression(WC) method, which can be employed prior to variable selection,will also be presented.

2.1. Successive projections algorithm

The successive projections algorithm (SPA) was originally pro-posed by Araújo et al. [22] to minimize multi-collinearity effectsand thus improve the conditioning of multiple linear regression(MLR) modelling for spectral data. In the original formulation, can-didate subsets of variables were defined as the result of projectionoperations carried out on the matrix of instrumental response data.These subsets were then used to build MLR models, which werecompared in terms of the prediction error in a set of validationsamples. This validation set was not employed in either the projec-tion operations or the calibration of the MLR models. At the end, thesubset of variables leading to the smallest root-mean-square errorof validation (RMSEV) was adopted.

In a subsequent paper [18], SPA was adapted for use in clas-sification problems. As in the original formulation, the candidatesubsets of variables were formed as the result of projection opera-tions intended to minimize multi-collinearity effects, which are aknown cause of poor generalization performance in LDA [23]. How-ever, the RMSEV metric was replaced with an average risk G of LDAmisclassification. Such a cost function is calculated in the validationset as

G = 1Kv

Kv∑

k=1

gk, (1)

where gk (risk of misclassification of the kth validation object xk,k = 1, . . ., Kv) is defined as

gk = r2(xk, �Ik)minIj /= Ikr2(xk, �Ij)

. (2)

In this definition, the numerator r2(xk, �Ik) is the squared Maha-lanobis distance [24] between object xk (of class index Ik) and thesample mean �Ik of its true class. The denominator in Eq. (2) cor-responds to the squared Mahalanobis distance between object xkand the center of the closest wrong class. In the Mahalanobis dis-tance calculations, the sample mean for each class and the pooledcovariance matrix for each variable subset under consideration arecomputed by using the training data.

2.2. Stepwise algorithm

The stepwise (SW) selection algorithm adopted in the presentwork was proposed by Caneca et al. [19] for classification of diesel-engine lubricating oils on the basis of near and mid-infrared spectra.Initially, the algorithm calculates the discriminability of each spec-tral variable with respect to the classes under consideration [20].

14 M.J.C. Pontes et al. / Analytica Chimica Acta 642 (2009) 12–18

F presed

TaLtpbvtIe

oi

r

wsfrfntbLs

2

mao“a

Fhld

ig. 1. Filter bank implementation of the wavelet transform. In this diagram, H, G reownsampling operation.

he variable with the largest discriminability value is selected andleave-one-out cross-validation procedure is carried out by using

DA. Among the remaining variables, those having a large correla-ion with the selected one are then discarded to avoid collinearityroblems. This process is repeated at each subsequent iterationy successively adding variables to the LDA model until no moreariables are available for selection. The subset of variables leadingo the smallest number of cross-validation errors is then adopted.f different subsets lead to the same number of cross-validationrrors, the subset with the smallest number of variables is chosen.

It is worth noting that, after the second iteration, the discardingf variables is based on the coefficient of multiple correlation, whichs defined, for each variable xi still available for selection, as

i = �(xi)�(xi)

, (3)

here �(·) denotes the standard deviation calculated in the traininget and xi is an estimate of xi obtained by multiple linear regressionrom the variables already selected. If ri is close to one, variable xi isedundant because its values can be predicted, with good accuracy,rom the variables already included in the LDA model. An inconve-ience of this algorithm is the need to set a threshold for ri in ordero decide which variables are to be discarded. However, it is possi-le to test different threshold values and then compare the resultingDA models on the basis of the classification errors obtained in aeparate validation set.

.3. Genetic algorithm

The GA is a versatile search technique inspired in the biological

echanisms of evolution by natural selection [25–27]. In vari-

ble selection problems, the algorithm typically encodes subsetsf variables in the form of strings of binary (0/1) values termedchromosomes”. Each position (or “gene”) in the chromosome isssociated to one of the variables available for selection. Genes

ig. 2. Diagram of LIBS instrument. (a) Laser source and cooler, (b) Nd:YAG laseread, (c) dicroic mirror, (d) focusing lens, (e) soil sample, (f) sample cell, (g) collecting

ens, (h) fiber optic, (i) detector trigger signal, (j) echelle polychromator, (k) ICCDetector and (l) computer.

nt a low-pass and a high-pass digital filter, respectively, and ↓2 denotes the dyadic

with a “1” value indicate that the corresponding variables are tobe included in the model. The algorithm starts with a populationof randomly generated chromosomes, which are then combinedaccording to certain rules in order to generate a new generationof chromosomes (offspring). This process is repeated until a givenstopping criterion is satisfied.

The present work adopts the GA formulation presented in Ref.[18], which has the following features. A fitness value is definedfor each chromosome as the inverse of the validation cost definedin Eq. (1) calculated for the subset of variables encoded in thechromosome (“1” genes). The probability of a given chromosomebeing selected for offspring generation is proportional to its fitness(“roulette” method) [25]. By using this probabilistic method, pairsof chromosomes are formed and then combined to generate pairsof descendants by one-point crossover and mutation operators. Thepopulation size is kept constant, each generation being completelyreplaced by its descendants. However, the best individual is auto-matically transferred to the next generation (elitism) to avoid theloss of good solutions. This evolutionary process is repeated until apre-specified number of cycles is completed.

2.4. Wavelet compression

The SPA, SW and GA algorithms described above may involveconsiderable computational workload if the number of variables islarge, as in the case of LIBS spectra. This problem can be alleviatedby using a compression technique to reduce the dimensionality ofthe data prior to the variable selection procedures. In the presentwork, a wavelet compression method is adopted for this purpose.

The wavelet transform (WT) is a multi-resolutional signalprocessing tool [28] that has found several applications in denois-ing, feature extraction and compression of instrumental signals[29–34]. The WT of a spectrum x = [x(�1) x(�2) · · · x(�J)], where �jis the jth wavelength, can be obtained by using a digital filter bankstructure [28,31,35] of the form depicted in Fig. 1.

The basic structure of the filter bank consists of a pair of low-

pass (H) and high-pass (G) filters, followed by a downsamplingoperation, which discards one in every two points of the filteringoutcome. The downsampled output of the low-pass filter, termed“approximation coefficients”, is a smoothed version of the spec-

Table 1Number of training and validation samples in each class.

Class Set

Training Validation

Argissolo 31 15Latossolo 56 28Nitossolo 12 7Total 99 50

M.J.C. Pontes et al. / Analytica Chimica Acta 642 (2009) 12–18 15

ectrum

thfaistcoc

fifisoHa

TCm

T

V123

C123

Fig. 3. Mean LIBS sp

rum at a coarser resolution. The downsampled output of theigh-pass filter, termed “detail coefficients”, correspond to high-

requency noise, as well as sharp features of the spectrum, suchs narrow peaks. This operation can be reapplied to the approx-mation coefficients up to the number of decomposition levelspecified by the analyst. The result of the transform compriseshe final approximation coefficients, as well as the detail coeffi-ients obtained along the entire filter bank. With a slight abusef language, this result will be henceforth termed “wavelet coeffi-ients”.

The H and G filters employed in the filter bank are typically ofnite length, which implies that each approximation or detail coef-

cient corresponds to a reduced range of wavelengths within thepectrum. This spatial localization feature is often invoked as onef the main advantages of WT over the Fourier transform [28,35].owever, the choice of appropriate H and G filters for a specificpplication may not be straightforward [29,31]. In the present work,

able 2lassification rates obtained with GA–LDA, SW–LDA, SPA–LDA and SIMCA for (1) Argissoloodel is indicated in parenthesis. N indicates the number of samples employed in the cal

rue class index N GA–LDA (17) SW–LDA (7)

Predicted class index (%) Predicted class index (%)

alidation set 1 2 3 1 2 315 73 13 13 73 27 028 0 89 11 0 79 21

7 0 29 71 0 29 71

ross-validation46 72 15 13 74 20 784 11 69 20 10 75 1619 16 32 53 16 16 68

of each soil order.

different wavelet filters were tested and compared in terms ofcompression ability for the LIBS data set under consideration. Thedecomposition levels were set to the maximum number for whichthe spatial localization features of the WT are not lost [32]. Thislimit situation occurs when the H, G filters span the entire lengthof the downsampled approximation coefficients [36].

3. Experimental

3.1. Brazilian soil data set

A total of 149 Brazilian soil samples of three different orders(Argissolo: 46, Latossolo: 84 and Nitossolo: 19) collected at the Bhorizon (subsurface layer) were employed in the study. Before LIBSspectral recording, these samples were dried in an oven at 105 ◦Cfor 2.5 h, ground and sieved to a particle size smaller than 350 �m.

, (2) Latossolo and (3) Nitossolo. The number of spectral variables employed in eachculation of the classification rates.

SPA–LDA (5) SIMCA

Predicted class index (%) Predicted class index (%)

1 2 3 1 2 380 20 0 100 80 80

4 89 7 93 100 790 0 100 100 100 100

70 20 11 98 72 6711 73 17 79 98 6011 16 74 90 95 100

16 M.J.C. Pontes et al. / Analytica Chim

Fs

3

i3mIoao

3

lmwtp

1uip

ig. 4. (a) PC2 × PC1 and (b) PC3 × PC1 score plots for the overall set of 149 soilamples (O: Argissolo, �: Latossolo, �: Nitossolo).

.2. LIBS instrument

The measurements were carried out with a lab-made LIBSnstrument consisting of a Nd:YAG laser (Quantel, 1064 nm,60 mJ/pulse and pulse duration of 5 ns), an echelle polychro-ator (52.13 lines/mm, Mechelle 5000, Andor Technology), an

ntensified Charge Couple Device (ICCD) detector with an arrayf 1024 × 1024 pixels (Model DH734, Andor Technology) and andjustable position plate for the sample. Fig. 2 presents a diagramf the LIBS instrument.

.3. Spectra acquisition

Thirty spectra were acquired for each sample by applying theaser pulse to different points of the sample surface. Prior to the

easurement process, the sample cell was filled and the soil surfaceas levelled. After every five measurements, on different points,

he sample surface was re-levelled to eliminate the small cratersroduced by the laser beam.

The laser energy, delay time and integration time gate were10 mJ/pulse, 500 ns and 10 �s, respectively. The focal point was sit-ated 0.5 cm below the sample surface. The spectra were acquired

n the range 203.13–987.64 nm. Each resulting spectrum had 26,624oints.

ica Acta 642 (2009) 12–18

3.4. Software

Each individual spectrum was pre-treated by Standard NormalVariate (SNV) [37]. Afterwards, the average spectrum for each sam-ple was calculated. The average spectra were then divided intotraining and validation sets by using the classic Kennard-Stone (KS)algorithm [38]. The KS algorithm was applied to each class sepa-rately, as described in Ref. [18]. The number of samples in each setis presented in Table 1.

For the purpose of WC, 22 different wavelets were tested (Symlet4-10, Daubechies 1-10 and Coiflet 1-5). The low-pass and high-pass filters for dbN, symN and coifN have length 2N, 2N, and 6N,respectively (i.e., small values of N are associated to wavelets ofsmall width). These wavelets were selected in view of previousworks concerning FT-IR [36] and UV–VIS [39] spectrometry. Themaximum number of decomposition levels for each wavelet wasemployed, as discussed in Section 2.4. The percentage of data vari-ance retained in the compression process was set to 95%.

SNV, PCA and SIMCA were performed with the default settingsof the Unscrambler® 9.6 software (CAMO A/S). The optimal numberof PCs was determined from the residual variance curve. The firstlocal minimum is adopted unless later PCs give significantly lowerresidual variance. The significance level of the F-test for SIMCA clas-sification was set to the default value (5%). The WC, KS, GA–LDA,SW–LDA and SPA–LDA classification routines were implemented inMatlab® 6.5. The GA routine was carried out during 200 generationswith 400 chromosomes each. Crossover and mutation probabili-ties were set to 60% and 10%, respectively, as in [18]. Moreover, thealgorithm was repeated three times, starting from different ran-dom initial populations. The best solution (in terms of the fitnessvalue) resulting from the three realizations of the GA was employed.Seven threshold values (0.1, 0.2, 0.5, 0.7, 0.8, 0.9, and 0.95) forthe coefficient of multiple correlation were tested in the SW–LDAalgorithm. The best threshold was selected on the basis of the classi-fication errors in the validation set. If two threshold values providedthe same number of classification errors, the threshold providingthe simplest model (smallest number of selected variables) wasfavoured.

The results were expressed in terms of classification rates for thevalidation set. In addition, cross-validation results were obtained byapplying the leave-one-out approach to the entire data set of 149samples.

4. Results and discussion

Fig. 3 presents the mean LIBS spectrum of each soil order inthe range of approximately 203–1000 nm. As can be seen, discrim-inating the three soil orders on the basis of LIBS measurements isnot straightforward, owing to the complexity of the spectra. Thedifficulty involved in the classification task is also apparent in thePC score plots presented in Fig. 4. As can be seen, the dispersionwithin each class is considerable. Such a dispersion can be ascribedto the poor repeatability of the LIBS measurements, as well as thelarge chemical and mineralogical variability within each soil type.In Fig. 4, the best discrimination is found between Latossolo andArgissolo samples, which are reasonably well separated along PC1.In fact, these two orders are the most distinct in terms of miner-alogical constitution. However, they are considerably overlappedby Nitossolo. It may be argued that distinctive features of Nitossoloare not adequately captured by the LIBS spectra.

4.1. Classification in the original spectral domain

Table 2 presents the classification results (validation set andcross-validation) obtained in the original spectral domain. This

M.J.C. Pontes et al. / Analytica Chimica Acta 642 (2009) 12–18 17

tesTm

dcibescirat

vfiawenSSttSt

teTt

4

pe9tt

Table 3Number of wavelet coefficients required to explain 95% of the data variance.

Wavelet Number of retained coefficients

Sym4 663Sym5 684Sym6 696Sym7 701Sym8 723Sym9 729Sym10 751Db1 785Db2 692Db3 690Db4 738Db5 753Db6 781Db7 818Db8 858Db9 865Db10 896Coif1 678Coif2 677

original domain was 65%. In view of the overall validation and cross-validation results, it can be concluded that the WC process doesnot significantly compromise the classification performance of theresulting models.

Table 4Average classification rate (%) in the validation set (original spectral domain andwavelet-compressed data).

GA–LDA SW–LDA SPA–LDA

Original domain 78 74 90

Fig. 5. Determination of the optimum number of variables in SPA–LDA.

able also indicates the number of spectral variables (wavelengths)mployed in each model. In the case of SW-LDA, the threshold valueelected according to the criteria described in Section 3.4 was 0.2.he number of variables for SPA–LDA was determined from theinimum of the cost function displayed in Fig. 5.The rates in Table 2 express both correct classifications (pre-

icted class index equal to correct class index) and incorrectlassifications (predicted class index different from correct classndex). In each LDA model, the three rates in a row add up to 100%,ecause every sample is included in one and only one class. Forxample, the 15 validation samples of class 1 (Argissolo) were clas-ified by GA–LDA in the following manner: 11 samples (73%) wereorrectly included in class 1, two samples (13%) were incorrectlyncluded in class 2 (Latossolo), and two samples (13%) were incor-ectly included in class 3 (Nitossolo). In contrast, SIMCA may includegiven sample in more than one class. Therefore, the sum of the

hree rates in a row may be larger than 100% for SIMCA.Among the LDA models, the worst overall results in terms of

alidation and cross-validation were obtained with GA–LDA. Thisnding may be ascribed to the fact that GA–LDA does not take intoccount multicollinearity effects in the variable selection process,hereas SPA–LDA and SW–LDA were designed to minimize such

ffects. In fact, it is worth noting that GA–LDA selected a largerumber of spectral variables (17), as compared to SW–LDA (7) andPA–LDA (5). As regards the comparison between SW–LDA andPA–LDA, it can be seen that SPA–LDA provides better results inhe validation set for all three soil types (average correct classifica-ion rate of 90%). In terms of overall cross-validation performance,W–LDA and SPA–LDA are similar, as the average correct classifica-ion rate was 72% for both models.

SIMCA provided good validation and cross-validation results inerms of correctly including the samples in their true class. How-ver, almost all samples were also included in an incorrect class.his problem may be ascribed to the dispersion and overlapping ofhe soil classes, as seen in the score plots presented in Fig. 4.

.2. Use of wavelet compression

As discussed in Section 3.4, 22 wavelets were tested for com-ression of the LIBS spectra. Table 3 presents the results, which are

xpressed in terms of the number of coefficients required to explain5% of the data variance. On the overall, the best performances (i.e.,he smallest number of required coefficients) were obtained withhe smallest wavelets within each family. In fact, small wavelets

Coif3 700Coif4 719Coif5 751

may be a better match to the narrow emission peaks found in LIBSspectra.

Classification tests were carried out by using the five bestwavelets in terms of compression performance (sym4, db2, db3,coif1 and coif2). Table 4 presents the validation results obtainedby applying GA–LDA, SW–LDA and SPA–LDA to the compresseddata set. The best wavelets for GA–LDA, SW–LDA and SPA–LDAwere sym4 (considering compression performance in addition tothe classification rate), coif1 and coif2, respectively. By using thesewavelets, a classification rate of 84% was obtained with the threeLDA models. For GA–LDA and SW–LDA, this rate is an improvementin comparison with the results obtained in the original spectraldomain. In the case of SPA–LDA, the result became slightly worse,as the classification rate in the original domain was 90%. However,the computation workload involved in the modelling process wassubstantially reduced by the use of WC, as the number of variableswas reduced by a factor of 40 (from 26,624 to 677 with coif2, forexample). By using a computer with a Celeron 2.66 GHz processorand 2 GB RAM, the time required for variable selection by SPA wasreduced from approximately 1000 min to 8 min. It is worth not-ing that the time spent in the WC process itself is relatively small(approximately 28 s for the coif2 wavelet).

By using the best wavelet for each model, the cross-validationrates for GA–LDA, SW–LDA and SPA–LDA were 69%, 70% and 71%,respectively. For SW–LDA and SPA–LDA, these results are slightlyworse than the rate obtained in the original domain (72%). ForGA–LDA, the result is actually better, as the rate obtained in the

Sym4 84 81 79Db2 84 77 68Db3 84 83 80Coif1 77 84 79Coif2 83 75 84

18 M.J.C. Pontes et al. / Analytica Chimica Acta 642 (2009) 12–18

Table 5Classification rates obtained with SPA–LDA and SIMCA for (1) Argissolo, (2) Latossolo and (3) Nitossolo. The number of wavelet coefficients employed in each model isindicated in parenthesis. N indicates the number of samples employed in the calculation of the classification rates.

True class index N SPA–LDA (6) SIMCA (677) SIMCA (6)

Predicted class index (%) Predicted class index (%) Predicted class index (%)

Validation set 1 2 3 1 2 3 1 2 31 15 67 20 13 100 80 80 100 80 672 28 4 86 11 93 100 68 86 96 543 7 0 0 100 100 100 100 71 43 100

C1 962 793 95

rSfswswi

5

tTBcofwcTtctps

oo

A

Fra

R

[

[

[[

[[

[

[

[[[[

[

[

[

[

ross-validation46 67 17 1584 10 71 1919 16 11 74

For comparison purposes, Table 5 presents the classificationesults of SPA–LDA and SIMCA for the coif2-compressed data set.IMCA models were constructed with the 677 coefficients resultingrom the WC compression process and also with the six coefficientselected by SPA–LDA. On the overall, the SIMCA classification ratesere similar to those obtained in the original domain with the full

pectrum (Table 2). This result corroborates the conclusion that theavelet compression retains discriminatory information concern-

ng the soil classes under study.

. Conclusions

This paper presented a novel methodology for soil classifica-ion based on the use of LIBS data and chemometrics methods.he methodology was validated in a case study involving threerazilian soil types (Argissolo, Latossolo and Nitossolo). Better dis-rimination of the soil types was attained by employing a subsetf selected spectral variables for LDA, as compared to the use ofull-spectrum SIMCA modelling. More specifically, the best resultsere obtained with SPA–LDA, which achieved an average classifi-

ation rate of 90% in the validation set and 72% in cross-validation.he proposed wavelet compression procedure was useful to reducehe computational workload (by a factor of 100) without signifi-antly compromising the classification accuracy. It is worth notinghat, after the classification models have been obtained, the pro-osed methodology can be applied to new samples in a fast andtraightforward manner.

Future works could investigate the combination of LIBS withther techniques, such as VIS–NIR spectroscopy, for the purposef improving the classification outcome.

cknowledgments

The authors thank PROCAD/CAPES (Grant 0081/05-1) andAPESP (Grant 03/07419-5) for partial financial support. Theesearch fellowships and scholarships granted by CNPq and CAPESre also gratefully acknowledged.

eferences

[1] H.G. Santos; P.K.T. Jacomine, L.H.C. Anjos, V.A. Oliveira, J.B. Oliveira, R.M. Coelho,J.F. Lumbreras, T.J.F. Cunha, Sistema Brasileiro de Classificacão de Solos, 2ndedition, Embrapa Solos, Rio de Janeiro, 2006.

[2] Soil Survey Staff, Keys to Soil Taxonomy, 9th ed., United States Department ofAgriculture, Washington, 2003.

[3] D. Baize, M.C. Girard, Référentiel pédologique, Paris, 1995.[4] P. Tittonell, K.D. Shepherd, B. Vanlauwe, K.E. Giller, Agr. Ecosyst. Environ. 123

(2008) 137.[5] J.D. Phillips, D.A. Marion, Geoderma 141 (2007) 89.

[

[[[[

70 65 93 65 6398 52 80 94 6095 100 74 90 100

[6] E.A.G. Zagatto, Análises Químicas Multielementares em Sistemas FIA-ICP-GSAMe Classificacões dos Solos do Estado de São Paulo, Doctoral thesis, UniversidadeEstadual de Campinas, Campinas, 1981.

[7] J.A.M. Demattê, R.C. Campos, M.C. Alves, P.R. Fiorio, M.R. Nanni, Geoderma 121(2004) 95.

[8] A.M. Mouazen, R. Karoui, J. Baerdemaeker, H. Ramon, J. Near Infrared Spectrosc.13 (2005) 231.

[9] C. Pasquini, J. Cortez, L.M.C. Silva, F.B. Gonzaga, J. Braz. Chem. Soc. 18 (2007)463.

[10] C.A. Munson, F.C. Lucia Jr., T. Piehler, K.L. McNesby, A.W. Miziolek, Spectrochim.Acta Part B 60 (2005) 1217.

[11] S.R. Goode, S.L. Morgan, R. Hoskins, A. Oxsher, J. Anal. At. Spectrom. 15 (2000)1133.

12] M. Corsi, G. Cristoforetti, M. Giuffrida, M. Hidalgo, S. Legnaioli, L. Masotti, V.Palleschi, A. Salvetti, E. Tognoni, C. Vallebona, A. Zanini, Microchim. Acta 152(2005) 105.

[13] R. Sattmann, I. Mönch, H. Krause, R. Noll, S. Couris, A. Hatziapostolou, A.Mavromanolakis, C. Fotakis, E. Larrauri, R. Miguel, Appl. Spectrosc. 52 (1998)456.

[14] W. Schade, C. Bohling, K. Hohmann, D. Scheel, Laser Part. Beams 24 (2006) 241.[15] B. Bousquet, J.-B. Sirven, L. Canioni, Spectrochim, Acta Part B 62 (2007) 1582.[16] IBGE (Brazilian Institute of Geography and Statistics), EMBRAPA (Brazilian Agri-

culture Research Institute), Soil Map of Brazil (1:5,000,000), 2001. Available at:http://mapas.ibge.gov.br/solos/viewer.htm (accessed in March 2008).

[17] IUSS Working Group WRB, World Reference Base for Soil Resources, World SoilResources Reports, 103, 128, 2006.

[18] M.J.C. Pontes, R.K.H. Galvão, M.C.U. Araújo, P.N.T. Moreira, O.D.P. Neto, G.E. José,T.C.B. Saldanha, Chemom. Intell. Lab. Syst. 78 (2005) 11.

[19] A.R. Caneca, M.F. Pimentel, R.K.H. Galvão, C.E. Matta, F.R. Carvalho, I.M.Raimundo Jr., C. Pasquini, J.J.R. Rohwedder, Talanta 70 (2006) 344.

20] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd ed., John Wiley, NewYork, 2001.

21] Y. Mallet, D. Coomans, O. de Vel, Chemom. Intell. Lab. Syst. 35 (1996) 157.22] M.C.U. Araújo, T.C.B. Saldanha, R.K.H. Galvão, T. Yoneyama, H.C. Chame, V. Visani,

Chemom. Intell. Lab. Syst. 57 (2001) 65.23] T. Naes, B.H. Mevik, J. Chem. 15 (2001) 413.24] R. de Maesschalck, D. Jouan-Rimbaud, D.L. Massart, Chemom. Intell. Lab. Syst.

50 (2000) 1.25] D.E. Goldberg, Genetic Algorithms in Search, Optimization,and Machine Learn-

ing, Addison-Wesley Longman Publishing Co., Inc., Boston, 1989.26] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. Noord, Anal. Chem. 67 (1995)

4295.27] R. Leardi, J. Chem. 15 (2001) 559.28] B. Walczak, Wavelets in Chemistry, Elsevier Science, New York, 2000.29] C. Cai, P.B. Harrington, J. Chem. Inf. Comput. Sci. 38 (1998) 1161.30] U. Depczynski, K. Jetter, K. Molt, A. Niemoller, Chemom. Intell. Lab. Syst. 49

(1999) 151.31] C.J. Coelho, R.K.H. Galvão, M.C.U. Araújo, M.F. Pimentel, E.C. Silva, J. Chem. Inf.

Comput. Sci. 43 (2003) 928.32] R.K.H. Galvão, H.A.D. Filho, M.N. Martins, M.C.U. Araújo, C. Pasquini, Anal. Chim.

Acta 581 (2007) 159.33] A.C. Sousa, M.M.L.M. Lucio, O.F. Bezerra Neto, G.P.S. Marcone, A.F.C. Pereira, E.O.

Dantas, W.D. Fragoso, M.C.U. Araújo, R.K.H. Galvão, Anal. Chim. Acta 588 (2007)231.

34] S. Ren, L. Gao, Talanta 50 (2000) 1163.

35] M. Vetterli, J. Kovacevic, Wavelets and Subband Coding, Prentice-Hall, New

Jersey, 1995.36] R.N.F. Santos, R.K.H. Galvão, M.C.U. Araújo, E.C. Silva, Talanta 71 (2007) 1136.37] R.J. Barnes, M.S. Dhanoa, S.J. Lister, Appl. Spectrosc. 43 (1989) 772.38] R.W. Kennard, L.A. Stone, Technometrics 11 (1969) 137.39] L. Gao, S. Ren, Spectrochim, Acta Part A 61 (2005) 1136.