Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree model in...

9
Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree model in Lebanon Rania Bou Kheir a, b, * , Mogens H. Greve b , Chadi Abdallah c , Tommy Dalgaard b a Lebanese University, Faculty of Letters and Human Sciences, Department of Geography, GIS Research Laboratory, P.O. Box 90-1065, Fanar, Lebanon b Department of Agroecology and Environment, Faculty of Agricultural Sciences (DJF), Aarhus University, Blichers Alle´ 20, P.O. Box 50, DK-8830 Tjele, Denmark c National Council for Scientific Research, Remote Sensing Center, P.O. Box 11-8281, Beirut, Lebanon GIS regression-tree analysis explained 88% of the variability in field/laboratory Zinc concentrations. article info Article history: Received 22 February 2009 Received in revised form 3 July 2009 Accepted 26 August 2009 Keywords: Zn concentration Soil pollution Decision-tree model GIS Terrain analysis abstract Heavy metal contamination has been and continues to be a worldwide phenomenon that has attracted a great deal of attention from governments and regulatory bodies. In this context, our study proposes a regression-tree model to predict the concentration level of zinc in the soils of northern Lebanon (as a case study of Mediterranean landscapes) under a GIS environment. The developed tree-model explained 88% of variance in zinc concentration using pH (100% in relative importance), surroundings of waste areas (90%), proximity to roads (80%), nearness to cities (50%), distance to drainage line (25%), lithology (24%), land cover/use (14%), slope gradient (10%), conductivity (7%), soil type (7%), organic matter (5%), and soil depth (5%). The overall accuracy of the quantitative zinc map produced (at 1:50.000 scale) was estimated to be 78%. The proposed tree model is relatively simple and may also be applied to other areas. Ó 2009 Elsevier Ltd. All rights reserved. 1. Introduction Soil contamination by metals has become a widespread serious problem in many parts of the world, including the Mediterranean environments. The reasons are manyfold: an increase in irrigation using waste water; the uncontrolled application of sewage sludge, industrial effluents, pesticides and fertilizers; rapid urbanization; the atmospheric deposition of dust and aerosols, and vehicular emissions, to mention but a few. Due to the non-biodegradability of heavy metals and their long biological half-lives for elimination, their accumulation in the food chain will have a significant effect on human health in the long term (Fuge, 2005). Past studies have revealed that human exposure to high concentrations of heavy metals will lead to their accumulation in the fatty tissues of the human body and affect the central nervous system, or the heavy metals may be deposited in the circulatory system and disrupt the normal functioning of the internal organs (Waisberg et al., 2003; Bocca et al., 2004). In order to prevent further environmental deterioration and adverse health effects and to examine possible methods of remediation, an identification of the spatial distribution of the contaminated areas, especially those affected by zinc (Zn) metal, is urgently needed. The spatial variability of Zn topsoil contents may be affected by soil parent materials and anthropogenic sources. The natural background concentration of zinc in dry soils is varying between 0 mg kg 1 and 18 mg kg 1 (De Temmerman et al., 2003). The problems associated with the characterization of Zn in the majority of sites are often due to multiple sources including agricultural practices, enhanced urbanization, industrialization, and mining activities. Zinc is used as an additive in the vulcanization process to strengthen crude rubber in tyre manufacturing (Adachia and Tainoshob, 2004). Tyre wear, motor oil, grease, brake emissions, and corrosion of galvanized parts may contribute to the high Zn content in roadside soils (Smolders and Degryse, 2002; Reimann and de Caritat, 2005). Crash barriers and roofs are also considered a major source of pollutants, charging running water with heavy metals, especially zinc (1045 mg/m/year) (Manta et al., 2002). In this context, our study proposes a statistical regression deci- sion-tree model that in a simple, realistic and practical way predicts the spatial distribution and concentrations of soil zinc contents, based on the analysis of terrain parameters (predictors) likely to impact on soil zinc concentration and quantification of their weights using Geographic Information Systems (GIS) in a study area across Lebanon. It comprises a set of rules to classify (predict) a dependent target variable (zinc concentration in mg kg 1 ) using * Corresponding author at: Lebanese University, Faculty of Letters and Human Sciences, Department of Geography, GIS Research Laboratory, P.O. Box 90-1065, Fanar, Lebanon. Tel.: þ961 3 982 848. E-mail addresses: [email protected], [email protected] (R. Bou Kheir). Contents lists available at ScienceDirect Environmental Pollution journal homepage: www.elsevier.com/locate/envpol 0269-7491/$ – see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.envpol.2009.08.009 Environmental Pollution 158 (2010) 520–528

Transcript of Spatial soil zinc content distribution from terrain parameters: A GIS-based decision-tree model in...

lable at ScienceDirect

Environmental Pollution 158 (2010) 520–528

Contents lists avai

Environmental Pollution

journal homepage: www.elsevier .com/locate/envpol

Spatial soil zinc content distribution from terrain parameters: A GIS-baseddecision-tree model in Lebanon

Rania Bou Kheir a,b,*, Mogens H. Greve b, Chadi Abdallah c, Tommy Dalgaard b

a Lebanese University, Faculty of Letters and Human Sciences, Department of Geography, GIS Research Laboratory, P.O. Box 90-1065, Fanar, Lebanonb Department of Agroecology and Environment, Faculty of Agricultural Sciences (DJF), Aarhus University, Blichers Alle 20, P.O. Box 50, DK-8830 Tjele, Denmarkc National Council for Scientific Research, Remote Sensing Center, P.O. Box 11-8281, Beirut, Lebanon

GIS regression-tree analysis explained 88% of the variability in field/la

boratory Zinc concentrations.

a r t i c l e i n f o

Article history:Received 22 February 2009Received in revised form3 July 2009Accepted 26 August 2009

Keywords:Zn concentrationSoil pollutionDecision-tree modelGISTerrain analysis

* Corresponding author at: Lebanese University, FSciences, Department of Geography, GIS Research LFanar, Lebanon. Tel.: þ961 3 982 848.

E-mail addresses: [email protected], boukh

0269-7491/$ – see front matter � 2009 Elsevier Ltd.doi:10.1016/j.envpol.2009.08.009

a b s t r a c t

Heavy metal contamination has been and continues to be a worldwide phenomenon that has attracteda great deal of attention from governments and regulatory bodies. In this context, our study proposesa regression-tree model to predict the concentration level of zinc in the soils of northern Lebanon (as a casestudy of Mediterranean landscapes) under a GIS environment. The developed tree-model explained 88% ofvariance in zinc concentration using pH (100% in relative importance), surroundings of waste areas (90%),proximity to roads (80%), nearness to cities (50%), distance to drainage line (25%), lithology (24%), landcover/use (14%), slope gradient (10%), conductivity (7%), soil type (7%), organic matter (5%), and soil depth(5%). The overall accuracy of the quantitative zinc map produced (at 1:50.000 scale) was estimated to be78%. The proposed tree model is relatively simple and may also be applied to other areas.

� 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Soil contamination by metals has become a widespread seriousproblem in many parts of the world, including the Mediterraneanenvironments. The reasons are manyfold: an increase in irrigationusing waste water; the uncontrolled application of sewage sludge,industrial effluents, pesticides and fertilizers; rapid urbanization;the atmospheric deposition of dust and aerosols, and vehicularemissions, to mention but a few. Due to the non-biodegradability ofheavy metals and their long biological half-lives for elimination,their accumulation in the food chain will have a significant effect onhuman health in the long term (Fuge, 2005). Past studies haverevealed that human exposure to high concentrations of heavymetals will lead to their accumulation in the fatty tissues of thehuman body and affect the central nervous system, or the heavymetals may be deposited in the circulatory system and disrupt thenormal functioning of the internal organs (Waisberg et al., 2003;Bocca et al., 2004). In order to prevent further environmentaldeterioration and adverse health effects and to examine possiblemethods of remediation, an identification of the spatial distribution

aculty of Letters and Humanaboratory, P.O. Box 90-1065,

[email protected] (R. Bou Kheir).

All rights reserved.

of the contaminated areas, especially those affected by zinc (Zn)metal, is urgently needed.

The spatial variability of Zn topsoil contents may be affected bysoil parent materials and anthropogenic sources. The naturalbackground concentration of zinc in dry soils is varying between0 mg kg�1 and 18 mg kg�1 (De Temmerman et al., 2003). Theproblems associated with the characterization of Zn in the majorityof sites are often due to multiple sources including agriculturalpractices, enhanced urbanization, industrialization, and miningactivities. Zinc is used as an additive in the vulcanization process tostrengthen crude rubber in tyre manufacturing (Adachia andTainoshob, 2004). Tyre wear, motor oil, grease, brake emissions,and corrosion of galvanized parts may contribute to the high Zncontent in roadside soils (Smolders and Degryse, 2002; Reimannand de Caritat, 2005). Crash barriers and roofs are also considereda major source of pollutants, charging running water with heavymetals, especially zinc (1045 mg/m/year) (Manta et al., 2002).

In this context, our study proposes a statistical regression deci-sion-tree model that in a simple, realistic and practical way predictsthe spatial distribution and concentrations of soil zinc contents,based on the analysis of terrain parameters (predictors) likely toimpact on soil zinc concentration and quantification of theirweights using Geographic Information Systems (GIS) in a study areaacross Lebanon. It comprises a set of rules to classify (predict)a dependent target variable (zinc concentration in mg kg�1) using

R. Bou Kheir et al. / Environmental Pollution 158 (2010) 520–528 521

the values of independent terrain parameters. These predictors areboth ordinal (e.g., slope gradient) and nominal (e.g., lithology)features; they cover natural (e.g., soil type) and anthropic (e.g.,proximity to roads) parameters. The machine learning, probabi-listic, non-parametric, decision-tree method has been extensivelyexploited for vegetation mapping (Kandrika, 2008), ecologicalmodeling (Michaelsen et al., 1994), gully erosion modeling (BouKheir et al., 2007), soil mapping (Henderson et al., 2005; Bou Kheiret al., 2008), epidemiological monitoring (Schroder, 2006), andremote sensing studies such as land-use classification based onthreshold values of various band data (Huang and Jensen, 1997;Friedl et al., 1999). It is essentially an exploratory method (Venablesand Ripley, 1994), and it has been used predictively (Franklin, 1998)without the need of specifying data-fitted form function. The zincconcentration map resulting from the conversion of the decision-tree model, at 1:50.000 cartographic scale, serves as a usefulinventory for land-use management and environmental decision-making.

2. Study area

The study area (Fig. 1), located in the northern part of Lebanon,extends over 195 km2 and exhibits a variety of soils developed ondiverse lithologies, various land cover/use modes and under a widerange of climatic conditions. It is also susceptible to Zn pollutionsfrom various human’s activities. For this reason, it has been selectedfor developing and testing the decision-tree methodology.

The physiography of the region displays a high variety. Con-trasting landscapes are found from the mountain environment in

Med iterr a

ne

anSea

Jord

an

E g y p t

CyprusLebanon Study area

Beirut

Tyr

Tripoli

Legend

Faults

Basalts

Quaternary

(Q) Quaternary deposits

Paleogene

(m) Miocene

Cretaceous

(C6) Senonian

(C5) Turonian

(C4) Cenomanian

(C3) Albian

(C2) Aptian

(C1) Barrermian

Jurassic

(J) Kimmerdjian

0 2 4 61Km

Fig. 1. Location of the study

the eastern and central parts of the area to the western coastalplains. This diversity is paralleled by the distribution of vegetationcover from forests (mainly composed of Pinus pinea – 14.5% andQuercus calliprinos – 6.5%) to sparse shrubby (20%) and herbaceous(7.5%) vegetation. The agricultural land (27%) is less widespreadthan natural vegetation, but pesticides are commonly used in highquantities in order to increase yields. In addition, the study area hasa well-established network of highways and roads, and manyresidential, commercial and industrial buildings (17%) are erectedalongside the roads.

The outcropping stratigraphic sequence exposes rock forma-tions spanning from the Middle Jurassic to more recent times.According to the 1:50.000 geological maps of Lebanon (Dubertret,1945), fourteen rock units are present in the study area (Fig. 1). Thehighly fractured and jointed limestone and dolomitic limestoneoutcrops of the Cenomanian (C4) form around 32% of the area,followed in importance by the dolomite, limestone and dolomiticlimestone of the Kimmeridgian (J6 – 25%). One perennial coastal(Nahr El-Jawz) river crosses the study area, flowing seawards withan E–W orientation. The climate is typically Mediterranean witha mean annual precipitation between 900 and 1100 mm, anda significant amount of rain (75–80%) falling in the winter period(November–March). The terrains are often unprotected duringwinter, which is the critical period for water runoff and the leachingof pollutants from soils.

The important diversity in parent materials along withgeomorphic features, land cover/use and climate has conditionedthe development of a wide variety of soils. A total of 18 different soilunits were identified on the available soil map (Geze, 1956), of

area within Lebanon.

R. Bou Kheir et al. / Environmental Pollution 158 (2010) 520–528522

which 15 are soil series and three are soil associations. Three ofthese cover 71% of the soils in the study area (discontinuous redmountainous soils – 55%, mixed soils on alternating marl andlimestone – 8%, and sandy soils – 8%).

3. Modeling approach

The quantitative soil mapping of zinc distribution was generatedin several steps. Base maps were used to produce the landscapeunit map, which was combined with field surveys and laboratoryanalyses for allocating site locations and determining their zincconcentration (mg kg�1). The obtained point location layer wasthen intersected with maps of predictor terrain parameters.A decision-tree model was then explored on the result of thisintersection combining, for field locations, zinc concentration andthe corresponding parameters. This model was used for quantita-tive mapping of zinc (mg kg�1) within the soils of the study area ina GIS environment.

3.1. Model choice

The decision-tree model is a logical model (deductive reasoning)represented as a binary (two-way split) tree that shows how thevalue of a variable – named target variable – can be predicted byusing the values of a set of predictor variables. It has a number ofadvantages when compared to numerically oriented techniquessuch as linear and nonlinear regression (function fitting), logisticregression, artificial neural networks ANNs, and genetic algorithms(McKenzie and Ryan, 1999; Breiman, 2001). Decision-trees are easyto build and interpret, and can automatically handle interactionsbetween both continuous (ordinal, interval) and categorical(nominal) variables. Linear regression, for instance, is appropriateonly if the data can be modeled by a straight line function, which isoften not the case. Also, linear regression cannot easily handlecategorical variables, nor is it easy to look for interactions betweenvariables. As with linear regression, nonlinear and logistic regres-sions are not well suited for categorical variables.

Decision-trees can identify the most decisive variables, whichare those that are used for creating the splits near the top of thetree. In addition, they do not require the specification of the form ofa function to be fitted to the data as is necessary for othercompeting procedures (e.g., nonlinear regression).

Artificial Neural Networks (ANNs) are often compared to deci-sion-trees models because both methods can model data that havenonlinear relationships between variables, and both can handleinteractions between variables. They have been used also forassessing soil Zn content (Chow et al., 1997). However, neuralnetworks have a number of drawbacks. They do not present aneasily understandable model allowing researchers to get the fullexplanation of the underlying nature of the data being analyzed. Inaddition, they handle only binary categorical input data, and notthose with multiple classes. It is difficult also to incorporatea neural network model into a computer system without usinga dedicated interpreter for the model. In contrast, once a decision-tree model has been built, it can be converted to statements that areimplemented easily in most computer languages without requiringa separate interpreter.

Moreover, decision-trees can identify a target (dependent)variable. This is not possible with unsupervised machine learningmethods like cluster analysis, factor analysis (principal componentanalysis) and statistical measures (Lin et al., 2002; Saadia et al.,2006; Manzoor et al., 2006), which treat all variables equallywithout predicting the value of a variable. These methods ratherlook for patterns, groupings or other ways to characterize the datathat may lead to the understanding of the way the data interrelate.

In addition, decision-trees can indicate the relative weight of eachpredictor variable in explaining the training data, while bivariateanalysis demonstrates only the implication of a couple of predictorvariables against the target variable. Therefore, if the goal of ananalysis is to predict the value of some variable, then decision-treemodeling is a recommended approach. But predicting the futureremains a hard task, even with decision-trees.

3.2. Allocation of field sites and zinc distribution

Field surveys coupled with laboratory analyses were conductedfor detailed measurements of zinc concentration at selected sites.These sites were chosen by random stratified sampling method toprovide a representative sampling for the whole study area. In thissampling program, soils from urban, suburban, county park areasand arable lands were collected based on the different site condi-tions. Thus this sampling method covers all landscape units, whichdiffer by at least one of the following variables: geological substrate,soil type and land cover/use (Bou Kheir et al., 2004). These land-scape units were determined by combining the corresponding GISlayers (explained below). A sampling density of one sample per km2

was used.Geographic locations of all sampling points (200) were deter-

mined using a global positioning system (GPS) with 10-m precision.At each site, soil samples were taken from the top 20-cm soil layer.The spatial distribution of Zn content in this layer is greatlyinfluenced by various anthropic sources such as urbanization,industrialization, agricultural land uses, vehicular emissions andatmosphere deposition of dust and aerosols. The collected soilsamples were then stored in polyethylene bags for transport andstorage. These samples were air-dried in an oven at 50 �C for threedays and subsequently sieved through a 2.0-mm polyethylene sieveto remove stones, coarse materials and other debris. Soil subsam-ples (z20 g) were placed in a mechanical agate grinder and finelyground (<200 mm). The prepared soil samples were then stored inpolyethylene bags in desiccators for chemical analysis.

The ground soil samples were analyzed for zinc using a strongacid digestion method. Approximately 0.5 g of the soil samples wasweighed and placed in pre-cleaned Pyrex test tubes. Concentratednitric acid (8 ml) and 3 ml of concentrated perchloric acid wereadded. The mixtures were heated in an aluminium block at 200 �Cfor 3 h until they were completely dry. After the test tubes werecooled, 10.0 ml of 5% HNO3 was added and heated at 70 �C for 1 hwith occasional mixing. Upon cooling, the mixtures were decantedinto polyethylene tubes and centrifuged at 3500 rpm for 10 min.Zinc concentrations of the solutions were measured using induc-tively coupled plasma-atomic emission spectrometry (ICP-AES;Perkin-Elmer 3300 DV).

3.3. Collection of predictor terrain parameters of zinc distribution

Soil Zn accumulation is driven by terrain characteristic, indus-trial and agriculture land uses, soil properties and enhancedurbanization (Facchinelli et al., 2001; Li et al., 2001; Navas andMachin, 2002). For that, in the regression-tree analysis, fourteenterrain parameters were selected as the independent variables,whereas the classes of Zn pollution were dependent variables. Thegenerated parameters (Table 1), i.e. lithology, slope gradient, slopelength, distance to the drainage line, soil type, pH, conductivity,organic matter, stoniness ratio, soil depth, land cover/use, prox-imity to roads, surroundings of waste areas, and nearness to citieswere extracted from satellite imageries, digital terrain models(DTMs) and/or ancillary maps. The rules for quantifying/under-standing of the impact of these terrain parameters on Zn contentcould be acquired from the training data through the built

Table 1The different terrain parameters (predictors) likely to impact on the zinc concen-tration and their corresponding classes.

Class no. Signification

(a) Lithology1 J6 (dolomite and dolomitic limestone)2 bJ6 (basalt)3 C1 (calcareous sandstone and intercalations of

siltstone and clays)4 C2 (clastic limestone, limestone and dolomitic

limestone)5 C3 (shaley limestone and marl)6 C4 (limestone, marly limestone and dolomitic

limestone)7 C5 (marly limestone and limestone)8 C6 (chalky marl and marly limestone)9 bC (basalt)10 m2 (marly limestone and marl)11 qc (colluvial deposits)12 qd (fixed dunes)13 qta (alluvium)14 bq (basalt)

(b) Land cover/use1..20 Natural vegetation cover21..41 Agricultural lands42..45 Bare lands46–47 Water bodies47..58 Human practices

(c) Slope gradient1 �8%2 8–15%3 15–30%4 30–60%5 �60%

(d) Slope length1 �50 m2 50–100 m3 100–150 m4 150–200 m5 >200 m

(e) Distance to drainage line1 �50 m2 50–100 m3 100–150 m4 150–200 m5 >200 m

(f) Proximity to roads1 �50 m2 50–100 m3 100–150 m4 150–200 m5 >200 m

(g) Surroundings of waste areas1 �50 m2 50–100 m3 100–150 m4 150–200 m5 >200 m

(h) Soil type1 Coastal sand2 Consolidated dunes3 Gravel and massive landslide deposits4 Decalcified and rubified coastal sand5 Discontinuous red mountainous soils/terra rossa6 Continuous red mountainous soils/terra rossa7 Brown soils8 Yellowish mountain soils9 Sandy soils10 Greyey soils11 Recent fluvial alluvium12 Mixed soils on marl with bedded limestone13 Mixed soils on alternating marl or limestone

and sandstone

Table 1 (continued)

Class no. Signification

14 Mixed soils on alternating marl, limestone,sandstone and basalt

15 Black or greyey soils16 Dolomitic sand associated to terra rossa17 Yellowish soils associated to greyey soils18 Sandy soils associated to greyey soils

(i) Soil pHExtremely to moderately acid <5.8Slightly acid 5.8–6.5Neutral 6.5–7.2Slightly alkaline 7.2–7.9Moderately to strongly

alkaline> 7.9

(j) Hydraulic conductivityRapid >42.34Moderately rapid 14.11–42.34Moderate 4.23–14.11Moderately slow 1.41–4.23Slow <1.41

(k) Nearness to cities1 �50 m2 50–100 m3 100–150 m4 150–200 m5 >200 m

(l) Organic matter content1 <0.5%2 0.5–1%3 1–1.5%4 1.5–2%5 >2%

(m) Stoniness ratio1 <5%2 5–35%3 35–65%4 65–95%5 >95%

(n) Soil depthVery shallow <0.25 mShallow 0.25–<0.5 mModerate 0.5–<1.0 mDeep 1.0–<1.5 mVery deep 1.5–5 m

R. Bou Kheir et al. / Environmental Pollution 158 (2010) 520–528 523

regression-tree model. The pollution class for Zn at unsampledlocations could be predicted based on these rules. A synopticclassification of five equal interval classes was plotted for each ofthe considered parameters (at the exception of the lithologies andsoil types) (Table 1) to match the pollution classes of Zn in theproduced quantitative map.

3.3.1. Parent materialNatural concentration of Zn in soils depends primarily on the

geochemistry of the parent material (De Temmerman et al., 2003).This can exhibit high spatial variability over heterogonous litholo-gies. Information on lithology was extracted from scanned andregistered geological map of Lebanon at 1:50.000 scale (Dubertret,1945).

3.3.2. Slope gradient and lengthTopography reflects soil types and likely transport processes for

pollution of heavy metals across the landscape. Slope gradient andslope length were derived from a mosaic of digital terrain models(DTMs) with a planimetric resolution of 50 m, generated for thestudy area from topographic maps at a 1:20.000 scale with anelevation contour interval of 20 m (DGA, 1963) using ArcGIS

R. Bou Kheir et al. / Environmental Pollution 158 (2010) 520–528524

software. Five classes of slope were distinguished: �8%, 8–15%,15–30%, 30–60%, and �60%. These classes were determined usinga clustering method based on the frequency distribution of slopesand the break-in-slope that influence formation of various soils.Five classes of slope length were distinguished also, with an equalinterval of 50 m between each class (Table 1).

3.3.3. Distance to drainage lineDrainage networks, which can be considered a major source of

pollution, were extracted from topographic maps at a scale of1:20.000 (DGA, 1963). A topology was built for the drainage systemsand then buffered. The influence of drainage was given to the bufferzone up to a distance of 50 m from the closest drainage line (Saha et al.,2002). Thus, five classes were determined in the study area (Table 1).

3.3.4. Soil propertiesSoils in this study area are derived from different parent mate-

rials, and this is a key factor determining their natural content ofheavy metals. Soil types were represented by a digital registeredform of the soil map of Lebanon established by Geze (1956) ata 1:200.000 scale. Soil pH, hydraulic conductivity and organicmatter content were analyzed, respectively, by soil suspension inwater and chlorhydric potassium (ratio 1:2), double rink, andoxidation with potassium dichromate (K2Cr2O7). Soil depth wasmeasured through a sounding by an auger at each site. Stoninessratio (which is the relative proportion of stones on the soil surface)was determined by visual observations with five classes: <5%,5–35%, 35–65%, 65–95%, and >95%.

3.3.5. Land cover/useAgricultural practices can add Zn to soils through application of

manure or inorganic fertilizers. Repeated use of Zn-based chem-icals in orchard and arable lands can result also in soil Zncontamination. The land cover/use parameter was estimated froma recent land cover/use map at 1:20.000 scale. This map wasproduced through visual interpretation of high-resolution Indiansatellite images IRS (5 m) acquired in October 1998 and based onCORINE (Coordinated Information on the European Environment)Land Cover methodology (level 4) (LNCSR-LMoA, 2002). Fifty-eightclasses were plotted belonging to five major categories: 1) naturalvegetation cover, 2) agricultural lands, 3) bare lands, 4) waterbodies and 5) human practices.

3.3.6. Proximity to roadsRoads are usually sites of anthropologically induced pollution

(e.g., vehicular emissions). For this reason, roads were included inthis study. Thus, a buffer zone of 50 m above the road (maximumheight of a talus cut created by the construction of a road) wascreated, and five classes were considered, ranging from less than50 m to more than 200 m from the road.

3.3.7. Surroundings of waste areasThe concentration of Zinc may also result from burning of waste

in the environment. The influence of waste areas on Zn accumu-lation can reach several meters to several kilometres from the pointsource depending on the industry involved and it is not easy todetermine the distance affected. Here, 50 m, 100 m, 150 m, 200 m,and more than 200 m were selected as the buffer distance aroundthe waste areas.

3.3.8. Nearness to citiesRecent enhanced urbanization and industrialization have

contributed to increased occurrence of heavy metals including Zncontamination in ecosystems. In this study, a buffer zone of 50 maround the city or town boundary was selected.

3.4. Statistical analysis

An ASCII point data file for the field sample locations wasconstructed. This file comprises several columns: geographic sitecoordinates (x, y), the zinc concentration in mg kg�1 (target vari-able), and various predictor parameters being both ordinal (pH,conductivity, organic matter, stoniness ratio, soil depth, slopegradient, slope length, distance to the drainage line, proximity toroads, surroundings to waste areas, and nearness to cities) andnominal (lithology, soil type, land cover/use). A decision-tree modelwas explored using this file with several steps: (1) find the bestpossible split through the examination of each predictor, (2) createtwo child nodes, (3) determine into which child node each fieldlocation goes, and (4) continue the process until the criterion of theminimum node size is fulfilled.

The number of splits to be evaluated is equal to 2(k�1)� 1, wherek is the number of categorical classes of predictor variables(Breiman, 2001). For example, if the land cover/use with five classesis considered, fifteen splits need to be tried. With the increase of thenumber of classes, there is an exponential growth of splits andcomputation time. We considered differences in the value ofa continuous variable up to 1% of the whole range, which isequivalent to ten thousand classes (Loh and Shih, 1997).

The algorithm used for evaluating the quality of the constructedtree is the Gini splitting method, which is considered as the defaultmethod (Breiman, 2001). Each split is chosen in order to maximizethe heterogeneity (node impurity) of the classes of a target variablein child nodes. The Gini method has been considered slightly betterthan the Entropy tree-fitting algorithm (Loh and Shih, 1997;Breiman, 2001). The Gini coefficient is used to measure the degreeof inequality of a variable in terms of frequency distribution. Itranges between 0 (perfect equality) and 1 (perfect inequality). TheGini mean difference (GMD) is defined as the mean of the differ-ence between each observation and every other observation(Breiman, 2001):

GMD ¼ 1N2

XN

j¼1

XN

k¼1

���Xj � Xk

��� (1)

where X is cumulative percentage (or fractions) and their respectivevalues (j and k) and N is the number of elements (observations).

A minimum node size of 10 was applied in this study, sincea simpler tree is easier to understand and faster to use and, moreimportantly, smaller trees provide greater predictive accuracy forunseen data. This number has also been used in other studies(Murphy et al., 1994; Zhang and Burton, 1999).

The maximum tree elevation was not specified in this study, aswas the case for earlier programs such as AID (Automatic InteractionDetection). A too large tree needs to be pruned back to its optimalsize (i.e. backward pruning) on the basis of a V-fold cross-validation(Berk, 2003) with several assumptions such as random-row-holdback validation, fixed number of terminal nodes, and smoothminimum spikes. It does not require a separate, independentdataset, which would reduce the data used to build the tree. Itpartitions the used dataset to build the reference unpruned tree intoa number of groups (i.e. folds). A 10 V-folds value was adopted inthis study, since a larger value increases computation time and maynot result in a more optimal tree.

3.5. Construction of the zinc concentration map

Using the resulting decision-tree model, a predictive concen-tration map of zinc distribution in soils of the study area wasproduced under a GIS environment through conditional parametric

Fig. 2. Decision-tree model for predicting zinc concentration based on terrain parameters (pH ¼ ‘‘pH’’; proximity to roads ¼ ‘‘Roads’’; Land cover/use ¼ ‘‘LUC’’; Soil type ¼ ‘‘Soils’’; Distance to drainage line ¼ ‘‘Drain’’; Slopegradient ¼ ‘‘SG’’; Soil depth ¼ ‘‘Sd’’; Organic matter ¼ ‘‘om’’; Hydraulic conductivity ¼ ‘‘cond’’; lithology ¼ ‘‘litho’’; Surroundings of waste areas ¼ ‘‘WA’’; Nearness to cities ¼ ‘‘C’’).

R.Bou

Kheir

etal./

Environmental

Pollution158

(2010)520–528

525

R. Bou Kheir et al. / Environmental Pollution 158 (2010) 520–528526

weight application. If different end results (concentrations inmg kg�1) characterize field sites within a given unit of the inter-section layer (between different predictor parameters), new sub-polygons were delineated. In the case of similar results, unit poly-gons were joined (merged). This map was validated based on fieldsurveys. An independent dataset has been randomly chosen in alllandscape units, consisting of 30% (60 sites) of the total number offield sites. The accuracy assessment was based on analysis of theerror matrix, which is a square array of dimension n � n (n is thenumber of classes). The total accuracy refers to the ratio of totalnumber of correctly inferred Zn classes divided by the total numberof test samples (60).

4. Results and discussion

4.1. Model performance evaluation

The regression-tree model created with 25 terminal nodes(Fig. 2) correctly classified 88% of the training data, but V-fold cross-validation indicated that this model would correctly classify 76% ofan independent dataset. The number of observations (N) perterminal node ranged from 1 to 8, and the node does not split if N isless than 10 observations.

The mean value of the target variable (zinc concentration inmg kg�1) of the rows that are in a terminal node of the tree is theestimated median value. From the tree (Fig. 2), one can see that ifthe pH is less than 8 (N ¼ 110 rows), then the estimated (average)value of the target variable is 200 mg kg�1, whereas if the pHexceeds 8 (N ¼ 30 rows), then the average value of the targetvariable is 110 mg kg�1. In fact, zinc – like other heavy metals –becomes more water-soluble under acid conditions and can movedownwards with water through the soils.

The minimum validation relative error (or validation cost) occurswith five nodes, at a value of 0.26 and a validation standard error of0.02. This validation error value is the cost relative to the cost fora tree with one node. It is the best measure of how well the tree willfit an independent dataset different from the learning dataset.

4.2. Evaluation of the influence of independent parameterson soil Zn content

The relative importance of the predictor parameters is repre-sented as follows: pH (100% in relative importance), surroundings

Fig. 3. Predicted zinc concentra

of waste areas (90%), proximity to roads (80%), nearness to cities(50%), distance to the drainage (25%), lithology (24%), land cover/use (14%), slope gradient (10%), conductivity (7%), soil type (7%),organic matter (5%), and soil depth (5%). pH appears as the masterparameter (generating the split from the parent node) in the wholepollution process by zinc metal, since it is strongly correlated withsoil types and parent materials influencing the cation mobility andregulates the solubility of zinc in soils. The spatial variability of zincin topsoils is affected by extrinsic (roads, waste areas and cities) andintrinsic aspects (drainage lines and soil parent materials). Inaddition, there is a good correspondence between zinc distributionand land cover/use, especially that related to a secular anthropicactivity (e.g., cultivation resulting in chemical anomalies). The slopegradient has demonstrated a certain impact by influencing theaccumulation of zinc in mosses. Similarly, hydraulic conductivityand soil type (texture) control zinc distribution, but to a lesserextent (7%). In clay-rich, low-permeability soils, heavy metals arebound to soil particles, and thus there are only few toxicity risks.Organic matter and soil depth have similar effects on the zincconcentration. The former influences heavy metal absorption insoils; this effect is probably due to the cation exchange capacity oforganic material. In addition, since heavy metals come from parentmaterials, their concentration will typically increase with soil depthdue to two processes. One is soil formation from rock parentmaterials, which have higher concentrations of heavy metals; theother is deposition of materials at the soil surface, such as plantorganic matter, which typically contain low heavy metal concen-trations. As the soil materials from these two processes mix,a gradient of increasing concentration is created through the soilprofile. The other parameters (i.e. slope length and stoniness ratio)do not intervene in building the regression-tree, and their effect istotally masked.

4.3. Production of the zinc concentration map

The predictive concentration map of zinc (Fig. 3), at a 1:50.000cartographic scale, was produced using the results of the builtdecision-tree model. This map was divided into five concentrationclasses having an equal range of distribution (Table 2). This divisionseems necessary to prioritize the measurements needed to reducethe level of zinc concentration in contaminated areas. In this map,class 3 (with a zinc concentration ranging between 100 and150 mg kg�1) covers the largest area (39%) being dispersed in the

tion map of the study area.

Table 2Classes of zinc concentration and corresponding areas for the studied region.

Classes Zinc concentration Area (km2) % of the study area

1 (very low) �50 mg kg�1 5 2.52 (low) 50–100 mg kg�1 53 273 (medium) 100–150 mg kg�1 75 394 (high) 150–200 mg kg�1 46 23.55 (very high) >200 mg kg�1 16 8

R. Bou Kheir et al. / Environmental Pollution 158 (2010) 520–528 527

studied region (Table 2). This indicates a widespread higher risk ofcontamination if no remedial measures are applied. Classes 4 (highcontamination) and 5 (very high contamination) occur in 31.5% ofthe area, but they have a far larger impact since they are distributedas patches in densely populated areas.

4.4. Accuracy of the zinc concentration map

The predicted Zn values were verified against test data. Theconfusion matrix between the measured zinc concentration classesand the modeled ones indicates a good overall accuracy of 78%(Table 3). This accuracy value denotes the number of correctlymodeled sites divided by the total number of sites. It is differentfrom the explained variance of the built decision-tree model (88%),since it is dedicated to validate all adopted approaches combiningthe integration of landscape unit map, quantification of terrainparameters and decision-tree modeling. In contrast, the explainedvariance reflects only some field/laboratory measurements of zincconcentration chosen for model training. The user’s accuracy (Au),i.e. the percentage of sites belonging to a model class correctlycorresponding to the reference data, read along columns, is rangingfrom 50% to 87.5% (50% being the result of dividing 4 over 8 andmultiplying by 100; 87.5% is derived from dividing 7 over 8 andmultiplying by 100). Omission errors (deficits) (Eo), computedalong columns, correspond to the distribution of sites of a class,derived from modeling, in the various classes of reference data.They are equal (in %) to ‘‘100% – user’s accuracy’’. They vary between12.5% and 50%. The producer’s accuracy (Ap), the percentage of sitesbelonging to a reference class correctly classified by the model, readalong the rows, is 67–88%. Commission errors (excesses) (Ec),computed along rows, correspond to the distribution of a referenceclass among various classes derived from modeling. Being equal to‘‘100% – producer’s accuracy’’, they range between 12% and 33%.Modeling often overestimates zinc concentration levels, while

Table 3Error matrix for measuring the accuracy of the map predicting zinc concentration.

Zinc concentration Modeled map

Verylow

Low Medium High VeryHigh

Total Ap(%)

Ec(%)

Field observation/laboratoryanalysis

Verylowa

7 2 1 0 0 10 70 30

Low 1 8 2 0 0 11 73 27Medium 0 0 23 3 0 26 88 12High 0 0 0 4 2 6 67 33Veryhigh

0 0 1 1 5 7 71 29

Total 8 10 27 8 7 60Au (%) 87.5 80 85 50 71

P¼ 47

Eo (%) 12.5 20 15 50 29 OA ¼ 78%

Au ¼ user’s accuracy, Eo ¼ Omission errors (deficits), Ap ¼ producer’s accuracy,Ec ¼ Commission errors (excesses), OA ¼ Overall accuracy.Bolditalics values indicates: Sites classified correctly,

P¼ Total number of correctly

modeled sites.a Very low zinc concentration: �50 mg kg�1; Low: 50–100 mg kg�1; Medium:

100–150 mg kg�1; High: 150–200 mg kg�1; Very high: >200 mg kg�1.

underestimation is rare; this can be considered a positive point formanagement planning considerations, because the possibility ofoverlooking actual zinc pollution risks decreases.

4.5. Advantages and problems of the constructed model

Our regression-tree model (88% variance) has defined a map ofzinc concentration with five classes for a region situated in thenorthern part of Lebanon (195 km2). Such a map was not availablein Lebanon, nor in many other countries. It represents the result ofmodeling from geo-environmental characteristics and can meet thescientific needs of researchers and decision-makers for exploringland-related problems.

The explained model variance can be ameliorated using otherdetails within the predictor parameters such as soil structure, soilcompactness, slope aspect, slope curvature, etc. This is an impor-tant future research topic since the contribution of such variables toexplaining additional variance can be tested. This model can beextrapolated to other areas in the country if the functional capac-ities of GIS are used, because they allow the integration of severalparametric maps for producing landscape unit maps, on which zincconcentration measurements can be determined.

The concept of decision-tree modeling can also be tested forother heavy metals (e.g., copper, lead, cadmium, etc.). However,a major difficulty is encountered related to the coarse scale of someparametric maps used for producing the zinc concentration map.

5. Conclusion

The constructed decision-tree model enabled, for the first timethe mapping of predicted zinc concentrations in a region ofLebanon at a scale of 1:50.000, based on geo-environmentalcharacteristics (e.g., topography, geology, soil and land cover/use).The modeling approach was easily implemented with available GISsoftware and is suitable for data exploration and predictive zincconcentration mapping. It is explicit and can be critically evaluatedand revised when necessary. It can also easily be extrapolated toother Mediterranean countries undergoing socio-economic change.

This decision-tree approach did not produce concentrationmaps of soil zinc pollution with accuracies substantially higherthan those reported in the literature for other methods and/ormodels (Facchinelli et al., 2001; Li et al., 2001; Norra et al., 2001;Zhang, 2005; Franco et al., 2006). However, it can be considereda quick, simple, realistic, and informative method for combininggeo-environmental variables in order to generate a map describingthe potential zinc concentration (predictive map). This map can beused to prioritize the choice of study areas for further measurementand modeling and may in the short-term help with the selectionand adoption of measures to reduce the contamination of denselypopulated areas and arable lands. Although the chosen scale of thismap (1:50.000) seems to be sufficient for estimating the zincconcentration to consider strategies for land protection, the mapcan be improved for more localized contamination assessment ifmore detailed data sets are available [higher resolution DTMs andmore detailed GIS parametric maps (predictors)].

References

Adachia, K., Tainoshob, Y., 2004. Characterization of heavy metal particlesembedded in tire dust. Environmental International 30, 1009–1017.

Berk, R.A., 2003. An Introduction to Ensemble Methods for Data Analysis. UCLADepartment of Statistics. Technical Report.

Bocca, B., Alimonti, A., Petrucci, F., Violante, N., Sancessario, G., Forte, G., 2004.Quantification of trace elements by sector field inductively coupled plasmaspectrometry in urine, serum, blood and cerebrospinal fluid of patients withParkinson’s disease. Spectrochimica Acta B59, 559–566.

R. Bou Kheir et al. / Environmental Pollution 158 (2010) 520–528528

Bou Kheir, R., Girard, M.-C., Khawlie, M., 2004. Use of a structural classificationOASIS for the mapping of landscape units in a representative region of Lebanon.Canadian Journal of Remote Sensing 30 (4), 617–630 (in French).

Bou Kheir, R., Wilson, J., Deng, Y., 2007. Use of terrain variables for predictive gullyerosion mapping in Lebanon. Earth Surface Processes and Landforms 32,1770–1782.

Bou Kheir, R., Chorowicz, J., Abdallah, C., Dhont, D., 2008. Soil and bedrocksdistribution estimated from gully form and frequency: a GIS-based decision-tree model for Lebanon. Geomorphology 93, 482–492.

Breiman, L., 2001. Decision-tree forests. Machine Learning 45 (1), 5–32.Chow, C.W.K., Davey, D.E., Mulcahy, D.E., 1997. A neural network approach to zinc

and copper interferences in potentiometric stripping analysis. Journal ofIntelligent Material Systems and Structures 8 (2), 177–183.

De Temmerman, L., Vanongeval, L.B., Hoenig, M., 2003. Heavy metal content ofarable soil in Northern Belgium. Water, Air and Soil Pollution 148, 61–76.

DGA, 1963. Topographic Maps at a Scale 1:20.000. Direction of Geographic Affairs,Republic of Lebanon.

Dubertret, L., 1945. Geological Maps of Lebanon at 1:50.000 Scale. Ministry of PublicAffairs, Republic of Lebanon.

Facchinelli, A., Sacchi, E., Mallen, L., 2001. Multivariate statistical and GIS-basedapproach to identify heavy metal sources in soils. Environmental Pollution 114,313–324.

Franco, C., Soares, A., Delgado, J., 2006. Geostatistical modelling of heavy metalcontamination in the topsoil of Guadiamar river margins (S Spain) usinga stochastic simulation technique. Geoderma 136, 852–864.

Franklin, J., 1998. Predicting the distributions of shrub species in California chap-arral and coastal stage communities from climate and terrain-derived variables.Journal of Vegetation Science 9, 733–748.

Friedl, M.A., Brodley, C.E., Strahler, A.H., 1999. Maximizing land cover classificationaccuracies produced by decision trees at continental to global scales. IEEETransactions on Geoscience Remote Sensing 37, 969–977.

Fuge, R., 2005. Anthropogenic sources. In: Selinum, O. (Ed.), Essentials of MedicalGeology: Impacts of the Natural Environment on Public Health. Academic Press,Amsterdam, pp. 43–60.

Geze, B., 1956. Soil Map of Lebanon at a Scale of 1:200.000, Explicative Note, 1956.Republic of Lebanon. Ministry of Agriculture, Beirut, Lebanon.

Henderson, B.L., Bui, E.N., Moran, C.J., Simon, D.A.P., 2005. Australia-wide predic-tions of soil properties using decision trees. Geoderma 124, 383–398.

Huang, X., Jensen, J.R., 1997. A machine-learning approach to automated knowl-edge-base building for remote sensing image analysis with GIS data. Photo-grammetric Engineering & Remote Sensing 63, 1185–1194.

Kandrika, S., 2008. Land use/land cover classification of Orissa using multi-temporalIRS-P6 awifs data: a decision tree approach. International Journal of AppliedEarth Observation and Geoinformation 10 (2), 186–193.

Li, X.D., Poon, C.S., Liu, P.S., 2001. Heavy meal contamination of urban soils andstreet dusts in Hong Kong. Applied Geochemistry 16, 1361–1368.

Lin, Y.P., Teng, T.P., Chang, T.K., 2002. Multivariate analysis of soil heavy metalpollution and landscape pattern in Changhua County in Taiwan. Landscape andUrban Planning 62, 19–35.

LNCSR-LMoA, 2002. Land Cover/Use Map of Lebanon at a Scale of 1:20.000. Leb-anese National Council for Scientific Research and Lebanese Ministry ofAgriculture.

Loh, W.Y., Shih, Y.S., 1997. Split selection methods for classification trees. StatisticaSinica 7, 815–840.

Manta, D.S., Angelone, M., Bellanca, A., Neri, R., Sprovieria, M., 2002. Heavy metalsin urban soils: a case study from the city of Palermo (Sicily), Italy. Science of theTotal Environment 300 (1–3), 229–243.

Manzoor, S., Munir, H.S., Shaheen, N., Khalique, A., Jaffar, M., 2006. Multivariateanalysis of trace metals in textile effluents in relation to soil and groundwater.Journal of Hazardous Materials A137, 31–37.

McKenzie, N.J., Ryan, P.J., 1999. Spatial prediction of soil properties using environ-mental correlation. Geoderma 89, 67–94.

Michaelsen, J., Schimel, D.S., Friedl, M.A., Davis, F.W., Dubayah, R.C., 1994. Regressiontree analysis of satellite and terrain data to guide vegetation sampling andsurveys. Journal of Vegetation Science 5, 673–686.

Murphy, P., Patrick, M., Michael, J.P., 1994. Exploring the decision forest: anempirical investigation of Occam’s Razor in decision-tree induction. Journal ofArtificial Intelligence Research 1, 257–275.

Navas, A., Machin, J., 2002. Spatial distribution of heavy metals and arsenic in soilsof Aragon (Northeast Spain): controlling factors and environmental implica-tions. Applied Geochemistry 17, 961–973.

Norra, S., Weber, A., Kramer, U., Stuben, D., 2001. Mapping of trace metals in urbansoils. Journal of Soil Sediment 1, 77–97.

Reimann, C., de Caritat, P., 2005. Distinguishing between natural and anthropo-genic sources for elements in the environment: regional geochemical surveysversus enrichment factors. The Science of the Total Environment 337 (1–3),91–107.

Saadia, R.T., Munir, H.S., Shaheen, N., Khalique, A., Manzoor, S., Jaffar, M., 2006.Multivariate analysis of trace metal levels in tannery effluents in relation to soiland water: a case study from Peshawar, Pakistan. Journal of EnvironmentalManagement 79, 20–29.

Saha, A.K., Gupta, R.P., Arora, M.K., 2002. GIS-based landslide hazard zonation in theBhagirathi (Ganga) valley, Himalayas. International Journal of Remote Sensing23 (2), 357–369.

Schroder, W., 2006. GIS, geostatistics, metadata banking, and tree-based models fordata analysis and mapping in environmental monitoring and epidemiology.International Journal of Medical Microbiology 296 (S1), 23–36.

Smolders, E., Degryse, F., 2002. Fate and effect of zinc from tire debris in soil.Environmental Science and Technology 36 (17), 3706–3710.

Venables, W.M., Ripley, B.D., 1994. Modern Applied Statistics with S-Plus. Springer,New York.

Waisberg, M., Joseph, P., Hale, B., Beyersmann, D., 2003. Molecular and cellularmechanisms of cadmium carcinogenesis. Toxicology 192, 95–117.

Zhang, C., 2005. Using multivariate analyses and GIS to identify pollutants and theirspatial patterns in urban soils in Galway, Ireland. Environmental Pollution 142,501–511.

Zhang, H., Burton, S., 1999. Recursive Partitioning in the Health Sciences. Springer,New York, 226 pp.