Backfilling missing microbial concentrations in a riverine database using artificial neural networks

ARTICLE IN PRESS

Available at www.sciencedirect.com

WAT E R R E S E A R C H 4 1 ( 2 0 0 7 ) 2 1 7 – 2 2 7

0043-1354/$ - see frodoi:10.1016/j.watres

�Corresponding auE-mail address:1 Currently with

journal homepage: www.elsevier.com/locate/watres

Backfilling missing microbial concentrations in a riverinedatabase using artificial neural networks

V. Chandramoulia,1, Gail Brionb,�, T.R. Neelakantanc, Srinivasa Lingireddyb

aIndian Institute of Technology, Guwahati, Assam, IndiabDepartment of Civil Engineering and Environmental Health, University of Kentucky, Lexington, KY 40506-0281, USAcSchool of Civil Engineering, SASTRA Deemed University, Thanjavur 613402, India

a r t i c l e i n f o

Article history:

Received 8 June 2004

Received in revised form

19 August 2006

Accepted 22 August 2006

Available online 30 October 2006

Keywords:

Artificial neural networks

Fecal coliform bacteria

Atypical bacteria

Backfilling

Abbreviations:

ANN, Artificial neural network

models

RSE, Relative strength effect

FC, Fecal coliform bacteria

AC, Atypical total coliform

colonies

TC, Total coliform group colonies

BG, Background colonies

CFU, Colony forming units

MSE, Mean square error

MRE, Mean relative error.

nt matter & 2006 Elsevie.2006.08.022

thor. Tel.: +1 859 257 [email protected] (G.

Department of Civil Eng

A B S T R A C T

Predicting peak pathogen loadings can provide a basis for watershed and water treatment

plant management decisions that can minimize microbial risk to the public from contact or

ingestion. Artificial neural network models (ANN) have been successfully applied to the

complex problem of predicting peak pathogen loadings in surface waters. However, these

data-driven models require substantial, multiparameter databases upon which to train,

and missing input values for pathogen indicators must often be estimated. In this study,

ANN models were evaluated for backfilling values for individual observations of indicator

bacterial concentrations in a river from 44 other related physical, chemical, and

bacteriological data contained in a multi-year database. The ANN modeling approach

provided slightly superior predictions of actual microbial concentrations when compared

to conventional imputation and multiple linear regression models. The ANN model

provided excellent classification of 300 randomly selected, individual data observations

into two defined ranges for fecal coliform concentrations with 97% overall accuracy. The

application of the relative strength effect (RSE) concept for selection of input variables for

ANN modeling and an approach for identifying anomalous data observations utilizing

cross validation with ANN model are also presented.

& 2006 Elsevier Ltd. All rights reserved.

1. Introduction

It has been established that waterborne disease outbreaks are

frequently associated with peak loadings of waterborne

microbes in surface waters, as happened with the large

r Ltd. All rights reserved.

Brion).ineering, University of K

protozoa outbreak that occurred in Milwaukee after heavy

rains (Mac Kenzie et al., 1994). Waterborne outbreaks have

occurred without the presence of indicator bacteria in treated

water (Craun et al., 1997; Marshall et al., 1997), forcing the

issue of source quality control to reduce potential health

entucky, Lexington, KY 40506, USA.

dx.doi.org/10.1016/j.watres.2006.08.022

mailto:[email protected]

ARTICLE IN PRESS

WA T E R R E S E A R C H 4 1 ( 2 0 0 7 ) 2 1 7 – 2 2 7218

threats. Apart from physical and chemical water quality

measurements, bacterial concentrations in source waters are

still essential for estimating the potential presence of

pathogens and to minimize outbreaks of waterborne and

water contact diseases. The presence of fecal coliforms (FC) in

high numbers in a water sample indicates the water may

have received fecal matter from a pathogen source, but other

indicators may be needed for confirmation. To handle the

complex web of relationships between multiple water quality

indicators and pathogens, advanced data-driven modeling

techniques such as artificial neural network (ANN) models are

needed. However, it is difficult to assemble the large, robust,

multivariate databases used to train ANN models without

encountering the problem of occasional missing inputs, so an

appropriate method for backfilling missing data values is

required. This paper investigates the potential for ANN

modeling to fill missing microbial data collected from a

multiyear database obtained from a single location on the

Kentucky River and presents novel approaches to input

parameter selection and data cleaning.

In the past decade, ANNs have been successfully applied in

water resources and environmental research for forecasting,

prediction, classification and pattern recognition. Neelakan-

tan et al. (2001) used a simple feed forward artificial neural

network model to relate peak Cryptosporidium and Giardia

concentrations with other biological, chemical and physical

parameters at a single point along a multi-source impacted

river, and then investigated other, more sophisticated types of

training algorithms with limited improvement in predictive

results (Neelakantan et al., 2002). In other studies using ANN

models, new groups of microbial indicators were discovered

when the models would not train without their inclusion as

inputs. The relationship between atypical colonies (AC), total

coliform colonies (TC), and FC concentrations was key to ANN

classification of runoff types and predominate fecal pollution

sources (Brion et al., 2002; Brion and Mao, 2000; Brion and

Lingireddy, 1999) and for estimating the overall microbial

quality of surface water (Neiman and Brion, 2003). This study

extends this work by trying to predict FC and AC concentra-

tions by ANN modeling from other indicators of water quality.

The underlying relationships between microbial concentra-

tions in surface waters and environmental factors are

complex, and the behavior of single groups of fecal indicator

bacteria with respects to flow often indeterminate as found by

this study. While TC and FC from fecal sources normally die

off after being flushed into the river, the majority of AC are

thought to be more associated with the normal flora of the

river and exist in relatively stable concentrations that

fluctuate with nutrient spikes and runoff inputs. In addition

to modeling the complex relationships between groups of

potential indicator bacteria, it is difficult to predict microbial

concentrations with precision in part due to the expected

variance for any reported measurement from the analytical

methods used. Another approach that can be used is to assign

a classification representing ranges of bacterial concentra-

tions that are related to the underlying distribution of the

data, or imposed limits of interest.

In this study, the main objectives were: (i) to predict

concentration estimates for fecal coliform (FC) and atypical

(AC) bacterial concentrations using ANN models created from

a select group of physical, chemical and bacterial indices, (ii)

to compare the ANN prediction of FC and AC concentrations

using the same set of input parameters to other standard

multivariate modeling and imputation procedures, (iii) to

classify/categorize FC concentrations with respect to high and

low concentration schemes using ANN models, (iv) to

investigate the applicability of a first-order method based

upon the relative strength effect (RSE) for eliminating

nominal input variables, and (v) to see if ANN modeling

could be used to identify anomalous observations for

potential data cleaning.

2. Neural network models

Multi layered feed-forward networks have proven to be very

powerful computational tools that excel in pattern recogni-

tion and function approximation. The general structure and

working principles of a feed-forward neural network have

been described elsewhere (Brion et al., 2002; Masters, 1993). In

this study, simple feed-forward, three layered, ANN models

with back propagation training algorithm and the sigmoidal

activation function in the input and output nodes were used.

3. Methods

3.1. Database compilation

Surface water quality data was gathered routinely by stan-

dard methods by water treatment plant operators three times

daily at the intake (Lock 9) of a privately owned potable water

supply on the Kentucky River and collated into a database by

the authors during a USGS funded study (Fig. 1). Since the

river system is essentially a series of small lakes separated by

locks, and behaves quite differently under high and low flow

conditions as well as seasonally, water quality data were

augmented with flow and rainfall data from several upstream

sites at locks and gauging stations available from the USGS

hydrological data web site (Fig. 1). The resultant 1165 data set

that was used for modeling was comprised of 44 separate

input parameters per individual observation of FC or AC

concentrations. The database contained observations for 6

commonly measured indicator bacteria, 7 commonly mea-

sured physical/chemical water quality measurements, mea-

surements of rainfall at 3 points in the watershed,

measurements of flow at 6 points along the watershed, and

23 input fields created by lagging flow and rainfall data by 1, 2

or 3 days. The observations were randomly shuffled to

eliminate any temporal patterns and then split into training

and testing sets before use in separate ANN modeling

exercises, or used for other modeling or imputation studies.

3.2. Artificial neural networks

In this study, simple feed-forward, three-layered, ANN

models with back propagation training algorithm and the

sigmoidal activation function in the input and output nodes

were used after examining training algorithms such as radial

basis function network, genetic algorithm-based training and

ARTICLE IN PRESS

Fig. 1 – Kentucky River basin map.

WAT ER R ES E A R C H 41 (2007) 217– 227 219

conjugate gradient method. None of the other types of models

provided superior performance to that of the simple back

propagation training algorithm for this data set. Three

different types of activation functions namely sigmoidal,

clipped linear, hyperbolic functions were tried in different

combinations in the hidden layer and output layer before

adopting the sigmoidal function for both the layers. Out of the

available 1165 data observations, 300 were used for testing/

validation and remainder for training the ANN model.

Neurosort Ver II software, a C based program that interfaces

with Visual Basic, developed and coded entirely in-house by

the authors, was used for all ANN modeling (trial versions of

this copywritten software are available upon request from Dr.

Lingireddy). Early training termination was done by monitor-

ing the testing and training indices to prevent overtraining. In

this study, data transformations did not provide any addi-

tional benefit to the predictive power of the ANN models.

Normalization of the input and output data was used in this

study; each value in a domain was done by dividing it by 1.2

times the maximum value encountered in that domain. The

normalization factor was set at a value that was 20% higher

than the maximum in any single input/output data domain to

prevent forcing the ANN model to predict from input values at

the extreme ends of the range.

3.3. Input variable selection by relative strength effect

Input variable selection is critical to the performance of

classification or prediction models of all types and is

generally accomplished from an understanding of the basic

phenomenon being modeled. Suggestions provided by Maier

and Dandy (2000), ASCE (2001a, b) were very useful in deciding

the best architecture. During training if an input variable is

not related to the output sought, then the corresponding

connecting weights are reduced until it is negligible. Although

with ANN modeling, all available input variables, even those

unrelated to the output variable, or more closely related to

another input, can be used for building the model; some

extraneous inputs may actually introduce noise into the

model which can reduce the precision of prediction, or the

inputs may be irrelevant for the task and should be

eliminated to reduce complexity. Thus, even for ANN models,

the task of selecting a subset of the available input variables is

important. Removal of useless variables helps to understand

the pattern behind the prediction, may add precision, and

speeds up data acquisition and processing.

There are several ways to select useful input variables for

ANN models. Maier and Dandy (1996) carried out sensitivity

analyses and based on the ratio of percent change in output

to percent change in input the less sensitive inputs were

identified and deleted. Stokelj et al. (2002) used principal

component analysis and found that this method was not

efficient for their rainfall-runoff modeling. In this paper, an

approach using a first-order method based upon the relative

strength effect (RSE) for eliminating nominal input variables

is presented. This approach was based upon the work of Kim

et al. (2001) who proposed the RSE as a means to differentiate

the relative influence of different input variables. They

ARTICLE IN PRESS

WA T E R R E S E A R C H 4 1 ( 2 0 0 7 ) 2 1 7 – 2 2 7220

defined the RSE as the partial derivative of the output variable

yk, qyk=qxi. The RSE could be used to measure the relative

importance of inputs in contributing to predict outputs.

When qyk=qxi is positive, the increase in input, increases the

output; and if it is negative an increase in input causes a fall

in output.

For each data set examined in this study, the RSE value was

estimated for different inputs. Absolute maximum RSE values

among the inputs were used for normalizing the RSE values of

all the inputs. Hence, for a considered data set, the RSE value

would be between +1 and �1. For basic screening, the average

RSE value of an input for p data set used in training is

considered. The larger the absolute value of RSE, the greater is

the contribution of that input variable.

3.4. ANN input variable selection for river data

The first ANN model was constructed with 44 inputs, 15

hidden nodes in a single hidden layer, and one output, FC

concentrations was used for input parameter elimination by

RSE. Fig. 2 presents the RSE values estimated for the input

parameters considered in this first training of the ANN model

for predicting FC. Many of the 44 inputs were contributing less

than 0.1 as measured by the RSE index. Based on these

results, the insignificant input parameters (lowest 5 RSE ranks

in successive ANN trainings) were eliminated until only 9

inputs remained. These 9 inputs were used to train a last ANN

model to predict FC, and 2 more inputs were removed due to

very low RSE values. For developing the AC ANN model too a

similar approach was used. After input variable elimination,

the RSE values continued to be monitored for the remaining

input variables in all subsequent modeling exercises. The

ANN model developed used architectures of 7:9:1 for FC and

10:5:1 for AC predictions.

Table 1 shows the statistical information for the most

important input parameters for the first ANN model as well as

for the output FC concentrations. Table 2 provides input

variables used in the final ANN models for FC and AC. Flow

values from upstream of the river intake (the Middle Fork of

the Kentucky River, the Red River, and Kentucky River flow

through Lock 14) were lagged to be the previous day’s reported

values. Flow from Lock 10 is directly upstream of the intake

-0.3

-0.1

0.1

0.3

0.5

0.7

0.9

1 11 21

Input va

RS

E

Fig. 2 – Relative strength effec

where the water was withdrawn from the river’s Lock 9 pool.

Flow data with 2 and 3 day lag periods for upstream sites was

not a major contributor to the model, nor was the inclusion of

rainfall data directly found to be useful. However, the

variance in calcium hardness seemed to capture the dilution

effect of rainfall in this limestone saturated river system.

Even though related input parameters temperature and

alkalinity values correlated well with the bacterial data,

without these data, the ANN model was capable of learning

the given patterns successfully.

3.5. ANN cross validation study for the identification ofanomalous observations

During training, it became apparent that there were some

rare, highly-skewed observations which were not well mod-

eled by the ANN model and were consistently mispredicted.

These anomalous observations became the subject of further

scrutiny to determine if they were potential data entry errors,

or a combination of inputs caused by events so rare that the

ANN model did not have enough of them to train upon. A

cross validation study for data screening was performed by

segregating the available data into four groups, after which

three groups of the data were used for model training and the

fourth group (which was not used for training) was used for

model verification. This resulted in four opportunities to

evaluate predictions on individual data observations. Anom-

alous observations were identified using ANN cross validation

results as follows. First, such individual observations were

identified from the verification results where poor prediction

was noted. Then, the same observations were flagged and

evaluated when used in the training sets. When poor

performance was noted for a single observation in all of the

three training sets, along with poor prediction in the one

verification set, then that individual observation was labeled

as an anomalous observation. Although this method for

identifying anomalous observations is presented as a pro-

posed first step in a data cleaning procedure, for some of the

following modeling exercises these observations were re-

tained in the data sets in order to create a ‘‘worst-case’’

scenario for training and validation exercises, a scenario that

should highlight the ability of ANN to adapt to widely varying

31 41

riable number

t for 44 input parameters.

ARTICLE IN PRESS

Ta

ble

1–

Sta

tist

ica

ld

eta

ils

of

data

pa

ram

ete

rs

Flo

wth

rou

gh

Lo

ck10

(cfs

)

Flo

wm

idd

leFo

rkK

YR

iver

(cfs

)

Ca

lciu

mh

ard

nes

s(m

g/L)

Turb

idit

y(N

TU

)B

ack

gro

un

dco

lon

ies

BG

(CFU

/100

mL)

To

tal

coli

form

sT

C(C

FU

/100

mL)

Feca

lst

rep

toco

cci

FS

(CFU

/100

mL)

Feca

lco

lifo

rms

FC

(CFU

/100

mL)

Aty

pic

al

colo

nie

sA

C(C

FU

/100

mL)

Mea

n4019.1

6612.8

691.4

29.8

6546

1598

346

145

5064

Ma

x.

63,5

00.0

05180.0

0180.0

810.0

123,0

00

25,0

00

17,0

00

12,0

00

86,0

00

Min

.52.0

09.5

024.0

1.2

o1

1o

1o

11

Std

.Dev

6487.7

0911.7

531.2

58.1

13,1

61

3192

1138

571

8948

WAT ER R ES E A R C H 41 (2007) 217– 227 221

observations. The few anomalous observations identified

were removed only for the ANN, multiple linear regression,

and imputation modeling efforts that attempted to predict a

numerical concentration for either FC or AC, but were not

eliminated from the ANN modeling exercises where the data

was grouped into classifications.

4. Results

4.1. Observations of microbial data variability andanalytical precision

The microbial data supplied to the ANN for estimating FC and

AC counts had considerable ranges in concentrations

during the time of study, with greater range of values

exhibited by the FC values (Table 1). For the majority of

time, when river conditions were stable, FC concentrations

were low, below 35 CFU/100 mL, but climbed orders of

magnitude to 12,000 CFU/100 mL during runoff or high

flow events. Of the 1165 observations available, over 87% of

the observations have less than 200 FC CFU/100 mL recorded.

This skew in data provides fewer observations of peak events

upon which to train an ANN model, and it would be expected

that the model would perform better at predicting low

concentrations just due to the robust numbers of training

observations.

To understand some of the difficulties the ANN faces when

attempting to predict microbial concentrations with great

numerical precision, one must have an understanding of the

achievable precision of the estimated number of microorgan-

isms reported from the analytical methods, and the impact of

the changing environment upon microbial concentrations.

Microbial counts are not really absolute firm measurements,

but estimates of the mean concentration of a contaminant

assumed to follow a Poisson distribution, with variance

equal to the square root of the mean. For a count on a

membrane filter of 30 colonies obtained from triplicate, 1-mL

aliquots of environmental water samples, the 95% confi-

dence interval would be 710.9, or 36% of the mean count

(Vail et al., 2003). A 10-fold dilution with the same mean

count and a 95% confidence intervals would encompass a

range of reportable values between 200 and 400 CFU/100 mL.

Lower counts result in values with even greater expected

percentage variance. Unavoidable assay variability, combined

with a lack of replicate aliquots, produces data with great

expected variances.

The precision of model prediction is impacted by the

imprecision with which bacteria can be measured and large

cumulative errors in modeling diagnostics can be generated

from fuzzy data. One way to minimize the impact of

imprecise and variable data quality is to categorize data into

overlapping groups and frequencies that have meaning

relative to the system under study, as was done for FC later

in this paper, rather than focusing on predicting a precise

concentration. This is where the neural network approach

can help to smooth the generalization provided by the model

and produce a result that can have greater meaning than an

imprecise numerical value.

ARTICLE IN PRESS

Table 2 – Input parameters used to model bacterial concentrations

FlowLock 10

Flow middlefork KYa

Flow redrivera

Flowlock 14a

TC BG FS FC Turbidity Calciumhardness

FC X X X X X X X

AC X X X X X X X X X X

a One-day lagged flow value.

0

200

400

600

800

1000

1200

Actual Verification Training I Training II Training III

FC

co

un

ts in

100

ml

Fig. 3 – Anomalous observation identification by use of ANN prediction and cross validation.

WA T E R R E S E A R C H 4 1 ( 2 0 0 7 ) 2 1 7 – 2 2 7222

4.2. Examination of anomalous observations

Mistakes in data entry, variations in data quality, errors in

data measurement, and unusual events within a watershed

can result in anomalous observations that should not be

relied upon to generalize the underlying functions a model is

attempting to capture. In this study, consistent misprediction

of values during ANN cross validation studies was investi-

gated as a means to identify individual observations that

warranted closer investigation as a proposed first step in the

data cleaning process.

As an example of the procedures followed for examining

anomalous observations detected by ANN misprediction of

recorded values, five individual observations that had a

recorded FC value of 10 CFU/100 mL were selected from the

database for an analysis of the relative 7 key input parameter

values associated with this common river condition. The ANN

model developed predicted this low FC value well for 4 of the

5 observations, even though it was noted that there was

considerable variation in input values between the observa-

tions, especially flow at the Middle Fork of the Kentucky River

the day prior. However, while the ANN model predicted FC

values very close to those observed for 4 of the observations,

the 5th observation FC value was mispredicted by over a

factor of 10 (when the recorded value was 10 CFU/100 mL, the

model predicted 320 CFU/100 mL). This observation was dur-

ing April, 1998 after a rainfall in the upper catchments

increased flow and turbidity without increasing recorded FC

concentrations, a very rare and unexpected event that raises

suspicions of data entry error. On this 5th event, levels of

other fecally associated bacterial indicators that tended to

trend with FC levels were elevated (fecal streptococ-

ci ¼ 200 CFU/mL). While the reported value for fecal strepto-

cocci was under the geometric mean of 346 fecal streptococci

CFU/100 mL, one would expect FC levels to be much greater

than the recorded value of 10; nearer to the geometric mean

of 145 FC CFU/100 mL. Further consistent, misprediction of the

value of FC for this observation in cross validation analysis by

ANN modeling confirmed that the 5th observation was indeed

an anomalous observation for water quality conditions in this

watershed. It is suspected that the FC value recorded may

have been a data entry or recording error.

The use of the ANN to identify observations that warrant

further inspection identified conditions that could not be

easily attributed to recording or data entry errors. The results

of three different ANN training and validation predictions for

an observation with a high recorded FC value are presented in

Fig. 3. It is readily seen that this observation containing a very

high recorded value for FC was not properly modeled by the

ANN in either testing or training. The reasons for this remain

unknown and this may represent a rare event in the

watershed.

The cross validation identification of anomalous observa-

tions did not point out a large number for further inspection.

In total, only 25 anomalous observations were identified from

the total 1165 data observations, which were used for

predicting FC. Similarly, in the case of AC, only 19 anomalous

observations were identified. This is less than 4% of all

observations for these two bacterial groups combined, and

o2% of the total observations individually. It appears that the

ANN cross validation approach presented may provide a way

to identify anomalous events by the observation of consistent

misprediction. This may be a function that the authors

encode into the software they are developing to flag observa-

tions, but more study is needed to ratify this approach with

other more standard methods of identifying outliers.

ARTICLE IN PRESS

WAT ER R ES E A R C H 41 (2007) 217– 227 223

4.3. Results of ANN modeling of bacterial concentrations

ANN models were trained and verified on data sets with

anomalous observations removed for the ability to predict

concentrations of FC and AC at a single point in the Kentucky

River. Figs. 4 and 5 graphically show the observed versus

expected values for FC concentrations in the river predicted

from the four different groupings of the data sets during ANN

training and then when analyzed as a validation test data

group. To relate the two figures, Graphs labeled with Batch 1

would be where Training data sets 2, 3, and 4 are used during

training of the ANN, while data set 1 is used to test, or validate

the findings. As expected, training results in Fig. 5 show that

the FC concentrations can be fit well to all assembled data sets

by ANN during training with an average R2 value of 0.91 and

Batch 1y = 0.6538x + 44.517

R2 = 0.633

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0 2000 4000 6000 8000 10000 12000

Observed FC (CFU/100ml)

0 2000 4000 6000 8000 10000 12000


Est

imat

edF

C(C

FU

/100

ml)

Batch 3y = 0.6265x + 46.811

R2 =0.733

Est

imat

ed F

C(C

FU

/100

ml)

Fig. 4 – FC—results for ANN validation se

Batch 1y = 0.9315x + 17.944

R2 = 0.9407

0

3000

6000

9000

12000

0 2000 4000 6000 8000 10000 12000

Observed FC (CFU/100 ml)

Est

imat

ed F

C(C

FU

/100

ml)

Batch 3y = 0.8334x + 15.604

R2 = 0.8731

0

3000

6000

9000

12000

0 2000 4000 6000 8000 10000 12000


Est

imat

ed F

C(C

FU

/100

ml)

Fig. 5 – FC—results for ANN training set

slope near 1. Validation on the unseen data sets has an average

R2 value of 0.71 with slopes varying from 0.66 to 1.6. The better

fit to the training data set is expected as the model sees all the

inputs in contrast to predicting on unknown observations

during verification tests. The variance in the predicted values

versus the observed values is greatest at FC concentrations

greater than 1000 CFU/100 mL. As well, the numbers of high FC

observations is limited (only 177 observations above

200 FC CFU/100 mL) and this can negatively impact the model’s

ability to generalize observations in this higher range. This can

be seen with the under-prediction of the recorded observation

of 12,000 FC CFU/100 mL as 6000 FC CFU/100 mL. While the ANN

model appears to be learning the training set, the precision of

predicting FC concentrations appears to be increasing as

reported values increase.

Batch 2y = 0.7757x + 55.446

R2 = 0.8053

0

2000

4000

6000

8000

10000

12000

0 2000 4000 6000 8000 10000 12000


0 2000 4000 6000 8000 10000 12000


Est

imat

ed F

C(C

FU

/100

ml)

0

2000

4000

6000

8000

10000

12000

Est

imat

ed F

C(C

FU

/100

ml)

Batch 4y = 0.7054x + 22.033

R2 =0.6692

ts (without anomalous observations).

Batch 2y = 0.8621x + 35.26

R2 = 0.8888

0

3000

6000

9000

12000

0 2000 4000 6000 8000 10000 12000


Est

imat

ed F

C(C

FU

/100

ml)

Batch 4y = 0.9291x + 7.2746

R2 = 0.9364

0

3000

6000

9000

12000

0 2000 4000 6000 8000 10000 12000


Est

imat

ed F

C(C

FU

/100

ml)

s (without anomalous observations).

ARTICLE IN PRESS

WA T E R R E S E A R C H 4 1 ( 2 0 0 7 ) 2 1 7 – 2 2 7224

Classification schemes were developed instead of predict-

ing the absolute values with precision from reported values

with unknown precision, but large expected analytical

variance. ANN model correctly classified 97% of the observa-

tions with values below 200 CFU/100 mL and 61% of values

above 200 CFU/100 mL (Table 3). Upon further inspection, it

was noted that most of the mispredictions occurred when the

FC value was within a 0.5 log10 deviation around the 200 CFU/

mL, within a range of values between 63 and 630 FC CFU/mL.

This range is in a transition zone for river conditions and had

limited data observations, which influenced the size of the

training and prediction sets.

To study the influence of the size of the database used to

train the ANN model, an attempt was made to examine the

ability of the ANN model to predict FC concentrations when

the training database size was limited to only 300 randomly

selected, data observations. In this experiment, the remaining

865 data were considered as the verification data set. The

results of this limited training size ANN showed that the

ability of ANN in estimating FC counts into these two

classifications did not suffer greatly even when the training

database size was limited (72% and 95% accuracy for FC

concentrations 4200 and o200 CFU/100 mL, respectively).

The next attempt to predict FC concentrations into set

classifications with an ANN model used three overlapping

ranges to group the data into three groups, FC4250 CFU/

100 mL 150oFCo250 and FCo150 CFU/100 mL. The rationale

behind creating the three classifications was to identify two

statistically separate populations of observations, and isolate

observations between these populations with reported values

within the range established by a 68% CI around a reported

value of 200 FC CFU/mL (assumed to be obtained from a 1 mL

sample of a 10-fold dilution). This classification clearly

delineated observations that fell into normal, or elevated

microbial risk categories as represented by FC values. The

classification category of low risk (output ¼ 0) had the

majority of observations and was created to contain observa-

tions whose microbial risk would be considered normal for

this watershed, while the observations with output values ¼ 1

clearly represented elevated indicator levels with an assumed

elevated microbial risk.

Using the same input variables as the two group classifica-

tion scheme discussed prior, the ANN model trained on 865

Table 3 – Validation: results of ANN fecal coliform dualrange model

Block no. Range testing predicted

4200 CFU/100 mL o200 CFU/100 mL

1 29/50 243/250

2 34/48 244/252

3 29/46 245/254

4 16/33 256/267

Total 108/177 (61%) 988/1023 (97%)

total data observations and predicted FC classifications on the

remaining 300 for validation. The performance of the ANN

model for this single training and testing exercise are

presented in Table 4. For correct classifications of observa-

tions in the two distinct classification groups for FC values

that are significantly different from each other (o150 and

4250 CFU/100 mL), combined accuracy of the model was

greater than 97.6%. The low/normal microbial risk concentra-

tion classifications of individual observations were predicted

with slightly greater accuracy than those for elevated risk

(98.3% and 92.5%, respectively). Utilizing these three over-

lapping categories, with two categories clearly distinct from

each other, the trained ANN model could be used to backfill

missing data classifications for the majority of the data

observations with great precision and confidence.

The results indicate clearly that the worst classification

misprediction is in the defined intermediate range (observa-

tions between 150 and 250 FC CFU/100 mL) that had few

observations. This intermediate classification had the least

data observations (69) available for training and testing. The

reasons for this are likely a combination of factors, with data

scarcity and imprecision of the measured concentration’s

likely contributors.

Predicting fecally introduced bacteria is different than

trying to predict the influence of flow and weather conditions

on bacterial groups that are indigenous to the river system

such as the atypical colonies (AC) discussed prior. Using a

similar approach with slightly different input variables (Table

2) an ANN model was trained and the testing and training

value comparisons for predicting reported concentrations are

presented in Figs. 6 and 7, respectively. It can be seen from

these figures that the training fits the data fairly well, with

less precise prediction for the validation test results. There

are limited data observations (82) above 20,000 CFU/100 mL,

which was the 75th percentile. From previous studies of AC

concentrations in the Kentucky River and tributaries, it has

been seen that the concentrations of AC remain relatively

stable in comparison to those of FC, and this was the case of

this study as well (Table 1). The AC concentration geometric

mean was about 5064 CFU/100 mL. This value only rose to a

maximum of 86,000 AC CFU/100 mL.

Table 4 – Results of ANN FC three-range classificationmodel

Datasets

Classification range

I (o200)correct/

total

II (150–250)correct/

total

III (4250)correct/

total

Training 701/702 14/52 105/110

(99.9%) (26.9%) (95.5%)

Testing 244/259 2/17 19/24

(94.2%) (11.8%) (79.2%)

Total 945/961 16/69 124/134

(98.3%) (23.2%) (92.5%)

ARTICLE IN PRESS

Batch 1y = 0.627x + 1598

R2 = 0.6575

0

20000

40000

60000

80000

100000

0

20000

40000

60000

80000

100000

0

20000

40000

60000

80000

100000

0

20000

40000

60000

80000

100000

0 20000 40000 60000 80000 100000

Observed AC (CFU/100ml)

0 20000 40000 60000 80000 100000


0 20000 40000 60000 80000 100000


0 20000 40000 60000 80000 100000


Est

imat

ed A

C(C

FU

/100

ml)

Batch 2y = 0.5509x + 1658.8

R2 = 0.6537

Est

imat

ed A

C(C

FU

/100

ml)

Batch 3y = 0.7871x + 1377.6

R2 = 0.6941

Est

imat

ed A

C(C

FU

/100

ml)

Batch 4y = 0.7964x + 1036.3

R2 = 0.6687

Est

imat

ed A

C(C

FU

/100

ml)

Fig. 6 – AC—results for ANN validation sets (without anomalous observations).

Batch 1y = 0.6435x +1581.3

R2 =0.687

0

20000

40000

60000

80000

100000

0

20000

40000

60000

80000

100000

0

20000

40000

60000

80000

100000

0

20000

40000

60000

80000

100000

0 20000 40000 60000 80000 100000


0 20000 40000 60000 80000 100000


0 20000 40000 60000 80000 100000


0 20000 40000 60000 80000 100000


Est

imat

ed A

C(C

FU

/100

ml)

Batch 2y = 0.7335x + 1021.3

R2 = 0.7644

Est

imat

ed A

C(C

FU

/100

ml)

Batch 3y = 0.6687x +1506.1

R2 =0.6854

Est

imat

ed A

C(C

FU

/100

ml)

Batch 4y = 0.6867x+ 1396.5

R2 =0.7043

Est

imat

ed A

C(C

FU

/100

ml)

Fig. 7 – AC—results for ANN training sets (without anomalous observations).

WAT ER R ES E A R C H 41 (2007) 217– 227 225

Comparing the graphic AC results to those of the FC training

and testing fits, there is less difference between the fit of the AC

data between testing and training, but the scatter around the

regression line is greater than that seen for FC, especially when

viewing the training results. This poorer fit is to be expected as

there is a difference between the bacterial groups being

predicted and how they interact with the environment.

It was expected that the less variable AC group would be

more difficult to model as they are in higher numbers,

difficult to discern on the assay, and autochthonous. The

ANN model is trying to predict not only the increase due to

wash-in by rain events, followed by decay, but increase caused

by growth that has been spurred by nutrients added to the river

in the days after the storm, and die off after these nutrients are

depleted.

4.4. Comparison of ANN with other statistical models forbackfilling

Two other popular models—multiple linear regression and

multiple imputation models—were developed to predict FC

and AC absolute values. These modeling methods were used

on the raw data set with the same set of inputs used for the

ANN models for comparison. For generating an imputation of

the missing microbial values, first a probability model on the

complete data (observed and missing values) was run using

the popular multiple imputation NORM software (Schafer,

2000) that uses the multivariate normal distribution function

for modeling. The multiple imputation method uses two

stages. First multiple imputation works by generating a

maximum likelihood-based covariance matrix and vector of

ARTICLE IN PRESS

Table 5 – Comparing the relative precision of imputation,multiple linear regression and ANN modeling for back-filling FC and AC values

Model Fecal coliform AC values

MSE MRE MSE MRE

ANN 81,360 726.58 35,912,569 243.75

Regression 99,781 920.10 39,817,963 261.33

Imputation 202,790 2780.2 77,155,074 474.82

WA T E R R E S E A R C H 4 1 ( 2 0 0 7 ) 2 1 7 – 2 2 7226

means like expectation maximization. Then in this procedure

a statistical uncertainty is introduced to the model and using

that uncertainty the natural variability among cases one

encounters in a complete database is emulated and multiple

imputation then imputes actual data values to fill in the

incomplete data points in the data matrix. Secondly, on the

same data set used for imputation, multiple linear regression

of the following form:

aX1 þ bX2 þ � � � þ nXn þ Constant ¼ Y (1)

was developed. A comparison of ANN, multiple linear

regression, and multiple imputation was carried out on the

raw data set (outliers intact, no transformation except

normalization to highest value for ANN) with 50% of the

randomly sorted observations used to train each model and

50% used for validation.

The results from the commercially available imputation

model were not encouraging for backfilling FC or AC bacterial

concentrations as compared to either multiple linear regres-

sion or ANN backfilling (Table 5). Based on MSE, the ANN

model results were superior to the regression model and

imputation model. The imputation model performed with the

least precision as indicated by MSE values twice that of

multiple linear regression. Multiple linear regression, due to

very strong correlations between microbial input and output

parameter concentrations, performed well in comparison to

imputation, but ANN still made improvements in terms of the

MSE value. The MSE value showed 18% and 8.3% improve-

ments for AC and FC prediction, respectively, when compar-

ing multiple linear regression to ANN microbial concentration

predictions. Looking at the results for the multiple linear

regression, that used inputs selected by RES criterion, it

appears that these tools may be of use in identifying

influential inputs from a group of available data. Combining

other selection criterions with the ANN-based RSE approach

may further enhance the selection of influential input

variables.

5. Conclusion

The complex relationships between microorganisms and the

surface water environment can be well modeled into distinct

classifications by the application of artificial neural network

modeling. When compiling a multivariate database, missing

observations of microbial concentrations can be estimated

with either multiple linear regression or ANN models, but

ANN modeling can provide a slightly more precise estimate of

microbial concentrations than the other methods evaluated

and often does not require data transformation for best

results. However, the superiority of ANN models for back-

filling observations that occur in clear ranges of interest is

best recognized in their ability to classify fuzzy data as was

shown by the ANN model’s performance in predicting if

values of fecal coliforms were either below 200 CFU/100 mL

(normal) or greater than 250 CFU/100 mL (peak) concentration

classifications. Even without identifying and removing out-

liers from the river database, ANN models provided classifica-

tion of peak and normal FC classifications with greater than

97% accuracy. Using the RSE function of ANN modeling, non-

essential input variables can be quickly eliminated from

modeling exercises and then ANN can be used to identify

outlier events that should be considered for removal from the

database. In all, the application of ANN to microbial data is

quite appropriate, if one considers using the data in such a

way that the imprecision of the underlying analytical method

is taken into consideration when designing appropriate risk

classifications, and can provide insights unhampered by

multicollinearity, discontinuity, and non-normal data distri-

bution.

Acknowledgements

This research was supported by the United States Environ-

mental Protection Agency (STAR) Project R830376 and by

USGS Grants managed by the Water Resource Research

Institute of Kentucky. Thanks to Kentucky American Water

Company for access to their in-house data on Kentucky River

water quality.

R E F E R E N C E S

Brion, G.M., Lingireddy, S., 1999. A neural network approach toidentify non-point sources of microbial contamination. WaterRes. 33 (14), 3099–4106.

Brion, G.M., Mao, H.H., 2000. Use of total coliform test forwatershed monitoring with respects to atypicals. ASCE J.Environ. Eng. 126 (2), 175–181.

Brion, G.M., Neelakantan, T.R., Lingireddy, S., 2002. A neuralnetwork based classification scheme for sorting sources andages of fecal contamination in water. Water Res. 36, 3765–3774.

Craun, G.F., Berger, P.S., Calderon, R.L., 1997. Coliform bacteria andwaterborne disease outbreaks. J. AWWA 89, 96–104.

Kim, C.Y., Bae, G.J., Hong, S.W., Park, C.H., Moon, H.K.,Shin, H.S., 2001. Neural network based prediction of groundsurface settlements due to tunneling. Comput. Geotech. 28,517–547.

Mac Kenzie, W.R., Hoxie, N.J., Proctor, M.E., Gradus, M.S., Blair,K.A., Peterson, D.E., Kazmierczak, J.J., Addiss, D.G., Fox, K.R.,Rose, J.B., Davis, J.P., 1994. A massive outbreak in Milwaukee ofcryptosporidium infection transmitted through the publicwater supply. N. Engl. J. Med. 331, 161–167.

Maier, H.R., Dandy, G.C., 1996. The use of artificial neuralnetworks for the prediction of water quality parameters.Water Resour. Res. 32 (4), 1013–1022.

Maier, H.R., Dandy, G.C., 2000. Neural networks for the predictionand forecasting of water resources variables: a review of

ARTICLE IN PRESS

WAT ER R ES E A R C H 41 (2007) 217– 227 227

modeling issues and applications. Environ. Model. Software15, 101–124.

Marshall, M.M., Naumovitz, D., Ortega, Y., Sterling, C.R., 1997.Waterborne protozoan pathogens. Clin. Microb. Rev. 10 (1),67–85.

Masters, T., 1993. Practical Neural Network Recipes in C++.Academic Press, USA.

Neelakantan, T., Brion, G.M., Lingireddy, S., 2001. Neural networkmodeling of Cryptosporidium and Giardia concentrations in theDelaware river. Water Sci. Technol. 43 (12), 125–132.

Neelakantan, T., Lingireddy, G.M., Brion, S., 2002. Effectiveness ofdifferent artificial neural network training algorithms inpredicting protozoa in surface waters. J. Environ. Eng. ASCE128 (6), 533–542.

Neiman, J., Brion, G.M., 2003. Novel bacterial ratio for predictingfecal age. Water Sci. Technol. 47 (3), 45–49.

Schafer, J.L., 2000. NORM: multiple imputation of incompletemultivariate data under a normal model, version 2.03, soft-ware for Windows 95/98/NT, available from /www.stat.p-su.edu/�jls/misoftwa.htmlS.

Stokelj, T., Paravan, D., Golob, R., 2002. Enhanced artificial neuralnetwork inflow forecasting algorithm for run-of-river hydro-power plants. ASCE J. Water Resour. Plann. Manage. 128 (6),415–423.

Vail, J.H., Morgan, R., Merino, C.R., Gonzales, F., Millar, R., Ram, J.L.,2003. Enumeration of waterborne Escherichia coli with Petrifilmplates: comparison to standard methods. J. Environ. Qual. 32,368–373.

http://www.stat.psu.edu/~jls/misoftwa.html



Backfilling missing microbial concentrations in a riverine database using artificial neural networks

Documents

Transcript of Backfilling missing microbial concentrations in a riverine database using artificial neural networks