Data Analytics for Environmental Science and Engineering ...

Data Analytics for Environmental Science and Engineering ResearchSuraj Gupta Diana Aga Amy Pruden Liqing Zhang and Peter Vikesland

Cite This httpsdoiorg101021acsest1c01026 Read Online

ACCESS Metrics amp More Article Recommendations

ABSTRACT The advent of new data acquisition and handlingtechniques has opened the door to alternative and morecomprehensive approaches to environmental monitoring that willimprove our capacity to understand and manage environmentalsystems Researchers have recently begun using machine learning(ML) techniques to analyze complex environmental systems and theirassociated data Herein we provide an overview of data analyticsframeworks suitable for various Environmental Science and Engineer-ing (ESE) research applications We present current applications ofML algorithms within the ESE domain using three representative case studies (1) Metagenomic data analysis for characterizing andtracking antimicrobial resistance in the environment (2) Nontarget analysis for environmental pollutant profiling and (3) Detectionof anomalies in continuous data generated by engineered water systems We conclude by proposing a path to advance incorporationof data analytics approaches in ESE research and application

KEYWORDS environmental science and engineering data analytics machine learning metagenomics nontarget analysisanomaly detection water

1 BACKGROUND

Data science is a rapidly evolving interdisciplinary fieldincorporating fundamentals from computer science informa-tion science mathematics and statistics1 It combinesprinciples and methodologies that facilitate and guideextraction of knowledge and insights from available datastreams in a usable format that supports data-driven policy anddecision making2 Moreover data science provides the capacityto better delineate problems by improving alignment betweenthe data that is available and the corresponding questions thatcan be addressed The availability of voluminous amounts ofdata powerful computational resources affordable datastorage and highly efficient algorithms is enabling broaderand deeper data analyses than previously possibleMachine learning (ML) is an emerging data science subfield

that includes algorithms and methodologies that can be used tofind hidden patterns within data and aid in predictive modelconstruction3 ML enables analysis of bigger and ever morecomplex data sets in more efficient and more accurate waysthan otherwise possible4 In the past decade ML has begun tosignificantly impact numerous disciplines5 and EnvironmentalScience and Engineering (ESE) has benefited from this surginginterest in ML and its applications6 At its core ESE isconcerned with improving and maintaining the environmentwith the ultimate goal of protecting human and ecologicalhealth ESE encompasses diverse areas such as water andwastewater treatment air quality environmental impactassessment and hazardous waste management ESE incorpo-rates concepts from disciplines such as basic sciences public

health engineering biological sciences and nanotechnologyResearch and industrial work within ESE increasingly requirecollection of vast amounts of data that simultaneously reflectnumerous data types (eg air and water quality measurementsflow measurements spatial discretization etc) with widespatial and temporal variability Data of this natureencompasses a broad and dynamic range with masses reportedfrom nanogram to teragram and flows ranging from microlitersper second to millions of liters per day Recent advances inenvironmental metagenomics and nontarget analysis furtherexpand the types and volumes of data being generated withinESEThe advent of novel data acquisition and handling

techniques has opened the door to alternative and potentiallymore comprehensive means of environmental monitoring thatwill improve our capacity to understand and manageenvironmental systems For instance metagenomics andnontarget analysis are evolving as powerful ways to identifyunknown contaminants for surveillance78 Further identifica-tion of anomalous or unusual events in waterwastewatertreatment processes using real time water quality data is a keytenet of the ldquodigital waterrdquo movement that holds immense

Featurepubsacsorgest

copy XXXX American Chemical SocietyA

httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX

Dow

nloa

ded

via

NA

NJI

NG

IN

ST O

F G

EO

GR

APH

Y amp

LIM

NO

LO

GY

on

Aug

ust 4

202

1 at

01

053

5 (U

TC

)Se

e ht

tps

pub

sac

sor

gsh

arin

ggui

delin

es f

or o

ptio

ns o

n ho

w to

legi

timat

ely

shar

e pu

blis

hed

artic

les

potential for the water industry9 However interpreting thesetypes of data is not trivial and requires appropriate applicationof ML techniquesThe aim of this Feature is to provide an overview of data

analysis frameworks suitable for application within ESE Wefirst introduce a general data analysis framework then highlightcurrent applications of ML within ESE problem domains andconclude with thoughts about the future

2 GENERAL DATA ANALYSIS FRAMEWORK

A general data analysis framework includes data acquisitiondata exploration data preprocessing and visualizationalgorithms for model building and ways to evaluate andinterpret the models and the results (Figure 1)21 Data Acquisition Data acquisition is the process of

collecting or acquiring data and storing it in a readily accessibleform such as a database (Figure 1a) Data acquisition requiresthorough planning to ensure the data set can be effectivelyinterrogated for the desired end purpose Key elements of sucha plan include deciding upon the data collection methodologyselecting variables of interest determining the samplingfrequency and the number of replicates assessing how muchdata is needed to effectively build the model(s) and test the

study hypotheses and deciding upon means to properlydocument and store the dataWork within ESE increasingly requires acquisition of vast

amounts of data Hence it is necessary to frame and adoptsuccinct protocols to minimize biases and increase compara-bility and reproducibility across the discipline To that extentseveral publications and guidance documents have focused ongood implementation practices for collection of robust datasets10111213

22 Data ExplorationVisualization and Preprocess-ing Data exploration or visualization is used to understand thedata characteristics Data exploration helps illustrate keyaspects such as sample size missing values distributionsinitial patterns correlations or variables that appear to besensitive to the system of interest Exploration is commonlyachieved by plotting and visualizing the data in a manner suchthat distributions and trends are readily apparent14

Raw data are often incomplete noisy and inconsistent dueto data redundancy incompatibilities between multiple datasources and divergent data collection protocols Poor dataquality can lead to inaccurate or misleading analyses15 Datapreprocessing is intended to improve model quality andminimize computational resource usage

Figure 1 Data Analysis Framework and Methodologies (a) Data Collection Gather the data (b) Data ExplorationPreprocessing Fix thediscrepancies visualize and transform the data to the required form (c) Model Building and Evaluation Train the model and evaluateperformance (d) Interpretation Use the model to make predictions analyze and interpret the outcomes (e) Monitoring Based on the learnedknowledge and domain expertise The top thread on the figure represents the tools and software that are available to execute various steps in theend-to-end data analysis framework

Environmental Science amp Technology pubsacsorgest Feature


B

Table

1Overviewof

Pop

ular

MLAlgorithm

s

machine

learning

model

description

positives

andnegatives

exem

plaryapplications

with

inES

E

Logistic

Regression

(Logit

model)24

bullSupervised

MLclassificationalgorithm

bullEasy

toimplem

ent

bullEstim

atetheprobablepresence

ofatype

ofvegetatio

nusing

environm

entalvariables2

5

bullStatisticalmodelthatuses

alogisticfunctio

nto

modeltheprobabilityof

the

output

variable

bullHighlyinterpretable

bullPredictcontam

inantsources26

andwater

quality

2728indifferent

water

system

s

bullUsedto

perform

binary

classificationtaskswhere

givensamples

are

classifiedinto

twogroups

bullDoesnotrequireheavycomputatio

nalresources

bullAnomalydetectionin

water

treatm

entsystem

s28

bullCan

beextended

tomultip

leoutput

variables

bullOftenused

asthefirstmodelor

benchm

arkto

compare

tomorecomplex

models

RandomForest

(RF)

29bullAtree-based

ensemblelearning

methodthat

isused

forsupervised

classificationandregression

problems

bullRobustcanhandlelinearandnonlineardatao

utliersaswellas

noisydata

bullPredictrelativeabundancelevels

insewage7

bullThe

basicunitisadecision

tree

bullNot

proneto

overfittin

gwhenthemodelovertrains

onthetraining

dataandfails

togeneralizeon

the

testingor

newdata

set3

0bullSelect

importantvariablesin

predictin

ganom

alouseventsin

water

distributio

nsystem

s28

bullSamples

andfeatures

arerandom

lysampled

from

thefulldata

setand

individualdecision

treesaremadeon

thesampled

data

set

bullDepending

onthesize

ofthedatasetand

theensembleof

treesRFs

cangetcom

plex

andmay

require

high

computatio

nalresourcesandlong

times

forboth

training

andpredictio

nbullPredictnutrient

concentrations

usingwater

quality

variables3

2

bullThe

outcom

esgeneratedfrom

theensembleof

decision

treesarecombined

todevelopthefinalmodelpredictio

nbullPredictnanofiltrationreverse

osmosis-m

embranerejectionof

emerging

contam

inants31

SupportVector

Machines

(SVM)

bullSupervised

learning

algorithm

bullUsesasubset

oftraining

pointsin

thedecision

functio

n(supportvectors)

andismem

oryefficient

bullNonlineartim

eseries

forecasting

modelto

predictwater

quality

33

bullUsedforboth


tasks

bullEffectiveinhigh

dimensionalspaces

andincaseswhere

thenumberof

samples

islessthan

thenumber

ofdimensionsHow

everchoosingkernelfunctio

nsandregularizatio

niscrucialto

avoidoverfittin

gbullPredicttheconcentrationof

nitrateandnitrite

inthemixed

liquorof

awastewater

treatm

ent

plant34

bullAimsto

findahyperplane

(inan

Ndimensionalfeaturespace)

that

has

maximum

distance

betweendata

pointsof

allclassesThe

hyperplane

classifiesthesamples

ordata

pointsunderconsideration

bullPh

enolicpollutant

concentration

predictio

nin

drinking

water35

bullMaximizingthemargindistance

brings

confidence

intheclassificationof

newdata

points

bullSupportvectorsaredata

pointsthat

arecloser

tothehyperplane

and

influence

thepositio

nandorientationof

thehyperplane

bullSetsof

mathematicalfunctio

ns(kernels)canbe

used

totransform

theinput

data

asrequired

Artificial

NeuralNet-

works

(ANNs)

bullANNsarecomputin

gsystem

sinspired

toanalyzeandprocessinform

ation

theway

thehuman

braindoes

bullDLhasem

ergedas

thestate-of-the-artMLmethod36minus39

bullPredicteffluent

concentrations

inawastewater

treatm

entplant4

2

bullAsimpleANN

consistsof

aninputlayerahidden

layermadeof

entities

(callednodes)

that

receiveweightedinputsfrom

theinputlayerandthen

transform

them

usinganonlinearfunctio

nthat

transm

itsthetransformed

valueto

theoutput

layerwhere

thefinaloutput

isused

togetpredictio

ns

bullDLmodelsrequirelargedatasetsfortraining

toachieverobustperformanceH

oweveritisdifficultto

know

thesamplesize

required

fortheparticular

task

beforehandS

omerulesof

thum

bhave

been

proposed

basedon

anumberof

assumptionsF

orinstancethe

samplesize

needsto

be50minus1000timesthe

numberof

classesor

10minus100times

thenumberof

featureso

r10minus50timesthenumberof

weightsin

the

network40


water

treatm

entsystem

2843

bullIn

conventio

nalANN

architecturetheinform

ationmoves

unidirectio

nally

(iefrom

inputto

output

layer)T

hisarchitectureisalso

know

nas

afeedforwardnetwork

bullFo

rsm

allerdata

setsalternativeMLmodelsmay

perform

better

bullAirquality

forecastingmodel44

bullWhenthenumberof

hidden

layersisincreasedto

achievehigher

levelsof

abstractionin

orderto

extractcomplex

inform

ationfrom

thedatathe

bullANN

architectures

existthat

canbe

used

toperform

supervisedu

nsupervised

orsemisupervised

learning

tasksSomeexam

ples

arerecurrentneuralnetworks

(RNNs)convolutio

naln

euraln

etworks

(CNNs)and

autoencoders



C

Table

1continued

machine

learning

model

description

positives

andnegatives

exem

plaryapplications

with

inES

E

resulting

architectureisknow

nas

adeep

neuralnetworkor

deep

learning

(DL)

bullANNstend

tooverfitC

onscientious

modelbuildingthorough

hyper-

parameter

tuning

andextensivevalidationisoftenrequired41

bullConsideredldquoblack

boxrdquo

modelsas

theinternalfunctio

ning

ofANNscanbe

hard

tointerpretThiscan

lead

toresultmisinterpretatio

nandtheacceptance

ofpoor

modelsAthorough

understandingof

varioushyperparam

etersisnecessaryto

understand

thenuancesof

themodel

Principal

Com

ponent

Analysis

(PCA)4546

bullAnunsupervised

technique

PCAcombinesfeatures

andcreatesanew

featureseto

fthe

samesize

where

each

newfeature(ietheldquocom

ponentsrdquo)

isacombinatio

nof

theoriginalfeatures

bullWidelyused

featureextractio

nmethod

Usedto

reduce

thedimensionality

ofthedata

setwhile

preserving

asmuchdata

variability

aspossible

bullEstim

atethevariance

explained

bygivenvariablesof

awater

quality

data47

bullNew

features

areform

edby

finding

theeigenvectors

andcorresponding

eigenvaluesthat

providean

estim

ationof

howmuchvariance

each

component

explains

bullOne

canselect

afewcomponentsthatexplainthemajority

ofthevariability

inthedata

anddrop

the

remaining

ones

bullIntercom

pare

airpollutio

npat-

terns4

4

bullCan

beused

toreduce

multicollinearityinthedataas

newcomponentsareindependento

feachother

bullEstablishthekeycompounds

todistinguishbetweensamples

usinghigh

resolutio

nmassspec-

trom

etry

(HRMS)

data4849

bullEstim

atetheimportance

ofdifferent

variablesandidentify

key-foulantsin

amem

brane

baseddrinking

water

treatm

ent

process5

0

K-means

Clustering

(K-means)51

bullUnsupervisedMLalgorithm

that

aimsto

clustersamples

usingfeature

values

bullSimpleto

implem

entandcaneasilyscaleto

largedata

sets

bullDelineate

different

airquality

indexthresholdclustersusingair

pollutio

ndata52

bullUserdefines

atarget

kvaluewhich

refersto

thenumberof

clustersone

wantsto

groupthesamples

into

bullDisadvantageisthat

oneneedsto

aprioridefinethekvaluewhich

may

bedifficult

bullAnalyze

water

quality

data55354

bullThe

kvaluerefersto

thenumberof

thecentroid

which

refersto

alocatio

nrepresentin

gthecenter

ofagivencluster

bullFails

forclusterswith

complex

nonsphericalgeom

etricshapes

bullGeneratevulnerability

mapsfor

anaquiferusingwater

quality

data55

bullK-m

eans

startswith

random

centroidsanditerativelyfinds

themost

appropriatecentroidsandassignsthesampleinto

respectiveclusters

such

thatthesum

ofthesquareddistance

betweenthesamples

andthecentroid

isminimized

bullClustercentroidscangetskewed

inthepresence

ofoutliershenceitisadvisedto

handleoutliersbefore

applying

k-means

bullRiskassessmentof

water

pollu-

tionsources5

6

Hierarchical

Clustering


that

aimsto

clustersamples

usingfeature

values

bullNoaprioriinform

ationaboutthenumberof

clustersisrequired

bullFeatureclustering

basedon

Pearsoncorrelations

inHRMS

data57

bullThe

hierarchicalclustering

algorithm

inadditio

nto

breaking

uptheobjects

into

clustersalso

show

sthehierarchyor

rankingof

thedistance

andshow

showdissimilaroneclusterisfrom

othersH

ierarchicalclustering

isrepresentedas

dendograms

bullTimeintensive

bullEstim

atethesimilarity

between

theresistom

eprofilesof

different

samples

basedon

Brayminus

Curtis

dissimilarity

metric58

bullInputisameasureofdissimilaritybetweensamplesT

hechoice

ofsimilarity

(ordistance)measuresisthemaininfluencing

factor

inhierarchical

clustering

bullCould

bedifficultto

identifythecorrectnumberof

clusters

basedupon

thedendogram

bullWater

quality

assessmentwith

hierarchicalclusteranalysisbased

onMahalanobisdistance59

bullHierarchicalclusterscanbe

built

either

asagglom

erative(bottom-up

each

samplestartsin

itsow

nclusterandclustering

isdone

bymerging

each

sampleandmovingup

thehierarchy)

ordivisive

(top-dow

nallsam

ples

are

putin

oneclusterandsplitsareperformed

recursivelymovingdownthe

hierarchy)



D

ESE data sets can incorporate numerous data types withwide spatial and temporal variability Accordingly if possible athorough data visualizationexploration and preprocessingprocess should be performed to gain initial insights andstreamline the data for the subsequent model building steps(Figure 1b) To aid this depending on the data type wellestablished community guidelines could be leveraged Thisprocess typically includes four steps data cleaning integrationreduction and transformation (See the glossary for de-tails)1617

23 Model Building ML Models ML models fall intothree classes supervised unsupervised and semisupervised(Figure 1c) Supervised learning is a ML task where analgorithm is used to train a model using known data and thenthe learned model is used to predict the outcome of new orunforeseen instances18 In supervised learning a sample in thedata set has two components (1) a set of variables (features)that define the sample and (2) class labels (outcomes) that onewants to predict Two types of models are built usingsupervised learning algorithms classification or regressionmodels A classification model is suggested when the outputdata can be categorized into specific groups or classes (iediscrete output variables) If the output variable is a continuousvariable a regression model is recommendedUnsupervised learning is done in the absence of pre-existing

class labels It is a ML category where the goal is to find hiddenand unknown data patterns or to determine the datadistribution across the data set19 Unsupervised learningalgorithms are often used for data exploration and visual-ization The two major types of unsupervised learningapproaches are clustering and dimensionality reductionClustering algorithms are used to group similar samplestogether into clusters whereas dissimilar samples are relegatedto separate clusters Dimensionality reduction algorithms areused to reduce the dimension of the feature set Too manyuninformative features can hinder predictive modeling this isoften referred to as the ldquocurse of dimensionalityrdquo Somecommon supervised and unsupervised learning algorithms arelisted in Table 1Semisupervised learning is a hybrid of supervised and

unsupervised learning where a set of labeled data is used inconjunction with a set of unlabeled training data In manycases getting true labels for a data set can be difficultSemisupervised learning tackles this problem by making use ofunlabeled data in the learning process and then leveraging thislearned information along with the labeled data set forsubsequent prediction20

Model Optimization and Evaluation Supervised learningis usually a three-step process In the first step the data aresplit into training and testing data sets with the split rangingbetween 60 and 80 for training and 40minus20 for testing Thetraining data set is then iteratively separated into training andvalidation sets The model is trained on the training set whilethe validation set is used to tune the model hyperparametersthat consist of the model architecture and the parameters thataffect the speed and quality of the training process Followingthis cross-validation the learned model is tested against thetest data set to evaluate model robustness and accuracy Toensure unbiased model evaluation the test data set cannot beused for training Commonly used model evaluation metricsinclude accuracy F1-score precision recall area under thereceiver operating characteristic (ROC) curve mean absoluteerror (MAE) and mean squared error (MSE)21

Evaluating unsupervised models can be challenging becausedata do not have prior labels Evaluation typically involvesinternal and external validation22 Internal validation is done byestimating inter- and intra-cluster distances to evaluate clusterquality A cluster refers to a collection of data points thataccumulate together based on certain similarities Modelsminimizing intracluster distances while maximizing interclusterdistances are preferred Such a result suggests that thesemodels are performing well at clustering similar data pointstogether Some commonly used metrics for this purpose arethe Adjusted Rand Index and the Silhouette CoefficientHowever a good internal validation score does not guaranteethe effectiveness of the obtained clusters for real applicationsHence external validation is necessary to assess whether thedata points are assigned to the correct clusters This stepusually requires human evaluation or comparison againstbenchmarks For dimensionality reduction task reconstructionerror or loss is estimated to evaluate model performanceThese methods seek to minimize the reconstruction error orlossminusdefined as the distance between the original data pointand its projection onto a lower-dimensional subspace (itsldquoestimaterdquo)23

The ESE community has begun to leverage various MLalgorithms and techniques to analyze environmental data setsWhile it is difficult to determine a priori which ML algorithm isbest suited for a particular data set an improved understandingof the positives and negatives and the inherent assumptions ofthe specific algorithms aid in the choice of the righttechnique(s) Table 1 summarizes algorithms popular withinthe ESE community The table describes the algorithms theiradvantages and disadvantages the assumptions for specific datatypes and gives examples where the algorithms have been usedto address ESE related problems

24 Data Interpretation The final step in the modelbuilding cycle is to interpret the learnt model and developreasonable explanations for the obtained analysis (Figure 1d)Data interpretation could mean extracting important variablescontributing to the model prediction These variables couldhelp in verifying the hypotheses of the study or understandingwhat factors are critical in driving the observations in the studyunder consideration At this stage the literature and expertopinions are queried to make connections between the modeloutputs and knowledge about the relevant chemical physicaland biological phenomena to make data-driven conclusionsand decisions

3 CURRENT APPLICATIONS OF MACHINE LEARNINGIN ESE

In this section we discuss current applications of MLalgorithms within the ESE domain by presenting three casestudies representing different application areas

31 Metagenomic Data Analysis High throughputshotgun (untargeted) metagenomic DNA sequencing offers arobust and effective way to access the microbial world and isnow often used in ESE Differentiating microbial communitiesin different environments6061 studying the dissemination ofantibiotic resistance in environmental systems6263 definingbacterial communities in contaminated environments toexplore bioremediation64 and characterizing microbial com-munities in wastewater treatment processes6566 are allexamples of environments where metagenomic character-ization is increasingly being applied



E

In a typical metagenomic study the first step is often tosearch sequenced DNA reads against a reference database toderive taxonomic or functional gene annotation The obtainedtaxonomic or functional gene profiles are then normalized toprovide relative abundance information corresponding todifferent species or genes in the sample that can be visualizedand analyzed to answer specific questions about the sampleContinuous optimization and declining costs of sequencingplatforms have made metagenomics related research moreaccessible than ever before This trend has led to thegeneration of large volumes of publicly available sequencingdata but rendering these data into meaningful interpretationsremains challenging67 Fortunately the advent of ML methodshas immensely accelerated advancement at every cross sectionof a typical metagenomic study68 Here we highlight someapplications of ML in antibiotic resistance-related studiesOne often faced challenge with metagenomic data is that of

high dimensionality and low sample size (HDLSS ie morevariables than independent samples) For example the numberof genes annotated within a given environmental sample caneasily exceed hundreds of thousands Hence many unsuper-vised techniques such as principal component analysis (PCA)and non-metric multi-dimensional scaling (NMDS) are usedas preprocessing steps to reduce dimensionality by removinguninformative features A common application of thesetechniques is to examine the similaritydissimilarity of differentenvironmental samples by clustering based on species or genecomposition69 Network-based ML (data is represented in theform of a graph where nodes are the data points to be clusteredand edges represent the relationship or similarity between thedata points) is another unsupervised approach to studyproteinminusprotein or geneminusgene interactions to understanddifferent functional pathways70

Given the sample labels or response variables and the genecomposition support vector machines (SVMs) ANNs andensemble methods (such as random forest (RF) or extremelyrandomized trees) have been extensively used to performsupervised learning and build predictive models For exampleidentifying interesting patterns and important genes in a dataset58 predicting relative antibiotic resistance abundancelevels7 understanding the role of socioeconomic status inshaping the resistome or microbiome71 or the prediction ofantibiotic resistance phenotype72 are some of the uniqueproblems that have been examined using the above-mentionedmethods Hidden Markov models (HMMs) which can beused in both supervised or unsupervised fashion are one of themore readily used algorithms to detect antibiotic resistancegene variants or potential functional homologues73 and havebeen applied for the discovery of novel ARGs frommetagenomic data74

Word embedding which has gained a lot of popularity overthe recent years is a feature learning technique used in naturallanguage processing (NLP) Word embedding is a term usedfor the representation of words or sentences in the form ofnumeric vectors that can be used for downstream ML tasks75

Raw DNAprotein sequences sliced into k-mers are analogousto the structure of a sentence and can be analyzed in a similarfashion as natural languages Thus there is a rapid thrusttoward analyzing raw sequences using NLP based techni-ques7677 MetaMLP78 based on a similar idea uses a wordembedding based classification model to predict ARGphenotype and has been found to perform 50times faster thanDIAMOND (one of the fastest sequence alignment methods)

with similar accuracy Word embedding based models holdimmense promise in analyzing metagenomic data as they havethe ability to learn patterns directly from within raw sequencesSimilar to the antibiotic resistance example other

applications can also be envisioned For instance analyzingmetagenomics data for taxonomic classification7980 or mobilegenetic element classification81 where ML algorithms wereable to outperform the traditional methods In essence MLalgorithms have enhanced our ability to robustly analyzecomplex metagenomic data

32 Nontarget Analysis of Environmental Samples Arapidly developing approach in environmental analysis andtoxicology involves the use of high resolution massspectrometry (HRMS) based nontarget analysis (NTA)where data on accurate masses of molecular and fragmentions are collected without a priori information on thechemicals being analyzed NTA can potentially aid in thescreening and analysis of the vast and diverse universe oforganic pollutants a grand challenge faced by the ESEcommunity82 Because the chemicals being detected are notpredetermined NTA provides an opportunity to comprehen-sively examine the occurrence fate and transport of chemicalcontaminants in different environmental niches with minimalbias82 Similarly NTA can be applied in environmentalmetabolomics to facilitate understanding of the effects ofchemical perturbations on exposed organisms (eg plantsanimals and humans) without focusing on a particularbiochemical pathway8384 A recent review highlighted studiesthat employed the different types of NTA techniques used inmetabolomics to discover metabolite changes in plants inducedby exposure to xenobiotics (eg pharmaceuticals personal careproducts pesticides flame retardants and engineered nano-materials) to examine the effect of altered levels of nutrients inthe environment on plant systems85 Here we present a briefsummary of how the combination of NTA with ML canadvance monitoring of water wastewater quality soil andexposed organisms (eg humans wildlife)While there are various methods that can be used for NTA

in ESE HRMS is the most popular because of its capacity forsensitive detection of low levels of contaminants andmetabolites in complex environmental samples However thepower of NTA using HRMS has been limited by problemsassociated with sample preparation and data analysis Inanalyzing environmental samples using MS sample concen-tration and cleanup are critical because the signal intensity ofthe MS features depend not only on the concentrations of thechemicals but also on the amounts and the nature of thematrix present in the sample extracts The reproducibility ofthe ionization of compounds in MS can be compromisedsignificantly by matrix effects Therefore it is important to havean appropriate number of replicate samples and properlyselected blank samples when conducting NTA Solid-phaseextraction for sample cleanup is commonly used but thisapproach can introduce bias because highly polar contaminantsmay be lost during sample preparation Unlike in targetanalysis where variations due to matrix effects and samplelosses can be corrected using stable isotope-labeled referencecompounds as surrogates this approach cannot be used inNTA because the purpose of NTA is to identify unknowncontaminants that were not included in the target list Tonormalize for instrumental variation and matrix effects internalstandards with varying polarity can be added to the sampleextract prior to analysis Recently several compounds of



F

diverse structure log Kow chromatographic behavior andionization efficiency have been proposed for inclusion in aquality control mixture usable for NTA86 However theproposed mixture has limitations such as for the detection ofhydrophilic compounds and molecular formula generation forcompounds containing fluorine Finally due to the substantialamount of MS data acquired indiscriminately under full-scanand the subsequent MSMS fragmentation of molecular andfragment ions data processing to identify relevant features isdaunting Therefore NTA workflows with built-in filters andcriteria are being developed to facilitate prioritization of MSfeatures that are useful for chemical structure annotation Inthis regard advanced data processing tools high throughputstatistical packages and user-friendly visualization programsare needed to fully interrogate the rich data sets acquired byNTANTA based on HRMS holds great promise for the

comprehensive monitoring of the occurrence and fate ofcontaminants in water and wastewater87 It also allowsresearchers to retrospectively analyze stored HRMS data toscreen for suspected contaminants (ie suspect screening) thatmay have been missed during target analysis88 Using advancedcomputational strategies such as hierarchical cluster analysiscommon patterns in the occurrence of contaminants in theenvironment and their emission pathways can be predictedbased on time series analysis of the aggregated NTA data89

NTA combined with cluster analysis was successfully appliedto reveal previously unmonitored chemical contaminants insoil and sediment samples90 The application of NTA andadvanced postacquisition data treatment will continue toenhance our ability to discover emerging contaminants in theenvironment including those that bioaccumulate and poserisks to humans and wildlife For instance new polyhalo-genated compounds were detected for the first time in blubbersamples from marine mammal sentinel species using both LC-HRMS and GC-HRMS for analysis and an open-source datamining software in the R programming environment thatdetected halogenated signatures in full scan HRMS91 FinallyNTA combined with personal passive samplers and propersample preparation techniques can be used to unlock thecomposition of chemical mixtures that humans are exposed toon a daily basis which can be used in investigating the humanexposome92

Akin to metagenomic data NTA data pose a similar HDLSSchallenge as each mass spectrum constitutes a large number ofpeaks representing potential compounds present in a givensample Hence data preprocessing is crucial and inevitableVarious data preprocessing steps are performed such asdetecting peaks subtracting peaks that were found in blankor control samples componentization (ie grouping of signalsthat probably belong to one unique molecular structure) andremoving noise using replicate measurements93 Followingpreprocessing various supervised and unsupervised MLalgorithms can be applied to engineer select and extractrelevant features However before any further analysis datanormalization and data scaling are two crucial steps thatrequire considerationThe most common algorithms used for NTA are linear

projection methods such as PCA and supervised partial least-squares discriminant analysis (PLS-DA)949548 PCA aids insample comparison by removing variance from the sample setSupervised PLS-DA along with relevant metadata can be usedto extract features pertaining to the specific questions that are

being asked However a major challenge is that featuredetection based on peak intensity can be misleading because oferroneous peak assignments Also relevant information can belost when differentiating the actual contaminant signals andbackground noise when using intensity information for featureselection Hence several alternative approaches that aim atusing raw data signals (retention time times mass-to-charge ratio)that bypass the peak detection step have been proposed andimplemented for improved extraction of information andunderlying patterns in the data set96minus98 This includestechniques such as transforming the raw data signal to distancematrices for dissimilarity analysis96minus98 and using clustering toextract features99

A number of other data analysis algorithms exist but havenot yet become common in NTA environmental analysisClustering techniques such as k-means and hierarchicalclustering can be used to identify similar samples Algorithmslike RF SVMs and ANNs can be applied to classify samplesbased on different categories and would be advantageous inillustrating nonlinear relations in the data These algorithmshave shown excellent promise in analyzing HRMS data inother fields100minus103 and this promise can certainly be extendedto environmental sample analysis

33 Anomaly Detection in Engineered Water Sys-tems (EWS) EWS is an umbrella term for systems of watercollection treatment distribution storage and their operationRecently there has been a major thrust toward digitalization ofthe water sector104 In particular the evolution of cyberinfras-tructure and of online process control instrumentation has ledto the development of advanced process control solutions suchas supervisory control and data acquisition (SCADA)systems105106 Such advances have enabled water utilities tocontinuously monitor water quality identify problems andeffectively oversee maintenance issues both remotely and morelocally These systems entail collection of a large volume of rawdata that could be in conjunction with appropriate dataanalysis techniques transformed into valuable information thatcan be leveraged to make proactive decisions to optimizeoverall performance107 In particular there is surging interest inusing ML techniques to identify unusual patterns in raw EWSdata as a means of discovering unexpected activitiesminusthis isbroadly termed anomaly detection108

A typical anomaly detection task in EWS aims todifferentiate between natural expected variations in waterquality and unusual or suspicious variations caused bycontamination or failure somewhere in the system Thechallenge is to characterize these normal water qualityvariations as it requires analysis of long-term data thatencompasses inherent background variability109 This charac-terized normal response helps to flag anomalous events in thedata Various ML algorithms have been applied to address thisproblem (Table 1)Within EWS one primary form of the generated data is a

time series where each time point can be considered a discretesample The time series consists of measurements of indicatorscollected over time from one or several sensors which formthe feature set Hence each sample can be represented as amultidimensional vector where each dimension represents afeature Historic data is then used to train the model Givenappropriate analysis of the supporting data set one can frameand solve problems to address a multitude of aims such asdetecting leaks sensor failures abrupt changes in water qualityor contamination events110minus112



G

There are a number of different anomaly detectiontechniques available113 In the unsupervised approachunlabeled samples are clustered using an algorithm such ask-means density-based clustering algorithms or expectation-maximization (EM) clustering The concept is that the normaldata points cluster to form high density clusters whereasanomalous data points cluster separately or distant from thenormal data pointsclusters and are located in low-densityregions This way one can carry out additional investigationusing new samples and classify them relative to the normaldata In the supervised approach labels (normal or anomalous)are assigned to each sample based on past information andexpert knowledge of anomalous behavior In this way theproblem is converted into a binary classification problemThough it can easily be extended to a multiclass classificationto categorize different types of anomalies SVM ANNBayesian Network Logistic Regression models and theirvariants are well explored algorithms in this space113 A specialcase of SVM One-Class SVM (OCSVM) is a widely usedmethod for anomaly detection In OCSVM the entire trainingdata set is considered as one-class (eg normal class) and thenew data points are classified as similar (normal) or different(anomalous) to the training data Because considering all datapoints from one class is equivalent to having no label OCSVMis considered as unsupervised learning methodOwing to the success of ML in detecting anomalies in other

domains such as network security researchers have startedexploring its potential in water quality anomaly detec-tion43114115 Based on the studies published so far DLalgorithms have shown promise in detecting anomalies andhave outperformed conventional techniques on a number ofoccasions However it should be noted that DL methods couldbe slow to train depending on the depth of the network andthe amount of data available116

Many studies have adopted a batch learning approach wherethe model is trained on historical data and then the new data iscategorized using this trained model With a continuousincoming time-series data stream and the possibility of novelanomalies retraining the models with every new data set canquickly become impractical and difficult to reliably executeContinual or active learning frameworks that continuouslylearn as the new data stream comes in and that can identifyanomalous behavior in real time are required to circumventthese issues117118 Variants of Latent Dirichlet AllocationMarkov Models and ANN based architectures are algorithmsthat have been applied in other domains to achieve continualonline learning119minus121 However thes approaches remainunderexplored for anomaly detection in water systems asthese frameworks are not trivial and have their ownimplementation challenges118119

Ultimately anomaly detection will be most valuable if it canlearn continually as the data stream comes in and yield real-time reporting that informs immediate corrective action It iscrucial that the models being utilized are fast and accurateHence going forward there is a need to shift the focus towardhybrid approaches (combinations of different algorithms) tobuild powerful models that are able to detect multiple types ofanomalies in real-time Such models will enhance EWSanomaly detection and advance public health9118119

4 PATH FORWARDThere is increasing interest in and a growing body of researchdocumenting how data analytics is being used to address ESE

problems As illustrated in this Feature the power of dataanalytics has been widely recognized as has been the vast needfor its application Anomaly detection has particularlypromising applications for water professionals and practi-tioners Notably broader application of metagenomics andNTA can revolutionize environmental monitoring effortsHowever the application of data analytics in ESE practiceremains in its infancy and concerted efforts are required tomake data analytics an integral part of ESE research andeducation Such a goal would most effectively be achieved ifthere were an agreed-upon plan of action Coordinated effortson several fronts are needed to help the ESE community reapthe potential benefits of data analyticsFirst there is a need to encourage collaborations between

data scientists and ESE practitioners that involve diverselyengaged and integrated research teams from multipledisciplines Such collaborations will facilitate cross-disciplinarycommunication and simultaneous skills building For exampledata scientists come from a culture highly supportive of datasharing (eg GitHub Bitbucket public databases) The ESEcommunity should embrace this culture and incorporate datasharing both locally regionally and globally Recent efforts toaddress the COVID-19 pandemic122123 the global dissem-ination of antimicrobial resistance7 and the development ofglobally vetted data analysis approaches within the NTAcommunity are all steps in this direction Such collaborationshave the potential to yield profound data-driven insights andconclusions that would be impossible to achieve within normalacademic and professional silos Working on their own datascientists may create models that answer specific questions butlack the interpretive training required to contextualize theirresults into real world applications Simultaneously ESEpractitioners with data may not possess the correspondingtools or insights required for proper analysis Strategiccollaborations will lead to improved data analytics datainterpretation and improved decision makingSecond there is a lack of user-friendly data analytics tools

and platforms devised for ESE problems Although under-standing the theory behind ML algorithms is not a difficult taskfor many trained scientists a major hurdle that ESE researchersface is in the coding and implementation of these models Withthe continual advancement of data science programming isbecoming an increasingly necessary skill without which modelsmay be improperly developed or may lack efficiency User-friendly tools may circumvent the problems with codinglearning curves and may be essential for widespreadpractitioner adoption While many tools (Microsoft powerBI Weka Tableau Azure) are available for general dataanalysis application-specific toolsplatforms would greatly aidin performing end-to-end analysesThird the field needs to incorporate data analytics-oriented

curricula This could include the addition of data analyticsfocused modules within existing coursework the introductionof new data science and statistics courses within ESE degreeprograms and the development of relevant capstone projectsfor ESE applications Such experiential learning would helpstudents understand the intricacies of different algorithms andprovide hands-on experience in applying various data analyticstechniques in the context of specific ESE problems Furtherencouraging students to take part in data science internshipswould be an excellent means of developing such expertiseFourth introducing data analytics workshops and making

the relevant resources available is valuable for students



H

researchers and professionals A plethora of online resourcessuch as Coursera LinkedIn Learning and Edx are available forlow or no cost offering high-quality content in data scienceExpanding access to and creation of open source learningplatforms could prove instrumental in instigating the data-driven problem-solving approachFinally we know that data analytics is not new but

continues to evolve at a rapid pace These advancementsfurnish us with new and powerful ways to analyze the data thatcan help holistically tackle the challenges faced by ESEcommunity and ultimately inform decision-making and policyformulation at scales never previously imagined Hence it iscritical to be abreast of new developments in data analytics andto continue making efforts to harvest their full potential

AUTHOR INFORMATIONCorresponding AuthorPeter Vikesland minus Via Department of Civil andEnvironmental Engineering Virginia Tech BlacksburgVirginia 24061 United States orcidorg0000-0003-2654-5132 Email pvikesvtedu

AuthorsSuraj Gupta minus The Interdisciplinary PhD Program in GeneticsBioinformatics and Computational Biology Virginia TechBlacksburg Virginia 24061 United States

Diana Aga minus Department of Chemistry University at BuffaloThe State University of New York Buffalo New York 14226United States orcidorg0000-0001-6512-7713

Amy Pruden minus Via Department of Civil and EnvironmentalEngineering Virginia Tech Blacksburg Virginia 24061United States orcidorg0000-0002-3191-6244

Liqing Zhang minus Department of Computer Science VirginiaTech Blacksburg Virginia 24061 United States

Complete contact information is available athttpspubsacsorg101021acsest1c01026

NotesThe authors declare no competing financial interestBiography

Peter J Vikesland is the Nick Prillaman Professor of Civil andEnvironmental Engineering at Virginia Tech He received his BAfrom Grinnell College in Chemistry in 1993 and MS and PhD inCivil and Environmental Engineering from University of Iowa in 1995and 1998 Vikeslandrsquos research focuses on the fate of nanomaterials inthe environment nanotechnology enabled sensor platforms forenvironmental quality assessment the environmental disseminationof antibiotic resistance and the application of data analytics in

environmental science and engineering He is a past President of theAssociation of Environmental Engineering and Science Professors(AEESP) is a US National Science Foundation CAREER awardeeand is the recipient of the 2018 Walter Weber Research InnovationAward from AEESP

ACKNOWLEDGMENTS

This study was supported by National Science Foundation(NSF) awards OISE (1545756) ECCSS NNCI (1542100)and CSSI (2004751) Additional support was provided by theCenter for Science and Engineering of the Exposome at theVirginia Tech Institute for Critical Technology and AppliedScience (ICTAS) and the Virginia Tech Graduate Schoolsupported Genetics Bioinformatics and ComputationalBiology (GBCB) and Sustainable Nanotechnology (VTSuN)programs

GLOSSARY

Accuracy Correct predictionstotal predictions

Autoencoders A deep learning techni-que that aims to build amodel where the outputtargets are set equal tothe input variables Itseeks to learn an ap-proximate representa-tion of the data andreconstruct it It is anunsupervised learningapproach used for di-mensionality reduction

Bayesian Network Bayesian networks are akind of probabilisticgraphical model thatutilizes Bayesian infer-ence for probability es-timations Bayesian net-works are used tomodel conditional de-pendence and hencecausation using a direc-ted graph

Data cleaning Identification of outliersor filling in of missingvalues Outlier identifi-cation can be done byusing Z-score quartilevalues or hypothesistesting Missing datacan be handled in mul-tiple ways such as fillingthe value by computingthe summary statisticsof the given variableusing predicted valuescomputed by an MLalgorithm manual cura-tion or by ignoring themissing record



I

Data integration The combination ofmultiple data sourcesinto a single coherentform

Data reduction Aggregation or elimina-tion of redundant in-formation to reduce thedata set size

Data transformation Conversion of raw databy normalizing to acommon scale to en-sure consistency andcomparability

Density Based Clustering Work by recognizingldquodenserdquo groups ofpoints permitting it tolearn clusters of discre-tionary shape and dis-tinguish anomalies inthe information

Dimensionality Reduction The process of reducingthe number of randomvariables or attributesunder consideration

EM Clustering Similar to k-means ex-cept it assigns the sam-ples into clusters basedon the probabilities es-timated using the EMalgorithm The objec-tive is to maximize theoverall probability ofthe data for the given(final) clusters

Ensemble Methods Ensemble methods arealgorithms that com-bine predictions fromseveral base models toobtain one optimalmodel

F1-Score F1-Score is a harmonicmean of recall andprecision

HMM HMM is a statisticalMarkov mode l i nwhich the system beingmodeled is assumed tobe a Markov process

k-mer A substring with alength k in a biologicalsequence of nucleoti-des

Naive Bayes Naive Bayes methodsare a family of simpleprobabilistic classifiersbased on Bayes ruleThe method is calledldquoNaiverdquo for its assump-tion of conditional in-dependence among thefeatures

Natural Language Processing (NLP) A subfield of computerscience artificial intelli-gence and linguisticsthat deals with theinteraction of com-puters with human lan-guages

NMDS NMDS is an ordinationtechnique based on dis-tance or dissimilaritymatrix NMDS repre-sents pairwise dissimi-larity between samplesin a low-dimensionalspace

PLS-DA PLS-DA is a linearclassification modelthat is able to predictthe class of new sam-ples

R Programming languagefor Statistical Comput-ing

ROC-Curve A receiver operatingcharacteristic (ROC)curve is a plot of truepositive rate (TPR-yaxis) against the falsepositive rate (FPR-xaxis) It measures theperformance of a classi-fier

SCADA Supervisory control andd a t a a c q u i s i t i o n(SCADA) is a systemdesigned to gather andanalyze real time dataIt is used in waterwastewater treatmentplants to monitor andmanage processes

REFERENCES(1) Tukey J W The future of data analysis Ann Math Stat 196233 (1) 1minus67(2) Hayashi C What is data science Fundamental concepts and aheuristic example In Data science classification and related methodsSpringer 1998 pp 40minus51(3) Bishop C M Pattern recognition and machine learning Springer2006(4) Qiu J Wu Q Ding G Xu Y Feng S A survey of machinelearning for big data processing EURASIP Journal on Advances inSignal Processing 2016 2016 (1) 67(5) Agarwal R Dhar V Big data data science and analytics Theopportunity and challenge for IS research In INFORMS 201425443(6) Wilcox C Woon W L Aung Z Applications of machinelearning in environmental engineering Citeseer 2013(7) Hendriksen R S Munk P Njage P Van Bunnik BMcNally L Lukjancenko O Roumlder T Nieuwenhuijse DPedersen S K Kjeldgaard J Global monitoring of antimicrobialresistance based on metagenomics analyses of urban sewage NatCommun 2019 10 (1) 1124(8) Sobus J R Wambaugh J F Isaacs K K Williams A JMcEachran A D Richard A M Grulke C M Ulrich E MRager J E Strynar M J Integrating tools for non-targeted analysis



J

research and chemical safety evaluations at the US EPA J ExposureSci Environ Epidemiol 2018 28 (5) 411minus426(9) Dogo E M Nwulu N I Twala B Aigbavboa C A survey ofmachine learning methods applied to anomaly detection on drinking-water quality data Urban Water J 2019 16 (3) 235minus248(10) Wilde F D Radtke D B Gibs J Iwatsubo R T Nationalfield manual for the collection of water-quality data US Geological SurveyTechniques of Water-Resources Investigations Book 9 1998(11) National Research Council Confronting the nationrsquos waterproblems The role of research National Academies Press 2004(12) Bharti R Grimm D G Current challenges and best-practiceprotocols for microbiome analysis Briefings in bioinformatics202122178(13) Veenaas C Developing tools for non-target analysis and digitalarchiving of organic urban water pollutants 2018(14) Keim D A Visual exploration of large data sets CommunACM 2001 44 (8) 38minus44(15) Chen J Chen Y Du X Li C Lu J Zhao S Zhou X Bigdata challenge a data management perspective Frontiers of ComputerScience 2013 7 (2) 157minus164(16) Chakrabarti S Cox E Frank E Guting R H Han J JiangX Kamber M Lightstone S S Nadeau T P Neapolitan R EData mining know it all Morgan Kaufmann 2008(17) Gibert K Sanchez-Marre M Izquierdo J A survey on pre-processing techniques Relevant issues in the context of environ-mental data mining AI Communications 2016 29 (6) 627minus663(18) Russell S Norvig P Artificial intelligence a modern approach2002(19) Hinton G E Sejnowski T J Poggio T A UnsupervisedLearning Foundations of Neural Computation MIT press 1999(20) Van Engelen J E Hoos H H A survey on semi-supervisedlearning Machine Learning 2020 109 (2) 373minus440(21) Brink H Richards J W Fetherolf M Cronin B Real-worldmachine learning Manning Shelter Island NY 2017(22) Kassambara A Practical guide to cluster analysis in RUnsupervised machine learning Sthda Vol 1(23) Ghodsi A Dimensionality reduction a short tutorialDepartment of Statistics and Actuarial Science Univ of WaterlooOntario Canada 2006 37 (38) 2006(24) Wright R E Logistic regression 1995(25) Alvarez-Arbesu R Feliciacutesimo A M GIS and logisticregression as tools for environmental management a coastal cliffvegetation model in Northern Spain WIT Transactions on Informationand Communication Technologies 2002 26(26) Liu L Sankarasubramanian A Ranjithan S R Logisticregression analysis to estimate contaminant sources in waterdistribution systems J Hydroinf 2011 13 (3) 545minus557(27) Thoe W Gold M Griesbach A Grimmer M Taggart ML Boehm A B Predicting water quality at Santa Monica Beachevaluation of five different models for public notification of unsafeswimming conditions Water Res 2014 67 105minus117(28) Muharemi F Logofa tu D Andersson C Leon FApproaches to building a detection model for water quality a casestudy In Modern Approaches for Intelligent Information and DatabaseSystems Springer 2018 pp 173minus183(29) Liaw A Wiener M Classification and regression byrandomForest R news 2002 2 (3) 18minus22(30) Gromski P S Muhamadali H Ellis D I Xu Y Correa ETurner M L Goodacre R A tutorial review Metabolomics andpartial least squares-discriminant analysis-a marriage of convenienceor a shotgun wedding Anal Chim Acta 2015 879 10minus23(31) Lee S Kim J Prediction of Nanofiltration and Reverse-Osmosis-Membrane Rejection of Organic Compounds UsingRandom Forest Model J Environ Eng 2020 146 (11) 04020127(32) Castrillo M Garciacutea A L Estimation of high frequencynutrient concentrations from water quality surrogates using machinelearning methods Water Res 2020 172 115490

(33) Tan G Yan J Gao C Yang S Prediction of water qualitytime series data based on least squares support vector machineProcedia Eng 2012 31 1194minus1199(34) Yang Y H Guergachi A Khan G Support vector machinesfor environmental informatics application to modelling the nitrogenremoval processes in wastewater treatment systems Journal ofEnvironmental Informatics 2006 7 (1) 14minus25(35) Feng J Pan L Cui B Sun Y Zhang A Gong M NewApproach for Concentration Prediction of Aqueous PhenolicContaminants by Using Wavelet Analysis and Support VectorMachine Environmental Engineering Science 2020 37 (5) 382minus392(36) Silver D Schrittwieser J Simonyan K Antonoglou IHuang A Guez A Hubert T Baker L Lai M Bolton AMastering the game of go without human knowledge Nature 2017550 (7676) 354minus359(37) Mikolov T Deoras A Povey D Burget L Cernocky JStrategies for Training Large Scale Neural Network Language ModelsIEEE 2011 pp 196minus201(38) Deng J Dong W Socher R Li L-J Li K Fei-Fei L InImagenet A Large-Scale Hierarchical Image Database Ieee 2009 pp248minus255(39) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (Aug) 2493minus2537(40) Alwosheel A van Cranenburgh S Chorus C G Is yourdataset big enough Sample size requirements when using artificialneural networks for discrete choice analysis Journal of choice modelling2018 28 167minus182(41) Yang J Xu J Zhang X Wu C Lin T Ying Y Deeplearning for vibrational spectral analysis Recent progress and apractical guide Anal Chim Acta 2019 1081 6minus17(42) Guo H Jeong K Lim J Jo J Kim Y M Park J-p Kim JH Cho K H Prediction of effluent concentration in a wastewatertreatment plant using machine learning models J Environ Sci 201532 90minus101(43) Inoue J Yamagata Y Chen Y Poskitt C M Sun JAnomaly Detection for a Water Treatment System Using UnsupervisedMachine Learning 2017 2017 IEEE 2017 pp 1058minus1065(44) Voukantsis D Karatzas K Kukkonen J Raumlsaumlnen TKarppinen A Kolehmainen M Intercomparison of air quality datausing principal component analysis and forecasting of PM10 andPM2 5 concentrations using artificial neural networks in Thessalonikiand Helsinki Sci Total Environ 2011 409 (7) 1266minus1276(45) Wold S Esbensen K Geladi P Principal componentanalysis Chemom Intell Lab Syst 1987 2 (1minus3) 37minus52(46) Jolliffe I T Cadima J Principal component analysis a reviewand recent developments Philos Trans R Soc A 2016 374 (2065)20150202(47) Dalal S G Shirodkar P V Jagtap T G Naik B G Rao GS Evaluation of significant sources influencing the variation of waterquality of Kandla creek Gulf of Katchchh using PCA Environ MonitAssess 2010 163 (1minus4) 49minus56(48) Scholleacutee J E Schymanski E L Avak S E Loos MHollender J Prioritizing unknown transformation products frombiologically-treated wastewater using high-resolution mass spectrom-etry multivariate statistics and metabolic logic Anal Chem 2015 87(24) 12121minus12129(49) Masiaacute A Campo J Blasco C Picoacute Y Ultra-highperformance liquid chromatography-quadrupole time-of-flight massspectrometry to identify contaminants in water an insight onenvironmental forensics Journal of Chromatography A 2014 134586minus97(50) Peiris R H Halleacute C Budman H Moresoli C Peldszus SHuck P M Legge R L Identifying fouling events in a membrane-based drinking water treatment process using principal componentanalysis of fluorescence excitation-emission matrices Water Res 201044 (1) 185minus194(51) MacQueen J In Some methods for classification and analysis ofmultivariate observations 1967 Oakland CA 1967 pp 281minus297



K

(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187

(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal



L

compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR

PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020



M

potential for the water industry9 However interpreting thesetypes of data is not trivial and requires appropriate applicationof ML techniquesThe aim of this Feature is to provide an overview of data

analysis frameworks suitable for application within ESE Wefirst introduce a general data analysis framework then highlightcurrent applications of ML within ESE problem domains andconclude with thoughts about the future

2 GENERAL DATA ANALYSIS FRAMEWORK

A general data analysis framework includes data acquisitiondata exploration data preprocessing and visualizationalgorithms for model building and ways to evaluate andinterpret the models and the results (Figure 1)21 Data Acquisition Data acquisition is the process of

collecting or acquiring data and storing it in a readily accessibleform such as a database (Figure 1a) Data acquisition requiresthorough planning to ensure the data set can be effectivelyinterrogated for the desired end purpose Key elements of sucha plan include deciding upon the data collection methodologyselecting variables of interest determining the samplingfrequency and the number of replicates assessing how muchdata is needed to effectively build the model(s) and test the

study hypotheses and deciding upon means to properlydocument and store the dataWork within ESE increasingly requires acquisition of vast

amounts of data Hence it is necessary to frame and adoptsuccinct protocols to minimize biases and increase compara-bility and reproducibility across the discipline To that extentseveral publications and guidance documents have focused ongood implementation practices for collection of robust datasets10111213

22 Data ExplorationVisualization and Preprocess-ing Data exploration or visualization is used to understand thedata characteristics Data exploration helps illustrate keyaspects such as sample size missing values distributionsinitial patterns correlations or variables that appear to besensitive to the system of interest Exploration is commonlyachieved by plotting and visualizing the data in a manner suchthat distributions and trends are readily apparent14

Raw data are often incomplete noisy and inconsistent dueto data redundancy incompatibilities between multiple datasources and divergent data collection protocols Poor dataquality can lead to inaccurate or misleading analyses15 Datapreprocessing is intended to improve model quality andminimize computational resource usage

Figure 1 Data Analysis Framework and Methodologies (a) Data Collection Gather the data (b) Data ExplorationPreprocessing Fix thediscrepancies visualize and transform the data to the required form (c) Model Building and Evaluation Train the model and evaluateperformance (d) Interpretation Use the model to make predictions analyze and interpret the outcomes (e) Monitoring Based on the learnedknowledge and domain expertise The top thread on the figure represents the tools and software that are available to execute various steps in theend-to-end data analysis framework



B

Table

1Overviewof

Pop

ular

MLAlgorithm

s

machine

learning

model

description

positives

andnegatives

exem

plaryapplications

with

inES

E

Logistic

Regression

(Logit

model)24

bullSupervised


bullEasy

toimplem

ent

bullEstim


ofatype

ofvegetatio

nusing

environm

entalvariables2

5


alogisticfunctio

nto


the

output

variable


bullPredictcontam

inantsources26

andwater

quality

2728indifferent

water

system

s

bullUsedto

perform

binary


givensamples

are

classifiedinto

twogroups


nalresources


water

treatm

entsystem

s28

bullCan

beextended

tomultip

leoutput

variables

bullOftenused

asthefirstmodelor

benchm

arkto

compare

tomorecomplex

models

RandomForest

(RF)

29bullAtree-based

ensemblelearning

methodthat

isused

forsupervised


problems


utliersaswellas

noisydata


insewage7

bullThe


tree

bullNot

proneto

overfittin


onthetraining

dataandfails

togeneralizeon

the

testingor

newdata

set3

0bullSelect


predictin

ganom

alouseventsin

water

distributio

nsystem

s28

bullSamples

andfeatures

arerandom

lysampled

from

thefulldata

setand

individualdecision

treesaremadeon

thesampled

data

set

bullDepending

onthesize

ofthedatasetand

theensembleof

treesRFs

cangetcom

plex

andmay

require

high

computatio

nalresourcesandlong

times

forboth

training

andpredictio


concentrations

usingwater

quality

variables3

2

bullThe

outcom

esgeneratedfrom

theensembleof

decision

treesarecombined



osmosis-m

embranerejectionof

emerging

contam

inants31

SupportVector

Machines

(SVM)

bullSupervised

learning

algorithm

bullUsesasubset

oftraining

pointsin

thedecision

functio

n(supportvectors)

andismem

oryefficient

bullNonlineartim

eseries

forecasting

modelto

predictwater

quality

33

bullUsedforboth


tasks

bullEffectiveinhigh

dimensionalspaces

andincaseswhere

thenumberof

samples

islessthan

thenumber

ofdimensionsHow


nsandregularizatio

niscrucialto

avoidoverfittin


nitrateandnitrite

inthemixed

liquorof

awastewater

treatm

ent

plant34

bullAimsto

findahyperplane

(inan


that

has

maximum

distance

betweendata

pointsof

allclassesThe

hyperplane


ordata


bullPh

enolicpollutant

concentration

predictio

nin

drinking

water35


brings

confidence


newdata

points


pointsthat

arecloser

tothehyperplane

and

influence

thepositio

nandorientationof

thehyperplane

bullSetsof

mathematicalfunctio

ns(kernels)canbe

used

totransform

theinput

data

asrequired

Artificial

NeuralNet-

works

(ANNs)

bullANNsarecomputin

gsystem

sinspired


ation

theway

thehuman

braindoes

bullDLhasem

ergedas


bullPredicteffluent

concentrations

inawastewater

treatm

entplant4

2

bullAsimpleANN

consistsof

aninputlayerahidden

layermadeof

entities

(callednodes)

that



transform

them


nthat

transm

itsthetransformed

valueto

theoutput

layerwhere

thefinaloutput

isused

togetpredictio

ns




know

thesamplesize

required

fortheparticular

task

beforehandS

omerulesof

thum

bhave

been

proposed

basedon

anumberof

assumptionsF

orinstancethe

samplesize

needsto


numberof

classesor

10minus100times

thenumberof

featureso


weightsin

the

network40


water

treatm

entsystem

2843

bullIn

conventio

nalANN


ationmoves

unidirectio

nally

(iefrom

inputto

output

layer)T


know

nas

afeedforwardnetwork

bullFo

rsm

allerdata


perform

better

bullAirquality

forecastingmodel44

bullWhenthenumberof

hidden

layersisincreasedto

achievehigher

levelsof

abstractionin

orderto

extractcomplex

inform

ationfrom

thedatathe

bullANN

architectures

existthat

canbe

used

toperform

supervisedu

nsupervised

orsemisupervised

learning

tasksSomeexam

ples


(RNNs)convolutio

naln

euraln

etworks

(CNNs)and

autoencoders



C

Table

1continued

machine

learning

model

description

positives

andnegatives

exem

plaryapplications

with

inES

E

resulting

architectureisknow

nas

adeep

neuralnetworkor

deep

learning

(DL)

bullANNstend

tooverfitC

onscientious


hyper-

parameter

tuning



boxrdquo

modelsas

theinternalfunctio

ning

ofANNscanbe

hard

tointerpretThiscan

lead


nandtheacceptance

ofpoor

modelsAthorough

understandingof

varioushyperparam

etersisnecessaryto

understand

thenuancesof

themodel

Principal

Com

ponent

Analysis

(PCA)4546

bullAnunsupervised

technique

PCAcombinesfeatures

andcreatesanew

featureseto

fthe

samesize

where

each


ponentsrdquo)

isacombinatio

nof

theoriginalfeatures

bullWidelyused

featureextractio

nmethod

Usedto

reduce

thedimensionality

ofthedata

setwhile

preserving

asmuchdata

variability

aspossible

bullEstim

atethevariance

explained

bygivenvariablesof

awater

quality

data47

bullNew

features

areform

edby

finding

theeigenvectors

andcorresponding

eigenvaluesthat

providean

estim

ationof

howmuchvariance

each

component

explains

bullOne

canselect


ofthevariability

inthedata

anddrop

the

remaining

ones

bullIntercom

pare

airpollutio

npat-

terns4

4

bullCan

beused

toreduce



feachother



usinghigh

resolutio

nmassspec-

trom

etry

(HRMS)

data4849

bullEstim

atetheimportance

ofdifferent


key-foulantsin

amem

brane

baseddrinking

water

treatm

ent

process5

0

K-means

Clustering

(K-means)51


that

aimsto

clustersamples

usingfeature

values

bullSimpleto

implem


largedata

sets

bullDelineate

different

airquality


pollutio

ndata52

bullUserdefines

atarget

kvaluewhich

refersto

thenumberof

clustersone

wantsto

groupthesamples

into


oneneedsto


may

bedifficult

bullAnalyze

water

quality

data55354

bullThe

kvaluerefersto

thenumberof

thecentroid

which

refersto

alocatio

nrepresentin

gthecenter

ofagivencluster

bullFails

forclusterswith

complex

nonsphericalgeom

etricshapes


mapsfor

anaquiferusingwater

quality

data55

bullK-m

eans

startswith

random


themost


respectiveclusters

such

thatthesum


betweenthesamples

andthecentroid

isminimized


inthepresence



applying

k-means


water

pollu-

tionsources5

6

Hierarchical

Clustering


that

aimsto

clustersamples

usingfeature

values

bullNoaprioriinform


clustersisrequired


basedon

Pearsoncorrelations

inHRMS

data57

bullThe


algorithm

inadditio

nto

breaking

uptheobjects

into

clustersalso

show

sthehierarchyor

rankingof

thedistance

andshow


othersH


isrepresentedas

dendograms

bullTimeintensive

bullEstim

atethesimilarity

between

theresistom

eprofilesof

different

samples

basedon

Brayminus

Curtis

dissimilarity

metric58


hechoice

ofsimilarity


factor

inhierarchical

clustering

bullCould

bedifficultto


clusters

basedupon

thedendogram

bullWater

quality

assessmentwith




built

either

asagglom

erative(bottom-up

each

samplestartsin

itsow


isdone

bymerging

each

sampleandmovingup

thehierarchy)

ordivisive

(top-dow

nallsam

ples

are

putin



hierarchy)



D














E












F













G













H












ACKNOWLEDGMENTS


GLOSSARY







I





















J





K





L





M

Table

1Overviewof

Pop

ular

MLAlgorithm

s

machine

learning

model

description

positives

andnegatives

exem

plaryapplications

with

inES

E

Logistic

Regression

(Logit

model)24

bullSupervised


bullEasy

toimplem

ent

bullEstim


ofatype

ofvegetatio

nusing

environm

entalvariables2

5


alogisticfunctio

nto


the

output

variable


bullPredictcontam

inantsources26

andwater

quality

2728indifferent

water

system

s

bullUsedto

perform

binary


givensamples

are

classifiedinto

twogroups


nalresources


water

treatm

entsystem

s28

bullCan

beextended

tomultip

leoutput

variables

bullOftenused

asthefirstmodelor

benchm

arkto

compare

tomorecomplex

models

RandomForest

(RF)

29bullAtree-based

ensemblelearning

methodthat

isused

forsupervised


problems


utliersaswellas

noisydata


insewage7

bullThe


tree

bullNot

proneto

overfittin


onthetraining

dataandfails

togeneralizeon

the

testingor

newdata

set3

0bullSelect


predictin

ganom

alouseventsin

water

distributio

nsystem

s28

bullSamples

andfeatures

arerandom

lysampled

from

thefulldata

setand

individualdecision

treesaremadeon

thesampled

data

set

bullDepending

onthesize

ofthedatasetand

theensembleof

treesRFs

cangetcom

plex

andmay

require

high

computatio

nalresourcesandlong

times

forboth

training

andpredictio


concentrations

usingwater

quality

variables3

2

bullThe

outcom

esgeneratedfrom

theensembleof

decision

treesarecombined



osmosis-m

embranerejectionof

emerging

contam

inants31

SupportVector

Machines

(SVM)

bullSupervised

learning

algorithm

bullUsesasubset

oftraining

pointsin

thedecision

functio

n(supportvectors)

andismem

oryefficient

bullNonlineartim

eseries

forecasting

modelto

predictwater

quality

33

bullUsedforboth


tasks

bullEffectiveinhigh

dimensionalspaces

andincaseswhere

thenumberof

samples

islessthan

thenumber

ofdimensionsHow


nsandregularizatio

niscrucialto

avoidoverfittin


nitrateandnitrite

inthemixed

liquorof

awastewater

treatm

ent

plant34

bullAimsto

findahyperplane

(inan


that

has

maximum

distance

betweendata

pointsof

allclassesThe

hyperplane


ordata


bullPh

enolicpollutant

concentration

predictio

nin

drinking

water35


brings

confidence


newdata

points


pointsthat

arecloser

tothehyperplane

and

influence

thepositio

nandorientationof

thehyperplane

bullSetsof

mathematicalfunctio

ns(kernels)canbe

used

totransform

theinput

data

asrequired

Artificial

NeuralNet-

works

(ANNs)

bullANNsarecomputin

gsystem

sinspired


ation

theway

thehuman

braindoes

bullDLhasem

ergedas


bullPredicteffluent

concentrations

inawastewater

treatm

entplant4

2

bullAsimpleANN

consistsof

aninputlayerahidden

layermadeof

entities

(callednodes)

that



transform

them


nthat

transm

itsthetransformed

valueto

theoutput

layerwhere

thefinaloutput

isused

togetpredictio

ns




know

thesamplesize

required

fortheparticular

task

beforehandS

omerulesof

thum

bhave

been

proposed

basedon

anumberof

assumptionsF

orinstancethe

samplesize

needsto


numberof

classesor

10minus100times

thenumberof

featureso


weightsin

the

network40


water

treatm

entsystem

2843

bullIn

conventio

nalANN


ationmoves

unidirectio

nally

(iefrom

inputto

output

layer)T


know

nas

afeedforwardnetwork

bullFo

rsm

allerdata


perform

better

bullAirquality

forecastingmodel44

bullWhenthenumberof

hidden

layersisincreasedto

achievehigher

levelsof

abstractionin

orderto

extractcomplex

inform

ationfrom

thedatathe

bullANN

architectures

existthat

canbe

used

toperform

supervisedu

nsupervised

orsemisupervised

learning

tasksSomeexam

ples


(RNNs)convolutio

naln

euraln

etworks

(CNNs)and

autoencoders



C

Table

1continued

machine

learning

model

description

positives

andnegatives

exem

plaryapplications

with

inES

E

resulting

architectureisknow

nas

adeep

neuralnetworkor

deep

learning

(DL)

bullANNstend

tooverfitC

onscientious


hyper-

parameter

tuning



boxrdquo

modelsas

theinternalfunctio

ning

ofANNscanbe

hard

tointerpretThiscan

lead


nandtheacceptance

ofpoor

modelsAthorough

understandingof

varioushyperparam

etersisnecessaryto

understand

thenuancesof

themodel

Principal

Com

ponent

Analysis

(PCA)4546

bullAnunsupervised

technique

PCAcombinesfeatures

andcreatesanew

featureseto

fthe

samesize

where

each


ponentsrdquo)

isacombinatio

nof

theoriginalfeatures

bullWidelyused

featureextractio

nmethod

Usedto

reduce

thedimensionality

ofthedata

setwhile

preserving

asmuchdata

variability

aspossible

bullEstim

atethevariance

explained

bygivenvariablesof

awater

quality

data47

bullNew

features

areform

edby

finding

theeigenvectors

andcorresponding

eigenvaluesthat

providean

estim

ationof

howmuchvariance

each

component

explains

bullOne

canselect


ofthevariability

inthedata

anddrop

the

remaining

ones

bullIntercom

pare

airpollutio

npat-

terns4

4

bullCan

beused

toreduce



feachother



usinghigh

resolutio

nmassspec-

trom

etry

(HRMS)

data4849

bullEstim

atetheimportance

ofdifferent


key-foulantsin

amem

brane

baseddrinking

water

treatm

ent

process5

0

K-means

Clustering

(K-means)51


that

aimsto

clustersamples

usingfeature

values

bullSimpleto

implem


largedata

sets

bullDelineate

different

airquality


pollutio

ndata52

bullUserdefines

atarget

kvaluewhich

refersto

thenumberof

clustersone

wantsto

groupthesamples

into


oneneedsto


may

bedifficult

bullAnalyze

water

quality

data55354

bullThe

kvaluerefersto

thenumberof

thecentroid

which

refersto

alocatio

nrepresentin

gthecenter

ofagivencluster

bullFails

forclusterswith

complex

nonsphericalgeom

etricshapes


mapsfor

anaquiferusingwater

quality

data55

bullK-m

eans

startswith

random


themost


respectiveclusters

such

thatthesum


betweenthesamples

andthecentroid

isminimized


inthepresence



applying

k-means


water

pollu-

tionsources5

6

Hierarchical

Clustering


that

aimsto

clustersamples

usingfeature

values

bullNoaprioriinform


clustersisrequired


basedon

Pearsoncorrelations

inHRMS

data57

bullThe


algorithm

inadditio

nto

breaking

uptheobjects

into

clustersalso

show

sthehierarchyor

rankingof

thedistance

andshow


othersH


isrepresentedas

dendograms

bullTimeintensive

bullEstim

atethesimilarity

between

theresistom

eprofilesof

different

samples

basedon

Brayminus

Curtis

dissimilarity

metric58


hechoice

ofsimilarity


factor

inhierarchical

clustering

bullCould

bedifficultto


clusters

basedupon

thedendogram

bullWater

quality

assessmentwith




built

either

asagglom

erative(bottom-up

each

samplestartsin

itsow


isdone

bymerging

each

sampleandmovingup

thehierarchy)

ordivisive

(top-dow

nallsam

ples

are

putin



hierarchy)



D














E












F













G













H












ACKNOWLEDGMENTS


GLOSSARY







I





















J





K





L





M

Table

1continued

machine

learning

model

description

positives

andnegatives

exem

plaryapplications

with

inES

E

resulting

architectureisknow

nas

adeep

neuralnetworkor

deep

learning

(DL)

bullANNstend

tooverfitC

onscientious


hyper-

parameter

tuning



boxrdquo

modelsas

theinternalfunctio

ning

ofANNscanbe

hard

tointerpretThiscan

lead


nandtheacceptance

ofpoor

modelsAthorough

understandingof

varioushyperparam

etersisnecessaryto

understand

thenuancesof

themodel

Principal

Com

ponent

Analysis

(PCA)4546

bullAnunsupervised

technique

PCAcombinesfeatures

andcreatesanew

featureseto

fthe

samesize

where

each


ponentsrdquo)

isacombinatio

nof

theoriginalfeatures

bullWidelyused

featureextractio

nmethod

Usedto

reduce

thedimensionality

ofthedata

setwhile

preserving

asmuchdata

variability

aspossible

bullEstim

atethevariance

explained

bygivenvariablesof

awater

quality

data47

bullNew

features

areform

edby

finding

theeigenvectors

andcorresponding

eigenvaluesthat

providean

estim

ationof

howmuchvariance

each

component

explains

bullOne

canselect


ofthevariability

inthedata

anddrop

the

remaining

ones

bullIntercom

pare

airpollutio

npat-

terns4

4

bullCan

beused

toreduce



feachother



usinghigh

resolutio

nmassspec-

trom

etry

(HRMS)

data4849

bullEstim

atetheimportance

ofdifferent


key-foulantsin

amem

brane

baseddrinking

water

treatm

ent

process5

0

K-means

Clustering

(K-means)51


that

aimsto

clustersamples

usingfeature

values

bullSimpleto

implem


largedata

sets

bullDelineate

different

airquality


pollutio

ndata52

bullUserdefines

atarget

kvaluewhich

refersto

thenumberof

clustersone

wantsto

groupthesamples

into


oneneedsto


may

bedifficult

bullAnalyze

water

quality

data55354

bullThe

kvaluerefersto

thenumberof

thecentroid

which

refersto

alocatio

nrepresentin

gthecenter

ofagivencluster

bullFails

forclusterswith

complex

nonsphericalgeom

etricshapes


mapsfor

anaquiferusingwater

quality

data55

bullK-m

eans

startswith

random


themost


respectiveclusters

such

thatthesum


betweenthesamples

andthecentroid

isminimized


inthepresence



applying

k-means


water

pollu-

tionsources5

6

Hierarchical

Clustering


that

aimsto

clustersamples

usingfeature

values

bullNoaprioriinform


clustersisrequired


basedon

Pearsoncorrelations

inHRMS

data57

bullThe


algorithm

inadditio

nto

breaking

uptheobjects

into

clustersalso

show

sthehierarchyor

rankingof

thedistance

andshow


othersH


isrepresentedas

dendograms

bullTimeintensive

bullEstim

atethesimilarity

between

theresistom

eprofilesof

different

samples

basedon

Brayminus

Curtis

dissimilarity

metric58


hechoice

ofsimilarity


factor

inhierarchical

clustering

bullCould

bedifficultto


clusters

basedupon

thedendogram

bullWater

quality

assessmentwith




built

either

asagglom

erative(bottom-up

each

samplestartsin

itsow


isdone

bymerging

each

sampleandmovingup

thehierarchy)

ordivisive

(top-dow

nallsam

ples

are

putin



hierarchy)



D














E












F













G













H












ACKNOWLEDGMENTS


GLOSSARY







I





















J





K





L





M














E












F













G













H












ACKNOWLEDGMENTS


GLOSSARY







I





















J





K





L





M












F













G













H












ACKNOWLEDGMENTS


GLOSSARY







I





















J





K





L





M













G













H












ACKNOWLEDGMENTS


GLOSSARY







I





















J





K





L





M













H












ACKNOWLEDGMENTS


GLOSSARY







I





















J





K





L





M












ACKNOWLEDGMENTS


GLOSSARY







I





















J





K





L





M





















J





K





L





M





K





L





M





L





M





M

Data Analytics for Environmental Science and Engineering ...

Documents

Transcript of Data Analytics for Environmental Science and Engineering ...