Applications in Environmental Engineering Technology ... - UTHM
Data Analytics for Environmental Science and Engineering ...
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of Data Analytics for Environmental Science and Engineering ...
Data Analytics for Environmental Science and Engineering ResearchSuraj Gupta Diana Aga Amy Pruden Liqing Zhang and Peter Vikesland
Cite This httpsdoiorg101021acsest1c01026 Read Online
ACCESS Metrics amp More Article Recommendations
ABSTRACT The advent of new data acquisition and handlingtechniques has opened the door to alternative and morecomprehensive approaches to environmental monitoring that willimprove our capacity to understand and manage environmentalsystems Researchers have recently begun using machine learning(ML) techniques to analyze complex environmental systems and theirassociated data Herein we provide an overview of data analyticsframeworks suitable for various Environmental Science and Engineer-ing (ESE) research applications We present current applications ofML algorithms within the ESE domain using three representative case studies (1) Metagenomic data analysis for characterizing andtracking antimicrobial resistance in the environment (2) Nontarget analysis for environmental pollutant profiling and (3) Detectionof anomalies in continuous data generated by engineered water systems We conclude by proposing a path to advance incorporationof data analytics approaches in ESE research and application
KEYWORDS environmental science and engineering data analytics machine learning metagenomics nontarget analysisanomaly detection water
1 BACKGROUND
Data science is a rapidly evolving interdisciplinary fieldincorporating fundamentals from computer science informa-tion science mathematics and statistics1 It combinesprinciples and methodologies that facilitate and guideextraction of knowledge and insights from available datastreams in a usable format that supports data-driven policy anddecision making2 Moreover data science provides the capacityto better delineate problems by improving alignment betweenthe data that is available and the corresponding questions thatcan be addressed The availability of voluminous amounts ofdata powerful computational resources affordable datastorage and highly efficient algorithms is enabling broaderand deeper data analyses than previously possibleMachine learning (ML) is an emerging data science subfield
that includes algorithms and methodologies that can be used tofind hidden patterns within data and aid in predictive modelconstruction3 ML enables analysis of bigger and ever morecomplex data sets in more efficient and more accurate waysthan otherwise possible4 In the past decade ML has begun tosignificantly impact numerous disciplines5 and EnvironmentalScience and Engineering (ESE) has benefited from this surginginterest in ML and its applications6 At its core ESE isconcerned with improving and maintaining the environmentwith the ultimate goal of protecting human and ecologicalhealth ESE encompasses diverse areas such as water andwastewater treatment air quality environmental impactassessment and hazardous waste management ESE incorpo-rates concepts from disciplines such as basic sciences public
health engineering biological sciences and nanotechnologyResearch and industrial work within ESE increasingly requirecollection of vast amounts of data that simultaneously reflectnumerous data types (eg air and water quality measurementsflow measurements spatial discretization etc) with widespatial and temporal variability Data of this natureencompasses a broad and dynamic range with masses reportedfrom nanogram to teragram and flows ranging from microlitersper second to millions of liters per day Recent advances inenvironmental metagenomics and nontarget analysis furtherexpand the types and volumes of data being generated withinESEThe advent of novel data acquisition and handling
techniques has opened the door to alternative and potentiallymore comprehensive means of environmental monitoring thatwill improve our capacity to understand and manageenvironmental systems For instance metagenomics andnontarget analysis are evolving as powerful ways to identifyunknown contaminants for surveillance78 Further identifica-tion of anomalous or unusual events in waterwastewatertreatment processes using real time water quality data is a keytenet of the ldquodigital waterrdquo movement that holds immense
Featurepubsacsorgest
copy XXXX American Chemical SocietyA
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
Dow
nloa
ded
via
NA
NJI
NG
IN
ST O
F G
EO
GR
APH
Y amp
LIM
NO
LO
GY
on
Aug
ust 4
202
1 at
01
053
5 (U
TC
)Se
e ht
tps
pub
sac
sor
gsh
arin
ggui
delin
es f
or o
ptio
ns o
n ho
w to
legi
timat
ely
shar
e pu
blis
hed
artic
les
potential for the water industry9 However interpreting thesetypes of data is not trivial and requires appropriate applicationof ML techniquesThe aim of this Feature is to provide an overview of data
analysis frameworks suitable for application within ESE Wefirst introduce a general data analysis framework then highlightcurrent applications of ML within ESE problem domains andconclude with thoughts about the future
2 GENERAL DATA ANALYSIS FRAMEWORK
A general data analysis framework includes data acquisitiondata exploration data preprocessing and visualizationalgorithms for model building and ways to evaluate andinterpret the models and the results (Figure 1)21 Data Acquisition Data acquisition is the process of
collecting or acquiring data and storing it in a readily accessibleform such as a database (Figure 1a) Data acquisition requiresthorough planning to ensure the data set can be effectivelyinterrogated for the desired end purpose Key elements of sucha plan include deciding upon the data collection methodologyselecting variables of interest determining the samplingfrequency and the number of replicates assessing how muchdata is needed to effectively build the model(s) and test the
study hypotheses and deciding upon means to properlydocument and store the dataWork within ESE increasingly requires acquisition of vast
amounts of data Hence it is necessary to frame and adoptsuccinct protocols to minimize biases and increase compara-bility and reproducibility across the discipline To that extentseveral publications and guidance documents have focused ongood implementation practices for collection of robust datasets10111213
22 Data ExplorationVisualization and Preprocess-ing Data exploration or visualization is used to understand thedata characteristics Data exploration helps illustrate keyaspects such as sample size missing values distributionsinitial patterns correlations or variables that appear to besensitive to the system of interest Exploration is commonlyachieved by plotting and visualizing the data in a manner suchthat distributions and trends are readily apparent14
Raw data are often incomplete noisy and inconsistent dueto data redundancy incompatibilities between multiple datasources and divergent data collection protocols Poor dataquality can lead to inaccurate or misleading analyses15 Datapreprocessing is intended to improve model quality andminimize computational resource usage
Figure 1 Data Analysis Framework and Methodologies (a) Data Collection Gather the data (b) Data ExplorationPreprocessing Fix thediscrepancies visualize and transform the data to the required form (c) Model Building and Evaluation Train the model and evaluateperformance (d) Interpretation Use the model to make predictions analyze and interpret the outcomes (e) Monitoring Based on the learnedknowledge and domain expertise The top thread on the figure represents the tools and software that are available to execute various steps in theend-to-end data analysis framework
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
B
Table
1Overviewof
Pop
ular
MLAlgorithm
s
machine
learning
model
description
positives
andnegatives
exem
plaryapplications
with
inES
E
Logistic
Regression
(Logit
model)24
bullSupervised
MLclassificationalgorithm
bullEasy
toimplem
ent
bullEstim
atetheprobablepresence
ofatype
ofvegetatio
nusing
environm
entalvariables2
5
bullStatisticalmodelthatuses
alogisticfunctio
nto
modeltheprobabilityof
the
output
variable
bullHighlyinterpretable
bullPredictcontam
inantsources26
andwater
quality
2728indifferent
water
system
s
bullUsedto
perform
binary
classificationtaskswhere
givensamples
are
classifiedinto
twogroups
bullDoesnotrequireheavycomputatio
nalresources
bullAnomalydetectionin
water
treatm
entsystem
s28
bullCan
beextended
tomultip
leoutput
variables
bullOftenused
asthefirstmodelor
benchm
arkto
compare
tomorecomplex
models
RandomForest
(RF)
29bullAtree-based
ensemblelearning
methodthat
isused
forsupervised
classificationandregression
problems
bullRobustcanhandlelinearandnonlineardatao
utliersaswellas
noisydata
bullPredictrelativeabundancelevels
insewage7
bullThe
basicunitisadecision
tree
bullNot
proneto
overfittin
gwhenthemodelovertrains
onthetraining
dataandfails
togeneralizeon
the
testingor
newdata
set3
0bullSelect
importantvariablesin
predictin
ganom
alouseventsin
water
distributio
nsystem
s28
bullSamples
andfeatures
arerandom
lysampled
from
thefulldata
setand
individualdecision
treesaremadeon
thesampled
data
set
bullDepending
onthesize
ofthedatasetand
theensembleof
treesRFs
cangetcom
plex
andmay
require
high
computatio
nalresourcesandlong
times
forboth
training
andpredictio
nbullPredictnutrient
concentrations
usingwater
quality
variables3
2
bullThe
outcom
esgeneratedfrom
theensembleof
decision
treesarecombined
todevelopthefinalmodelpredictio
nbullPredictnanofiltrationreverse
osmosis-m
embranerejectionof
emerging
contam
inants31
SupportVector
Machines
(SVM)
bullSupervised
learning
algorithm
bullUsesasubset
oftraining
pointsin
thedecision
functio
n(supportvectors)
andismem
oryefficient
bullNonlineartim
eseries
forecasting
modelto
predictwater
quality
33
bullUsedforboth
classificationandregression
tasks
bullEffectiveinhigh
dimensionalspaces
andincaseswhere
thenumberof
samples
islessthan
thenumber
ofdimensionsHow
everchoosingkernelfunctio
nsandregularizatio
niscrucialto
avoidoverfittin
gbullPredicttheconcentrationof
nitrateandnitrite
inthemixed
liquorof
awastewater
treatm
ent
plant34
bullAimsto
findahyperplane
(inan
Ndimensionalfeaturespace)
that
has
maximum
distance
betweendata
pointsof
allclassesThe
hyperplane
classifiesthesamples
ordata
pointsunderconsideration
bullPh
enolicpollutant
concentration
predictio
nin
drinking
water35
bullMaximizingthemargindistance
brings
confidence
intheclassificationof
newdata
points
bullSupportvectorsaredata
pointsthat
arecloser
tothehyperplane
and
influence
thepositio
nandorientationof
thehyperplane
bullSetsof
mathematicalfunctio
ns(kernels)canbe
used
totransform
theinput
data
asrequired
Artificial
NeuralNet-
works
(ANNs)
bullANNsarecomputin
gsystem
sinspired
toanalyzeandprocessinform
ation
theway
thehuman
braindoes
bullDLhasem
ergedas
thestate-of-the-artMLmethod36minus39
bullPredicteffluent
concentrations
inawastewater
treatm
entplant4
2
bullAsimpleANN
consistsof
aninputlayerahidden
layermadeof
entities
(callednodes)
that
receiveweightedinputsfrom
theinputlayerandthen
transform
them
usinganonlinearfunctio
nthat
transm
itsthetransformed
valueto
theoutput
layerwhere
thefinaloutput
isused
togetpredictio
ns
bullDLmodelsrequirelargedatasetsfortraining
toachieverobustperformanceH
oweveritisdifficultto
know
thesamplesize
required
fortheparticular
task
beforehandS
omerulesof
thum
bhave
been
proposed
basedon
anumberof
assumptionsF
orinstancethe
samplesize
needsto
be50minus1000timesthe
numberof
classesor
10minus100times
thenumberof
featureso
r10minus50timesthenumberof
weightsin
the
network40
bullAnomalydetectionin
water
treatm
entsystem
2843
bullIn
conventio
nalANN
architecturetheinform
ationmoves
unidirectio
nally
(iefrom
inputto
output
layer)T
hisarchitectureisalso
know
nas
afeedforwardnetwork
bullFo
rsm
allerdata
setsalternativeMLmodelsmay
perform
better
bullAirquality
forecastingmodel44
bullWhenthenumberof
hidden
layersisincreasedto
achievehigher
levelsof
abstractionin
orderto
extractcomplex
inform
ationfrom
thedatathe
bullANN
architectures
existthat
canbe
used
toperform
supervisedu
nsupervised
orsemisupervised
learning
tasksSomeexam
ples
arerecurrentneuralnetworks
(RNNs)convolutio
naln
euraln
etworks
(CNNs)and
autoencoders
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
C
Table
1continued
machine
learning
model
description
positives
andnegatives
exem
plaryapplications
with
inES
E
resulting
architectureisknow
nas
adeep
neuralnetworkor
deep
learning
(DL)
bullANNstend
tooverfitC
onscientious
modelbuildingthorough
hyper-
parameter
tuning
andextensivevalidationisoftenrequired41
bullConsideredldquoblack
boxrdquo
modelsas
theinternalfunctio
ning
ofANNscanbe
hard
tointerpretThiscan
lead
toresultmisinterpretatio
nandtheacceptance
ofpoor
modelsAthorough
understandingof
varioushyperparam
etersisnecessaryto
understand
thenuancesof
themodel
Principal
Com
ponent
Analysis
(PCA)4546
bullAnunsupervised
technique
PCAcombinesfeatures
andcreatesanew
featureseto
fthe
samesize
where
each
newfeature(ietheldquocom
ponentsrdquo)
isacombinatio
nof
theoriginalfeatures
bullWidelyused
featureextractio
nmethod
Usedto
reduce
thedimensionality
ofthedata
setwhile
preserving
asmuchdata
variability
aspossible
bullEstim
atethevariance
explained
bygivenvariablesof
awater
quality
data47
bullNew
features
areform
edby
finding
theeigenvectors
andcorresponding
eigenvaluesthat
providean
estim
ationof
howmuchvariance
each
component
explains
bullOne
canselect
afewcomponentsthatexplainthemajority
ofthevariability
inthedata
anddrop
the
remaining
ones
bullIntercom
pare
airpollutio
npat-
terns4
4
bullCan
beused
toreduce
multicollinearityinthedataas
newcomponentsareindependento
feachother
bullEstablishthekeycompounds
todistinguishbetweensamples
usinghigh
resolutio
nmassspec-
trom
etry
(HRMS)
data4849
bullEstim
atetheimportance
ofdifferent
variablesandidentify
key-foulantsin
amem
brane
baseddrinking
water
treatm
ent
process5
0
K-means
Clustering
(K-means)51
bullUnsupervisedMLalgorithm
that
aimsto
clustersamples
usingfeature
values
bullSimpleto
implem
entandcaneasilyscaleto
largedata
sets
bullDelineate
different
airquality
indexthresholdclustersusingair
pollutio
ndata52
bullUserdefines
atarget
kvaluewhich
refersto
thenumberof
clustersone
wantsto
groupthesamples
into
bullDisadvantageisthat
oneneedsto
aprioridefinethekvaluewhich
may
bedifficult
bullAnalyze
water
quality
data55354
bullThe
kvaluerefersto
thenumberof
thecentroid
which
refersto
alocatio
nrepresentin
gthecenter
ofagivencluster
bullFails
forclusterswith
complex
nonsphericalgeom
etricshapes
bullGeneratevulnerability
mapsfor
anaquiferusingwater
quality
data55
bullK-m
eans
startswith
random
centroidsanditerativelyfinds
themost
appropriatecentroidsandassignsthesampleinto
respectiveclusters
such
thatthesum
ofthesquareddistance
betweenthesamples
andthecentroid
isminimized
bullClustercentroidscangetskewed
inthepresence
ofoutliershenceitisadvisedto
handleoutliersbefore
applying
k-means
bullRiskassessmentof
water
pollu-
tionsources5
6
Hierarchical
Clustering
bullUnsupervisedMLalgorithm
that
aimsto
clustersamples
usingfeature
values
bullNoaprioriinform
ationaboutthenumberof
clustersisrequired
bullFeatureclustering
basedon
Pearsoncorrelations
inHRMS
data57
bullThe
hierarchicalclustering
algorithm
inadditio
nto
breaking
uptheobjects
into
clustersalso
show
sthehierarchyor
rankingof
thedistance
andshow
showdissimilaroneclusterisfrom
othersH
ierarchicalclustering
isrepresentedas
dendograms
bullTimeintensive
bullEstim
atethesimilarity
between
theresistom
eprofilesof
different
samples
basedon
Brayminus
Curtis
dissimilarity
metric58
bullInputisameasureofdissimilaritybetweensamplesT
hechoice
ofsimilarity
(ordistance)measuresisthemaininfluencing
factor
inhierarchical
clustering
bullCould
bedifficultto
identifythecorrectnumberof
clusters
basedupon
thedendogram
bullWater
quality
assessmentwith
hierarchicalclusteranalysisbased
onMahalanobisdistance59
bullHierarchicalclusterscanbe
built
either
asagglom
erative(bottom-up
each
samplestartsin
itsow
nclusterandclustering
isdone
bymerging
each
sampleandmovingup
thehierarchy)
ordivisive
(top-dow
nallsam
ples
are
putin
oneclusterandsplitsareperformed
recursivelymovingdownthe
hierarchy)
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
D
ESE data sets can incorporate numerous data types withwide spatial and temporal variability Accordingly if possible athorough data visualizationexploration and preprocessingprocess should be performed to gain initial insights andstreamline the data for the subsequent model building steps(Figure 1b) To aid this depending on the data type wellestablished community guidelines could be leveraged Thisprocess typically includes four steps data cleaning integrationreduction and transformation (See the glossary for de-tails)1617
23 Model Building ML Models ML models fall intothree classes supervised unsupervised and semisupervised(Figure 1c) Supervised learning is a ML task where analgorithm is used to train a model using known data and thenthe learned model is used to predict the outcome of new orunforeseen instances18 In supervised learning a sample in thedata set has two components (1) a set of variables (features)that define the sample and (2) class labels (outcomes) that onewants to predict Two types of models are built usingsupervised learning algorithms classification or regressionmodels A classification model is suggested when the outputdata can be categorized into specific groups or classes (iediscrete output variables) If the output variable is a continuousvariable a regression model is recommendedUnsupervised learning is done in the absence of pre-existing
class labels It is a ML category where the goal is to find hiddenand unknown data patterns or to determine the datadistribution across the data set19 Unsupervised learningalgorithms are often used for data exploration and visual-ization The two major types of unsupervised learningapproaches are clustering and dimensionality reductionClustering algorithms are used to group similar samplestogether into clusters whereas dissimilar samples are relegatedto separate clusters Dimensionality reduction algorithms areused to reduce the dimension of the feature set Too manyuninformative features can hinder predictive modeling this isoften referred to as the ldquocurse of dimensionalityrdquo Somecommon supervised and unsupervised learning algorithms arelisted in Table 1Semisupervised learning is a hybrid of supervised and
unsupervised learning where a set of labeled data is used inconjunction with a set of unlabeled training data In manycases getting true labels for a data set can be difficultSemisupervised learning tackles this problem by making use ofunlabeled data in the learning process and then leveraging thislearned information along with the labeled data set forsubsequent prediction20
Model Optimization and Evaluation Supervised learningis usually a three-step process In the first step the data aresplit into training and testing data sets with the split rangingbetween 60 and 80 for training and 40minus20 for testing Thetraining data set is then iteratively separated into training andvalidation sets The model is trained on the training set whilethe validation set is used to tune the model hyperparametersthat consist of the model architecture and the parameters thataffect the speed and quality of the training process Followingthis cross-validation the learned model is tested against thetest data set to evaluate model robustness and accuracy Toensure unbiased model evaluation the test data set cannot beused for training Commonly used model evaluation metricsinclude accuracy F1-score precision recall area under thereceiver operating characteristic (ROC) curve mean absoluteerror (MAE) and mean squared error (MSE)21
Evaluating unsupervised models can be challenging becausedata do not have prior labels Evaluation typically involvesinternal and external validation22 Internal validation is done byestimating inter- and intra-cluster distances to evaluate clusterquality A cluster refers to a collection of data points thataccumulate together based on certain similarities Modelsminimizing intracluster distances while maximizing interclusterdistances are preferred Such a result suggests that thesemodels are performing well at clustering similar data pointstogether Some commonly used metrics for this purpose arethe Adjusted Rand Index and the Silhouette CoefficientHowever a good internal validation score does not guaranteethe effectiveness of the obtained clusters for real applicationsHence external validation is necessary to assess whether thedata points are assigned to the correct clusters This stepusually requires human evaluation or comparison againstbenchmarks For dimensionality reduction task reconstructionerror or loss is estimated to evaluate model performanceThese methods seek to minimize the reconstruction error orlossminusdefined as the distance between the original data pointand its projection onto a lower-dimensional subspace (itsldquoestimaterdquo)23
The ESE community has begun to leverage various MLalgorithms and techniques to analyze environmental data setsWhile it is difficult to determine a priori which ML algorithm isbest suited for a particular data set an improved understandingof the positives and negatives and the inherent assumptions ofthe specific algorithms aid in the choice of the righttechnique(s) Table 1 summarizes algorithms popular withinthe ESE community The table describes the algorithms theiradvantages and disadvantages the assumptions for specific datatypes and gives examples where the algorithms have been usedto address ESE related problems
24 Data Interpretation The final step in the modelbuilding cycle is to interpret the learnt model and developreasonable explanations for the obtained analysis (Figure 1d)Data interpretation could mean extracting important variablescontributing to the model prediction These variables couldhelp in verifying the hypotheses of the study or understandingwhat factors are critical in driving the observations in the studyunder consideration At this stage the literature and expertopinions are queried to make connections between the modeloutputs and knowledge about the relevant chemical physicaland biological phenomena to make data-driven conclusionsand decisions
3 CURRENT APPLICATIONS OF MACHINE LEARNINGIN ESE
In this section we discuss current applications of MLalgorithms within the ESE domain by presenting three casestudies representing different application areas
31 Metagenomic Data Analysis High throughputshotgun (untargeted) metagenomic DNA sequencing offers arobust and effective way to access the microbial world and isnow often used in ESE Differentiating microbial communitiesin different environments6061 studying the dissemination ofantibiotic resistance in environmental systems6263 definingbacterial communities in contaminated environments toexplore bioremediation64 and characterizing microbial com-munities in wastewater treatment processes6566 are allexamples of environments where metagenomic character-ization is increasingly being applied
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
E
In a typical metagenomic study the first step is often tosearch sequenced DNA reads against a reference database toderive taxonomic or functional gene annotation The obtainedtaxonomic or functional gene profiles are then normalized toprovide relative abundance information corresponding todifferent species or genes in the sample that can be visualizedand analyzed to answer specific questions about the sampleContinuous optimization and declining costs of sequencingplatforms have made metagenomics related research moreaccessible than ever before This trend has led to thegeneration of large volumes of publicly available sequencingdata but rendering these data into meaningful interpretationsremains challenging67 Fortunately the advent of ML methodshas immensely accelerated advancement at every cross sectionof a typical metagenomic study68 Here we highlight someapplications of ML in antibiotic resistance-related studiesOne often faced challenge with metagenomic data is that of
high dimensionality and low sample size (HDLSS ie morevariables than independent samples) For example the numberof genes annotated within a given environmental sample caneasily exceed hundreds of thousands Hence many unsuper-vised techniques such as principal component analysis (PCA)and non-metric multi-dimensional scaling (NMDS) are usedas preprocessing steps to reduce dimensionality by removinguninformative features A common application of thesetechniques is to examine the similaritydissimilarity of differentenvironmental samples by clustering based on species or genecomposition69 Network-based ML (data is represented in theform of a graph where nodes are the data points to be clusteredand edges represent the relationship or similarity between thedata points) is another unsupervised approach to studyproteinminusprotein or geneminusgene interactions to understanddifferent functional pathways70
Given the sample labels or response variables and the genecomposition support vector machines (SVMs) ANNs andensemble methods (such as random forest (RF) or extremelyrandomized trees) have been extensively used to performsupervised learning and build predictive models For exampleidentifying interesting patterns and important genes in a dataset58 predicting relative antibiotic resistance abundancelevels7 understanding the role of socioeconomic status inshaping the resistome or microbiome71 or the prediction ofantibiotic resistance phenotype72 are some of the uniqueproblems that have been examined using the above-mentionedmethods Hidden Markov models (HMMs) which can beused in both supervised or unsupervised fashion are one of themore readily used algorithms to detect antibiotic resistancegene variants or potential functional homologues73 and havebeen applied for the discovery of novel ARGs frommetagenomic data74
Word embedding which has gained a lot of popularity overthe recent years is a feature learning technique used in naturallanguage processing (NLP) Word embedding is a term usedfor the representation of words or sentences in the form ofnumeric vectors that can be used for downstream ML tasks75
Raw DNAprotein sequences sliced into k-mers are analogousto the structure of a sentence and can be analyzed in a similarfashion as natural languages Thus there is a rapid thrusttoward analyzing raw sequences using NLP based techni-ques7677 MetaMLP78 based on a similar idea uses a wordembedding based classification model to predict ARGphenotype and has been found to perform 50times faster thanDIAMOND (one of the fastest sequence alignment methods)
with similar accuracy Word embedding based models holdimmense promise in analyzing metagenomic data as they havethe ability to learn patterns directly from within raw sequencesSimilar to the antibiotic resistance example other
applications can also be envisioned For instance analyzingmetagenomics data for taxonomic classification7980 or mobilegenetic element classification81 where ML algorithms wereable to outperform the traditional methods In essence MLalgorithms have enhanced our ability to robustly analyzecomplex metagenomic data
32 Nontarget Analysis of Environmental Samples Arapidly developing approach in environmental analysis andtoxicology involves the use of high resolution massspectrometry (HRMS) based nontarget analysis (NTA)where data on accurate masses of molecular and fragmentions are collected without a priori information on thechemicals being analyzed NTA can potentially aid in thescreening and analysis of the vast and diverse universe oforganic pollutants a grand challenge faced by the ESEcommunity82 Because the chemicals being detected are notpredetermined NTA provides an opportunity to comprehen-sively examine the occurrence fate and transport of chemicalcontaminants in different environmental niches with minimalbias82 Similarly NTA can be applied in environmentalmetabolomics to facilitate understanding of the effects ofchemical perturbations on exposed organisms (eg plantsanimals and humans) without focusing on a particularbiochemical pathway8384 A recent review highlighted studiesthat employed the different types of NTA techniques used inmetabolomics to discover metabolite changes in plants inducedby exposure to xenobiotics (eg pharmaceuticals personal careproducts pesticides flame retardants and engineered nano-materials) to examine the effect of altered levels of nutrients inthe environment on plant systems85 Here we present a briefsummary of how the combination of NTA with ML canadvance monitoring of water wastewater quality soil andexposed organisms (eg humans wildlife)While there are various methods that can be used for NTA
in ESE HRMS is the most popular because of its capacity forsensitive detection of low levels of contaminants andmetabolites in complex environmental samples However thepower of NTA using HRMS has been limited by problemsassociated with sample preparation and data analysis Inanalyzing environmental samples using MS sample concen-tration and cleanup are critical because the signal intensity ofthe MS features depend not only on the concentrations of thechemicals but also on the amounts and the nature of thematrix present in the sample extracts The reproducibility ofthe ionization of compounds in MS can be compromisedsignificantly by matrix effects Therefore it is important to havean appropriate number of replicate samples and properlyselected blank samples when conducting NTA Solid-phaseextraction for sample cleanup is commonly used but thisapproach can introduce bias because highly polar contaminantsmay be lost during sample preparation Unlike in targetanalysis where variations due to matrix effects and samplelosses can be corrected using stable isotope-labeled referencecompounds as surrogates this approach cannot be used inNTA because the purpose of NTA is to identify unknowncontaminants that were not included in the target list Tonormalize for instrumental variation and matrix effects internalstandards with varying polarity can be added to the sampleextract prior to analysis Recently several compounds of
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
F
diverse structure log Kow chromatographic behavior andionization efficiency have been proposed for inclusion in aquality control mixture usable for NTA86 However theproposed mixture has limitations such as for the detection ofhydrophilic compounds and molecular formula generation forcompounds containing fluorine Finally due to the substantialamount of MS data acquired indiscriminately under full-scanand the subsequent MSMS fragmentation of molecular andfragment ions data processing to identify relevant features isdaunting Therefore NTA workflows with built-in filters andcriteria are being developed to facilitate prioritization of MSfeatures that are useful for chemical structure annotation Inthis regard advanced data processing tools high throughputstatistical packages and user-friendly visualization programsare needed to fully interrogate the rich data sets acquired byNTANTA based on HRMS holds great promise for the
comprehensive monitoring of the occurrence and fate ofcontaminants in water and wastewater87 It also allowsresearchers to retrospectively analyze stored HRMS data toscreen for suspected contaminants (ie suspect screening) thatmay have been missed during target analysis88 Using advancedcomputational strategies such as hierarchical cluster analysiscommon patterns in the occurrence of contaminants in theenvironment and their emission pathways can be predictedbased on time series analysis of the aggregated NTA data89
NTA combined with cluster analysis was successfully appliedto reveal previously unmonitored chemical contaminants insoil and sediment samples90 The application of NTA andadvanced postacquisition data treatment will continue toenhance our ability to discover emerging contaminants in theenvironment including those that bioaccumulate and poserisks to humans and wildlife For instance new polyhalo-genated compounds were detected for the first time in blubbersamples from marine mammal sentinel species using both LC-HRMS and GC-HRMS for analysis and an open-source datamining software in the R programming environment thatdetected halogenated signatures in full scan HRMS91 FinallyNTA combined with personal passive samplers and propersample preparation techniques can be used to unlock thecomposition of chemical mixtures that humans are exposed toon a daily basis which can be used in investigating the humanexposome92
Akin to metagenomic data NTA data pose a similar HDLSSchallenge as each mass spectrum constitutes a large number ofpeaks representing potential compounds present in a givensample Hence data preprocessing is crucial and inevitableVarious data preprocessing steps are performed such asdetecting peaks subtracting peaks that were found in blankor control samples componentization (ie grouping of signalsthat probably belong to one unique molecular structure) andremoving noise using replicate measurements93 Followingpreprocessing various supervised and unsupervised MLalgorithms can be applied to engineer select and extractrelevant features However before any further analysis datanormalization and data scaling are two crucial steps thatrequire considerationThe most common algorithms used for NTA are linear
projection methods such as PCA and supervised partial least-squares discriminant analysis (PLS-DA)949548 PCA aids insample comparison by removing variance from the sample setSupervised PLS-DA along with relevant metadata can be usedto extract features pertaining to the specific questions that are
being asked However a major challenge is that featuredetection based on peak intensity can be misleading because oferroneous peak assignments Also relevant information can belost when differentiating the actual contaminant signals andbackground noise when using intensity information for featureselection Hence several alternative approaches that aim atusing raw data signals (retention time times mass-to-charge ratio)that bypass the peak detection step have been proposed andimplemented for improved extraction of information andunderlying patterns in the data set96minus98 This includestechniques such as transforming the raw data signal to distancematrices for dissimilarity analysis96minus98 and using clustering toextract features99
A number of other data analysis algorithms exist but havenot yet become common in NTA environmental analysisClustering techniques such as k-means and hierarchicalclustering can be used to identify similar samples Algorithmslike RF SVMs and ANNs can be applied to classify samplesbased on different categories and would be advantageous inillustrating nonlinear relations in the data These algorithmshave shown excellent promise in analyzing HRMS data inother fields100minus103 and this promise can certainly be extendedto environmental sample analysis
33 Anomaly Detection in Engineered Water Sys-tems (EWS) EWS is an umbrella term for systems of watercollection treatment distribution storage and their operationRecently there has been a major thrust toward digitalization ofthe water sector104 In particular the evolution of cyberinfras-tructure and of online process control instrumentation has ledto the development of advanced process control solutions suchas supervisory control and data acquisition (SCADA)systems105106 Such advances have enabled water utilities tocontinuously monitor water quality identify problems andeffectively oversee maintenance issues both remotely and morelocally These systems entail collection of a large volume of rawdata that could be in conjunction with appropriate dataanalysis techniques transformed into valuable information thatcan be leveraged to make proactive decisions to optimizeoverall performance107 In particular there is surging interest inusing ML techniques to identify unusual patterns in raw EWSdata as a means of discovering unexpected activitiesminusthis isbroadly termed anomaly detection108
A typical anomaly detection task in EWS aims todifferentiate between natural expected variations in waterquality and unusual or suspicious variations caused bycontamination or failure somewhere in the system Thechallenge is to characterize these normal water qualityvariations as it requires analysis of long-term data thatencompasses inherent background variability109 This charac-terized normal response helps to flag anomalous events in thedata Various ML algorithms have been applied to address thisproblem (Table 1)Within EWS one primary form of the generated data is a
time series where each time point can be considered a discretesample The time series consists of measurements of indicatorscollected over time from one or several sensors which formthe feature set Hence each sample can be represented as amultidimensional vector where each dimension represents afeature Historic data is then used to train the model Givenappropriate analysis of the supporting data set one can frameand solve problems to address a multitude of aims such asdetecting leaks sensor failures abrupt changes in water qualityor contamination events110minus112
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
G
There are a number of different anomaly detectiontechniques available113 In the unsupervised approachunlabeled samples are clustered using an algorithm such ask-means density-based clustering algorithms or expectation-maximization (EM) clustering The concept is that the normaldata points cluster to form high density clusters whereasanomalous data points cluster separately or distant from thenormal data pointsclusters and are located in low-densityregions This way one can carry out additional investigationusing new samples and classify them relative to the normaldata In the supervised approach labels (normal or anomalous)are assigned to each sample based on past information andexpert knowledge of anomalous behavior In this way theproblem is converted into a binary classification problemThough it can easily be extended to a multiclass classificationto categorize different types of anomalies SVM ANNBayesian Network Logistic Regression models and theirvariants are well explored algorithms in this space113 A specialcase of SVM One-Class SVM (OCSVM) is a widely usedmethod for anomaly detection In OCSVM the entire trainingdata set is considered as one-class (eg normal class) and thenew data points are classified as similar (normal) or different(anomalous) to the training data Because considering all datapoints from one class is equivalent to having no label OCSVMis considered as unsupervised learning methodOwing to the success of ML in detecting anomalies in other
domains such as network security researchers have startedexploring its potential in water quality anomaly detec-tion43114115 Based on the studies published so far DLalgorithms have shown promise in detecting anomalies andhave outperformed conventional techniques on a number ofoccasions However it should be noted that DL methods couldbe slow to train depending on the depth of the network andthe amount of data available116
Many studies have adopted a batch learning approach wherethe model is trained on historical data and then the new data iscategorized using this trained model With a continuousincoming time-series data stream and the possibility of novelanomalies retraining the models with every new data set canquickly become impractical and difficult to reliably executeContinual or active learning frameworks that continuouslylearn as the new data stream comes in and that can identifyanomalous behavior in real time are required to circumventthese issues117118 Variants of Latent Dirichlet AllocationMarkov Models and ANN based architectures are algorithmsthat have been applied in other domains to achieve continualonline learning119minus121 However thes approaches remainunderexplored for anomaly detection in water systems asthese frameworks are not trivial and have their ownimplementation challenges118119
Ultimately anomaly detection will be most valuable if it canlearn continually as the data stream comes in and yield real-time reporting that informs immediate corrective action It iscrucial that the models being utilized are fast and accurateHence going forward there is a need to shift the focus towardhybrid approaches (combinations of different algorithms) tobuild powerful models that are able to detect multiple types ofanomalies in real-time Such models will enhance EWSanomaly detection and advance public health9118119
4 PATH FORWARDThere is increasing interest in and a growing body of researchdocumenting how data analytics is being used to address ESE
problems As illustrated in this Feature the power of dataanalytics has been widely recognized as has been the vast needfor its application Anomaly detection has particularlypromising applications for water professionals and practi-tioners Notably broader application of metagenomics andNTA can revolutionize environmental monitoring effortsHowever the application of data analytics in ESE practiceremains in its infancy and concerted efforts are required tomake data analytics an integral part of ESE research andeducation Such a goal would most effectively be achieved ifthere were an agreed-upon plan of action Coordinated effortson several fronts are needed to help the ESE community reapthe potential benefits of data analyticsFirst there is a need to encourage collaborations between
data scientists and ESE practitioners that involve diverselyengaged and integrated research teams from multipledisciplines Such collaborations will facilitate cross-disciplinarycommunication and simultaneous skills building For exampledata scientists come from a culture highly supportive of datasharing (eg GitHub Bitbucket public databases) The ESEcommunity should embrace this culture and incorporate datasharing both locally regionally and globally Recent efforts toaddress the COVID-19 pandemic122123 the global dissem-ination of antimicrobial resistance7 and the development ofglobally vetted data analysis approaches within the NTAcommunity are all steps in this direction Such collaborationshave the potential to yield profound data-driven insights andconclusions that would be impossible to achieve within normalacademic and professional silos Working on their own datascientists may create models that answer specific questions butlack the interpretive training required to contextualize theirresults into real world applications Simultaneously ESEpractitioners with data may not possess the correspondingtools or insights required for proper analysis Strategiccollaborations will lead to improved data analytics datainterpretation and improved decision makingSecond there is a lack of user-friendly data analytics tools
and platforms devised for ESE problems Although under-standing the theory behind ML algorithms is not a difficult taskfor many trained scientists a major hurdle that ESE researchersface is in the coding and implementation of these models Withthe continual advancement of data science programming isbecoming an increasingly necessary skill without which modelsmay be improperly developed or may lack efficiency User-friendly tools may circumvent the problems with codinglearning curves and may be essential for widespreadpractitioner adoption While many tools (Microsoft powerBI Weka Tableau Azure) are available for general dataanalysis application-specific toolsplatforms would greatly aidin performing end-to-end analysesThird the field needs to incorporate data analytics-oriented
curricula This could include the addition of data analyticsfocused modules within existing coursework the introductionof new data science and statistics courses within ESE degreeprograms and the development of relevant capstone projectsfor ESE applications Such experiential learning would helpstudents understand the intricacies of different algorithms andprovide hands-on experience in applying various data analyticstechniques in the context of specific ESE problems Furtherencouraging students to take part in data science internshipswould be an excellent means of developing such expertiseFourth introducing data analytics workshops and making
the relevant resources available is valuable for students
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
H
researchers and professionals A plethora of online resourcessuch as Coursera LinkedIn Learning and Edx are available forlow or no cost offering high-quality content in data scienceExpanding access to and creation of open source learningplatforms could prove instrumental in instigating the data-driven problem-solving approachFinally we know that data analytics is not new but
continues to evolve at a rapid pace These advancementsfurnish us with new and powerful ways to analyze the data thatcan help holistically tackle the challenges faced by ESEcommunity and ultimately inform decision-making and policyformulation at scales never previously imagined Hence it iscritical to be abreast of new developments in data analytics andto continue making efforts to harvest their full potential
AUTHOR INFORMATIONCorresponding AuthorPeter Vikesland minus Via Department of Civil andEnvironmental Engineering Virginia Tech BlacksburgVirginia 24061 United States orcidorg0000-0003-2654-5132 Email pvikesvtedu
AuthorsSuraj Gupta minus The Interdisciplinary PhD Program in GeneticsBioinformatics and Computational Biology Virginia TechBlacksburg Virginia 24061 United States
Diana Aga minus Department of Chemistry University at BuffaloThe State University of New York Buffalo New York 14226United States orcidorg0000-0001-6512-7713
Amy Pruden minus Via Department of Civil and EnvironmentalEngineering Virginia Tech Blacksburg Virginia 24061United States orcidorg0000-0002-3191-6244
Liqing Zhang minus Department of Computer Science VirginiaTech Blacksburg Virginia 24061 United States
Complete contact information is available athttpspubsacsorg101021acsest1c01026
NotesThe authors declare no competing financial interestBiography
Peter J Vikesland is the Nick Prillaman Professor of Civil andEnvironmental Engineering at Virginia Tech He received his BAfrom Grinnell College in Chemistry in 1993 and MS and PhD inCivil and Environmental Engineering from University of Iowa in 1995and 1998 Vikeslandrsquos research focuses on the fate of nanomaterials inthe environment nanotechnology enabled sensor platforms forenvironmental quality assessment the environmental disseminationof antibiotic resistance and the application of data analytics in
environmental science and engineering He is a past President of theAssociation of Environmental Engineering and Science Professors(AEESP) is a US National Science Foundation CAREER awardeeand is the recipient of the 2018 Walter Weber Research InnovationAward from AEESP
ACKNOWLEDGMENTS
This study was supported by National Science Foundation(NSF) awards OISE (1545756) ECCSS NNCI (1542100)and CSSI (2004751) Additional support was provided by theCenter for Science and Engineering of the Exposome at theVirginia Tech Institute for Critical Technology and AppliedScience (ICTAS) and the Virginia Tech Graduate Schoolsupported Genetics Bioinformatics and ComputationalBiology (GBCB) and Sustainable Nanotechnology (VTSuN)programs
GLOSSARY
Accuracy Correct predictionstotal predictions
Autoencoders A deep learning techni-que that aims to build amodel where the outputtargets are set equal tothe input variables Itseeks to learn an ap-proximate representa-tion of the data andreconstruct it It is anunsupervised learningapproach used for di-mensionality reduction
Bayesian Network Bayesian networks are akind of probabilisticgraphical model thatutilizes Bayesian infer-ence for probability es-timations Bayesian net-works are used tomodel conditional de-pendence and hencecausation using a direc-ted graph
Data cleaning Identification of outliersor filling in of missingvalues Outlier identifi-cation can be done byusing Z-score quartilevalues or hypothesistesting Missing datacan be handled in mul-tiple ways such as fillingthe value by computingthe summary statisticsof the given variableusing predicted valuescomputed by an MLalgorithm manual cura-tion or by ignoring themissing record
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
I
Data integration The combination ofmultiple data sourcesinto a single coherentform
Data reduction Aggregation or elimina-tion of redundant in-formation to reduce thedata set size
Data transformation Conversion of raw databy normalizing to acommon scale to en-sure consistency andcomparability
Density Based Clustering Work by recognizingldquodenserdquo groups ofpoints permitting it tolearn clusters of discre-tionary shape and dis-tinguish anomalies inthe information
Dimensionality Reduction The process of reducingthe number of randomvariables or attributesunder consideration
EM Clustering Similar to k-means ex-cept it assigns the sam-ples into clusters basedon the probabilities es-timated using the EMalgorithm The objec-tive is to maximize theoverall probability ofthe data for the given(final) clusters
Ensemble Methods Ensemble methods arealgorithms that com-bine predictions fromseveral base models toobtain one optimalmodel
F1-Score F1-Score is a harmonicmean of recall andprecision
HMM HMM is a statisticalMarkov mode l i nwhich the system beingmodeled is assumed tobe a Markov process
k-mer A substring with alength k in a biologicalsequence of nucleoti-des
Naive Bayes Naive Bayes methodsare a family of simpleprobabilistic classifiersbased on Bayes ruleThe method is calledldquoNaiverdquo for its assump-tion of conditional in-dependence among thefeatures
Natural Language Processing (NLP) A subfield of computerscience artificial intelli-gence and linguisticsthat deals with theinteraction of com-puters with human lan-guages
NMDS NMDS is an ordinationtechnique based on dis-tance or dissimilaritymatrix NMDS repre-sents pairwise dissimi-larity between samplesin a low-dimensionalspace
PLS-DA PLS-DA is a linearclassification modelthat is able to predictthe class of new sam-ples
R Programming languagefor Statistical Comput-ing
ROC-Curve A receiver operatingcharacteristic (ROC)curve is a plot of truepositive rate (TPR-yaxis) against the falsepositive rate (FPR-xaxis) It measures theperformance of a classi-fier
SCADA Supervisory control andd a t a a c q u i s i t i o n(SCADA) is a systemdesigned to gather andanalyze real time dataIt is used in waterwastewater treatmentplants to monitor andmanage processes
REFERENCES(1) Tukey J W The future of data analysis Ann Math Stat 196233 (1) 1minus67(2) Hayashi C What is data science Fundamental concepts and aheuristic example In Data science classification and related methodsSpringer 1998 pp 40minus51(3) Bishop C M Pattern recognition and machine learning Springer2006(4) Qiu J Wu Q Ding G Xu Y Feng S A survey of machinelearning for big data processing EURASIP Journal on Advances inSignal Processing 2016 2016 (1) 67(5) Agarwal R Dhar V Big data data science and analytics Theopportunity and challenge for IS research In INFORMS 201425443(6) Wilcox C Woon W L Aung Z Applications of machinelearning in environmental engineering Citeseer 2013(7) Hendriksen R S Munk P Njage P Van Bunnik BMcNally L Lukjancenko O Roumlder T Nieuwenhuijse DPedersen S K Kjeldgaard J Global monitoring of antimicrobialresistance based on metagenomics analyses of urban sewage NatCommun 2019 10 (1) 1124(8) Sobus J R Wambaugh J F Isaacs K K Williams A JMcEachran A D Richard A M Grulke C M Ulrich E MRager J E Strynar M J Integrating tools for non-targeted analysis
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
J
research and chemical safety evaluations at the US EPA J ExposureSci Environ Epidemiol 2018 28 (5) 411minus426(9) Dogo E M Nwulu N I Twala B Aigbavboa C A survey ofmachine learning methods applied to anomaly detection on drinking-water quality data Urban Water J 2019 16 (3) 235minus248(10) Wilde F D Radtke D B Gibs J Iwatsubo R T Nationalfield manual for the collection of water-quality data US Geological SurveyTechniques of Water-Resources Investigations Book 9 1998(11) National Research Council Confronting the nationrsquos waterproblems The role of research National Academies Press 2004(12) Bharti R Grimm D G Current challenges and best-practiceprotocols for microbiome analysis Briefings in bioinformatics202122178(13) Veenaas C Developing tools for non-target analysis and digitalarchiving of organic urban water pollutants 2018(14) Keim D A Visual exploration of large data sets CommunACM 2001 44 (8) 38minus44(15) Chen J Chen Y Du X Li C Lu J Zhao S Zhou X Bigdata challenge a data management perspective Frontiers of ComputerScience 2013 7 (2) 157minus164(16) Chakrabarti S Cox E Frank E Guting R H Han J JiangX Kamber M Lightstone S S Nadeau T P Neapolitan R EData mining know it all Morgan Kaufmann 2008(17) Gibert K Sanchez-Marre M Izquierdo J A survey on pre-processing techniques Relevant issues in the context of environ-mental data mining AI Communications 2016 29 (6) 627minus663(18) Russell S Norvig P Artificial intelligence a modern approach2002(19) Hinton G E Sejnowski T J Poggio T A UnsupervisedLearning Foundations of Neural Computation MIT press 1999(20) Van Engelen J E Hoos H H A survey on semi-supervisedlearning Machine Learning 2020 109 (2) 373minus440(21) Brink H Richards J W Fetherolf M Cronin B Real-worldmachine learning Manning Shelter Island NY 2017(22) Kassambara A Practical guide to cluster analysis in RUnsupervised machine learning Sthda Vol 1(23) Ghodsi A Dimensionality reduction a short tutorialDepartment of Statistics and Actuarial Science Univ of WaterlooOntario Canada 2006 37 (38) 2006(24) Wright R E Logistic regression 1995(25) Alvarez-Arbesu R Feliciacutesimo A M GIS and logisticregression as tools for environmental management a coastal cliffvegetation model in Northern Spain WIT Transactions on Informationand Communication Technologies 2002 26(26) Liu L Sankarasubramanian A Ranjithan S R Logisticregression analysis to estimate contaminant sources in waterdistribution systems J Hydroinf 2011 13 (3) 545minus557(27) Thoe W Gold M Griesbach A Grimmer M Taggart ML Boehm A B Predicting water quality at Santa Monica Beachevaluation of five different models for public notification of unsafeswimming conditions Water Res 2014 67 105minus117(28) Muharemi F Logofa tu D Andersson C Leon FApproaches to building a detection model for water quality a casestudy In Modern Approaches for Intelligent Information and DatabaseSystems Springer 2018 pp 173minus183(29) Liaw A Wiener M Classification and regression byrandomForest R news 2002 2 (3) 18minus22(30) Gromski P S Muhamadali H Ellis D I Xu Y Correa ETurner M L Goodacre R A tutorial review Metabolomics andpartial least squares-discriminant analysis-a marriage of convenienceor a shotgun wedding Anal Chim Acta 2015 879 10minus23(31) Lee S Kim J Prediction of Nanofiltration and Reverse-Osmosis-Membrane Rejection of Organic Compounds UsingRandom Forest Model J Environ Eng 2020 146 (11) 04020127(32) Castrillo M Garciacutea A L Estimation of high frequencynutrient concentrations from water quality surrogates using machinelearning methods Water Res 2020 172 115490
(33) Tan G Yan J Gao C Yang S Prediction of water qualitytime series data based on least squares support vector machineProcedia Eng 2012 31 1194minus1199(34) Yang Y H Guergachi A Khan G Support vector machinesfor environmental informatics application to modelling the nitrogenremoval processes in wastewater treatment systems Journal ofEnvironmental Informatics 2006 7 (1) 14minus25(35) Feng J Pan L Cui B Sun Y Zhang A Gong M NewApproach for Concentration Prediction of Aqueous PhenolicContaminants by Using Wavelet Analysis and Support VectorMachine Environmental Engineering Science 2020 37 (5) 382minus392(36) Silver D Schrittwieser J Simonyan K Antonoglou IHuang A Guez A Hubert T Baker L Lai M Bolton AMastering the game of go without human knowledge Nature 2017550 (7676) 354minus359(37) Mikolov T Deoras A Povey D Burget L Cernocky JStrategies for Training Large Scale Neural Network Language ModelsIEEE 2011 pp 196minus201(38) Deng J Dong W Socher R Li L-J Li K Fei-Fei L InImagenet A Large-Scale Hierarchical Image Database Ieee 2009 pp248minus255(39) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (Aug) 2493minus2537(40) Alwosheel A van Cranenburgh S Chorus C G Is yourdataset big enough Sample size requirements when using artificialneural networks for discrete choice analysis Journal of choice modelling2018 28 167minus182(41) Yang J Xu J Zhang X Wu C Lin T Ying Y Deeplearning for vibrational spectral analysis Recent progress and apractical guide Anal Chim Acta 2019 1081 6minus17(42) Guo H Jeong K Lim J Jo J Kim Y M Park J-p Kim JH Cho K H Prediction of effluent concentration in a wastewatertreatment plant using machine learning models J Environ Sci 201532 90minus101(43) Inoue J Yamagata Y Chen Y Poskitt C M Sun JAnomaly Detection for a Water Treatment System Using UnsupervisedMachine Learning 2017 2017 IEEE 2017 pp 1058minus1065(44) Voukantsis D Karatzas K Kukkonen J Raumlsaumlnen TKarppinen A Kolehmainen M Intercomparison of air quality datausing principal component analysis and forecasting of PM10 andPM2 5 concentrations using artificial neural networks in Thessalonikiand Helsinki Sci Total Environ 2011 409 (7) 1266minus1276(45) Wold S Esbensen K Geladi P Principal componentanalysis Chemom Intell Lab Syst 1987 2 (1minus3) 37minus52(46) Jolliffe I T Cadima J Principal component analysis a reviewand recent developments Philos Trans R Soc A 2016 374 (2065)20150202(47) Dalal S G Shirodkar P V Jagtap T G Naik B G Rao GS Evaluation of significant sources influencing the variation of waterquality of Kandla creek Gulf of Katchchh using PCA Environ MonitAssess 2010 163 (1minus4) 49minus56(48) Scholleacutee J E Schymanski E L Avak S E Loos MHollender J Prioritizing unknown transformation products frombiologically-treated wastewater using high-resolution mass spectrom-etry multivariate statistics and metabolic logic Anal Chem 2015 87(24) 12121minus12129(49) Masiaacute A Campo J Blasco C Picoacute Y Ultra-highperformance liquid chromatography-quadrupole time-of-flight massspectrometry to identify contaminants in water an insight onenvironmental forensics Journal of Chromatography A 2014 134586minus97(50) Peiris R H Halleacute C Budman H Moresoli C Peldszus SHuck P M Legge R L Identifying fouling events in a membrane-based drinking water treatment process using principal componentanalysis of fluorescence excitation-emission matrices Water Res 201044 (1) 185minus194(51) MacQueen J In Some methods for classification and analysis ofmultivariate observations 1967 Oakland CA 1967 pp 281minus297
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
K
(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187
(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
L
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M
potential for the water industry9 However interpreting thesetypes of data is not trivial and requires appropriate applicationof ML techniquesThe aim of this Feature is to provide an overview of data
analysis frameworks suitable for application within ESE Wefirst introduce a general data analysis framework then highlightcurrent applications of ML within ESE problem domains andconclude with thoughts about the future
2 GENERAL DATA ANALYSIS FRAMEWORK
A general data analysis framework includes data acquisitiondata exploration data preprocessing and visualizationalgorithms for model building and ways to evaluate andinterpret the models and the results (Figure 1)21 Data Acquisition Data acquisition is the process of
collecting or acquiring data and storing it in a readily accessibleform such as a database (Figure 1a) Data acquisition requiresthorough planning to ensure the data set can be effectivelyinterrogated for the desired end purpose Key elements of sucha plan include deciding upon the data collection methodologyselecting variables of interest determining the samplingfrequency and the number of replicates assessing how muchdata is needed to effectively build the model(s) and test the
study hypotheses and deciding upon means to properlydocument and store the dataWork within ESE increasingly requires acquisition of vast
amounts of data Hence it is necessary to frame and adoptsuccinct protocols to minimize biases and increase compara-bility and reproducibility across the discipline To that extentseveral publications and guidance documents have focused ongood implementation practices for collection of robust datasets10111213
22 Data ExplorationVisualization and Preprocess-ing Data exploration or visualization is used to understand thedata characteristics Data exploration helps illustrate keyaspects such as sample size missing values distributionsinitial patterns correlations or variables that appear to besensitive to the system of interest Exploration is commonlyachieved by plotting and visualizing the data in a manner suchthat distributions and trends are readily apparent14
Raw data are often incomplete noisy and inconsistent dueto data redundancy incompatibilities between multiple datasources and divergent data collection protocols Poor dataquality can lead to inaccurate or misleading analyses15 Datapreprocessing is intended to improve model quality andminimize computational resource usage
Figure 1 Data Analysis Framework and Methodologies (a) Data Collection Gather the data (b) Data ExplorationPreprocessing Fix thediscrepancies visualize and transform the data to the required form (c) Model Building and Evaluation Train the model and evaluateperformance (d) Interpretation Use the model to make predictions analyze and interpret the outcomes (e) Monitoring Based on the learnedknowledge and domain expertise The top thread on the figure represents the tools and software that are available to execute various steps in theend-to-end data analysis framework
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
B
Table
1Overviewof
Pop
ular
MLAlgorithm
s
machine
learning
model
description
positives
andnegatives
exem
plaryapplications
with
inES
E
Logistic
Regression
(Logit
model)24
bullSupervised
MLclassificationalgorithm
bullEasy
toimplem
ent
bullEstim
atetheprobablepresence
ofatype
ofvegetatio
nusing
environm
entalvariables2
5
bullStatisticalmodelthatuses
alogisticfunctio
nto
modeltheprobabilityof
the
output
variable
bullHighlyinterpretable
bullPredictcontam
inantsources26
andwater
quality
2728indifferent
water
system
s
bullUsedto
perform
binary
classificationtaskswhere
givensamples
are
classifiedinto
twogroups
bullDoesnotrequireheavycomputatio
nalresources
bullAnomalydetectionin
water
treatm
entsystem
s28
bullCan
beextended
tomultip
leoutput
variables
bullOftenused
asthefirstmodelor
benchm
arkto
compare
tomorecomplex
models
RandomForest
(RF)
29bullAtree-based
ensemblelearning
methodthat
isused
forsupervised
classificationandregression
problems
bullRobustcanhandlelinearandnonlineardatao
utliersaswellas
noisydata
bullPredictrelativeabundancelevels
insewage7
bullThe
basicunitisadecision
tree
bullNot
proneto
overfittin
gwhenthemodelovertrains
onthetraining
dataandfails
togeneralizeon
the
testingor
newdata
set3
0bullSelect
importantvariablesin
predictin
ganom
alouseventsin
water
distributio
nsystem
s28
bullSamples
andfeatures
arerandom
lysampled
from
thefulldata
setand
individualdecision
treesaremadeon
thesampled
data
set
bullDepending
onthesize
ofthedatasetand
theensembleof
treesRFs
cangetcom
plex
andmay
require
high
computatio
nalresourcesandlong
times
forboth
training
andpredictio
nbullPredictnutrient
concentrations
usingwater
quality
variables3
2
bullThe
outcom
esgeneratedfrom
theensembleof
decision
treesarecombined
todevelopthefinalmodelpredictio
nbullPredictnanofiltrationreverse
osmosis-m
embranerejectionof
emerging
contam
inants31
SupportVector
Machines
(SVM)
bullSupervised
learning
algorithm
bullUsesasubset
oftraining
pointsin
thedecision
functio
n(supportvectors)
andismem
oryefficient
bullNonlineartim
eseries
forecasting
modelto
predictwater
quality
33
bullUsedforboth
classificationandregression
tasks
bullEffectiveinhigh
dimensionalspaces
andincaseswhere
thenumberof
samples
islessthan
thenumber
ofdimensionsHow
everchoosingkernelfunctio
nsandregularizatio
niscrucialto
avoidoverfittin
gbullPredicttheconcentrationof
nitrateandnitrite
inthemixed
liquorof
awastewater
treatm
ent
plant34
bullAimsto
findahyperplane
(inan
Ndimensionalfeaturespace)
that
has
maximum
distance
betweendata
pointsof
allclassesThe
hyperplane
classifiesthesamples
ordata
pointsunderconsideration
bullPh
enolicpollutant
concentration
predictio
nin
drinking
water35
bullMaximizingthemargindistance
brings
confidence
intheclassificationof
newdata
points
bullSupportvectorsaredata
pointsthat
arecloser
tothehyperplane
and
influence
thepositio
nandorientationof
thehyperplane
bullSetsof
mathematicalfunctio
ns(kernels)canbe
used
totransform
theinput
data
asrequired
Artificial
NeuralNet-
works
(ANNs)
bullANNsarecomputin
gsystem
sinspired
toanalyzeandprocessinform
ation
theway
thehuman
braindoes
bullDLhasem
ergedas
thestate-of-the-artMLmethod36minus39
bullPredicteffluent
concentrations
inawastewater
treatm
entplant4
2
bullAsimpleANN
consistsof
aninputlayerahidden
layermadeof
entities
(callednodes)
that
receiveweightedinputsfrom
theinputlayerandthen
transform
them
usinganonlinearfunctio
nthat
transm
itsthetransformed
valueto
theoutput
layerwhere
thefinaloutput
isused
togetpredictio
ns
bullDLmodelsrequirelargedatasetsfortraining
toachieverobustperformanceH
oweveritisdifficultto
know
thesamplesize
required
fortheparticular
task
beforehandS
omerulesof
thum
bhave
been
proposed
basedon
anumberof
assumptionsF
orinstancethe
samplesize
needsto
be50minus1000timesthe
numberof
classesor
10minus100times
thenumberof
featureso
r10minus50timesthenumberof
weightsin
the
network40
bullAnomalydetectionin
water
treatm
entsystem
2843
bullIn
conventio
nalANN
architecturetheinform
ationmoves
unidirectio
nally
(iefrom
inputto
output
layer)T
hisarchitectureisalso
know
nas
afeedforwardnetwork
bullFo
rsm
allerdata
setsalternativeMLmodelsmay
perform
better
bullAirquality
forecastingmodel44
bullWhenthenumberof
hidden
layersisincreasedto
achievehigher
levelsof
abstractionin
orderto
extractcomplex
inform
ationfrom
thedatathe
bullANN
architectures
existthat
canbe
used
toperform
supervisedu
nsupervised
orsemisupervised
learning
tasksSomeexam
ples
arerecurrentneuralnetworks
(RNNs)convolutio
naln
euraln
etworks
(CNNs)and
autoencoders
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
C
Table
1continued
machine
learning
model
description
positives
andnegatives
exem
plaryapplications
with
inES
E
resulting
architectureisknow
nas
adeep
neuralnetworkor
deep
learning
(DL)
bullANNstend
tooverfitC
onscientious
modelbuildingthorough
hyper-
parameter
tuning
andextensivevalidationisoftenrequired41
bullConsideredldquoblack
boxrdquo
modelsas
theinternalfunctio
ning
ofANNscanbe
hard
tointerpretThiscan
lead
toresultmisinterpretatio
nandtheacceptance
ofpoor
modelsAthorough
understandingof
varioushyperparam
etersisnecessaryto
understand
thenuancesof
themodel
Principal
Com
ponent
Analysis
(PCA)4546
bullAnunsupervised
technique
PCAcombinesfeatures
andcreatesanew
featureseto
fthe
samesize
where
each
newfeature(ietheldquocom
ponentsrdquo)
isacombinatio
nof
theoriginalfeatures
bullWidelyused
featureextractio
nmethod
Usedto
reduce
thedimensionality
ofthedata
setwhile
preserving
asmuchdata
variability
aspossible
bullEstim
atethevariance
explained
bygivenvariablesof
awater
quality
data47
bullNew
features
areform
edby
finding
theeigenvectors
andcorresponding
eigenvaluesthat
providean
estim
ationof
howmuchvariance
each
component
explains
bullOne
canselect
afewcomponentsthatexplainthemajority
ofthevariability
inthedata
anddrop
the
remaining
ones
bullIntercom
pare
airpollutio
npat-
terns4
4
bullCan
beused
toreduce
multicollinearityinthedataas
newcomponentsareindependento
feachother
bullEstablishthekeycompounds
todistinguishbetweensamples
usinghigh
resolutio
nmassspec-
trom
etry
(HRMS)
data4849
bullEstim
atetheimportance
ofdifferent
variablesandidentify
key-foulantsin
amem
brane
baseddrinking
water
treatm
ent
process5
0
K-means
Clustering
(K-means)51
bullUnsupervisedMLalgorithm
that
aimsto
clustersamples
usingfeature
values
bullSimpleto
implem
entandcaneasilyscaleto
largedata
sets
bullDelineate
different
airquality
indexthresholdclustersusingair
pollutio
ndata52
bullUserdefines
atarget
kvaluewhich
refersto
thenumberof
clustersone
wantsto
groupthesamples
into
bullDisadvantageisthat
oneneedsto
aprioridefinethekvaluewhich
may
bedifficult
bullAnalyze
water
quality
data55354
bullThe
kvaluerefersto
thenumberof
thecentroid
which
refersto
alocatio
nrepresentin
gthecenter
ofagivencluster
bullFails
forclusterswith
complex
nonsphericalgeom
etricshapes
bullGeneratevulnerability
mapsfor
anaquiferusingwater
quality
data55
bullK-m
eans
startswith
random
centroidsanditerativelyfinds
themost
appropriatecentroidsandassignsthesampleinto
respectiveclusters
such
thatthesum
ofthesquareddistance
betweenthesamples
andthecentroid
isminimized
bullClustercentroidscangetskewed
inthepresence
ofoutliershenceitisadvisedto
handleoutliersbefore
applying
k-means
bullRiskassessmentof
water
pollu-
tionsources5
6
Hierarchical
Clustering
bullUnsupervisedMLalgorithm
that
aimsto
clustersamples
usingfeature
values
bullNoaprioriinform
ationaboutthenumberof
clustersisrequired
bullFeatureclustering
basedon
Pearsoncorrelations
inHRMS
data57
bullThe
hierarchicalclustering
algorithm
inadditio
nto
breaking
uptheobjects
into
clustersalso
show
sthehierarchyor
rankingof
thedistance
andshow
showdissimilaroneclusterisfrom
othersH
ierarchicalclustering
isrepresentedas
dendograms
bullTimeintensive
bullEstim
atethesimilarity
between
theresistom
eprofilesof
different
samples
basedon
Brayminus
Curtis
dissimilarity
metric58
bullInputisameasureofdissimilaritybetweensamplesT
hechoice
ofsimilarity
(ordistance)measuresisthemaininfluencing
factor
inhierarchical
clustering
bullCould
bedifficultto
identifythecorrectnumberof
clusters
basedupon
thedendogram
bullWater
quality
assessmentwith
hierarchicalclusteranalysisbased
onMahalanobisdistance59
bullHierarchicalclusterscanbe
built
either
asagglom
erative(bottom-up
each
samplestartsin
itsow
nclusterandclustering
isdone
bymerging
each
sampleandmovingup
thehierarchy)
ordivisive
(top-dow
nallsam
ples
are
putin
oneclusterandsplitsareperformed
recursivelymovingdownthe
hierarchy)
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
D
ESE data sets can incorporate numerous data types withwide spatial and temporal variability Accordingly if possible athorough data visualizationexploration and preprocessingprocess should be performed to gain initial insights andstreamline the data for the subsequent model building steps(Figure 1b) To aid this depending on the data type wellestablished community guidelines could be leveraged Thisprocess typically includes four steps data cleaning integrationreduction and transformation (See the glossary for de-tails)1617
23 Model Building ML Models ML models fall intothree classes supervised unsupervised and semisupervised(Figure 1c) Supervised learning is a ML task where analgorithm is used to train a model using known data and thenthe learned model is used to predict the outcome of new orunforeseen instances18 In supervised learning a sample in thedata set has two components (1) a set of variables (features)that define the sample and (2) class labels (outcomes) that onewants to predict Two types of models are built usingsupervised learning algorithms classification or regressionmodels A classification model is suggested when the outputdata can be categorized into specific groups or classes (iediscrete output variables) If the output variable is a continuousvariable a regression model is recommendedUnsupervised learning is done in the absence of pre-existing
class labels It is a ML category where the goal is to find hiddenand unknown data patterns or to determine the datadistribution across the data set19 Unsupervised learningalgorithms are often used for data exploration and visual-ization The two major types of unsupervised learningapproaches are clustering and dimensionality reductionClustering algorithms are used to group similar samplestogether into clusters whereas dissimilar samples are relegatedto separate clusters Dimensionality reduction algorithms areused to reduce the dimension of the feature set Too manyuninformative features can hinder predictive modeling this isoften referred to as the ldquocurse of dimensionalityrdquo Somecommon supervised and unsupervised learning algorithms arelisted in Table 1Semisupervised learning is a hybrid of supervised and
unsupervised learning where a set of labeled data is used inconjunction with a set of unlabeled training data In manycases getting true labels for a data set can be difficultSemisupervised learning tackles this problem by making use ofunlabeled data in the learning process and then leveraging thislearned information along with the labeled data set forsubsequent prediction20
Model Optimization and Evaluation Supervised learningis usually a three-step process In the first step the data aresplit into training and testing data sets with the split rangingbetween 60 and 80 for training and 40minus20 for testing Thetraining data set is then iteratively separated into training andvalidation sets The model is trained on the training set whilethe validation set is used to tune the model hyperparametersthat consist of the model architecture and the parameters thataffect the speed and quality of the training process Followingthis cross-validation the learned model is tested against thetest data set to evaluate model robustness and accuracy Toensure unbiased model evaluation the test data set cannot beused for training Commonly used model evaluation metricsinclude accuracy F1-score precision recall area under thereceiver operating characteristic (ROC) curve mean absoluteerror (MAE) and mean squared error (MSE)21
Evaluating unsupervised models can be challenging becausedata do not have prior labels Evaluation typically involvesinternal and external validation22 Internal validation is done byestimating inter- and intra-cluster distances to evaluate clusterquality A cluster refers to a collection of data points thataccumulate together based on certain similarities Modelsminimizing intracluster distances while maximizing interclusterdistances are preferred Such a result suggests that thesemodels are performing well at clustering similar data pointstogether Some commonly used metrics for this purpose arethe Adjusted Rand Index and the Silhouette CoefficientHowever a good internal validation score does not guaranteethe effectiveness of the obtained clusters for real applicationsHence external validation is necessary to assess whether thedata points are assigned to the correct clusters This stepusually requires human evaluation or comparison againstbenchmarks For dimensionality reduction task reconstructionerror or loss is estimated to evaluate model performanceThese methods seek to minimize the reconstruction error orlossminusdefined as the distance between the original data pointand its projection onto a lower-dimensional subspace (itsldquoestimaterdquo)23
The ESE community has begun to leverage various MLalgorithms and techniques to analyze environmental data setsWhile it is difficult to determine a priori which ML algorithm isbest suited for a particular data set an improved understandingof the positives and negatives and the inherent assumptions ofthe specific algorithms aid in the choice of the righttechnique(s) Table 1 summarizes algorithms popular withinthe ESE community The table describes the algorithms theiradvantages and disadvantages the assumptions for specific datatypes and gives examples where the algorithms have been usedto address ESE related problems
24 Data Interpretation The final step in the modelbuilding cycle is to interpret the learnt model and developreasonable explanations for the obtained analysis (Figure 1d)Data interpretation could mean extracting important variablescontributing to the model prediction These variables couldhelp in verifying the hypotheses of the study or understandingwhat factors are critical in driving the observations in the studyunder consideration At this stage the literature and expertopinions are queried to make connections between the modeloutputs and knowledge about the relevant chemical physicaland biological phenomena to make data-driven conclusionsand decisions
3 CURRENT APPLICATIONS OF MACHINE LEARNINGIN ESE
In this section we discuss current applications of MLalgorithms within the ESE domain by presenting three casestudies representing different application areas
31 Metagenomic Data Analysis High throughputshotgun (untargeted) metagenomic DNA sequencing offers arobust and effective way to access the microbial world and isnow often used in ESE Differentiating microbial communitiesin different environments6061 studying the dissemination ofantibiotic resistance in environmental systems6263 definingbacterial communities in contaminated environments toexplore bioremediation64 and characterizing microbial com-munities in wastewater treatment processes6566 are allexamples of environments where metagenomic character-ization is increasingly being applied
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
E
In a typical metagenomic study the first step is often tosearch sequenced DNA reads against a reference database toderive taxonomic or functional gene annotation The obtainedtaxonomic or functional gene profiles are then normalized toprovide relative abundance information corresponding todifferent species or genes in the sample that can be visualizedand analyzed to answer specific questions about the sampleContinuous optimization and declining costs of sequencingplatforms have made metagenomics related research moreaccessible than ever before This trend has led to thegeneration of large volumes of publicly available sequencingdata but rendering these data into meaningful interpretationsremains challenging67 Fortunately the advent of ML methodshas immensely accelerated advancement at every cross sectionof a typical metagenomic study68 Here we highlight someapplications of ML in antibiotic resistance-related studiesOne often faced challenge with metagenomic data is that of
high dimensionality and low sample size (HDLSS ie morevariables than independent samples) For example the numberof genes annotated within a given environmental sample caneasily exceed hundreds of thousands Hence many unsuper-vised techniques such as principal component analysis (PCA)and non-metric multi-dimensional scaling (NMDS) are usedas preprocessing steps to reduce dimensionality by removinguninformative features A common application of thesetechniques is to examine the similaritydissimilarity of differentenvironmental samples by clustering based on species or genecomposition69 Network-based ML (data is represented in theform of a graph where nodes are the data points to be clusteredand edges represent the relationship or similarity between thedata points) is another unsupervised approach to studyproteinminusprotein or geneminusgene interactions to understanddifferent functional pathways70
Given the sample labels or response variables and the genecomposition support vector machines (SVMs) ANNs andensemble methods (such as random forest (RF) or extremelyrandomized trees) have been extensively used to performsupervised learning and build predictive models For exampleidentifying interesting patterns and important genes in a dataset58 predicting relative antibiotic resistance abundancelevels7 understanding the role of socioeconomic status inshaping the resistome or microbiome71 or the prediction ofantibiotic resistance phenotype72 are some of the uniqueproblems that have been examined using the above-mentionedmethods Hidden Markov models (HMMs) which can beused in both supervised or unsupervised fashion are one of themore readily used algorithms to detect antibiotic resistancegene variants or potential functional homologues73 and havebeen applied for the discovery of novel ARGs frommetagenomic data74
Word embedding which has gained a lot of popularity overthe recent years is a feature learning technique used in naturallanguage processing (NLP) Word embedding is a term usedfor the representation of words or sentences in the form ofnumeric vectors that can be used for downstream ML tasks75
Raw DNAprotein sequences sliced into k-mers are analogousto the structure of a sentence and can be analyzed in a similarfashion as natural languages Thus there is a rapid thrusttoward analyzing raw sequences using NLP based techni-ques7677 MetaMLP78 based on a similar idea uses a wordembedding based classification model to predict ARGphenotype and has been found to perform 50times faster thanDIAMOND (one of the fastest sequence alignment methods)
with similar accuracy Word embedding based models holdimmense promise in analyzing metagenomic data as they havethe ability to learn patterns directly from within raw sequencesSimilar to the antibiotic resistance example other
applications can also be envisioned For instance analyzingmetagenomics data for taxonomic classification7980 or mobilegenetic element classification81 where ML algorithms wereable to outperform the traditional methods In essence MLalgorithms have enhanced our ability to robustly analyzecomplex metagenomic data
32 Nontarget Analysis of Environmental Samples Arapidly developing approach in environmental analysis andtoxicology involves the use of high resolution massspectrometry (HRMS) based nontarget analysis (NTA)where data on accurate masses of molecular and fragmentions are collected without a priori information on thechemicals being analyzed NTA can potentially aid in thescreening and analysis of the vast and diverse universe oforganic pollutants a grand challenge faced by the ESEcommunity82 Because the chemicals being detected are notpredetermined NTA provides an opportunity to comprehen-sively examine the occurrence fate and transport of chemicalcontaminants in different environmental niches with minimalbias82 Similarly NTA can be applied in environmentalmetabolomics to facilitate understanding of the effects ofchemical perturbations on exposed organisms (eg plantsanimals and humans) without focusing on a particularbiochemical pathway8384 A recent review highlighted studiesthat employed the different types of NTA techniques used inmetabolomics to discover metabolite changes in plants inducedby exposure to xenobiotics (eg pharmaceuticals personal careproducts pesticides flame retardants and engineered nano-materials) to examine the effect of altered levels of nutrients inthe environment on plant systems85 Here we present a briefsummary of how the combination of NTA with ML canadvance monitoring of water wastewater quality soil andexposed organisms (eg humans wildlife)While there are various methods that can be used for NTA
in ESE HRMS is the most popular because of its capacity forsensitive detection of low levels of contaminants andmetabolites in complex environmental samples However thepower of NTA using HRMS has been limited by problemsassociated with sample preparation and data analysis Inanalyzing environmental samples using MS sample concen-tration and cleanup are critical because the signal intensity ofthe MS features depend not only on the concentrations of thechemicals but also on the amounts and the nature of thematrix present in the sample extracts The reproducibility ofthe ionization of compounds in MS can be compromisedsignificantly by matrix effects Therefore it is important to havean appropriate number of replicate samples and properlyselected blank samples when conducting NTA Solid-phaseextraction for sample cleanup is commonly used but thisapproach can introduce bias because highly polar contaminantsmay be lost during sample preparation Unlike in targetanalysis where variations due to matrix effects and samplelosses can be corrected using stable isotope-labeled referencecompounds as surrogates this approach cannot be used inNTA because the purpose of NTA is to identify unknowncontaminants that were not included in the target list Tonormalize for instrumental variation and matrix effects internalstandards with varying polarity can be added to the sampleextract prior to analysis Recently several compounds of
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
F
diverse structure log Kow chromatographic behavior andionization efficiency have been proposed for inclusion in aquality control mixture usable for NTA86 However theproposed mixture has limitations such as for the detection ofhydrophilic compounds and molecular formula generation forcompounds containing fluorine Finally due to the substantialamount of MS data acquired indiscriminately under full-scanand the subsequent MSMS fragmentation of molecular andfragment ions data processing to identify relevant features isdaunting Therefore NTA workflows with built-in filters andcriteria are being developed to facilitate prioritization of MSfeatures that are useful for chemical structure annotation Inthis regard advanced data processing tools high throughputstatistical packages and user-friendly visualization programsare needed to fully interrogate the rich data sets acquired byNTANTA based on HRMS holds great promise for the
comprehensive monitoring of the occurrence and fate ofcontaminants in water and wastewater87 It also allowsresearchers to retrospectively analyze stored HRMS data toscreen for suspected contaminants (ie suspect screening) thatmay have been missed during target analysis88 Using advancedcomputational strategies such as hierarchical cluster analysiscommon patterns in the occurrence of contaminants in theenvironment and their emission pathways can be predictedbased on time series analysis of the aggregated NTA data89
NTA combined with cluster analysis was successfully appliedto reveal previously unmonitored chemical contaminants insoil and sediment samples90 The application of NTA andadvanced postacquisition data treatment will continue toenhance our ability to discover emerging contaminants in theenvironment including those that bioaccumulate and poserisks to humans and wildlife For instance new polyhalo-genated compounds were detected for the first time in blubbersamples from marine mammal sentinel species using both LC-HRMS and GC-HRMS for analysis and an open-source datamining software in the R programming environment thatdetected halogenated signatures in full scan HRMS91 FinallyNTA combined with personal passive samplers and propersample preparation techniques can be used to unlock thecomposition of chemical mixtures that humans are exposed toon a daily basis which can be used in investigating the humanexposome92
Akin to metagenomic data NTA data pose a similar HDLSSchallenge as each mass spectrum constitutes a large number ofpeaks representing potential compounds present in a givensample Hence data preprocessing is crucial and inevitableVarious data preprocessing steps are performed such asdetecting peaks subtracting peaks that were found in blankor control samples componentization (ie grouping of signalsthat probably belong to one unique molecular structure) andremoving noise using replicate measurements93 Followingpreprocessing various supervised and unsupervised MLalgorithms can be applied to engineer select and extractrelevant features However before any further analysis datanormalization and data scaling are two crucial steps thatrequire considerationThe most common algorithms used for NTA are linear
projection methods such as PCA and supervised partial least-squares discriminant analysis (PLS-DA)949548 PCA aids insample comparison by removing variance from the sample setSupervised PLS-DA along with relevant metadata can be usedto extract features pertaining to the specific questions that are
being asked However a major challenge is that featuredetection based on peak intensity can be misleading because oferroneous peak assignments Also relevant information can belost when differentiating the actual contaminant signals andbackground noise when using intensity information for featureselection Hence several alternative approaches that aim atusing raw data signals (retention time times mass-to-charge ratio)that bypass the peak detection step have been proposed andimplemented for improved extraction of information andunderlying patterns in the data set96minus98 This includestechniques such as transforming the raw data signal to distancematrices for dissimilarity analysis96minus98 and using clustering toextract features99
A number of other data analysis algorithms exist but havenot yet become common in NTA environmental analysisClustering techniques such as k-means and hierarchicalclustering can be used to identify similar samples Algorithmslike RF SVMs and ANNs can be applied to classify samplesbased on different categories and would be advantageous inillustrating nonlinear relations in the data These algorithmshave shown excellent promise in analyzing HRMS data inother fields100minus103 and this promise can certainly be extendedto environmental sample analysis
33 Anomaly Detection in Engineered Water Sys-tems (EWS) EWS is an umbrella term for systems of watercollection treatment distribution storage and their operationRecently there has been a major thrust toward digitalization ofthe water sector104 In particular the evolution of cyberinfras-tructure and of online process control instrumentation has ledto the development of advanced process control solutions suchas supervisory control and data acquisition (SCADA)systems105106 Such advances have enabled water utilities tocontinuously monitor water quality identify problems andeffectively oversee maintenance issues both remotely and morelocally These systems entail collection of a large volume of rawdata that could be in conjunction with appropriate dataanalysis techniques transformed into valuable information thatcan be leveraged to make proactive decisions to optimizeoverall performance107 In particular there is surging interest inusing ML techniques to identify unusual patterns in raw EWSdata as a means of discovering unexpected activitiesminusthis isbroadly termed anomaly detection108
A typical anomaly detection task in EWS aims todifferentiate between natural expected variations in waterquality and unusual or suspicious variations caused bycontamination or failure somewhere in the system Thechallenge is to characterize these normal water qualityvariations as it requires analysis of long-term data thatencompasses inherent background variability109 This charac-terized normal response helps to flag anomalous events in thedata Various ML algorithms have been applied to address thisproblem (Table 1)Within EWS one primary form of the generated data is a
time series where each time point can be considered a discretesample The time series consists of measurements of indicatorscollected over time from one or several sensors which formthe feature set Hence each sample can be represented as amultidimensional vector where each dimension represents afeature Historic data is then used to train the model Givenappropriate analysis of the supporting data set one can frameand solve problems to address a multitude of aims such asdetecting leaks sensor failures abrupt changes in water qualityor contamination events110minus112
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
G
There are a number of different anomaly detectiontechniques available113 In the unsupervised approachunlabeled samples are clustered using an algorithm such ask-means density-based clustering algorithms or expectation-maximization (EM) clustering The concept is that the normaldata points cluster to form high density clusters whereasanomalous data points cluster separately or distant from thenormal data pointsclusters and are located in low-densityregions This way one can carry out additional investigationusing new samples and classify them relative to the normaldata In the supervised approach labels (normal or anomalous)are assigned to each sample based on past information andexpert knowledge of anomalous behavior In this way theproblem is converted into a binary classification problemThough it can easily be extended to a multiclass classificationto categorize different types of anomalies SVM ANNBayesian Network Logistic Regression models and theirvariants are well explored algorithms in this space113 A specialcase of SVM One-Class SVM (OCSVM) is a widely usedmethod for anomaly detection In OCSVM the entire trainingdata set is considered as one-class (eg normal class) and thenew data points are classified as similar (normal) or different(anomalous) to the training data Because considering all datapoints from one class is equivalent to having no label OCSVMis considered as unsupervised learning methodOwing to the success of ML in detecting anomalies in other
domains such as network security researchers have startedexploring its potential in water quality anomaly detec-tion43114115 Based on the studies published so far DLalgorithms have shown promise in detecting anomalies andhave outperformed conventional techniques on a number ofoccasions However it should be noted that DL methods couldbe slow to train depending on the depth of the network andthe amount of data available116
Many studies have adopted a batch learning approach wherethe model is trained on historical data and then the new data iscategorized using this trained model With a continuousincoming time-series data stream and the possibility of novelanomalies retraining the models with every new data set canquickly become impractical and difficult to reliably executeContinual or active learning frameworks that continuouslylearn as the new data stream comes in and that can identifyanomalous behavior in real time are required to circumventthese issues117118 Variants of Latent Dirichlet AllocationMarkov Models and ANN based architectures are algorithmsthat have been applied in other domains to achieve continualonline learning119minus121 However thes approaches remainunderexplored for anomaly detection in water systems asthese frameworks are not trivial and have their ownimplementation challenges118119
Ultimately anomaly detection will be most valuable if it canlearn continually as the data stream comes in and yield real-time reporting that informs immediate corrective action It iscrucial that the models being utilized are fast and accurateHence going forward there is a need to shift the focus towardhybrid approaches (combinations of different algorithms) tobuild powerful models that are able to detect multiple types ofanomalies in real-time Such models will enhance EWSanomaly detection and advance public health9118119
4 PATH FORWARDThere is increasing interest in and a growing body of researchdocumenting how data analytics is being used to address ESE
problems As illustrated in this Feature the power of dataanalytics has been widely recognized as has been the vast needfor its application Anomaly detection has particularlypromising applications for water professionals and practi-tioners Notably broader application of metagenomics andNTA can revolutionize environmental monitoring effortsHowever the application of data analytics in ESE practiceremains in its infancy and concerted efforts are required tomake data analytics an integral part of ESE research andeducation Such a goal would most effectively be achieved ifthere were an agreed-upon plan of action Coordinated effortson several fronts are needed to help the ESE community reapthe potential benefits of data analyticsFirst there is a need to encourage collaborations between
data scientists and ESE practitioners that involve diverselyengaged and integrated research teams from multipledisciplines Such collaborations will facilitate cross-disciplinarycommunication and simultaneous skills building For exampledata scientists come from a culture highly supportive of datasharing (eg GitHub Bitbucket public databases) The ESEcommunity should embrace this culture and incorporate datasharing both locally regionally and globally Recent efforts toaddress the COVID-19 pandemic122123 the global dissem-ination of antimicrobial resistance7 and the development ofglobally vetted data analysis approaches within the NTAcommunity are all steps in this direction Such collaborationshave the potential to yield profound data-driven insights andconclusions that would be impossible to achieve within normalacademic and professional silos Working on their own datascientists may create models that answer specific questions butlack the interpretive training required to contextualize theirresults into real world applications Simultaneously ESEpractitioners with data may not possess the correspondingtools or insights required for proper analysis Strategiccollaborations will lead to improved data analytics datainterpretation and improved decision makingSecond there is a lack of user-friendly data analytics tools
and platforms devised for ESE problems Although under-standing the theory behind ML algorithms is not a difficult taskfor many trained scientists a major hurdle that ESE researchersface is in the coding and implementation of these models Withthe continual advancement of data science programming isbecoming an increasingly necessary skill without which modelsmay be improperly developed or may lack efficiency User-friendly tools may circumvent the problems with codinglearning curves and may be essential for widespreadpractitioner adoption While many tools (Microsoft powerBI Weka Tableau Azure) are available for general dataanalysis application-specific toolsplatforms would greatly aidin performing end-to-end analysesThird the field needs to incorporate data analytics-oriented
curricula This could include the addition of data analyticsfocused modules within existing coursework the introductionof new data science and statistics courses within ESE degreeprograms and the development of relevant capstone projectsfor ESE applications Such experiential learning would helpstudents understand the intricacies of different algorithms andprovide hands-on experience in applying various data analyticstechniques in the context of specific ESE problems Furtherencouraging students to take part in data science internshipswould be an excellent means of developing such expertiseFourth introducing data analytics workshops and making
the relevant resources available is valuable for students
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
H
researchers and professionals A plethora of online resourcessuch as Coursera LinkedIn Learning and Edx are available forlow or no cost offering high-quality content in data scienceExpanding access to and creation of open source learningplatforms could prove instrumental in instigating the data-driven problem-solving approachFinally we know that data analytics is not new but
continues to evolve at a rapid pace These advancementsfurnish us with new and powerful ways to analyze the data thatcan help holistically tackle the challenges faced by ESEcommunity and ultimately inform decision-making and policyformulation at scales never previously imagined Hence it iscritical to be abreast of new developments in data analytics andto continue making efforts to harvest their full potential
AUTHOR INFORMATIONCorresponding AuthorPeter Vikesland minus Via Department of Civil andEnvironmental Engineering Virginia Tech BlacksburgVirginia 24061 United States orcidorg0000-0003-2654-5132 Email pvikesvtedu
AuthorsSuraj Gupta minus The Interdisciplinary PhD Program in GeneticsBioinformatics and Computational Biology Virginia TechBlacksburg Virginia 24061 United States
Diana Aga minus Department of Chemistry University at BuffaloThe State University of New York Buffalo New York 14226United States orcidorg0000-0001-6512-7713
Amy Pruden minus Via Department of Civil and EnvironmentalEngineering Virginia Tech Blacksburg Virginia 24061United States orcidorg0000-0002-3191-6244
Liqing Zhang minus Department of Computer Science VirginiaTech Blacksburg Virginia 24061 United States
Complete contact information is available athttpspubsacsorg101021acsest1c01026
NotesThe authors declare no competing financial interestBiography
Peter J Vikesland is the Nick Prillaman Professor of Civil andEnvironmental Engineering at Virginia Tech He received his BAfrom Grinnell College in Chemistry in 1993 and MS and PhD inCivil and Environmental Engineering from University of Iowa in 1995and 1998 Vikeslandrsquos research focuses on the fate of nanomaterials inthe environment nanotechnology enabled sensor platforms forenvironmental quality assessment the environmental disseminationof antibiotic resistance and the application of data analytics in
environmental science and engineering He is a past President of theAssociation of Environmental Engineering and Science Professors(AEESP) is a US National Science Foundation CAREER awardeeand is the recipient of the 2018 Walter Weber Research InnovationAward from AEESP
ACKNOWLEDGMENTS
This study was supported by National Science Foundation(NSF) awards OISE (1545756) ECCSS NNCI (1542100)and CSSI (2004751) Additional support was provided by theCenter for Science and Engineering of the Exposome at theVirginia Tech Institute for Critical Technology and AppliedScience (ICTAS) and the Virginia Tech Graduate Schoolsupported Genetics Bioinformatics and ComputationalBiology (GBCB) and Sustainable Nanotechnology (VTSuN)programs
GLOSSARY
Accuracy Correct predictionstotal predictions
Autoencoders A deep learning techni-que that aims to build amodel where the outputtargets are set equal tothe input variables Itseeks to learn an ap-proximate representa-tion of the data andreconstruct it It is anunsupervised learningapproach used for di-mensionality reduction
Bayesian Network Bayesian networks are akind of probabilisticgraphical model thatutilizes Bayesian infer-ence for probability es-timations Bayesian net-works are used tomodel conditional de-pendence and hencecausation using a direc-ted graph
Data cleaning Identification of outliersor filling in of missingvalues Outlier identifi-cation can be done byusing Z-score quartilevalues or hypothesistesting Missing datacan be handled in mul-tiple ways such as fillingthe value by computingthe summary statisticsof the given variableusing predicted valuescomputed by an MLalgorithm manual cura-tion or by ignoring themissing record
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
I
Data integration The combination ofmultiple data sourcesinto a single coherentform
Data reduction Aggregation or elimina-tion of redundant in-formation to reduce thedata set size
Data transformation Conversion of raw databy normalizing to acommon scale to en-sure consistency andcomparability
Density Based Clustering Work by recognizingldquodenserdquo groups ofpoints permitting it tolearn clusters of discre-tionary shape and dis-tinguish anomalies inthe information
Dimensionality Reduction The process of reducingthe number of randomvariables or attributesunder consideration
EM Clustering Similar to k-means ex-cept it assigns the sam-ples into clusters basedon the probabilities es-timated using the EMalgorithm The objec-tive is to maximize theoverall probability ofthe data for the given(final) clusters
Ensemble Methods Ensemble methods arealgorithms that com-bine predictions fromseveral base models toobtain one optimalmodel
F1-Score F1-Score is a harmonicmean of recall andprecision
HMM HMM is a statisticalMarkov mode l i nwhich the system beingmodeled is assumed tobe a Markov process
k-mer A substring with alength k in a biologicalsequence of nucleoti-des
Naive Bayes Naive Bayes methodsare a family of simpleprobabilistic classifiersbased on Bayes ruleThe method is calledldquoNaiverdquo for its assump-tion of conditional in-dependence among thefeatures
Natural Language Processing (NLP) A subfield of computerscience artificial intelli-gence and linguisticsthat deals with theinteraction of com-puters with human lan-guages
NMDS NMDS is an ordinationtechnique based on dis-tance or dissimilaritymatrix NMDS repre-sents pairwise dissimi-larity between samplesin a low-dimensionalspace
PLS-DA PLS-DA is a linearclassification modelthat is able to predictthe class of new sam-ples
R Programming languagefor Statistical Comput-ing
ROC-Curve A receiver operatingcharacteristic (ROC)curve is a plot of truepositive rate (TPR-yaxis) against the falsepositive rate (FPR-xaxis) It measures theperformance of a classi-fier
SCADA Supervisory control andd a t a a c q u i s i t i o n(SCADA) is a systemdesigned to gather andanalyze real time dataIt is used in waterwastewater treatmentplants to monitor andmanage processes
REFERENCES(1) Tukey J W The future of data analysis Ann Math Stat 196233 (1) 1minus67(2) Hayashi C What is data science Fundamental concepts and aheuristic example In Data science classification and related methodsSpringer 1998 pp 40minus51(3) Bishop C M Pattern recognition and machine learning Springer2006(4) Qiu J Wu Q Ding G Xu Y Feng S A survey of machinelearning for big data processing EURASIP Journal on Advances inSignal Processing 2016 2016 (1) 67(5) Agarwal R Dhar V Big data data science and analytics Theopportunity and challenge for IS research In INFORMS 201425443(6) Wilcox C Woon W L Aung Z Applications of machinelearning in environmental engineering Citeseer 2013(7) Hendriksen R S Munk P Njage P Van Bunnik BMcNally L Lukjancenko O Roumlder T Nieuwenhuijse DPedersen S K Kjeldgaard J Global monitoring of antimicrobialresistance based on metagenomics analyses of urban sewage NatCommun 2019 10 (1) 1124(8) Sobus J R Wambaugh J F Isaacs K K Williams A JMcEachran A D Richard A M Grulke C M Ulrich E MRager J E Strynar M J Integrating tools for non-targeted analysis
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
J
research and chemical safety evaluations at the US EPA J ExposureSci Environ Epidemiol 2018 28 (5) 411minus426(9) Dogo E M Nwulu N I Twala B Aigbavboa C A survey ofmachine learning methods applied to anomaly detection on drinking-water quality data Urban Water J 2019 16 (3) 235minus248(10) Wilde F D Radtke D B Gibs J Iwatsubo R T Nationalfield manual for the collection of water-quality data US Geological SurveyTechniques of Water-Resources Investigations Book 9 1998(11) National Research Council Confronting the nationrsquos waterproblems The role of research National Academies Press 2004(12) Bharti R Grimm D G Current challenges and best-practiceprotocols for microbiome analysis Briefings in bioinformatics202122178(13) Veenaas C Developing tools for non-target analysis and digitalarchiving of organic urban water pollutants 2018(14) Keim D A Visual exploration of large data sets CommunACM 2001 44 (8) 38minus44(15) Chen J Chen Y Du X Li C Lu J Zhao S Zhou X Bigdata challenge a data management perspective Frontiers of ComputerScience 2013 7 (2) 157minus164(16) Chakrabarti S Cox E Frank E Guting R H Han J JiangX Kamber M Lightstone S S Nadeau T P Neapolitan R EData mining know it all Morgan Kaufmann 2008(17) Gibert K Sanchez-Marre M Izquierdo J A survey on pre-processing techniques Relevant issues in the context of environ-mental data mining AI Communications 2016 29 (6) 627minus663(18) Russell S Norvig P Artificial intelligence a modern approach2002(19) Hinton G E Sejnowski T J Poggio T A UnsupervisedLearning Foundations of Neural Computation MIT press 1999(20) Van Engelen J E Hoos H H A survey on semi-supervisedlearning Machine Learning 2020 109 (2) 373minus440(21) Brink H Richards J W Fetherolf M Cronin B Real-worldmachine learning Manning Shelter Island NY 2017(22) Kassambara A Practical guide to cluster analysis in RUnsupervised machine learning Sthda Vol 1(23) Ghodsi A Dimensionality reduction a short tutorialDepartment of Statistics and Actuarial Science Univ of WaterlooOntario Canada 2006 37 (38) 2006(24) Wright R E Logistic regression 1995(25) Alvarez-Arbesu R Feliciacutesimo A M GIS and logisticregression as tools for environmental management a coastal cliffvegetation model in Northern Spain WIT Transactions on Informationand Communication Technologies 2002 26(26) Liu L Sankarasubramanian A Ranjithan S R Logisticregression analysis to estimate contaminant sources in waterdistribution systems J Hydroinf 2011 13 (3) 545minus557(27) Thoe W Gold M Griesbach A Grimmer M Taggart ML Boehm A B Predicting water quality at Santa Monica Beachevaluation of five different models for public notification of unsafeswimming conditions Water Res 2014 67 105minus117(28) Muharemi F Logofa tu D Andersson C Leon FApproaches to building a detection model for water quality a casestudy In Modern Approaches for Intelligent Information and DatabaseSystems Springer 2018 pp 173minus183(29) Liaw A Wiener M Classification and regression byrandomForest R news 2002 2 (3) 18minus22(30) Gromski P S Muhamadali H Ellis D I Xu Y Correa ETurner M L Goodacre R A tutorial review Metabolomics andpartial least squares-discriminant analysis-a marriage of convenienceor a shotgun wedding Anal Chim Acta 2015 879 10minus23(31) Lee S Kim J Prediction of Nanofiltration and Reverse-Osmosis-Membrane Rejection of Organic Compounds UsingRandom Forest Model J Environ Eng 2020 146 (11) 04020127(32) Castrillo M Garciacutea A L Estimation of high frequencynutrient concentrations from water quality surrogates using machinelearning methods Water Res 2020 172 115490
(33) Tan G Yan J Gao C Yang S Prediction of water qualitytime series data based on least squares support vector machineProcedia Eng 2012 31 1194minus1199(34) Yang Y H Guergachi A Khan G Support vector machinesfor environmental informatics application to modelling the nitrogenremoval processes in wastewater treatment systems Journal ofEnvironmental Informatics 2006 7 (1) 14minus25(35) Feng J Pan L Cui B Sun Y Zhang A Gong M NewApproach for Concentration Prediction of Aqueous PhenolicContaminants by Using Wavelet Analysis and Support VectorMachine Environmental Engineering Science 2020 37 (5) 382minus392(36) Silver D Schrittwieser J Simonyan K Antonoglou IHuang A Guez A Hubert T Baker L Lai M Bolton AMastering the game of go without human knowledge Nature 2017550 (7676) 354minus359(37) Mikolov T Deoras A Povey D Burget L Cernocky JStrategies for Training Large Scale Neural Network Language ModelsIEEE 2011 pp 196minus201(38) Deng J Dong W Socher R Li L-J Li K Fei-Fei L InImagenet A Large-Scale Hierarchical Image Database Ieee 2009 pp248minus255(39) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (Aug) 2493minus2537(40) Alwosheel A van Cranenburgh S Chorus C G Is yourdataset big enough Sample size requirements when using artificialneural networks for discrete choice analysis Journal of choice modelling2018 28 167minus182(41) Yang J Xu J Zhang X Wu C Lin T Ying Y Deeplearning for vibrational spectral analysis Recent progress and apractical guide Anal Chim Acta 2019 1081 6minus17(42) Guo H Jeong K Lim J Jo J Kim Y M Park J-p Kim JH Cho K H Prediction of effluent concentration in a wastewatertreatment plant using machine learning models J Environ Sci 201532 90minus101(43) Inoue J Yamagata Y Chen Y Poskitt C M Sun JAnomaly Detection for a Water Treatment System Using UnsupervisedMachine Learning 2017 2017 IEEE 2017 pp 1058minus1065(44) Voukantsis D Karatzas K Kukkonen J Raumlsaumlnen TKarppinen A Kolehmainen M Intercomparison of air quality datausing principal component analysis and forecasting of PM10 andPM2 5 concentrations using artificial neural networks in Thessalonikiand Helsinki Sci Total Environ 2011 409 (7) 1266minus1276(45) Wold S Esbensen K Geladi P Principal componentanalysis Chemom Intell Lab Syst 1987 2 (1minus3) 37minus52(46) Jolliffe I T Cadima J Principal component analysis a reviewand recent developments Philos Trans R Soc A 2016 374 (2065)20150202(47) Dalal S G Shirodkar P V Jagtap T G Naik B G Rao GS Evaluation of significant sources influencing the variation of waterquality of Kandla creek Gulf of Katchchh using PCA Environ MonitAssess 2010 163 (1minus4) 49minus56(48) Scholleacutee J E Schymanski E L Avak S E Loos MHollender J Prioritizing unknown transformation products frombiologically-treated wastewater using high-resolution mass spectrom-etry multivariate statistics and metabolic logic Anal Chem 2015 87(24) 12121minus12129(49) Masiaacute A Campo J Blasco C Picoacute Y Ultra-highperformance liquid chromatography-quadrupole time-of-flight massspectrometry to identify contaminants in water an insight onenvironmental forensics Journal of Chromatography A 2014 134586minus97(50) Peiris R H Halleacute C Budman H Moresoli C Peldszus SHuck P M Legge R L Identifying fouling events in a membrane-based drinking water treatment process using principal componentanalysis of fluorescence excitation-emission matrices Water Res 201044 (1) 185minus194(51) MacQueen J In Some methods for classification and analysis ofmultivariate observations 1967 Oakland CA 1967 pp 281minus297
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
K
(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187
(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
L
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M
Table
1Overviewof
Pop
ular
MLAlgorithm
s
machine
learning
model
description
positives
andnegatives
exem
plaryapplications
with
inES
E
Logistic
Regression
(Logit
model)24
bullSupervised
MLclassificationalgorithm
bullEasy
toimplem
ent
bullEstim
atetheprobablepresence
ofatype
ofvegetatio
nusing
environm
entalvariables2
5
bullStatisticalmodelthatuses
alogisticfunctio
nto
modeltheprobabilityof
the
output
variable
bullHighlyinterpretable
bullPredictcontam
inantsources26
andwater
quality
2728indifferent
water
system
s
bullUsedto
perform
binary
classificationtaskswhere
givensamples
are
classifiedinto
twogroups
bullDoesnotrequireheavycomputatio
nalresources
bullAnomalydetectionin
water
treatm
entsystem
s28
bullCan
beextended
tomultip
leoutput
variables
bullOftenused
asthefirstmodelor
benchm
arkto
compare
tomorecomplex
models
RandomForest
(RF)
29bullAtree-based
ensemblelearning
methodthat
isused
forsupervised
classificationandregression
problems
bullRobustcanhandlelinearandnonlineardatao
utliersaswellas
noisydata
bullPredictrelativeabundancelevels
insewage7
bullThe
basicunitisadecision
tree
bullNot
proneto
overfittin
gwhenthemodelovertrains
onthetraining
dataandfails
togeneralizeon
the
testingor
newdata
set3
0bullSelect
importantvariablesin
predictin
ganom
alouseventsin
water
distributio
nsystem
s28
bullSamples
andfeatures
arerandom
lysampled
from
thefulldata
setand
individualdecision
treesaremadeon
thesampled
data
set
bullDepending
onthesize
ofthedatasetand
theensembleof
treesRFs
cangetcom
plex
andmay
require
high
computatio
nalresourcesandlong
times
forboth
training
andpredictio
nbullPredictnutrient
concentrations
usingwater
quality
variables3
2
bullThe
outcom
esgeneratedfrom
theensembleof
decision
treesarecombined
todevelopthefinalmodelpredictio
nbullPredictnanofiltrationreverse
osmosis-m
embranerejectionof
emerging
contam
inants31
SupportVector
Machines
(SVM)
bullSupervised
learning
algorithm
bullUsesasubset
oftraining
pointsin
thedecision
functio
n(supportvectors)
andismem
oryefficient
bullNonlineartim
eseries
forecasting
modelto
predictwater
quality
33
bullUsedforboth
classificationandregression
tasks
bullEffectiveinhigh
dimensionalspaces
andincaseswhere
thenumberof
samples
islessthan
thenumber
ofdimensionsHow
everchoosingkernelfunctio
nsandregularizatio
niscrucialto
avoidoverfittin
gbullPredicttheconcentrationof
nitrateandnitrite
inthemixed
liquorof
awastewater
treatm
ent
plant34
bullAimsto
findahyperplane
(inan
Ndimensionalfeaturespace)
that
has
maximum
distance
betweendata
pointsof
allclassesThe
hyperplane
classifiesthesamples
ordata
pointsunderconsideration
bullPh
enolicpollutant
concentration
predictio
nin
drinking
water35
bullMaximizingthemargindistance
brings
confidence
intheclassificationof
newdata
points
bullSupportvectorsaredata
pointsthat
arecloser
tothehyperplane
and
influence
thepositio
nandorientationof
thehyperplane
bullSetsof
mathematicalfunctio
ns(kernels)canbe
used
totransform
theinput
data
asrequired
Artificial
NeuralNet-
works
(ANNs)
bullANNsarecomputin
gsystem
sinspired
toanalyzeandprocessinform
ation
theway
thehuman
braindoes
bullDLhasem
ergedas
thestate-of-the-artMLmethod36minus39
bullPredicteffluent
concentrations
inawastewater
treatm
entplant4
2
bullAsimpleANN
consistsof
aninputlayerahidden
layermadeof
entities
(callednodes)
that
receiveweightedinputsfrom
theinputlayerandthen
transform
them
usinganonlinearfunctio
nthat
transm
itsthetransformed
valueto
theoutput
layerwhere
thefinaloutput
isused
togetpredictio
ns
bullDLmodelsrequirelargedatasetsfortraining
toachieverobustperformanceH
oweveritisdifficultto
know
thesamplesize
required
fortheparticular
task
beforehandS
omerulesof
thum
bhave
been
proposed
basedon
anumberof
assumptionsF
orinstancethe
samplesize
needsto
be50minus1000timesthe
numberof
classesor
10minus100times
thenumberof
featureso
r10minus50timesthenumberof
weightsin
the
network40
bullAnomalydetectionin
water
treatm
entsystem
2843
bullIn
conventio
nalANN
architecturetheinform
ationmoves
unidirectio
nally
(iefrom
inputto
output
layer)T
hisarchitectureisalso
know
nas
afeedforwardnetwork
bullFo
rsm
allerdata
setsalternativeMLmodelsmay
perform
better
bullAirquality
forecastingmodel44
bullWhenthenumberof
hidden
layersisincreasedto
achievehigher
levelsof
abstractionin
orderto
extractcomplex
inform
ationfrom
thedatathe
bullANN
architectures
existthat
canbe
used
toperform
supervisedu
nsupervised
orsemisupervised
learning
tasksSomeexam
ples
arerecurrentneuralnetworks
(RNNs)convolutio
naln
euraln
etworks
(CNNs)and
autoencoders
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
C
Table
1continued
machine
learning
model
description
positives
andnegatives
exem
plaryapplications
with
inES
E
resulting
architectureisknow
nas
adeep
neuralnetworkor
deep
learning
(DL)
bullANNstend
tooverfitC
onscientious
modelbuildingthorough
hyper-
parameter
tuning
andextensivevalidationisoftenrequired41
bullConsideredldquoblack
boxrdquo
modelsas
theinternalfunctio
ning
ofANNscanbe
hard
tointerpretThiscan
lead
toresultmisinterpretatio
nandtheacceptance
ofpoor
modelsAthorough
understandingof
varioushyperparam
etersisnecessaryto
understand
thenuancesof
themodel
Principal
Com
ponent
Analysis
(PCA)4546
bullAnunsupervised
technique
PCAcombinesfeatures
andcreatesanew
featureseto
fthe
samesize
where
each
newfeature(ietheldquocom
ponentsrdquo)
isacombinatio
nof
theoriginalfeatures
bullWidelyused
featureextractio
nmethod
Usedto
reduce
thedimensionality
ofthedata
setwhile
preserving
asmuchdata
variability
aspossible
bullEstim
atethevariance
explained
bygivenvariablesof
awater
quality
data47
bullNew
features
areform
edby
finding
theeigenvectors
andcorresponding
eigenvaluesthat
providean
estim
ationof
howmuchvariance
each
component
explains
bullOne
canselect
afewcomponentsthatexplainthemajority
ofthevariability
inthedata
anddrop
the
remaining
ones
bullIntercom
pare
airpollutio
npat-
terns4
4
bullCan
beused
toreduce
multicollinearityinthedataas
newcomponentsareindependento
feachother
bullEstablishthekeycompounds
todistinguishbetweensamples
usinghigh
resolutio
nmassspec-
trom
etry
(HRMS)
data4849
bullEstim
atetheimportance
ofdifferent
variablesandidentify
key-foulantsin
amem
brane
baseddrinking
water
treatm
ent
process5
0
K-means
Clustering
(K-means)51
bullUnsupervisedMLalgorithm
that
aimsto
clustersamples
usingfeature
values
bullSimpleto
implem
entandcaneasilyscaleto
largedata
sets
bullDelineate
different
airquality
indexthresholdclustersusingair
pollutio
ndata52
bullUserdefines
atarget
kvaluewhich
refersto
thenumberof
clustersone
wantsto
groupthesamples
into
bullDisadvantageisthat
oneneedsto
aprioridefinethekvaluewhich
may
bedifficult
bullAnalyze
water
quality
data55354
bullThe
kvaluerefersto
thenumberof
thecentroid
which
refersto
alocatio
nrepresentin
gthecenter
ofagivencluster
bullFails
forclusterswith
complex
nonsphericalgeom
etricshapes
bullGeneratevulnerability
mapsfor
anaquiferusingwater
quality
data55
bullK-m
eans
startswith
random
centroidsanditerativelyfinds
themost
appropriatecentroidsandassignsthesampleinto
respectiveclusters
such
thatthesum
ofthesquareddistance
betweenthesamples
andthecentroid
isminimized
bullClustercentroidscangetskewed
inthepresence
ofoutliershenceitisadvisedto
handleoutliersbefore
applying
k-means
bullRiskassessmentof
water
pollu-
tionsources5
6
Hierarchical
Clustering
bullUnsupervisedMLalgorithm
that
aimsto
clustersamples
usingfeature
values
bullNoaprioriinform
ationaboutthenumberof
clustersisrequired
bullFeatureclustering
basedon
Pearsoncorrelations
inHRMS
data57
bullThe
hierarchicalclustering
algorithm
inadditio
nto
breaking
uptheobjects
into
clustersalso
show
sthehierarchyor
rankingof
thedistance
andshow
showdissimilaroneclusterisfrom
othersH
ierarchicalclustering
isrepresentedas
dendograms
bullTimeintensive
bullEstim
atethesimilarity
between
theresistom
eprofilesof
different
samples
basedon
Brayminus
Curtis
dissimilarity
metric58
bullInputisameasureofdissimilaritybetweensamplesT
hechoice
ofsimilarity
(ordistance)measuresisthemaininfluencing
factor
inhierarchical
clustering
bullCould
bedifficultto
identifythecorrectnumberof
clusters
basedupon
thedendogram
bullWater
quality
assessmentwith
hierarchicalclusteranalysisbased
onMahalanobisdistance59
bullHierarchicalclusterscanbe
built
either
asagglom
erative(bottom-up
each
samplestartsin
itsow
nclusterandclustering
isdone
bymerging
each
sampleandmovingup
thehierarchy)
ordivisive
(top-dow
nallsam
ples
are
putin
oneclusterandsplitsareperformed
recursivelymovingdownthe
hierarchy)
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
D
ESE data sets can incorporate numerous data types withwide spatial and temporal variability Accordingly if possible athorough data visualizationexploration and preprocessingprocess should be performed to gain initial insights andstreamline the data for the subsequent model building steps(Figure 1b) To aid this depending on the data type wellestablished community guidelines could be leveraged Thisprocess typically includes four steps data cleaning integrationreduction and transformation (See the glossary for de-tails)1617
23 Model Building ML Models ML models fall intothree classes supervised unsupervised and semisupervised(Figure 1c) Supervised learning is a ML task where analgorithm is used to train a model using known data and thenthe learned model is used to predict the outcome of new orunforeseen instances18 In supervised learning a sample in thedata set has two components (1) a set of variables (features)that define the sample and (2) class labels (outcomes) that onewants to predict Two types of models are built usingsupervised learning algorithms classification or regressionmodels A classification model is suggested when the outputdata can be categorized into specific groups or classes (iediscrete output variables) If the output variable is a continuousvariable a regression model is recommendedUnsupervised learning is done in the absence of pre-existing
class labels It is a ML category where the goal is to find hiddenand unknown data patterns or to determine the datadistribution across the data set19 Unsupervised learningalgorithms are often used for data exploration and visual-ization The two major types of unsupervised learningapproaches are clustering and dimensionality reductionClustering algorithms are used to group similar samplestogether into clusters whereas dissimilar samples are relegatedto separate clusters Dimensionality reduction algorithms areused to reduce the dimension of the feature set Too manyuninformative features can hinder predictive modeling this isoften referred to as the ldquocurse of dimensionalityrdquo Somecommon supervised and unsupervised learning algorithms arelisted in Table 1Semisupervised learning is a hybrid of supervised and
unsupervised learning where a set of labeled data is used inconjunction with a set of unlabeled training data In manycases getting true labels for a data set can be difficultSemisupervised learning tackles this problem by making use ofunlabeled data in the learning process and then leveraging thislearned information along with the labeled data set forsubsequent prediction20
Model Optimization and Evaluation Supervised learningis usually a three-step process In the first step the data aresplit into training and testing data sets with the split rangingbetween 60 and 80 for training and 40minus20 for testing Thetraining data set is then iteratively separated into training andvalidation sets The model is trained on the training set whilethe validation set is used to tune the model hyperparametersthat consist of the model architecture and the parameters thataffect the speed and quality of the training process Followingthis cross-validation the learned model is tested against thetest data set to evaluate model robustness and accuracy Toensure unbiased model evaluation the test data set cannot beused for training Commonly used model evaluation metricsinclude accuracy F1-score precision recall area under thereceiver operating characteristic (ROC) curve mean absoluteerror (MAE) and mean squared error (MSE)21
Evaluating unsupervised models can be challenging becausedata do not have prior labels Evaluation typically involvesinternal and external validation22 Internal validation is done byestimating inter- and intra-cluster distances to evaluate clusterquality A cluster refers to a collection of data points thataccumulate together based on certain similarities Modelsminimizing intracluster distances while maximizing interclusterdistances are preferred Such a result suggests that thesemodels are performing well at clustering similar data pointstogether Some commonly used metrics for this purpose arethe Adjusted Rand Index and the Silhouette CoefficientHowever a good internal validation score does not guaranteethe effectiveness of the obtained clusters for real applicationsHence external validation is necessary to assess whether thedata points are assigned to the correct clusters This stepusually requires human evaluation or comparison againstbenchmarks For dimensionality reduction task reconstructionerror or loss is estimated to evaluate model performanceThese methods seek to minimize the reconstruction error orlossminusdefined as the distance between the original data pointand its projection onto a lower-dimensional subspace (itsldquoestimaterdquo)23
The ESE community has begun to leverage various MLalgorithms and techniques to analyze environmental data setsWhile it is difficult to determine a priori which ML algorithm isbest suited for a particular data set an improved understandingof the positives and negatives and the inherent assumptions ofthe specific algorithms aid in the choice of the righttechnique(s) Table 1 summarizes algorithms popular withinthe ESE community The table describes the algorithms theiradvantages and disadvantages the assumptions for specific datatypes and gives examples where the algorithms have been usedto address ESE related problems
24 Data Interpretation The final step in the modelbuilding cycle is to interpret the learnt model and developreasonable explanations for the obtained analysis (Figure 1d)Data interpretation could mean extracting important variablescontributing to the model prediction These variables couldhelp in verifying the hypotheses of the study or understandingwhat factors are critical in driving the observations in the studyunder consideration At this stage the literature and expertopinions are queried to make connections between the modeloutputs and knowledge about the relevant chemical physicaland biological phenomena to make data-driven conclusionsand decisions
3 CURRENT APPLICATIONS OF MACHINE LEARNINGIN ESE
In this section we discuss current applications of MLalgorithms within the ESE domain by presenting three casestudies representing different application areas
31 Metagenomic Data Analysis High throughputshotgun (untargeted) metagenomic DNA sequencing offers arobust and effective way to access the microbial world and isnow often used in ESE Differentiating microbial communitiesin different environments6061 studying the dissemination ofantibiotic resistance in environmental systems6263 definingbacterial communities in contaminated environments toexplore bioremediation64 and characterizing microbial com-munities in wastewater treatment processes6566 are allexamples of environments where metagenomic character-ization is increasingly being applied
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
E
In a typical metagenomic study the first step is often tosearch sequenced DNA reads against a reference database toderive taxonomic or functional gene annotation The obtainedtaxonomic or functional gene profiles are then normalized toprovide relative abundance information corresponding todifferent species or genes in the sample that can be visualizedand analyzed to answer specific questions about the sampleContinuous optimization and declining costs of sequencingplatforms have made metagenomics related research moreaccessible than ever before This trend has led to thegeneration of large volumes of publicly available sequencingdata but rendering these data into meaningful interpretationsremains challenging67 Fortunately the advent of ML methodshas immensely accelerated advancement at every cross sectionof a typical metagenomic study68 Here we highlight someapplications of ML in antibiotic resistance-related studiesOne often faced challenge with metagenomic data is that of
high dimensionality and low sample size (HDLSS ie morevariables than independent samples) For example the numberof genes annotated within a given environmental sample caneasily exceed hundreds of thousands Hence many unsuper-vised techniques such as principal component analysis (PCA)and non-metric multi-dimensional scaling (NMDS) are usedas preprocessing steps to reduce dimensionality by removinguninformative features A common application of thesetechniques is to examine the similaritydissimilarity of differentenvironmental samples by clustering based on species or genecomposition69 Network-based ML (data is represented in theform of a graph where nodes are the data points to be clusteredand edges represent the relationship or similarity between thedata points) is another unsupervised approach to studyproteinminusprotein or geneminusgene interactions to understanddifferent functional pathways70
Given the sample labels or response variables and the genecomposition support vector machines (SVMs) ANNs andensemble methods (such as random forest (RF) or extremelyrandomized trees) have been extensively used to performsupervised learning and build predictive models For exampleidentifying interesting patterns and important genes in a dataset58 predicting relative antibiotic resistance abundancelevels7 understanding the role of socioeconomic status inshaping the resistome or microbiome71 or the prediction ofantibiotic resistance phenotype72 are some of the uniqueproblems that have been examined using the above-mentionedmethods Hidden Markov models (HMMs) which can beused in both supervised or unsupervised fashion are one of themore readily used algorithms to detect antibiotic resistancegene variants or potential functional homologues73 and havebeen applied for the discovery of novel ARGs frommetagenomic data74
Word embedding which has gained a lot of popularity overthe recent years is a feature learning technique used in naturallanguage processing (NLP) Word embedding is a term usedfor the representation of words or sentences in the form ofnumeric vectors that can be used for downstream ML tasks75
Raw DNAprotein sequences sliced into k-mers are analogousto the structure of a sentence and can be analyzed in a similarfashion as natural languages Thus there is a rapid thrusttoward analyzing raw sequences using NLP based techni-ques7677 MetaMLP78 based on a similar idea uses a wordembedding based classification model to predict ARGphenotype and has been found to perform 50times faster thanDIAMOND (one of the fastest sequence alignment methods)
with similar accuracy Word embedding based models holdimmense promise in analyzing metagenomic data as they havethe ability to learn patterns directly from within raw sequencesSimilar to the antibiotic resistance example other
applications can also be envisioned For instance analyzingmetagenomics data for taxonomic classification7980 or mobilegenetic element classification81 where ML algorithms wereable to outperform the traditional methods In essence MLalgorithms have enhanced our ability to robustly analyzecomplex metagenomic data
32 Nontarget Analysis of Environmental Samples Arapidly developing approach in environmental analysis andtoxicology involves the use of high resolution massspectrometry (HRMS) based nontarget analysis (NTA)where data on accurate masses of molecular and fragmentions are collected without a priori information on thechemicals being analyzed NTA can potentially aid in thescreening and analysis of the vast and diverse universe oforganic pollutants a grand challenge faced by the ESEcommunity82 Because the chemicals being detected are notpredetermined NTA provides an opportunity to comprehen-sively examine the occurrence fate and transport of chemicalcontaminants in different environmental niches with minimalbias82 Similarly NTA can be applied in environmentalmetabolomics to facilitate understanding of the effects ofchemical perturbations on exposed organisms (eg plantsanimals and humans) without focusing on a particularbiochemical pathway8384 A recent review highlighted studiesthat employed the different types of NTA techniques used inmetabolomics to discover metabolite changes in plants inducedby exposure to xenobiotics (eg pharmaceuticals personal careproducts pesticides flame retardants and engineered nano-materials) to examine the effect of altered levels of nutrients inthe environment on plant systems85 Here we present a briefsummary of how the combination of NTA with ML canadvance monitoring of water wastewater quality soil andexposed organisms (eg humans wildlife)While there are various methods that can be used for NTA
in ESE HRMS is the most popular because of its capacity forsensitive detection of low levels of contaminants andmetabolites in complex environmental samples However thepower of NTA using HRMS has been limited by problemsassociated with sample preparation and data analysis Inanalyzing environmental samples using MS sample concen-tration and cleanup are critical because the signal intensity ofthe MS features depend not only on the concentrations of thechemicals but also on the amounts and the nature of thematrix present in the sample extracts The reproducibility ofthe ionization of compounds in MS can be compromisedsignificantly by matrix effects Therefore it is important to havean appropriate number of replicate samples and properlyselected blank samples when conducting NTA Solid-phaseextraction for sample cleanup is commonly used but thisapproach can introduce bias because highly polar contaminantsmay be lost during sample preparation Unlike in targetanalysis where variations due to matrix effects and samplelosses can be corrected using stable isotope-labeled referencecompounds as surrogates this approach cannot be used inNTA because the purpose of NTA is to identify unknowncontaminants that were not included in the target list Tonormalize for instrumental variation and matrix effects internalstandards with varying polarity can be added to the sampleextract prior to analysis Recently several compounds of
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
F
diverse structure log Kow chromatographic behavior andionization efficiency have been proposed for inclusion in aquality control mixture usable for NTA86 However theproposed mixture has limitations such as for the detection ofhydrophilic compounds and molecular formula generation forcompounds containing fluorine Finally due to the substantialamount of MS data acquired indiscriminately under full-scanand the subsequent MSMS fragmentation of molecular andfragment ions data processing to identify relevant features isdaunting Therefore NTA workflows with built-in filters andcriteria are being developed to facilitate prioritization of MSfeatures that are useful for chemical structure annotation Inthis regard advanced data processing tools high throughputstatistical packages and user-friendly visualization programsare needed to fully interrogate the rich data sets acquired byNTANTA based on HRMS holds great promise for the
comprehensive monitoring of the occurrence and fate ofcontaminants in water and wastewater87 It also allowsresearchers to retrospectively analyze stored HRMS data toscreen for suspected contaminants (ie suspect screening) thatmay have been missed during target analysis88 Using advancedcomputational strategies such as hierarchical cluster analysiscommon patterns in the occurrence of contaminants in theenvironment and their emission pathways can be predictedbased on time series analysis of the aggregated NTA data89
NTA combined with cluster analysis was successfully appliedto reveal previously unmonitored chemical contaminants insoil and sediment samples90 The application of NTA andadvanced postacquisition data treatment will continue toenhance our ability to discover emerging contaminants in theenvironment including those that bioaccumulate and poserisks to humans and wildlife For instance new polyhalo-genated compounds were detected for the first time in blubbersamples from marine mammal sentinel species using both LC-HRMS and GC-HRMS for analysis and an open-source datamining software in the R programming environment thatdetected halogenated signatures in full scan HRMS91 FinallyNTA combined with personal passive samplers and propersample preparation techniques can be used to unlock thecomposition of chemical mixtures that humans are exposed toon a daily basis which can be used in investigating the humanexposome92
Akin to metagenomic data NTA data pose a similar HDLSSchallenge as each mass spectrum constitutes a large number ofpeaks representing potential compounds present in a givensample Hence data preprocessing is crucial and inevitableVarious data preprocessing steps are performed such asdetecting peaks subtracting peaks that were found in blankor control samples componentization (ie grouping of signalsthat probably belong to one unique molecular structure) andremoving noise using replicate measurements93 Followingpreprocessing various supervised and unsupervised MLalgorithms can be applied to engineer select and extractrelevant features However before any further analysis datanormalization and data scaling are two crucial steps thatrequire considerationThe most common algorithms used for NTA are linear
projection methods such as PCA and supervised partial least-squares discriminant analysis (PLS-DA)949548 PCA aids insample comparison by removing variance from the sample setSupervised PLS-DA along with relevant metadata can be usedto extract features pertaining to the specific questions that are
being asked However a major challenge is that featuredetection based on peak intensity can be misleading because oferroneous peak assignments Also relevant information can belost when differentiating the actual contaminant signals andbackground noise when using intensity information for featureselection Hence several alternative approaches that aim atusing raw data signals (retention time times mass-to-charge ratio)that bypass the peak detection step have been proposed andimplemented for improved extraction of information andunderlying patterns in the data set96minus98 This includestechniques such as transforming the raw data signal to distancematrices for dissimilarity analysis96minus98 and using clustering toextract features99
A number of other data analysis algorithms exist but havenot yet become common in NTA environmental analysisClustering techniques such as k-means and hierarchicalclustering can be used to identify similar samples Algorithmslike RF SVMs and ANNs can be applied to classify samplesbased on different categories and would be advantageous inillustrating nonlinear relations in the data These algorithmshave shown excellent promise in analyzing HRMS data inother fields100minus103 and this promise can certainly be extendedto environmental sample analysis
33 Anomaly Detection in Engineered Water Sys-tems (EWS) EWS is an umbrella term for systems of watercollection treatment distribution storage and their operationRecently there has been a major thrust toward digitalization ofthe water sector104 In particular the evolution of cyberinfras-tructure and of online process control instrumentation has ledto the development of advanced process control solutions suchas supervisory control and data acquisition (SCADA)systems105106 Such advances have enabled water utilities tocontinuously monitor water quality identify problems andeffectively oversee maintenance issues both remotely and morelocally These systems entail collection of a large volume of rawdata that could be in conjunction with appropriate dataanalysis techniques transformed into valuable information thatcan be leveraged to make proactive decisions to optimizeoverall performance107 In particular there is surging interest inusing ML techniques to identify unusual patterns in raw EWSdata as a means of discovering unexpected activitiesminusthis isbroadly termed anomaly detection108
A typical anomaly detection task in EWS aims todifferentiate between natural expected variations in waterquality and unusual or suspicious variations caused bycontamination or failure somewhere in the system Thechallenge is to characterize these normal water qualityvariations as it requires analysis of long-term data thatencompasses inherent background variability109 This charac-terized normal response helps to flag anomalous events in thedata Various ML algorithms have been applied to address thisproblem (Table 1)Within EWS one primary form of the generated data is a
time series where each time point can be considered a discretesample The time series consists of measurements of indicatorscollected over time from one or several sensors which formthe feature set Hence each sample can be represented as amultidimensional vector where each dimension represents afeature Historic data is then used to train the model Givenappropriate analysis of the supporting data set one can frameand solve problems to address a multitude of aims such asdetecting leaks sensor failures abrupt changes in water qualityor contamination events110minus112
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
G
There are a number of different anomaly detectiontechniques available113 In the unsupervised approachunlabeled samples are clustered using an algorithm such ask-means density-based clustering algorithms or expectation-maximization (EM) clustering The concept is that the normaldata points cluster to form high density clusters whereasanomalous data points cluster separately or distant from thenormal data pointsclusters and are located in low-densityregions This way one can carry out additional investigationusing new samples and classify them relative to the normaldata In the supervised approach labels (normal or anomalous)are assigned to each sample based on past information andexpert knowledge of anomalous behavior In this way theproblem is converted into a binary classification problemThough it can easily be extended to a multiclass classificationto categorize different types of anomalies SVM ANNBayesian Network Logistic Regression models and theirvariants are well explored algorithms in this space113 A specialcase of SVM One-Class SVM (OCSVM) is a widely usedmethod for anomaly detection In OCSVM the entire trainingdata set is considered as one-class (eg normal class) and thenew data points are classified as similar (normal) or different(anomalous) to the training data Because considering all datapoints from one class is equivalent to having no label OCSVMis considered as unsupervised learning methodOwing to the success of ML in detecting anomalies in other
domains such as network security researchers have startedexploring its potential in water quality anomaly detec-tion43114115 Based on the studies published so far DLalgorithms have shown promise in detecting anomalies andhave outperformed conventional techniques on a number ofoccasions However it should be noted that DL methods couldbe slow to train depending on the depth of the network andthe amount of data available116
Many studies have adopted a batch learning approach wherethe model is trained on historical data and then the new data iscategorized using this trained model With a continuousincoming time-series data stream and the possibility of novelanomalies retraining the models with every new data set canquickly become impractical and difficult to reliably executeContinual or active learning frameworks that continuouslylearn as the new data stream comes in and that can identifyanomalous behavior in real time are required to circumventthese issues117118 Variants of Latent Dirichlet AllocationMarkov Models and ANN based architectures are algorithmsthat have been applied in other domains to achieve continualonline learning119minus121 However thes approaches remainunderexplored for anomaly detection in water systems asthese frameworks are not trivial and have their ownimplementation challenges118119
Ultimately anomaly detection will be most valuable if it canlearn continually as the data stream comes in and yield real-time reporting that informs immediate corrective action It iscrucial that the models being utilized are fast and accurateHence going forward there is a need to shift the focus towardhybrid approaches (combinations of different algorithms) tobuild powerful models that are able to detect multiple types ofanomalies in real-time Such models will enhance EWSanomaly detection and advance public health9118119
4 PATH FORWARDThere is increasing interest in and a growing body of researchdocumenting how data analytics is being used to address ESE
problems As illustrated in this Feature the power of dataanalytics has been widely recognized as has been the vast needfor its application Anomaly detection has particularlypromising applications for water professionals and practi-tioners Notably broader application of metagenomics andNTA can revolutionize environmental monitoring effortsHowever the application of data analytics in ESE practiceremains in its infancy and concerted efforts are required tomake data analytics an integral part of ESE research andeducation Such a goal would most effectively be achieved ifthere were an agreed-upon plan of action Coordinated effortson several fronts are needed to help the ESE community reapthe potential benefits of data analyticsFirst there is a need to encourage collaborations between
data scientists and ESE practitioners that involve diverselyengaged and integrated research teams from multipledisciplines Such collaborations will facilitate cross-disciplinarycommunication and simultaneous skills building For exampledata scientists come from a culture highly supportive of datasharing (eg GitHub Bitbucket public databases) The ESEcommunity should embrace this culture and incorporate datasharing both locally regionally and globally Recent efforts toaddress the COVID-19 pandemic122123 the global dissem-ination of antimicrobial resistance7 and the development ofglobally vetted data analysis approaches within the NTAcommunity are all steps in this direction Such collaborationshave the potential to yield profound data-driven insights andconclusions that would be impossible to achieve within normalacademic and professional silos Working on their own datascientists may create models that answer specific questions butlack the interpretive training required to contextualize theirresults into real world applications Simultaneously ESEpractitioners with data may not possess the correspondingtools or insights required for proper analysis Strategiccollaborations will lead to improved data analytics datainterpretation and improved decision makingSecond there is a lack of user-friendly data analytics tools
and platforms devised for ESE problems Although under-standing the theory behind ML algorithms is not a difficult taskfor many trained scientists a major hurdle that ESE researchersface is in the coding and implementation of these models Withthe continual advancement of data science programming isbecoming an increasingly necessary skill without which modelsmay be improperly developed or may lack efficiency User-friendly tools may circumvent the problems with codinglearning curves and may be essential for widespreadpractitioner adoption While many tools (Microsoft powerBI Weka Tableau Azure) are available for general dataanalysis application-specific toolsplatforms would greatly aidin performing end-to-end analysesThird the field needs to incorporate data analytics-oriented
curricula This could include the addition of data analyticsfocused modules within existing coursework the introductionof new data science and statistics courses within ESE degreeprograms and the development of relevant capstone projectsfor ESE applications Such experiential learning would helpstudents understand the intricacies of different algorithms andprovide hands-on experience in applying various data analyticstechniques in the context of specific ESE problems Furtherencouraging students to take part in data science internshipswould be an excellent means of developing such expertiseFourth introducing data analytics workshops and making
the relevant resources available is valuable for students
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
H
researchers and professionals A plethora of online resourcessuch as Coursera LinkedIn Learning and Edx are available forlow or no cost offering high-quality content in data scienceExpanding access to and creation of open source learningplatforms could prove instrumental in instigating the data-driven problem-solving approachFinally we know that data analytics is not new but
continues to evolve at a rapid pace These advancementsfurnish us with new and powerful ways to analyze the data thatcan help holistically tackle the challenges faced by ESEcommunity and ultimately inform decision-making and policyformulation at scales never previously imagined Hence it iscritical to be abreast of new developments in data analytics andto continue making efforts to harvest their full potential
AUTHOR INFORMATIONCorresponding AuthorPeter Vikesland minus Via Department of Civil andEnvironmental Engineering Virginia Tech BlacksburgVirginia 24061 United States orcidorg0000-0003-2654-5132 Email pvikesvtedu
AuthorsSuraj Gupta minus The Interdisciplinary PhD Program in GeneticsBioinformatics and Computational Biology Virginia TechBlacksburg Virginia 24061 United States
Diana Aga minus Department of Chemistry University at BuffaloThe State University of New York Buffalo New York 14226United States orcidorg0000-0001-6512-7713
Amy Pruden minus Via Department of Civil and EnvironmentalEngineering Virginia Tech Blacksburg Virginia 24061United States orcidorg0000-0002-3191-6244
Liqing Zhang minus Department of Computer Science VirginiaTech Blacksburg Virginia 24061 United States
Complete contact information is available athttpspubsacsorg101021acsest1c01026
NotesThe authors declare no competing financial interestBiography
Peter J Vikesland is the Nick Prillaman Professor of Civil andEnvironmental Engineering at Virginia Tech He received his BAfrom Grinnell College in Chemistry in 1993 and MS and PhD inCivil and Environmental Engineering from University of Iowa in 1995and 1998 Vikeslandrsquos research focuses on the fate of nanomaterials inthe environment nanotechnology enabled sensor platforms forenvironmental quality assessment the environmental disseminationof antibiotic resistance and the application of data analytics in
environmental science and engineering He is a past President of theAssociation of Environmental Engineering and Science Professors(AEESP) is a US National Science Foundation CAREER awardeeand is the recipient of the 2018 Walter Weber Research InnovationAward from AEESP
ACKNOWLEDGMENTS
This study was supported by National Science Foundation(NSF) awards OISE (1545756) ECCSS NNCI (1542100)and CSSI (2004751) Additional support was provided by theCenter for Science and Engineering of the Exposome at theVirginia Tech Institute for Critical Technology and AppliedScience (ICTAS) and the Virginia Tech Graduate Schoolsupported Genetics Bioinformatics and ComputationalBiology (GBCB) and Sustainable Nanotechnology (VTSuN)programs
GLOSSARY
Accuracy Correct predictionstotal predictions
Autoencoders A deep learning techni-que that aims to build amodel where the outputtargets are set equal tothe input variables Itseeks to learn an ap-proximate representa-tion of the data andreconstruct it It is anunsupervised learningapproach used for di-mensionality reduction
Bayesian Network Bayesian networks are akind of probabilisticgraphical model thatutilizes Bayesian infer-ence for probability es-timations Bayesian net-works are used tomodel conditional de-pendence and hencecausation using a direc-ted graph
Data cleaning Identification of outliersor filling in of missingvalues Outlier identifi-cation can be done byusing Z-score quartilevalues or hypothesistesting Missing datacan be handled in mul-tiple ways such as fillingthe value by computingthe summary statisticsof the given variableusing predicted valuescomputed by an MLalgorithm manual cura-tion or by ignoring themissing record
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
I
Data integration The combination ofmultiple data sourcesinto a single coherentform
Data reduction Aggregation or elimina-tion of redundant in-formation to reduce thedata set size
Data transformation Conversion of raw databy normalizing to acommon scale to en-sure consistency andcomparability
Density Based Clustering Work by recognizingldquodenserdquo groups ofpoints permitting it tolearn clusters of discre-tionary shape and dis-tinguish anomalies inthe information
Dimensionality Reduction The process of reducingthe number of randomvariables or attributesunder consideration
EM Clustering Similar to k-means ex-cept it assigns the sam-ples into clusters basedon the probabilities es-timated using the EMalgorithm The objec-tive is to maximize theoverall probability ofthe data for the given(final) clusters
Ensemble Methods Ensemble methods arealgorithms that com-bine predictions fromseveral base models toobtain one optimalmodel
F1-Score F1-Score is a harmonicmean of recall andprecision
HMM HMM is a statisticalMarkov mode l i nwhich the system beingmodeled is assumed tobe a Markov process
k-mer A substring with alength k in a biologicalsequence of nucleoti-des
Naive Bayes Naive Bayes methodsare a family of simpleprobabilistic classifiersbased on Bayes ruleThe method is calledldquoNaiverdquo for its assump-tion of conditional in-dependence among thefeatures
Natural Language Processing (NLP) A subfield of computerscience artificial intelli-gence and linguisticsthat deals with theinteraction of com-puters with human lan-guages
NMDS NMDS is an ordinationtechnique based on dis-tance or dissimilaritymatrix NMDS repre-sents pairwise dissimi-larity between samplesin a low-dimensionalspace
PLS-DA PLS-DA is a linearclassification modelthat is able to predictthe class of new sam-ples
R Programming languagefor Statistical Comput-ing
ROC-Curve A receiver operatingcharacteristic (ROC)curve is a plot of truepositive rate (TPR-yaxis) against the falsepositive rate (FPR-xaxis) It measures theperformance of a classi-fier
SCADA Supervisory control andd a t a a c q u i s i t i o n(SCADA) is a systemdesigned to gather andanalyze real time dataIt is used in waterwastewater treatmentplants to monitor andmanage processes
REFERENCES(1) Tukey J W The future of data analysis Ann Math Stat 196233 (1) 1minus67(2) Hayashi C What is data science Fundamental concepts and aheuristic example In Data science classification and related methodsSpringer 1998 pp 40minus51(3) Bishop C M Pattern recognition and machine learning Springer2006(4) Qiu J Wu Q Ding G Xu Y Feng S A survey of machinelearning for big data processing EURASIP Journal on Advances inSignal Processing 2016 2016 (1) 67(5) Agarwal R Dhar V Big data data science and analytics Theopportunity and challenge for IS research In INFORMS 201425443(6) Wilcox C Woon W L Aung Z Applications of machinelearning in environmental engineering Citeseer 2013(7) Hendriksen R S Munk P Njage P Van Bunnik BMcNally L Lukjancenko O Roumlder T Nieuwenhuijse DPedersen S K Kjeldgaard J Global monitoring of antimicrobialresistance based on metagenomics analyses of urban sewage NatCommun 2019 10 (1) 1124(8) Sobus J R Wambaugh J F Isaacs K K Williams A JMcEachran A D Richard A M Grulke C M Ulrich E MRager J E Strynar M J Integrating tools for non-targeted analysis
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
J
research and chemical safety evaluations at the US EPA J ExposureSci Environ Epidemiol 2018 28 (5) 411minus426(9) Dogo E M Nwulu N I Twala B Aigbavboa C A survey ofmachine learning methods applied to anomaly detection on drinking-water quality data Urban Water J 2019 16 (3) 235minus248(10) Wilde F D Radtke D B Gibs J Iwatsubo R T Nationalfield manual for the collection of water-quality data US Geological SurveyTechniques of Water-Resources Investigations Book 9 1998(11) National Research Council Confronting the nationrsquos waterproblems The role of research National Academies Press 2004(12) Bharti R Grimm D G Current challenges and best-practiceprotocols for microbiome analysis Briefings in bioinformatics202122178(13) Veenaas C Developing tools for non-target analysis and digitalarchiving of organic urban water pollutants 2018(14) Keim D A Visual exploration of large data sets CommunACM 2001 44 (8) 38minus44(15) Chen J Chen Y Du X Li C Lu J Zhao S Zhou X Bigdata challenge a data management perspective Frontiers of ComputerScience 2013 7 (2) 157minus164(16) Chakrabarti S Cox E Frank E Guting R H Han J JiangX Kamber M Lightstone S S Nadeau T P Neapolitan R EData mining know it all Morgan Kaufmann 2008(17) Gibert K Sanchez-Marre M Izquierdo J A survey on pre-processing techniques Relevant issues in the context of environ-mental data mining AI Communications 2016 29 (6) 627minus663(18) Russell S Norvig P Artificial intelligence a modern approach2002(19) Hinton G E Sejnowski T J Poggio T A UnsupervisedLearning Foundations of Neural Computation MIT press 1999(20) Van Engelen J E Hoos H H A survey on semi-supervisedlearning Machine Learning 2020 109 (2) 373minus440(21) Brink H Richards J W Fetherolf M Cronin B Real-worldmachine learning Manning Shelter Island NY 2017(22) Kassambara A Practical guide to cluster analysis in RUnsupervised machine learning Sthda Vol 1(23) Ghodsi A Dimensionality reduction a short tutorialDepartment of Statistics and Actuarial Science Univ of WaterlooOntario Canada 2006 37 (38) 2006(24) Wright R E Logistic regression 1995(25) Alvarez-Arbesu R Feliciacutesimo A M GIS and logisticregression as tools for environmental management a coastal cliffvegetation model in Northern Spain WIT Transactions on Informationand Communication Technologies 2002 26(26) Liu L Sankarasubramanian A Ranjithan S R Logisticregression analysis to estimate contaminant sources in waterdistribution systems J Hydroinf 2011 13 (3) 545minus557(27) Thoe W Gold M Griesbach A Grimmer M Taggart ML Boehm A B Predicting water quality at Santa Monica Beachevaluation of five different models for public notification of unsafeswimming conditions Water Res 2014 67 105minus117(28) Muharemi F Logofa tu D Andersson C Leon FApproaches to building a detection model for water quality a casestudy In Modern Approaches for Intelligent Information and DatabaseSystems Springer 2018 pp 173minus183(29) Liaw A Wiener M Classification and regression byrandomForest R news 2002 2 (3) 18minus22(30) Gromski P S Muhamadali H Ellis D I Xu Y Correa ETurner M L Goodacre R A tutorial review Metabolomics andpartial least squares-discriminant analysis-a marriage of convenienceor a shotgun wedding Anal Chim Acta 2015 879 10minus23(31) Lee S Kim J Prediction of Nanofiltration and Reverse-Osmosis-Membrane Rejection of Organic Compounds UsingRandom Forest Model J Environ Eng 2020 146 (11) 04020127(32) Castrillo M Garciacutea A L Estimation of high frequencynutrient concentrations from water quality surrogates using machinelearning methods Water Res 2020 172 115490
(33) Tan G Yan J Gao C Yang S Prediction of water qualitytime series data based on least squares support vector machineProcedia Eng 2012 31 1194minus1199(34) Yang Y H Guergachi A Khan G Support vector machinesfor environmental informatics application to modelling the nitrogenremoval processes in wastewater treatment systems Journal ofEnvironmental Informatics 2006 7 (1) 14minus25(35) Feng J Pan L Cui B Sun Y Zhang A Gong M NewApproach for Concentration Prediction of Aqueous PhenolicContaminants by Using Wavelet Analysis and Support VectorMachine Environmental Engineering Science 2020 37 (5) 382minus392(36) Silver D Schrittwieser J Simonyan K Antonoglou IHuang A Guez A Hubert T Baker L Lai M Bolton AMastering the game of go without human knowledge Nature 2017550 (7676) 354minus359(37) Mikolov T Deoras A Povey D Burget L Cernocky JStrategies for Training Large Scale Neural Network Language ModelsIEEE 2011 pp 196minus201(38) Deng J Dong W Socher R Li L-J Li K Fei-Fei L InImagenet A Large-Scale Hierarchical Image Database Ieee 2009 pp248minus255(39) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (Aug) 2493minus2537(40) Alwosheel A van Cranenburgh S Chorus C G Is yourdataset big enough Sample size requirements when using artificialneural networks for discrete choice analysis Journal of choice modelling2018 28 167minus182(41) Yang J Xu J Zhang X Wu C Lin T Ying Y Deeplearning for vibrational spectral analysis Recent progress and apractical guide Anal Chim Acta 2019 1081 6minus17(42) Guo H Jeong K Lim J Jo J Kim Y M Park J-p Kim JH Cho K H Prediction of effluent concentration in a wastewatertreatment plant using machine learning models J Environ Sci 201532 90minus101(43) Inoue J Yamagata Y Chen Y Poskitt C M Sun JAnomaly Detection for a Water Treatment System Using UnsupervisedMachine Learning 2017 2017 IEEE 2017 pp 1058minus1065(44) Voukantsis D Karatzas K Kukkonen J Raumlsaumlnen TKarppinen A Kolehmainen M Intercomparison of air quality datausing principal component analysis and forecasting of PM10 andPM2 5 concentrations using artificial neural networks in Thessalonikiand Helsinki Sci Total Environ 2011 409 (7) 1266minus1276(45) Wold S Esbensen K Geladi P Principal componentanalysis Chemom Intell Lab Syst 1987 2 (1minus3) 37minus52(46) Jolliffe I T Cadima J Principal component analysis a reviewand recent developments Philos Trans R Soc A 2016 374 (2065)20150202(47) Dalal S G Shirodkar P V Jagtap T G Naik B G Rao GS Evaluation of significant sources influencing the variation of waterquality of Kandla creek Gulf of Katchchh using PCA Environ MonitAssess 2010 163 (1minus4) 49minus56(48) Scholleacutee J E Schymanski E L Avak S E Loos MHollender J Prioritizing unknown transformation products frombiologically-treated wastewater using high-resolution mass spectrom-etry multivariate statistics and metabolic logic Anal Chem 2015 87(24) 12121minus12129(49) Masiaacute A Campo J Blasco C Picoacute Y Ultra-highperformance liquid chromatography-quadrupole time-of-flight massspectrometry to identify contaminants in water an insight onenvironmental forensics Journal of Chromatography A 2014 134586minus97(50) Peiris R H Halleacute C Budman H Moresoli C Peldszus SHuck P M Legge R L Identifying fouling events in a membrane-based drinking water treatment process using principal componentanalysis of fluorescence excitation-emission matrices Water Res 201044 (1) 185minus194(51) MacQueen J In Some methods for classification and analysis ofmultivariate observations 1967 Oakland CA 1967 pp 281minus297
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
K
(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187
(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
L
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M
Table
1continued
machine
learning
model
description
positives
andnegatives
exem
plaryapplications
with
inES
E
resulting
architectureisknow
nas
adeep
neuralnetworkor
deep
learning
(DL)
bullANNstend
tooverfitC
onscientious
modelbuildingthorough
hyper-
parameter
tuning
andextensivevalidationisoftenrequired41
bullConsideredldquoblack
boxrdquo
modelsas
theinternalfunctio
ning
ofANNscanbe
hard
tointerpretThiscan
lead
toresultmisinterpretatio
nandtheacceptance
ofpoor
modelsAthorough
understandingof
varioushyperparam
etersisnecessaryto
understand
thenuancesof
themodel
Principal
Com
ponent
Analysis
(PCA)4546
bullAnunsupervised
technique
PCAcombinesfeatures
andcreatesanew
featureseto
fthe
samesize
where
each
newfeature(ietheldquocom
ponentsrdquo)
isacombinatio
nof
theoriginalfeatures
bullWidelyused
featureextractio
nmethod
Usedto
reduce
thedimensionality
ofthedata
setwhile
preserving
asmuchdata
variability
aspossible
bullEstim
atethevariance
explained
bygivenvariablesof
awater
quality
data47
bullNew
features
areform
edby
finding
theeigenvectors
andcorresponding
eigenvaluesthat
providean
estim
ationof
howmuchvariance
each
component
explains
bullOne
canselect
afewcomponentsthatexplainthemajority
ofthevariability
inthedata
anddrop
the
remaining
ones
bullIntercom
pare
airpollutio
npat-
terns4
4
bullCan
beused
toreduce
multicollinearityinthedataas
newcomponentsareindependento
feachother
bullEstablishthekeycompounds
todistinguishbetweensamples
usinghigh
resolutio
nmassspec-
trom
etry
(HRMS)
data4849
bullEstim
atetheimportance
ofdifferent
variablesandidentify
key-foulantsin
amem
brane
baseddrinking
water
treatm
ent
process5
0
K-means
Clustering
(K-means)51
bullUnsupervisedMLalgorithm
that
aimsto
clustersamples
usingfeature
values
bullSimpleto
implem
entandcaneasilyscaleto
largedata
sets
bullDelineate
different
airquality
indexthresholdclustersusingair
pollutio
ndata52
bullUserdefines
atarget
kvaluewhich
refersto
thenumberof
clustersone
wantsto
groupthesamples
into
bullDisadvantageisthat
oneneedsto
aprioridefinethekvaluewhich
may
bedifficult
bullAnalyze
water
quality
data55354
bullThe
kvaluerefersto
thenumberof
thecentroid
which
refersto
alocatio
nrepresentin
gthecenter
ofagivencluster
bullFails
forclusterswith
complex
nonsphericalgeom
etricshapes
bullGeneratevulnerability
mapsfor
anaquiferusingwater
quality
data55
bullK-m
eans
startswith
random
centroidsanditerativelyfinds
themost
appropriatecentroidsandassignsthesampleinto
respectiveclusters
such
thatthesum
ofthesquareddistance
betweenthesamples
andthecentroid
isminimized
bullClustercentroidscangetskewed
inthepresence
ofoutliershenceitisadvisedto
handleoutliersbefore
applying
k-means
bullRiskassessmentof
water
pollu-
tionsources5
6
Hierarchical
Clustering
bullUnsupervisedMLalgorithm
that
aimsto
clustersamples
usingfeature
values
bullNoaprioriinform
ationaboutthenumberof
clustersisrequired
bullFeatureclustering
basedon
Pearsoncorrelations
inHRMS
data57
bullThe
hierarchicalclustering
algorithm
inadditio
nto
breaking
uptheobjects
into
clustersalso
show
sthehierarchyor
rankingof
thedistance
andshow
showdissimilaroneclusterisfrom
othersH
ierarchicalclustering
isrepresentedas
dendograms
bullTimeintensive
bullEstim
atethesimilarity
between
theresistom
eprofilesof
different
samples
basedon
Brayminus
Curtis
dissimilarity
metric58
bullInputisameasureofdissimilaritybetweensamplesT
hechoice
ofsimilarity
(ordistance)measuresisthemaininfluencing
factor
inhierarchical
clustering
bullCould
bedifficultto
identifythecorrectnumberof
clusters
basedupon
thedendogram
bullWater
quality
assessmentwith
hierarchicalclusteranalysisbased
onMahalanobisdistance59
bullHierarchicalclusterscanbe
built
either
asagglom
erative(bottom-up
each
samplestartsin
itsow
nclusterandclustering
isdone
bymerging
each
sampleandmovingup
thehierarchy)
ordivisive
(top-dow
nallsam
ples
are
putin
oneclusterandsplitsareperformed
recursivelymovingdownthe
hierarchy)
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
D
ESE data sets can incorporate numerous data types withwide spatial and temporal variability Accordingly if possible athorough data visualizationexploration and preprocessingprocess should be performed to gain initial insights andstreamline the data for the subsequent model building steps(Figure 1b) To aid this depending on the data type wellestablished community guidelines could be leveraged Thisprocess typically includes four steps data cleaning integrationreduction and transformation (See the glossary for de-tails)1617
23 Model Building ML Models ML models fall intothree classes supervised unsupervised and semisupervised(Figure 1c) Supervised learning is a ML task where analgorithm is used to train a model using known data and thenthe learned model is used to predict the outcome of new orunforeseen instances18 In supervised learning a sample in thedata set has two components (1) a set of variables (features)that define the sample and (2) class labels (outcomes) that onewants to predict Two types of models are built usingsupervised learning algorithms classification or regressionmodels A classification model is suggested when the outputdata can be categorized into specific groups or classes (iediscrete output variables) If the output variable is a continuousvariable a regression model is recommendedUnsupervised learning is done in the absence of pre-existing
class labels It is a ML category where the goal is to find hiddenand unknown data patterns or to determine the datadistribution across the data set19 Unsupervised learningalgorithms are often used for data exploration and visual-ization The two major types of unsupervised learningapproaches are clustering and dimensionality reductionClustering algorithms are used to group similar samplestogether into clusters whereas dissimilar samples are relegatedto separate clusters Dimensionality reduction algorithms areused to reduce the dimension of the feature set Too manyuninformative features can hinder predictive modeling this isoften referred to as the ldquocurse of dimensionalityrdquo Somecommon supervised and unsupervised learning algorithms arelisted in Table 1Semisupervised learning is a hybrid of supervised and
unsupervised learning where a set of labeled data is used inconjunction with a set of unlabeled training data In manycases getting true labels for a data set can be difficultSemisupervised learning tackles this problem by making use ofunlabeled data in the learning process and then leveraging thislearned information along with the labeled data set forsubsequent prediction20
Model Optimization and Evaluation Supervised learningis usually a three-step process In the first step the data aresplit into training and testing data sets with the split rangingbetween 60 and 80 for training and 40minus20 for testing Thetraining data set is then iteratively separated into training andvalidation sets The model is trained on the training set whilethe validation set is used to tune the model hyperparametersthat consist of the model architecture and the parameters thataffect the speed and quality of the training process Followingthis cross-validation the learned model is tested against thetest data set to evaluate model robustness and accuracy Toensure unbiased model evaluation the test data set cannot beused for training Commonly used model evaluation metricsinclude accuracy F1-score precision recall area under thereceiver operating characteristic (ROC) curve mean absoluteerror (MAE) and mean squared error (MSE)21
Evaluating unsupervised models can be challenging becausedata do not have prior labels Evaluation typically involvesinternal and external validation22 Internal validation is done byestimating inter- and intra-cluster distances to evaluate clusterquality A cluster refers to a collection of data points thataccumulate together based on certain similarities Modelsminimizing intracluster distances while maximizing interclusterdistances are preferred Such a result suggests that thesemodels are performing well at clustering similar data pointstogether Some commonly used metrics for this purpose arethe Adjusted Rand Index and the Silhouette CoefficientHowever a good internal validation score does not guaranteethe effectiveness of the obtained clusters for real applicationsHence external validation is necessary to assess whether thedata points are assigned to the correct clusters This stepusually requires human evaluation or comparison againstbenchmarks For dimensionality reduction task reconstructionerror or loss is estimated to evaluate model performanceThese methods seek to minimize the reconstruction error orlossminusdefined as the distance between the original data pointand its projection onto a lower-dimensional subspace (itsldquoestimaterdquo)23
The ESE community has begun to leverage various MLalgorithms and techniques to analyze environmental data setsWhile it is difficult to determine a priori which ML algorithm isbest suited for a particular data set an improved understandingof the positives and negatives and the inherent assumptions ofthe specific algorithms aid in the choice of the righttechnique(s) Table 1 summarizes algorithms popular withinthe ESE community The table describes the algorithms theiradvantages and disadvantages the assumptions for specific datatypes and gives examples where the algorithms have been usedto address ESE related problems
24 Data Interpretation The final step in the modelbuilding cycle is to interpret the learnt model and developreasonable explanations for the obtained analysis (Figure 1d)Data interpretation could mean extracting important variablescontributing to the model prediction These variables couldhelp in verifying the hypotheses of the study or understandingwhat factors are critical in driving the observations in the studyunder consideration At this stage the literature and expertopinions are queried to make connections between the modeloutputs and knowledge about the relevant chemical physicaland biological phenomena to make data-driven conclusionsand decisions
3 CURRENT APPLICATIONS OF MACHINE LEARNINGIN ESE
In this section we discuss current applications of MLalgorithms within the ESE domain by presenting three casestudies representing different application areas
31 Metagenomic Data Analysis High throughputshotgun (untargeted) metagenomic DNA sequencing offers arobust and effective way to access the microbial world and isnow often used in ESE Differentiating microbial communitiesin different environments6061 studying the dissemination ofantibiotic resistance in environmental systems6263 definingbacterial communities in contaminated environments toexplore bioremediation64 and characterizing microbial com-munities in wastewater treatment processes6566 are allexamples of environments where metagenomic character-ization is increasingly being applied
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
E
In a typical metagenomic study the first step is often tosearch sequenced DNA reads against a reference database toderive taxonomic or functional gene annotation The obtainedtaxonomic or functional gene profiles are then normalized toprovide relative abundance information corresponding todifferent species or genes in the sample that can be visualizedand analyzed to answer specific questions about the sampleContinuous optimization and declining costs of sequencingplatforms have made metagenomics related research moreaccessible than ever before This trend has led to thegeneration of large volumes of publicly available sequencingdata but rendering these data into meaningful interpretationsremains challenging67 Fortunately the advent of ML methodshas immensely accelerated advancement at every cross sectionof a typical metagenomic study68 Here we highlight someapplications of ML in antibiotic resistance-related studiesOne often faced challenge with metagenomic data is that of
high dimensionality and low sample size (HDLSS ie morevariables than independent samples) For example the numberof genes annotated within a given environmental sample caneasily exceed hundreds of thousands Hence many unsuper-vised techniques such as principal component analysis (PCA)and non-metric multi-dimensional scaling (NMDS) are usedas preprocessing steps to reduce dimensionality by removinguninformative features A common application of thesetechniques is to examine the similaritydissimilarity of differentenvironmental samples by clustering based on species or genecomposition69 Network-based ML (data is represented in theform of a graph where nodes are the data points to be clusteredand edges represent the relationship or similarity between thedata points) is another unsupervised approach to studyproteinminusprotein or geneminusgene interactions to understanddifferent functional pathways70
Given the sample labels or response variables and the genecomposition support vector machines (SVMs) ANNs andensemble methods (such as random forest (RF) or extremelyrandomized trees) have been extensively used to performsupervised learning and build predictive models For exampleidentifying interesting patterns and important genes in a dataset58 predicting relative antibiotic resistance abundancelevels7 understanding the role of socioeconomic status inshaping the resistome or microbiome71 or the prediction ofantibiotic resistance phenotype72 are some of the uniqueproblems that have been examined using the above-mentionedmethods Hidden Markov models (HMMs) which can beused in both supervised or unsupervised fashion are one of themore readily used algorithms to detect antibiotic resistancegene variants or potential functional homologues73 and havebeen applied for the discovery of novel ARGs frommetagenomic data74
Word embedding which has gained a lot of popularity overthe recent years is a feature learning technique used in naturallanguage processing (NLP) Word embedding is a term usedfor the representation of words or sentences in the form ofnumeric vectors that can be used for downstream ML tasks75
Raw DNAprotein sequences sliced into k-mers are analogousto the structure of a sentence and can be analyzed in a similarfashion as natural languages Thus there is a rapid thrusttoward analyzing raw sequences using NLP based techni-ques7677 MetaMLP78 based on a similar idea uses a wordembedding based classification model to predict ARGphenotype and has been found to perform 50times faster thanDIAMOND (one of the fastest sequence alignment methods)
with similar accuracy Word embedding based models holdimmense promise in analyzing metagenomic data as they havethe ability to learn patterns directly from within raw sequencesSimilar to the antibiotic resistance example other
applications can also be envisioned For instance analyzingmetagenomics data for taxonomic classification7980 or mobilegenetic element classification81 where ML algorithms wereable to outperform the traditional methods In essence MLalgorithms have enhanced our ability to robustly analyzecomplex metagenomic data
32 Nontarget Analysis of Environmental Samples Arapidly developing approach in environmental analysis andtoxicology involves the use of high resolution massspectrometry (HRMS) based nontarget analysis (NTA)where data on accurate masses of molecular and fragmentions are collected without a priori information on thechemicals being analyzed NTA can potentially aid in thescreening and analysis of the vast and diverse universe oforganic pollutants a grand challenge faced by the ESEcommunity82 Because the chemicals being detected are notpredetermined NTA provides an opportunity to comprehen-sively examine the occurrence fate and transport of chemicalcontaminants in different environmental niches with minimalbias82 Similarly NTA can be applied in environmentalmetabolomics to facilitate understanding of the effects ofchemical perturbations on exposed organisms (eg plantsanimals and humans) without focusing on a particularbiochemical pathway8384 A recent review highlighted studiesthat employed the different types of NTA techniques used inmetabolomics to discover metabolite changes in plants inducedby exposure to xenobiotics (eg pharmaceuticals personal careproducts pesticides flame retardants and engineered nano-materials) to examine the effect of altered levels of nutrients inthe environment on plant systems85 Here we present a briefsummary of how the combination of NTA with ML canadvance monitoring of water wastewater quality soil andexposed organisms (eg humans wildlife)While there are various methods that can be used for NTA
in ESE HRMS is the most popular because of its capacity forsensitive detection of low levels of contaminants andmetabolites in complex environmental samples However thepower of NTA using HRMS has been limited by problemsassociated with sample preparation and data analysis Inanalyzing environmental samples using MS sample concen-tration and cleanup are critical because the signal intensity ofthe MS features depend not only on the concentrations of thechemicals but also on the amounts and the nature of thematrix present in the sample extracts The reproducibility ofthe ionization of compounds in MS can be compromisedsignificantly by matrix effects Therefore it is important to havean appropriate number of replicate samples and properlyselected blank samples when conducting NTA Solid-phaseextraction for sample cleanup is commonly used but thisapproach can introduce bias because highly polar contaminantsmay be lost during sample preparation Unlike in targetanalysis where variations due to matrix effects and samplelosses can be corrected using stable isotope-labeled referencecompounds as surrogates this approach cannot be used inNTA because the purpose of NTA is to identify unknowncontaminants that were not included in the target list Tonormalize for instrumental variation and matrix effects internalstandards with varying polarity can be added to the sampleextract prior to analysis Recently several compounds of
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
F
diverse structure log Kow chromatographic behavior andionization efficiency have been proposed for inclusion in aquality control mixture usable for NTA86 However theproposed mixture has limitations such as for the detection ofhydrophilic compounds and molecular formula generation forcompounds containing fluorine Finally due to the substantialamount of MS data acquired indiscriminately under full-scanand the subsequent MSMS fragmentation of molecular andfragment ions data processing to identify relevant features isdaunting Therefore NTA workflows with built-in filters andcriteria are being developed to facilitate prioritization of MSfeatures that are useful for chemical structure annotation Inthis regard advanced data processing tools high throughputstatistical packages and user-friendly visualization programsare needed to fully interrogate the rich data sets acquired byNTANTA based on HRMS holds great promise for the
comprehensive monitoring of the occurrence and fate ofcontaminants in water and wastewater87 It also allowsresearchers to retrospectively analyze stored HRMS data toscreen for suspected contaminants (ie suspect screening) thatmay have been missed during target analysis88 Using advancedcomputational strategies such as hierarchical cluster analysiscommon patterns in the occurrence of contaminants in theenvironment and their emission pathways can be predictedbased on time series analysis of the aggregated NTA data89
NTA combined with cluster analysis was successfully appliedto reveal previously unmonitored chemical contaminants insoil and sediment samples90 The application of NTA andadvanced postacquisition data treatment will continue toenhance our ability to discover emerging contaminants in theenvironment including those that bioaccumulate and poserisks to humans and wildlife For instance new polyhalo-genated compounds were detected for the first time in blubbersamples from marine mammal sentinel species using both LC-HRMS and GC-HRMS for analysis and an open-source datamining software in the R programming environment thatdetected halogenated signatures in full scan HRMS91 FinallyNTA combined with personal passive samplers and propersample preparation techniques can be used to unlock thecomposition of chemical mixtures that humans are exposed toon a daily basis which can be used in investigating the humanexposome92
Akin to metagenomic data NTA data pose a similar HDLSSchallenge as each mass spectrum constitutes a large number ofpeaks representing potential compounds present in a givensample Hence data preprocessing is crucial and inevitableVarious data preprocessing steps are performed such asdetecting peaks subtracting peaks that were found in blankor control samples componentization (ie grouping of signalsthat probably belong to one unique molecular structure) andremoving noise using replicate measurements93 Followingpreprocessing various supervised and unsupervised MLalgorithms can be applied to engineer select and extractrelevant features However before any further analysis datanormalization and data scaling are two crucial steps thatrequire considerationThe most common algorithms used for NTA are linear
projection methods such as PCA and supervised partial least-squares discriminant analysis (PLS-DA)949548 PCA aids insample comparison by removing variance from the sample setSupervised PLS-DA along with relevant metadata can be usedto extract features pertaining to the specific questions that are
being asked However a major challenge is that featuredetection based on peak intensity can be misleading because oferroneous peak assignments Also relevant information can belost when differentiating the actual contaminant signals andbackground noise when using intensity information for featureselection Hence several alternative approaches that aim atusing raw data signals (retention time times mass-to-charge ratio)that bypass the peak detection step have been proposed andimplemented for improved extraction of information andunderlying patterns in the data set96minus98 This includestechniques such as transforming the raw data signal to distancematrices for dissimilarity analysis96minus98 and using clustering toextract features99
A number of other data analysis algorithms exist but havenot yet become common in NTA environmental analysisClustering techniques such as k-means and hierarchicalclustering can be used to identify similar samples Algorithmslike RF SVMs and ANNs can be applied to classify samplesbased on different categories and would be advantageous inillustrating nonlinear relations in the data These algorithmshave shown excellent promise in analyzing HRMS data inother fields100minus103 and this promise can certainly be extendedto environmental sample analysis
33 Anomaly Detection in Engineered Water Sys-tems (EWS) EWS is an umbrella term for systems of watercollection treatment distribution storage and their operationRecently there has been a major thrust toward digitalization ofthe water sector104 In particular the evolution of cyberinfras-tructure and of online process control instrumentation has ledto the development of advanced process control solutions suchas supervisory control and data acquisition (SCADA)systems105106 Such advances have enabled water utilities tocontinuously monitor water quality identify problems andeffectively oversee maintenance issues both remotely and morelocally These systems entail collection of a large volume of rawdata that could be in conjunction with appropriate dataanalysis techniques transformed into valuable information thatcan be leveraged to make proactive decisions to optimizeoverall performance107 In particular there is surging interest inusing ML techniques to identify unusual patterns in raw EWSdata as a means of discovering unexpected activitiesminusthis isbroadly termed anomaly detection108
A typical anomaly detection task in EWS aims todifferentiate between natural expected variations in waterquality and unusual or suspicious variations caused bycontamination or failure somewhere in the system Thechallenge is to characterize these normal water qualityvariations as it requires analysis of long-term data thatencompasses inherent background variability109 This charac-terized normal response helps to flag anomalous events in thedata Various ML algorithms have been applied to address thisproblem (Table 1)Within EWS one primary form of the generated data is a
time series where each time point can be considered a discretesample The time series consists of measurements of indicatorscollected over time from one or several sensors which formthe feature set Hence each sample can be represented as amultidimensional vector where each dimension represents afeature Historic data is then used to train the model Givenappropriate analysis of the supporting data set one can frameand solve problems to address a multitude of aims such asdetecting leaks sensor failures abrupt changes in water qualityor contamination events110minus112
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
G
There are a number of different anomaly detectiontechniques available113 In the unsupervised approachunlabeled samples are clustered using an algorithm such ask-means density-based clustering algorithms or expectation-maximization (EM) clustering The concept is that the normaldata points cluster to form high density clusters whereasanomalous data points cluster separately or distant from thenormal data pointsclusters and are located in low-densityregions This way one can carry out additional investigationusing new samples and classify them relative to the normaldata In the supervised approach labels (normal or anomalous)are assigned to each sample based on past information andexpert knowledge of anomalous behavior In this way theproblem is converted into a binary classification problemThough it can easily be extended to a multiclass classificationto categorize different types of anomalies SVM ANNBayesian Network Logistic Regression models and theirvariants are well explored algorithms in this space113 A specialcase of SVM One-Class SVM (OCSVM) is a widely usedmethod for anomaly detection In OCSVM the entire trainingdata set is considered as one-class (eg normal class) and thenew data points are classified as similar (normal) or different(anomalous) to the training data Because considering all datapoints from one class is equivalent to having no label OCSVMis considered as unsupervised learning methodOwing to the success of ML in detecting anomalies in other
domains such as network security researchers have startedexploring its potential in water quality anomaly detec-tion43114115 Based on the studies published so far DLalgorithms have shown promise in detecting anomalies andhave outperformed conventional techniques on a number ofoccasions However it should be noted that DL methods couldbe slow to train depending on the depth of the network andthe amount of data available116
Many studies have adopted a batch learning approach wherethe model is trained on historical data and then the new data iscategorized using this trained model With a continuousincoming time-series data stream and the possibility of novelanomalies retraining the models with every new data set canquickly become impractical and difficult to reliably executeContinual or active learning frameworks that continuouslylearn as the new data stream comes in and that can identifyanomalous behavior in real time are required to circumventthese issues117118 Variants of Latent Dirichlet AllocationMarkov Models and ANN based architectures are algorithmsthat have been applied in other domains to achieve continualonline learning119minus121 However thes approaches remainunderexplored for anomaly detection in water systems asthese frameworks are not trivial and have their ownimplementation challenges118119
Ultimately anomaly detection will be most valuable if it canlearn continually as the data stream comes in and yield real-time reporting that informs immediate corrective action It iscrucial that the models being utilized are fast and accurateHence going forward there is a need to shift the focus towardhybrid approaches (combinations of different algorithms) tobuild powerful models that are able to detect multiple types ofanomalies in real-time Such models will enhance EWSanomaly detection and advance public health9118119
4 PATH FORWARDThere is increasing interest in and a growing body of researchdocumenting how data analytics is being used to address ESE
problems As illustrated in this Feature the power of dataanalytics has been widely recognized as has been the vast needfor its application Anomaly detection has particularlypromising applications for water professionals and practi-tioners Notably broader application of metagenomics andNTA can revolutionize environmental monitoring effortsHowever the application of data analytics in ESE practiceremains in its infancy and concerted efforts are required tomake data analytics an integral part of ESE research andeducation Such a goal would most effectively be achieved ifthere were an agreed-upon plan of action Coordinated effortson several fronts are needed to help the ESE community reapthe potential benefits of data analyticsFirst there is a need to encourage collaborations between
data scientists and ESE practitioners that involve diverselyengaged and integrated research teams from multipledisciplines Such collaborations will facilitate cross-disciplinarycommunication and simultaneous skills building For exampledata scientists come from a culture highly supportive of datasharing (eg GitHub Bitbucket public databases) The ESEcommunity should embrace this culture and incorporate datasharing both locally regionally and globally Recent efforts toaddress the COVID-19 pandemic122123 the global dissem-ination of antimicrobial resistance7 and the development ofglobally vetted data analysis approaches within the NTAcommunity are all steps in this direction Such collaborationshave the potential to yield profound data-driven insights andconclusions that would be impossible to achieve within normalacademic and professional silos Working on their own datascientists may create models that answer specific questions butlack the interpretive training required to contextualize theirresults into real world applications Simultaneously ESEpractitioners with data may not possess the correspondingtools or insights required for proper analysis Strategiccollaborations will lead to improved data analytics datainterpretation and improved decision makingSecond there is a lack of user-friendly data analytics tools
and platforms devised for ESE problems Although under-standing the theory behind ML algorithms is not a difficult taskfor many trained scientists a major hurdle that ESE researchersface is in the coding and implementation of these models Withthe continual advancement of data science programming isbecoming an increasingly necessary skill without which modelsmay be improperly developed or may lack efficiency User-friendly tools may circumvent the problems with codinglearning curves and may be essential for widespreadpractitioner adoption While many tools (Microsoft powerBI Weka Tableau Azure) are available for general dataanalysis application-specific toolsplatforms would greatly aidin performing end-to-end analysesThird the field needs to incorporate data analytics-oriented
curricula This could include the addition of data analyticsfocused modules within existing coursework the introductionof new data science and statistics courses within ESE degreeprograms and the development of relevant capstone projectsfor ESE applications Such experiential learning would helpstudents understand the intricacies of different algorithms andprovide hands-on experience in applying various data analyticstechniques in the context of specific ESE problems Furtherencouraging students to take part in data science internshipswould be an excellent means of developing such expertiseFourth introducing data analytics workshops and making
the relevant resources available is valuable for students
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
H
researchers and professionals A plethora of online resourcessuch as Coursera LinkedIn Learning and Edx are available forlow or no cost offering high-quality content in data scienceExpanding access to and creation of open source learningplatforms could prove instrumental in instigating the data-driven problem-solving approachFinally we know that data analytics is not new but
continues to evolve at a rapid pace These advancementsfurnish us with new and powerful ways to analyze the data thatcan help holistically tackle the challenges faced by ESEcommunity and ultimately inform decision-making and policyformulation at scales never previously imagined Hence it iscritical to be abreast of new developments in data analytics andto continue making efforts to harvest their full potential
AUTHOR INFORMATIONCorresponding AuthorPeter Vikesland minus Via Department of Civil andEnvironmental Engineering Virginia Tech BlacksburgVirginia 24061 United States orcidorg0000-0003-2654-5132 Email pvikesvtedu
AuthorsSuraj Gupta minus The Interdisciplinary PhD Program in GeneticsBioinformatics and Computational Biology Virginia TechBlacksburg Virginia 24061 United States
Diana Aga minus Department of Chemistry University at BuffaloThe State University of New York Buffalo New York 14226United States orcidorg0000-0001-6512-7713
Amy Pruden minus Via Department of Civil and EnvironmentalEngineering Virginia Tech Blacksburg Virginia 24061United States orcidorg0000-0002-3191-6244
Liqing Zhang minus Department of Computer Science VirginiaTech Blacksburg Virginia 24061 United States
Complete contact information is available athttpspubsacsorg101021acsest1c01026
NotesThe authors declare no competing financial interestBiography
Peter J Vikesland is the Nick Prillaman Professor of Civil andEnvironmental Engineering at Virginia Tech He received his BAfrom Grinnell College in Chemistry in 1993 and MS and PhD inCivil and Environmental Engineering from University of Iowa in 1995and 1998 Vikeslandrsquos research focuses on the fate of nanomaterials inthe environment nanotechnology enabled sensor platforms forenvironmental quality assessment the environmental disseminationof antibiotic resistance and the application of data analytics in
environmental science and engineering He is a past President of theAssociation of Environmental Engineering and Science Professors(AEESP) is a US National Science Foundation CAREER awardeeand is the recipient of the 2018 Walter Weber Research InnovationAward from AEESP
ACKNOWLEDGMENTS
This study was supported by National Science Foundation(NSF) awards OISE (1545756) ECCSS NNCI (1542100)and CSSI (2004751) Additional support was provided by theCenter for Science and Engineering of the Exposome at theVirginia Tech Institute for Critical Technology and AppliedScience (ICTAS) and the Virginia Tech Graduate Schoolsupported Genetics Bioinformatics and ComputationalBiology (GBCB) and Sustainable Nanotechnology (VTSuN)programs
GLOSSARY
Accuracy Correct predictionstotal predictions
Autoencoders A deep learning techni-que that aims to build amodel where the outputtargets are set equal tothe input variables Itseeks to learn an ap-proximate representa-tion of the data andreconstruct it It is anunsupervised learningapproach used for di-mensionality reduction
Bayesian Network Bayesian networks are akind of probabilisticgraphical model thatutilizes Bayesian infer-ence for probability es-timations Bayesian net-works are used tomodel conditional de-pendence and hencecausation using a direc-ted graph
Data cleaning Identification of outliersor filling in of missingvalues Outlier identifi-cation can be done byusing Z-score quartilevalues or hypothesistesting Missing datacan be handled in mul-tiple ways such as fillingthe value by computingthe summary statisticsof the given variableusing predicted valuescomputed by an MLalgorithm manual cura-tion or by ignoring themissing record
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
I
Data integration The combination ofmultiple data sourcesinto a single coherentform
Data reduction Aggregation or elimina-tion of redundant in-formation to reduce thedata set size
Data transformation Conversion of raw databy normalizing to acommon scale to en-sure consistency andcomparability
Density Based Clustering Work by recognizingldquodenserdquo groups ofpoints permitting it tolearn clusters of discre-tionary shape and dis-tinguish anomalies inthe information
Dimensionality Reduction The process of reducingthe number of randomvariables or attributesunder consideration
EM Clustering Similar to k-means ex-cept it assigns the sam-ples into clusters basedon the probabilities es-timated using the EMalgorithm The objec-tive is to maximize theoverall probability ofthe data for the given(final) clusters
Ensemble Methods Ensemble methods arealgorithms that com-bine predictions fromseveral base models toobtain one optimalmodel
F1-Score F1-Score is a harmonicmean of recall andprecision
HMM HMM is a statisticalMarkov mode l i nwhich the system beingmodeled is assumed tobe a Markov process
k-mer A substring with alength k in a biologicalsequence of nucleoti-des
Naive Bayes Naive Bayes methodsare a family of simpleprobabilistic classifiersbased on Bayes ruleThe method is calledldquoNaiverdquo for its assump-tion of conditional in-dependence among thefeatures
Natural Language Processing (NLP) A subfield of computerscience artificial intelli-gence and linguisticsthat deals with theinteraction of com-puters with human lan-guages
NMDS NMDS is an ordinationtechnique based on dis-tance or dissimilaritymatrix NMDS repre-sents pairwise dissimi-larity between samplesin a low-dimensionalspace
PLS-DA PLS-DA is a linearclassification modelthat is able to predictthe class of new sam-ples
R Programming languagefor Statistical Comput-ing
ROC-Curve A receiver operatingcharacteristic (ROC)curve is a plot of truepositive rate (TPR-yaxis) against the falsepositive rate (FPR-xaxis) It measures theperformance of a classi-fier
SCADA Supervisory control andd a t a a c q u i s i t i o n(SCADA) is a systemdesigned to gather andanalyze real time dataIt is used in waterwastewater treatmentplants to monitor andmanage processes
REFERENCES(1) Tukey J W The future of data analysis Ann Math Stat 196233 (1) 1minus67(2) Hayashi C What is data science Fundamental concepts and aheuristic example In Data science classification and related methodsSpringer 1998 pp 40minus51(3) Bishop C M Pattern recognition and machine learning Springer2006(4) Qiu J Wu Q Ding G Xu Y Feng S A survey of machinelearning for big data processing EURASIP Journal on Advances inSignal Processing 2016 2016 (1) 67(5) Agarwal R Dhar V Big data data science and analytics Theopportunity and challenge for IS research In INFORMS 201425443(6) Wilcox C Woon W L Aung Z Applications of machinelearning in environmental engineering Citeseer 2013(7) Hendriksen R S Munk P Njage P Van Bunnik BMcNally L Lukjancenko O Roumlder T Nieuwenhuijse DPedersen S K Kjeldgaard J Global monitoring of antimicrobialresistance based on metagenomics analyses of urban sewage NatCommun 2019 10 (1) 1124(8) Sobus J R Wambaugh J F Isaacs K K Williams A JMcEachran A D Richard A M Grulke C M Ulrich E MRager J E Strynar M J Integrating tools for non-targeted analysis
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
J
research and chemical safety evaluations at the US EPA J ExposureSci Environ Epidemiol 2018 28 (5) 411minus426(9) Dogo E M Nwulu N I Twala B Aigbavboa C A survey ofmachine learning methods applied to anomaly detection on drinking-water quality data Urban Water J 2019 16 (3) 235minus248(10) Wilde F D Radtke D B Gibs J Iwatsubo R T Nationalfield manual for the collection of water-quality data US Geological SurveyTechniques of Water-Resources Investigations Book 9 1998(11) National Research Council Confronting the nationrsquos waterproblems The role of research National Academies Press 2004(12) Bharti R Grimm D G Current challenges and best-practiceprotocols for microbiome analysis Briefings in bioinformatics202122178(13) Veenaas C Developing tools for non-target analysis and digitalarchiving of organic urban water pollutants 2018(14) Keim D A Visual exploration of large data sets CommunACM 2001 44 (8) 38minus44(15) Chen J Chen Y Du X Li C Lu J Zhao S Zhou X Bigdata challenge a data management perspective Frontiers of ComputerScience 2013 7 (2) 157minus164(16) Chakrabarti S Cox E Frank E Guting R H Han J JiangX Kamber M Lightstone S S Nadeau T P Neapolitan R EData mining know it all Morgan Kaufmann 2008(17) Gibert K Sanchez-Marre M Izquierdo J A survey on pre-processing techniques Relevant issues in the context of environ-mental data mining AI Communications 2016 29 (6) 627minus663(18) Russell S Norvig P Artificial intelligence a modern approach2002(19) Hinton G E Sejnowski T J Poggio T A UnsupervisedLearning Foundations of Neural Computation MIT press 1999(20) Van Engelen J E Hoos H H A survey on semi-supervisedlearning Machine Learning 2020 109 (2) 373minus440(21) Brink H Richards J W Fetherolf M Cronin B Real-worldmachine learning Manning Shelter Island NY 2017(22) Kassambara A Practical guide to cluster analysis in RUnsupervised machine learning Sthda Vol 1(23) Ghodsi A Dimensionality reduction a short tutorialDepartment of Statistics and Actuarial Science Univ of WaterlooOntario Canada 2006 37 (38) 2006(24) Wright R E Logistic regression 1995(25) Alvarez-Arbesu R Feliciacutesimo A M GIS and logisticregression as tools for environmental management a coastal cliffvegetation model in Northern Spain WIT Transactions on Informationand Communication Technologies 2002 26(26) Liu L Sankarasubramanian A Ranjithan S R Logisticregression analysis to estimate contaminant sources in waterdistribution systems J Hydroinf 2011 13 (3) 545minus557(27) Thoe W Gold M Griesbach A Grimmer M Taggart ML Boehm A B Predicting water quality at Santa Monica Beachevaluation of five different models for public notification of unsafeswimming conditions Water Res 2014 67 105minus117(28) Muharemi F Logofa tu D Andersson C Leon FApproaches to building a detection model for water quality a casestudy In Modern Approaches for Intelligent Information and DatabaseSystems Springer 2018 pp 173minus183(29) Liaw A Wiener M Classification and regression byrandomForest R news 2002 2 (3) 18minus22(30) Gromski P S Muhamadali H Ellis D I Xu Y Correa ETurner M L Goodacre R A tutorial review Metabolomics andpartial least squares-discriminant analysis-a marriage of convenienceor a shotgun wedding Anal Chim Acta 2015 879 10minus23(31) Lee S Kim J Prediction of Nanofiltration and Reverse-Osmosis-Membrane Rejection of Organic Compounds UsingRandom Forest Model J Environ Eng 2020 146 (11) 04020127(32) Castrillo M Garciacutea A L Estimation of high frequencynutrient concentrations from water quality surrogates using machinelearning methods Water Res 2020 172 115490
(33) Tan G Yan J Gao C Yang S Prediction of water qualitytime series data based on least squares support vector machineProcedia Eng 2012 31 1194minus1199(34) Yang Y H Guergachi A Khan G Support vector machinesfor environmental informatics application to modelling the nitrogenremoval processes in wastewater treatment systems Journal ofEnvironmental Informatics 2006 7 (1) 14minus25(35) Feng J Pan L Cui B Sun Y Zhang A Gong M NewApproach for Concentration Prediction of Aqueous PhenolicContaminants by Using Wavelet Analysis and Support VectorMachine Environmental Engineering Science 2020 37 (5) 382minus392(36) Silver D Schrittwieser J Simonyan K Antonoglou IHuang A Guez A Hubert T Baker L Lai M Bolton AMastering the game of go without human knowledge Nature 2017550 (7676) 354minus359(37) Mikolov T Deoras A Povey D Burget L Cernocky JStrategies for Training Large Scale Neural Network Language ModelsIEEE 2011 pp 196minus201(38) Deng J Dong W Socher R Li L-J Li K Fei-Fei L InImagenet A Large-Scale Hierarchical Image Database Ieee 2009 pp248minus255(39) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (Aug) 2493minus2537(40) Alwosheel A van Cranenburgh S Chorus C G Is yourdataset big enough Sample size requirements when using artificialneural networks for discrete choice analysis Journal of choice modelling2018 28 167minus182(41) Yang J Xu J Zhang X Wu C Lin T Ying Y Deeplearning for vibrational spectral analysis Recent progress and apractical guide Anal Chim Acta 2019 1081 6minus17(42) Guo H Jeong K Lim J Jo J Kim Y M Park J-p Kim JH Cho K H Prediction of effluent concentration in a wastewatertreatment plant using machine learning models J Environ Sci 201532 90minus101(43) Inoue J Yamagata Y Chen Y Poskitt C M Sun JAnomaly Detection for a Water Treatment System Using UnsupervisedMachine Learning 2017 2017 IEEE 2017 pp 1058minus1065(44) Voukantsis D Karatzas K Kukkonen J Raumlsaumlnen TKarppinen A Kolehmainen M Intercomparison of air quality datausing principal component analysis and forecasting of PM10 andPM2 5 concentrations using artificial neural networks in Thessalonikiand Helsinki Sci Total Environ 2011 409 (7) 1266minus1276(45) Wold S Esbensen K Geladi P Principal componentanalysis Chemom Intell Lab Syst 1987 2 (1minus3) 37minus52(46) Jolliffe I T Cadima J Principal component analysis a reviewand recent developments Philos Trans R Soc A 2016 374 (2065)20150202(47) Dalal S G Shirodkar P V Jagtap T G Naik B G Rao GS Evaluation of significant sources influencing the variation of waterquality of Kandla creek Gulf of Katchchh using PCA Environ MonitAssess 2010 163 (1minus4) 49minus56(48) Scholleacutee J E Schymanski E L Avak S E Loos MHollender J Prioritizing unknown transformation products frombiologically-treated wastewater using high-resolution mass spectrom-etry multivariate statistics and metabolic logic Anal Chem 2015 87(24) 12121minus12129(49) Masiaacute A Campo J Blasco C Picoacute Y Ultra-highperformance liquid chromatography-quadrupole time-of-flight massspectrometry to identify contaminants in water an insight onenvironmental forensics Journal of Chromatography A 2014 134586minus97(50) Peiris R H Halleacute C Budman H Moresoli C Peldszus SHuck P M Legge R L Identifying fouling events in a membrane-based drinking water treatment process using principal componentanalysis of fluorescence excitation-emission matrices Water Res 201044 (1) 185minus194(51) MacQueen J In Some methods for classification and analysis ofmultivariate observations 1967 Oakland CA 1967 pp 281minus297
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
K
(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187
(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
L
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M
ESE data sets can incorporate numerous data types withwide spatial and temporal variability Accordingly if possible athorough data visualizationexploration and preprocessingprocess should be performed to gain initial insights andstreamline the data for the subsequent model building steps(Figure 1b) To aid this depending on the data type wellestablished community guidelines could be leveraged Thisprocess typically includes four steps data cleaning integrationreduction and transformation (See the glossary for de-tails)1617
23 Model Building ML Models ML models fall intothree classes supervised unsupervised and semisupervised(Figure 1c) Supervised learning is a ML task where analgorithm is used to train a model using known data and thenthe learned model is used to predict the outcome of new orunforeseen instances18 In supervised learning a sample in thedata set has two components (1) a set of variables (features)that define the sample and (2) class labels (outcomes) that onewants to predict Two types of models are built usingsupervised learning algorithms classification or regressionmodels A classification model is suggested when the outputdata can be categorized into specific groups or classes (iediscrete output variables) If the output variable is a continuousvariable a regression model is recommendedUnsupervised learning is done in the absence of pre-existing
class labels It is a ML category where the goal is to find hiddenand unknown data patterns or to determine the datadistribution across the data set19 Unsupervised learningalgorithms are often used for data exploration and visual-ization The two major types of unsupervised learningapproaches are clustering and dimensionality reductionClustering algorithms are used to group similar samplestogether into clusters whereas dissimilar samples are relegatedto separate clusters Dimensionality reduction algorithms areused to reduce the dimension of the feature set Too manyuninformative features can hinder predictive modeling this isoften referred to as the ldquocurse of dimensionalityrdquo Somecommon supervised and unsupervised learning algorithms arelisted in Table 1Semisupervised learning is a hybrid of supervised and
unsupervised learning where a set of labeled data is used inconjunction with a set of unlabeled training data In manycases getting true labels for a data set can be difficultSemisupervised learning tackles this problem by making use ofunlabeled data in the learning process and then leveraging thislearned information along with the labeled data set forsubsequent prediction20
Model Optimization and Evaluation Supervised learningis usually a three-step process In the first step the data aresplit into training and testing data sets with the split rangingbetween 60 and 80 for training and 40minus20 for testing Thetraining data set is then iteratively separated into training andvalidation sets The model is trained on the training set whilethe validation set is used to tune the model hyperparametersthat consist of the model architecture and the parameters thataffect the speed and quality of the training process Followingthis cross-validation the learned model is tested against thetest data set to evaluate model robustness and accuracy Toensure unbiased model evaluation the test data set cannot beused for training Commonly used model evaluation metricsinclude accuracy F1-score precision recall area under thereceiver operating characteristic (ROC) curve mean absoluteerror (MAE) and mean squared error (MSE)21
Evaluating unsupervised models can be challenging becausedata do not have prior labels Evaluation typically involvesinternal and external validation22 Internal validation is done byestimating inter- and intra-cluster distances to evaluate clusterquality A cluster refers to a collection of data points thataccumulate together based on certain similarities Modelsminimizing intracluster distances while maximizing interclusterdistances are preferred Such a result suggests that thesemodels are performing well at clustering similar data pointstogether Some commonly used metrics for this purpose arethe Adjusted Rand Index and the Silhouette CoefficientHowever a good internal validation score does not guaranteethe effectiveness of the obtained clusters for real applicationsHence external validation is necessary to assess whether thedata points are assigned to the correct clusters This stepusually requires human evaluation or comparison againstbenchmarks For dimensionality reduction task reconstructionerror or loss is estimated to evaluate model performanceThese methods seek to minimize the reconstruction error orlossminusdefined as the distance between the original data pointand its projection onto a lower-dimensional subspace (itsldquoestimaterdquo)23
The ESE community has begun to leverage various MLalgorithms and techniques to analyze environmental data setsWhile it is difficult to determine a priori which ML algorithm isbest suited for a particular data set an improved understandingof the positives and negatives and the inherent assumptions ofthe specific algorithms aid in the choice of the righttechnique(s) Table 1 summarizes algorithms popular withinthe ESE community The table describes the algorithms theiradvantages and disadvantages the assumptions for specific datatypes and gives examples where the algorithms have been usedto address ESE related problems
24 Data Interpretation The final step in the modelbuilding cycle is to interpret the learnt model and developreasonable explanations for the obtained analysis (Figure 1d)Data interpretation could mean extracting important variablescontributing to the model prediction These variables couldhelp in verifying the hypotheses of the study or understandingwhat factors are critical in driving the observations in the studyunder consideration At this stage the literature and expertopinions are queried to make connections between the modeloutputs and knowledge about the relevant chemical physicaland biological phenomena to make data-driven conclusionsand decisions
3 CURRENT APPLICATIONS OF MACHINE LEARNINGIN ESE
In this section we discuss current applications of MLalgorithms within the ESE domain by presenting three casestudies representing different application areas
31 Metagenomic Data Analysis High throughputshotgun (untargeted) metagenomic DNA sequencing offers arobust and effective way to access the microbial world and isnow often used in ESE Differentiating microbial communitiesin different environments6061 studying the dissemination ofantibiotic resistance in environmental systems6263 definingbacterial communities in contaminated environments toexplore bioremediation64 and characterizing microbial com-munities in wastewater treatment processes6566 are allexamples of environments where metagenomic character-ization is increasingly being applied
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
E
In a typical metagenomic study the first step is often tosearch sequenced DNA reads against a reference database toderive taxonomic or functional gene annotation The obtainedtaxonomic or functional gene profiles are then normalized toprovide relative abundance information corresponding todifferent species or genes in the sample that can be visualizedand analyzed to answer specific questions about the sampleContinuous optimization and declining costs of sequencingplatforms have made metagenomics related research moreaccessible than ever before This trend has led to thegeneration of large volumes of publicly available sequencingdata but rendering these data into meaningful interpretationsremains challenging67 Fortunately the advent of ML methodshas immensely accelerated advancement at every cross sectionof a typical metagenomic study68 Here we highlight someapplications of ML in antibiotic resistance-related studiesOne often faced challenge with metagenomic data is that of
high dimensionality and low sample size (HDLSS ie morevariables than independent samples) For example the numberof genes annotated within a given environmental sample caneasily exceed hundreds of thousands Hence many unsuper-vised techniques such as principal component analysis (PCA)and non-metric multi-dimensional scaling (NMDS) are usedas preprocessing steps to reduce dimensionality by removinguninformative features A common application of thesetechniques is to examine the similaritydissimilarity of differentenvironmental samples by clustering based on species or genecomposition69 Network-based ML (data is represented in theform of a graph where nodes are the data points to be clusteredand edges represent the relationship or similarity between thedata points) is another unsupervised approach to studyproteinminusprotein or geneminusgene interactions to understanddifferent functional pathways70
Given the sample labels or response variables and the genecomposition support vector machines (SVMs) ANNs andensemble methods (such as random forest (RF) or extremelyrandomized trees) have been extensively used to performsupervised learning and build predictive models For exampleidentifying interesting patterns and important genes in a dataset58 predicting relative antibiotic resistance abundancelevels7 understanding the role of socioeconomic status inshaping the resistome or microbiome71 or the prediction ofantibiotic resistance phenotype72 are some of the uniqueproblems that have been examined using the above-mentionedmethods Hidden Markov models (HMMs) which can beused in both supervised or unsupervised fashion are one of themore readily used algorithms to detect antibiotic resistancegene variants or potential functional homologues73 and havebeen applied for the discovery of novel ARGs frommetagenomic data74
Word embedding which has gained a lot of popularity overthe recent years is a feature learning technique used in naturallanguage processing (NLP) Word embedding is a term usedfor the representation of words or sentences in the form ofnumeric vectors that can be used for downstream ML tasks75
Raw DNAprotein sequences sliced into k-mers are analogousto the structure of a sentence and can be analyzed in a similarfashion as natural languages Thus there is a rapid thrusttoward analyzing raw sequences using NLP based techni-ques7677 MetaMLP78 based on a similar idea uses a wordembedding based classification model to predict ARGphenotype and has been found to perform 50times faster thanDIAMOND (one of the fastest sequence alignment methods)
with similar accuracy Word embedding based models holdimmense promise in analyzing metagenomic data as they havethe ability to learn patterns directly from within raw sequencesSimilar to the antibiotic resistance example other
applications can also be envisioned For instance analyzingmetagenomics data for taxonomic classification7980 or mobilegenetic element classification81 where ML algorithms wereable to outperform the traditional methods In essence MLalgorithms have enhanced our ability to robustly analyzecomplex metagenomic data
32 Nontarget Analysis of Environmental Samples Arapidly developing approach in environmental analysis andtoxicology involves the use of high resolution massspectrometry (HRMS) based nontarget analysis (NTA)where data on accurate masses of molecular and fragmentions are collected without a priori information on thechemicals being analyzed NTA can potentially aid in thescreening and analysis of the vast and diverse universe oforganic pollutants a grand challenge faced by the ESEcommunity82 Because the chemicals being detected are notpredetermined NTA provides an opportunity to comprehen-sively examine the occurrence fate and transport of chemicalcontaminants in different environmental niches with minimalbias82 Similarly NTA can be applied in environmentalmetabolomics to facilitate understanding of the effects ofchemical perturbations on exposed organisms (eg plantsanimals and humans) without focusing on a particularbiochemical pathway8384 A recent review highlighted studiesthat employed the different types of NTA techniques used inmetabolomics to discover metabolite changes in plants inducedby exposure to xenobiotics (eg pharmaceuticals personal careproducts pesticides flame retardants and engineered nano-materials) to examine the effect of altered levels of nutrients inthe environment on plant systems85 Here we present a briefsummary of how the combination of NTA with ML canadvance monitoring of water wastewater quality soil andexposed organisms (eg humans wildlife)While there are various methods that can be used for NTA
in ESE HRMS is the most popular because of its capacity forsensitive detection of low levels of contaminants andmetabolites in complex environmental samples However thepower of NTA using HRMS has been limited by problemsassociated with sample preparation and data analysis Inanalyzing environmental samples using MS sample concen-tration and cleanup are critical because the signal intensity ofthe MS features depend not only on the concentrations of thechemicals but also on the amounts and the nature of thematrix present in the sample extracts The reproducibility ofthe ionization of compounds in MS can be compromisedsignificantly by matrix effects Therefore it is important to havean appropriate number of replicate samples and properlyselected blank samples when conducting NTA Solid-phaseextraction for sample cleanup is commonly used but thisapproach can introduce bias because highly polar contaminantsmay be lost during sample preparation Unlike in targetanalysis where variations due to matrix effects and samplelosses can be corrected using stable isotope-labeled referencecompounds as surrogates this approach cannot be used inNTA because the purpose of NTA is to identify unknowncontaminants that were not included in the target list Tonormalize for instrumental variation and matrix effects internalstandards with varying polarity can be added to the sampleextract prior to analysis Recently several compounds of
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
F
diverse structure log Kow chromatographic behavior andionization efficiency have been proposed for inclusion in aquality control mixture usable for NTA86 However theproposed mixture has limitations such as for the detection ofhydrophilic compounds and molecular formula generation forcompounds containing fluorine Finally due to the substantialamount of MS data acquired indiscriminately under full-scanand the subsequent MSMS fragmentation of molecular andfragment ions data processing to identify relevant features isdaunting Therefore NTA workflows with built-in filters andcriteria are being developed to facilitate prioritization of MSfeatures that are useful for chemical structure annotation Inthis regard advanced data processing tools high throughputstatistical packages and user-friendly visualization programsare needed to fully interrogate the rich data sets acquired byNTANTA based on HRMS holds great promise for the
comprehensive monitoring of the occurrence and fate ofcontaminants in water and wastewater87 It also allowsresearchers to retrospectively analyze stored HRMS data toscreen for suspected contaminants (ie suspect screening) thatmay have been missed during target analysis88 Using advancedcomputational strategies such as hierarchical cluster analysiscommon patterns in the occurrence of contaminants in theenvironment and their emission pathways can be predictedbased on time series analysis of the aggregated NTA data89
NTA combined with cluster analysis was successfully appliedto reveal previously unmonitored chemical contaminants insoil and sediment samples90 The application of NTA andadvanced postacquisition data treatment will continue toenhance our ability to discover emerging contaminants in theenvironment including those that bioaccumulate and poserisks to humans and wildlife For instance new polyhalo-genated compounds were detected for the first time in blubbersamples from marine mammal sentinel species using both LC-HRMS and GC-HRMS for analysis and an open-source datamining software in the R programming environment thatdetected halogenated signatures in full scan HRMS91 FinallyNTA combined with personal passive samplers and propersample preparation techniques can be used to unlock thecomposition of chemical mixtures that humans are exposed toon a daily basis which can be used in investigating the humanexposome92
Akin to metagenomic data NTA data pose a similar HDLSSchallenge as each mass spectrum constitutes a large number ofpeaks representing potential compounds present in a givensample Hence data preprocessing is crucial and inevitableVarious data preprocessing steps are performed such asdetecting peaks subtracting peaks that were found in blankor control samples componentization (ie grouping of signalsthat probably belong to one unique molecular structure) andremoving noise using replicate measurements93 Followingpreprocessing various supervised and unsupervised MLalgorithms can be applied to engineer select and extractrelevant features However before any further analysis datanormalization and data scaling are two crucial steps thatrequire considerationThe most common algorithms used for NTA are linear
projection methods such as PCA and supervised partial least-squares discriminant analysis (PLS-DA)949548 PCA aids insample comparison by removing variance from the sample setSupervised PLS-DA along with relevant metadata can be usedto extract features pertaining to the specific questions that are
being asked However a major challenge is that featuredetection based on peak intensity can be misleading because oferroneous peak assignments Also relevant information can belost when differentiating the actual contaminant signals andbackground noise when using intensity information for featureselection Hence several alternative approaches that aim atusing raw data signals (retention time times mass-to-charge ratio)that bypass the peak detection step have been proposed andimplemented for improved extraction of information andunderlying patterns in the data set96minus98 This includestechniques such as transforming the raw data signal to distancematrices for dissimilarity analysis96minus98 and using clustering toextract features99
A number of other data analysis algorithms exist but havenot yet become common in NTA environmental analysisClustering techniques such as k-means and hierarchicalclustering can be used to identify similar samples Algorithmslike RF SVMs and ANNs can be applied to classify samplesbased on different categories and would be advantageous inillustrating nonlinear relations in the data These algorithmshave shown excellent promise in analyzing HRMS data inother fields100minus103 and this promise can certainly be extendedto environmental sample analysis
33 Anomaly Detection in Engineered Water Sys-tems (EWS) EWS is an umbrella term for systems of watercollection treatment distribution storage and their operationRecently there has been a major thrust toward digitalization ofthe water sector104 In particular the evolution of cyberinfras-tructure and of online process control instrumentation has ledto the development of advanced process control solutions suchas supervisory control and data acquisition (SCADA)systems105106 Such advances have enabled water utilities tocontinuously monitor water quality identify problems andeffectively oversee maintenance issues both remotely and morelocally These systems entail collection of a large volume of rawdata that could be in conjunction with appropriate dataanalysis techniques transformed into valuable information thatcan be leveraged to make proactive decisions to optimizeoverall performance107 In particular there is surging interest inusing ML techniques to identify unusual patterns in raw EWSdata as a means of discovering unexpected activitiesminusthis isbroadly termed anomaly detection108
A typical anomaly detection task in EWS aims todifferentiate between natural expected variations in waterquality and unusual or suspicious variations caused bycontamination or failure somewhere in the system Thechallenge is to characterize these normal water qualityvariations as it requires analysis of long-term data thatencompasses inherent background variability109 This charac-terized normal response helps to flag anomalous events in thedata Various ML algorithms have been applied to address thisproblem (Table 1)Within EWS one primary form of the generated data is a
time series where each time point can be considered a discretesample The time series consists of measurements of indicatorscollected over time from one or several sensors which formthe feature set Hence each sample can be represented as amultidimensional vector where each dimension represents afeature Historic data is then used to train the model Givenappropriate analysis of the supporting data set one can frameand solve problems to address a multitude of aims such asdetecting leaks sensor failures abrupt changes in water qualityor contamination events110minus112
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
G
There are a number of different anomaly detectiontechniques available113 In the unsupervised approachunlabeled samples are clustered using an algorithm such ask-means density-based clustering algorithms or expectation-maximization (EM) clustering The concept is that the normaldata points cluster to form high density clusters whereasanomalous data points cluster separately or distant from thenormal data pointsclusters and are located in low-densityregions This way one can carry out additional investigationusing new samples and classify them relative to the normaldata In the supervised approach labels (normal or anomalous)are assigned to each sample based on past information andexpert knowledge of anomalous behavior In this way theproblem is converted into a binary classification problemThough it can easily be extended to a multiclass classificationto categorize different types of anomalies SVM ANNBayesian Network Logistic Regression models and theirvariants are well explored algorithms in this space113 A specialcase of SVM One-Class SVM (OCSVM) is a widely usedmethod for anomaly detection In OCSVM the entire trainingdata set is considered as one-class (eg normal class) and thenew data points are classified as similar (normal) or different(anomalous) to the training data Because considering all datapoints from one class is equivalent to having no label OCSVMis considered as unsupervised learning methodOwing to the success of ML in detecting anomalies in other
domains such as network security researchers have startedexploring its potential in water quality anomaly detec-tion43114115 Based on the studies published so far DLalgorithms have shown promise in detecting anomalies andhave outperformed conventional techniques on a number ofoccasions However it should be noted that DL methods couldbe slow to train depending on the depth of the network andthe amount of data available116
Many studies have adopted a batch learning approach wherethe model is trained on historical data and then the new data iscategorized using this trained model With a continuousincoming time-series data stream and the possibility of novelanomalies retraining the models with every new data set canquickly become impractical and difficult to reliably executeContinual or active learning frameworks that continuouslylearn as the new data stream comes in and that can identifyanomalous behavior in real time are required to circumventthese issues117118 Variants of Latent Dirichlet AllocationMarkov Models and ANN based architectures are algorithmsthat have been applied in other domains to achieve continualonline learning119minus121 However thes approaches remainunderexplored for anomaly detection in water systems asthese frameworks are not trivial and have their ownimplementation challenges118119
Ultimately anomaly detection will be most valuable if it canlearn continually as the data stream comes in and yield real-time reporting that informs immediate corrective action It iscrucial that the models being utilized are fast and accurateHence going forward there is a need to shift the focus towardhybrid approaches (combinations of different algorithms) tobuild powerful models that are able to detect multiple types ofanomalies in real-time Such models will enhance EWSanomaly detection and advance public health9118119
4 PATH FORWARDThere is increasing interest in and a growing body of researchdocumenting how data analytics is being used to address ESE
problems As illustrated in this Feature the power of dataanalytics has been widely recognized as has been the vast needfor its application Anomaly detection has particularlypromising applications for water professionals and practi-tioners Notably broader application of metagenomics andNTA can revolutionize environmental monitoring effortsHowever the application of data analytics in ESE practiceremains in its infancy and concerted efforts are required tomake data analytics an integral part of ESE research andeducation Such a goal would most effectively be achieved ifthere were an agreed-upon plan of action Coordinated effortson several fronts are needed to help the ESE community reapthe potential benefits of data analyticsFirst there is a need to encourage collaborations between
data scientists and ESE practitioners that involve diverselyengaged and integrated research teams from multipledisciplines Such collaborations will facilitate cross-disciplinarycommunication and simultaneous skills building For exampledata scientists come from a culture highly supportive of datasharing (eg GitHub Bitbucket public databases) The ESEcommunity should embrace this culture and incorporate datasharing both locally regionally and globally Recent efforts toaddress the COVID-19 pandemic122123 the global dissem-ination of antimicrobial resistance7 and the development ofglobally vetted data analysis approaches within the NTAcommunity are all steps in this direction Such collaborationshave the potential to yield profound data-driven insights andconclusions that would be impossible to achieve within normalacademic and professional silos Working on their own datascientists may create models that answer specific questions butlack the interpretive training required to contextualize theirresults into real world applications Simultaneously ESEpractitioners with data may not possess the correspondingtools or insights required for proper analysis Strategiccollaborations will lead to improved data analytics datainterpretation and improved decision makingSecond there is a lack of user-friendly data analytics tools
and platforms devised for ESE problems Although under-standing the theory behind ML algorithms is not a difficult taskfor many trained scientists a major hurdle that ESE researchersface is in the coding and implementation of these models Withthe continual advancement of data science programming isbecoming an increasingly necessary skill without which modelsmay be improperly developed or may lack efficiency User-friendly tools may circumvent the problems with codinglearning curves and may be essential for widespreadpractitioner adoption While many tools (Microsoft powerBI Weka Tableau Azure) are available for general dataanalysis application-specific toolsplatforms would greatly aidin performing end-to-end analysesThird the field needs to incorporate data analytics-oriented
curricula This could include the addition of data analyticsfocused modules within existing coursework the introductionof new data science and statistics courses within ESE degreeprograms and the development of relevant capstone projectsfor ESE applications Such experiential learning would helpstudents understand the intricacies of different algorithms andprovide hands-on experience in applying various data analyticstechniques in the context of specific ESE problems Furtherencouraging students to take part in data science internshipswould be an excellent means of developing such expertiseFourth introducing data analytics workshops and making
the relevant resources available is valuable for students
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
H
researchers and professionals A plethora of online resourcessuch as Coursera LinkedIn Learning and Edx are available forlow or no cost offering high-quality content in data scienceExpanding access to and creation of open source learningplatforms could prove instrumental in instigating the data-driven problem-solving approachFinally we know that data analytics is not new but
continues to evolve at a rapid pace These advancementsfurnish us with new and powerful ways to analyze the data thatcan help holistically tackle the challenges faced by ESEcommunity and ultimately inform decision-making and policyformulation at scales never previously imagined Hence it iscritical to be abreast of new developments in data analytics andto continue making efforts to harvest their full potential
AUTHOR INFORMATIONCorresponding AuthorPeter Vikesland minus Via Department of Civil andEnvironmental Engineering Virginia Tech BlacksburgVirginia 24061 United States orcidorg0000-0003-2654-5132 Email pvikesvtedu
AuthorsSuraj Gupta minus The Interdisciplinary PhD Program in GeneticsBioinformatics and Computational Biology Virginia TechBlacksburg Virginia 24061 United States
Diana Aga minus Department of Chemistry University at BuffaloThe State University of New York Buffalo New York 14226United States orcidorg0000-0001-6512-7713
Amy Pruden minus Via Department of Civil and EnvironmentalEngineering Virginia Tech Blacksburg Virginia 24061United States orcidorg0000-0002-3191-6244
Liqing Zhang minus Department of Computer Science VirginiaTech Blacksburg Virginia 24061 United States
Complete contact information is available athttpspubsacsorg101021acsest1c01026
NotesThe authors declare no competing financial interestBiography
Peter J Vikesland is the Nick Prillaman Professor of Civil andEnvironmental Engineering at Virginia Tech He received his BAfrom Grinnell College in Chemistry in 1993 and MS and PhD inCivil and Environmental Engineering from University of Iowa in 1995and 1998 Vikeslandrsquos research focuses on the fate of nanomaterials inthe environment nanotechnology enabled sensor platforms forenvironmental quality assessment the environmental disseminationof antibiotic resistance and the application of data analytics in
environmental science and engineering He is a past President of theAssociation of Environmental Engineering and Science Professors(AEESP) is a US National Science Foundation CAREER awardeeand is the recipient of the 2018 Walter Weber Research InnovationAward from AEESP
ACKNOWLEDGMENTS
This study was supported by National Science Foundation(NSF) awards OISE (1545756) ECCSS NNCI (1542100)and CSSI (2004751) Additional support was provided by theCenter for Science and Engineering of the Exposome at theVirginia Tech Institute for Critical Technology and AppliedScience (ICTAS) and the Virginia Tech Graduate Schoolsupported Genetics Bioinformatics and ComputationalBiology (GBCB) and Sustainable Nanotechnology (VTSuN)programs
GLOSSARY
Accuracy Correct predictionstotal predictions
Autoencoders A deep learning techni-que that aims to build amodel where the outputtargets are set equal tothe input variables Itseeks to learn an ap-proximate representa-tion of the data andreconstruct it It is anunsupervised learningapproach used for di-mensionality reduction
Bayesian Network Bayesian networks are akind of probabilisticgraphical model thatutilizes Bayesian infer-ence for probability es-timations Bayesian net-works are used tomodel conditional de-pendence and hencecausation using a direc-ted graph
Data cleaning Identification of outliersor filling in of missingvalues Outlier identifi-cation can be done byusing Z-score quartilevalues or hypothesistesting Missing datacan be handled in mul-tiple ways such as fillingthe value by computingthe summary statisticsof the given variableusing predicted valuescomputed by an MLalgorithm manual cura-tion or by ignoring themissing record
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
I
Data integration The combination ofmultiple data sourcesinto a single coherentform
Data reduction Aggregation or elimina-tion of redundant in-formation to reduce thedata set size
Data transformation Conversion of raw databy normalizing to acommon scale to en-sure consistency andcomparability
Density Based Clustering Work by recognizingldquodenserdquo groups ofpoints permitting it tolearn clusters of discre-tionary shape and dis-tinguish anomalies inthe information
Dimensionality Reduction The process of reducingthe number of randomvariables or attributesunder consideration
EM Clustering Similar to k-means ex-cept it assigns the sam-ples into clusters basedon the probabilities es-timated using the EMalgorithm The objec-tive is to maximize theoverall probability ofthe data for the given(final) clusters
Ensemble Methods Ensemble methods arealgorithms that com-bine predictions fromseveral base models toobtain one optimalmodel
F1-Score F1-Score is a harmonicmean of recall andprecision
HMM HMM is a statisticalMarkov mode l i nwhich the system beingmodeled is assumed tobe a Markov process
k-mer A substring with alength k in a biologicalsequence of nucleoti-des
Naive Bayes Naive Bayes methodsare a family of simpleprobabilistic classifiersbased on Bayes ruleThe method is calledldquoNaiverdquo for its assump-tion of conditional in-dependence among thefeatures
Natural Language Processing (NLP) A subfield of computerscience artificial intelli-gence and linguisticsthat deals with theinteraction of com-puters with human lan-guages
NMDS NMDS is an ordinationtechnique based on dis-tance or dissimilaritymatrix NMDS repre-sents pairwise dissimi-larity between samplesin a low-dimensionalspace
PLS-DA PLS-DA is a linearclassification modelthat is able to predictthe class of new sam-ples
R Programming languagefor Statistical Comput-ing
ROC-Curve A receiver operatingcharacteristic (ROC)curve is a plot of truepositive rate (TPR-yaxis) against the falsepositive rate (FPR-xaxis) It measures theperformance of a classi-fier
SCADA Supervisory control andd a t a a c q u i s i t i o n(SCADA) is a systemdesigned to gather andanalyze real time dataIt is used in waterwastewater treatmentplants to monitor andmanage processes
REFERENCES(1) Tukey J W The future of data analysis Ann Math Stat 196233 (1) 1minus67(2) Hayashi C What is data science Fundamental concepts and aheuristic example In Data science classification and related methodsSpringer 1998 pp 40minus51(3) Bishop C M Pattern recognition and machine learning Springer2006(4) Qiu J Wu Q Ding G Xu Y Feng S A survey of machinelearning for big data processing EURASIP Journal on Advances inSignal Processing 2016 2016 (1) 67(5) Agarwal R Dhar V Big data data science and analytics Theopportunity and challenge for IS research In INFORMS 201425443(6) Wilcox C Woon W L Aung Z Applications of machinelearning in environmental engineering Citeseer 2013(7) Hendriksen R S Munk P Njage P Van Bunnik BMcNally L Lukjancenko O Roumlder T Nieuwenhuijse DPedersen S K Kjeldgaard J Global monitoring of antimicrobialresistance based on metagenomics analyses of urban sewage NatCommun 2019 10 (1) 1124(8) Sobus J R Wambaugh J F Isaacs K K Williams A JMcEachran A D Richard A M Grulke C M Ulrich E MRager J E Strynar M J Integrating tools for non-targeted analysis
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
J
research and chemical safety evaluations at the US EPA J ExposureSci Environ Epidemiol 2018 28 (5) 411minus426(9) Dogo E M Nwulu N I Twala B Aigbavboa C A survey ofmachine learning methods applied to anomaly detection on drinking-water quality data Urban Water J 2019 16 (3) 235minus248(10) Wilde F D Radtke D B Gibs J Iwatsubo R T Nationalfield manual for the collection of water-quality data US Geological SurveyTechniques of Water-Resources Investigations Book 9 1998(11) National Research Council Confronting the nationrsquos waterproblems The role of research National Academies Press 2004(12) Bharti R Grimm D G Current challenges and best-practiceprotocols for microbiome analysis Briefings in bioinformatics202122178(13) Veenaas C Developing tools for non-target analysis and digitalarchiving of organic urban water pollutants 2018(14) Keim D A Visual exploration of large data sets CommunACM 2001 44 (8) 38minus44(15) Chen J Chen Y Du X Li C Lu J Zhao S Zhou X Bigdata challenge a data management perspective Frontiers of ComputerScience 2013 7 (2) 157minus164(16) Chakrabarti S Cox E Frank E Guting R H Han J JiangX Kamber M Lightstone S S Nadeau T P Neapolitan R EData mining know it all Morgan Kaufmann 2008(17) Gibert K Sanchez-Marre M Izquierdo J A survey on pre-processing techniques Relevant issues in the context of environ-mental data mining AI Communications 2016 29 (6) 627minus663(18) Russell S Norvig P Artificial intelligence a modern approach2002(19) Hinton G E Sejnowski T J Poggio T A UnsupervisedLearning Foundations of Neural Computation MIT press 1999(20) Van Engelen J E Hoos H H A survey on semi-supervisedlearning Machine Learning 2020 109 (2) 373minus440(21) Brink H Richards J W Fetherolf M Cronin B Real-worldmachine learning Manning Shelter Island NY 2017(22) Kassambara A Practical guide to cluster analysis in RUnsupervised machine learning Sthda Vol 1(23) Ghodsi A Dimensionality reduction a short tutorialDepartment of Statistics and Actuarial Science Univ of WaterlooOntario Canada 2006 37 (38) 2006(24) Wright R E Logistic regression 1995(25) Alvarez-Arbesu R Feliciacutesimo A M GIS and logisticregression as tools for environmental management a coastal cliffvegetation model in Northern Spain WIT Transactions on Informationand Communication Technologies 2002 26(26) Liu L Sankarasubramanian A Ranjithan S R Logisticregression analysis to estimate contaminant sources in waterdistribution systems J Hydroinf 2011 13 (3) 545minus557(27) Thoe W Gold M Griesbach A Grimmer M Taggart ML Boehm A B Predicting water quality at Santa Monica Beachevaluation of five different models for public notification of unsafeswimming conditions Water Res 2014 67 105minus117(28) Muharemi F Logofa tu D Andersson C Leon FApproaches to building a detection model for water quality a casestudy In Modern Approaches for Intelligent Information and DatabaseSystems Springer 2018 pp 173minus183(29) Liaw A Wiener M Classification and regression byrandomForest R news 2002 2 (3) 18minus22(30) Gromski P S Muhamadali H Ellis D I Xu Y Correa ETurner M L Goodacre R A tutorial review Metabolomics andpartial least squares-discriminant analysis-a marriage of convenienceor a shotgun wedding Anal Chim Acta 2015 879 10minus23(31) Lee S Kim J Prediction of Nanofiltration and Reverse-Osmosis-Membrane Rejection of Organic Compounds UsingRandom Forest Model J Environ Eng 2020 146 (11) 04020127(32) Castrillo M Garciacutea A L Estimation of high frequencynutrient concentrations from water quality surrogates using machinelearning methods Water Res 2020 172 115490
(33) Tan G Yan J Gao C Yang S Prediction of water qualitytime series data based on least squares support vector machineProcedia Eng 2012 31 1194minus1199(34) Yang Y H Guergachi A Khan G Support vector machinesfor environmental informatics application to modelling the nitrogenremoval processes in wastewater treatment systems Journal ofEnvironmental Informatics 2006 7 (1) 14minus25(35) Feng J Pan L Cui B Sun Y Zhang A Gong M NewApproach for Concentration Prediction of Aqueous PhenolicContaminants by Using Wavelet Analysis and Support VectorMachine Environmental Engineering Science 2020 37 (5) 382minus392(36) Silver D Schrittwieser J Simonyan K Antonoglou IHuang A Guez A Hubert T Baker L Lai M Bolton AMastering the game of go without human knowledge Nature 2017550 (7676) 354minus359(37) Mikolov T Deoras A Povey D Burget L Cernocky JStrategies for Training Large Scale Neural Network Language ModelsIEEE 2011 pp 196minus201(38) Deng J Dong W Socher R Li L-J Li K Fei-Fei L InImagenet A Large-Scale Hierarchical Image Database Ieee 2009 pp248minus255(39) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (Aug) 2493minus2537(40) Alwosheel A van Cranenburgh S Chorus C G Is yourdataset big enough Sample size requirements when using artificialneural networks for discrete choice analysis Journal of choice modelling2018 28 167minus182(41) Yang J Xu J Zhang X Wu C Lin T Ying Y Deeplearning for vibrational spectral analysis Recent progress and apractical guide Anal Chim Acta 2019 1081 6minus17(42) Guo H Jeong K Lim J Jo J Kim Y M Park J-p Kim JH Cho K H Prediction of effluent concentration in a wastewatertreatment plant using machine learning models J Environ Sci 201532 90minus101(43) Inoue J Yamagata Y Chen Y Poskitt C M Sun JAnomaly Detection for a Water Treatment System Using UnsupervisedMachine Learning 2017 2017 IEEE 2017 pp 1058minus1065(44) Voukantsis D Karatzas K Kukkonen J Raumlsaumlnen TKarppinen A Kolehmainen M Intercomparison of air quality datausing principal component analysis and forecasting of PM10 andPM2 5 concentrations using artificial neural networks in Thessalonikiand Helsinki Sci Total Environ 2011 409 (7) 1266minus1276(45) Wold S Esbensen K Geladi P Principal componentanalysis Chemom Intell Lab Syst 1987 2 (1minus3) 37minus52(46) Jolliffe I T Cadima J Principal component analysis a reviewand recent developments Philos Trans R Soc A 2016 374 (2065)20150202(47) Dalal S G Shirodkar P V Jagtap T G Naik B G Rao GS Evaluation of significant sources influencing the variation of waterquality of Kandla creek Gulf of Katchchh using PCA Environ MonitAssess 2010 163 (1minus4) 49minus56(48) Scholleacutee J E Schymanski E L Avak S E Loos MHollender J Prioritizing unknown transformation products frombiologically-treated wastewater using high-resolution mass spectrom-etry multivariate statistics and metabolic logic Anal Chem 2015 87(24) 12121minus12129(49) Masiaacute A Campo J Blasco C Picoacute Y Ultra-highperformance liquid chromatography-quadrupole time-of-flight massspectrometry to identify contaminants in water an insight onenvironmental forensics Journal of Chromatography A 2014 134586minus97(50) Peiris R H Halleacute C Budman H Moresoli C Peldszus SHuck P M Legge R L Identifying fouling events in a membrane-based drinking water treatment process using principal componentanalysis of fluorescence excitation-emission matrices Water Res 201044 (1) 185minus194(51) MacQueen J In Some methods for classification and analysis ofmultivariate observations 1967 Oakland CA 1967 pp 281minus297
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
K
(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187
(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
L
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M
In a typical metagenomic study the first step is often tosearch sequenced DNA reads against a reference database toderive taxonomic or functional gene annotation The obtainedtaxonomic or functional gene profiles are then normalized toprovide relative abundance information corresponding todifferent species or genes in the sample that can be visualizedand analyzed to answer specific questions about the sampleContinuous optimization and declining costs of sequencingplatforms have made metagenomics related research moreaccessible than ever before This trend has led to thegeneration of large volumes of publicly available sequencingdata but rendering these data into meaningful interpretationsremains challenging67 Fortunately the advent of ML methodshas immensely accelerated advancement at every cross sectionof a typical metagenomic study68 Here we highlight someapplications of ML in antibiotic resistance-related studiesOne often faced challenge with metagenomic data is that of
high dimensionality and low sample size (HDLSS ie morevariables than independent samples) For example the numberof genes annotated within a given environmental sample caneasily exceed hundreds of thousands Hence many unsuper-vised techniques such as principal component analysis (PCA)and non-metric multi-dimensional scaling (NMDS) are usedas preprocessing steps to reduce dimensionality by removinguninformative features A common application of thesetechniques is to examine the similaritydissimilarity of differentenvironmental samples by clustering based on species or genecomposition69 Network-based ML (data is represented in theform of a graph where nodes are the data points to be clusteredand edges represent the relationship or similarity between thedata points) is another unsupervised approach to studyproteinminusprotein or geneminusgene interactions to understanddifferent functional pathways70
Given the sample labels or response variables and the genecomposition support vector machines (SVMs) ANNs andensemble methods (such as random forest (RF) or extremelyrandomized trees) have been extensively used to performsupervised learning and build predictive models For exampleidentifying interesting patterns and important genes in a dataset58 predicting relative antibiotic resistance abundancelevels7 understanding the role of socioeconomic status inshaping the resistome or microbiome71 or the prediction ofantibiotic resistance phenotype72 are some of the uniqueproblems that have been examined using the above-mentionedmethods Hidden Markov models (HMMs) which can beused in both supervised or unsupervised fashion are one of themore readily used algorithms to detect antibiotic resistancegene variants or potential functional homologues73 and havebeen applied for the discovery of novel ARGs frommetagenomic data74
Word embedding which has gained a lot of popularity overthe recent years is a feature learning technique used in naturallanguage processing (NLP) Word embedding is a term usedfor the representation of words or sentences in the form ofnumeric vectors that can be used for downstream ML tasks75
Raw DNAprotein sequences sliced into k-mers are analogousto the structure of a sentence and can be analyzed in a similarfashion as natural languages Thus there is a rapid thrusttoward analyzing raw sequences using NLP based techni-ques7677 MetaMLP78 based on a similar idea uses a wordembedding based classification model to predict ARGphenotype and has been found to perform 50times faster thanDIAMOND (one of the fastest sequence alignment methods)
with similar accuracy Word embedding based models holdimmense promise in analyzing metagenomic data as they havethe ability to learn patterns directly from within raw sequencesSimilar to the antibiotic resistance example other
applications can also be envisioned For instance analyzingmetagenomics data for taxonomic classification7980 or mobilegenetic element classification81 where ML algorithms wereable to outperform the traditional methods In essence MLalgorithms have enhanced our ability to robustly analyzecomplex metagenomic data
32 Nontarget Analysis of Environmental Samples Arapidly developing approach in environmental analysis andtoxicology involves the use of high resolution massspectrometry (HRMS) based nontarget analysis (NTA)where data on accurate masses of molecular and fragmentions are collected without a priori information on thechemicals being analyzed NTA can potentially aid in thescreening and analysis of the vast and diverse universe oforganic pollutants a grand challenge faced by the ESEcommunity82 Because the chemicals being detected are notpredetermined NTA provides an opportunity to comprehen-sively examine the occurrence fate and transport of chemicalcontaminants in different environmental niches with minimalbias82 Similarly NTA can be applied in environmentalmetabolomics to facilitate understanding of the effects ofchemical perturbations on exposed organisms (eg plantsanimals and humans) without focusing on a particularbiochemical pathway8384 A recent review highlighted studiesthat employed the different types of NTA techniques used inmetabolomics to discover metabolite changes in plants inducedby exposure to xenobiotics (eg pharmaceuticals personal careproducts pesticides flame retardants and engineered nano-materials) to examine the effect of altered levels of nutrients inthe environment on plant systems85 Here we present a briefsummary of how the combination of NTA with ML canadvance monitoring of water wastewater quality soil andexposed organisms (eg humans wildlife)While there are various methods that can be used for NTA
in ESE HRMS is the most popular because of its capacity forsensitive detection of low levels of contaminants andmetabolites in complex environmental samples However thepower of NTA using HRMS has been limited by problemsassociated with sample preparation and data analysis Inanalyzing environmental samples using MS sample concen-tration and cleanup are critical because the signal intensity ofthe MS features depend not only on the concentrations of thechemicals but also on the amounts and the nature of thematrix present in the sample extracts The reproducibility ofthe ionization of compounds in MS can be compromisedsignificantly by matrix effects Therefore it is important to havean appropriate number of replicate samples and properlyselected blank samples when conducting NTA Solid-phaseextraction for sample cleanup is commonly used but thisapproach can introduce bias because highly polar contaminantsmay be lost during sample preparation Unlike in targetanalysis where variations due to matrix effects and samplelosses can be corrected using stable isotope-labeled referencecompounds as surrogates this approach cannot be used inNTA because the purpose of NTA is to identify unknowncontaminants that were not included in the target list Tonormalize for instrumental variation and matrix effects internalstandards with varying polarity can be added to the sampleextract prior to analysis Recently several compounds of
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
F
diverse structure log Kow chromatographic behavior andionization efficiency have been proposed for inclusion in aquality control mixture usable for NTA86 However theproposed mixture has limitations such as for the detection ofhydrophilic compounds and molecular formula generation forcompounds containing fluorine Finally due to the substantialamount of MS data acquired indiscriminately under full-scanand the subsequent MSMS fragmentation of molecular andfragment ions data processing to identify relevant features isdaunting Therefore NTA workflows with built-in filters andcriteria are being developed to facilitate prioritization of MSfeatures that are useful for chemical structure annotation Inthis regard advanced data processing tools high throughputstatistical packages and user-friendly visualization programsare needed to fully interrogate the rich data sets acquired byNTANTA based on HRMS holds great promise for the
comprehensive monitoring of the occurrence and fate ofcontaminants in water and wastewater87 It also allowsresearchers to retrospectively analyze stored HRMS data toscreen for suspected contaminants (ie suspect screening) thatmay have been missed during target analysis88 Using advancedcomputational strategies such as hierarchical cluster analysiscommon patterns in the occurrence of contaminants in theenvironment and their emission pathways can be predictedbased on time series analysis of the aggregated NTA data89
NTA combined with cluster analysis was successfully appliedto reveal previously unmonitored chemical contaminants insoil and sediment samples90 The application of NTA andadvanced postacquisition data treatment will continue toenhance our ability to discover emerging contaminants in theenvironment including those that bioaccumulate and poserisks to humans and wildlife For instance new polyhalo-genated compounds were detected for the first time in blubbersamples from marine mammal sentinel species using both LC-HRMS and GC-HRMS for analysis and an open-source datamining software in the R programming environment thatdetected halogenated signatures in full scan HRMS91 FinallyNTA combined with personal passive samplers and propersample preparation techniques can be used to unlock thecomposition of chemical mixtures that humans are exposed toon a daily basis which can be used in investigating the humanexposome92
Akin to metagenomic data NTA data pose a similar HDLSSchallenge as each mass spectrum constitutes a large number ofpeaks representing potential compounds present in a givensample Hence data preprocessing is crucial and inevitableVarious data preprocessing steps are performed such asdetecting peaks subtracting peaks that were found in blankor control samples componentization (ie grouping of signalsthat probably belong to one unique molecular structure) andremoving noise using replicate measurements93 Followingpreprocessing various supervised and unsupervised MLalgorithms can be applied to engineer select and extractrelevant features However before any further analysis datanormalization and data scaling are two crucial steps thatrequire considerationThe most common algorithms used for NTA are linear
projection methods such as PCA and supervised partial least-squares discriminant analysis (PLS-DA)949548 PCA aids insample comparison by removing variance from the sample setSupervised PLS-DA along with relevant metadata can be usedto extract features pertaining to the specific questions that are
being asked However a major challenge is that featuredetection based on peak intensity can be misleading because oferroneous peak assignments Also relevant information can belost when differentiating the actual contaminant signals andbackground noise when using intensity information for featureselection Hence several alternative approaches that aim atusing raw data signals (retention time times mass-to-charge ratio)that bypass the peak detection step have been proposed andimplemented for improved extraction of information andunderlying patterns in the data set96minus98 This includestechniques such as transforming the raw data signal to distancematrices for dissimilarity analysis96minus98 and using clustering toextract features99
A number of other data analysis algorithms exist but havenot yet become common in NTA environmental analysisClustering techniques such as k-means and hierarchicalclustering can be used to identify similar samples Algorithmslike RF SVMs and ANNs can be applied to classify samplesbased on different categories and would be advantageous inillustrating nonlinear relations in the data These algorithmshave shown excellent promise in analyzing HRMS data inother fields100minus103 and this promise can certainly be extendedto environmental sample analysis
33 Anomaly Detection in Engineered Water Sys-tems (EWS) EWS is an umbrella term for systems of watercollection treatment distribution storage and their operationRecently there has been a major thrust toward digitalization ofthe water sector104 In particular the evolution of cyberinfras-tructure and of online process control instrumentation has ledto the development of advanced process control solutions suchas supervisory control and data acquisition (SCADA)systems105106 Such advances have enabled water utilities tocontinuously monitor water quality identify problems andeffectively oversee maintenance issues both remotely and morelocally These systems entail collection of a large volume of rawdata that could be in conjunction with appropriate dataanalysis techniques transformed into valuable information thatcan be leveraged to make proactive decisions to optimizeoverall performance107 In particular there is surging interest inusing ML techniques to identify unusual patterns in raw EWSdata as a means of discovering unexpected activitiesminusthis isbroadly termed anomaly detection108
A typical anomaly detection task in EWS aims todifferentiate between natural expected variations in waterquality and unusual or suspicious variations caused bycontamination or failure somewhere in the system Thechallenge is to characterize these normal water qualityvariations as it requires analysis of long-term data thatencompasses inherent background variability109 This charac-terized normal response helps to flag anomalous events in thedata Various ML algorithms have been applied to address thisproblem (Table 1)Within EWS one primary form of the generated data is a
time series where each time point can be considered a discretesample The time series consists of measurements of indicatorscollected over time from one or several sensors which formthe feature set Hence each sample can be represented as amultidimensional vector where each dimension represents afeature Historic data is then used to train the model Givenappropriate analysis of the supporting data set one can frameand solve problems to address a multitude of aims such asdetecting leaks sensor failures abrupt changes in water qualityor contamination events110minus112
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
G
There are a number of different anomaly detectiontechniques available113 In the unsupervised approachunlabeled samples are clustered using an algorithm such ask-means density-based clustering algorithms or expectation-maximization (EM) clustering The concept is that the normaldata points cluster to form high density clusters whereasanomalous data points cluster separately or distant from thenormal data pointsclusters and are located in low-densityregions This way one can carry out additional investigationusing new samples and classify them relative to the normaldata In the supervised approach labels (normal or anomalous)are assigned to each sample based on past information andexpert knowledge of anomalous behavior In this way theproblem is converted into a binary classification problemThough it can easily be extended to a multiclass classificationto categorize different types of anomalies SVM ANNBayesian Network Logistic Regression models and theirvariants are well explored algorithms in this space113 A specialcase of SVM One-Class SVM (OCSVM) is a widely usedmethod for anomaly detection In OCSVM the entire trainingdata set is considered as one-class (eg normal class) and thenew data points are classified as similar (normal) or different(anomalous) to the training data Because considering all datapoints from one class is equivalent to having no label OCSVMis considered as unsupervised learning methodOwing to the success of ML in detecting anomalies in other
domains such as network security researchers have startedexploring its potential in water quality anomaly detec-tion43114115 Based on the studies published so far DLalgorithms have shown promise in detecting anomalies andhave outperformed conventional techniques on a number ofoccasions However it should be noted that DL methods couldbe slow to train depending on the depth of the network andthe amount of data available116
Many studies have adopted a batch learning approach wherethe model is trained on historical data and then the new data iscategorized using this trained model With a continuousincoming time-series data stream and the possibility of novelanomalies retraining the models with every new data set canquickly become impractical and difficult to reliably executeContinual or active learning frameworks that continuouslylearn as the new data stream comes in and that can identifyanomalous behavior in real time are required to circumventthese issues117118 Variants of Latent Dirichlet AllocationMarkov Models and ANN based architectures are algorithmsthat have been applied in other domains to achieve continualonline learning119minus121 However thes approaches remainunderexplored for anomaly detection in water systems asthese frameworks are not trivial and have their ownimplementation challenges118119
Ultimately anomaly detection will be most valuable if it canlearn continually as the data stream comes in and yield real-time reporting that informs immediate corrective action It iscrucial that the models being utilized are fast and accurateHence going forward there is a need to shift the focus towardhybrid approaches (combinations of different algorithms) tobuild powerful models that are able to detect multiple types ofanomalies in real-time Such models will enhance EWSanomaly detection and advance public health9118119
4 PATH FORWARDThere is increasing interest in and a growing body of researchdocumenting how data analytics is being used to address ESE
problems As illustrated in this Feature the power of dataanalytics has been widely recognized as has been the vast needfor its application Anomaly detection has particularlypromising applications for water professionals and practi-tioners Notably broader application of metagenomics andNTA can revolutionize environmental monitoring effortsHowever the application of data analytics in ESE practiceremains in its infancy and concerted efforts are required tomake data analytics an integral part of ESE research andeducation Such a goal would most effectively be achieved ifthere were an agreed-upon plan of action Coordinated effortson several fronts are needed to help the ESE community reapthe potential benefits of data analyticsFirst there is a need to encourage collaborations between
data scientists and ESE practitioners that involve diverselyengaged and integrated research teams from multipledisciplines Such collaborations will facilitate cross-disciplinarycommunication and simultaneous skills building For exampledata scientists come from a culture highly supportive of datasharing (eg GitHub Bitbucket public databases) The ESEcommunity should embrace this culture and incorporate datasharing both locally regionally and globally Recent efforts toaddress the COVID-19 pandemic122123 the global dissem-ination of antimicrobial resistance7 and the development ofglobally vetted data analysis approaches within the NTAcommunity are all steps in this direction Such collaborationshave the potential to yield profound data-driven insights andconclusions that would be impossible to achieve within normalacademic and professional silos Working on their own datascientists may create models that answer specific questions butlack the interpretive training required to contextualize theirresults into real world applications Simultaneously ESEpractitioners with data may not possess the correspondingtools or insights required for proper analysis Strategiccollaborations will lead to improved data analytics datainterpretation and improved decision makingSecond there is a lack of user-friendly data analytics tools
and platforms devised for ESE problems Although under-standing the theory behind ML algorithms is not a difficult taskfor many trained scientists a major hurdle that ESE researchersface is in the coding and implementation of these models Withthe continual advancement of data science programming isbecoming an increasingly necessary skill without which modelsmay be improperly developed or may lack efficiency User-friendly tools may circumvent the problems with codinglearning curves and may be essential for widespreadpractitioner adoption While many tools (Microsoft powerBI Weka Tableau Azure) are available for general dataanalysis application-specific toolsplatforms would greatly aidin performing end-to-end analysesThird the field needs to incorporate data analytics-oriented
curricula This could include the addition of data analyticsfocused modules within existing coursework the introductionof new data science and statistics courses within ESE degreeprograms and the development of relevant capstone projectsfor ESE applications Such experiential learning would helpstudents understand the intricacies of different algorithms andprovide hands-on experience in applying various data analyticstechniques in the context of specific ESE problems Furtherencouraging students to take part in data science internshipswould be an excellent means of developing such expertiseFourth introducing data analytics workshops and making
the relevant resources available is valuable for students
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
H
researchers and professionals A plethora of online resourcessuch as Coursera LinkedIn Learning and Edx are available forlow or no cost offering high-quality content in data scienceExpanding access to and creation of open source learningplatforms could prove instrumental in instigating the data-driven problem-solving approachFinally we know that data analytics is not new but
continues to evolve at a rapid pace These advancementsfurnish us with new and powerful ways to analyze the data thatcan help holistically tackle the challenges faced by ESEcommunity and ultimately inform decision-making and policyformulation at scales never previously imagined Hence it iscritical to be abreast of new developments in data analytics andto continue making efforts to harvest their full potential
AUTHOR INFORMATIONCorresponding AuthorPeter Vikesland minus Via Department of Civil andEnvironmental Engineering Virginia Tech BlacksburgVirginia 24061 United States orcidorg0000-0003-2654-5132 Email pvikesvtedu
AuthorsSuraj Gupta minus The Interdisciplinary PhD Program in GeneticsBioinformatics and Computational Biology Virginia TechBlacksburg Virginia 24061 United States
Diana Aga minus Department of Chemistry University at BuffaloThe State University of New York Buffalo New York 14226United States orcidorg0000-0001-6512-7713
Amy Pruden minus Via Department of Civil and EnvironmentalEngineering Virginia Tech Blacksburg Virginia 24061United States orcidorg0000-0002-3191-6244
Liqing Zhang minus Department of Computer Science VirginiaTech Blacksburg Virginia 24061 United States
Complete contact information is available athttpspubsacsorg101021acsest1c01026
NotesThe authors declare no competing financial interestBiography
Peter J Vikesland is the Nick Prillaman Professor of Civil andEnvironmental Engineering at Virginia Tech He received his BAfrom Grinnell College in Chemistry in 1993 and MS and PhD inCivil and Environmental Engineering from University of Iowa in 1995and 1998 Vikeslandrsquos research focuses on the fate of nanomaterials inthe environment nanotechnology enabled sensor platforms forenvironmental quality assessment the environmental disseminationof antibiotic resistance and the application of data analytics in
environmental science and engineering He is a past President of theAssociation of Environmental Engineering and Science Professors(AEESP) is a US National Science Foundation CAREER awardeeand is the recipient of the 2018 Walter Weber Research InnovationAward from AEESP
ACKNOWLEDGMENTS
This study was supported by National Science Foundation(NSF) awards OISE (1545756) ECCSS NNCI (1542100)and CSSI (2004751) Additional support was provided by theCenter for Science and Engineering of the Exposome at theVirginia Tech Institute for Critical Technology and AppliedScience (ICTAS) and the Virginia Tech Graduate Schoolsupported Genetics Bioinformatics and ComputationalBiology (GBCB) and Sustainable Nanotechnology (VTSuN)programs
GLOSSARY
Accuracy Correct predictionstotal predictions
Autoencoders A deep learning techni-que that aims to build amodel where the outputtargets are set equal tothe input variables Itseeks to learn an ap-proximate representa-tion of the data andreconstruct it It is anunsupervised learningapproach used for di-mensionality reduction
Bayesian Network Bayesian networks are akind of probabilisticgraphical model thatutilizes Bayesian infer-ence for probability es-timations Bayesian net-works are used tomodel conditional de-pendence and hencecausation using a direc-ted graph
Data cleaning Identification of outliersor filling in of missingvalues Outlier identifi-cation can be done byusing Z-score quartilevalues or hypothesistesting Missing datacan be handled in mul-tiple ways such as fillingthe value by computingthe summary statisticsof the given variableusing predicted valuescomputed by an MLalgorithm manual cura-tion or by ignoring themissing record
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
I
Data integration The combination ofmultiple data sourcesinto a single coherentform
Data reduction Aggregation or elimina-tion of redundant in-formation to reduce thedata set size
Data transformation Conversion of raw databy normalizing to acommon scale to en-sure consistency andcomparability
Density Based Clustering Work by recognizingldquodenserdquo groups ofpoints permitting it tolearn clusters of discre-tionary shape and dis-tinguish anomalies inthe information
Dimensionality Reduction The process of reducingthe number of randomvariables or attributesunder consideration
EM Clustering Similar to k-means ex-cept it assigns the sam-ples into clusters basedon the probabilities es-timated using the EMalgorithm The objec-tive is to maximize theoverall probability ofthe data for the given(final) clusters
Ensemble Methods Ensemble methods arealgorithms that com-bine predictions fromseveral base models toobtain one optimalmodel
F1-Score F1-Score is a harmonicmean of recall andprecision
HMM HMM is a statisticalMarkov mode l i nwhich the system beingmodeled is assumed tobe a Markov process
k-mer A substring with alength k in a biologicalsequence of nucleoti-des
Naive Bayes Naive Bayes methodsare a family of simpleprobabilistic classifiersbased on Bayes ruleThe method is calledldquoNaiverdquo for its assump-tion of conditional in-dependence among thefeatures
Natural Language Processing (NLP) A subfield of computerscience artificial intelli-gence and linguisticsthat deals with theinteraction of com-puters with human lan-guages
NMDS NMDS is an ordinationtechnique based on dis-tance or dissimilaritymatrix NMDS repre-sents pairwise dissimi-larity between samplesin a low-dimensionalspace
PLS-DA PLS-DA is a linearclassification modelthat is able to predictthe class of new sam-ples
R Programming languagefor Statistical Comput-ing
ROC-Curve A receiver operatingcharacteristic (ROC)curve is a plot of truepositive rate (TPR-yaxis) against the falsepositive rate (FPR-xaxis) It measures theperformance of a classi-fier
SCADA Supervisory control andd a t a a c q u i s i t i o n(SCADA) is a systemdesigned to gather andanalyze real time dataIt is used in waterwastewater treatmentplants to monitor andmanage processes
REFERENCES(1) Tukey J W The future of data analysis Ann Math Stat 196233 (1) 1minus67(2) Hayashi C What is data science Fundamental concepts and aheuristic example In Data science classification and related methodsSpringer 1998 pp 40minus51(3) Bishop C M Pattern recognition and machine learning Springer2006(4) Qiu J Wu Q Ding G Xu Y Feng S A survey of machinelearning for big data processing EURASIP Journal on Advances inSignal Processing 2016 2016 (1) 67(5) Agarwal R Dhar V Big data data science and analytics Theopportunity and challenge for IS research In INFORMS 201425443(6) Wilcox C Woon W L Aung Z Applications of machinelearning in environmental engineering Citeseer 2013(7) Hendriksen R S Munk P Njage P Van Bunnik BMcNally L Lukjancenko O Roumlder T Nieuwenhuijse DPedersen S K Kjeldgaard J Global monitoring of antimicrobialresistance based on metagenomics analyses of urban sewage NatCommun 2019 10 (1) 1124(8) Sobus J R Wambaugh J F Isaacs K K Williams A JMcEachran A D Richard A M Grulke C M Ulrich E MRager J E Strynar M J Integrating tools for non-targeted analysis
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
J
research and chemical safety evaluations at the US EPA J ExposureSci Environ Epidemiol 2018 28 (5) 411minus426(9) Dogo E M Nwulu N I Twala B Aigbavboa C A survey ofmachine learning methods applied to anomaly detection on drinking-water quality data Urban Water J 2019 16 (3) 235minus248(10) Wilde F D Radtke D B Gibs J Iwatsubo R T Nationalfield manual for the collection of water-quality data US Geological SurveyTechniques of Water-Resources Investigations Book 9 1998(11) National Research Council Confronting the nationrsquos waterproblems The role of research National Academies Press 2004(12) Bharti R Grimm D G Current challenges and best-practiceprotocols for microbiome analysis Briefings in bioinformatics202122178(13) Veenaas C Developing tools for non-target analysis and digitalarchiving of organic urban water pollutants 2018(14) Keim D A Visual exploration of large data sets CommunACM 2001 44 (8) 38minus44(15) Chen J Chen Y Du X Li C Lu J Zhao S Zhou X Bigdata challenge a data management perspective Frontiers of ComputerScience 2013 7 (2) 157minus164(16) Chakrabarti S Cox E Frank E Guting R H Han J JiangX Kamber M Lightstone S S Nadeau T P Neapolitan R EData mining know it all Morgan Kaufmann 2008(17) Gibert K Sanchez-Marre M Izquierdo J A survey on pre-processing techniques Relevant issues in the context of environ-mental data mining AI Communications 2016 29 (6) 627minus663(18) Russell S Norvig P Artificial intelligence a modern approach2002(19) Hinton G E Sejnowski T J Poggio T A UnsupervisedLearning Foundations of Neural Computation MIT press 1999(20) Van Engelen J E Hoos H H A survey on semi-supervisedlearning Machine Learning 2020 109 (2) 373minus440(21) Brink H Richards J W Fetherolf M Cronin B Real-worldmachine learning Manning Shelter Island NY 2017(22) Kassambara A Practical guide to cluster analysis in RUnsupervised machine learning Sthda Vol 1(23) Ghodsi A Dimensionality reduction a short tutorialDepartment of Statistics and Actuarial Science Univ of WaterlooOntario Canada 2006 37 (38) 2006(24) Wright R E Logistic regression 1995(25) Alvarez-Arbesu R Feliciacutesimo A M GIS and logisticregression as tools for environmental management a coastal cliffvegetation model in Northern Spain WIT Transactions on Informationand Communication Technologies 2002 26(26) Liu L Sankarasubramanian A Ranjithan S R Logisticregression analysis to estimate contaminant sources in waterdistribution systems J Hydroinf 2011 13 (3) 545minus557(27) Thoe W Gold M Griesbach A Grimmer M Taggart ML Boehm A B Predicting water quality at Santa Monica Beachevaluation of five different models for public notification of unsafeswimming conditions Water Res 2014 67 105minus117(28) Muharemi F Logofa tu D Andersson C Leon FApproaches to building a detection model for water quality a casestudy In Modern Approaches for Intelligent Information and DatabaseSystems Springer 2018 pp 173minus183(29) Liaw A Wiener M Classification and regression byrandomForest R news 2002 2 (3) 18minus22(30) Gromski P S Muhamadali H Ellis D I Xu Y Correa ETurner M L Goodacre R A tutorial review Metabolomics andpartial least squares-discriminant analysis-a marriage of convenienceor a shotgun wedding Anal Chim Acta 2015 879 10minus23(31) Lee S Kim J Prediction of Nanofiltration and Reverse-Osmosis-Membrane Rejection of Organic Compounds UsingRandom Forest Model J Environ Eng 2020 146 (11) 04020127(32) Castrillo M Garciacutea A L Estimation of high frequencynutrient concentrations from water quality surrogates using machinelearning methods Water Res 2020 172 115490
(33) Tan G Yan J Gao C Yang S Prediction of water qualitytime series data based on least squares support vector machineProcedia Eng 2012 31 1194minus1199(34) Yang Y H Guergachi A Khan G Support vector machinesfor environmental informatics application to modelling the nitrogenremoval processes in wastewater treatment systems Journal ofEnvironmental Informatics 2006 7 (1) 14minus25(35) Feng J Pan L Cui B Sun Y Zhang A Gong M NewApproach for Concentration Prediction of Aqueous PhenolicContaminants by Using Wavelet Analysis and Support VectorMachine Environmental Engineering Science 2020 37 (5) 382minus392(36) Silver D Schrittwieser J Simonyan K Antonoglou IHuang A Guez A Hubert T Baker L Lai M Bolton AMastering the game of go without human knowledge Nature 2017550 (7676) 354minus359(37) Mikolov T Deoras A Povey D Burget L Cernocky JStrategies for Training Large Scale Neural Network Language ModelsIEEE 2011 pp 196minus201(38) Deng J Dong W Socher R Li L-J Li K Fei-Fei L InImagenet A Large-Scale Hierarchical Image Database Ieee 2009 pp248minus255(39) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (Aug) 2493minus2537(40) Alwosheel A van Cranenburgh S Chorus C G Is yourdataset big enough Sample size requirements when using artificialneural networks for discrete choice analysis Journal of choice modelling2018 28 167minus182(41) Yang J Xu J Zhang X Wu C Lin T Ying Y Deeplearning for vibrational spectral analysis Recent progress and apractical guide Anal Chim Acta 2019 1081 6minus17(42) Guo H Jeong K Lim J Jo J Kim Y M Park J-p Kim JH Cho K H Prediction of effluent concentration in a wastewatertreatment plant using machine learning models J Environ Sci 201532 90minus101(43) Inoue J Yamagata Y Chen Y Poskitt C M Sun JAnomaly Detection for a Water Treatment System Using UnsupervisedMachine Learning 2017 2017 IEEE 2017 pp 1058minus1065(44) Voukantsis D Karatzas K Kukkonen J Raumlsaumlnen TKarppinen A Kolehmainen M Intercomparison of air quality datausing principal component analysis and forecasting of PM10 andPM2 5 concentrations using artificial neural networks in Thessalonikiand Helsinki Sci Total Environ 2011 409 (7) 1266minus1276(45) Wold S Esbensen K Geladi P Principal componentanalysis Chemom Intell Lab Syst 1987 2 (1minus3) 37minus52(46) Jolliffe I T Cadima J Principal component analysis a reviewand recent developments Philos Trans R Soc A 2016 374 (2065)20150202(47) Dalal S G Shirodkar P V Jagtap T G Naik B G Rao GS Evaluation of significant sources influencing the variation of waterquality of Kandla creek Gulf of Katchchh using PCA Environ MonitAssess 2010 163 (1minus4) 49minus56(48) Scholleacutee J E Schymanski E L Avak S E Loos MHollender J Prioritizing unknown transformation products frombiologically-treated wastewater using high-resolution mass spectrom-etry multivariate statistics and metabolic logic Anal Chem 2015 87(24) 12121minus12129(49) Masiaacute A Campo J Blasco C Picoacute Y Ultra-highperformance liquid chromatography-quadrupole time-of-flight massspectrometry to identify contaminants in water an insight onenvironmental forensics Journal of Chromatography A 2014 134586minus97(50) Peiris R H Halleacute C Budman H Moresoli C Peldszus SHuck P M Legge R L Identifying fouling events in a membrane-based drinking water treatment process using principal componentanalysis of fluorescence excitation-emission matrices Water Res 201044 (1) 185minus194(51) MacQueen J In Some methods for classification and analysis ofmultivariate observations 1967 Oakland CA 1967 pp 281minus297
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
K
(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187
(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
L
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M
diverse structure log Kow chromatographic behavior andionization efficiency have been proposed for inclusion in aquality control mixture usable for NTA86 However theproposed mixture has limitations such as for the detection ofhydrophilic compounds and molecular formula generation forcompounds containing fluorine Finally due to the substantialamount of MS data acquired indiscriminately under full-scanand the subsequent MSMS fragmentation of molecular andfragment ions data processing to identify relevant features isdaunting Therefore NTA workflows with built-in filters andcriteria are being developed to facilitate prioritization of MSfeatures that are useful for chemical structure annotation Inthis regard advanced data processing tools high throughputstatistical packages and user-friendly visualization programsare needed to fully interrogate the rich data sets acquired byNTANTA based on HRMS holds great promise for the
comprehensive monitoring of the occurrence and fate ofcontaminants in water and wastewater87 It also allowsresearchers to retrospectively analyze stored HRMS data toscreen for suspected contaminants (ie suspect screening) thatmay have been missed during target analysis88 Using advancedcomputational strategies such as hierarchical cluster analysiscommon patterns in the occurrence of contaminants in theenvironment and their emission pathways can be predictedbased on time series analysis of the aggregated NTA data89
NTA combined with cluster analysis was successfully appliedto reveal previously unmonitored chemical contaminants insoil and sediment samples90 The application of NTA andadvanced postacquisition data treatment will continue toenhance our ability to discover emerging contaminants in theenvironment including those that bioaccumulate and poserisks to humans and wildlife For instance new polyhalo-genated compounds were detected for the first time in blubbersamples from marine mammal sentinel species using both LC-HRMS and GC-HRMS for analysis and an open-source datamining software in the R programming environment thatdetected halogenated signatures in full scan HRMS91 FinallyNTA combined with personal passive samplers and propersample preparation techniques can be used to unlock thecomposition of chemical mixtures that humans are exposed toon a daily basis which can be used in investigating the humanexposome92
Akin to metagenomic data NTA data pose a similar HDLSSchallenge as each mass spectrum constitutes a large number ofpeaks representing potential compounds present in a givensample Hence data preprocessing is crucial and inevitableVarious data preprocessing steps are performed such asdetecting peaks subtracting peaks that were found in blankor control samples componentization (ie grouping of signalsthat probably belong to one unique molecular structure) andremoving noise using replicate measurements93 Followingpreprocessing various supervised and unsupervised MLalgorithms can be applied to engineer select and extractrelevant features However before any further analysis datanormalization and data scaling are two crucial steps thatrequire considerationThe most common algorithms used for NTA are linear
projection methods such as PCA and supervised partial least-squares discriminant analysis (PLS-DA)949548 PCA aids insample comparison by removing variance from the sample setSupervised PLS-DA along with relevant metadata can be usedto extract features pertaining to the specific questions that are
being asked However a major challenge is that featuredetection based on peak intensity can be misleading because oferroneous peak assignments Also relevant information can belost when differentiating the actual contaminant signals andbackground noise when using intensity information for featureselection Hence several alternative approaches that aim atusing raw data signals (retention time times mass-to-charge ratio)that bypass the peak detection step have been proposed andimplemented for improved extraction of information andunderlying patterns in the data set96minus98 This includestechniques such as transforming the raw data signal to distancematrices for dissimilarity analysis96minus98 and using clustering toextract features99
A number of other data analysis algorithms exist but havenot yet become common in NTA environmental analysisClustering techniques such as k-means and hierarchicalclustering can be used to identify similar samples Algorithmslike RF SVMs and ANNs can be applied to classify samplesbased on different categories and would be advantageous inillustrating nonlinear relations in the data These algorithmshave shown excellent promise in analyzing HRMS data inother fields100minus103 and this promise can certainly be extendedto environmental sample analysis
33 Anomaly Detection in Engineered Water Sys-tems (EWS) EWS is an umbrella term for systems of watercollection treatment distribution storage and their operationRecently there has been a major thrust toward digitalization ofthe water sector104 In particular the evolution of cyberinfras-tructure and of online process control instrumentation has ledto the development of advanced process control solutions suchas supervisory control and data acquisition (SCADA)systems105106 Such advances have enabled water utilities tocontinuously monitor water quality identify problems andeffectively oversee maintenance issues both remotely and morelocally These systems entail collection of a large volume of rawdata that could be in conjunction with appropriate dataanalysis techniques transformed into valuable information thatcan be leveraged to make proactive decisions to optimizeoverall performance107 In particular there is surging interest inusing ML techniques to identify unusual patterns in raw EWSdata as a means of discovering unexpected activitiesminusthis isbroadly termed anomaly detection108
A typical anomaly detection task in EWS aims todifferentiate between natural expected variations in waterquality and unusual or suspicious variations caused bycontamination or failure somewhere in the system Thechallenge is to characterize these normal water qualityvariations as it requires analysis of long-term data thatencompasses inherent background variability109 This charac-terized normal response helps to flag anomalous events in thedata Various ML algorithms have been applied to address thisproblem (Table 1)Within EWS one primary form of the generated data is a
time series where each time point can be considered a discretesample The time series consists of measurements of indicatorscollected over time from one or several sensors which formthe feature set Hence each sample can be represented as amultidimensional vector where each dimension represents afeature Historic data is then used to train the model Givenappropriate analysis of the supporting data set one can frameand solve problems to address a multitude of aims such asdetecting leaks sensor failures abrupt changes in water qualityor contamination events110minus112
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
G
There are a number of different anomaly detectiontechniques available113 In the unsupervised approachunlabeled samples are clustered using an algorithm such ask-means density-based clustering algorithms or expectation-maximization (EM) clustering The concept is that the normaldata points cluster to form high density clusters whereasanomalous data points cluster separately or distant from thenormal data pointsclusters and are located in low-densityregions This way one can carry out additional investigationusing new samples and classify them relative to the normaldata In the supervised approach labels (normal or anomalous)are assigned to each sample based on past information andexpert knowledge of anomalous behavior In this way theproblem is converted into a binary classification problemThough it can easily be extended to a multiclass classificationto categorize different types of anomalies SVM ANNBayesian Network Logistic Regression models and theirvariants are well explored algorithms in this space113 A specialcase of SVM One-Class SVM (OCSVM) is a widely usedmethod for anomaly detection In OCSVM the entire trainingdata set is considered as one-class (eg normal class) and thenew data points are classified as similar (normal) or different(anomalous) to the training data Because considering all datapoints from one class is equivalent to having no label OCSVMis considered as unsupervised learning methodOwing to the success of ML in detecting anomalies in other
domains such as network security researchers have startedexploring its potential in water quality anomaly detec-tion43114115 Based on the studies published so far DLalgorithms have shown promise in detecting anomalies andhave outperformed conventional techniques on a number ofoccasions However it should be noted that DL methods couldbe slow to train depending on the depth of the network andthe amount of data available116
Many studies have adopted a batch learning approach wherethe model is trained on historical data and then the new data iscategorized using this trained model With a continuousincoming time-series data stream and the possibility of novelanomalies retraining the models with every new data set canquickly become impractical and difficult to reliably executeContinual or active learning frameworks that continuouslylearn as the new data stream comes in and that can identifyanomalous behavior in real time are required to circumventthese issues117118 Variants of Latent Dirichlet AllocationMarkov Models and ANN based architectures are algorithmsthat have been applied in other domains to achieve continualonline learning119minus121 However thes approaches remainunderexplored for anomaly detection in water systems asthese frameworks are not trivial and have their ownimplementation challenges118119
Ultimately anomaly detection will be most valuable if it canlearn continually as the data stream comes in and yield real-time reporting that informs immediate corrective action It iscrucial that the models being utilized are fast and accurateHence going forward there is a need to shift the focus towardhybrid approaches (combinations of different algorithms) tobuild powerful models that are able to detect multiple types ofanomalies in real-time Such models will enhance EWSanomaly detection and advance public health9118119
4 PATH FORWARDThere is increasing interest in and a growing body of researchdocumenting how data analytics is being used to address ESE
problems As illustrated in this Feature the power of dataanalytics has been widely recognized as has been the vast needfor its application Anomaly detection has particularlypromising applications for water professionals and practi-tioners Notably broader application of metagenomics andNTA can revolutionize environmental monitoring effortsHowever the application of data analytics in ESE practiceremains in its infancy and concerted efforts are required tomake data analytics an integral part of ESE research andeducation Such a goal would most effectively be achieved ifthere were an agreed-upon plan of action Coordinated effortson several fronts are needed to help the ESE community reapthe potential benefits of data analyticsFirst there is a need to encourage collaborations between
data scientists and ESE practitioners that involve diverselyengaged and integrated research teams from multipledisciplines Such collaborations will facilitate cross-disciplinarycommunication and simultaneous skills building For exampledata scientists come from a culture highly supportive of datasharing (eg GitHub Bitbucket public databases) The ESEcommunity should embrace this culture and incorporate datasharing both locally regionally and globally Recent efforts toaddress the COVID-19 pandemic122123 the global dissem-ination of antimicrobial resistance7 and the development ofglobally vetted data analysis approaches within the NTAcommunity are all steps in this direction Such collaborationshave the potential to yield profound data-driven insights andconclusions that would be impossible to achieve within normalacademic and professional silos Working on their own datascientists may create models that answer specific questions butlack the interpretive training required to contextualize theirresults into real world applications Simultaneously ESEpractitioners with data may not possess the correspondingtools or insights required for proper analysis Strategiccollaborations will lead to improved data analytics datainterpretation and improved decision makingSecond there is a lack of user-friendly data analytics tools
and platforms devised for ESE problems Although under-standing the theory behind ML algorithms is not a difficult taskfor many trained scientists a major hurdle that ESE researchersface is in the coding and implementation of these models Withthe continual advancement of data science programming isbecoming an increasingly necessary skill without which modelsmay be improperly developed or may lack efficiency User-friendly tools may circumvent the problems with codinglearning curves and may be essential for widespreadpractitioner adoption While many tools (Microsoft powerBI Weka Tableau Azure) are available for general dataanalysis application-specific toolsplatforms would greatly aidin performing end-to-end analysesThird the field needs to incorporate data analytics-oriented
curricula This could include the addition of data analyticsfocused modules within existing coursework the introductionof new data science and statistics courses within ESE degreeprograms and the development of relevant capstone projectsfor ESE applications Such experiential learning would helpstudents understand the intricacies of different algorithms andprovide hands-on experience in applying various data analyticstechniques in the context of specific ESE problems Furtherencouraging students to take part in data science internshipswould be an excellent means of developing such expertiseFourth introducing data analytics workshops and making
the relevant resources available is valuable for students
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
H
researchers and professionals A plethora of online resourcessuch as Coursera LinkedIn Learning and Edx are available forlow or no cost offering high-quality content in data scienceExpanding access to and creation of open source learningplatforms could prove instrumental in instigating the data-driven problem-solving approachFinally we know that data analytics is not new but
continues to evolve at a rapid pace These advancementsfurnish us with new and powerful ways to analyze the data thatcan help holistically tackle the challenges faced by ESEcommunity and ultimately inform decision-making and policyformulation at scales never previously imagined Hence it iscritical to be abreast of new developments in data analytics andto continue making efforts to harvest their full potential
AUTHOR INFORMATIONCorresponding AuthorPeter Vikesland minus Via Department of Civil andEnvironmental Engineering Virginia Tech BlacksburgVirginia 24061 United States orcidorg0000-0003-2654-5132 Email pvikesvtedu
AuthorsSuraj Gupta minus The Interdisciplinary PhD Program in GeneticsBioinformatics and Computational Biology Virginia TechBlacksburg Virginia 24061 United States
Diana Aga minus Department of Chemistry University at BuffaloThe State University of New York Buffalo New York 14226United States orcidorg0000-0001-6512-7713
Amy Pruden minus Via Department of Civil and EnvironmentalEngineering Virginia Tech Blacksburg Virginia 24061United States orcidorg0000-0002-3191-6244
Liqing Zhang minus Department of Computer Science VirginiaTech Blacksburg Virginia 24061 United States
Complete contact information is available athttpspubsacsorg101021acsest1c01026
NotesThe authors declare no competing financial interestBiography
Peter J Vikesland is the Nick Prillaman Professor of Civil andEnvironmental Engineering at Virginia Tech He received his BAfrom Grinnell College in Chemistry in 1993 and MS and PhD inCivil and Environmental Engineering from University of Iowa in 1995and 1998 Vikeslandrsquos research focuses on the fate of nanomaterials inthe environment nanotechnology enabled sensor platforms forenvironmental quality assessment the environmental disseminationof antibiotic resistance and the application of data analytics in
environmental science and engineering He is a past President of theAssociation of Environmental Engineering and Science Professors(AEESP) is a US National Science Foundation CAREER awardeeand is the recipient of the 2018 Walter Weber Research InnovationAward from AEESP
ACKNOWLEDGMENTS
This study was supported by National Science Foundation(NSF) awards OISE (1545756) ECCSS NNCI (1542100)and CSSI (2004751) Additional support was provided by theCenter for Science and Engineering of the Exposome at theVirginia Tech Institute for Critical Technology and AppliedScience (ICTAS) and the Virginia Tech Graduate Schoolsupported Genetics Bioinformatics and ComputationalBiology (GBCB) and Sustainable Nanotechnology (VTSuN)programs
GLOSSARY
Accuracy Correct predictionstotal predictions
Autoencoders A deep learning techni-que that aims to build amodel where the outputtargets are set equal tothe input variables Itseeks to learn an ap-proximate representa-tion of the data andreconstruct it It is anunsupervised learningapproach used for di-mensionality reduction
Bayesian Network Bayesian networks are akind of probabilisticgraphical model thatutilizes Bayesian infer-ence for probability es-timations Bayesian net-works are used tomodel conditional de-pendence and hencecausation using a direc-ted graph
Data cleaning Identification of outliersor filling in of missingvalues Outlier identifi-cation can be done byusing Z-score quartilevalues or hypothesistesting Missing datacan be handled in mul-tiple ways such as fillingthe value by computingthe summary statisticsof the given variableusing predicted valuescomputed by an MLalgorithm manual cura-tion or by ignoring themissing record
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
I
Data integration The combination ofmultiple data sourcesinto a single coherentform
Data reduction Aggregation or elimina-tion of redundant in-formation to reduce thedata set size
Data transformation Conversion of raw databy normalizing to acommon scale to en-sure consistency andcomparability
Density Based Clustering Work by recognizingldquodenserdquo groups ofpoints permitting it tolearn clusters of discre-tionary shape and dis-tinguish anomalies inthe information
Dimensionality Reduction The process of reducingthe number of randomvariables or attributesunder consideration
EM Clustering Similar to k-means ex-cept it assigns the sam-ples into clusters basedon the probabilities es-timated using the EMalgorithm The objec-tive is to maximize theoverall probability ofthe data for the given(final) clusters
Ensemble Methods Ensemble methods arealgorithms that com-bine predictions fromseveral base models toobtain one optimalmodel
F1-Score F1-Score is a harmonicmean of recall andprecision
HMM HMM is a statisticalMarkov mode l i nwhich the system beingmodeled is assumed tobe a Markov process
k-mer A substring with alength k in a biologicalsequence of nucleoti-des
Naive Bayes Naive Bayes methodsare a family of simpleprobabilistic classifiersbased on Bayes ruleThe method is calledldquoNaiverdquo for its assump-tion of conditional in-dependence among thefeatures
Natural Language Processing (NLP) A subfield of computerscience artificial intelli-gence and linguisticsthat deals with theinteraction of com-puters with human lan-guages
NMDS NMDS is an ordinationtechnique based on dis-tance or dissimilaritymatrix NMDS repre-sents pairwise dissimi-larity between samplesin a low-dimensionalspace
PLS-DA PLS-DA is a linearclassification modelthat is able to predictthe class of new sam-ples
R Programming languagefor Statistical Comput-ing
ROC-Curve A receiver operatingcharacteristic (ROC)curve is a plot of truepositive rate (TPR-yaxis) against the falsepositive rate (FPR-xaxis) It measures theperformance of a classi-fier
SCADA Supervisory control andd a t a a c q u i s i t i o n(SCADA) is a systemdesigned to gather andanalyze real time dataIt is used in waterwastewater treatmentplants to monitor andmanage processes
REFERENCES(1) Tukey J W The future of data analysis Ann Math Stat 196233 (1) 1minus67(2) Hayashi C What is data science Fundamental concepts and aheuristic example In Data science classification and related methodsSpringer 1998 pp 40minus51(3) Bishop C M Pattern recognition and machine learning Springer2006(4) Qiu J Wu Q Ding G Xu Y Feng S A survey of machinelearning for big data processing EURASIP Journal on Advances inSignal Processing 2016 2016 (1) 67(5) Agarwal R Dhar V Big data data science and analytics Theopportunity and challenge for IS research In INFORMS 201425443(6) Wilcox C Woon W L Aung Z Applications of machinelearning in environmental engineering Citeseer 2013(7) Hendriksen R S Munk P Njage P Van Bunnik BMcNally L Lukjancenko O Roumlder T Nieuwenhuijse DPedersen S K Kjeldgaard J Global monitoring of antimicrobialresistance based on metagenomics analyses of urban sewage NatCommun 2019 10 (1) 1124(8) Sobus J R Wambaugh J F Isaacs K K Williams A JMcEachran A D Richard A M Grulke C M Ulrich E MRager J E Strynar M J Integrating tools for non-targeted analysis
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
J
research and chemical safety evaluations at the US EPA J ExposureSci Environ Epidemiol 2018 28 (5) 411minus426(9) Dogo E M Nwulu N I Twala B Aigbavboa C A survey ofmachine learning methods applied to anomaly detection on drinking-water quality data Urban Water J 2019 16 (3) 235minus248(10) Wilde F D Radtke D B Gibs J Iwatsubo R T Nationalfield manual for the collection of water-quality data US Geological SurveyTechniques of Water-Resources Investigations Book 9 1998(11) National Research Council Confronting the nationrsquos waterproblems The role of research National Academies Press 2004(12) Bharti R Grimm D G Current challenges and best-practiceprotocols for microbiome analysis Briefings in bioinformatics202122178(13) Veenaas C Developing tools for non-target analysis and digitalarchiving of organic urban water pollutants 2018(14) Keim D A Visual exploration of large data sets CommunACM 2001 44 (8) 38minus44(15) Chen J Chen Y Du X Li C Lu J Zhao S Zhou X Bigdata challenge a data management perspective Frontiers of ComputerScience 2013 7 (2) 157minus164(16) Chakrabarti S Cox E Frank E Guting R H Han J JiangX Kamber M Lightstone S S Nadeau T P Neapolitan R EData mining know it all Morgan Kaufmann 2008(17) Gibert K Sanchez-Marre M Izquierdo J A survey on pre-processing techniques Relevant issues in the context of environ-mental data mining AI Communications 2016 29 (6) 627minus663(18) Russell S Norvig P Artificial intelligence a modern approach2002(19) Hinton G E Sejnowski T J Poggio T A UnsupervisedLearning Foundations of Neural Computation MIT press 1999(20) Van Engelen J E Hoos H H A survey on semi-supervisedlearning Machine Learning 2020 109 (2) 373minus440(21) Brink H Richards J W Fetherolf M Cronin B Real-worldmachine learning Manning Shelter Island NY 2017(22) Kassambara A Practical guide to cluster analysis in RUnsupervised machine learning Sthda Vol 1(23) Ghodsi A Dimensionality reduction a short tutorialDepartment of Statistics and Actuarial Science Univ of WaterlooOntario Canada 2006 37 (38) 2006(24) Wright R E Logistic regression 1995(25) Alvarez-Arbesu R Feliciacutesimo A M GIS and logisticregression as tools for environmental management a coastal cliffvegetation model in Northern Spain WIT Transactions on Informationand Communication Technologies 2002 26(26) Liu L Sankarasubramanian A Ranjithan S R Logisticregression analysis to estimate contaminant sources in waterdistribution systems J Hydroinf 2011 13 (3) 545minus557(27) Thoe W Gold M Griesbach A Grimmer M Taggart ML Boehm A B Predicting water quality at Santa Monica Beachevaluation of five different models for public notification of unsafeswimming conditions Water Res 2014 67 105minus117(28) Muharemi F Logofa tu D Andersson C Leon FApproaches to building a detection model for water quality a casestudy In Modern Approaches for Intelligent Information and DatabaseSystems Springer 2018 pp 173minus183(29) Liaw A Wiener M Classification and regression byrandomForest R news 2002 2 (3) 18minus22(30) Gromski P S Muhamadali H Ellis D I Xu Y Correa ETurner M L Goodacre R A tutorial review Metabolomics andpartial least squares-discriminant analysis-a marriage of convenienceor a shotgun wedding Anal Chim Acta 2015 879 10minus23(31) Lee S Kim J Prediction of Nanofiltration and Reverse-Osmosis-Membrane Rejection of Organic Compounds UsingRandom Forest Model J Environ Eng 2020 146 (11) 04020127(32) Castrillo M Garciacutea A L Estimation of high frequencynutrient concentrations from water quality surrogates using machinelearning methods Water Res 2020 172 115490
(33) Tan G Yan J Gao C Yang S Prediction of water qualitytime series data based on least squares support vector machineProcedia Eng 2012 31 1194minus1199(34) Yang Y H Guergachi A Khan G Support vector machinesfor environmental informatics application to modelling the nitrogenremoval processes in wastewater treatment systems Journal ofEnvironmental Informatics 2006 7 (1) 14minus25(35) Feng J Pan L Cui B Sun Y Zhang A Gong M NewApproach for Concentration Prediction of Aqueous PhenolicContaminants by Using Wavelet Analysis and Support VectorMachine Environmental Engineering Science 2020 37 (5) 382minus392(36) Silver D Schrittwieser J Simonyan K Antonoglou IHuang A Guez A Hubert T Baker L Lai M Bolton AMastering the game of go without human knowledge Nature 2017550 (7676) 354minus359(37) Mikolov T Deoras A Povey D Burget L Cernocky JStrategies for Training Large Scale Neural Network Language ModelsIEEE 2011 pp 196minus201(38) Deng J Dong W Socher R Li L-J Li K Fei-Fei L InImagenet A Large-Scale Hierarchical Image Database Ieee 2009 pp248minus255(39) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (Aug) 2493minus2537(40) Alwosheel A van Cranenburgh S Chorus C G Is yourdataset big enough Sample size requirements when using artificialneural networks for discrete choice analysis Journal of choice modelling2018 28 167minus182(41) Yang J Xu J Zhang X Wu C Lin T Ying Y Deeplearning for vibrational spectral analysis Recent progress and apractical guide Anal Chim Acta 2019 1081 6minus17(42) Guo H Jeong K Lim J Jo J Kim Y M Park J-p Kim JH Cho K H Prediction of effluent concentration in a wastewatertreatment plant using machine learning models J Environ Sci 201532 90minus101(43) Inoue J Yamagata Y Chen Y Poskitt C M Sun JAnomaly Detection for a Water Treatment System Using UnsupervisedMachine Learning 2017 2017 IEEE 2017 pp 1058minus1065(44) Voukantsis D Karatzas K Kukkonen J Raumlsaumlnen TKarppinen A Kolehmainen M Intercomparison of air quality datausing principal component analysis and forecasting of PM10 andPM2 5 concentrations using artificial neural networks in Thessalonikiand Helsinki Sci Total Environ 2011 409 (7) 1266minus1276(45) Wold S Esbensen K Geladi P Principal componentanalysis Chemom Intell Lab Syst 1987 2 (1minus3) 37minus52(46) Jolliffe I T Cadima J Principal component analysis a reviewand recent developments Philos Trans R Soc A 2016 374 (2065)20150202(47) Dalal S G Shirodkar P V Jagtap T G Naik B G Rao GS Evaluation of significant sources influencing the variation of waterquality of Kandla creek Gulf of Katchchh using PCA Environ MonitAssess 2010 163 (1minus4) 49minus56(48) Scholleacutee J E Schymanski E L Avak S E Loos MHollender J Prioritizing unknown transformation products frombiologically-treated wastewater using high-resolution mass spectrom-etry multivariate statistics and metabolic logic Anal Chem 2015 87(24) 12121minus12129(49) Masiaacute A Campo J Blasco C Picoacute Y Ultra-highperformance liquid chromatography-quadrupole time-of-flight massspectrometry to identify contaminants in water an insight onenvironmental forensics Journal of Chromatography A 2014 134586minus97(50) Peiris R H Halleacute C Budman H Moresoli C Peldszus SHuck P M Legge R L Identifying fouling events in a membrane-based drinking water treatment process using principal componentanalysis of fluorescence excitation-emission matrices Water Res 201044 (1) 185minus194(51) MacQueen J In Some methods for classification and analysis ofmultivariate observations 1967 Oakland CA 1967 pp 281minus297
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
K
(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187
(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
L
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M
There are a number of different anomaly detectiontechniques available113 In the unsupervised approachunlabeled samples are clustered using an algorithm such ask-means density-based clustering algorithms or expectation-maximization (EM) clustering The concept is that the normaldata points cluster to form high density clusters whereasanomalous data points cluster separately or distant from thenormal data pointsclusters and are located in low-densityregions This way one can carry out additional investigationusing new samples and classify them relative to the normaldata In the supervised approach labels (normal or anomalous)are assigned to each sample based on past information andexpert knowledge of anomalous behavior In this way theproblem is converted into a binary classification problemThough it can easily be extended to a multiclass classificationto categorize different types of anomalies SVM ANNBayesian Network Logistic Regression models and theirvariants are well explored algorithms in this space113 A specialcase of SVM One-Class SVM (OCSVM) is a widely usedmethod for anomaly detection In OCSVM the entire trainingdata set is considered as one-class (eg normal class) and thenew data points are classified as similar (normal) or different(anomalous) to the training data Because considering all datapoints from one class is equivalent to having no label OCSVMis considered as unsupervised learning methodOwing to the success of ML in detecting anomalies in other
domains such as network security researchers have startedexploring its potential in water quality anomaly detec-tion43114115 Based on the studies published so far DLalgorithms have shown promise in detecting anomalies andhave outperformed conventional techniques on a number ofoccasions However it should be noted that DL methods couldbe slow to train depending on the depth of the network andthe amount of data available116
Many studies have adopted a batch learning approach wherethe model is trained on historical data and then the new data iscategorized using this trained model With a continuousincoming time-series data stream and the possibility of novelanomalies retraining the models with every new data set canquickly become impractical and difficult to reliably executeContinual or active learning frameworks that continuouslylearn as the new data stream comes in and that can identifyanomalous behavior in real time are required to circumventthese issues117118 Variants of Latent Dirichlet AllocationMarkov Models and ANN based architectures are algorithmsthat have been applied in other domains to achieve continualonline learning119minus121 However thes approaches remainunderexplored for anomaly detection in water systems asthese frameworks are not trivial and have their ownimplementation challenges118119
Ultimately anomaly detection will be most valuable if it canlearn continually as the data stream comes in and yield real-time reporting that informs immediate corrective action It iscrucial that the models being utilized are fast and accurateHence going forward there is a need to shift the focus towardhybrid approaches (combinations of different algorithms) tobuild powerful models that are able to detect multiple types ofanomalies in real-time Such models will enhance EWSanomaly detection and advance public health9118119
4 PATH FORWARDThere is increasing interest in and a growing body of researchdocumenting how data analytics is being used to address ESE
problems As illustrated in this Feature the power of dataanalytics has been widely recognized as has been the vast needfor its application Anomaly detection has particularlypromising applications for water professionals and practi-tioners Notably broader application of metagenomics andNTA can revolutionize environmental monitoring effortsHowever the application of data analytics in ESE practiceremains in its infancy and concerted efforts are required tomake data analytics an integral part of ESE research andeducation Such a goal would most effectively be achieved ifthere were an agreed-upon plan of action Coordinated effortson several fronts are needed to help the ESE community reapthe potential benefits of data analyticsFirst there is a need to encourage collaborations between
data scientists and ESE practitioners that involve diverselyengaged and integrated research teams from multipledisciplines Such collaborations will facilitate cross-disciplinarycommunication and simultaneous skills building For exampledata scientists come from a culture highly supportive of datasharing (eg GitHub Bitbucket public databases) The ESEcommunity should embrace this culture and incorporate datasharing both locally regionally and globally Recent efforts toaddress the COVID-19 pandemic122123 the global dissem-ination of antimicrobial resistance7 and the development ofglobally vetted data analysis approaches within the NTAcommunity are all steps in this direction Such collaborationshave the potential to yield profound data-driven insights andconclusions that would be impossible to achieve within normalacademic and professional silos Working on their own datascientists may create models that answer specific questions butlack the interpretive training required to contextualize theirresults into real world applications Simultaneously ESEpractitioners with data may not possess the correspondingtools or insights required for proper analysis Strategiccollaborations will lead to improved data analytics datainterpretation and improved decision makingSecond there is a lack of user-friendly data analytics tools
and platforms devised for ESE problems Although under-standing the theory behind ML algorithms is not a difficult taskfor many trained scientists a major hurdle that ESE researchersface is in the coding and implementation of these models Withthe continual advancement of data science programming isbecoming an increasingly necessary skill without which modelsmay be improperly developed or may lack efficiency User-friendly tools may circumvent the problems with codinglearning curves and may be essential for widespreadpractitioner adoption While many tools (Microsoft powerBI Weka Tableau Azure) are available for general dataanalysis application-specific toolsplatforms would greatly aidin performing end-to-end analysesThird the field needs to incorporate data analytics-oriented
curricula This could include the addition of data analyticsfocused modules within existing coursework the introductionof new data science and statistics courses within ESE degreeprograms and the development of relevant capstone projectsfor ESE applications Such experiential learning would helpstudents understand the intricacies of different algorithms andprovide hands-on experience in applying various data analyticstechniques in the context of specific ESE problems Furtherencouraging students to take part in data science internshipswould be an excellent means of developing such expertiseFourth introducing data analytics workshops and making
the relevant resources available is valuable for students
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
H
researchers and professionals A plethora of online resourcessuch as Coursera LinkedIn Learning and Edx are available forlow or no cost offering high-quality content in data scienceExpanding access to and creation of open source learningplatforms could prove instrumental in instigating the data-driven problem-solving approachFinally we know that data analytics is not new but
continues to evolve at a rapid pace These advancementsfurnish us with new and powerful ways to analyze the data thatcan help holistically tackle the challenges faced by ESEcommunity and ultimately inform decision-making and policyformulation at scales never previously imagined Hence it iscritical to be abreast of new developments in data analytics andto continue making efforts to harvest their full potential
AUTHOR INFORMATIONCorresponding AuthorPeter Vikesland minus Via Department of Civil andEnvironmental Engineering Virginia Tech BlacksburgVirginia 24061 United States orcidorg0000-0003-2654-5132 Email pvikesvtedu
AuthorsSuraj Gupta minus The Interdisciplinary PhD Program in GeneticsBioinformatics and Computational Biology Virginia TechBlacksburg Virginia 24061 United States
Diana Aga minus Department of Chemistry University at BuffaloThe State University of New York Buffalo New York 14226United States orcidorg0000-0001-6512-7713
Amy Pruden minus Via Department of Civil and EnvironmentalEngineering Virginia Tech Blacksburg Virginia 24061United States orcidorg0000-0002-3191-6244
Liqing Zhang minus Department of Computer Science VirginiaTech Blacksburg Virginia 24061 United States
Complete contact information is available athttpspubsacsorg101021acsest1c01026
NotesThe authors declare no competing financial interestBiography
Peter J Vikesland is the Nick Prillaman Professor of Civil andEnvironmental Engineering at Virginia Tech He received his BAfrom Grinnell College in Chemistry in 1993 and MS and PhD inCivil and Environmental Engineering from University of Iowa in 1995and 1998 Vikeslandrsquos research focuses on the fate of nanomaterials inthe environment nanotechnology enabled sensor platforms forenvironmental quality assessment the environmental disseminationof antibiotic resistance and the application of data analytics in
environmental science and engineering He is a past President of theAssociation of Environmental Engineering and Science Professors(AEESP) is a US National Science Foundation CAREER awardeeand is the recipient of the 2018 Walter Weber Research InnovationAward from AEESP
ACKNOWLEDGMENTS
This study was supported by National Science Foundation(NSF) awards OISE (1545756) ECCSS NNCI (1542100)and CSSI (2004751) Additional support was provided by theCenter for Science and Engineering of the Exposome at theVirginia Tech Institute for Critical Technology and AppliedScience (ICTAS) and the Virginia Tech Graduate Schoolsupported Genetics Bioinformatics and ComputationalBiology (GBCB) and Sustainable Nanotechnology (VTSuN)programs
GLOSSARY
Accuracy Correct predictionstotal predictions
Autoencoders A deep learning techni-que that aims to build amodel where the outputtargets are set equal tothe input variables Itseeks to learn an ap-proximate representa-tion of the data andreconstruct it It is anunsupervised learningapproach used for di-mensionality reduction
Bayesian Network Bayesian networks are akind of probabilisticgraphical model thatutilizes Bayesian infer-ence for probability es-timations Bayesian net-works are used tomodel conditional de-pendence and hencecausation using a direc-ted graph
Data cleaning Identification of outliersor filling in of missingvalues Outlier identifi-cation can be done byusing Z-score quartilevalues or hypothesistesting Missing datacan be handled in mul-tiple ways such as fillingthe value by computingthe summary statisticsof the given variableusing predicted valuescomputed by an MLalgorithm manual cura-tion or by ignoring themissing record
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
I
Data integration The combination ofmultiple data sourcesinto a single coherentform
Data reduction Aggregation or elimina-tion of redundant in-formation to reduce thedata set size
Data transformation Conversion of raw databy normalizing to acommon scale to en-sure consistency andcomparability
Density Based Clustering Work by recognizingldquodenserdquo groups ofpoints permitting it tolearn clusters of discre-tionary shape and dis-tinguish anomalies inthe information
Dimensionality Reduction The process of reducingthe number of randomvariables or attributesunder consideration
EM Clustering Similar to k-means ex-cept it assigns the sam-ples into clusters basedon the probabilities es-timated using the EMalgorithm The objec-tive is to maximize theoverall probability ofthe data for the given(final) clusters
Ensemble Methods Ensemble methods arealgorithms that com-bine predictions fromseveral base models toobtain one optimalmodel
F1-Score F1-Score is a harmonicmean of recall andprecision
HMM HMM is a statisticalMarkov mode l i nwhich the system beingmodeled is assumed tobe a Markov process
k-mer A substring with alength k in a biologicalsequence of nucleoti-des
Naive Bayes Naive Bayes methodsare a family of simpleprobabilistic classifiersbased on Bayes ruleThe method is calledldquoNaiverdquo for its assump-tion of conditional in-dependence among thefeatures
Natural Language Processing (NLP) A subfield of computerscience artificial intelli-gence and linguisticsthat deals with theinteraction of com-puters with human lan-guages
NMDS NMDS is an ordinationtechnique based on dis-tance or dissimilaritymatrix NMDS repre-sents pairwise dissimi-larity between samplesin a low-dimensionalspace
PLS-DA PLS-DA is a linearclassification modelthat is able to predictthe class of new sam-ples
R Programming languagefor Statistical Comput-ing
ROC-Curve A receiver operatingcharacteristic (ROC)curve is a plot of truepositive rate (TPR-yaxis) against the falsepositive rate (FPR-xaxis) It measures theperformance of a classi-fier
SCADA Supervisory control andd a t a a c q u i s i t i o n(SCADA) is a systemdesigned to gather andanalyze real time dataIt is used in waterwastewater treatmentplants to monitor andmanage processes
REFERENCES(1) Tukey J W The future of data analysis Ann Math Stat 196233 (1) 1minus67(2) Hayashi C What is data science Fundamental concepts and aheuristic example In Data science classification and related methodsSpringer 1998 pp 40minus51(3) Bishop C M Pattern recognition and machine learning Springer2006(4) Qiu J Wu Q Ding G Xu Y Feng S A survey of machinelearning for big data processing EURASIP Journal on Advances inSignal Processing 2016 2016 (1) 67(5) Agarwal R Dhar V Big data data science and analytics Theopportunity and challenge for IS research In INFORMS 201425443(6) Wilcox C Woon W L Aung Z Applications of machinelearning in environmental engineering Citeseer 2013(7) Hendriksen R S Munk P Njage P Van Bunnik BMcNally L Lukjancenko O Roumlder T Nieuwenhuijse DPedersen S K Kjeldgaard J Global monitoring of antimicrobialresistance based on metagenomics analyses of urban sewage NatCommun 2019 10 (1) 1124(8) Sobus J R Wambaugh J F Isaacs K K Williams A JMcEachran A D Richard A M Grulke C M Ulrich E MRager J E Strynar M J Integrating tools for non-targeted analysis
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
J
research and chemical safety evaluations at the US EPA J ExposureSci Environ Epidemiol 2018 28 (5) 411minus426(9) Dogo E M Nwulu N I Twala B Aigbavboa C A survey ofmachine learning methods applied to anomaly detection on drinking-water quality data Urban Water J 2019 16 (3) 235minus248(10) Wilde F D Radtke D B Gibs J Iwatsubo R T Nationalfield manual for the collection of water-quality data US Geological SurveyTechniques of Water-Resources Investigations Book 9 1998(11) National Research Council Confronting the nationrsquos waterproblems The role of research National Academies Press 2004(12) Bharti R Grimm D G Current challenges and best-practiceprotocols for microbiome analysis Briefings in bioinformatics202122178(13) Veenaas C Developing tools for non-target analysis and digitalarchiving of organic urban water pollutants 2018(14) Keim D A Visual exploration of large data sets CommunACM 2001 44 (8) 38minus44(15) Chen J Chen Y Du X Li C Lu J Zhao S Zhou X Bigdata challenge a data management perspective Frontiers of ComputerScience 2013 7 (2) 157minus164(16) Chakrabarti S Cox E Frank E Guting R H Han J JiangX Kamber M Lightstone S S Nadeau T P Neapolitan R EData mining know it all Morgan Kaufmann 2008(17) Gibert K Sanchez-Marre M Izquierdo J A survey on pre-processing techniques Relevant issues in the context of environ-mental data mining AI Communications 2016 29 (6) 627minus663(18) Russell S Norvig P Artificial intelligence a modern approach2002(19) Hinton G E Sejnowski T J Poggio T A UnsupervisedLearning Foundations of Neural Computation MIT press 1999(20) Van Engelen J E Hoos H H A survey on semi-supervisedlearning Machine Learning 2020 109 (2) 373minus440(21) Brink H Richards J W Fetherolf M Cronin B Real-worldmachine learning Manning Shelter Island NY 2017(22) Kassambara A Practical guide to cluster analysis in RUnsupervised machine learning Sthda Vol 1(23) Ghodsi A Dimensionality reduction a short tutorialDepartment of Statistics and Actuarial Science Univ of WaterlooOntario Canada 2006 37 (38) 2006(24) Wright R E Logistic regression 1995(25) Alvarez-Arbesu R Feliciacutesimo A M GIS and logisticregression as tools for environmental management a coastal cliffvegetation model in Northern Spain WIT Transactions on Informationand Communication Technologies 2002 26(26) Liu L Sankarasubramanian A Ranjithan S R Logisticregression analysis to estimate contaminant sources in waterdistribution systems J Hydroinf 2011 13 (3) 545minus557(27) Thoe W Gold M Griesbach A Grimmer M Taggart ML Boehm A B Predicting water quality at Santa Monica Beachevaluation of five different models for public notification of unsafeswimming conditions Water Res 2014 67 105minus117(28) Muharemi F Logofa tu D Andersson C Leon FApproaches to building a detection model for water quality a casestudy In Modern Approaches for Intelligent Information and DatabaseSystems Springer 2018 pp 173minus183(29) Liaw A Wiener M Classification and regression byrandomForest R news 2002 2 (3) 18minus22(30) Gromski P S Muhamadali H Ellis D I Xu Y Correa ETurner M L Goodacre R A tutorial review Metabolomics andpartial least squares-discriminant analysis-a marriage of convenienceor a shotgun wedding Anal Chim Acta 2015 879 10minus23(31) Lee S Kim J Prediction of Nanofiltration and Reverse-Osmosis-Membrane Rejection of Organic Compounds UsingRandom Forest Model J Environ Eng 2020 146 (11) 04020127(32) Castrillo M Garciacutea A L Estimation of high frequencynutrient concentrations from water quality surrogates using machinelearning methods Water Res 2020 172 115490
(33) Tan G Yan J Gao C Yang S Prediction of water qualitytime series data based on least squares support vector machineProcedia Eng 2012 31 1194minus1199(34) Yang Y H Guergachi A Khan G Support vector machinesfor environmental informatics application to modelling the nitrogenremoval processes in wastewater treatment systems Journal ofEnvironmental Informatics 2006 7 (1) 14minus25(35) Feng J Pan L Cui B Sun Y Zhang A Gong M NewApproach for Concentration Prediction of Aqueous PhenolicContaminants by Using Wavelet Analysis and Support VectorMachine Environmental Engineering Science 2020 37 (5) 382minus392(36) Silver D Schrittwieser J Simonyan K Antonoglou IHuang A Guez A Hubert T Baker L Lai M Bolton AMastering the game of go without human knowledge Nature 2017550 (7676) 354minus359(37) Mikolov T Deoras A Povey D Burget L Cernocky JStrategies for Training Large Scale Neural Network Language ModelsIEEE 2011 pp 196minus201(38) Deng J Dong W Socher R Li L-J Li K Fei-Fei L InImagenet A Large-Scale Hierarchical Image Database Ieee 2009 pp248minus255(39) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (Aug) 2493minus2537(40) Alwosheel A van Cranenburgh S Chorus C G Is yourdataset big enough Sample size requirements when using artificialneural networks for discrete choice analysis Journal of choice modelling2018 28 167minus182(41) Yang J Xu J Zhang X Wu C Lin T Ying Y Deeplearning for vibrational spectral analysis Recent progress and apractical guide Anal Chim Acta 2019 1081 6minus17(42) Guo H Jeong K Lim J Jo J Kim Y M Park J-p Kim JH Cho K H Prediction of effluent concentration in a wastewatertreatment plant using machine learning models J Environ Sci 201532 90minus101(43) Inoue J Yamagata Y Chen Y Poskitt C M Sun JAnomaly Detection for a Water Treatment System Using UnsupervisedMachine Learning 2017 2017 IEEE 2017 pp 1058minus1065(44) Voukantsis D Karatzas K Kukkonen J Raumlsaumlnen TKarppinen A Kolehmainen M Intercomparison of air quality datausing principal component analysis and forecasting of PM10 andPM2 5 concentrations using artificial neural networks in Thessalonikiand Helsinki Sci Total Environ 2011 409 (7) 1266minus1276(45) Wold S Esbensen K Geladi P Principal componentanalysis Chemom Intell Lab Syst 1987 2 (1minus3) 37minus52(46) Jolliffe I T Cadima J Principal component analysis a reviewand recent developments Philos Trans R Soc A 2016 374 (2065)20150202(47) Dalal S G Shirodkar P V Jagtap T G Naik B G Rao GS Evaluation of significant sources influencing the variation of waterquality of Kandla creek Gulf of Katchchh using PCA Environ MonitAssess 2010 163 (1minus4) 49minus56(48) Scholleacutee J E Schymanski E L Avak S E Loos MHollender J Prioritizing unknown transformation products frombiologically-treated wastewater using high-resolution mass spectrom-etry multivariate statistics and metabolic logic Anal Chem 2015 87(24) 12121minus12129(49) Masiaacute A Campo J Blasco C Picoacute Y Ultra-highperformance liquid chromatography-quadrupole time-of-flight massspectrometry to identify contaminants in water an insight onenvironmental forensics Journal of Chromatography A 2014 134586minus97(50) Peiris R H Halleacute C Budman H Moresoli C Peldszus SHuck P M Legge R L Identifying fouling events in a membrane-based drinking water treatment process using principal componentanalysis of fluorescence excitation-emission matrices Water Res 201044 (1) 185minus194(51) MacQueen J In Some methods for classification and analysis ofmultivariate observations 1967 Oakland CA 1967 pp 281minus297
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
K
(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187
(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
L
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M
researchers and professionals A plethora of online resourcessuch as Coursera LinkedIn Learning and Edx are available forlow or no cost offering high-quality content in data scienceExpanding access to and creation of open source learningplatforms could prove instrumental in instigating the data-driven problem-solving approachFinally we know that data analytics is not new but
continues to evolve at a rapid pace These advancementsfurnish us with new and powerful ways to analyze the data thatcan help holistically tackle the challenges faced by ESEcommunity and ultimately inform decision-making and policyformulation at scales never previously imagined Hence it iscritical to be abreast of new developments in data analytics andto continue making efforts to harvest their full potential
AUTHOR INFORMATIONCorresponding AuthorPeter Vikesland minus Via Department of Civil andEnvironmental Engineering Virginia Tech BlacksburgVirginia 24061 United States orcidorg0000-0003-2654-5132 Email pvikesvtedu
AuthorsSuraj Gupta minus The Interdisciplinary PhD Program in GeneticsBioinformatics and Computational Biology Virginia TechBlacksburg Virginia 24061 United States
Diana Aga minus Department of Chemistry University at BuffaloThe State University of New York Buffalo New York 14226United States orcidorg0000-0001-6512-7713
Amy Pruden minus Via Department of Civil and EnvironmentalEngineering Virginia Tech Blacksburg Virginia 24061United States orcidorg0000-0002-3191-6244
Liqing Zhang minus Department of Computer Science VirginiaTech Blacksburg Virginia 24061 United States
Complete contact information is available athttpspubsacsorg101021acsest1c01026
NotesThe authors declare no competing financial interestBiography
Peter J Vikesland is the Nick Prillaman Professor of Civil andEnvironmental Engineering at Virginia Tech He received his BAfrom Grinnell College in Chemistry in 1993 and MS and PhD inCivil and Environmental Engineering from University of Iowa in 1995and 1998 Vikeslandrsquos research focuses on the fate of nanomaterials inthe environment nanotechnology enabled sensor platforms forenvironmental quality assessment the environmental disseminationof antibiotic resistance and the application of data analytics in
environmental science and engineering He is a past President of theAssociation of Environmental Engineering and Science Professors(AEESP) is a US National Science Foundation CAREER awardeeand is the recipient of the 2018 Walter Weber Research InnovationAward from AEESP
ACKNOWLEDGMENTS
This study was supported by National Science Foundation(NSF) awards OISE (1545756) ECCSS NNCI (1542100)and CSSI (2004751) Additional support was provided by theCenter for Science and Engineering of the Exposome at theVirginia Tech Institute for Critical Technology and AppliedScience (ICTAS) and the Virginia Tech Graduate Schoolsupported Genetics Bioinformatics and ComputationalBiology (GBCB) and Sustainable Nanotechnology (VTSuN)programs
GLOSSARY
Accuracy Correct predictionstotal predictions
Autoencoders A deep learning techni-que that aims to build amodel where the outputtargets are set equal tothe input variables Itseeks to learn an ap-proximate representa-tion of the data andreconstruct it It is anunsupervised learningapproach used for di-mensionality reduction
Bayesian Network Bayesian networks are akind of probabilisticgraphical model thatutilizes Bayesian infer-ence for probability es-timations Bayesian net-works are used tomodel conditional de-pendence and hencecausation using a direc-ted graph
Data cleaning Identification of outliersor filling in of missingvalues Outlier identifi-cation can be done byusing Z-score quartilevalues or hypothesistesting Missing datacan be handled in mul-tiple ways such as fillingthe value by computingthe summary statisticsof the given variableusing predicted valuescomputed by an MLalgorithm manual cura-tion or by ignoring themissing record
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
I
Data integration The combination ofmultiple data sourcesinto a single coherentform
Data reduction Aggregation or elimina-tion of redundant in-formation to reduce thedata set size
Data transformation Conversion of raw databy normalizing to acommon scale to en-sure consistency andcomparability
Density Based Clustering Work by recognizingldquodenserdquo groups ofpoints permitting it tolearn clusters of discre-tionary shape and dis-tinguish anomalies inthe information
Dimensionality Reduction The process of reducingthe number of randomvariables or attributesunder consideration
EM Clustering Similar to k-means ex-cept it assigns the sam-ples into clusters basedon the probabilities es-timated using the EMalgorithm The objec-tive is to maximize theoverall probability ofthe data for the given(final) clusters
Ensemble Methods Ensemble methods arealgorithms that com-bine predictions fromseveral base models toobtain one optimalmodel
F1-Score F1-Score is a harmonicmean of recall andprecision
HMM HMM is a statisticalMarkov mode l i nwhich the system beingmodeled is assumed tobe a Markov process
k-mer A substring with alength k in a biologicalsequence of nucleoti-des
Naive Bayes Naive Bayes methodsare a family of simpleprobabilistic classifiersbased on Bayes ruleThe method is calledldquoNaiverdquo for its assump-tion of conditional in-dependence among thefeatures
Natural Language Processing (NLP) A subfield of computerscience artificial intelli-gence and linguisticsthat deals with theinteraction of com-puters with human lan-guages
NMDS NMDS is an ordinationtechnique based on dis-tance or dissimilaritymatrix NMDS repre-sents pairwise dissimi-larity between samplesin a low-dimensionalspace
PLS-DA PLS-DA is a linearclassification modelthat is able to predictthe class of new sam-ples
R Programming languagefor Statistical Comput-ing
ROC-Curve A receiver operatingcharacteristic (ROC)curve is a plot of truepositive rate (TPR-yaxis) against the falsepositive rate (FPR-xaxis) It measures theperformance of a classi-fier
SCADA Supervisory control andd a t a a c q u i s i t i o n(SCADA) is a systemdesigned to gather andanalyze real time dataIt is used in waterwastewater treatmentplants to monitor andmanage processes
REFERENCES(1) Tukey J W The future of data analysis Ann Math Stat 196233 (1) 1minus67(2) Hayashi C What is data science Fundamental concepts and aheuristic example In Data science classification and related methodsSpringer 1998 pp 40minus51(3) Bishop C M Pattern recognition and machine learning Springer2006(4) Qiu J Wu Q Ding G Xu Y Feng S A survey of machinelearning for big data processing EURASIP Journal on Advances inSignal Processing 2016 2016 (1) 67(5) Agarwal R Dhar V Big data data science and analytics Theopportunity and challenge for IS research In INFORMS 201425443(6) Wilcox C Woon W L Aung Z Applications of machinelearning in environmental engineering Citeseer 2013(7) Hendriksen R S Munk P Njage P Van Bunnik BMcNally L Lukjancenko O Roumlder T Nieuwenhuijse DPedersen S K Kjeldgaard J Global monitoring of antimicrobialresistance based on metagenomics analyses of urban sewage NatCommun 2019 10 (1) 1124(8) Sobus J R Wambaugh J F Isaacs K K Williams A JMcEachran A D Richard A M Grulke C M Ulrich E MRager J E Strynar M J Integrating tools for non-targeted analysis
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
J
research and chemical safety evaluations at the US EPA J ExposureSci Environ Epidemiol 2018 28 (5) 411minus426(9) Dogo E M Nwulu N I Twala B Aigbavboa C A survey ofmachine learning methods applied to anomaly detection on drinking-water quality data Urban Water J 2019 16 (3) 235minus248(10) Wilde F D Radtke D B Gibs J Iwatsubo R T Nationalfield manual for the collection of water-quality data US Geological SurveyTechniques of Water-Resources Investigations Book 9 1998(11) National Research Council Confronting the nationrsquos waterproblems The role of research National Academies Press 2004(12) Bharti R Grimm D G Current challenges and best-practiceprotocols for microbiome analysis Briefings in bioinformatics202122178(13) Veenaas C Developing tools for non-target analysis and digitalarchiving of organic urban water pollutants 2018(14) Keim D A Visual exploration of large data sets CommunACM 2001 44 (8) 38minus44(15) Chen J Chen Y Du X Li C Lu J Zhao S Zhou X Bigdata challenge a data management perspective Frontiers of ComputerScience 2013 7 (2) 157minus164(16) Chakrabarti S Cox E Frank E Guting R H Han J JiangX Kamber M Lightstone S S Nadeau T P Neapolitan R EData mining know it all Morgan Kaufmann 2008(17) Gibert K Sanchez-Marre M Izquierdo J A survey on pre-processing techniques Relevant issues in the context of environ-mental data mining AI Communications 2016 29 (6) 627minus663(18) Russell S Norvig P Artificial intelligence a modern approach2002(19) Hinton G E Sejnowski T J Poggio T A UnsupervisedLearning Foundations of Neural Computation MIT press 1999(20) Van Engelen J E Hoos H H A survey on semi-supervisedlearning Machine Learning 2020 109 (2) 373minus440(21) Brink H Richards J W Fetherolf M Cronin B Real-worldmachine learning Manning Shelter Island NY 2017(22) Kassambara A Practical guide to cluster analysis in RUnsupervised machine learning Sthda Vol 1(23) Ghodsi A Dimensionality reduction a short tutorialDepartment of Statistics and Actuarial Science Univ of WaterlooOntario Canada 2006 37 (38) 2006(24) Wright R E Logistic regression 1995(25) Alvarez-Arbesu R Feliciacutesimo A M GIS and logisticregression as tools for environmental management a coastal cliffvegetation model in Northern Spain WIT Transactions on Informationand Communication Technologies 2002 26(26) Liu L Sankarasubramanian A Ranjithan S R Logisticregression analysis to estimate contaminant sources in waterdistribution systems J Hydroinf 2011 13 (3) 545minus557(27) Thoe W Gold M Griesbach A Grimmer M Taggart ML Boehm A B Predicting water quality at Santa Monica Beachevaluation of five different models for public notification of unsafeswimming conditions Water Res 2014 67 105minus117(28) Muharemi F Logofa tu D Andersson C Leon FApproaches to building a detection model for water quality a casestudy In Modern Approaches for Intelligent Information and DatabaseSystems Springer 2018 pp 173minus183(29) Liaw A Wiener M Classification and regression byrandomForest R news 2002 2 (3) 18minus22(30) Gromski P S Muhamadali H Ellis D I Xu Y Correa ETurner M L Goodacre R A tutorial review Metabolomics andpartial least squares-discriminant analysis-a marriage of convenienceor a shotgun wedding Anal Chim Acta 2015 879 10minus23(31) Lee S Kim J Prediction of Nanofiltration and Reverse-Osmosis-Membrane Rejection of Organic Compounds UsingRandom Forest Model J Environ Eng 2020 146 (11) 04020127(32) Castrillo M Garciacutea A L Estimation of high frequencynutrient concentrations from water quality surrogates using machinelearning methods Water Res 2020 172 115490
(33) Tan G Yan J Gao C Yang S Prediction of water qualitytime series data based on least squares support vector machineProcedia Eng 2012 31 1194minus1199(34) Yang Y H Guergachi A Khan G Support vector machinesfor environmental informatics application to modelling the nitrogenremoval processes in wastewater treatment systems Journal ofEnvironmental Informatics 2006 7 (1) 14minus25(35) Feng J Pan L Cui B Sun Y Zhang A Gong M NewApproach for Concentration Prediction of Aqueous PhenolicContaminants by Using Wavelet Analysis and Support VectorMachine Environmental Engineering Science 2020 37 (5) 382minus392(36) Silver D Schrittwieser J Simonyan K Antonoglou IHuang A Guez A Hubert T Baker L Lai M Bolton AMastering the game of go without human knowledge Nature 2017550 (7676) 354minus359(37) Mikolov T Deoras A Povey D Burget L Cernocky JStrategies for Training Large Scale Neural Network Language ModelsIEEE 2011 pp 196minus201(38) Deng J Dong W Socher R Li L-J Li K Fei-Fei L InImagenet A Large-Scale Hierarchical Image Database Ieee 2009 pp248minus255(39) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (Aug) 2493minus2537(40) Alwosheel A van Cranenburgh S Chorus C G Is yourdataset big enough Sample size requirements when using artificialneural networks for discrete choice analysis Journal of choice modelling2018 28 167minus182(41) Yang J Xu J Zhang X Wu C Lin T Ying Y Deeplearning for vibrational spectral analysis Recent progress and apractical guide Anal Chim Acta 2019 1081 6minus17(42) Guo H Jeong K Lim J Jo J Kim Y M Park J-p Kim JH Cho K H Prediction of effluent concentration in a wastewatertreatment plant using machine learning models J Environ Sci 201532 90minus101(43) Inoue J Yamagata Y Chen Y Poskitt C M Sun JAnomaly Detection for a Water Treatment System Using UnsupervisedMachine Learning 2017 2017 IEEE 2017 pp 1058minus1065(44) Voukantsis D Karatzas K Kukkonen J Raumlsaumlnen TKarppinen A Kolehmainen M Intercomparison of air quality datausing principal component analysis and forecasting of PM10 andPM2 5 concentrations using artificial neural networks in Thessalonikiand Helsinki Sci Total Environ 2011 409 (7) 1266minus1276(45) Wold S Esbensen K Geladi P Principal componentanalysis Chemom Intell Lab Syst 1987 2 (1minus3) 37minus52(46) Jolliffe I T Cadima J Principal component analysis a reviewand recent developments Philos Trans R Soc A 2016 374 (2065)20150202(47) Dalal S G Shirodkar P V Jagtap T G Naik B G Rao GS Evaluation of significant sources influencing the variation of waterquality of Kandla creek Gulf of Katchchh using PCA Environ MonitAssess 2010 163 (1minus4) 49minus56(48) Scholleacutee J E Schymanski E L Avak S E Loos MHollender J Prioritizing unknown transformation products frombiologically-treated wastewater using high-resolution mass spectrom-etry multivariate statistics and metabolic logic Anal Chem 2015 87(24) 12121minus12129(49) Masiaacute A Campo J Blasco C Picoacute Y Ultra-highperformance liquid chromatography-quadrupole time-of-flight massspectrometry to identify contaminants in water an insight onenvironmental forensics Journal of Chromatography A 2014 134586minus97(50) Peiris R H Halleacute C Budman H Moresoli C Peldszus SHuck P M Legge R L Identifying fouling events in a membrane-based drinking water treatment process using principal componentanalysis of fluorescence excitation-emission matrices Water Res 201044 (1) 185minus194(51) MacQueen J In Some methods for classification and analysis ofmultivariate observations 1967 Oakland CA 1967 pp 281minus297
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
K
(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187
(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
L
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M
Data integration The combination ofmultiple data sourcesinto a single coherentform
Data reduction Aggregation or elimina-tion of redundant in-formation to reduce thedata set size
Data transformation Conversion of raw databy normalizing to acommon scale to en-sure consistency andcomparability
Density Based Clustering Work by recognizingldquodenserdquo groups ofpoints permitting it tolearn clusters of discre-tionary shape and dis-tinguish anomalies inthe information
Dimensionality Reduction The process of reducingthe number of randomvariables or attributesunder consideration
EM Clustering Similar to k-means ex-cept it assigns the sam-ples into clusters basedon the probabilities es-timated using the EMalgorithm The objec-tive is to maximize theoverall probability ofthe data for the given(final) clusters
Ensemble Methods Ensemble methods arealgorithms that com-bine predictions fromseveral base models toobtain one optimalmodel
F1-Score F1-Score is a harmonicmean of recall andprecision
HMM HMM is a statisticalMarkov mode l i nwhich the system beingmodeled is assumed tobe a Markov process
k-mer A substring with alength k in a biologicalsequence of nucleoti-des
Naive Bayes Naive Bayes methodsare a family of simpleprobabilistic classifiersbased on Bayes ruleThe method is calledldquoNaiverdquo for its assump-tion of conditional in-dependence among thefeatures
Natural Language Processing (NLP) A subfield of computerscience artificial intelli-gence and linguisticsthat deals with theinteraction of com-puters with human lan-guages
NMDS NMDS is an ordinationtechnique based on dis-tance or dissimilaritymatrix NMDS repre-sents pairwise dissimi-larity between samplesin a low-dimensionalspace
PLS-DA PLS-DA is a linearclassification modelthat is able to predictthe class of new sam-ples
R Programming languagefor Statistical Comput-ing
ROC-Curve A receiver operatingcharacteristic (ROC)curve is a plot of truepositive rate (TPR-yaxis) against the falsepositive rate (FPR-xaxis) It measures theperformance of a classi-fier
SCADA Supervisory control andd a t a a c q u i s i t i o n(SCADA) is a systemdesigned to gather andanalyze real time dataIt is used in waterwastewater treatmentplants to monitor andmanage processes
REFERENCES(1) Tukey J W The future of data analysis Ann Math Stat 196233 (1) 1minus67(2) Hayashi C What is data science Fundamental concepts and aheuristic example In Data science classification and related methodsSpringer 1998 pp 40minus51(3) Bishop C M Pattern recognition and machine learning Springer2006(4) Qiu J Wu Q Ding G Xu Y Feng S A survey of machinelearning for big data processing EURASIP Journal on Advances inSignal Processing 2016 2016 (1) 67(5) Agarwal R Dhar V Big data data science and analytics Theopportunity and challenge for IS research In INFORMS 201425443(6) Wilcox C Woon W L Aung Z Applications of machinelearning in environmental engineering Citeseer 2013(7) Hendriksen R S Munk P Njage P Van Bunnik BMcNally L Lukjancenko O Roumlder T Nieuwenhuijse DPedersen S K Kjeldgaard J Global monitoring of antimicrobialresistance based on metagenomics analyses of urban sewage NatCommun 2019 10 (1) 1124(8) Sobus J R Wambaugh J F Isaacs K K Williams A JMcEachran A D Richard A M Grulke C M Ulrich E MRager J E Strynar M J Integrating tools for non-targeted analysis
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
J
research and chemical safety evaluations at the US EPA J ExposureSci Environ Epidemiol 2018 28 (5) 411minus426(9) Dogo E M Nwulu N I Twala B Aigbavboa C A survey ofmachine learning methods applied to anomaly detection on drinking-water quality data Urban Water J 2019 16 (3) 235minus248(10) Wilde F D Radtke D B Gibs J Iwatsubo R T Nationalfield manual for the collection of water-quality data US Geological SurveyTechniques of Water-Resources Investigations Book 9 1998(11) National Research Council Confronting the nationrsquos waterproblems The role of research National Academies Press 2004(12) Bharti R Grimm D G Current challenges and best-practiceprotocols for microbiome analysis Briefings in bioinformatics202122178(13) Veenaas C Developing tools for non-target analysis and digitalarchiving of organic urban water pollutants 2018(14) Keim D A Visual exploration of large data sets CommunACM 2001 44 (8) 38minus44(15) Chen J Chen Y Du X Li C Lu J Zhao S Zhou X Bigdata challenge a data management perspective Frontiers of ComputerScience 2013 7 (2) 157minus164(16) Chakrabarti S Cox E Frank E Guting R H Han J JiangX Kamber M Lightstone S S Nadeau T P Neapolitan R EData mining know it all Morgan Kaufmann 2008(17) Gibert K Sanchez-Marre M Izquierdo J A survey on pre-processing techniques Relevant issues in the context of environ-mental data mining AI Communications 2016 29 (6) 627minus663(18) Russell S Norvig P Artificial intelligence a modern approach2002(19) Hinton G E Sejnowski T J Poggio T A UnsupervisedLearning Foundations of Neural Computation MIT press 1999(20) Van Engelen J E Hoos H H A survey on semi-supervisedlearning Machine Learning 2020 109 (2) 373minus440(21) Brink H Richards J W Fetherolf M Cronin B Real-worldmachine learning Manning Shelter Island NY 2017(22) Kassambara A Practical guide to cluster analysis in RUnsupervised machine learning Sthda Vol 1(23) Ghodsi A Dimensionality reduction a short tutorialDepartment of Statistics and Actuarial Science Univ of WaterlooOntario Canada 2006 37 (38) 2006(24) Wright R E Logistic regression 1995(25) Alvarez-Arbesu R Feliciacutesimo A M GIS and logisticregression as tools for environmental management a coastal cliffvegetation model in Northern Spain WIT Transactions on Informationand Communication Technologies 2002 26(26) Liu L Sankarasubramanian A Ranjithan S R Logisticregression analysis to estimate contaminant sources in waterdistribution systems J Hydroinf 2011 13 (3) 545minus557(27) Thoe W Gold M Griesbach A Grimmer M Taggart ML Boehm A B Predicting water quality at Santa Monica Beachevaluation of five different models for public notification of unsafeswimming conditions Water Res 2014 67 105minus117(28) Muharemi F Logofa tu D Andersson C Leon FApproaches to building a detection model for water quality a casestudy In Modern Approaches for Intelligent Information and DatabaseSystems Springer 2018 pp 173minus183(29) Liaw A Wiener M Classification and regression byrandomForest R news 2002 2 (3) 18minus22(30) Gromski P S Muhamadali H Ellis D I Xu Y Correa ETurner M L Goodacre R A tutorial review Metabolomics andpartial least squares-discriminant analysis-a marriage of convenienceor a shotgun wedding Anal Chim Acta 2015 879 10minus23(31) Lee S Kim J Prediction of Nanofiltration and Reverse-Osmosis-Membrane Rejection of Organic Compounds UsingRandom Forest Model J Environ Eng 2020 146 (11) 04020127(32) Castrillo M Garciacutea A L Estimation of high frequencynutrient concentrations from water quality surrogates using machinelearning methods Water Res 2020 172 115490
(33) Tan G Yan J Gao C Yang S Prediction of water qualitytime series data based on least squares support vector machineProcedia Eng 2012 31 1194minus1199(34) Yang Y H Guergachi A Khan G Support vector machinesfor environmental informatics application to modelling the nitrogenremoval processes in wastewater treatment systems Journal ofEnvironmental Informatics 2006 7 (1) 14minus25(35) Feng J Pan L Cui B Sun Y Zhang A Gong M NewApproach for Concentration Prediction of Aqueous PhenolicContaminants by Using Wavelet Analysis and Support VectorMachine Environmental Engineering Science 2020 37 (5) 382minus392(36) Silver D Schrittwieser J Simonyan K Antonoglou IHuang A Guez A Hubert T Baker L Lai M Bolton AMastering the game of go without human knowledge Nature 2017550 (7676) 354minus359(37) Mikolov T Deoras A Povey D Burget L Cernocky JStrategies for Training Large Scale Neural Network Language ModelsIEEE 2011 pp 196minus201(38) Deng J Dong W Socher R Li L-J Li K Fei-Fei L InImagenet A Large-Scale Hierarchical Image Database Ieee 2009 pp248minus255(39) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (Aug) 2493minus2537(40) Alwosheel A van Cranenburgh S Chorus C G Is yourdataset big enough Sample size requirements when using artificialneural networks for discrete choice analysis Journal of choice modelling2018 28 167minus182(41) Yang J Xu J Zhang X Wu C Lin T Ying Y Deeplearning for vibrational spectral analysis Recent progress and apractical guide Anal Chim Acta 2019 1081 6minus17(42) Guo H Jeong K Lim J Jo J Kim Y M Park J-p Kim JH Cho K H Prediction of effluent concentration in a wastewatertreatment plant using machine learning models J Environ Sci 201532 90minus101(43) Inoue J Yamagata Y Chen Y Poskitt C M Sun JAnomaly Detection for a Water Treatment System Using UnsupervisedMachine Learning 2017 2017 IEEE 2017 pp 1058minus1065(44) Voukantsis D Karatzas K Kukkonen J Raumlsaumlnen TKarppinen A Kolehmainen M Intercomparison of air quality datausing principal component analysis and forecasting of PM10 andPM2 5 concentrations using artificial neural networks in Thessalonikiand Helsinki Sci Total Environ 2011 409 (7) 1266minus1276(45) Wold S Esbensen K Geladi P Principal componentanalysis Chemom Intell Lab Syst 1987 2 (1minus3) 37minus52(46) Jolliffe I T Cadima J Principal component analysis a reviewand recent developments Philos Trans R Soc A 2016 374 (2065)20150202(47) Dalal S G Shirodkar P V Jagtap T G Naik B G Rao GS Evaluation of significant sources influencing the variation of waterquality of Kandla creek Gulf of Katchchh using PCA Environ MonitAssess 2010 163 (1minus4) 49minus56(48) Scholleacutee J E Schymanski E L Avak S E Loos MHollender J Prioritizing unknown transformation products frombiologically-treated wastewater using high-resolution mass spectrom-etry multivariate statistics and metabolic logic Anal Chem 2015 87(24) 12121minus12129(49) Masiaacute A Campo J Blasco C Picoacute Y Ultra-highperformance liquid chromatography-quadrupole time-of-flight massspectrometry to identify contaminants in water an insight onenvironmental forensics Journal of Chromatography A 2014 134586minus97(50) Peiris R H Halleacute C Budman H Moresoli C Peldszus SHuck P M Legge R L Identifying fouling events in a membrane-based drinking water treatment process using principal componentanalysis of fluorescence excitation-emission matrices Water Res 201044 (1) 185minus194(51) MacQueen J In Some methods for classification and analysis ofmultivariate observations 1967 Oakland CA 1967 pp 281minus297
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
K
(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187
(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
L
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M
research and chemical safety evaluations at the US EPA J ExposureSci Environ Epidemiol 2018 28 (5) 411minus426(9) Dogo E M Nwulu N I Twala B Aigbavboa C A survey ofmachine learning methods applied to anomaly detection on drinking-water quality data Urban Water J 2019 16 (3) 235minus248(10) Wilde F D Radtke D B Gibs J Iwatsubo R T Nationalfield manual for the collection of water-quality data US Geological SurveyTechniques of Water-Resources Investigations Book 9 1998(11) National Research Council Confronting the nationrsquos waterproblems The role of research National Academies Press 2004(12) Bharti R Grimm D G Current challenges and best-practiceprotocols for microbiome analysis Briefings in bioinformatics202122178(13) Veenaas C Developing tools for non-target analysis and digitalarchiving of organic urban water pollutants 2018(14) Keim D A Visual exploration of large data sets CommunACM 2001 44 (8) 38minus44(15) Chen J Chen Y Du X Li C Lu J Zhao S Zhou X Bigdata challenge a data management perspective Frontiers of ComputerScience 2013 7 (2) 157minus164(16) Chakrabarti S Cox E Frank E Guting R H Han J JiangX Kamber M Lightstone S S Nadeau T P Neapolitan R EData mining know it all Morgan Kaufmann 2008(17) Gibert K Sanchez-Marre M Izquierdo J A survey on pre-processing techniques Relevant issues in the context of environ-mental data mining AI Communications 2016 29 (6) 627minus663(18) Russell S Norvig P Artificial intelligence a modern approach2002(19) Hinton G E Sejnowski T J Poggio T A UnsupervisedLearning Foundations of Neural Computation MIT press 1999(20) Van Engelen J E Hoos H H A survey on semi-supervisedlearning Machine Learning 2020 109 (2) 373minus440(21) Brink H Richards J W Fetherolf M Cronin B Real-worldmachine learning Manning Shelter Island NY 2017(22) Kassambara A Practical guide to cluster analysis in RUnsupervised machine learning Sthda Vol 1(23) Ghodsi A Dimensionality reduction a short tutorialDepartment of Statistics and Actuarial Science Univ of WaterlooOntario Canada 2006 37 (38) 2006(24) Wright R E Logistic regression 1995(25) Alvarez-Arbesu R Feliciacutesimo A M GIS and logisticregression as tools for environmental management a coastal cliffvegetation model in Northern Spain WIT Transactions on Informationand Communication Technologies 2002 26(26) Liu L Sankarasubramanian A Ranjithan S R Logisticregression analysis to estimate contaminant sources in waterdistribution systems J Hydroinf 2011 13 (3) 545minus557(27) Thoe W Gold M Griesbach A Grimmer M Taggart ML Boehm A B Predicting water quality at Santa Monica Beachevaluation of five different models for public notification of unsafeswimming conditions Water Res 2014 67 105minus117(28) Muharemi F Logofa tu D Andersson C Leon FApproaches to building a detection model for water quality a casestudy In Modern Approaches for Intelligent Information and DatabaseSystems Springer 2018 pp 173minus183(29) Liaw A Wiener M Classification and regression byrandomForest R news 2002 2 (3) 18minus22(30) Gromski P S Muhamadali H Ellis D I Xu Y Correa ETurner M L Goodacre R A tutorial review Metabolomics andpartial least squares-discriminant analysis-a marriage of convenienceor a shotgun wedding Anal Chim Acta 2015 879 10minus23(31) Lee S Kim J Prediction of Nanofiltration and Reverse-Osmosis-Membrane Rejection of Organic Compounds UsingRandom Forest Model J Environ Eng 2020 146 (11) 04020127(32) Castrillo M Garciacutea A L Estimation of high frequencynutrient concentrations from water quality surrogates using machinelearning methods Water Res 2020 172 115490
(33) Tan G Yan J Gao C Yang S Prediction of water qualitytime series data based on least squares support vector machineProcedia Eng 2012 31 1194minus1199(34) Yang Y H Guergachi A Khan G Support vector machinesfor environmental informatics application to modelling the nitrogenremoval processes in wastewater treatment systems Journal ofEnvironmental Informatics 2006 7 (1) 14minus25(35) Feng J Pan L Cui B Sun Y Zhang A Gong M NewApproach for Concentration Prediction of Aqueous PhenolicContaminants by Using Wavelet Analysis and Support VectorMachine Environmental Engineering Science 2020 37 (5) 382minus392(36) Silver D Schrittwieser J Simonyan K Antonoglou IHuang A Guez A Hubert T Baker L Lai M Bolton AMastering the game of go without human knowledge Nature 2017550 (7676) 354minus359(37) Mikolov T Deoras A Povey D Burget L Cernocky JStrategies for Training Large Scale Neural Network Language ModelsIEEE 2011 pp 196minus201(38) Deng J Dong W Socher R Li L-J Li K Fei-Fei L InImagenet A Large-Scale Hierarchical Image Database Ieee 2009 pp248minus255(39) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (Aug) 2493minus2537(40) Alwosheel A van Cranenburgh S Chorus C G Is yourdataset big enough Sample size requirements when using artificialneural networks for discrete choice analysis Journal of choice modelling2018 28 167minus182(41) Yang J Xu J Zhang X Wu C Lin T Ying Y Deeplearning for vibrational spectral analysis Recent progress and apractical guide Anal Chim Acta 2019 1081 6minus17(42) Guo H Jeong K Lim J Jo J Kim Y M Park J-p Kim JH Cho K H Prediction of effluent concentration in a wastewatertreatment plant using machine learning models J Environ Sci 201532 90minus101(43) Inoue J Yamagata Y Chen Y Poskitt C M Sun JAnomaly Detection for a Water Treatment System Using UnsupervisedMachine Learning 2017 2017 IEEE 2017 pp 1058minus1065(44) Voukantsis D Karatzas K Kukkonen J Raumlsaumlnen TKarppinen A Kolehmainen M Intercomparison of air quality datausing principal component analysis and forecasting of PM10 andPM2 5 concentrations using artificial neural networks in Thessalonikiand Helsinki Sci Total Environ 2011 409 (7) 1266minus1276(45) Wold S Esbensen K Geladi P Principal componentanalysis Chemom Intell Lab Syst 1987 2 (1minus3) 37minus52(46) Jolliffe I T Cadima J Principal component analysis a reviewand recent developments Philos Trans R Soc A 2016 374 (2065)20150202(47) Dalal S G Shirodkar P V Jagtap T G Naik B G Rao GS Evaluation of significant sources influencing the variation of waterquality of Kandla creek Gulf of Katchchh using PCA Environ MonitAssess 2010 163 (1minus4) 49minus56(48) Scholleacutee J E Schymanski E L Avak S E Loos MHollender J Prioritizing unknown transformation products frombiologically-treated wastewater using high-resolution mass spectrom-etry multivariate statistics and metabolic logic Anal Chem 2015 87(24) 12121minus12129(49) Masiaacute A Campo J Blasco C Picoacute Y Ultra-highperformance liquid chromatography-quadrupole time-of-flight massspectrometry to identify contaminants in water an insight onenvironmental forensics Journal of Chromatography A 2014 134586minus97(50) Peiris R H Halleacute C Budman H Moresoli C Peldszus SHuck P M Legge R L Identifying fouling events in a membrane-based drinking water treatment process using principal componentanalysis of fluorescence excitation-emission matrices Water Res 201044 (1) 185minus194(51) MacQueen J In Some methods for classification and analysis ofmultivariate observations 1967 Oakland CA 1967 pp 281minus297
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
K
(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187
(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
L
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M
(52) Kingsy G R Manimegalai R Geetha D M S Rajathi SUsha K Raabiathul B N In Air pollution analysis using enhanced K-Means clustering algorithm for real time sensor data IEEE 2016 pp1945minus1949(53) Zou H Zou Z Wang X An enhanced K-means algorithmfor water quality analysis of the Haihe River in China Int J EnvironRes Public Health 2015 12 (11) 14400minus14413(54) Areerachakul S Sanguansintukul S Clustering analysis of waterquality for canals in Bangkok Thailand Springer 2010 pp 215minus227(55) Javadi S Hashemy S M Mohammadi K Howard K W FNeshat A Classification of aquifer vulnerability using K-means clusteranalysis J Hydrol 2017 549 27minus37(56) Li C Sun L Jia J Cai Y Wang X Risk assessment ofwater pollution sources based on an integrated k-means clustering andset pair analysis method in the region of Shiyan China Sci TotalEnviron 2016 557 307minus316(57) Brunner A M Bertelkamp C Dingemans M M LKolkman A Wols B Harmsen D Siegers W Martijn B JOorthuizen W A Ter Laak T L Integration of target analyses non-target screening and effect-based monitoring to assess OMP relatedwater quality changes in drinking water treatment Sci Total Environ2020 705 135779(58) Gupta S Arango-Argoty G Zhang L Pruden A VikeslandP Identification of discriminatory antibiotic resistance genes amongenvironmental resistomes using extremely randomized tree algorithmMicrobiome 2019 7 (1) 123(59) Du X Shao F Wu S Zhang H Xu S Water qualityassessment with hierarchical cluster analysis based on Mahalanobisdistance Environ Monit Assess 2017 189 (7) 1minus12(60) Bibby K Crank K Greaves J Li X Wu Z Hamza I AStachler E Metagenomics and the development of viral water qualitytools npj Clean Water 2019 2 (1) 1minus13(61) Chu B T T Petrovich M L Chaudhary A Wright DMurphy B Wells G Poretsky R Metagenomics reveals the impactof wastewater treatment plants on the dispersal of microorganismsand genes in aquatic sediments Appl Environ Microbiol 2018 84 (5)No e02168-17(62) Petrovich M L Zilberman A Kaplan A Eliraz G R WangY Langenfeld K Duhaime M Wigginton K Poretsky R AvisarD Microbial and Viral Communities and Their Antibiotic ResistanceGenes Throughout a Hospital Wastewater Treatment System FrontMicrobiol 2020 11 153(63) Dai D Rhoads W J Edwards M A Pruden A Shotgunmetagenomics reveals taxonomic and functional shifts in hot watermicrobiome due to temperature setting and stagnation FrontMicrobiol 2018 9 2695(64) Malla M A Dubey A Yadav S Kumar A Hashem AAbd_Allah E F Understanding and designing the strategies for themicrobe-mediated remediation of environmental contaminants usingomics approaches Front Microbiol 2018 9 1132(65) Giwa A S Ali N Athar M A Wang K Dissectingmicrobial community structure in sewage treatment plant forpathogensrsquo detection using metagenomic sequencing technologyArch Microbiol 2020 202 1minus9(66) Guo J Ni B-J Han X Chen X Bond P Peng Y YuanZ Data on metagenomic profiles of activated sludge from a full-scalewastewater treatment plant Data in brief 2017 15 833minus839(67) Laudadio I Fulci V Stronati L Carissimi C Next-generation metagenomics Methodological challenges and opportu-nities Omics a journal of integrative biology 2019 23 (7) 327minus333(68) Soueidan H Nikolski M Machine learning for metagenomicsmethods and tools arXiv preprint arXiv151006621 2015DOI 101515metgen-2016-0001(69) Ramette A Multivariate analyses in microbial ecology FEMSMicrobiol Ecol 2007 62 (2) 142minus160(70) Yang P Yu S Cheng L Ning K Meta-network optimizedspecies-species network analysis for microbial communities BMCGenomics 2019 20 (2) 187
(71) Collignon P Beggs J J Walsh T R Gandra SLaxminarayan R Anthropological and socioeconomic factorscontributing to global antimicrobial resistance a univariate andmultivariable analysis Lancet Planetary Health 2018 2 (9) No e398-e405(72) Arango-Argoty G Garner E Pruden A Heath L SVikesland P Zhang L DeepARG a deep learning approach forpredicting antibiotic resistance genes from metagenomic dataMicrobiome 2018 6 (1) 1minus15(73) El-Gebali S Mistry J Bateman A Eddy S R Luciani APotter S C Qureshi M Richardson L J Salazar G A Smart AThe Pfam protein families database in 2019 Nucleic Acids Res 201947 (D1) D427minusD432(74) Berglund F Marathe N P Osterlund T Bengtsson-PalmeJ Kotsakis S Flach C-F Larsson D G J Kristiansson EIdentification of 76 novel B1 metallo-β-lactamases through large-scalescreening of genomic and metagenomic data Microbiome 2017 5 (1)134(75) Collobert R Weston J Bottou L Karlen M KavukcuogluK Kuksa P Natural language processing (almost) from scratch JMach Learn Res 2011 12 (ARTICLE) 2493minus2537(76) Ng P dna2vec Consistent vector representations of variable-length k-mers arXiv preprint arXiv170106279 2017(77) Woloszynek S Zhao Z Chen J Rosen G L 16S rRNAsequence embeddings Meaningful numeric feature representations ofnucleotide sequences that are convenient for downstream analysesPLoS Comput Biol 2019 15 (2) No e1006721(78) Arango-Argoty G A Heath L S Pruden A Vikesland PZhang L MetaMLP A fast word embedding based classifier to profiletarget gene databases in metagenomic samples bioRxiv 2019 569970(79) Liang Q Bible P W Liu Y Zou B Wei L DeepMicrobestaxonomic classification for metagenomics with deep learning NARGenomics and Bioinformatics 2020 2 (1) No lqaa009(80) Fiannaca A La Paglia L La Rosa M Renda G Rizzo RGaglio S Urso A Deep learning models for bacteria taxonomicclassification of metagenomic data BMC Bioinf 2018 19 (7) 198(81) Krawczyk P S Lipinski L Dziembowski A PlasFlowpredicting plasmid sequences in metagenomic data using genomesignatures Nucleic Acids Res 2018 46 (6) No e35-e35(82) Hollender J van Bavel B Dulio V Farmen E FurtmannK Koschorreck J Kunkel U Krauss M Munthe J Schlabach MHigh resolution mass spectrometry-based non-target screening cansupport regulatory environmental monitoring and chemicals manage-ment Environ Sci Eur 2019 31 (1) 42(83) Soria N G C Montes A Bisson M A Atilla-Gokcumen GE Aga D S Mass spectrometry-based metabolomics to assess uptakeof silver nanoparticles by Arabidopsis thaliana Environ Sci Nano2017 4 (10) 1944minus1953(84) Matich E K Ghafari M Camgoz E Caliskan E Pfeifer BA Haznedaroglu B Z Atilla-Gokcumen G E Time-serieslipidomic analysis of the oleaginous green microalga species Ettliaoleoabundans under nutrient stress Biotechnol Biofuels 2018 11 (1)1minus15(85) Matich E K Soria N G C Aga D S Atilla-Gokcumen GE Applications of metabolomics in assessing ecological effects ofemerging contaminants and pollutants on plants J Hazard Mater2019 373 527minus535(86) Knolhoff A M Premo J H Fisher C M A ProposedQuality Control Standard Mixture and Its Uses for EvaluatingNontargeted and Suspect Screening LCHR-MS Method Perform-ance Anal Chem 2021931596(87) Samanipour S Kaserzon S Vijayasarathy S Jiang H ChoiP Reid M J Mueller J F Thomas K V Machine learningcombined with non-targeted LC-HRMS analysis for a risk warningsystem of chemical hazards in drinking water A proof of conceptTalanta 2019 195 426minus432(88) Angeles L F Islam S Aldstadt J Saqeeb K N Alam MKhan M A Johura F-T Ahmed S I Aga D S Retrospectivesuspect screening reveals previously ignored antibiotics antifungal
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
L
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M
compounds and metabolites in Bangladesh surface waters Sci TotalEnviron 2020 712 136285(89) Albergamo V Scholleacutee J E Schymanski E L Helmus RTimmer H Hollender J De Voogt P Nontarget screening revealstime trends of polar micropollutants in a riverbank filtration systemEnviron Sci Technol 2019 53 (13) 7584minus7594(90) Chiaia-Hernaacutendez A C Scheringer M Muller A StiegerG Waumlchter D Keller A Pintado-Herrera M G Lara-Martin PA Bucheli T D Hollender J Target and suspect screening analysisreveals persistent emerging organic contaminants in soils andsediments Sci Total Environ 2020 740 140181(91) Cariou R Meacutendez-Fernandez P Hutinet S b Guitton YCaurant F Le Bizec B Spitz J r m Vetter W Dervilly GNontargeted LCESI-HRMS Detection of Polyhalogenated Com-pounds in Marine Mammals Stranded on French Atlantic Coasts ACSESampT Water 20211309(92) Travis S C Kordas K Aga D S Optimized workflow forunknown screening using gas chromatography high-resolution massspectrometry expands identification of contaminants in siliconepersonal passive samplers Rapid Commun Mass Spectrom 2021 35(8) No e9048(93) Hollender J Schymanski E L Singer H P Ferguson P LNontarget screening with high resolution mass spectrometry in theenvironment ready to go ACS Publications 2017(94) Muller A Schulz W Ruck W K L Weber W H A newapproach to data evaluation in the non-target screening of organictrace substances in water analysis Chemosphere 2011 85 (8) 1211minus1219(95) Scholleacutee J E Schymanski E L Hollender J Statisticalapproaches for LC-HRMS data to characterize prioritize and identifytransformation products from water treatment processes In AssessingTransformation Products of Chemicals by Non-Target and SuspectScreening- Strategies and Workflows Vol 1 ACS Publications 2016 pp45minus65(96) Sireacuten K Fischer U Vestner J Automated supervisedlearning pipeline for non-targeted GC-MS data analysis AnalyticaChimica Acta X 2019 1 100005(97) Sinkov N A Harynuk J J Cluster resolution A metric forautomated objective and optimized feature selection in chemometricmodeling Talanta 2011 83 (4) 1079minus1087(98) Johnsen L G Amigo J M Skov T Bro R Automatedresolution of overlapping peaks in chromatographic data J Chemom2014 28 (2) 71minus82(99) Smirnov A Jia W Walker D I Jones D P Du X ADAP-GC 32 graphical software tool for efficient spectral deconvolution ofgas chromatography-high-resolution mass spectrometry metabolomicsdata J Proteome Res 2018 17 (1) 470minus478(100) Chen C Husny J Rabe S Predicting fishiness off-flavourand identifying compounds of lipid oxidation in dairy powders bySPME-GCMS and machine learning Int Dairy J 2018 77 19minus28(101) Taghadomi-Saberi S Mas Garcia S Allah Masoumi ASadeghi M Marco S Classification of bitter orange essential oilsaccording to fruit ripening stage by untargeted chemical profiling andmachine learning Sensors 2018 18 (6) 1922(102) Yang Q Xu L Tang L-J Yang J-T Wu B-Q Chen NJiang J-H Yu R-Q Simultaneous detection of multiple inheritedmetabolic diseases using GC-MS urinary metabolomics by chemo-metrics multi-class classification strategies Talanta 2018 186 489minus496(103) Smolinska A Hauschild A C Fijten R R R Dallinga JW Baumbach J Van Schooten F J Current breathomicsa reviewon data pre-processing techniques and machine learning inmetabolomics breath analysis J Breath Res 2014 8 (2) 027105(104) Garrido-Baserba M Corominas L Corteacutes U Rosso DPoch M The fourth-revolution in the water sector encounters thedigital revolution Environ Sci Technol 2020 54 (8) 4698minus4705(105) Vujnovic G Perisic J Bozilovic Z Milovanovic MGetman R V Radovanovic L USING SCADA SYSTEM FOR
PROCESS CONTROL IN WATER INDUSTRY Acta TechnicaCorviniensis-Bulletin of Engineering 2019 12 (2) 67minus72(106) Olsson G ICA and me-a subjective review Water Res 201246 (6) 1585minus1624(107) Zhao H Hou D Huang P Zhang G Water quality eventdetection in drinking water network Water Air Soil Pollut 2014 225(11) 2183(108) Chandola V Banerjee A Kumar V Anomaly detection Asurvey ACM computing surveys (CSUR) 2009 41 (3) 1minus58(109) Hasan J Goldbloom-Helzner D Ichida A Rouse TGibson M Technologies and Techniques for Early Warning Systems toMonitor and Evaluate Drinking Water Quality A State-of-the-ArtReview Environmental Protection Agency Washington Dc Office ofWater 2005(110) Izquierdo J Loacutepez P A Martiacutenez F J Peacuterez R Faultdetection in water supply systems using hybrid (theory and data-driven) modelling Mathematical and Computer Modelling 2007 46(3minus4) 341minus350(111) Murray S Ghazali M McBean E A Real-time water qualitymonitoring assessment of multisensor data using Bayesian beliefnetworks Journal of Water Resources Planning and Management 2012138 (1) 63minus70(112) Raciti M Cucurull J Nadjm-Tehrani S Anomaly detectionin water management systems In Critical Infrastructure ProtectionSpringer 2012 pp 98minus119(113) Ahmed M Mahmood A N Hu J A survey of networkanomaly detection techniques Journal of Network and ComputerApplications 2016 60 19minus31(114) Yuan Y Jia K A Water Quality Assessment Method Based onSparse Autoencoder 2015 IEEE 2015 pp 1minus4(115) Muharemi F Logofatu D Leon F Machine learningapproaches for anomaly detection of water quality on a real-worlddata set Journal of Information and Telecommunication 2019 3 (3)294minus307(116) Alom M Z Taha T M Yakopcic C Westberg S SidikeP Nasrin M S Hasan M Van Essen B C Awwal A A S AsariV K A state-of-the-art survey on deep learning theory andarchitectures Electronics 2019 8 (3) 292(117) Russo S Lurig M Hao W Matthews B Villez K Activelearning for anomaly detection in environmental data EnvironmentalModelling amp Software 2020 134 104869(118) Ahmad S Lavin A Purdy S Agha Z Unsupervised real-time anomaly detection for streaming data Neurocomputing 2017262 134minus147(119) Stocco A Tonella P Towards Anomaly Detectors that LearnContinuously IEEE 2020 pp 201minus208(120) Hoffman M Bach F R Blei D M In Online learning forlatent dirichlet allocation 2010 Citeseer pp 856minus864(121) Mongillo G Deneve S Online learning with hidden Markovmodels Neural computation 2008 20 (7) 1706minus1716(122) Bogler A Packman A Furman A Gross A Kushmaro ARonen A Dagot C Hill C Vaizel-Ohayon D Morgenroth ERethinking wastewater risks and monitoring in light of the COVID-19pandemic Nature Sustainability 2020 9 1minus10(123) Bivins A North D Ahmad A Ahmed W Alm E BeenF Bhattacharya P Bijlsma L Boehm A B Brown J Wastewater-Based Epidemiology Global Collaborative to Maximize Contributions inthe Fight Against COVID-19 ACS Publications 2020
Environmental Science amp Technology pubsacsorgest Feature
httpsdoiorg101021acsest1c01026Environ Sci Technol XXXX XXX XXXminusXXX
M