Managing uncertainty in early estimation of epidemic behaviors using scenario trees

16
This article was downloaded by: [Monash University Library] On: 24 June 2014, At: 16:26 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK IIE Transactions Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/uiie20 Managing uncertainty in early estimation of epidemic behaviors using scenario trees Ralph Gailis a , Ajith Gunatilaka a , Leo Lopes b , Alex Skvortsov a & Kate Smith-Miles bc a HPP Division, Defence Science and Technology Organization, Melbourne, Victoria 3207, Australia b SAS Institute, Cary, NC 27513, USA c School of Mathematical Sciences, Monash University, Melbourne, Victoria 3800, Australia E- mail: Accepted author version posted online: 31 May 2013.Published online: 01 May 2014. To cite this article: Ralph Gailis, Ajith Gunatilaka, Leo Lopes, Alex Skvortsov & Kate Smith-Miles (2014) Managing uncertainty in early estimation of epidemic behaviors using scenario trees, IIE Transactions, 46:8, 828-842, DOI: 10.1080/0740817X.2013.803641 To link to this article: http://dx.doi.org/10.1080/0740817X.2013.803641 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

Transcript of Managing uncertainty in early estimation of epidemic behaviors using scenario trees

This article was downloaded by: [Monash University Library]On: 24 June 2014, At: 16:26Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

IIE TransactionsPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/uiie20

Managing uncertainty in early estimation of epidemicbehaviors using scenario treesRalph Gailisa, Ajith Gunatilakaa, Leo Lopesb, Alex Skvortsova & Kate Smith-Milesbc

a HPP Division, Defence Science and Technology Organization, Melbourne, Victoria 3207,Australiab SAS Institute, Cary, NC 27513, USAc School of Mathematical Sciences, Monash University, Melbourne, Victoria 3800, Australia E-mail:Accepted author version posted online: 31 May 2013.Published online: 01 May 2014.

To cite this article: Ralph Gailis, Ajith Gunatilaka, Leo Lopes, Alex Skvortsov & Kate Smith-Miles (2014) Managinguncertainty in early estimation of epidemic behaviors using scenario trees, IIE Transactions, 46:8, 828-842, DOI:10.1080/0740817X.2013.803641

To link to this article: http://dx.doi.org/10.1080/0740817X.2013.803641

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

IIE Transactions (2014) 46, 828–842Copyright C© “IIE”ISSN: 0740-817X print / 1545-8830 onlineDOI: 10.1080/0740817X.2013.803641

Managing uncertainty in early estimation of epidemicbehaviors using scenario trees

RALPH GAILIS1, AJITH GUNATILAKA1, LEO LOPES2, ALEX SKVORTSOV1,and KATE SMITH-MILES2,3,∗

1HPP Division, Defence Science and Technology Organization, Melbourne, Victoria 3207, Australia2SAS Institute, Cary, NC 27513, USA3School of Mathematical Sciences, Monash University, Melbourne, Victoria 3800, AustraliaE-mail: [email protected]

Received June 2012 and accepted April 2013

The onset of an epidemic can be foreshadowed theoretically through observation of a number of syndromic signals, such as absenteeismor rising sales of particular pharmaceuticals. The success of such approaches depends on how well the uncertainty associated with theearly stages of an epidemic can be managed. This article uses scenario trees to summarize the uncertainty in the parameters definingan epidemiological process and the future path the epidemic might take. Extensive simulations are used to generate various syndromicand epidemic time series, which are then summarized in scenario trees, creating a simple data structure that can be explored quicklyat surveillance time without the need to fit models. Decisions can be made based on the subset of the uncertainty (the subtree) thatbest fits the current observed syndromic signals. Simulations are performed to investigate how well an underlying dynamic model ofan epidemic with inhomogeneous mixing and noise fluctuations can capture the effects of social interactions. Two noise terms areintroduced to capture the observable fluctuations in the social network connectivity and variation in some model parameters (e.g.,infectious time). Finally, it is shown how the entire framework can be used to compare syndromic surveillance systems against eachother; to evaluate the effect of lag and noise on accuracy; and to evaluate the impact that differences in syndromic behavior amongsusceptible and infected populations have on accuracy.

Keywords: Epidemic, syndromic surveillance, scenario trees, SIR model

1. Introduction

Early indicators of an epidemic include various syndromicsignals, such as the number of visits to doctors, calls to“hot lines,” sales of a particular product, hits of particu-lar web sites, etc. (Dailey et al., 2007). When such signalsare detected, however, it is unclear whether they representthe beginning of an epidemic or just normal variance ina stochastic process, especially at the onset stages of anepidemic. Although intervention at these early stages is ex-traordinarily valuable, the benefits of intervening when atrue epidemic is present have to be balanced against thecosts of intervening when the data observed are just nor-mal variance. Thus, estimating the state of an epidemicprocess using syndromic data is an extraordinarily impor-tant (HSPD-21, 2007) and challenging problem that hasattracted significant effort from diverse scientific commu-nities (see, for example, Wagner et al. (2011), Wilson et al.

∗Corresponding authorColor versions of one or more of the figures in the article can befound online at www.tandfonline.com/uiie.

(2006), Chen et al. (2009), Sparks et al. (2010), and manyothers).

The following topics are among the many important andcomplex questions yet to be answered in the literature.

1. How effective is a particular set of syndromes at movingthe first confident detection date earlier?

2. How much lag can a syndromic process have relative toan epidemic process and still be a useful predictor?

3. How much noise can that signal have and still provideinsight?

4. How different does the syndromic signal produced bythe epidemic have to be compared to its endemic signalto be detectable?

This article proposes a general framework for studyingthese questions in silico using scenario trees as a visual,simple, and powerful estimator of the power of a set of syn-dromes to predict the onset of an epidemic. Scenario trees(Dupacova et al., 2000) summarize very complex stochasticprocesses in a computationally tractable, easy to visualize,and relatively explanatory way. They store the dependenceof future outcomes on prior information. As a stochastic

0740-817X C© 2014 “IIE”

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

Managing uncertainty in estimating epidemics 829

process drifts from a known initial fixed point, the treebranches out. Each node of the tree represents a complexstate of the system; each edge describes the possible evolu-tion from one state to the next over a period of time; andthe weight on that edge stores the probability of such atransition.

Formulations of problems based on scenario trees can as-sist with non-anticipativity, where decisions must be madewhile information about the future remains unknown.Thus, they are common in stochastic programming for op-timization (see, for example, Høyland and Wallace (2001)and Kaut and Wallace (2007)). In this article, we use sce-nario trees for a novel application, namely, real-time es-timation of parameters from a complex distribution thatare most likely to underpin an epidemiological model andexplain observations at that time and identification of thelikely path an epidemic will follow.

Clustering of scenario trees enables arbitrarily sophisti-cated behavior in the underlying stochastic process to bepreserved, instead of limiting the analysis to the dynam-ics of a specific model, such as one based on a Bayesianupdate framework (Cooper et al., 2004). This is especiallyvaluable when the stochastic model is very sophisticated(e.g., an agent-based simulation). In the scenario tree ap-proach presented here, all major computation is done apriori. The operations required in real-time are simple, fast,and independent of the underlying models.

The procedure works in three major steps, the first twoof which are done once in an off-line training mode tounderstand the kinds of variation seen in model-generatedepidemic and syndromic data, and the final step is the real-time deployment of the system in test mode to identify thelikely event of an epidemic based on observed syndromicdata. In more detail, these steps involve the following tasks.

1. Data generation: training data from both epidemic mod-els and correlated syndromic data is generated accord-ing to defined models under certain assumptions. Manyreplicates of multivariate time series are generated eachcontaining a correlated ensemble of model-driven epi-demic processes and various syndromes. The syndromeshave varying responses to the underlying epidemicprocess.

2. Scenario tree generation: the stochastic time series gen-erated in step 1 are clustered into a scenario tree struc-ture. This is essentially an unsupervised learning of thesimilarities and differences between time series (used astraining data) based on their variations apparent at var-ious stages of the time series (e.g., day 1, day 2, etc.). Theroot node of the tree contains all generated time seriesas they are observed on day 0, since we assume they areall indistinguishable at this early stage. Once we observemore data (say, from day 1) we begin to see how thedifferent time series start to become distinguished, andclustering based on similarity creates a few branches inthe tree at day 1. With more time points used for clus-

tering at each stage of the tree, we eventually generatean entire scenario tree revealing how all the time seriesbehaviors can be grouped in a manner robust to noisein the data and how their path can be traced back toan earlier stage where the distinction started to becomeapparent.

3. Real-time filtering: the scenario tree is used at deploy-ment time to identify the likely event of an epidemic.In this article, we generate a test ensemble of epidemicand syndromic data drawn from the same distributionas that used to train the scenario tree, playing the roleof the data observed in real-time in a deployed system.The initial stages of the new syndromic signal are thenused to identify the early stage node of the scenario treethat best matches this signal and to extract the corre-sponding subtree of later stage nodes that are likely tocontain the eventual path of this signal. The subtree rep-resents an envelope of uncertainty around the state ofthe epidemic given the observed syndromic data. Theability of this subtree to adequately summarize and cor-rectly predict the evolution of the test epidemic pro-vides the means with which to answer questions 1 to 4above.

The accuracy of the proposed approach at deployment timewill be highly dependent on our model assumptions in thedata generation process and how well those assumptionshold for real-world signals. Simulation of epidemics is awell-researched topic, with many models proposed undervarious assumptions, such as the distribution of infectiontime (Lloyd, 2001). Likewise, there have been studies exam-ining simulation of syndromic data (Buckeridge et al., 2004;Lotze et al., 2009; Maciejewski et al., 2009), under variousassumptions. For our methodology, it is not particularlyimportant which assumptions we make for the generationof the data, as long as those assumptions can reasonablybe expected to hold for the test data and real-world deploy-ment. If we do not feel comfortable that any one model canmimic the kind of behavior we may encounter in a real-world setting, then we should adopt multiple models forthe data generation stage to ensure that the scenario tree isafforded a breath of experience in understanding the uncer-tainty that can arise in stochastic processes under variousmodel assumptions.

In this article, we have adopted a single model forgenerating the epidemic process: a modified Susceptible–Infected–Recovered (SIR) model, which includes powerlaw corrections, uncertainty, and population mixing, andhas been validated on real epidemic data from the UnitedStates (see Skvortsov and Ristic (2012) and referencestherein). The chosen model also has a strong correlationbetween the predicted number of infected individualsand real-world syndromic data from Google Flu Trends(Wilson et al., 2008). We have also adopted a singlemodel for generating syndromic data that includes a lagparameter for each syndromic series and replaces the

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

830 Gailis et al.

frequent normality assumption with a beta distribution,which enables a wider range of behaviors to be modeledwith its two parameters affecting the distribution. Aswith the epidemic simulation, there are a variety ofapproaches to simulating syndromic data using auto-correlation, cross-correlation, and seasonality patterns(Buckeridge et al., 2004; Lotze et al., 2009; Maciejewskiet al., 2009), but our choices do not limit the broader ap-plicability of the proposed scenario tree framework to con-sider multiple models in the future, ensuring a wider rangeof plausible behaviors are represented by the scenario tree.

Traditional approaches to modeling epidemics such asthe SIR models and their stochastic extensions (Danger-field et al., 2009) can provide valuable estimates of im-portant behaviors of epidemics. Most approaches to stateestimation fit observed infection data to an SIR-derivedmodel (Bettencourt and Ribeiro, 2008; Ristic et al., 2009).However, these models do not incorporate syndromic data,which can contain valuable leading indicators of an epi-demic process (Zheng et al., 2007). When chosen appro-priately, syndromic data at least provide corroborating evi-dence that can strengthen statistical approaches (Ginsberget al., 2008; Hulth et al., 2009). However, it is difficultto evaluate current designs for syndromic surveillance sys-tems. With few exceptions (Paul et al., 2008) the importantconsideration that syndromes are the result of both epi-demic and endemic processes is ignored. Testing and vali-dation of syndromic surveillance procedures is exceedinglydifficult (Hurt-Mullen and Coberly, 2005). One of the aimsof this article is to provide a methodology to examine theeffectiveness of a syndromic signal in detecting an epidemic.

Ensembles of syndromic models have been used to im-prove state estimation performance using data mining(Lotze and Shmueli, 2008), although adapting existing dataassimilation approaches (Cazelles and Chau, 1997; Betten-court and Ribeiro, 2008; Mondini et al., 2009) to include en-sembles complicates the analysis and the computation sig-nificantly, by requiring the fitting and tuning of model en-sembles at each stage of the epidemic.Indeed, a data miningapproach to our problem of early detection of an epidemiccould be taken, whereby summary data of the current stateof the system is used as an input (number of infected indi-viduals, syndromic signals) and the output is the numberof infected individuals at some future time. If the numberof infected individuals in the future is higher than expected(based on some statistical test) in the training data, thenthis example is labeled as an outbreak. This is the approachthat has been taken with the surveillance tree method for de-tecting unusually high numbers of vehicle crashes (Sparkset al., 2012), where a recursive partitioning data miningalgorithm is used to partition the high-dimensional inputspace (age, gender, vehicle type) into regions of normalor abnormal events (higher than expected vehicle crashes).Other data mining or machine learning techniques thatpartition a high dimensional space in a supervised manner,with labeled data indicating where anomalies lie, could be

adopted here if we wished to pursue a supervised learningapproach to this problem (Friedman and Fisher, 1999).

In this research, our approach is quite different. We be-gin with time series of syndromic and epidemic data as inSchindeler et al. (2009) and others. In contrast with theexisting literature, in our next step we categorize the tem-poral evolution of epidemic behaviors independently of themodel that produced them, doing away with the ensem-ble data-fitting step of data assimilation approaches andwithout labeling any of the time series as an epidemic, ananomaly, or outbreak, as in traditional supervised learn-ing approaches. Instead, we summarize the behaviors intoa scenario tree using unsupervised learning, based on thesimilarity of the time series given the available informationat each temporal stage. This enables a simple (at the cost ofbeing exponentially large) data structure, the scenario tree,to contain very detailed but easy to manipulate informationon the stochastic processes describing epidemic behaviorsand syndromic data. At surveillance time, the tree can be ex-plored quickly without the need to fit models, and decisionscan be made based on the subset of the uncertainty (i.e., thesubtree) that best defines the neighborhood of the observedsyndrome. Scenario tree structures have been used beforefor tracking temporal systems, and various methods havebeen proposed to prune redundant branches of the treesusing Markov chains and Bayesian Belief Networks (Hoodet al., 2009). In this article, we propose the use of clusteringof the scenario tree branches to achieve this pruning and torecognize that branches of a scenario tree are redundant ifthey lead to similar enough outcomes for the epidemic.

Once the resulting subtree has been identified for a real-time surveillance case, the outcomes of the subtree can thenbe used to predict the epidemic path with more certaintyand plan responses to the syndrome or the parameters usedto generate the training data falling within the subtree canbe recovered and used to drive new simulations and evaluatepotential responses (Carrat et al., 2006) or even to preparefirst responders in simulation exercises.

This article presents the proposed framework, the de-veloped algorithms, and a series of experimental resultsthat are used to answer questions 1 to 4 above. The re-mainder of this article is organized as follows: In Section 2we present the scenario tree methodology, first discussingthe data generation process in Section 2.1, including theadopted modified SIR model containing population mix-ing and our method for generating syndromic data to ac-company the epidemic data. In Section 2.2, we introducethe scenario tree structure and describe our scenario treeclustering-based algorithm. With the methodology for thefirst two (off-line training) steps explained, we then presentan example in Section 3 that demonstrates how the algo-rithms work and the resulting scenario tree. Section 4 thenpresents the method for real-time filtering and evaluatingthe accuracy of the scenario tree. A series of experimentalresults is presented in Section 5 for a variety of model as-sumptions, with experiments designed to answer questions

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

Managing uncertainty in estimating epidemics 831

1 to 4. Section 6 concludes the article and suggests somefurther improvements and future directions.

2. Generating a scenario tree of epidemicand syndromic data

In this section we present the methodology for the off-linegeneration of the scenario tree, comprising two steps: datageneration followed by scenario tree generation.

2.1. Generating data

There are many available models for simulating epidemics,ranging from the standard SIR model (Anderson and May,1979) to stochastic variations (Allen and Burgin, 2000;Dangerfield et al., 2009) and those that consider differentassumptions about model parameters such as infection time(Lloyd, 2001). Since there is a social network in any realis-tic community, mixing will occur and needs to be modeled.In this article we adopt the model developed in our earlierwork (Skvortsov et al., 2010; Skvortsov and Ristic, 2012),which captures this nuance by adding two inhomogeneitymixing parameters to a stochastic SIR model, describedbriefly in Section 2.1.1. The simulated epidemic data arethen used to generate synthetic syndromic data using a fur-ther stochastic process, described in Section 2.1.2. As statedin the Introduction, the proposed scenario tree methodol-ogy is not dependent on the chosen models, and the accu-racy of the approach should only be improved by includingmultiple models to generate the epidemic and syndromicdata. In this article, however, we restrict the data genera-tion to the processes described in the following sections,without loss of generality.

2.1.1. Simulating epidemicsFull details of the epidemiological model we have adoptedcan be found in Skvortsov and Ristic (2012) but, in brief, itis a generalized SIR model with Gaussian noise extendedwith the so-called inhomogeneous mixing with explicitdependency of social interactions on underlying networktopology (Roy and Pascual, 2006). Demographic noise,modeled as fluctuations of model parameters caused by thefinite size of population, is also included (Van Herwaardenand Grasman, 1995). The implemented model can be ex-pressed by two stochastic differential equations:

dsdt

= −q + σqξ anddidt

= q − βi − σqξ + σβζ, (1)

and r = 1 − s − i is the “conservation” law for the popula-tion. Here, q ≡ q(s, i ) is a nonlinear mixing term, describ-ing social contacts between individuals; β is the recoveryrate (a disease-specific parameter); ξ, ζ are two uncorre-lated (white) standard Gaussian noise processes. The termsσq ≡ σq (s, i ) and σβ ≡ σβ(s, i ) are introduced to capturethe demographic noise (Dangerfield et al., 2009).

In the standard SIR model (i.e., with homogeneous mix-ing) q(s, i ) = α i s, with a constant rate of social contactsα (Daley and Gani, 1996). We adopt the simple phe-nomenological extension of this model that includes thenon-homogeneous mixing case as proposed by Roy andPascual (2006) with

q(s, i ) = α iμ sν. (2)

Parameters μ, ν describe a mixing inhomogeneity, with μ =ν = 1 corresponding to the uniform mixing scenario, whileμ, ν can be treated as parameters of the model that aredetermined by underlying social network topology. Someresults of the theoretical derivation of μ, ν are presented inNovozhilov (2008).

The amplitude of the demographic noise terms can beestablished from a scaling law of Gaussian fluctuations gen-erated by the random contact rate q and recovery rate βi(Dangerfield et al., 2009). This corresponds to the well-known diffusion approximation in the theory of stochasticprocesses (Van Herwaarden and Grasman, 1995). As a re-sult, for a dynamical system consisting of a large numberof individuals P we can write σq (s, i ) = √

q(s, i )/P andσβ(s, i ) = √

βi/P. In summary, the chosen model of theepidemic process is parameterized by α, β, μ, ν, and noiseparameters.

2.1.2. Simulating syndromesTo determine which syndromes are most useful for earlydetection of epidemics, our approach is to build a syntheticdataset characterized by the following properties (Lotzeet al., 2009).

1. The lag between the epidemic and the syndrome.2. The prevalence of the syndrome in populations where

the disease is both epidemic and endemic.3. The variance of the syndromic signal both in popula-

tions where the disease is both epidemic and endemic.

To enable syndromes with behaviors that are as diverse aspossible, we model them as beta-distributed factors witha lag parameter in an ensemble of stochastic SIR seriesgenerated using the procedure above. This captures a largevariety of model behaviors. Each syndrome is assumed tohave two different rates: one for infected individuals andone for the endemic behaviors in the population. Thus,each syndrome is represented by a convex combination oftwo beta random variates, weighted by the proportion ofinfected people at a given time. Beta variates were chosensince they are known to provide flexibility in describingepidemic behavior with their two parameters that definethe shape distribution and their support over the interval[0, 1] (Pollett et al., 2010).

In detail now let �(t) be a set of syndromes of interest;i.e., a multivariate time series. Each syndrome i will becharacterized using five parameters.

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

832 Gailis et al.

(ai , bi ) The parameters of the beta distribution de-scribing the syndromic behavior of infectedindividuals.

(ai , bi ) The parameters of the beta distribution de-scribing the endemic syndromic behavior.

λi The lag between the infection time and the mo-ment when the syndromic behavior begins tobe observable.

If It represents the proportion of infected people at timet, and B(a, b) is a random variate from a beta distributionwith parameters a and b, then

�i (t) = It−λi B(ai , bi ) + (1 − It−λi )B(ai , bi ). (3)

2.2. Generating a scenario tree

A comprehensive review of many scenario generation tech-niques, including clustering techniques, can be found inDupacova et al. (2000). There are many existing methods inthe literature to build reasonably sized scenario trees. Hen-rion et al. (2009) addresses computational complexity byusing probability metrics for scenario reduction. Dupacovaet al. (2000) proposed a weighted clustering approach; theseweights serve as the basis for our clustering procedure. Kautand Wallace (2007) considered statistical parameters of therelationship between trees and outputs in stochastic pro-grams, such as bias. Høyland and Wallace (2001) proposeda method that matches statistical properties of the distri-butions of the original data to the distributions of data inthe tree. Di Domenica et al. (2007) empirically consideredstability of several parameters of stochastic programs usingsimulation.

In this article, we apply the concept of scenario trees tothe representation of the evolution of epidemic and syn-dromic signals and use clustering of the time series at dis-crete points in time to generate the stages of the scenariotree. The proposed methodology follows.

2.2.1. MethodologyThe infection rate and observations of syndromic behav-iors, generated via the processes described in Sections2.1.1 and 2.1.2, respectively, can be summarized into amultivariate time series ω. Different simulation methodsand different parameters within each method can gen-erate a collection � = {ω1,ω2, . . . ,ω|�|}, where eachωi = [ω0

i , . . . ,ωτi ] is a time series of length τ + 1 and

each ωti = (γ1, . . . , γr ) is a vector of observations at time

t ∈ {0, . . . , τ }, where two components are the number of in-fected and recovered individuals and the other componentsare syndromes.

Given the same initial conditions, all ωi should exhibitsimilar initial behavior. As time goes on, the behaviors ofeach ωi will vary. The proposed algorithm clusters the timeseries based on their behavior, independent of the specific

simulation or parameter set used to generate that time se-ries.

Algorithm 1 Generate a scenario tree from arrays S of |�|independent time series scenarios, where S contains theoriginal time series at t = 0, and k1, . . . , kτ provides a lowerbound for the number of nodes to be generated at each stageof the tree (k0 = 1).GenerateScenarioTree (S = �, t = 0)

p = |S||�|

κ = ceil(pkt)Delete ωt

i , ∀ωi ∈ SLet Q be a normalized transformation of S.{Si }i∈{1,...,κ}=k-means(Q,κ)Ct = ⋃

i∈{1,...,κ}GenerateScenarioTree (Si ,t + 1)return Cτ

Algorithm 1 describes the clustering procedure. The al-gorithm recursively generates a tree structure Cτ that de-scribes the similarities between time series as more databecomes available and t is increased from zero to τ . Eachnew t value generates a new layer of branches across thetree. The algorithm begins at the root node with the set of|�| multivariate time series at t = 0 (where all time seriesappear very similar). The set of time series to be clustered(initially all S = �) is first normalized to [0, 1] to mitigateagainst the fact that each syndrome and epidemic might bemeasured on different scales. This transformed data Q isthen clustered using the k-means algorithm (although anyclustering algorithm would be appropriate) to generate asuitable proportion (given by p) of the stipulated numberof clusters across the whole tree at time t, given by kt. Afterclustering is complete at a given stage t, the data points forstage t are removed from all the time series, so that theydo not affect the clustering at later stages. The algorithmrecursively performs this clustering of subsets of the timeseries found in a given node as we simulate time progressingand more data becoming available.

3. Illustrative example of a scenario tree

Now that the process and algorithm has been described forgenerating a scenario tree, a small example is introducedto provide greater clarity about how the algorithm can beimplemented and to illustrate what the resulting scenariotree looks like and the information it contains. Focusing ona community with a rather small population (P ≤ 10 000people)—such as a military community—suppose we canmeasure the proportion of people who have become in-fected with a contagious virus, and the proportion of peopledisplaying syndromic behaviors likely to be caused by peo-ple contracting the virus: pharmaceutical sales of productsto relieve the symptoms of the virus and absenteeism fromschool and work. Although real-world data could be used,

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

Managing uncertainty in estimating epidemics 833

Fig. 1. Beta distributions for the two syndromes in our example.

it is too limited to provide a useful and wide-ranging setof scenarios to build a scenario tree, and we use simulateddata containing plausible stochastic variations to generatea rich set of stochastic epidemic and syndromic data, asdescribed generally in Section 2.1.

3.1. Generating the epidemic

Specifically, for this example, we generated an epidemicseries over a 30 day period using the modified stochas-tic SIR model described in Section 2.1.1 with α = 0.6, μ =1.0, ν = 0.7, and for 15 unique β values varying in the range[0.25, 0.95] in increments of 0.05. For each β value, we gen-erated five replicates of the process with Gaussian noisecreating variation, producing a total set of 75 distinct epi-demic time series for varying β, and with fixed α, μ, and ν.Certainly, the other parameters could also have been var-ied and probably should be to ensure the scenario tree isexposed to a wide range of epidemic behaviors, but for

computational simplicity, we have varied only β for thisexample.

3.2. Generating the syndromes

With the 75 epidemic time series generated, we generatedtwo associated syndromic signals for each epidemic. Forthis example only, we assumed that the two syndromes hadbeta distributions as shown in Fig. 1. Syndrome 1 in thisexample is pharmaceutical sales and shows a distinct differ-ence in distribution between the infected and non-infectedpopulations due to a difference in the parameter b (b = 10for the infected population compared with b = 200 for thenon-infected population). Syndrome 2 in this example isabsenteeism and shows a distinction between the two pop-ulations due to a difference in both the parameters a and b(b is still 20 times greater for the non-infected population,but now a is magnified for the non-infected population aswell). Using these distinct values of a and b for both the in-fected and non-infected populations, and with a time lag of5 days for syndrome 1 and 10 days for syndrome 2, we gen-erated the syndromic signals using Equation (3) for a givenepidemic time series. Figure 2 shows one instance (from the75) of such an epidemic series and its associated syndromicsignals generated for a particular value of β = 0.5.

After generating the corresponding syndromic data foreach of the 75 epidemic time series, we have the collection oftime series shown in Fig. 3. The red dots show the decisionstages that we have defined for where we would like to eval-uate whether an epidemic is likely to exist. Decision stagescorrespond to days 1, 2, 3, 6, and 9. This choice is arbitraryand for the purposes of illustration of the methodology. Itis clear that at stages 1 and 2 it is very difficult to visuallydetermine which time series are indicative of a future surgecorresponding to an epidemic. It should also be noted thateach of the time series is on a different scale due to differ-ences in the endemic signal (in this example, we concludefrom the data that around 20% of the population tend tobuy pharmaceutical products associated with a virus, even

Fig. 2. Example simulated proportion of epidemic (blue), syndrome 1 (green, lag of 5 days) and syndrome 2 (red, lag of 10 days) intotal population.

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

834 Gailis et al.

Fig. 3. Time series of epidemic (blue), syndrome 1 (green), and syndrome 2 (red) used for generating the scenario tree. The red dotsshow the nominated decision stages.

if they have not contracted this particular virus, whereasonly 3% of the population are absent from school or workin the non-infected population). Before these time seriesare used for clustering in the scenario tree, they are nor-malized to [0, 1] to align them to the same scale. At thisstage, we have 75 time series, each with four components:the proportion of infected and recovered individuals at timet, along with the normalized proportions of the populationwith syndromes 1 and 2. This collection of 75 time seriesis for fixed values of the beta distribution parameters (a, b)as shown in Fig. 1 and the time lags λ1 = 5 and λ2 = 10for the two syndromes. We will vary these parameters inSection 5 but, for this illustrative example, we will assumethese parameters represent the breadth of behaviors of theseries we are likely to need to consider. It is a trivial ex-tension to generate additional series with different modelassumptions and parameters.

3.3. Generating the scenario tree

We now take the 75 multivariate time series and distributethem across a scenario tree with stages of the tree corre-sponding to the decision stages shown as red points in Fig.3, with an additional stage on day 21, after the epidemichas peaked. That is, τ = 6 and the tree will have six stagesafter the root node, corresponding to t = 1, 2, 3, 6, 9, 21.At each of these stages we need to supply a lower bound

for the number of clusters or nodes across that layer of thetree. We arbitrarily chose k0 = 1, k1 = 4, k2 = 16, k3 = 24,k4 = 48, k5 = 60, k6 = 75. At the later stages we are expect-ing that the time series have evolved to be quite unique, andwe expect to have a large number of clusters. But at the ear-lier stages, the series are still quite similar and we seek justa small number of clusters to help reduce uncertainty anddecide which subtree we expect the path of the epidemic totake.

Figure 4 shows the resulting scenario tree when Algo-rithm 1 is applied to the |�| = 75 multivariate time seriesin this example. In the upper pane we see that the scenariotree structure comprises of six stages after the root node.The root node has been highlighted to reveal the time seriesstored within it, as shown in the lower panel where we seethe original 75 multivariate time series. Although we haveplotted all time series over the full 30 day period, the sce-nario tree does not have access to more than t = 0 data atstage 0 and, at that time, the 75 time series are quite similarwhen normalized.

Stepping through the algorithm with this example, we seethat initially S = � at t = 0 and we will be first clusteringall 75 multivariate time series. At stage 1, we have a lowerbound of k1 = 4 clusters to generate, and since we onlyhave one node in the previous stage (the root node) thenall of the clusters will come from this source (so p = 1 toindicate that the full proportion of the clusters in stage 1

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

Managing uncertainty in estimating epidemics 835

Fig. 4. The original collection of 75 series: epidemics (blue) and syndromes (green and red).

will come from the clustering of S). We ensure that the timeseries are normalized, before calling the k-means clusteringalgorithm to produce κ = 4 clusters. We now divide S intoS1, S2, S3, S4 corresponding to the four clusters. The rootnode (stage 0) is then extended to stage 1 of the tree, showingfour branches, and this partial tree is labeled C1 in thealgorithm. Recursively, the construction of the tree callsthe entire algorithm again, this time with the set of timeseries coming from S1 (which produces two clusters in stage2); repeating for S2, S3, and S4, so that the total numberof clusters across stage 2 matches or exceeds the requiredminimum of k2 = 16. This process is continued until wereach stage τ , corresponding to day 21, and the final stageof the tree Cτ shows 75 clusters, each containing one uniquemultivariate time series.

Figure 5 reveals the time series that comprise a particularnode in stage 2 of the tree (highlighted in the upper panel).It contains three multivariate time series (so nine time seriesare plotted in the lower panel) that show epidemic behavior,and that were recognized as different from other time series,even as early as stage 2. Figure 6 shows the time series fromanother part of the tree, also at stage 2, which were seenas similar to each other but different from other time se-ries. Looking beyond the stage 2 time period, these 14 timeseries can later be seen to correspond to no epidemic, but

the similarity of their behavior was recognized at stage 2(day 2). Clearly, the scenario tree has been able to find sim-ilarities and differences between the early syndromic andepidemic data and to partition the time series based on thisearly behavior in a way that can successfully discriminatebetween the epidemic outcomes at later stages.

This example has illustrated how the model generatedepidemic and syndromic data can be used to build a sce-nario tree to represent the uncertainty in the path an epi-demic might take at various decision stages. This processcan be done once, under a set of model assumptions, andneed not be repeated unless those model assumptions andparameters are considered no longer valid for real-worlddeployment. In such a case, additional data would need tobe generated to match the new model assumptions and anew scenario tree generated.

4. Real-time deployment of the scenario tree

Assuming that we are comfortable that the data used togenerate the scenario tree include the kind of data likelyto be encountered in real-world settings, we are now readyto deploy the scenario tree for real-time identification of anepidemic status given some early syndromic data. In this

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

836 Gailis et al.

Fig. 5. Three series displaying epidemic behaviors.

section we present an algorithm to use the scenario treeto predict the path of an epidemic, based only on earlysyndromic signals. We discuss how to test and measure theaccuracy of the model and determine the effectiveness ofthe syndromic data so we can answer questions 1 to 4 statedin the Introduction.

4.1. Real-time filtering

In real time, when a real series γ of syndromic data withlength τ < τ becomes available, it can be compared to thesyndromic data represented in each node of the tree at stageτ . The set V of nodes at stage τ whose distance from γ

is less than a parameter d can then be used to estimate theparameters of the underlying epidemic process. Here, weuse Euclidean distance between the syndromic data vectorγ and the centroid of the time series in each node. If thebehavior of the nodes that are children of V at stages t ≥ τ

is similar to the future behavior of γ, then the procedurewill have effectively reduced the uncertainty in the param-eter space by eliminating all the nodes at stage τ not inV. In other words, if we were to look ahead at a futuretime to the behavior of γ and we discover that the paththe syndromic signal takes is within the subtree of nodesdescended from nodes in V, then we can be confident that

we can eliminate uncertainty in future behaviors based onvery early data by effectively pruning the scenario tree. Theeffectiveness of the procedure is a function of τ and thechosen distance threshold d. If more data are available (τ islater), we expect the success rate will improve; increasing dshould also improve the success rate, at the cost of a largerV and therefore a smaller reduction in uncertainty.

Testing of the model is essential to gain the confidence touse it for a real-world deployment, and we will present analgorithm for evaluating model performance in Section 4.3.Rather than testing on a single syndromic example, we haveperformed a more rigorous statistical analysis of some syn-thetically generated test data, as described in the followingSection 4.2. This also affords us the ability to evaluate theeffectiveness of various syndromic signals based on theirstatistical properties.

4.2. Model evaluation

The ideal syndromic set is one that dramatically reduces theamount of uncertainty over the future dynamics of the epi-demic after a short observation. Furthermore, that reducedinterval of uncertainty contains the true epidemic behav-ior if the syndromic signals are effective. There are twomeasures of interest: the reduction in uncertainty and the

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

Managing uncertainty in estimating epidemics 837

Fig. 6. Fourteen series displaying endemic behaviors.

likelihood that the prediction is correct. These two quanti-ties are always in conflict, as in any estimation exercise.

Syndromes may not have sufficient power to reduce theestimation interval using short-length samples. Thus, thedecision maker may want to evaluate the effect of the lengthof the observation window on the performance of the syn-dromic set.

Another feature of any syndromic set is noise. If a pro-cedure is pedantic regarding the similarity between an ob-served syndromic vector and a set of syndromic vectorsrepresented by the centroid of a tree node, it can reduceuncertainty significantly but at the cost of reliability. If itis too lax, then the subtree will contain the true syndrome,but uncertainty will not be sufficiently reduced to be useful.

Since this measure of strictness is a parameter that is mostlikely to be set by the sensitivities of the human operator ofthe system, it cannot be automatically removed. It is similarto the risk tolerance in a portfolio optimization model inthat an arbitrary tolerance can be mandated but probablyshould not be treated in that manner. Instead, the responseof the system to a variety of risk profiles provides moreinsight.

Therefore, one outcome of each experiment in this re-search is a pair of three-dimensional surfaces. One sur-face illustrates how often the procedure is successful as

measured by accuracy; i.e., how often does the subset ofremaining uncertainty contain the original epidemic. Theother surface shows how much uncertainty was removed(represented by the size of the resulting tree). Each surfaceis parameterized by the number of stages of syndromic dataused and by the similarity threshold used when deciding inwhich subtree a syndromic signal best matches.

Figure 7 shows these two surfaces for our example sce-nario tree presented in Section 3. Parameter τ is the lengthof the available data on which to base a prediction. If τ istoo small, then not enough information has been observedto accurately predict the later behavior of the epidemic. Pa-rameter d is the maximum relative distance (compared tothe closest node) between the series and the node centroidsthat permit a node to stay in the matched subtree. If d is toolarge, then high success rates are obtained but uncertaintyis not significantly reduced.

These surfaces were generated using data calculated inAlgorithm 2, presented in the following section. Ratherthan visually comparing two surfaces, it is more conve-nient and objective to combine these two notions of accu-racy and uncertainty reduction into a single measure forvarious stages τ . It is natural to model this using a Re-ceiver Operator Characteristics (ROC) curve (Brown andDavis, 2006). The Area Under Curve (AUC) can be used to

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

838 Gailis et al.

Fig. 7. The prediction success (left) and proportion of non-pruned nodes (right) as a function of the stages available (τ ) and theflexibility in matching nodes (d).

summarize the effectiveness of the particular set of syn-dromes to balance accuracy with uncertainty reduction.

Figure 8 shows ROC curves for our example, showing theaccuracy of predicting the path of the epidemic by correctlyidentifying, at early stages, the subtree eventually taken bythe epidemic. As more parts of the tree are explored usinga larger d value for the distance threshold, a greater num-ber of nodes of the tree are included, and the accuracy isincreased, although at the expense of uncertainty, since weare not confident that we know in which part of the tree theepidemic resides. The AUC metric for the quality of the ac-curacy obtained at stages 2, 3, and 4 is 0.66, 0.73, and 0.73,respectively. A highly effective syndrome will have a higher

Fig. 8. ROC curves for our example.

AUC at an earlier stage, compared with a less effective syn-drome. The syndromes used to create the ROC curve in Fig.8 were based on the statistical testing procedure describedin the next section.

4.3. Algorithm for evaluating effectivenessof a syndromic set

Our test procedure considers a given stage of available in-formation τ and the distance threshold d for deciding if aseries is similar to a node centroid. For a given (τ , d) pair,we performed 100 replications of the following steps andreported the average accuracy and proportion of discardednodes, as per Fig. 7, which can then be summarized as anROC curve as per Fig. 8.

1. A new syndromic series γ is generated with parametersin a range similar to those used to create the tree.

2. The similarity of the new series to those represented bythe tree is computed using only information up to stageτ , and the best node (smallest Euclidean distance) isidentified. Then the similarity is computed using onlyfuture information at stage τ . If the best node at stage τ

lies within the subtree of the best node at stage τ , thenwe can conclude that the pruning of the tree has beensuccessful.2.1. The distance from the test syndromic series γ

to the centroid of all series in each node up tostage τ is computed. The best node is identified asthe one with the minimum Euclidean distance. V isthe set of other nodes at the same stage τ of the treewhose distance is no larger than a factor 1 ≤ d ≤ 20greater than the distance to the best node. Thus, the

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

Managing uncertainty in estimating epidemics 839

set V includes nodes whose centroid series is similarenough to the test series γ , for varying thresholdsof similarity d. We can essentially discard the partsof the scenario tree that are not descended from Vsince we believe that the future path of the series lieswithin the descendents of V and we can eliminatesome uncertainty.

2.2. The same distance is computed between the testepidemic and each of the nodes in the final stage τ .This treats the series as a test, where we know thefinal path and seek to establish in which final nodethe test epidemic series best matches.

2.3. The procedure is considered a success if one of thenodes picked in step 2.2 is a child of one of the nodespicked in step 2.1. In other words, if an early stagesyndromic signal is able to assist the pruning of thetree to identify the likely subtree that an epidemicpath will evolve within, then the syndromic signalis valuable.

The accuracy (defined as percentage of the 100 replicationsthat were successful) and the average proportion of nodeseliminated by the filtering process are reported for each(τ , d) pair. The procedure is described in more detail inAlgorithm 2.

Returning to our example, and explaining how Algo-rithm 2 was used to generate the data for the AUC resultsshown in Fig. 8, we randomly generated 100 replicates ofa syndromic series with similar properties as syndrome 2(absenteeism). That is, α = 0.6, ν = 0.7, μ = 1, with β ran-domly sampled from the range [0.25, 0.95], a = 50, andb = 2000 for the non-infected population, and a = 20 andb = 100 for the infected population, with a lag of 10 daysto the epidemic. Using the notation in Algorithm 2, E andY are the epidemic and syndromic training data used togenerate the scenario tree C6 presented in Fig. 4.

5. Experimental results

We now explore how various experimental conditions af-fect the AUC and use the testing approach proposed in theprevious section to seek answers to the questions posed inSection 1. In addition to parameters defining our illustra-tive example, which we now label as Case 1, we consider thefollowing cases with variations in the a and b values of thebeta distributions for both syndromes in infected and non-infected populations, as well as variations in lag time foreach syndrome. Table 1 shows the parameter values usedto define a total of seven cases.

Cases 2 to 4 are used to illustrate the impact that dif-ferent distributions of syndromic behaviors in infected andnon-infected populations have on the ability of the sce-nario tree approach to recognize the onset of an epidemic.Case 2 presents a highly effective syndromic signal withsmall variance. It induces a strong response in the infected

Algorithm 2 A heuristic for testing the impact that a set ofsyndromes has on both the ability of the scenario tree filterto correctly identify states, and eliminate unlikely states.The algorithm is called for 100 replications, with averageSuccess and Pruned reported.TestSyndromicSet (

E = {(St, It)k|t = 0 . . . τ, k = 1 . . . |�|}, a collection ofepidemic series;

Y = {(ai , bi , a, bi , λi |i = 1 . . . s}, a collection ofsyndromes;

Cτ , a scenario tree of syndromic and epidemic data(trained by Algorithm 1 based on E and Y)

)Randomly generate an epidemic processRandomly generate a syndromic seriesSuccess ≡ 0, how often does the filtering correctly

identify the nodePruned ≡ 0, how many non-interesting states can the

procedure eliminatefor t = 1 : (τ − 1) do

for d = 1 : 20 by 0.1 doδvt ≡ distance between the test syndrome and the

centroid of the subset of Y in each node v in staget, using only information up to stage t

δuτ ≡ distance between the test epidemic and thecentroid of the subset of E in each node u in laststage τ , using all information available

w ≡ arg minu

δuτ

V ≡ {v ∈ Ct : δvtmin(δvt)

≤ d}Pruned ≡ 1 − |V|

|Cτ |if w ∈ {descendents of V} then

count as successend if

end forend for

population and a weak response for the non-infected pop-ulation. We expect that the scenario tree approach will beable to effectively use this syndromic signal to detect theonset of an epidemic. Case 3 provides a larger varianceof the beta distributions, with a high degree of overlap ofsignal for both infected and non-infected populations. Weexpect that this will result in lower AUC metrics since itwill be more difficult for the clustering process to separatethe epidemic from the endemic signals. Case 4 has identicaldistributions between the infected and non-infected popu-lations for the two syndromes, and consequently we expectthe lowest AUC.

Cases 5 to 7 are used to explore the effect of the lagbetween the onset of an epidemic and the increase of a syn-dromic signal. Case 5 is similar to Case 2 but with greaterdifference between the two syndromic levels. It has zero lagand will be used as a comparison with Cases 6 and 7 that

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

840 Gailis et al.

Table 1. Summary of test case parameter values

Syndrome 1 Syndrome 2

Infected Non-infected Infected Non-infectedLag Lag

a b a b (days) a b a b (days)

Case 1 50 10 50 200 5 20 100 50 2000 10Case 2 1000 100 100 1000 0 2000 50 50 2000 0Case 3 15 10 10 15 0 8 5 5 8 0Case 4 50 200 50 200 0 50 2000 50 2000 0Case 5 200 50 50 200 0 2000 50 50 2000 0Case 6 200 50 50 200 5 2000 50 50 2000 5Case 7 200 50 50 200 10 2000 50 50 2000 10

are identical to Case 5 but with lag increasing to 5 and 10days, respectively.

In all cases, we used both syndromes and the simulatedepidemic time series (from the stochastic SIR model) togenerate the scenario tree for each case, but we tested theapproach using only data distributed according to the samedistribution as syndrome 2. Table 2 summarizes the AUCmetrics for each case at various stages of attempting to de-tect the onset of an epidemic. Clearly, the later we leavethe decision about whether we believe an epidemic has oc-curred, the more accurate we are in identifying the path anepidemic will take. This is supported by the AUC increas-ing as the stage increases. Moreover, the variation in resultsbetween cases needs to be explored in order to understandthe impact that syndromic distributions have on epidemicand endemic signals and the effect that lag between sig-nals has on the ability to accurately predict the epidemicbehavior.

Comparing the effect of the beta distribution parametersin Cases 2 to 4, where there is no lag, Case 2 obtains the mostpredictive accuracy at each stage. This is expected when weare using a syndromic signal that clearly distinguishes be-tween the infected and non-infected populations. As thedistributions between infected and non-infected popula-tions start to become more similar, we see that the accuracydecreases (Case 3) to the point where the accuracy is lowest(Case 4) when the distributions are identical. ComparingCase 5, which has zero lag, with Cases 6 and 7 with increas-ing lag between the epidemic and the syndromic signals, weobserve that as the syndromic signal is delayed it becomesharder to accurately predict the onset of an epidemic within

Table 2. AUC for ROC curves of each case at stages 2, 3, and 4of the epidemic

Stage Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7

2 0.6763 0.7352 0.7005 0.6553 0.6849 0.6693 0.65613 0.7303 0.8155 0.7059 0.7263 0.7935 0.6999 0.69274 0.7260 0.8481 0.8086 0.7733 0.8445 0.8041 0.7396

the first four stages (stage 4 is day 6, stage 3 is day 3, stage2 is day 2 in these experiments). These results are common-sense and what we expect to be able to see as we balancethe need for accuracy with certainty of prediction.

6. Conclusions

This article has proposed the use of scenario trees to indi-cate the likely path taken by an epidemic, based on bothepidemic and syndromic signals. The epidemic has beenmodeled using a stochastic SIR model with inhomoge-neous mixing due to an underlying social network. The syn-dromes have been modeled using beta distributions, withlags from the onset of the epidemic. How effectively canthese syndromic signals provide early warning of an epi-demic outbreak? Naturally, this depends on how useful thesyndromes are for distinguishing between the infected andnon-infected populations, and a variety of cases have beenexplored. Where the syndromes are useful, the proposedmethodology enables higher accuracy in predicting the fu-ture path of an epidemic, despite the uncertainty in theearly stages. The uncertainty in the signals and their fu-ture trajectory is managed through clustering, to recognizewhich signals are likely to end up with a similar outcome,regardless of their initial variation. The accuracy of thissummarization process, whether the epidemic ends up inthe subtree that we predict, depends on how vigilantly wecluster the time series by carefully applying similarity met-rics, and this is user-controlled. By clustering various timeseries to arrive at a scenario tree we have adopted a leandata structure that replaces the need to use ensembles ofmodels.

In the scenario tree approach, we pay for the reduc-tion in complexity (from removing explicit models for syn-dromic behavior) by requiring exponentially more storage.We do not pay a significant price in terms of computationaltime, only in terms of computational space, since the datastructures can be implemented efficiently and manipulatedmostly in order O(log n) where n is the number of states.

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

Managing uncertainty in estimating epidemics 841

The power of the proposed approach has been illustratedwith a series of experimental cases designed to explorewhether the scenario trees produce the kind of results weexpect, and our results have confirmed that they indeed do.The balance between accuracy and certainty in predictingfuture outcomes has been demonstrated via ROC curves.We have confirmed that the distributions of the syndromicsignals have the expected impact on the accuracy of theselected subtree, and that the effect of lag between the epi-demic and syndromic signals can be well understood.

The proposed approach, with an updated scenario treetrained on data with distributions matching a particularreal-world community, can be used rapidly in real-time de-ployment without the need to run an ensemble of models orto devise new methods for handling the uncertainty in theprediction of any one model. Besides its utility as a predic-tion tool, the proposed approach is also useful for explor-ing the effectiveness of different syndromic data streams forpredicting epidemics, as well as understanding the epidemicmodel parameters that can be inferred from the observeddata.

A limitation of the current results presented in this articleis that a restricted set of model assumptions and parame-ters have been used to generate the data and the subsequentscenario tree. We have tested the ability of this scenario treeto accurately predict the path an epidemic is likely to take(or at least to eliminate large regions of uncertainty), butwe have assumed that the test data follows the same modelassumptions and parameters used to build the scenariotree. Naturally, for the broadest possible applicability ofthe scenario tree approach, we should ensure that a richvariety of modeling approaches are used to generate thedata (Lloyd, 2001) and a wide selection of plausible pa-rameters used. The proposed approach is identical thoughonce the data generation process is augmented, and we willbe considering an augmented data generation process infuture extensions of this work. Alternative clustering ap-proaches, apart from the k-means algorithm, could be usedfor clustering the scenario tree in future extensions of thiswork, including clustering approaches that do not requirethe number of clusters to be specified a priori. Additionally,we could consider using an augmented set of features of thetime series representing an epidemic beyond the number ofinfected and recovered individuals but could also considercharacterising the path of an epidemic based on statisticalproperties of the duration of infection. In future researchwe will also be extending this framework to demonstratehow the scenario tree data structure can be used withina stochastic optimization model to assist decision supportaround intervention strategies for epidemic management.

Acknowledgements

The authors are grateful to Peter Dawson (DSTO) for pro-viding useful discussion on this research and the three

anonymous referees whose suggestions greatly improvedthe clarity of the article.

Funding

This project was supported by a DSTO Bioterrorism Pre-paredness Corporate Enabling Research Program grant,2009–2011.

References

Allen, L. and Burgin, A. (2000) Comparison of deterministic and stochas-tic Sis and Sir models in discrete time. Mathematical Biosciences,163(1), 1–34.

Anderson, R.M. and May, R.M. (1979) Population biology of infectiousdiseases: part 1. Nature, 280, 361–367.

Bettencourt, L. and Ribeiro, R. (2008) Real time Bayesian estimation ofthe epidemic potential of emerging infectious diseases. PLoS One,3(5), e2185.

Brown, C. and Davis, H. (2006) Receiver operating characteristics curvesand related decision measures: a tutorial. Chemometrics and Intelli-gent Laboratory Systems, 80(1), 24–38.

Buckeridge, D., Burkom, H., Moore, A., Pavlin, J., Cutchis, P. and Hogan,W. (2004) Evaluation of syndromic surveillance systems: design of anepidemic simulation model. MMWR Morbidity Mortality WeeklyReport, 53, 137–143.

Carrat, F., Luong, J., Lao, H., Salle, A., Lajaunie, C. and Wackernagel,H. (2006) A “small-world-like” model for comparing interventionsaimed at preventing and controlling influenza pandemics. BMCMedicine, 4, 26.

Cazelles, B. and Chau, N. (1997). Using the kalman filter and dynamicmodels to assess the changing HIV/AIDS epidemic. MathematicalBiosciences, 140(2), 131–154.

Chen, H., Zeng, D. and Yan, P. (2009) Infectious Disease Informatics:Syndromic Surveillance for Public Health and Biodefense, volume 21,Springer, New York, NY.

Cooper, G., Dash, D., Levander, J., Wong, W., Hogan, W. and Wagner, M.(2004) Bayesian biosurveillance of disease outbreaks, in Proceedingsof the 20th Conference on Uncertainty in Artificial Intelligence, AUAIPress, Arlington, VA, pp. 94–103.

Dailey, L., Watkins, R. and Plant, A. (2007) Timeliness of data sourcesused for influenza surveillance. Journal of the American MedicalInformatics Association, 14(5), 626–631.

Daley, D. and Gani, J. (1996) Epidemic Modelling, Cambridge UniversityPress, Cambridge, UK.

Dangerfield, C.E., Ross, J.V. and Keeling, M.J. (2009). Integratingstochasticity and network structure into an epidemic model. Journalof Royal Society Interface, 6(38), 761–774.

Di Domenica, N., Mitra, G., Valente, P. and Birbilis, G. (2007) Stochasticprogramming and scenario generation within a simulation frame-work: an information systems perspective. Decision Support Sys-tems, 42(4), 2197–2218.

Dupacova, J., Consigli, G. and Wallace, S. (2000) Scenarios for multistagestochastic programs. Annals of Operations Research, 100(1), 25–53.

Friedman, J. and Fisher, N. (1999) Bump hunting in high-dimensionaldata. Statistics and Computing, 9(2), 123–143.

Ginsberg, J., Mohebbi, M., Patel, R., Brammer, L., Smolinski, M. andBrilliant, L. (2008) Detecting influenza epidemics using search en-gine query data. Nature, 457(7232), 1012–1014.

Henrion, R., Kuchler, C. and Romisch, W. (2009). Scenario reductionin stochastic programming with respect to discrepancy distances.Computational Optimization and Applications, 43(1), 67–93.

Hood, G., Barry, S. and Martin, P. (2009). Alternative methods for com-puting the sensitivity of complex surveillance systems. Risk Analysis,29(12), 1686–1698.

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4

842 Gailis et al.

Høyland, K. and Wallace, S. (2001) Generating scenario trees formultistage decision problems. Management Science, 47(2), 295–307.

HSPD-21. (2007) Homeland security presidential directive. Available athttp://www.fas.org/irp/offdocs/nspd/hspd-21.htm.

Hulth, A., Rydevik, G. and Linde, A. (2009) Web queries as a sourcefor syndromic surveillance. PLoS One, 4(2), e4378 (accessed May 8,2012).

Hurt-Mullen, K. and Coberly, J. (2005) Syndromic surveillance on theepidemiologist’s desktop: making sense of much data. MMWR Mor-bidity and Mortality Weekly Report, 54, 141–146.

Kaut, M. and Wallace, S. (2007) Evaluation of scenario-generation meth-ods for stochastic programming. Pacific Journal of Optimization,3(2), 257–271.

Lloyd, A. (2001) Realistic distributions of infectious periods in epidemicmodels: changing patterns of persistence and dynamics. TheoreticalPopulation Biology, 60(1), 59–71.

Lotze, T. and Shmueli, G. (2008) Ensemble forecasting for disease out-break detection, in Proceedings of the 23rd AAAI Conference onArtficial Intelligence, Fox, D. and Gomes, C.P. (eds), AAAI Press,Palo Alto, CA, pp. 1470–1471.

Lotze, T., Shmueli, G. and Yahav, I. (2009) Simulating and evaluatingbiosurveillance datasets, in Biosurveillance: Methods and Case Stud-ies, Kass-Hout, T. and Zhang, X. (eds), Chapman & Hall/CRC,Boca Raton, FL, pp. 23–52.

Maciejewski, R., Hafen, R., Rudolph, S., Tebbetts, G., Cleveland, W.,Grannis, S. and Ebert, D. (2009) Generating synthetic syndromic-surveillance data for evaluating visual-analytics techniques. IEEEComputer Graphics and Applications, 29(3), 18–28.

Mondini, A., de Moraes Bronzoni, R., Nunes, S., Neto, F., Massad, E.,Alonso, W., Lazzaro, E., Ferraz, A., de Andrade Zanotto, P. andNogueira, M. (2009) Spatio-temporal tracking and phylodynamicsof an urban dengue 3 outbreak in Sao Paulo, Brazil. PLoS NeglectedTropical Diseases, 3(5), e448.

Novozhilov, A.S. (2008) On the spread of epidemics in a closed hetero-geneous population. Mathematical Biosciences, 215(2), 177–185.

Paul, M., Held, L. and Toschke, A. (2008) Multivariate modelling ofinfectious disease surveillance data. Statistics in Medicine, 27(29),6250–6267.

Pollett, P., Dooley, A. and Ross, J. (2010) Modelling population processeswith random initial conditions. Mathematical Biosciences, 223(2),142–150.

Ristic, B., Skvortsov, A. and Morelande, M. (2009) Predicting theprogress and the peak of an epidemic, in Proceedings of the IEEEInternational Conference on Acoustics, Speech and Signal Processing,IEEE Press, Piscataway, NJ, pp. 513–516.

Roy, M. and Pascual, M. (2006) On representing network heterogeneitiesin the incidence rate of simple epidemic models. Journal of EcologicalComplexity, 3(1), 80–90.

Schindeler, S., Muscatello, D., Ferson, M., Rogers, K., Grant, P. andChurches, T. (2009) Evaluation of alternative respiratory syndromesfor specific syndromic surveillance of influenza and respiratory syn-cytial virus: a time series analysis. BMC Infectious Diseases, 9(1),190.

Skvortsov, A. and Ristic, B. (2012) Monitoring and prediction of an epi-demic outbreak using syndromic observations. Mathematical Bio-sciences, 240, 12–19.

Skvortsov, A., Ristic, B. and Woodruff, C. (2010) Predicting an epi-demic based on syndromic surveillance, in Proceedings of the 13thConference on Information Fusion, IEEE Press, Piscataway, NJ,pp. 1–8.

Sparks, R., Keighley, T. and Muscatello, D. (2010) Early warningCUSUM plans for surveillance of negative binomial daily diseasecounts. Journal of Applied Statistics, 37(11), 1911–1929.

Sparks, R., Okugami, C. and Bolt, S. (2012) Outbreak detection of spatio-temporally smoothed crashes Open Journal of Safety Science andTechnology, 2(3), 98–107.

Van Herwaarden, O.A. and Grasman, J. (1995) Stochastic epidemics:major outbreaks and the duration of the endemic period. Journal ofMathematical Biology, 33(4), 581–601.

Wagner, M., Moore, A. and Aryel, R. (2011) Handbook of Biosurveillance.Elsevier Academic Press, Burlington, MA.

Wilson, A., Wilson, G. and Olwell, D. (2006) Statistical Methods in Coun-terterrorism: Game Theory, Modeling, Syndromic Surveillance, andBiometric Authentication. Springer, New York, NY.

Wilson, N., Mason, K., Tobias, M., Peacey, M., Huang, Q. and Baker,M. (2008) Interpreting Google flu trends data for pandemic H1N1influenza: the New Zealand experience. Euro Surveillance: EuropeanCommunicable Disease Bulletin, 14(44), 429–433.

Zheng, W., Aitken, R., Muscatello, D. and Churches, T. (2007) Potentialfor early warning of viral influenza activity in the community bymonitoring clinical diagnoses of influenza in hospital emergencydepartments. BMC Public Health, 7(1), 250.

Biographies

Ralph Gailis received his B.Sc. (Hon) in Physics and Mathematics at theUniversity of Melbourne in 1992. He completed his Ph.D. in TheoreticalPhysics at the University of Melbourne in 1996, followed by a postdoc-toral appointment in the same department on cosmological gravitationalstructure formation. He joined the Defence Science Technology Organi-sation (DSTO) in 1999 to work in the area of Chemical, Biological, andRadiological (CBR) hazard assessment, particularly atmospheric disper-sion modeling. His interests and responsibility have broadened in thesubsequent years to include CBR data fusion, radiological defencse, andoperations research. His current research and leadership responsibili-ties focus particularly on aspects of data fusion and stochastic modelingapplied to disease surveillance and atmospheric hazard assessment.

Ajith Gunatilaka received his B.Sc. (Engineering) degree with First ClassHonours in Electronics and Telecommunication Engineering from theUniversity of Moratuwa, Sri Lanka, in 1990. He received his Ph.D. inElectrical Engineering from the Ohio State University in 2000. In Novem-ber 2005, he joined the Defence Science and Technology Organisation,where he is engaged in Chemical, Biological, and Radiological (CBR)hazard modeling and CBR data fusion research.

Leo Lopes received his Ph.D. in Industrial Engineering from North-western University in 2003. He has been an Assistant Professor at theUniversity of Arizona and a Research Fellow at Monash University andis currently a senior operations research specialist at the SAS Institute,North Carolina. His research expertise spans stochastic optimization andmathematical programming, with a primary research interest in helpingpeople build and use realistic models efficiently.

Alex Skvortsov holds a Ph.D. degree in Theoretical Physics from MoscowUniversity of Applied Physics and Technology. He has been working atthe Defence Science Technology Organisation (DSTO) since 2005 in thearea of mathematical modeling. His areas of research include atmosphericdispersion, hazard source backtracking, sensor systems, and mathemati-cal biology.

Kate Smith-Miles received a B.Sc. (Hons) in Mathematics in 1993 andPh.D. from the University of Melbourne in 1996. Her academic careerhas spanned the disciplines of information technology, engineering, andmathematics, and she has held professorial positions in all three of thesedisciplines at Monash University and Deakin University. She is currentlyProfessor and Head of the School of Mathematical Sciences at MonashUniversity in Melbourne, Australia. Her research interests include com-binatorial optimization, data mining, neural networks, and mathemat-ical modeling of problems ranging from neuroscience to economicapplications.

Dow

nloa

ded

by [

Mon

ash

Uni

vers

ity L

ibra

ry]

at 1

6:26

24

June

201

4