arXiv:2110.12273v4 [stat.AP] 5 Jan 2022

73
Inferring the sources of HIV infection in Africa from deep- sequence data with semi-parametric Bayesian Poisson flow models Xiaoyue Xi Department of Mathematics, Imperial College London, London SW72AZ, United Kingdom. Simon EF Spencer Department of Statistics, University of Warwick, Coventry CV47AL, United Kingdom. Matthew Hall Big Data Institute, University of Oxford, Oxford OX3 7LF, United Kingdom. M Kate Grabowski Department of Pathology, Johns Hopkins University, Baltimore, MD, USA; Rakai Health Sciences Program, Entebbe, Uganda. Joseph Kagaayi Rakai Health Sciences Program, Entebbe, Uganda. Oliver Ratmann Department of Mathematics, Imperial College London, London SW72AZ, United Kingdom. E-mail: [email protected] on behalf of Rakai Health Sciences Program and PANGEA-HIV Summary. Pathogen deep-sequencing is an increasingly routinely used technology in infectious dis- ease surveillance. We present a semi-parametric Bayesian Poisson model to exploit these emerging data for inferring infectious disease transmission flows and the sources of infection at the population level. The framework is computationally scalable in high- dimensional flow spaces thanks to Hilbert Space Gaussian process approximations, al- lows for sampling bias adjustments, and estimation of gender- and age-specific transmis- sion flows at finer resolution than previously possible. We apply the approach to densely sampled, population-based HIV deep-sequence data from Rakai, Uganda, and find sub- stantive evidence that adolescent and young women are predominantly infected through age-disparate relationships. arXiv:2110.12273v4 [stat.AP] 5 Jan 2022

Transcript of arXiv:2110.12273v4 [stat.AP] 5 Jan 2022

Inferring the sources of HIV infection in Africa from deep-sequence data with semi-parametric Bayesian Poisson flowmodels

Xiaoyue Xi

Department of Mathematics, Imperial College London, London SW72AZ, United Kingdom.

Simon EF Spencer

Department of Statistics, University of Warwick, Coventry CV47AL, United Kingdom.

Matthew Hall

Big Data Institute, University of Oxford, Oxford OX3 7LF, United Kingdom.

M Kate Grabowski

Department of Pathology, Johns Hopkins University, Baltimore, MD, USA; Rakai Health

Sciences Program, Entebbe, Uganda.

Joseph Kagaayi

Rakai Health Sciences Program, Entebbe, Uganda.

Oliver Ratmann

Department of Mathematics, Imperial College London, London SW72AZ, United Kingdom.

E-mail: [email protected]

on behalf of Rakai Health Sciences Program and PANGEA-HIV

Summary.Pathogen deep-sequencing is an increasingly routinely used technology in infectious dis-ease surveillance. We present a semi-parametric Bayesian Poisson model to exploitthese emerging data for inferring infectious disease transmission flows and the sourcesof infection at the population level. The framework is computationally scalable in high-dimensional flow spaces thanks to Hilbert Space Gaussian process approximations, al-lows for sampling bias adjustments, and estimation of gender- and age-specific transmis-sion flows at finer resolution than previously possible. We apply the approach to denselysampled, population-based HIV deep-sequence data from Rakai, Uganda, and find sub-stantive evidence that adolescent and young women are predominantly infected throughage-disparate relationships.

arX

iv:2

110.

1227

3v4

[st

at.A

P] 5

Jan

202

2

2 on behalf of Rakai Health Sciences Program and PANGEA-HIV

1. Introduction

1.1. Inferring the sources, sinks and hubs of transmission flows to aid the design ofHIV prevention interventions

HIV remains one of the largest public health threats, especially in sub-Saharan Africawhere approximately 61% of all new cases worldwide occur (UNAIDS, 2019). In recentyears, rates of incident cases have overall dropped considerably with the widespread adop-tion of prevention interventions such as voluntary medical male circumcision (VMMC)to reduce the risk of HIV acquisition in men, or immediate provision of antiretroviraltherapy (ART) to suppress the virus in infected individuals and thereby stop onwardtransmission (Cohen et al., 2011; Grabowski et al., 2017; Hayes et al., 2019), althoughthey remain well above UNAIDS thresholds for elimination (UNAIDS, 2018).

Within Africa, there is increasing focus on identifying groups of individuals that areat high risk of acquiring HIV and at high risk of spreading the virus with the goal oftargeted control interventions to these groups (Abeler-Dörner et al., 2019). Conceptually,the first step in this strategy is to break down the epidemic into source, sink and hubpopulations, according to the transmission flows that occur between them (Figure 1).Sources are population groups that disproportionately pass on infection, sinks are groupsthat disproportionately acquire infection, and hubs are both sources and sinks. Thepopulation groups can be defined in various ways.

For example Dwyer-Lindgren et al. (2019) provided sub-national estimates of HIVprevalence across Africa, adding to data showing that the epidemic is highly heteroge-neous across Africa, with small areas of very high prevalence (i.e., hotspots) that aresurrounded by neighbouring areas with substantially lower prevalence. Although oftenassumed, it is unclear if hotspots are also sources of epidemic spread to neighbouringlower-prevalence communities (Ratmann et al., 2020). Here, study populations are di-vided into individuals living in high-prevalence areas (h) and low-prevalence areas (l),and then transmission flows are estimated within and between them,

π =

(πhh πhl

πlh πll

), (1)

where πab is the proportion of transmission flows from group a to group b subject to∑ab πab = 1.Another prominent application concerns the interruption of infection cycles between

men and women of different ages. De Oliveira et al. (2017) proposed the scenario thatyoung women aged <25 years are predominantly infected by older men aged 25-40 years,and later spread the virus to similarly aged men in their late twenties and early thirties.Here, study populations are divided into sex-specific age groups (we consider 1-year age

3

xu v w

source sink hub

B

A

Fig. 1. Analysis aims and sketch of deep-sequence phylogenetic data to address theseaims. (A) The overall aim of phylogenetic source attribution analyses is to infer how pathogensare passed on between population groups. Conceptually populations can be divided into sourcepopulations that predominantly transmit disease, sink populations that predominantly receiveinfection, and hub populations that both disproportionally transmit and receive infections. (B)Viral deep-sequencing generates many sequence samples per host, which can be used toestablish phylogenetic orderings between individuals, and thereby estimate the direction ofpathogen spread among sampled individuals. The figure sketches a deep-sequence phylogenyof pathogen sequences from individuals u, v, w, and x. Each tip (diamonds) represents a uniquesequence, and the size of the tip copy number. Black tips correspond to out-of-sample referencesequences. Phylogenetic lineages are attributed to individuals (colours) using ancestral statereconstruction. Black lineages cannot be attributed to individuals. The subgraph of the treeassociated with individual u is ancestral to that of v, suggesting that infection spread from u tov potentially via unsampled intermediates. Individual w has five subgraphs, some of which areancestral to those of x and some of which are descendent from those of x, indicating a complexordering from which the direction of infection spread cannot be inferred. Subgraphs of v and ware not phylogenetically adjacent (disconnected), suggesting that one did not infect the other.With such information from a population-based sample of infected individuals, it is possible toquantify population-level transmission flows, sources, sinks, and hubs.

groups between 15 to 49 years), and then transmission flows between and within agegroups are estimated,

π =

(πmf 0

0 πfm

), πmf =

πmf11 · · · πmf1K...

. . ....

πmfK1 · · · πmfKK

πfm =

πfm11 · · · πfm1K...

. . ....

πfmK1 · · · πfmKK

, (2)

4 on behalf of Rakai Health Sciences Program and PANGEA-HIV

where πmfab is the proportion of transmissions from men in age band a to women in ageband b, and similarly for πfmab . We consider here only male-female transmission flowsbecause we found no evidence of male-male transmission in previous analyses (Ratmannet al., 2019) and sexual transmission between women is extremely rare. The flow matrix(2) has 2K2 non-zero entries to estimate, which for 1-year age bands amounts to 2450

variables. Important summary statistics are the vector of sources of infection in group bindividuals (δb), for example in young women aged 20 years; the vector of recipients ofinfection from group a individuals (ωa), for example from men aged 25 years; and flowratios from a to b (γab), for example the ratio of transmissions from high-prevalence tolow-prevalence areas compared to transmissions from low-prevalence to high-prevalenceareas. Respectively these quantities are defined by

δb = (δba)a∈A, δba = πab/∑c

πcb; (3a)

ωa = (ωab )b∈A, ωab = πab/∑c

πac; (3b)

γab = πab/πba. (3c)

Flow matrices of the form (1-2) are also of central interest to characterise human mi-gration flows between countries (Raymer et al., 2013), transport flows between locations(Tebaldi and West, 1998), bacterial migration in humans (Ailloud et al., 2019), or hu-man contact intensities (van de Kassteele et al., 2017), and are alternatively referred toas origin-destination matrices (Hazelton, 2001; Miller et al., 2019).

1.2. Inference from pathogen sequence dataTransmission flow matrices have been estimated from contact tracing or survey data onpartner characteristics, though these data are often subject to reporting and/or socialdesirability biases, especially for sexual diseases that are associated with stigma or remaincriminalised in many countries (Barré-Sinoussi et al., 2018). Here, we are concernedin estimating the quantities (1-3) from pathogen sequences, which are considered anobjective marker of disease flow. For fast-evolving pathogens like HIV, mutations accruequickly and the phylogenetic relationship of pathogen sequences can be used to evaluatemany aspects of transmission dynamics such as the origins of HIV (Faria et al., 2014),the contribution of different disease phases to onward spread (Volz et al., 2013; Ratmannet al., 2016), or outbreak detection (Poon et al., 2016).

Traditionally, HIV sequences are obtained through Sanger sequencing, which returnsfor each sample one consensus nucleotide sequence that captures the entire viral diversityin the sample from one individual. The genetic distance between two consensus sequences

5

can be used to estimate if the corresponding two individuals are epidemiologically closelyrelated, however the data are insufficient to estimate the direction of transmission be-tween any two sampled individuals (Leitner and Romero-Severson, 2018). For this reasonmost methods infer transmission flows indirectly from statistics of the entire phylogeny,usually the coalescent times (i.e. the times when two lineages coalesce into one, back-wards in time) and the disease states of infected individuals at time of sampling (such aslocation or age in the two applications discussed above). In the mugration model (Lemeyet al., 2009), the states of viral lineages at any time are described with a continuous-timeMarkov chain (CTMC) that is independent of the evolutionary process. Flow estimatesbetween groups can be obtained from the posterior distribution of the transition rates ofthe CTMC model via MCMC sampling, as well as posterior estimates of the phylogenyand the states of its lineages, which are latent variables in the model (Lemey et al., 2009).The MultiTypeTree model (Vaughan et al., 2014) removes the independence assumptionthat the evolutionary history of the genealogy is independent of population structure,however sampling correlated latent phylogenies and state histories is often computa-tionally infeasible. This limitation is addressed with the structured coalescent of Volzet al. (2009), which integrates over the state histories of the phylogeny and describes themarginal probabilities of each viral lineage to be in a particular state at a particular time.The changes in the state probabilities along lineages and through coalescent events arederived under ordinary differential equations (ODE) models of disease spread. The flowparameters are obtained as by-products of the estimated latent states and parametersof the compartmental model, and in general vary in time over the phylogenetic history.Adopting the marginalisation approach, a greater range of flow models and data sets canbe analysed, though computational run-times often remain on the order of several weeksfor data from hundreds of individuals (Vaughan et al., 2014; Volz et al., 2013).

An emerging strategy for estimating transmission flows involves phylogenetic analy-sis of multiple distinct pathogen sequences per infected host, because such data makepossible to attribute sets of viral lineages to individuals and infer the ancestral relation-ships between them, which can provide direct evidence into the direction of transmissionbetween two individuals (Leitner and Romero-Severson, 2018). Such analyses are be-coming broadly applicable, because deep sequencing technology now allows generatingthousands to millions of distinct pathogen sequence fragments per sample (Gall et al.,2012; Zhang et al., 2020). Prior work focused on software development (Wymant et al.,2017; Skums et al., 2018), validation of the bioinformatics protocol for inferring the di-rection of transmission (Ratmann et al., 2019; Zhang et al., 2020), and reconstruction ofpartially observed transmission networks at the population level (Ratmann et al., 2019).

6 on behalf of Rakai Health Sciences Program and PANGEA-HIV

1.3. Semi-parametric Poisson flow models

The starting point of this paper is the output of a typical deep-sequence phylogeneticanalysis (Wymant et al., 2017), which includes, for each ordered pair of sampled indi-viduals, a viral phylogenetic measure in [0, 1] giving a score that transmission occurredfrom the first to the second individual, possibly via unsampled intermediate individuals(phylogenetic direction scores). This is advantageous, because first, in this canonicalform the data enable us to present the estimation problem in terms of a class of BayesianPoisson models for non-Gaussian flow data that can flexibly describe a range of epidemi-ological questions including transmission in space (1), or by age and sex (2), similar tothe Poisson models used for estimating transport or migration flows (Tebaldi and West,1998; Raymer et al., 2013). Second, the models can account for multi-level samplingheterogeneity, which is typically present in population-based disease occurrence data butwas not emphasised in phylogenetic analysis. We leverage Bayesian data augmentationto adjust for sampling heterogeneity (Givens et al., 1997), and exploit the fact that theadditional latent variables can be integrated out in our framework, so that computationalinference remains inexpensive. Third, while typical phylodynamic approaches are limitedto estimating transmission flows between coarse population strata, for example by agebrackets 15 − 24 years and 25 − 40 years (De Oliveira et al., 2017; Le Vu et al., 2019),we can employ Gaussian-process-based regularisation techniques to capture fine detailin transmission flows by annual age increments. Specifically, we propose using recentlydeveloped Hilbert Space Gaussion Process (HGSP) approximations to ensure the regu-larisation priors remain computationally tractable (Solin and Särkkä, 2020). This bringsour approach into the form of semi-parametric Bayesian Poisson models, which enable in-ference of high-resolution transmission flows similar to the Integrated Nested Laplace Ap-proximations used for inferring high-resolution human contact matrices (van de Kassteeleet al., 2017).

In Section 2, we introduce our notation and develop the semi-parametric BayesianPoisson flow model. In Sections 3.1-3.2, we assess the performance of the Poisson modelin estimating transmission flows from sampling-biased data, and identify suitable HGSPapproximations. Sections 3.3-3.6 illustrate our approach on HIV deep sequence data fromthe Rakai Community Cohort Study (RCCS) of the Rakai Health Sciences program,situated in south-eastern Uganda (Grabowski et al., 2017). Between August 10 2011to January 30 2015, virus from 2652 HIV-infected individuals could be deep-sequenced,and 293 pairs of individuals with phylogenetically strong support for the direction oftransmission were identified. We demonstrate that the new type of phylogenetic data andour statistical model enable estimation of age- and gender-specific transmission flows at

7

finer detail than previously possible while remaining computationally scalable. Particularattention is given to potential sampling biases, and we propose a hierarchical model ofthe sequence sampling cascade for analysis of transmission flows. Section 4 closes with adiscussion.

Mbarara

Masaka

Road

Masaka

Kakuto

Road

Tanzania

Lake

Victoria

N

community type inland communities fishing sites

from female to male from male to female

A

B

15

20

25

30

35

40

45

50

15 20 25 30 35 40 45 50

age of source

ag

e o

f re

cip

ien

t

15

20

25

30

35

40

45

50

15 20 25 30 35 40 45 50

age of source

ag

e o

f re

cip

ien

t

Fig. 2. Location of the Rakai Community Cohort Study, and data. (A) Location of RakaiDistrict (red) in south-eastern Uganda at the shores of Lake Victoria. HIV surveillance datawere obtained from 2 survey rounds in 36 inland communities of the Rakai Community CohortStudy (green circles) and three survey rounds in the main 4 fishing communities within 3kmof Lake Victoria (green triangles) between August 10, 2011 and January 30, 2015. (B) Thestudy did a household census, and all individuals aged 15-49 years capable to provide informedconsent and resident for at least 1 month with the intention to stay were invited to participate.Viral deep-sequencing was performed on plasma blood samples from HIV infected participantswho reported no ART use. Deep-sequence phylogenetic analysis returned phylogenetic trans-mission scores between individuals, and 293 pairs had strong support of phylogenetic linkageand transmission direction. 173 pairs were male-to-female and 120 were female-to-male. Thefigures show the phylogenetically likely source-recipient pairs by age of the source and recipientat the midpoint of the study period.

8 on behalf of Rakai Health Sciences Program and PANGEA-HIV

2. Methodology

2.1. Notation and DefinitionsIn this section we present the notation that is used to estimate transmission flows in apopulation P of size N , during a study period T = [t1, t2]. We define by i = 1, . . . , N

the identifier of infected individuals in P during T . The transmission events duringthe study period can thus be described in a N × N binary matrix Z, where zij = 1

denotes transmission from person i to person j, and zij = 0 denotes no transmission.The transmission matrix is not symmetric, and diagonal entries are zero.

We estimate transmission flows between population strata, and denote the strata bya and the set of strata by A, which is of dimension A > 0. The number of transmissionevents in T from group a to group b are zab =

∑i∈a,j∈b zij , and the primary object of

interest is the A × A flow matrix π with entries πab = zab/z+ where z+ =

∑ab zab.

The flow matrix is in general not symmetric, and is subject to∑

a,b∈A πab = 1. Thematrix may contain structural zeros, for example in the case of HIV female-to-femaletransmission is extremely unlikely. We denote the number of structurally non-zero entriesby L, which satisfies L ≤ A2.

In general the flow matrix is time-dependent due to changes in population compositionand varying transmission rates (Anderson and May, 1992). For instance in a compart-ment model of susceptible (S), infected (I) and treated (T ) men and women of high (h)and low risk (l) of onward transmission, the ODE equations pertaining to the male (m)high risk population are

Smh = −λ(t)Smh(t) + µ− µSmh(t)

Imh = λ(t)Imh(t)− γ(t)Imh(t)− µImh(t)

Tmh = γ(t)Imh(t)− µTmh(t),

(4)

where the force of infection is λ(t) = βfh(t)Ifh(t)/Nfh(t) + βfl(t)Ifl(t)/Nfl(t), thebirth/death rate µ is constant, and the viral suppression rate γ and transmission ratesβfh, βfl are time-dependent. The actual, unobserved number of transmissions from highrisk women to high risk men in T = [t1, t2] are

zfh,mh([t1, t2]) =

∫ t2

t1

βfh(t)Ifh(t)Smh(t)/Nfh(t)dt, (5)

and the corresponding proportion of transmissions is

πfh,mh([t1, t2]) =zfh,mh([t1, t2])

Z([t1, t2]), (6)

where Z is the sum of transmission events in T = [t1, t2]. Here, we focus on estimating

9

transmission flows in a given study period, and for ease of notation drop the dependenceof our data and estimates on T = [t1, t2].

Pathogen deep sequence data are available from sampled, infected individuals. Wedenote the sampling status vector for all individuals in P by s = (si)i=1,...,N , where si = 1

denotes that person i is sampled, and si = 0 that person i is not sampled. The numberof sampled individuals is N s, which corresponds here to the 2652 individuals for whoma viral deep sequence is available for analysis. We will characterise population samplingin terms of individual-level characteristics, such as age or location of residence, that aredescribed with p covariates, which we denote with the N × p matrix X. The outputof the phylogenetic deep sequence analysis can be summarised in an N × N directionscore matrix W that describes the evidence for transmission from i to j with the weightwij ∈ [0, 1]. The direction score matrix is not symmetric, diagonal entries are zero, andentries involving unsampled individuals are missing. To estimate the flow matrix π,consider the observed flow counts

nab =∑

i∈a,j∈b1 si = 11 sj = 1 zij =

∑i∈a,j∈b

1 si = 11 sj = 11 wij > ζ , (7)

where ζ ∈ (0, 1) is a threshold that can be used to select phylogenetically highly sup-ported source-recipient pairs, and zij is taken as a perfect predictor of zij among sampledindividuals. The counts can be arranged into the A × A count matrix n, and sum ton+ =

∑a,b nab. In previous studies, ζ was set to 0.5 or 0.6, and n+ was between 100 to

500 (Hall et al., 2019; Ratmann et al., 2020).The naïve flow estimator is defined by πab = nab∑

c,d ncd. If we suppose that each popu-

lation group a is independently sampled at random with probability ξa, and the actualflows from group a to group b are zab, then E(πab) = (zabξaξb)/(

∑c,d zcdξcξd). Con-

sequently, the naïve flow estimator is only unbiased when the population groups werehomogeneously sampled, i.e. ξa is the same for all a, which is rarely the case (Ratmannet al., 2020).

2.2. Inferring flows from heterogeneously sampled dataThe data inputs for estimating transmission flows from pathogen deep-sequence data (7)are of the same form as for estimating origin-destination matrices with unobserved trans-port routes (Hazelton, 2001), migration flows (Raymer et al., 2013), or contact intensi-ties (van de Kassteele et al., 2017), prompting us to formulate the statistical model ingeneral terms. Considering the actual, unobserved number of flow events (i.e., trans-missions) between all population groups, z = (zab)a,b∈A, the complete data likelihoodthat arises under mathematical models of the form (4) in a fixed study period is the

10 on behalf of Rakai Health Sciences Program and PANGEA-HIV

multinomialp(z|z+,π) ∝

∏a,b

(πab)zab , (8)

where for ease of notation we denote by π the vector of non-zero elements of the flowmatrices (1-2), and similarly for n and z. Model (8) ignores potential second-ordercorrelations between flows, for example that a female infected by an older male may bemore likely to transmit to men of older age. Since the total number of transmissions isin itself a random variable, we consider the related Poisson model

p(z|λ) ∝∏a,b

(λab)zab exp(−λab) =

[∏a,b

(πab)zab

][ηz

+

exp(−η)

], (9)

where λab can be interpreted as the flow intensities from group a to group b, η =∑

c,d λcd,and πab are recovered via πab = λab/η.

The actual flows z are not observed. We assume that individuals are sampled atrandom within strata (SARWS). SAWRS implies that sampling is independent of beinga source or not, and the likelihood of the observed counts conditional on the completedata is p(n|z, ξ) =

∏a,b Binomial(nab; zab, ξaξb), where ξa is the sampling probability in

group a. In this class of models the latent flow counts zab can be conveniently integratedout, yielding for the observed flow counts the Poisson model

p(n|λ, ξ) =∏a,b

(λabξaξb)nab exp(−λabξaξb). (10)

An important point in this construction is that we are free to choose the stratificationA in order to accommodate the SARWS assumption. We will show that flow estimateson different, coarser population stratifications that are of primary interest are easilyobtained through the aggregation property of the Poisson system (9).

The sampling-adjusted maximum-likelihood estimates of λab and πab under (10) canbe derived under the SARWS assumption. The number of sampled individuals in a,N sa =

∑i∈a 1 si = 1, is a Binomial sample of the number of all individuals in a, Na,

which leads toπab =

nab

ξaξb

/[∑c,d

ncd

ξcξd

], (11)

where ξa = N sa/Na (Supplementary Material, section S1).

2.3. Inferring high-resolution flows with regularising priorsBayesian regularisation techniques play a central role in obtaining robust and suitablysmoothed flow estimates. Considering population sampling, we exploit additional infor-mation on the sampling vector s. We assume that flows are independent of sampling,

11

allowing us to decompose the joint posterior distribution into

p(λ, ξ|n, s,X) ∝ p(n|λ, ξ, s,X)p(λ|ξ, s,X)p(ξ|s,X)

= p(n|λ, ξ)p(λ|ξ)p(ξ|s,X).(12)

A possible limitation of (11) is that the counts Na, N sa can be small when the population

is finely stratified. It is thus often advantageous to model individual-level samplingprobabilities in terms of a linear combination of predictors. Using, for example, a logisticregression approach, we obtain

p(ξa|s,X) =

∫logit−1(xaβ)p(β|s,X)dβ, (13)

where xa is the row vector of population characteristics that specify group a, β are theregression coefficients, and p(β|s,X) is the posterior density of the regression coefficients,estimated from the sampling status vector s of all individuals in the study population.

Considering the prior density on the transmission intensities λ, in some applicationsthe population strata are unordered such as in application (1). In this case we proposeusing

λab|ξa, ξb ∼ Gamma(αab, β), αab = 0.8/L, β = 0.8/Zp(ξa, ξb), (14)

where L is the length of the flow vector π, which is equivalent to the number of struc-turally non-zero entries in the flow matrices (1-2), and Zp is the number of expectedtransmission events, Zp =

∑a,b:nab 6=0

nab

ξaξb+∑

a,b:nab=01−ξaξbξaξb

. This choice is motivatedby the fact that (14) induces on π an objective Dirichlet prior density with parametersαab = 0.8/L (Berger et al., 2015), such that the likelihood (10) dominates the prior (14)regardless of the number of flows to estimate. However when the population groupscan be ordered, such as the K 1-year age bands in (2), the structure of the flow model(12) enables using regularising prior densities that penalise against large deviations inflow intensities between similar source and recipient populations. For (2), we opted for(stacked) two-dimensional Gaussian-process priors on the entries of λ,

logλ = µ1 + ν1mf + f ,

f = (fTmf ,fTfm)T ,

fmf ∼ GP(0, kmf ), ffm ∼ GP(0, kfm),

kmf((a1, b1), (a2, b2)

)= σ2mf exp

(−[(a2 − a1)2

2`2mf,a+

(b2 − b1)2

2`2mf,b

])kfm

((a1, b1), (a2, b2)

)= σ2fm exp

(−[(a2 − a1)2

2`2fm,a+

(b2 − b1)2

2`2fm,b

])σ2mf , σ

2fm ∼ Half-Normal(0, 10)

`d,i ∼ Inv-Gamma(αd,i, βd,i),

(15)

12 on behalf of Rakai Health Sciences Program and PANGEA-HIV

where the first K2 entries of λ correspond to flows in the male-female direction and theremaining K2 entries correspond to flows in the female-male direction, µ is the baselinelog transmission intensity, ν is a scalar on the elements of λ in the male-female direction,and kmf , kfm are gender-specific squared exponential kernels with variance parametersσ2mf , σ

2fm and length scales `mf,a, `mf,b, `fm,a, `fm,b (Rasmussen and Williams, 2006).

2.4. Scalable numerical inferenceThe semi-parametric Poisson model can be efficiently fitted with the dynamic Hamilto-nian Monte Carlo sampler of the Stan probabilistic programming framework (Carpenteret al., 2017). The implementation uses the Hilbert Space Gaussian process approxi-mation (HSGP) to the GP prior (15) developed by Solin and Särkkä (2020), in whichthe squared exponential covariance kernel k in (15) is approximated through a seriesexpansion of eigenvalues and eigenfunctions of the Laplacian differential operator ona compact domain Ω of the input space. In our two-dimensional case, we considerΩ = [−B1, B1] × [−B2, B2], and the boundary points play an important role in theapproximation, and need to be specified appropriately.

Briefly, the HSGP approximation involves the spectral density associated with the sta-tionary kernel of the GP prior. For the squared exponential kernel with two-dimensionalinputs, it is given by

Sθ(ω) = 2πσ22∏d=1

`d exp

(−1

2

2∑i=1

`2dω2d

), (16)

where ω = (ω1, ω2) denote the frequencies, and θ = (σ, `1, `2) are the kernel parameters(Rasmussen and Williams, 2006). The approximation further involves the eigenvaluesand eigenfunctions of the Laplacian differential operator. On the compact domain Ω,the jth univariate eigenvalues and eigenfunctions in dimension d = 1, 2 can be computed(Solin and Särkkä, 2020), and are λdj =

( jπ2Bd

)2, φdj(xd) =√

1/Bd sin(√

λdj(xd + Bd)).

The HSGP approximation involves the first m1 and m2 such terms of both dimensions.There are m = m1 ×m2 possible combinations of such terms, which we index throughK ∈ N2×m. For example, if m1 = 3 and m2 = 4, then

K =

(1 1 1 1 2 2 2 2 3 3 3 3

1 2 3 4 1 2 3 4 1 2 3 4

).

For 2D inputs, the j = 1, . . . ,m eigenvalues and eigenfunctions are the combinationsof the univariate eigenvalues and eigenfunctions, λj = (λK1j

, λK2j), and φj(a, b) =

φ1K1j(a)φ2K2j

(b). This fully specifies the HSGP approximation to Gaussian processes

13

women in rural area women in urban area

men in rural area men in urban area

0 100 200 300 400 0 100 200 300 400

0

2500

5000

7500

10000

0

2500

5000

7500

10000

time

po

pu

latio

n s

ize

compartment infected recovered susceptible

0.0%

10.0%

20.0%

30.0%

0 0.05 0.1 0.15 0.2 0.25

sampling differences

wo

rst

ca

se

err

or

Adjustment for sampling differences

yes no

A B

Fig. 3. Simulation experiments to assess the accuracy of flow estimates under sam-pling bias. (A) ODE-based models were used to simulate epidemic trajectories of susceptible,infected and treated men and women across two population strata, and transmission flows be-tween them. Panel A visualises one of the simulate trajectories. (B) Transmission flows wereestimated under the Poisson likelihood model (10), without adjustments for sampling differences(dark grey), and with adjustments for sampling differences (light grey). Accuracy was measuredwith the worst case error between posterior median estimates and the simulated true values,and shown are the average error and 95% range in 100 replicate simulations.

with 2D inputs,

f ∼ HSGP(0, kHSGP)

kHSGP((a, b), (a′, b′)) =

m∑j=1

(√λj

)φj(a, b)φj(a

′, b′),(17)

which depends on the choice of m1, m2 and B1, B2. In our work, we applied theapproximation (17) to each of the stacked Gaussian process prior components in (15).To our knowledge, this is one of the first applications of the HSGP approximation. Notethat the structure of the data inputs and the form of the Poisson model is the same inmany flow applications (Raymer et al., 2013; van de Kassteele et al., 2017), and so theHSGP approximation could make estimation of high-resolution flows also numericallyscalable in these settings.

Full details on the algorithm are reported in Supplementary Text SS2 and a tuto-rial is provided in https://github.com/BDI-pathogens/phyloscanner/blob/master/phyloflows/vignettes/08_practical_example.md.

14 on behalf of Rakai Health Sciences Program and PANGEA-HIV

3. Applications

3.1. Accuracy with and without adjustments for sampling heterogeneity

Standard phylodynamic methods ignore sampling differences between population strata (Volzet al., 2009; Le Vu et al., 2019; Scire et al., 2020). We first assessed the impact of sam-pling heterogeneity on estimating transmission flows in simulation experiments, that arefully reported in Supplementary Material, section S3. The first experiment is a minimalexample involving flows between two population groups, which for simplicity we refer toas individuals in rural areas (group a) and individuals in large communities (group b).Transmission chains were simulated under the ODE model (4), i.e. not from our simplerlikelihood model (10), and the simulated flow matrix π0

r was recorded in r = 1, . . . , 100

replicate simulations (Figure 3A). Observations were drawn under heterogeneous sam-pling, with fixed sampling probabilities ξa = 0.6 in group a, and decreasing samplingprobabilities in groub b, ξb = 0.6, 0.55, . . . , 0.35. First, we estimated flows from (12)assuming no sampling differences between population strata, and which we implementedby setting p(ξa|s,X) and p(ξb|s,X) to the Beta density with parameters 25.5, 25.5. Sec-ond, we estimated flows with information on sampling differences included, by settingp(ξa|s,X) to the Beta density with shape parameters N s

a + αξ, Na − N sa + βξ, where

N sa are the number of sampled inviduals in group a, Na is the population denominator,

and the hyperparameters αξ, βξ were both set to 0.5 under Jeffrey’s prior on the Bino-mial sampling probabilities. The density p(ξb|s,X) was specified analogously. Figure 3Bcompares the accuracy in the posterior median flow estimates πr,ab in terms of the worstcase error (WCE) εr = maxa,b |πr,ab − πTr,ab|, and illustrates that the Poisson model (12)can minimise the impact of sampling heterogeneity on flow estimates. More complexsimulation experiments yielded similar results (Supplementary Material, section S3).

3.2. Accuracy with different smoothing priors

Next, we assessed on simulations the impact of different regularising prior densities onestimating flows across orderable population strata, which is typically not consideredin standard phylodynamic inference approaches (Volz et al., 2009; De Oliveira et al.,2017; Le Vu et al., 2019; Scire et al., 2020; Bbosa et al., 2020). We focused on age- andgender-specific transmission flows by 1-year age inputs between 15-24 years, resulting in8 × 102 = 800 flow combinations, and simulated 300 transmission pairs from the GPmodel (15) using ground-truth parameters that were motivated by analyses of the Rakaidata set (see Supplementary Material, section S3). Figure 4A illustrates the simulatedtransmission pairs from men to women, and the corresponding underlying flows in 2

of 20 simulations generated. Transmission flows were then inferred using the HSGP

15

simulation 2

simulation 1

15 16 17 18 19 20 21 22 23 24 25

15

16

17

18

19

20

21

22

23

24

25

15

16

17

18

19

20

21

22

23

24

25

age of male source

ag

e o

f fe

ma

lel r

eci

pie

nts

regularising priorHSGP, m = 10

HSGP, m = 20

HSGP, m = 30

HSGP, m = 40

HSGP, m = 50

HSGP, m = 60

A B

0

10

20

30

1 1.05 1.25 1.5 2

run

time

(h

ou

r)

5.0

5.4

5.8

6.2

6.6

7.0

1 1.05 1.25 1.5 2

boundary factor

me

an

ab

so

lute

err

or(1

0−4)

Fig. 4. Simulation experiments to assess the accuracy of flow estimates under dif-ferent smoothing priors. (A) Age- and gender-specific transmission pairs were simulatedfrom the GP model (15), and shown are the simulated male to female transmission pairs fortwo simulations (red), along with contours of the simulated ground-truth transmission flow den-sity. (B) Average runtimes of transmission flow inferences across 20 simulated data sets usingdifferent HSGP and GP priors, and average mean absolute errors in inferred posterior medianflow estimates. The dashed line corresponds to average runtimes and average mean absoluteerrors when using the GP prior, under which the simulations were generated.

approximation (17) to (15) in our semi-parametric Poisson flow model. We varied thenumber m of basis functions from 100 to 3600 by setting m1 = m2 = 10, 20, . . . , 60,and chose as HSGP domain Ω an expanded version of the input domain, [15/B, 50B]×[15/B, 50B], with boundary factor B = 1, 1.05, 1.25, 1.5, 2. To have a benchmark forthe performance of the HSGP approximations, inferences were also performed usingthe GP prior (15), from which the data were simulated. Priors for all parameters aredescribed in the Supplementary Material, section S3. Figure 4B shows that relative tousing GP priors, average HMC runtimes across the 20 simulated data sets improved bymore than ten-fold with HSGP priors when the number of basis functions was less than50. Figure 4B also summarises the average mean absolute error in posterior median flowestimates. Similar to Riutort-Mayol et al. (2020), our results indicate that the HSGPapproximations were least accurate for small B ≤ 1.05, and for larger boundary factorswhen the number of basis functions was not increased simultaneously. On our simulations,HSGP approximations performed almost as well as GPs for the tuning parameters (B =

16 on behalf of Rakai Health Sciences Program and PANGEA-HIV

1.25,m = 20− 30), (B = 1.5,m = 20− 40) or (B = 2,m = 30− 40). As computationalcost increases with m, we chose B = 1.25,m = 30.

3.3. Application to population-based deep-sequence data from Rakai, Uganda

We illustrate application of the semi-parametric Poisson flow model (12) on a population-based sample of HIV deep sequences from the RCCS in south-eastern Uganda at theshores of Lake Victoria (Ratmann et al., 2019, 2020). Between 2011/08/10 to 2015/01/30,two survey rounds were conducted in 36 inland communities, and three survey roundsin 4 fishing communities (Figure 2A). Preceeding each survey, a household census wasconducted to identify individuals aged 15-49 years who lived in the communities for atleast one month and with intention to stay, who were eligible to participate. In brief,there were 37645 census-eligible individuals, of whom 25882 (68.8%) participated in theRCCS. Participation was higher among women than men, increased with age for bothmen and women, and was similar in fishing and inland communities (SupplementaryMaterial, section S4). 11404 (96.9%) of non-participants were absent for school or work.Infected individuals who did not report ART use were selected for sequencing, and deep-sequencing rates were higher among men than women, similar by age for men and women,and were higher in fishing communities (Supplementary Material, section S4). There were293 heterosexual pairs with phylogenetic support for linkage and direction of transmission(source-recipient pairs) when using the threshold ζ = 0.6 in (7). The estimated infectiontimes of the recipients were between 2009/10/01 and 2015/01/30, which defined thestudy period T during which we estimated transmission flows. Figure 2B illustrates thereconstructed source-recipient pairs by age of both individuals at the midpoint of thestudy period.

To interpret these observations, we defined as denominator transmission events tocensus-eligible individuals in RCCS communities who were infected during the studyperiod T , and formalised the individual steps in the sampling cascade of transmissionevents (Figure 5A). The sources and recipients had to participate in at least one sur-vey round between 2011/08/10 and 2015/01/30, report no ART use, and have virussequenced successfully. In Rakai, each survey was preceeded by a household census,and with this denominator we numerically estimated age-, gender-, and location-specificconditional sampling probabilities at each step of the sampling cascade using Bayesianlogistic-Binomial regression models, and then multiplied Monte Carlo draws from thesedistributions to numerically approximate the posterior distribution of the overall sam-pling probabilities ξa, ξb, ∀a, b (see Supplementary Material, section S4). Figure 5Billustrates the resulting, overall sampling probabilities of female-to-male transmissions

17

new infections in RCCSduring study period,aged 15-49 years

sources aged 15-49 years- living outside RCCS- living in RCCS

participate in RCCS duringsurveillance period

reporting no ART use

virus deep-sequenced

participate in RCCS duringsurveillance period

reporting no ART use

virus deep-sequenced

sampling cascadeof transmission events

A B

15

20

25

30

35

40

45

50

15 20 25 30 35 40 45 50

age of female sources

ag

e o

f m

ale

re

cip

ien

t

0.0 0.1 0.2 0.3 0.4 0.5sampling fractions

(3311 individuals)

(2262 individuals)

(3823 individuals)

(2629 individuals)

Fig. 5. Model of the sampling cascade of transmission events. (A) The sampling cascadeformalises the steps involved in sampling sources and recipients of transmission events. Recip-ients were defined as individuals aged 15-49 years who acquired infection in one of the RCCScommunities during the study period. Sources were defined as individuals aged 15-49 yearswho transmitted to one of the recipients. Each arrow corresponds to a sampling step of thesource and recipient populations, which we model using cohort data.(B) Estimated posteriormedian sampling probabilities of female to male transmission events in fishing communities byage of source and recipient. Conditional sampling probabilities were numerically estimated foreach step of the sampling cascade using logistic Binomial regression, and then multiplied togive overall sampling probabilities.

in fishing communities. The estimated sampling probabilities indicate that the observeddata over-represent transmissions between older individuals.

3.4. Transmission flows between areas with high and low disease prevalence

We then used the source-recipient data of Figure 2B to address problem (1) and estimatetransmission HIV flows within and between high- and low-prevalence RCCS communi-ties. The high-prevalence communities comprised the four fishing communities, and thelow-prevalence communities comprised the remaining 36 inland communities. Detailedanalyses have been reported in Ratmann et al. (2020); here we focus on illustrating howknown sampling heterogeneities can be accounted for, and how they affect inferences oftransmission flow.

The participation, ART use, and sequence sampling probabilities differed by gender,age, and location, and we stratified the population accordingly to meet the SARWSassumption that underlies the Poisson flow model (Section 2.2). Specifically, we stratified

18 on behalf of Rakai Health Sciences Program and PANGEA-HIV

15.0%

20.0%

25.0%

30.0%

35.0%

high−

> hig

h

high−

> low

low−

> hig

h

low−

> low

est

ima

ted

sa

mp

ling

pro

po

rtio

nsA B

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

fishing

−>

fishing

fishing

−>

inland

inland

−>

fishing

inland

−>

inland

estim

ate

d tra

nsm

issio

n flo

ws

sampling adjustments

noparticipation, sequencing sampling

Fig. 6. Estimated sampling probabilities of transmission events, and impact on flowestimates. (A) Boxplots of estimated pairwise sampling probabilities of transmission events be-tween low-prevalence inland and high-prevalence fishing communities of the RCCS. The sam-pling probabilities were obtained by marginalising over age- and gender-specific differences,and indicate that transmission events between high-prevalence fishing communities were over-represented in the data set. (B) Transmission flow estimates between low-prevalence inlandand high-prevalence fishing communities of the RCCS. Shown are posterior median estimates(horizontal line), interquartile ranges (box), and 95% credible ranges when sampling hetero-geneity by gender, age, and locations was ignored (red), and when sampling heterogeneity wasaccounted for as described in the main text.

populations by gender, 1-year age bands (between 15 and 49 years), and resident location(low or high prevalence), which resulted in 140 sampling groups. Following our samplingcascade model (Figure 5), we then sought to estimate transmission flows between the2 × 352 age- and gender-specific transmission flows for each of the 4 combinations ofgeographic source and recipient locations, through the joint posterior distribution (12).We further accounted for geographic in-migration, resulting in 14, 700 flow variables. Onthis high-resolution flow space, we were able to directly apply the estimated, structuredsampling probabilities that are illustrated in Figure 5B. This shows that accounting forthe observed heterogeneities in how the census population was sampled resulted in amore complex inferential problem than (1-2) suggest.

To regularise inferences, we used the HSGP approximation (17) to the stacked Gaus-sian process prior (15). We further sought to allow for differences in transmission dynam-ics across locations, and for this reason specified independent HSGP priors on the parts ofthe flow space that correspond to transmissions to low prevalence areas, and on the partsof the flow space that correspond to transmissions to high prevalence areas. Numericalinference of the joint posterior density (12) took 89 hours on four 2.4 Ghz processors withStan version 2.19. There were no convergence, mixing, or divergence warnings, as longas informative prior densities on the length scale hyperparameters were chosen, which

19

age of individuals acquiring HIV (recipients)

15−16 16−17 17−18 18−19 19−20 20−21 21−22 22−23 23−24 24−25

25−26 26−27 27−28 28−29 29−30 30−31 31−32 32−33 33−34 34−35

35−36 36−37 37−38 38−39 39−40 40−41 41−42 42−43 43−44 44−45

45−46 46−47 47−48 48−49 49−50

A B

C D

0.0

0.1

0.2

0.3

0.4

0.5

0.6

15 20 25 30 35 40 45 50

so

urc

es o

fm

ale−

to−

fem

ale

tra

nsm

issio

np

oste

rio

r m

ea

n

0.00

0.02

0.04

0.06

0.08

0.10

15 20 25 30 35 40 45 50

0.0

0.3

0.6

0.9

1.2

1.5

15 20 25 30 35 40 45 50

age of male source (years)

0.3

1.0

3.0

10.0

15 20 25 30 35 40 45 50

age of male sources (years)

sou

rce

s o

fm

ale−

to−

fem

ale

tra

nsm

issi

on

coe

ffic

ien

t o

f va

ria

tion

Fig. 7. Impact of using reqularising prior densities on estimation of age- and gender-specific sources of infection. We compared estimates of the sources of HIV infection inwomen without regularisation, using the Gamma prior densities (14), to estimates obtainedunder the regularising HSGP prior on the log transmission intensities (17). Each colour cor-responds to infection recipients of a particular age. (A) Each line shows the posterior medianestimates of the probability of infections attributed to the age of individuals of the opposite gen-der, i.e. the sources of infection, obtained without regularisation when using 5-year age strata.(B) Same as A, using the HSGP prior with boundary factor B = 1.25 and m = 30 basis func-tions when using 1-year age strata. (C-D) Corresponding estimates of the posterior coefficientof variation in the source estimates.

we set by matching their 99% credible ranges to the empirical 99% quantiles in Figure 2.Thus flow inferences remained computationally manageable even in the high-resolutionspace considered here.

Figure 6B shows the marginal posterior estimates of the aggregated flow vectorπ = (πhh, πhl, πlh, πll), Equation (1), when sampling heterogeneity was ignored by settingall ξa to the average sampling probability (red), and when gender, age, and location-

20 on behalf of Rakai Health Sciences Program and PANGEA-HIV

specific participation and sequence sampling probabilities of sources and recipients wereaccounted for as described above (turquoise). The average sampling difference betweenindividuals in fishing and inland communities was 7.16%, suggesting based on our resultsin Figure 3B that after accounting for sampling differences, the sampling-adjusted esti-mates could differ by up to 5% from the unadjusted estimates. Figure 6B shows thatour results are in line with this expectation. The estimated flow ratio (inland→fishing/ fishing→inland) was 2.18 (1.06-4.71) when sampling heterogeneity was accounted for,and 2.58 (1.23-5.86) when sampling heterogeneity was not accounted for. We thus seethat sampling heterogeneity can have an impact on flow estimates, and that the find-ing that high-prevalence fishing communities were net sinks, and not sources, of localinfection flows is robust to sampling heterogeneity.

3.5. Transmission flows between age groups

We next turned to problem (2) and estimated transmission flows by age and genderfrom the source-recipient data shown in Figure 2. Here, we focus on illustrating how theHSGP prior in the Poisson model allows us to borrow information across data points,and thereby go beyond existing phylodynamic methods (De Oliveira et al., 2017; Le Vuet al., 2019; Bbosa et al., 2020) and make flow inferences by 1-year age bands.

To do this, we compared estimates of the source vectors δ, defined in (3a), when weused the independent Gamma prior density (14) (no regularisation) to those when we usedthe HSGP prior density (17) (with regularisation). We focused the comparison on theage- and gender-specific sources of infection regardless of location, i.e. the source vectorthat corresponds to Equation (2), which was obtained by aggregating over the high- andlow-prevalence locations of the source and recipient population groups. Figures 7A-Bshow the posterior median source estimates for each recipient group respectively withoutregularisation when using 5-year age bands and with regularisation using 1-year agebands, and Figures 7C-D show the corresponding posterior coefficients of variation. Theestimated coefficients of variation were similar with and without regularisation, and wellbelow 1 with regularisation, except from sources associated with little contribution toonward transmission and for very young or very old recipients.These findings suggestthat the 1-year flow estimates are statistically meaningful, and at high resolution providebetter insights. More detailed analyses are reported in Supplementary Text S5. First,we document the obvious, that it is not possible to estimate 2450 flow variables from293 data points without regularisation. Second, we show that, as age bands are widened,estimates increasingly depend on the particular start and end points of the chosen agestrata, and so we caution against inferences by 5-year age bands or wider.

21

men younger or same age men 1− 5yrs older men >5 yrs older

20 30 40 50 20 30 40 50 20 30 40 500%

25%

50%

75%

100%

age of female recipients

estim

ated

sou

rces

o

f inf

ectio

ns in

wom

en

Fig. 8. Estimated sources of infection in women in Rakai, Uganda, 2009-2015. Weestimated the sources of HIV infection in women of increasing age. Sources were defined asmen that were younger or the same age as the women (grey), 1-5 years older (orange), and over5 years older (blue), and sum to 100% for each age of infected women on the x-axis. Posteriormedian source estimates are shown along with 95% marginal credibility intervals.

3.6. Sources of transmission to women aged <25 years

Across sub-Saharan Africa, HIV prevalence rises rapidly among young women aged <25years (UNAIDS, 2018), which has prompted efforts to prevent infection among adoles-cent girls and young women, most notably the DREAMS partnership (Saul et al., 2018).Infections among young women aged <25 years are commonly attributed to older men,with a recent phylogenetic study from South Africa finding that of 60 identified transmis-sion pairs involving women aged <25 years, 42 (70.0%) had a probable male partner aged>25 years (De Oliveira et al., 2017). Our larger data set with directional phylogeneticinformation allowed us to revisit these estimates.

Our data contained 96 source-recipient pairs involving women aged <25 years, ofwhom 59 (61.5%) originated from men and 37 (38.5%) from women. However, under theregularising HSGP prior density (17), our gender- and age-specific flow estimates alsoborrow information from the other source-recipient pairs that involved older women. Wereport in Figure 8 the estimated sources of HIV infection in women of increasing age.The facets show the contribution of each source, men younger or the same age, men upto 5 years older, and men more than 5 years older, and sum to 100% for each age ofinfected women on the x-axis. At age 15, an estimated 91.8% (77.6% - 97.7%) of womenwere infected by men more than 5 years older, while at age 20, this was 71.8% (60.6% -80.7%), and at age 25 this was 47.8% (38.2% - 57.4%). These estimates document the

22 on behalf of Rakai Health Sciences Program and PANGEA-HIV

overwhelming impact that men more than 5 years older have on driving infection in veryyoung women in our observation period 2010-2015, and they show that the contributionof these men on infection in women declines rapidly with the age of the women. Weprovide exact estimates in Table S9.

4. Discussion

In this study we introduce a class of semi-parametric Bayesian Poisson models for esti-mating high-resolution flows between population strata, and apply the model to estimatethe sources of HIV infections from pathogen deep-sequence data. The modelling frame-work is flexible, and enables addressing a range of epidemiological questions on pathogenspread between geographic areas, by 1-year age bands, gender, or indeed other discretely-valued sociodemographic characteristics. We templated the model with and withoutHilbert space approximations for scalable inference of high-resolution flows in genericStan model files, and hope that given the canonical structure of the semi-parametricPoisson flow model, these will be helpful in other movement, origin-destination or flowapplications as well (Tebaldi and West, 1998; Hazelton, 2001; Raymer et al., 2013; Lind-ström et al., 2013; Faye et al., 2015; van de Kassteele et al., 2017; Sun et al., 2021).

Existing phylodynamic estimation approaches are tailored for pathogen consensus se-quences (Lemey et al., 2009; Vaughan et al., 2014; Volz et al., 2009; Scire et al., 2020).The approach described here is tailored for pathogen deep-sequence data, in that an ob-served, time-homogeneous flow matrix is required as input, which can be derived throughaggregation from individual source-recipient relationships in deep-sequence phylogenies.The main advantages are first, that population-level spread can be directly estimatedfrom individual source-recipient relationships, and modelled in terms of associated indi-vidual covariates. In comparison, standard phylogeographic models estimate transitionrates from the shape of viral phylogenies captured in times to lineage coalescence (Lemeyet al., 2009; Stadler and Bonhoeffer, 2013; Scire et al., 2020), which are often harder tointerpret. Second, relatively little computational effort is needed to fit the Poisson flowmodel (10-12) to deep-sequence data, because it falls within the class of Bayesian hier-archical models for binary data, for which efficient fitting and regularisation techniquesexist (Carpenter et al., 2017; Rasmussen and Williams, 2006; Solin and Särkkä, 2020).This makes it computationally feasible to investigate complex aspects of disease spreadsuch as population-level HIV transmission by 1-year age bands. Third, differences in howthe phylogenetic data were sampled for each stratum can be explicitly accounted for inthe model, which is particularly important for characterising HIV transmission, whichtends to concentrate in marginalised, vulnerable, and hard to reach populations (UN-

23

AIDS, 2019). We found relatively small differences in the estimated sources of infectionin the study communities by location and age when inferences were performed with andwithout sampling adjustments, in line with the relatively limited differences in samplinginclusion probabilities in the RCCS communities and the expected impact on sourceattribution on simulated data (Figure 3B). However in other use cases, the impact ofsampling heterogeneity on source attribution can be substantially larger. For example,the RCCS included in surveyed communities an estimated 75.7% of the lakeside popula-tion within 3km of the shoreline of Lake Victoria along the Rakai region, and an estimated16.2% of the inland population of the Rakai region (Ratmann et al., 2020). We can usethe proposed framework to extrapolate our inferences from the RCCS communities tothe underlying population, and given the larger sampling differences we estimated that88.7% (84.5–91.9%) of transmissions in the Rakai region occurred in the inland popula-tion (Ratmann et al., 2020). Specifying and characterising the denominator population isthus crucial for interpreting phylogenetic source attribution estimates, and we believe theproposed Bayesian semi-parametric flow model provides a useful tool in this endeavour.

The method we propose has limitations. First, the method requires pathogen deep-sequence data instead of consensus sequences, which at present are uncommon, thoughincreasingly generated in routine clinical care (Houlihan et al., 2018). Second, currentdeep-sequencing protocols typically generate short sequence fragments, usually of 200

to 300 base pairs in length after trimming adaptors and low quality ends, and mergingpaired end fragments. This implies that pathogens need to evolve at a fast rate, be-cause otherwise reconstructed deep-sequence phylogenies do not contain the pattern ofancestral subgraphs that is characteristic of pathogen spread in one direction. Such highevolutionary rates are typical for viral pathogens that infect and evolve in humans overlong periods of time, such as HIV or hepatitis C. We expect that the methods developedhere will become applicable to a broad range of viral and bacterial infectious diseases asexisting deep-sequencing methods that generate substantially longer pathogen sequencefragments become cheaper (Rhoads and Au, 2015). Third, our inferences are based onsource-recipient pairs with strong evidence for the direction of transmission, which is asubset of all the data available, and we cannot exclude that this selection step introducesbias into flow estimates.Fourth, the model was not designed to estimate time changesin transmission flows. While in principle it is possible to add time as a covariate tothe linear predictor of the log transmission intensities (15), the resulting flow estimateswill in general not be consistent with the constraints imposed by standard assumptionson disease spread, as in Equations (4-6), and other techniques such as the structuredcoalescent may be better suited (Volz et al., 2009).

24 on behalf of Rakai Health Sciences Program and PANGEA-HIV

Reducing HIV incidence among adolescent and young women is a key priority forpublic health programs across sub-Saharan Africa to achieve epidemic control milestones(UNAIDS, 2018). The DREAMS intervention aims to promote determined, resilient, em-powered, AIDS-free, mentored, and safe adolescent girls and young women, and includeseducational programs that aim to address the socio-behavioral factors that underlie vul-nerability and infection risk (Saul et al., 2018). Our analysis of a large cross-sectionallysampled HIV deep-sequence data set from Rakai, Uganda, supports previous analyses(De Oliveira et al., 2017; Probert et al., 2019; Bbosa et al., 2020) and indicates that68.9% (60.2%-76.9%) of adolescent and young women aged <25 years acquired HIV inage-disparate relationships with men at least 5 years older. The estimated proportion ofinfections attributable to age-disparate relationships was approximately 90% among ado-lescent girls, and decreased to approximately 50% among women aged 25 years. Takentogether, the data from this study and other phylogenetic studies from Uganda, Zambia,and South Africa suggest that rapid increases in HIV prevalence among adolescent andyoung women may be driven by the same source populations across sub-Saharan Africa,and support DREAMS interventions that include clear prevention messages about age-disparate sexual relationships.

Acknowledgements

This study was supported by the Bill & Melinda Gates Foundation (OPP1175094, OPP1084362),the National Institute of Allergy and Infectious Diseases (R01AI110324, U01AI100031,U01AI075115, R01AI102939, K01AI125086-01), National Institute of Mental Health (R01MH107275),the National Institute of Child Health and Development (RO1HD070769, R01HD050180),the Division of Intramural Research of the National Institute for Allergy and Infec-tious Diseases, the World Bank, the Doris Duke Charitable Foundation, the Johns Hop-kins University Center for AIDS Research (P30AI094189), and the Presidents Emer-gency Plan for AIDS Relief through the Centers for Disease Control and Prevention(NU2GGH000817). We acknowledge data management support provided in part by theOffice of Cyberinfrastructure and Computational Biology at the National Institute forAllergy and Infectious Diseases, computational support through the Imperial College Re-search Computing Service, doi: 10.14469/hpc/2232. We thank the participants of theRakai Community Cohort Study and the many staff and investigators who made thisstudy possible, as well as the PANGEA-HIV steering committee, the RCCS leadership,and two anonymous reviewers for their helpful comments on this manuscript.

25

List of Supplementary Material

S1. Maximum likelihood flow estimates under heterogeneous samplingS2. Numerical inference algorithmsS3. Simulation experimentsS4. Modelling and estimation of the sampling cascadeS5. Analyses using different age bandsS6. Supplementary Figures and TablesStan files were provided as text documents.prior_gamma.txtprior_gp.txtprior_gp_approx.txt

References

Abeler-Dörner, L., Grabowski, M. K., Rambaut, A., Pillay, D., Fraser, C. et al. (2019)PANGEA-HIV 2: Phylogenetics and networks for generalised epidemics in africa. Cur-rent Opinion in HIV and AIDS, 14, 173–180.

Ailloud, F., Didelot, X., Woltemate, S., Pfaffinger, G., Overmann, J., Bader, R. C.,Schulz, C., Malfertheiner, P. and Suerbaum, S. (2019) Within-host evolution of heli-cobacter pylori shaped by niche-specific adaptation, intragastric migrations and selec-tive sweeps. Nature communications, 10, 1–13.

Anderson, R. M. and May, R. M. (1992) Infectious diseases of humans: dynamics andcontrol. Oxford University Press.

Barré-Sinoussi, F., Abdool Karim, S. S., Albert, J., Bekker, L.-G., Beyrer, C., Cahn,P., Calmy, A., Grinsztejn, B., Grulich, A., Kamarulzaman, A. et al. (2018) Expertconsensus statement on the science of hiv in the context of criminal law. Journal ofthe International AIDS Society, 21, e25161.

Bbosa, N., Ssemwanga, D., Ssekagiri, A., Xi, X., Mayanja, Y., Bahemuka, U., Seeley, J.,Pillay, D., Abeler-Dörner, L., Golubchik, T. et al. (2020) Phylogenetic and demographiccharacterization of directed HIV-1 transmission using deep sequences from high-riskand general population cohorts/groups in uganda. Viruses, 12, 331.

Berger, J. O., Bernardo, J. M. and Sun, D. (2015) Overall objective priors. BayesianAnalysis, 10, 189–221.

26 on behalf of Rakai Health Sciences Program and PANGEA-HIV

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M.,Brubaker, M., Guo, J., Li, P. and Riddell, A. (2017) Stan: A probabilistic programminglanguage. Journal of Statistical Software, 76.

Cohen, M. S., Chen, Y. Q., McCauley, M., Gamble, T., Hosseinipour, M. C., Ku-marasamy, N., Hakim, J. G., Kumwenda, J., Grinsztejn, B., Pilotto, J. H. et al. (2011)Prevention of hiv-1 infection with early antiretroviral therapy. New England journalof medicine, 365, 493–505.

De Oliveira, T., Kharsany, A. B., Gräf, T., Cawood, C., Khanyile, D., Grobler, A., Puren,A., Madurai, S., Baxter, C., Karim, Q. A. et al. (2017) Transmission networks and riskof hiv infection in KwaZulu-Natal, South Africa: a community-wide phylogenetic study.The Lancet HIV, 4, e41–e50.

Dwyer-Lindgren, L., Cork, M. A., Sligar, A., Steuben, K. M., Wilson, K. F., Provost,N. R., Mayala, B. K., VanderHeide, J. D., Collison, M. L., Hall, J. B. et al. (2019)Mapping HIV prevalence in sub-Saharan Africa between 2000 and 2017. Nature, 570,189.

Faria, N. R., Rambaut, A., Suchard, M. A., Baele, G., Bedford, T., Ward, M. J., Tatem,A. J., Sousa, J. D., Arinaminpathy, N., Pépin, J. et al. (2014) The early spread andepidemic ignition of HIV-1 in human populations. Science, 346, 56–61.

Faye, O., Boëlle, P.-Y., Heleze, E., Faye, O., Loucoubar, C., Magassouba, N., Soropogui,B., Keita, S., Gakou, T., Koivogui, L. et al. (2015) Chains of transmission and controlof ebola virus disease in conakry, guinea, in 2014: an observational study. The LancetInfectious Diseases, 15, 320–326.

Gall, A., Ferns, B., Morris, C., Watson, S., Cotten, M., Robinson, M., Berry, N., Pillay,D. and Kellam, P. (2012) Universal amplification, next-generation sequencing, andassembly of HIV-1 genomes. Journal of Clinical Microbiology, 50, 3838–3844.

Givens, G. H., Smith, D. and Tweedie, R. (1997) Publication bias in meta-analysis: aBayesian data-augmentation approach to account for issues exemplified in the passivesmoking debate. Statistical Science, 221–240.

Golubchik, T., Ratmann, O., Wymant, C., Hall, M., Bonsall, D., Grabowski, M. K.,Laeyendecker, O. and Fraser, C. (2017) Quantifying within-host viral diversificationusing deep sequencing data: recent vs chronic HIV infection.

Grabowski, M. K., Lessler, J., Redd, A. D., Kagaayi, J., Laeyendecker, O., Ndyanabo,A., Nelson, M. I., Cummings, D. A., Bwanika, J. B., Mueller, A. C. et al. (2014)

27

The role of viral introductions in sustaining community-based HIV epidemics in ruralUganda: evidence from spatial clustering, phylogenetics, and egocentric transmissionmodels. PLoS Medicine, 11.

Grabowski, M. K., Reynolds, S. J., Kagaayi, J., Gray, R. H., Clarke, W., Chang, L.,Nakigozi, G., Laeyendecker, O., Redd, A. D., Goud-Billoux, V. et al. (2018) Thevalidity of self-reported antiretroviral use in persons living with HIV: a population-based study. AIDS, 32, 363.

Grabowski, M. K., Serwadda, D. M., Gray, R. H., Nakigozi, G., Kigozi, G., Kagaayi, J.,Ssekubugu, R., Nalugoda, F., Lessler, J., Lutalo, T. et al. (2017) HIV prevention effortsand incidence of HIV in uganda. New England Journal of Medicine, 377, 2154–2166.

Hall, M. D., Holden, M. T., Srisomang, P., Mahavanakul, W., Wuthiekanun, V., Lim-mathurotsakul, D., Fountain, K., Parkhill, J., Nickerson, E. K., Peacock, S. J. et al.(2019) Improved characterisation of MRSA transmission using within-host bacterialsequence diversity. eLife, 8, e46402.

Hayes, R. J., Donnell, D., Floyd, S., Mandla, N., Bwalya, J., Sabapathy, K., Yang, B.,Phiri, M., Schaap, A., Eshleman, S. H., Piwowar-Manning, E., Kosloff, B., James, A.,Skalland, T., Wilson, E., Emel, L., Macleod, D., Dunbar, R., Simwinga, M., Makola,N., Bond, V., Hoddinott, G., Moore, A., Griffith, S., Deshmane Sista, N., Vermund,S. H., El-Sadr, W., Burns, D. N., Hargreaves, J. R., Hauck, K., Fraser, C., Shanaube,K., Bock, P., Beyers, N., Ayles, H. and Fidler, S. (2019) Effect of Universal Testingand Treatment on HIV Incidence — HPTN 071 (PopART). New England Journal ofMedicine, 381, 207–218. PMID: 31314965.

Hazelton, M. L. (2001) Inference for origin–destination matrices: estimation, predictionand reconstruction. Transportation Research Part B: Methodological, 35, 667–676.

Houlihan, C. F., Frampton, D., Ferns, R. B., Raffle, J., Grant, P., Reidy, M., Hail, L.,Thomson, K., Mattes, F., Kozlakidis, Z. et al. (2018) Use of whole-genome sequencingin the investigation of a nosocomial influenza virus outbreak. The Journal of InfectiousDiseases, 218, 1485–1489.

van de Kassteele, J., van Eijkeren, J., Wallinga, J. et al. (2017) Efficient estimationof age-specific social contact rates between men and women. The Annals of AppliedStatistics, 11, 320–339.

Le Vu, S., Ratmann, O., Delpech, V., Brown, A. E., Gill, O. N., Tostevin, A., Dunn, D.,Fraser, C., Volz, E. M. and Database, U. H. D. R. (2019) HIV-1 transmission patterns

28 on behalf of Rakai Health Sciences Program and PANGEA-HIV

in Men Who Have Sex with Men: Insights from genetic source attribution analysis.AIDS Research and Human Retroviruses, 35, 805–813.

Leitner, T. and Romero-Severson, E. (2018) Phylogenetic patterns recover known HIVepidemiological relationships and reveal common transmission of multiple variants.Nature Microbiology, 3, 983–988.

Lemey, P., Rambaut, A., Drummond, A. J. and Suchard, M. A. (2009) Bayesian phylo-geography finds its roots. PLoS Computational Biology, 5, e1000520.

Lindström, T., Grear, D. A., Buhnerkempe, M., Webb, C. T., Miller, R. S., Portacci, K.and Wennergren, U. (2013) A bayesian approach for modeling cattle movements in theunited states: scaling up a partially observed network. PLoS One, 8, e53432.

Miller, H. J., Dodge, S., Miller, J. and Bohrer, G. (2019) Towards an integrated scienceof movement: converging research on animal movement ecology and human mobilityscience. International Journal of Geographical Information Science, 33, 855–876.

Poon, A. F., Gustafson, R., Daly, P., Zerr, L., Demlow, S. E., Wong, J., Woods, C. K.,Hogg, R. S., Krajden, M., Moore, D. et al. (2016) Near real-time monitoring of HIVtransmission hotspots from routine HIV genotyping: an implementation case study.The Lancet HIV, 3, e231–e238.

Probert, W., Hall, M., Xi, X., Sauter, R., Golubchik, T., Bonsall, D., Abeler-Dörner,L., Pickles, M., Cori, A., Bwalya, J., Floyd, S., Mandla, N., Shanaube, K., Yang, B.,Ayles, H., Bock, P., Donnell, D., Grabowski, K., Pillay, D., Rambaut, A., Ratmann, O.,Fidler, S., Hayes, R., Fraser, C., consortium, P. and the HPTN 071 (PopART) studyteam (2019) Quantifying the contribution of different aged men and women to onwardstransmission of HIV-1 in generalised epidemics in sub-Saharan Africa: A modelling andphylogenetics approach from the HPTN071 (PopART) trial.

Rasmussen, C. E. and Williams, C. (2006) Gaussian processes for Machine Learning.MIT Press.

Ratmann, O., Grabowski, M. K., Hall, M., Golubchik, T., Wymant, C., Abeler-Dörner,L., Bonsall, D., Hoppe, A., Brown, A. L., de Oliveira, T. et al. (2019) Inferring HIV-1 transmission networks and sources of epidemic spread in africa with deep-sequencephylogenetic analysis. Nature Communications, 10, 1–13.

Ratmann, O., Kagaayi, J., Hall, M., Golubchick, T., Kigozi, G., Xi, X., Wymant, C.,Nakigozi, G., Abeler-Dörner, L., Bonsall, D. et al. (2020) Quantifying HIV transmission

29

flow between high-prevalence hotspots and surrounding communities: a population-based study in Rakai, Uganda. The Lancet HIV, 7, PE173–E183.

Ratmann, O., Van Sighem, A., Bezemer, D., Gavryushkina, A., Jurriaans, S., Wensing,A., De Wolf, F., Reiss, P., Fraser, C. et al. (2016) Sources of HIV infection among menhaving sex with men and implications for prevention. Science Translational Medicine,8, 320ra2.

Raymer, J., Wiśniowski, A., Forster, J. J., Smith, P. W. and Bijak, J. (2013) Integratedmodeling of european migration. Journal of the American Statistical Association, 108,801–819.

Rhoads, A. and Au, K. F. (2015) PacBio sequencing and its applications. Genomics,Proteomics & Bioinformatics, 13, 278–289.

Riutort-Mayol, G., Bürkner, P.-C., Andersen, M. R., Solin, A. and Vehtari, A. (2020)Practical hilbert space approximate bayesian gaussian processes for probabilistic pro-gramming. arXiv preprint arXiv:2004.11408.

Rue, H. and Held, L. (2005) Gaussian Markov random fields: theory and applications.CRC press.

Saul, J., Bachman, G., Allen, S., Toiv, N., Cooney, C. and Beamon, T. (2018) Deter-mined resilient empowered AIDS-free mentored and safe (DREAMS): What is the corepackage and why now. PLOS One, 13, e0208167.

Scire, J., Barido-Sottani, J., Kühnert, D., Vaughan, T. G. and Stadler, T. (2020)Improved multi-type birth-death phylodynamic inference in BEAST 2. bioRxiv,2020.01.06.895532.

Skums, P., Zelikovsky, A., Singh, R., Gussler, W., Dimitrova, Z., Knyazev, S., Mandric,I., Ramachandran, S., Campo, D., Jha, D. et al. (2018) QUENTIN: reconstructionof disease transmissions from viral quasispecies genomic data. Bioinformatics, 34,163–170.

Solin, A. and Särkkä, S. (2020) Hilbert space methods for reduced-rank Gaussian processregression. Statistics and Computing, 30, 419–446.

Sun, K., Wang, W., Gao, L., Wang, Y., Luo, K., Ren, L., Zhan, Z., Chen, X., Zhao, S.,Huang, Y. et al. (2021) Transmission heterogeneities, kinetics, and controllability ofsars-cov-2. Science, 371.

30 on behalf of Rakai Health Sciences Program and PANGEA-HIV

Tebaldi, C. and West, M. (1998) Bayesian inference on network traffic using link countdata. Journal of the American Statistical Association, 93, 557–573.

UNAIDS (2018) Miles to go: closing gaps, breaking barriers, righting justice, docu-ment jc2924. URL: https://www.unaids.org/sites/default/files/media_asset/miles-to-go_en.pdf.

— (2019) UNAIDS Data 2019, document jc2959e. URL: https://www.unaids.org/en/resources/documents/2019/2019-UNAIDS-data.

Vaughan, T. G., Kühnert, D., Popinga, A., Welch, D. and Drummond, A. J. (2014)Efficient Bayesian inference under the structured coalescent. Bioinformatics, 30, 2272–2279.

Vaughan, T. G. and Drummond, A. J. (2013) A stochastic simulator of birth–deathmaster equations with application to phylodynamics. Molecular Biology and Evolution,30, 1480–1493.

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B. and Bürkner, P.-C. (2019) Rank-normalization, folding, and localization: An improved r for assessing convergence ofmcmc. arXiv preprint arXiv:1903.08008.

Volz, E. M., Ionides, E., Romero-Severson, E. O., Brandt, M.-G., Mokotoff, E. andKoopman, J. S. (2013) HIV-1 transmission during early infection in Men who have Sexwith Men: a phylodynamic analysis. PLoS Medicine, 10, e1001568.

Volz, E. M., Pond, S. L. K., Ward, M. J., Brown, A. J. L. and Frost, S. D. (2009)Phylodynamics of infectious disease epidemics. Genetics, 183, 1421–1430.

Wymant, C., Hall, M., Ratmann, O., Bonsall, D., Golubchik, T., de Cesare, M., Gall,A., Cornelissen, M., Fraser, C., STOP-HCV Consortium, The Maela PneumococcalCollaboration and The BEEHIVE Collaboration (2017) PHYLOSCANNER: inferringtransmission from within-and between-host pathogen genetic diversity. Molecular Bi-ology and Evolution, 35, 719–733.

Zhang, Y., Wymant, C., Laeyendecker, O., Grabowski, M. K., Hall, M., Hudelson, S.,Piwowar-Manning, E., McCauley, M., Gamble, T., Hosseinipour, M. C. et al. (2020)Evaluation of phylogenetic methods for inferring the direction of HIV transmission:HPTN 052. Clinical Infectious Diseases, Epub ahead of print, ciz1247.

Stadler, T. and Bonhoeffer, S. (2013) Uncovering epidemiological dynamics in heteroge-neous host populations using phylogenetic methods. Philosophical Transactions of theRoyal Society B: Biological Sciences, 368, 20120198.

31

S1. Maximum likelihood flow estimates under heterogeneous sampling

We here describe derivation of the maximum likelihood estimates of the Poisson model(10),

p(n|λ, ξ) ∝∏a,b

(λabξaξb)nab exp(−λabξaξb),

when the number of sampled individuals in a, N sa =

∑i∈a 1 si = 1, is a Binomial sample

of the number of all individuals in a, Na. We denote the vector of sampled individualsacross population strata by N s = (N s

a)a∈A, and similarly the vector of individuals ineach population group by N = (Na)a∈A. Then the product likelihood is

p(n,N ,N s|λ, ξ) =

(∏a,b

Poisson(nab;λabξaξb)

)(∏a

Binomial(N sa ;Na, ξa)

). (S1)

Taking the derivative of the log-likelihood (S1) to zero gives

πabη =nab

ξaξb

ξa =N sa

Na

(S2)

As∑

ab πab = 1,η =

∑a,b

nab

ξaξb, (S3)

and we obtainπab =

nab

ξaξb

/∑a,b

nab

ξaξb. (S4)

S2. Numerical inference algorithms

We here describe numerical algorithms for estimating transmission flows with the Poissonflow model (10).

It is possible to implement the flow model (10) in the Stan computing language (Car-penter et al., 2017). We hope this feature enhances robust numerical inference in awell-tested software environment, facilitates model sharing and model extensions, andgives end-users the option to apply alternative inference algorithms. For the work pre-sented in this paper, we have used the dynamic Hamiltonian Monte Carlo algorithm ofthe Stan probabilistic programming framework in Stan version 2.21. Section S2.1 de-scribes the Stan model specification for inference of the joint posterior distribution (12)when population strata are unordered and the prior density on the transmission intensi-ties λ is independent Gamma (14). Section S2.2 describes the Stan model specification

32 on behalf of Rakai Health Sciences Program and PANGEA-HIV

for inference when population strata can be ordered and the prior densities for the trans-mission intensities are correlated through a Gaussian process prior (15). For simplicity,we describe models and algorithms for the case that sampling probabilities are the sameamong source and recipient cases. Extensions to different sampling probabilities arestraightforward. Stan is not necessary for numerical inference. To illustrate this, wedescribe in Section S2.3 a custom MCMC within Gibbs sampler for inference of the jointposterior distribution (12) when population strata are unordered and the prior densityon the transmission intensities λ is independent Gamma (14).

S2.1. Stan implementation using independent prior on transmission intensitiesWhen population strata are unordered, we focus on inference of the posterior distribution

p(λ, ξ|n, s,X) ∝ p(n|λ, ξ, s,X)p(λ|ξ, s,X)p(ξ|s,X)

= p(n|λ, ξ)p(λ|ξ)p(ξ|s,X)

=

(∏a,b

Poisson(nab;λabξaξb)

)(∏a,b

Gamma(λab;αab, β)

)p(ξ|s,X)

(S5)

where

λ vector of transmission intensities from group a to group b, with length L,ξ vector of sampling proportions for each population group a, a ∈ A,n vector of observed transmission counts, with length L,s sampling status vector, with length N ,X matrix of sampling characteristics, with dimension N × p,

and αab, β are as in (14), and the flow vector π is obtained from the transformationπab = λab/(

∑c,d λcd).

To specify the posterior density p(ξ|s,X) in Stan, we further approximated its com-ponents p(ξa|s,X) through a suitable closed-form density. In our applications, we optedfor Beta densities with shape and rate parameters estimated via maximum likelihood.The resulting Stan model file was provided in the file named prior_gamma.txt.

Use of the Stan model is illustrated in the online example script https://github.com/BDI-pathogens/phyloscanner/blob/master/phyloflows/vignettes/07_age_analysis.md.

33

S2.2. Stan implementation using Gaussian process prior on transmission intensities

When population strata can be ordered as in the application to age-specific transmissionflows, we focus on inference of the posterior distribution (S5) with the independentGamma priors replaced by a two-dimensional Hilbert space Gaussian process priors onthe non-zero components of λ, (15). To implement this in Stan, Equation (17) can bemore compactly written as

kHSGP((a, b), (a′, b′)) = φ(a, b)T∆ φ(a′, b′), (S6)

where φ(a, b) = (φj(a, b))mj=1 ∈ Rm is the column vector of eigenfunctions and ∆ ∈ Rm×m

is the diagonal matrix whose (j, j)th entry is Sθ(√λj). Consequently the HSGP Gram

matrix for n observations and associated inputs (ai, bi)ni=1 ∈ Ω is

KHSGP = Φ∆ΦT , (S7)

where

Φ =

φ1(a1, b1) · · · φm(a1, b1)

.... . .

...φ1(an, bn) · · · φm(an, bn)

∈ Rn×m.

The two-dimensional zero-mean HSGP model is fully defined in terms of the kernel (17),as the stochastic process for which any finite set of two-dimensional inputs in Ω follows amultivariate normal distribution with variance-covariance matrix specified by (17), andthis is vectorised as (S7). Using Cholesky decomposition, the HSGP model (17) can beconveniently calculated from m standard normal random variables through

f(a, b) =

m∑j=1

√Sθ(√λj))φj(a, b)βj (S8)

where βj ∼ N (0, 1) for j = 1, . . . ,m. The resulting Stan model file is prior_gpapprox.txt.Use of the Stan model is illustrated in the online example script https://github.com/

BDI-pathogens/phyloscanner/blob/master/phyloflows/vignettes/07_age_analysis.md. We explored optimal choices of the tuning parameters of the HSGP approximationon simulations, please see Section 3.2 and S3.4 for details. The Stan file using the GPprior without approximation is in prior_gp.txt as a comparison to approximated GPpriors.

34 on behalf of Rakai Health Sciences Program and PANGEA-HIV

S2.3. Markov Chain Monte Carlo within Gibbs algorithm

This section describes implementation details of a custom Metropolis-within-Gibbs algo-rithm to estimate the posterior distribution (S5),

p(λ, ξ|n, s,X) ∝(∏

a,b

Poisson(nab);λabξaξb)

)(∏a,b

Gamma(λab;αab, β)

)p(ξ|s,X).

The algorithm is also implemented in the phyloflows R package, version 1.2.0.

S2.3.1. Overall structure of algorithm

The algorithm exploits the factorisation of the posterior density (S5) into the full condi-tionals

p(ξa|ξ−a,λ, s,n,X) for all a,

p(λab|ξ,λ−ab, s,n,X) for all a, b,

where ξa is the sampling probability of the ath population group, λab denotes theaverage transmission counts from population group a to population group b, ξ−a =

ξ \ ξa and λ−ab = λ \ λab. The density p(ξa|ξ−a,λ, s,n,X) is not available in closedform, and we performed Metropolis-within-Gibbs updates for each ξa. The densityp(λab|ξ,λ−ab, s,n,X) is of Gamma form, and so a Gibbs step can be used to updateλab. The resulting MCMC algorithm iterates through Metropolis-within-Gibbs updatesfor each ξa, followed by a series of Gibbs updates for each λab.

S2.3.2. Metropolis-Hastings within Gibbs steps

The algorithm starts by updating in turn the sampling proportions ξa of each populationgroup a ∈ A. The full conditional distribution of ξa is

p(ξa|ξ−a,λ,n, s,X)

∝ Πc=aord=ap(ncd|λcd, ξcξd)p(ξa|s,X)

∝ Πc=aord=aPoi(ncd;λcdξcξd)Gamma(λcd; ξcξd)p(ξa|s,X),

(S9)

which does not have closed form, and so a Metropolis-Hastings update is used. We soughtto avoid tuning parameters as much as possible, and for this reason proposed moves fromthe prior,

q(ξ′a|ξa) = p(ξ′a|s,X). (S10)

35

The resulting Metropolis Hasting ratio is

r =p(ξ′a|ξ−a,λ,n, s,X)

p(ξa|ξ−a,λ,n, s,X)× q(ξa)

q(ξ′a)

=Πbp(nab|λab, ξ′aξb)Πb6=ap(nba|λba, ξbξ′a)Πbp(λab|ξ′)Πb 6=ap(λba|ξ′)p(ξ′a|s,X)

Πbp(nab|λab, ξaξb)Πb6=ap(nba|λba, ξbξa)Πbp(λab|ξ)Πb 6=ap(λba|ξ)p(ξa|s,X)

× p(ξa|s,X)

p(ξ′a|s,X)

=Πbp(nab|λab, ξ′aξb)Πb6=ap(nba|λba, ξbξ′a)Πbp(λab|ξ′)Πb 6=ap(λba|ξ′)Πbp(nab|λab, ξaξb)Πb6=ap(nba|λba, ξbξa)Πbp(λab|ξ)Πb 6=ap(λba|ξ)

,

(S11)

where for brevity we denoted by ξ′ the vector of sampling probabilities such that ξ′c = ξc

for all indices c 6= a and ξ′a inserted at index a. Proposed moves are accepted withprobability min(1, r).

In our applications, different sampling cascades apply to source and recipient popu-lation groups, which we denote respectively by ξsa and ξra. Equations (S9-S11) generalisestraightforwardly to this setting, and the algorithms updates the sampling probabilitiesξs and ξr in turn.

S2.3.3. Gibbs stepThe conditional distribution p(λab|ξ,λ−ab, s,n,X) is

p(λab|ξ,λ−ab, s,n,X) ∝ p(λab)p(nab|λab, ξa, ξb)

∝ exp(−λcdξcξd)(λcdξcξd)ncdλαcd−1cd exp(−βλcd)

∝ exp(−λcd(ξcξd + β))λncd+αcd−1cd

∼ Gamma(ncd + αcd, ξcξd + β).

(S12)

Each λab is updated in turn by sampling from the right hand side.

S3. Simulation experiments

We here describe several simulation experiments that we implemented to validate thehierarchical Poisson model (10) to estimate transmission flows in the context of samplingheterogeneity. Our overall strategy was to simulate transmission counts, re-estimatetransmission flows. and compare the estimated flows with the true simulated flows.

S3.1. ODE-based simulations experimentsThe first experiment was a minimal example to assess (1) the impact of sampling dif-ferences on flow estimates and (2) the basic performance of our approach. The main

36 on behalf of Rakai Health Sciences Program and PANGEA-HIV

Table S2. Parameter values used for the ODE-basedsimulationsParameter Symbol Valuetransmission ratesfemale to male(further divided by pop size)

a->a βfm,aa 0.0713b->a βfm,ba 0.0071a->b βfm,ab 0.0122b->b βfm,bb 0.0713

transmission ratesmale to female(further divided by pop size)

a->a βmf,aa 0.1019b->a βmf,ba 0.0173a->b βmf,ab 0.0224b->b βmf,bb 0.1019

viral suppression rate γ 0.0444birth/death rate µ 0.01667

results were reported in Figure 3B. We here provide further detail on these simulationexperiments.

We considered transmission flows between two population groups A = (a, b), whichfor ease of illustration we refer to as individuals living in rural areas or small commu-nities (group a), and individuals living in large communities (group b). The populationwas further structured by men and women, yielding in total 4 population strata. Wethen simulated epidemics based on the compartmental model (4) among susceptible, in-fected, and treated individuals in the four population strata. The parameters of themodel were specified such that 40% of were in group a and 60% of were in group; halfof individuals in each group were men and women respectively; and HIV prevalence wasapproximately 60% at equilibrium. Stochastic simulations in a population of 30000 in-dividuals were performed using MASTER (Vaughan and Drummond, 2013), and runfor 400 time units. The simulations are specified in the xml file https://github.com/BDI-pathogens/phyloscanner/blob/master/phyloflows/inst/misc/Master.xml. Ta-ble S2 lists the model parameters that we used in the simulations, where in terms ofnotation we replaced the population subscripts l with a and h with b compared to (4).

To assess the impact of sampling heterogeneity on flow estimates, we kept the samplingprobability in population group a at 60%, and varied the sampling probability in groupb from 60% to 35% in order to assess the impact of sampling differences of 0%, 5%,10%, 15%, 20%, 25%. The sampling status of each individual in a was simulated undera Bernoulli draw with probability ξa, and respectively for each individual in b withprobability ξb, and observed transmissions nrab from group a to group b were calculated

37

sampling in location a sampling in location b

0% 25% 50% 75% 100% 0% 25% 50% 75% 100%0

10

20

30

sampling probabilities

posterior density estimated from simulated sampling status data

Beta(25.5, 25.5)used when ignoring sampling heterogeneity

Fig. S1. Re-estimated sampling probabilities in the ODE-based simulations. We simulatedtransmission flows between men and women in two locations a and b under the ODE model (4)in Section S3.1. From the simulated sampling status data, posterior estimates of the samplingproportions in a and b were obtained with Equation (S13), and then used in flow inference.We show posterior densities of the sampling probabilities in a and b (grey), compared to aBeta(25.5, 25.5) prior density. Vertical lines indicate the true sampling fractions, ξa = 0.6 andξb = 0.35.

as in (S15).From the simulated data, we first re-estimated the transmission flows while ignoring

sampling heterogeneity. We estimated the posterior distribution (12) with the phyloflowsMCMC-within-Gibbs sampler of section S2.3, with p(ξa|s,X) and p(ξb|s,X) set to Beta(25.5, 25.5).Thus, the specified sampling densities ignored sampling heterogeneity. Code is availableat https://github.com/BDI-pathogens/phyloscanner/blob/master/phyloflows/inst/misc/ode_estimation.R. For each simulation, we then calculated the worst case error.The main results are shown in Figure 3B, and indicate that worst case error increasedsubstantially with sampling heterogeneity.

Second, we re-estimated the transmission flows while accounting for sampling hetero-geneity. The posterior distribution of sampling probabilities p(ξa|s,X) was estimatedfrom the number of sampled individuals in the populations N s

a , assuming we know thetotal population size Na, through

p(ξa|s,X) = Beta(ξa, Nsa + αξ, Na −N s

a + βξ), (S13)

where we set αξ = 0.5, βξ = 0.5, and similarly for ξa. Figure S1 illustrates the estimatedposterior sampling distributions, in comparison to the true values. We then estimated theposterior distribution (12) with the phyloflows MCMC-within-Gibbs sampler of sectionS2.3, using the posterior distribution of sampling probabilities (S13). Figure 3B reportsthe worst case error of posterior flow estimates.

38 on behalf of Rakai Health Sciences Program and PANGEA-HIV

simulation 1 simulation 2 simulation 3 simulation 4

n+= 100n+= 300

n+= 600

a−>a a−>b b−>a b−>b a−>a a−>b b−>a b−>b a−>a a−>b b−>a b−>b a−>a a−>b b−>a b−>b

0%

20%

40%

60%

0%

20%

40%

60%

0%

20%

40%

60%

trans

mis

sion

flo

ws

true flows estimated flows, adjusted for sampling differences

estimated flows, not adjusted for sampling differences

Fig. S2. True and re-estimated flows in four ODE-based simulations, varying insample size. We simulated transmission flows between men and women in two locations aand b under the ODE model (4) in Section S3.2. Columns show results for four simulateddata sets, and rows show results for increasing samples size. The true transmission flows areshow in blue. Posterior flow estimates (median and 95% credibility intervals) are shown whensampling heterogeneity was ignored (grey), and when sampling heterogeneity was accunted foras in (S13) (orange).

S3.2. Sensitivity to overall sample size of observed transmission flows

We further assessed the accuracy of flow inferences as a function of overall sample sizeof the observed transmission flows, n+ =

∑a,b nab. We re-considered the ODE-based

simulation experiments of Section S3.1, with a 25% sampling difference between a andb locations. The overall population size parameter was set to 11200, 18000, 52000 suchthat the overall sample size was n+ ≈ 100, 300, 600.

Figure S2 (grey bars) shows posterior estimates of the transmission flows on 4 ran-domly selected simulated data sets, for the case that 60% of the population in a and35% of the population in b were sampled, and sampling heterogeneity was ignored. Fig-ure S2 (orange bars) shows posterior estimates of the same transmission flows on thesame 4 randomly selected data sets, when sampling heterogeneity was accounted for asin (S13). There was considerable variability in the flow data sets when sample size waslow (n+ = 100). However, information on population sampling substantially improvedflow estimates regardless of sample size. Figure S3 summarises the worst case error inthese simulations. When sampling heterogeneity was not accounted for, variability in

39

0.0%

5.0%

10.0%

100 300 600

sampling differences

wor

st c

ase

erro

r

Adjustment for sampling differences yes no

Fig. S3. Worst case error in estimated transmission flows with increasing sample size.We simulated transmission flows between men and women in two locations a and b under theODE model (4) as described in Section S3.1. Transmission flows were re-estimated with thehierarchical Poisson model (10) on simulated data sets that varied in sample size n+ = 100, 300,600 (x-axis). In the first set of inference runs (dark grey), sampling differences across locationsa, b were not adjusted for, and in the second set of inferences (light grey), sampling differenceswere accounted for based on counts of sampled and infected individuals. The worst case errorbetween the true transmission flows and median posterior estimates was calculated in eachscenario, and the median (bar) and 2.5% and 97.5% quantiles (error bars) are shown. Errorsdecreased with increasing sample size, although errors remained very large when samplingdifferences were not accounted for.

flow estimates decreased with increasing sampling size, but error magnitude did not de-crease. When sampling heterogeneity was accounted for, both the variability in flowestimates and error magnitude decreased with increasing sampling size.

S3.3. More complex simulation experimentsThe second experiment mimicked the more complex population structure and samplingheterogeneity as observed in the Rakai case study and described in Section S4.

We considered simulated populations stratified into 24 population sub-groups, bygender (male, female), location (inland,fishing), in-migration status (in-migrant, resi-dent), age (15-24, 25-34, 35-50 years), and simulated 576 transmission flows betweenthe male-female and female-male sub-group combinations. For simplicity, we did notgenerate epidemic trajectories and instead simulated transmission counts nab from pre-specified transmission flows π0, and pre-specified sampling probabilities ξa. For a fixedtarget sample size n+, the total number of actual transmissions z+ were simulated fromz+ ∼ Poisson(n+/ξ), where ξ = 1

24

∑a ξa was the average sampling probability in the

population. Then, the actual transmission counts between population groups were sim-ulated from z ∼ Multinomial(Z,π0). The sampling status of each individual in a was

40 on behalf of Rakai Health Sciences Program and PANGEA-HIV

simulation 3 simulation 4

simulation 1 simulation 2

fishing

−>fish

ing

fishing

−>inla

nd

inland

−>fish

ing

inland

−>inla

nd

0%

20%

40%

60%

0%

20%

40%

60%tra

nsm

issi

on fl

ows

true flows estimated flows, adjusted for sampling differences

estimated flows, not adjusted for sampling differences

fishing

−>fish

ing

fishing

−>inla

nd

inland

−>fish

ing

inland

−>inla

nd

Fig. S4. True and re-estimated flows in more complex simulations between 24 pop-ulation groups. We simulated transmission flows by gender, 3 age bands, and location undermodel parameters similar to those obtained for the Rakai analysis; see Section S3.3. Panelsshow results for four simulated data sets. Posterior flow estimates (median and 95% credibil-ity intervals) are shown when sampling heterogeneity was ignored (grey), and when samplingheterogeneity was accounted for as in (S13) (orange).

simulated under a Bernoulli draw with probability ξa, and respectively in population bwith probability ξb, and observed transmissions nab from group a to group b were calcu-lated as in (S15). 100 such simulated data sets were generated, for a target sample sizen+ = 300.

The difference between sampling proportions in inland and fishing communities wasabout 10%. We transformed the sampling proportions with a sine function to obtaina greater range of average sampling differences between inland and fishing communitiesfrom 5% to 25%. Additional simulations were then generated under these samplingscenarios.

From the simulated data, we re-estimated the 576-dimensional transmission flowswhile ignoring sampling heterogeneity, and then while accounting for sampling hetero-geneity as described in Section S3.1. To facilitate comparison, we aggregated the true andre-estimated flows into the 4-dimensional vector of flows within and between inland andfishing communities, and calculated worst case errors between them. Figure S4 shows pos-terior estimates of the transmission flows on 4 randomly selected simulated data sets, forthe baseline case that the difference in sampling probabilities between inland and fishingcommunities was 10%. Figure S5 shows that in these more complex simulation scenarios,

41

0%

10%

20%

30%

0.05 0.1 0.15 0.2 0.25

sampling differences

wor

st c

ase

erro

r

Adjustment for sampling differences yes no

Fig. S5. Worst case error in estimated transmission flows in more complex simulations oftransmission flow between 24 population groups. We simulated transmission flows betweenmen and women in 24 population groups as described in Section S3.3. Simulation scenariosvaried in average sampling differences between simulated inland and fishing communities (x-axis). Inferences were performed while ignoring for sampling heterogeneity (dark grey), andwhile accounting for sampling heterogeneity (light grey). In each simulation, worst case errorwas reported, after aggregating flow estimates to the 4 flow combination within and betweeninland and fishing combinations for comparison with the ODE-based simulations, and its medianand 2.5% and 97.5% quantiles are shown (y-axis). Overall, trends in worst case error weresimilar those found in the more simple ODE-based simulation experiments.

trends in worst case error were overall similar compared to the more simple ODE-basedsimulation experiments in Section S3.1. When ignoring sampling heterogeneity, worstcase error increased considerably with average sampling differences between inland andfishing communities. When accounting for sampling heterogeneity, worst case error re-mained largely unaffected by increasing average sampling differences, and was overallslightly higher compared to that in the ODE-based simulation experiments.

S3.4. Accuracy of HSGP approximation

We finally assessed the accuracy of HSGP approximation in the context for heterogenoussampling. The main results are summarised in Section 3.2. This supplementary materialprovides more details of this simulation experiment and visualises the performance of thefinal model specification. We considered transmission flows between one-year-incrementage groups between 15 and 24 by gender and locations and simulated transmission in-

42 on behalf of Rakai Health Sciences Program and PANGEA-HIV

Table S3. Parameter values used to simulate trans-mission flows in populations stratified by 1-year agebands under the extension of GP model (15).Parameter Symbol Valueinterceptfemale to male

h->h µfm,hh -1h->l µfm,hl -10l->h µfm,lh -9l->l µfm,ll -2.5

interceptmale to female

h->h µmf,hh -0.5h->l µmf,hl -9l->h µmf,lh -9l->l µmf,ll -1

lengthscalefemale to male `fm (4.1,2.3)male to female `mf (2.3,4.6)marginal standard deviationfemale to male σfm 1.8male to female σmf 1.5

tensities and counts via a slight extension of Equation (15),

n ∼ Poisson(λξT ξR)

logλ = µmf,hh1mf,hh + µmf,hl1mf,hl + µmf,lh1mf,lh + µmf,ll1mf,ll

+ µfm,hh1fm,hh + µfm,hl1fm,hl + µfm,lh1fm,lh + µfm,ll1fm,ll + f ,

f = (fTmf ,fTfm)T ,

fmf ∼ GP(0, kmf ), ffm ∼ GP(0, kfm),

kmf((a1, b1), (a2, b2)

)= σ2mf exp

(−[(a2 − a1)2

2`2mf,a+

(b2 − b1)2

2`2mf,b

])kfm

((a1, b1), (a2, b2)

)= σ2fm exp

(−[(a2 − a1)2

2`2fm,a+

(b2 − b1)2

2`2fm,b

]),

(S14)

under the hyper-parameter values in Table S3.This extension allows to capture average transmission intensities by locations and gen-

ders and varying age-dependent transmission dynamics from male to female and fromfemale to male. Meanwhile, samples from models for sampling probability (Supplemen-tary Material S4) were reused and their means were taken as ground-truth samplingintensities. 20 replicates of transmission intensities and counts were generated.

Flows and hyperparameters were re-estimated using both Gaussian Process modeland Hilbert Space Gaussian Process model based on observed counts in the simulation,

43

GP HSGP

infected wom

en aged 15 years

infected wom

en aged 18 years

infected wom

en aged 21 years

infected wom

en aged 24 years

15 16 17 18 19 20 21 22 23 24 25 15 16 17 18 19 20 21 22 23 24 25

0.0%

10.0%

20.0%

30.0%

40.0%

0.0%

10.0%

20.0%

30.0%

40.0%

0.0%

10.0%

20.0%

30.0%

40.0%

0.0%

10.0%

20.0%

30.0%

40.0%

age of male sources

sour

ce

Fig. S6. Estimated sources of infection in women under the final HSGP approx-imation. We simulated age-specific transmission flows under the kernel parameters shownin Table S3, and re-estimated transmission flows along with all other parameters under HSGPapproximations to (15). The subfigures on the left illustrate the estimated sources of infec-tion in women under the GP prior from which the data were simulated. The subfigures on theright illustrate the estimated sources under the final HSGP approximation. Shades illustratethe estimated posterior credibility intervals, showing 40% credibility intervals in dark red to 80%credibility intervals in light red. Posterior medians are shown as a red line. True values areshown in black. We found no systematic differences in source estimates when using the finalHSGP approximation over the GP prior.

under priors

σ2mf , σ2fm ∼ Half-Normal(0, 10)

`d,i ∼ Inv-Gamma(αd,i, βd,i)

µmf,hh, µmf,hl, µmf,lh, µmf,ll, µfm,hh, µfm,hl, µfm,lh, µfm,ll ∼ Normal(0, 10).

Here, we illustrate further the performance of the HSGP approximation under the final

44 on behalf of Rakai Health Sciences Program and PANGEA-HIV

sigma_mf sigma_fm

mu_fmll mu_mflh mu_mfll

mu_mfhh mu_mfhl mu_fmlh

ell_fm2 mu_fmhh mu_fmhl

ell_mf1 ell_mf2 ell_fm1

1 2 3 4 5 1 2 3

−4 −2 0 −40−30−20−10 0 −3 0 3 6

−8 −4 0 4 −30 −20 −10 0 −40 −30 −20 −10

2.5 5.0 7.5 10.0 −2 0 2 4 −40 −30 −20 −10

1 2 3 4 5 6 5 10 15 20 25 10 200.00.10.20.3

0.000.020.040.060.08

0.000.020.040.060.08

0.00.10.20.30.40.5

0.00.10.20.3

0.00.30.60.9

0.000.020.040.060.08

0.000.020.040.060.08

0.00.40.81.2

0.00.20.40.6

0.00.10.20.30.40.5

0.00.10.20.30.4

0.00.20.40.6

0.00.30.60.9

method GP HSGP

Fig. S7. Estimated covariance kernel parameters under the final HSGP approxi-mation. We simulated age-specific transmission flows under the kernel parameters shown inTable S3, and re-estimated transmission flows along with all other parameters under HSGPapproximations to (15). The subfigures illustrate the marginal posterior densities of the kernelparameters for under the GP model (in red) and the HSGP approximation (in blue) based onB = 1.25,m = 30 (right panels).

tuning parameters, B = 1.25,m = 30. Figure S6 compares the marginal posterior distri-bution of the sources of infections in women when using the chosen HSGP approximation,compared to using the GP prior. We found no particular differences in the estimated

45

sources of infection when using the HSGP approximation over the GP prior. Figure S7compare the marginal posterior densities of the kernel parameters between the chosenHSGP approximation and the GP prior. Again, we found no systematic differences inthe estimated kernel parameters when using the chosen HSGP approximation comparedto the GP prior.

S4. Modelling and estimation of the sampling cascade

We here describe the statistical models used to characterise each step of the samplingcascade of transmission events, shown in Figure 5A.

To recall our setting, we defined in the main text transmission events as HIV infectionevents to individuals who lived in RCCS communities, were aged 15-49 years, and wereinfected in the study period T = 2009/10/1, 2015/01/30. The main study objectivewas to estimate transmission flows and related quantities (1-3) among different populationgroups, expressed in proportions relative to this denominator of transmission events,denoted by Z. Each transmission event in Z can be indexed by its source i and recipientj in the population of individuals that were infected by the end of T , and we denotetransmission from i to j with the indicator variable zij = 1, and no transmission withzij = 0. Thus, recipients are individuals in P that were infected during the studyperiod T , i.e. in our case 2009/10/1-2015/01/30. Sources are infected individuals whotransmitted to one of the recipients. In our applications, population groups were definedby location (inland and fishing communities), gender, and 1-year age bands from age15 to age 49, yielding A = 140 population strata. The actual transmission events fromgroup a to group b are

zab =∑

i∈a,j∈bzij , (S15)

and observing any such event depends on the sampling status of the source and recipient,

nab =∑

i∈a,j∈b1sSi = 1

1sRj = 1

zij . (S16)

To make inference on transmission flows, our strategy is to explicitly account for samplingheterogeneity in order to reduce bias in the flow estimates. We propose estimatingsampling proportions and then propagate these estimates into inference of transmissionflows. Overall, we assume that sources and recipients are independently sampled, andconditionally at random within each population sub-group a with probabilities ξS =

(ξSa )a∈A and ξR = (ξRa )a∈A. For sake of brevity, we derived the flow model in equations (8-12) using the same sampling probabilities for sources and recipients. In our applications

46 on behalf of Rakai Health Sciences Program and PANGEA-HIV

we consider its extension to distinct sampling probabilities for sources and recipients,with equations generalising straightforwardly.

We modelled the overall sampling probabilities ξSa , ξRa as the product of conditionalsampling probabilities along the sampling cascade of Figure 5A, see the forthcomingEquations (S20) and (S23).

The sampling cascade involved that sources and recipients participate in the cohort,report no ART use, and that virus was successfully deep-sequenced. Section S4.1 de-scribes estimation of participation probabilities for source cases, and Section S4.2 de-scribes estimation of sequence sampling probabilities conditional on participation amongsource cases. Section S4.3 describes estimation of participation probabilities for recipi-ent cases, and Section S4.4 describes estimation of sequence sampling probabilities con-ditional on participation among recipient cases. Section S4.5 illustrates the resultingsampling probabilities of transmission events.

In each step of the cascade, we compared the prediction accuracy of candidate modelsthrough the ten-fold cross validation. To do it, the data y ∈ RN is divided into 10 sets,the kth test set is denoted as yk and the rest of observations forms the training set y−k.Here we defined accuracy measures.

• Hold-out data in 95% PPCrI (Posterior Predictive Credibility Interval): we sim-ulated data (sampled individuals) from the fitted model on y−k, found the 95%credibility intervals of the simulated data, and counted the percentage of observeddata yk included in the credibility interval.• MAE (Mean Absolute Error): we simulated data (sampled individuals) from thefitted model on y−k, found the median estimates, and calculated the absolute dif-ferences of the observations yk and the median estimates.• ELPD (expected log point-wise predictive density for a new dataset): this evaluatesthe log predictive density at each data point in the kth fold,

ELPDk =

Nk∑i=1

ELPDki =

Nk∑i=1

log(1

S

S∑s=1

p(yki|θk,s)), (S17)

where Nk is the number of data in the kth fold, yki denotes the ith observation inthe kth test set, and θk,s denote the sth samples drawn from p(θ|yk).

S4.1. Modelling and estimating participation probabilities of source casesWe are interested in participation probabilities among infected individuals who transmit-ted virus. In general, HIV infections in surveyed locations may have their source outsidethese locations, with previous work suggesting that approximately 30% of infections in

47

RCCS communities originate from outside the cohort (Grabowski et al., 2014). Sourcesfrom outside the RCCS communities by definition did not participate in any surveyround, and in the absence of any data we focused on sources from RCCS communities.This is a general limitation of our flow inferences.

As part of demographic surveillance, the RCCS conducts a population census imme-diately prior to each survey round in the cohort communities, which provides an enu-meration of the underlying age-eligible population of 15-49 years. We used these data toestimate participation probabilities ξpa over the survey period in each of the underlyingpopulation sub-groups a, and thereby approximated the participation probabilities ofsources cases. This involved three assumptions. First, we assumed that individuals par-ticipated regardless of infection status, and second regardless of being a source or not, andthird that participation probabilities were the same in the survey period (2011/08/10-2015/01/30) and in the slightly longer study period (2009/10/01-2015/01/30), duringwhich transmission flows were estimated.

We used a Beta-Binomial regression framework to estimate the posterior distributionof population participation probabilities ξp = (ξpa)a∈A. Denote by N e

a the number ofparticipation-eligible individuals in population group a, by Np

a the number of participantsin population group a, by Xa ∈ 0, 1p a row vector of p predictor variables associatedwith population a, including location status, gender, age band, and possible interactionterms, and by X ∈ 0, 1A×p the design matrix that stacks the row vectors Xa for eachpopulation group a. The general form of the regression models considered was

Npa ∼ Beta-Binomial(N e

a , ξpa, γ), ∀a

logit(ξp) = β01 + βX

β0 ∼ N (0, 100)

β ∼ N (0, 10I)

γ ∼ Exp(1)

(S18)

where the Beta-Binomial is parameterised in terms of the number of Bernoulli trials, themean of the marginal probabilities of successes and the dispersion parameter, β0 ∈ R isthe baseline participation probability, β ∈ Rp is the vector of regression coefficients, andI is the p × p identity matrix. Four models were considered and assessed using 10-foldcross-validation.

• Model P1. Model (S18) with indicator variables on location, gender, and 1-yearage bands as contrasts, leading to p = 36 predictors. The corresponding regres-sion coefficients were given independent normal prior densities, and the dispersioncoefficient was set to γ = 0 to reflect no overdispersion.

48 on behalf of Rakai Health Sciences Program and PANGEA-HIV

Table S4. Prediction accuracy of models P1-P4 on hold-out populationparticipation data in 10-fold cross-validation.Model Hold-out data MAE ELPD

in 95% PPCrImedian (range) median (range) median (std dev)

P1 93.1% (91.5% - 95.1%) 1.79 (1.50 - 1.92) -3705.27 ( 26.78 )P2 98.0% (97.2% - 99.2%) 1.76 (1.49 - 1.98) -4168.58 ( 37.05 )P3 93.1% (92.4% - 95.3%) 1.67 (1.58 - 1.90) -3651.39 ( 27.61 )P4 98.4% (96.5% - 98.8%) 1.70 (1.53 - 1.90) -4072.42 ( 37.21 )* PPCrI is posterior predictive credibility interval, MAE is abbreviated

from mean absolute error, ELPD is the abbreviation of expected logpredictive density

• Model P2. As model P1 but with γ estimated to allow for overdispersion.• Model P3. Model (S18) with indicator variables on location, and gender and 1-yearage band interactions as contrasts, leading to p = 70 predictors. The correspond-ing regression coefficients were given independent normal prior densities, and thedispersion coefficient was set to γ = 0 to reflect no overdispersion.• Model P4. As model P3 but with γ estimated to allow for overdispersion.

The models were fitted in Stan version 2.21, convergence diagnostics (Vehtari et al., 2019)were checked, and effective sample sizes were 10000. Table S4 shows the proportion ofhold-out data in 95% posterior predictive credibility intervals (PPCrI), the mean absoluteerror (MAE), and the expected log predictive density in 10-fold cross-validation (ELPD)for each model. Based on these statistics, model P4 was chosen to model participationprobabilities across population strata. Figure S8 summarises the marginal posteriordensities of the regression coefficients under model P4. Figure S9 shows the marginalposterior participation probabilities by location, gender, and age-bands.

49

−3−2−1

0123

a

age1

5_fem

ale

age1

5_m

ale

age1

6_fem

ale

age1

6_m

ale

age1

7_fem

ale

age1

7_m

ale

age1

8_fem

ale

age1

8_m

ale

age1

9_fem

ale

age1

9_m

ale

age2

0_fem

ale

age2

0_m

ale

age2

1_fem

ale

age2

1_m

ale

age2

2_fem

ale

age2

2_m

ale

age2

3_fem

ale

age2

3_m

ale

age2

4_fem

ale

age2

4_m

ale

age2

5_fem

ale

age2

5_m

ale

age2

6_fem

ale

age2

6_m

ale

age2

7_fem

ale

age2

7_m

ale

age2

8_fem

ale

age2

8_m

ale

age2

9_fem

ale

age2

9_m

ale

−3−2−1

0123

age3

0_fem

ale

age3

0_m

ale

age3

1_fem

ale

age3

1_m

ale

age3

2_fem

ale

age3

2_m

ale

age3

3_fem

ale

age3

3_m

ale

age3

4_fem

ale

age3

4_m

ale

age3

5_fem

ale

age3

5_m

ale

age3

6_fem

ale

age3

6_m

ale

age3

7_fem

ale

age3

7_m

ale

age3

8_fem

ale

age3

8_m

ale

age3

9_fem

ale

age3

9_m

ale

age4

0_fem

ale

age4

0_m

ale

age4

1_fem

ale

age4

1_m

ale

age4

2_fem

ale

age4

2_m

ale

age4

3_fem

ale

age4

3_m

ale

age4

4_fem

ale

age4

4_m

ale

age4

5_fem

ale

age4

5_m

ale

age4

6_fem

ale

age4

6_m

ale

age4

7_fem

ale

age4

7_m

ale

age4

8_fem

ale

age4

8_m

ale

age4

9_m

ale

fishin

g

Fig. S8. Marginal posterior densities of regression coefficients of model P4 to estimatepopulation participation probabilities. Circle: median; bar: interquartile range; line: 95%credibility interval. The first variable a refers to the intercept.

fishing sitesinland com

munities

15 20 25 30 35 40 45 50

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

one−year increment age groups

prob

abili

ty o

f par

ticip

atio

n

gender female male

Fig. S9. Marginal posterior densities of population participation probabilities undermodel P4. Circle: median; line: 95% credibility interval.

S4.2. Modelling and estimating sequencing probabilities of source casesWe are now interested in the probability that (participating) source cases report no ARTuse, and the probability that source cases and reported no ART use also had virus suc-

50 on behalf of Rakai Health Sciences Program and PANGEA-HIV

Table S5. Prediction accuracy of models S1-S8 on hold-out deep-sequencingdata in 10-fold cross-validation.Model Hold-out data MAE ELPD

in 95% PPCrImedian (range) median (range) median (std dev)

S1 99.2% (98.6% - 100.0%) 0.60 (0.55 - 0.65) -1082.79 ( 12.99 )S2 100% (99.2% - 100%) 0.58 (0.54 - 0.65) -1130.18 ( 15.91 )S3 99.2% (98.4% - 100.0%) 0.60 (0.53 - 0.70) -1057.71 ( 13.1 )S4 99.2% (99.2% - 100.0%) 0.61 (0.56 - 0.70) -1097.95 ( 15.9 )S5 99.2% (99.2% - 100%) 0.58 (0.56 - 0.61) -1106.56 ( 12.79 )S6 100% (99.2% - 100%) 0.57 (0.55 - 0.62) -1159.65 ( 15.77 )S7 99.2% (98.4% - 100%) 0.58 (0.56 - 0.61) -1118.38 ( 12.58 )S8 100% (99.2% - 100%) 0.58 (0.55 - 0.61) -1174.11 ( 15.6 )* PPCrI is posterior predictive credibility interval, MAE is abbreviated from

mean absolute error, ELPD is the abbreviation of expected log predictivedensity

−10

−5

0

5

10

am

ale

fishin

g

sigm

aph

i15ph

i16ph

i17ph

i18ph

i19ph

i20ph

i21ph

i22ph

i23ph

i24ph

i25ph

i26ph

i27ph

i28ph

i29ph

i30ph

i31ph

i32ph

i33ph

i34ph

i35ph

i36ph

i37ph

i38ph

i39ph

i40ph

i41ph

i42ph

i43ph

i44ph

i45ph

i46ph

i47ph

i48ph

i49

Fig. S10. Marginal posterior densities of regression coefficients of model S5 to estimateconditional deep-sequencing probabilities. Circle: median; bar: interquartile range; line:95% credibility interval. The first variable a refers to the intercept.

cessfully deep-sequenced. We combined both steps and sought to estimate the probabilitythat source cases had virus successfully deep-sequenced. Considerable demographic andclinical data were collected among RCCS participants, which allowed us to relax theassumptions of Section S4.1 when modelling and estimating sequencing probabilities ofsource cases. In general, many infected individuals do not transmit the virus onwards, andindividuals who transmitted virus in the study period T must have had unsuppressedHIV at some point in T . A routinely collected proxy variable of viral suppression isself-reported ART use (Grabowski et al., 2018). On this basis, we excluded from con-sideration of potential source cases infected individuals who consistently reported ARTuse, defined as reporting ART use at each survey visit, and approximated the probabilityof obtaining a deep-sequence sample from source cases who participated in the cohortby the probability of obtaining a deep-sequence sample from infected participants whoreported no ART use on at least one visit during the survey period. In this approxima-tion, we further assumed that individuals who consistently reported ART use during thesurvey period (2011/08/10-2015/01/30) would also have consistently reported ART useduring the study period (2009/10/01-2015/01/30).

We again used a Beta-Binomial regression framework to estimate the posterior distri-

51

bution of deep-sequencing probabilities ξs = (ξsa)a∈A, where ξsa denotes the probabilityof obtaining a deep-sequence sample from potential source cases, defined as infected par-ticipants who reported no ART use on at least one visit during the survey period inpopulation sub-group a. However due to smaller numbers we also considered regularisedregression models with correlations imposed among the regression coefficients. We followthe notation of (S18) and further denote by Nnaïve

a the number of infected participants ina who did not report consistent ART use, and by N s

a the number of infected participantsin a who did not report consistent ART use and had virus deep-sequenced successfully.We considered models of the form

N sa ∼ Beta-Binomial(Nnaïve

a , ξsa, γ), ∀a

logit(ξs) = β01 + βX

β0 ∼ N (0, 100)

β ∼ N (0, 10I)

γ ∼ Exp(1),

(S19)

and similar models with 1st-order intrinsic conditional auto-regressive (ICAR) prior dis-tributions across adjacent age bands (Rue and Held, 2005). Eight models were consideredand assessed using 10-fold cross-validation.

• Model S1. Model (S19) with indicator variables on location (1), gender (1), and1-year age bands (34) as contrasts, leading to p = 36 predictors. The correspond-ing regression coefficients were given independent normal prior densities, and thedispersion coefficient was set to γ = 0 to reflect no overdispersion.• Model S2. As model S1 but with γ estimated to allow for overdispersion.• Model S3. Model (S19) with indicator variables on location (1), and gender and1-year age band interactions (69) as contrasts, leading to p = 70 predictors. Thecorresponding regression coefficients were given in independent normal prior densi-ties, and the dispersion coefficient was set to γ = 0 to reflect no overdispersion.• Model S4. As model S3 but with γ estimated to allow for overdispersion.• Models S5-S8. As models S1-S4 respectively, but using an ICAR prior density onadjacent age bands.

The models were fitted in Stan version 2.21, convergence diagnostics (Vehtari et al.,2019) were checked, and effective sample sizes were again 10000. In Table S5 wereport as before the proportion of hold-out data in 95% PPCrI, the MAE, and the ELPDfor each model in 10-fold cross-validations. Based on these figures, model S5 was chosento model deep-sequencing probabilities (conditional on participation) across population

52 on behalf of Rakai Health Sciences Program and PANGEA-HIV

fishing sitesinland com

munities

15 20 25 30 35 40 45 50

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

one−year increment age groups

prob

abili

ty o

f seq

uenc

ing

a s

ourc

e ca

se

gender female male

Fig. S11. Marginal posterior densities of source sequencing probabilities under modelS5. Circle: median; line: 95% credibility interval.

strata. Figure S8 shows the marginal posterior densities of the regression coefficientsunder model S5. Figure S11 summarises the marginal posterior deep-sequencing proba-bilities by location, gender, and age-bands.

To summarise, the sampling probability of source cases was approximated by

ξSa = ξpa ∗ ξsa (S20)

for all population sub-groups a.

S4.3. Modelling and estimating participation probabilities of recipient casesWe are now interested in participation probabilities among recipients, i.e. individualswho acquired infection in the study period T , 2009/10/01-2015/01/30. We proceededas in Section S4.1 and used the RCCS census data to approximate the participationprobabilities among individuals who acquired HIV in T by participation probabilitiesover the survey period in each of the underlying population sub-groups. This involvedtwo assumptions. First, we assumed that individuals participated regardless of infec-tion status, and second that participation probabilities were the same in the survey

53

Table S6. Prediction accuracy of models N1-N3 on hold-out infection time data in10-fold cross-validation.Model Hold-out data MAE ELPD

in 95% PPCrImedian (range) median (range) median (std dev)

N1 87.3% (85.0% - 88.9%) 0.1458 (0.1364 - 0.1830) -537 (10.1)N2 97.2% (95.6% - 98.5%) 0.0284 (0.0152 - 0.0467) -103 (3.61)N3 97.3% (95.6% - 98.5%) 0.0303 (0.0152 - 0.0438) -101 (3.56)* PPCrI is posterior predictive credibility interval, MAE is abbreviated from

mean absolute error, ELPD is the abbreviation of expected log predictive den-sity

period (2011/08/10-2015/01/30) and in the slightly longer study period (2009/10/01-2015/01/30), during which transmission flows were estimated. Following Section S4.1,we adjusted for heterogeneity in participation probabilities among newly infected indi-viduals using the posterior estimates under model P4, reported in Figure S9.

S4.4. Modelling and estimating sequencing probabilities of recipient cases

We are finally interested in the probability that newly infected participants report noART use, and the probability that newly infected participants reporting no ART usehad virus successfully deep-sequenced. As in Section S4.2, we combined both steps, andsought to estimate the probability that newly infected participants had virus successfullydeep-sequenced.

Based on available data collected among RCCS participants, we could again relax theassumptions of Section S4.3. Our approach here differed in two aspects from the approachwe employed to estimate sequencing probabilities among source cases in Section S4.2.First, we predicted who of the infected participants acquired infection during the studyperiod T , as opposed to before the start of T . We used longitudinal individual-levelserostatus data and phylogenetic estimates of the infection time for this step (Grabowskiet al., 2017; Golubchik et al., 2017). Second, we estimated the probability that infectedparticipants who were classified to have acquired HIV in T (regardless of ART use, incontrast to Section S4.2) also had virus successfully deep-sequenced.

We now describe step 1. Of 5032 infected participants, 883 (17.5%) had a positiveHIV test before the start of T , 65 (1.3%) had a phylogenetically estimated mean time ofinfection before T , 1698 (33.8%) had a phylogenetically estimated time of infection in T .Information on last negative test was not available for this analysis. The remaining 2,386(47.4%) had no data or phylogenetic estimates on the likely time of infection. Based onthe 2646 individuals with data on infection times, we trained Bernoulli regression modelsto estimate predictor variables associated with HIV acquisition in T (Yi = 1) versusbefore T (Yi = 0). The predictive accuracy of several models was compared using 10-

54 on behalf of Rakai Health Sciences Program and PANGEA-HIV

−300

−200

−100

0

am

ale

inmigr

ant

age

date

firstp

osda

te art

fishin

g

Fig. S12. Marginal posterior densities of regression coefficients of model N3 to estimateconditional deep-sequencing probabilities. Circle: median; bar: interquartile range; line:95% credibility interval. The first variable a refers to the intercept.

fold cross-validation. Using the best model, we predicted the timing of HIV acquisitionamong the 2,386 infected participants with no such data. We then retained for furtheranalysis the individuals with evidence of HIV acquisition during T , plus those individualsfor whom HIV acquisition during T was predicted.

The Bernoulli regression models had the form

Yi ∼ Bernoulli(ηi), ∀a

logit(η) = β01 + βX

β0 ∼ N (0, 100)

β ∼ N (0, 10I)

(S21)

where ηi is the probability that the ith infected participant acquired HIV in T , and thedesign matrix was based on the covariates gender (indicator variable), first survey visitdate (real valued), age of individual at first visit date (real valued), in-migration intothe current RCCS community of residence in the two years prior to the first survey visit(indicator variable), date of first HIV positive test (real valued), residence in fishing orinland communities at time of first positive test (indicator variable), and ART use atfirst visit date when infected (indicator). Three models were considered.

• Model N1. Model (S21) with the aforementioned covariates except ART status andtime of first positive test, leading to p = 5 predictors.• Model N2. Model (S21) with the aforementioned covariates except ART status,leading to p = 6 predictors.• Model N3. Model (S21) with the aforementioned covariates, leading to p = 7

predictors.

55

Table S7. Demographic characteristics of infected participants by estimated infection time.Community Gender Age Infection Infection Proportion of

before T in T infection in T

Fishingcommunities

Male15-19 years 0 (0.0%) 15 (1.6%) 100%20-24 years 6 (3.9%) 104 (11.2%) 94.5%25-29 years 16 (10.5%) 256 (27.5%) 94.1%30-34 years 49 (32.0%) 253 (27.2%) 83.8%35-39 years 39 (25.5%) 162 (17.4%) 80.6%40-44 years 25 (16.3%) 116 (12.5%) 82.3%45-50 years 18 (11.8%) 25 (2.7%) 58.%subtotal 153 931 85.9%

Female15-19 years 7 (2.5%) 63 (5.9%) 90%20-24 years 27 (9.5%) 225 (21.2%) 89.3%25-29 years 50 (17.5%) 314 (29.6%) 86.3%30-34 years 85 (29.8%) 236 (22.2%) 73.5%35-39 years 71 (24.9%) 145 (13.7%) 67.1%40-44 years 35 (12.3%) 55 (5.2%) 61.1%45-50 years 10 (3.5%) 23 (2.2%) 69.7%subtotal 285 1061 78.8%

Inlandcommunities

Male15-19 years 9 (2.2% 14 (3%) 60.9%20-24 years 14 (3.5%) 72 (15.5%) 83.7%25-29 years 34 (8.5%) 129 (27.8%) 79.1%30-34 years 83 (20.8%) 99 (21.3%) 54.4%35-39 years 103 (25.8%) 85 (18.3%) 45.2%40-44 years 81 (20.2%) 44 (9.5%) 35.2%45-50 years 76 (19%) 21 (4.5%) 21.6%subtotal 400 464 53.7%

Female15-19 years 19 (2.2%) 78 (9.1%) 80.4%20-24 years 62 (7.2%) 220 (25.7%) 78.0%25-29 years 143 (16.6%) 228 (26.7%) 61.5%30-34 years 203 (23.6%) 174 (20.4%) 46.2%35-39 years 208 (24.2%) 37 (4.3%) 15.1%40-44 years 137 (15.9%) 91 (10.6%) 39.9%45-50 years 88 (10.2%) 27 (3.2%) 23.5%subtotal 860 855 49.85%total 1698 3311 66.1%

The models were fitted in Stan version 2.21, convergence diagnostics (Vehtari et al., 2019)were checked, and effective sample sizes were 5000. Table S6 reports as before theproportion of hold data in 95% PPCrI, the MAE, and the ELPD for each model in 10-fold cross-validations. Model N3 was chosen to model HIV acquisition during the surveyperiod across population strata. Figure S12 shows the marginal posterior densities of theregression coefficients under model N3. Participants without evidence of infection timeswere classified as having acquired HIV infection in T if the median marginal posteriorestimate of ηi was above 0.37, and were otherwise classified as having acquired HIVinfection before T . The threshold was selected by maximising the F1 score for binaryoutcomes. Table S7 summarises the number of infected participants classified as havingacquired infection before or during the study period 2009/10/01-2015/01/30.

We now describe step 2. To estimate the deep-sequencing probabilities among newly

56 on behalf of Rakai Health Sciences Program and PANGEA-HIV

Table S8. Prediction accuracy of models S′1-S′8 on hold-out deep-sequencing data among patients infected in T in 10-fold cross-validation.Model Hold-out data MAE ELPD

in 95% PPCrImedian (range) median (range) median (std dev)

S′1 98.5% (56% - 100%) 0.65 (0.52 - 1.07) -817.49 ( 14.23 )S′2 100% (98.6% - 100%) 0.62 (0.48 - 0.72) -951.26 ( 15.04 )S′3 100.0% (98.1% - 100.0%) 0.66 (0.50 - 0.73) -883 ( 12.3 )S′4 100.0% (98.3% - 100.0%) 0.67 (0.49 - 0.77) -917.12 ( 15.12 )S′5 99.5% (99% - 100%) 0.60 (0.51 - 0.68) -930.96 ( 11.93 )S′6 100% (99% - 100%) 0.61 (0.51 - 0.67) -980.39 ( 14.95 )S′7 99.5% (97.5% - 100%) 0.62 (0.49 - 0.68) -940.31 ( 11.77 )S′8 100.0% (98.5% - 100.0%) 0.62 (0.47 - 0.66) -992.35 ( 14.84 )* PPCrI is posterior predictive credibility interval, MAE is abbreviated from

mean absolute error, ELPD is the abbreviation of expected log predictivedensity

infected participants, we excluded from further consideration the 1698 participants thatwere classified to have acquired HIV before the start of the study period T , and basedestimates on the 3311 participants that were classified to have acquired HIV in T . Asin Section S4.2 we used a Beta-Binomial regression framework to estimate the posteriordistribution of deep-sequencing probabilities ξs

′= (ξs

a )a∈A, where ξs′

a denotes the prob-ability of obtaining a deep-sequence sample from (classified) newly infected participantsin population sub-group a. We follow the notation of Section S4.2 and further denoteby N study

a the number of infected participants in a who were classified to have acquiredHIV during the study period T , and by N s′

a the number of infected participants in a

who were classified to have acquired infection during the study period and had virusdeep-sequenced successfully. We considered models of the form

N s′

a ∼ Beta-Binomial(N studya , ξs

a , γ), ∀a

logit(ξs′) = β01 + βX

β0 ∼ N (0, 100)

β ∼ N (0, 10I)

γ ∼ Exp(1),

(S22)

and potentially including regularising ICAR prior densities as in Section S4.2. We com-pared eight models S′1-S′8 that were analogous to models S1-S8 in Section S4.2. Themodels were fitted in Stan version 2.21, convergence diagnostics (Vehtari et al., 2019)were checked, and effective sample sizes were 10000. Table S8 reports the propor-tion of hold-out data in 95% PPCrI, the MAE, and ELPD for each model in 10-foldcross-validation. We chose to model deep-sequencing probabilities among newly infectedparticipants across population strata with model S′5. Figure S13 shows the marginalposterior densities of the regression coefficients under model S′5, and Figure S14 shows

57

−10

−5

0

5

10

am

ale

fishin

g

sigm

aph

i15ph

i16ph

i17ph

i18ph

i19ph

i20ph

i21ph

i22ph

i23ph

i24ph

i25ph

i26ph

i27ph

i28ph

i29ph

i30ph

i31ph

i32ph

i33ph

i34ph

i35ph

i36ph

i37ph

i38ph

i39ph

i40ph

i41ph

i42ph

i43ph

i44ph

i45ph

i46ph

i47ph

i48ph

i49

Fig. S13. Marginal posterior densities of regression coefficients of model S′5 to estimateconditional deep-sequencing probabilities among infected participants who were classi-fied to have acquired HIV during the study period. Circle: median; bar: interquartile range;line 95% credibility interval. The first variable a refers to the intercept.

the resulting deep-sequencing probabilities among recipients. We further compared theestimated deep-sequencing probabilities among recipients (approximated among newlyinfected participants regardless of ART use) to those among potential source cases (ap-proximated among all infected participants that reported no ART use) in Figure S15.Deep-sequence sampling probabilities were estimated to be slightly lower among newlyinfected populations in older age groups when compared to source populations, but wereoverall very similar. Note that during the study period immediate start of treatment wasnot yet recommended by guidelines, and so infected populations were mostly ART naiveat time of their first survey date, which explains why sampling probabilities in the twogroups were overall similar.

To summarise, the sampling probability of recipient cases was approximated by

ξRa = ξpa ∗ ξs′

a (S23)

for all population sub-groups a.

58 on behalf of Rakai Health Sciences Program and PANGEA-HIV

fishing sitesinland com

munities

15 20 25 30 35 40 45 50

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

one−year increment age groups

prob

abili

ty o

f seq

uenc

ing

a r

ecip

ient

cas

e

gender female male

Fig. S14. Marginal posterior densities of recipient sequencing probabilities under modelS′5. Circle: median; line: 95% credibility interval.

59

female

fishing sites

male

fishing sites

female

inland comm

unities

male

inland comm

unities

15 20 25 30 35 40 45 50

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

one−year increment age groups

sequ

enci

ng p

roba

bilit

y

sampling of newly infected populations source populations

Fig. S15. Comparison of marginal posterior densities of regression coefficients betweenmodel S5 and model S′5. Circle: median; bar: interquartile range; line 95% credibility interval.

60 on behalf of Rakai Health Sciences Program and PANGEA-HIV

female −> male male −> female

15 20 25 30 35 40 45 50 15 20 25 30 35 40 45 5015

20

25

30

35

40

45

50

age (transmitters)

age

(rec

ipie

nts)

0.0 0.1 0.2 0.3 0.4 0.5sampling fractions

Fig. S16. Estimated sampling probabilities of transmission events in RCCS inland com-munities, by 1-year age bands. Sampling probabilities of transmission events were calculatedby multiplying the sampling probabilities of source and recipient populations. Shown are theposterior median estimates of the sampling probabilities by the age of source and recipientpopulations at the midpoint of the study period. Left panel: female to male transmission. Rightpanel: male to female transmission.

S4.5. Pairwise sampling probabilities of sources and recipientsWe then used the sampling probabilities of source and recipient cases, ξSa and ξRb in(S20) and (S23), to calculate the sampling probability of transmission events from sub-population a to sub-population b as ξSa × ξRb . Doing so we assumed that source andrecipient cases are independently sampled. Figure S16 illustrates the resulting samplingprobabilities of transmission events in inland communities, and Figure S17 illustrates thesampling probabilities of transmission events in fishing communities.

61

female −> male male −> female

15 20 25 30 35 40 45 50 15 20 25 30 35 40 45 5015

20

25

30

35

40

45

50

age (transmitters)

age

(rec

ipie

nts)

0.0 0.1 0.2 0.3 0.4 0.5sampling fractions

Fig. S17. Estimated sampling probabilities of transmission events in RCCS fishing com-munities, by 1-year age bands. Shown are the posterior median estimates of the samplingprobabilities by the age of source and recipient populations at the midpoint of the study period.Left panel: female to male transmission. Right panel: male to female transmission.

S5. Analyses using different age bands

In the main text, we stratified and analyzed data by 1-year age bands to obtain detailedestimates of age-specific transmission flows. In this section, we present results on analysesusing data that is stratified by 2-year age bands, and 5-year age bands. For the 5-year ageband analysis, we consider two stratifications, 15−19, 20−24, . . . and (12.5−17.5], (17.5−22.5], . . . . We used the same methodological approach as for 1-year age bands, exceptthat based on the fitted GP model in Equation (15), we also predicted transmissionflows at 1-year age bands from the model fitted to 2-year or 5-year stratified data. Toillustrate the impact of using different age stratifications, we here focus on the proportionof transmissions from men that are of same age or younger than female recipients, up to5 years older, or > 5 years older than their female counterparts.

Figures S18A-D show the observed data, the phylogenetically likely source-recipientpairs, respectively by 1-year, 2-year and 5-year age stratifications. To facilitate compari-son across the choice of age strata, the observed counts were divided by the average sam-pling fractions, and show the expected number of phylogenetically likely source-recipientpairs after adjusting for sampling differences. To characterise sources of infections inyoung and adolescent women, our primary interest is to estimate if male source partnersare more than 5 years older, i.e. for 17 year old women, interest is in men aged ≥ 23

or < 23 and for 24 year old women interest is in men aged ≥ 29 or < 29, and these

62 on behalf of Rakai Health Sciences Program and PANGEA-HIV

cut-offs are indicated as a dashed lower diagonal line in Figure S18. Intuitively, it is clearanalyses stratified by 5-year age strata are very coarse to address the primary analyticaims.

Figure S19A illustrates the estimated age-specific transmission flows at 1-year agestratification (posterior median) from data stratified by 1-year age bands, and Fig-ures S19B-D illustrate the predicted age-specific transmission flows at 1-year age strat-ification (posterior predictive median) from data stratified respectively by 2-year and5-year age bands. The figures show that when using coarser age stratifications, the in-ferred peak of male to female transmission flows shifts to younger ages, and falls morestrongly into the diagonal band indicating transmission from men aged up to 5 yearsolder. We also find that the predicted transmission flows shift considerably dependingon how exactly the start and end points of 5-year stratifications are chosen. Figure S20shows that using 5-year age stratifications with different start and end points providesignificantly different estimates into the sources of infections in young and adolescentwomen, while estimates using 1-year and 2-year age stratifications are very similar.

These analyses suggest that statistical models that are able to borrow informationacross the exact ages of the individuals in likely transmission pairs are an importanttechnique to estimate age-specific transmission flows, and in particular the sources oftransmission in young and adolescent women. Comparing the findings from 1-year and2-year age stratifications suggests further that analyses could have been performed by2-year age bands at smaller computational cost.

63

20

30

40

50

20 30 40 50

ages of male source

ag

es

of fe

ma

le r

eci

pie

nts

0.0 2.5 5.0 7.5 10.0adjusted counts

20

30

40

50

20 30 40 50

ages of male source

ag

es

of fe

ma

le r

eci

pie

nts

0 1 2 3 4adjusted counts

20

30

40

50

20 30 40 50

ages of male source

ag

es

of fe

ma

le r

eci

pie

nts

0 1 2 3adjusted counts

A B

C D

20

30

40

50

20 30 40 50

ages of male source

ag

es

of

fem

ale

re

cip

ien

ts

1 2 3adjusted counts

Fig. S18. Aggregated male-to-female transmission counts. We compared male-to-femaletransmission counts (in colour) aggregated by 1-year (Panel A), 2-year (Panel B) and 5-year agebands (Panel C-D). These counts were adjusted by average sampling fractions per stratum. Thetwo black dashed lines are reference lines, representing sources are as same age as recipients(top) and 5 years older than recipients.

64 on behalf of Rakai Health Sciences Program and PANGEA-HIV

20

30

40

50

20 30 40 50

ages of male source

ag

es o

f fe

ma

le r

ecip

ien

ts

0.001 0.002 0.003 0.004estimated flows

20

30

40

50

20 30 40 50

ages of male source

ag

es o

f fe

ma

le r

ecip

ien

ts

0.001 0.002 0.003estimated flows

20

30

40

50

20 30 40 50

ages of male source

ag

es o

f fe

ma

le r

ecip

ien

ts

0.001 0.002 0.003 0.004estimated flows

A B

C

20

30

40

50

20 30 40 50

ages of male source

ag

es o

ffe

ma

le r

ecip

ien

ts

0.001 0.002 0.003 0.004estimated flows

D

Fig. S19. Estimated male-to-female transmission flows. Using the aggregated counts inFigure S18, we modelled the transmission intensities using the GP prior densities (15), andpredicted flows by 1-year age bands. Panel A-D show the estimated transmission flows (incolour) from men (x-axis) to women (y-axis) using corresponding data in S18. The two blackdashed lines are reference lines, representing sources are as same age as recipients (top) and5 years older than recipients.

65

men younger

or same age

men 1−

5yrs olderm

en >5 yrs older

20 30 40 50

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

0%

25%

50%

75%

100%

age of female recipients

estim

ated

sou

rces

o

f inf

ectio

ns in

wom

en

1−year age band 2−year age band 5−year age band (12.5−17.5,17.5−22.5,...) 5−year age band (15−19,20−24,...)

Fig. S20. Source of infections in women under 4 types of aggregation. Using the ag-gregated counts in Figure S18, we modelled the transmission intensities using the GP priordensities (15), predicted flows by 1-year age bands and evaluate the proportion of infectionsfrom men in three age categories (younger or of the same age, aged 1-5 years older and aged> 5 years older). The proportions of infections from 3 age categories (in facet) were shownagainst ages of female recipients in the bar plots, and estimations were based on 4 types ofaggregations (in colour)

66 on behalf of Rakai Health Sciences Program and PANGEA-HIV

S6. Supplementary Figures and Tables

67

Table S9. Estimated sources of infections in women, Rakai, Uganda, 2009-2015.Infected Estimated sources by agewomen Men younger or Men 1- 5yrs Men >5 yrs

same age older olderAge Posterior mean Posterior mean Posterior mean

(95% credibility interval) (95% credibility interval) (95% credibility interval)15 0.6% (0.1% - 3.2% ) 7.6% (2.1% - 20% ) 91.8% (77.6% - 97.7% )16 1.1% (0.1% - 4.9% ) 9.7% (3.7% - 20.5% ) 89% (75.2% - 95.9% )17 1.7% (0.3% - 6.4% ) 12.3% (5.9% - 23.1% ) 85.6% (72.9% - 93.4% )18 2.5% (0.6% - 7.6% ) 15.6% (8.8% - 26.2% ) 81.5% (69.2% - 89.9% )19 3.5% (1% - 8.9% ) 19.4% (12% - 29.3% ) 76.7% (65.1% - 85.6% )20 4.8% (1.8% - 10.5% ) 23% (15.5% - 32.6% ) 71.8% (60.6% - 80.7% )21 6.6% (3% - 12.6% ) 26.4% (18.6% - 35.3% ) 66.8% (56.1% - 76% )22 8.7% (4.6% - 15.2% ) 29.1% (21.2% - 37.8% ) 61.9% (51.6% - 71.4% )23 11.3% (6.4% - 18.2% ) 31.3% (23.5% - 40.1% ) 57.1% (47% - 66.9% )24 14% (8.6% - 21.1% ) 33.2% (25.4% - 41.6% ) 52.5% (42.6% - 62.1% )25 17% (11% - 24.3% ) 34.9% (27.6% - 43.1% ) 47.8% (38.2% - 57.4% )26 20.2% (13.8% - 28.1% ) 36.1% (28.6% - 44.7% ) 43.4% (34.2% - 53.1% )27 23.9% (16.6% - 32.5% ) 36.6% (28.6% - 45.9% ) 39.2% (30.3% - 49.3% )28 28.1% (20.2% - 37.2% ) 35.8% (28% - 45.3% ) 35.8% (26.7% - 45.7% )29 32.8% (24.5% - 42.1% ) 33.8% (26.6% - 42.7% ) 32.9% (24.2% - 43.3% )30 37.7% (28.6% - 47.6% ) 31.2% (24.3% - 39.1% ) 30.8% (22.4% - 40.6% )31 42.2% (32.1% - 52.7% ) 28.6% (21.2% - 36% ) 28.9% (20.6% - 39.1% )32 46% (34.8% - 57% ) 26.6% (18.8% - 34.5% ) 27.2% (18.6% - 38.2% )33 48.9% (36.4% - 60.3% ) 25.6% (17.5% - 34.6% ) 25.1% (16.6% - 37.7% )34 51.6% (36.9% - 63.6% ) 25.3% (16.9% - 34.8% ) 22.5% (13.9% - 35.7% )35 54.6% (39.3% - 66.9% ) 25.5% (17.1% - 35.2% ) 19.4% (11.1% - 32.7% )

References

Abeler-Dörner, L., Grabowski, M. K., Rambaut, A., Pillay, D., Fraser, C. et al. (2019)PANGEA-HIV 2: Phylogenetics and networks for generalised epidemics in africa. Cur-rent Opinion in HIV and AIDS, 14, 173–180.

Ailloud, F., Didelot, X., Woltemate, S., Pfaffinger, G., Overmann, J., Bader, R. C.,Schulz, C., Malfertheiner, P. and Suerbaum, S. (2019) Within-host evolution of heli-cobacter pylori shaped by niche-specific adaptation, intragastric migrations and selec-tive sweeps. Nature communications, 10, 1–13.

Anderson, R. M. and May, R. M. (1992) Infectious diseases of humans: dynamics andcontrol. Oxford University Press.

Barré-Sinoussi, F., Abdool Karim, S. S., Albert, J., Bekker, L.-G., Beyrer, C., Cahn,P., Calmy, A., Grinsztejn, B., Grulich, A., Kamarulzaman, A. et al. (2018) Expertconsensus statement on the science of hiv in the context of criminal law. Journal ofthe International AIDS Society, 21, e25161.

Bbosa, N., Ssemwanga, D., Ssekagiri, A., Xi, X., Mayanja, Y., Bahemuka, U., Seeley, J.,Pillay, D., Abeler-Dörner, L., Golubchik, T. et al. (2020) Phylogenetic and demographic

68 on behalf of Rakai Health Sciences Program and PANGEA-HIV

characterization of directed HIV-1 transmission using deep sequences from high-riskand general population cohorts/groups in uganda. Viruses, 12, 331.

Berger, J. O., Bernardo, J. M. and Sun, D. (2015) Overall objective priors. BayesianAnalysis, 10, 189–221.

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M.,Brubaker, M., Guo, J., Li, P. and Riddell, A. (2017) Stan: A probabilistic programminglanguage. Journal of Statistical Software, 76.

Cohen, M. S., Chen, Y. Q., McCauley, M., Gamble, T., Hosseinipour, M. C., Ku-marasamy, N., Hakim, J. G., Kumwenda, J., Grinsztejn, B., Pilotto, J. H. et al. (2011)Prevention of hiv-1 infection with early antiretroviral therapy. New England journalof medicine, 365, 493–505.

De Oliveira, T., Kharsany, A. B., Gräf, T., Cawood, C., Khanyile, D., Grobler, A., Puren,A., Madurai, S., Baxter, C., Karim, Q. A. et al. (2017) Transmission networks and riskof hiv infection in KwaZulu-Natal, South Africa: a community-wide phylogenetic study.The Lancet HIV, 4, e41–e50.

Dwyer-Lindgren, L., Cork, M. A., Sligar, A., Steuben, K. M., Wilson, K. F., Provost,N. R., Mayala, B. K., VanderHeide, J. D., Collison, M. L., Hall, J. B. et al. (2019)Mapping HIV prevalence in sub-Saharan Africa between 2000 and 2017. Nature, 570,189.

Faria, N. R., Rambaut, A., Suchard, M. A., Baele, G., Bedford, T., Ward, M. J., Tatem,A. J., Sousa, J. D., Arinaminpathy, N., Pépin, J. et al. (2014) The early spread andepidemic ignition of HIV-1 in human populations. Science, 346, 56–61.

Faye, O., Boëlle, P.-Y., Heleze, E., Faye, O., Loucoubar, C., Magassouba, N., Soropogui,B., Keita, S., Gakou, T., Koivogui, L. et al. (2015) Chains of transmission and controlof ebola virus disease in conakry, guinea, in 2014: an observational study. The LancetInfectious Diseases, 15, 320–326.

Gall, A., Ferns, B., Morris, C., Watson, S., Cotten, M., Robinson, M., Berry, N., Pillay,D. and Kellam, P. (2012) Universal amplification, next-generation sequencing, andassembly of HIV-1 genomes. Journal of Clinical Microbiology, 50, 3838–3844.

Givens, G. H., Smith, D. and Tweedie, R. (1997) Publication bias in meta-analysis: aBayesian data-augmentation approach to account for issues exemplified in the passivesmoking debate. Statistical Science, 221–240.

69

Golubchik, T., Ratmann, O., Wymant, C., Hall, M., Bonsall, D., Grabowski, M. K.,Laeyendecker, O. and Fraser, C. (2017) Quantifying within-host viral diversificationusing deep sequencing data: recent vs chronic HIV infection.

Grabowski, M. K., Lessler, J., Redd, A. D., Kagaayi, J., Laeyendecker, O., Ndyanabo,A., Nelson, M. I., Cummings, D. A., Bwanika, J. B., Mueller, A. C. et al. (2014)The role of viral introductions in sustaining community-based HIV epidemics in ruralUganda: evidence from spatial clustering, phylogenetics, and egocentric transmissionmodels. PLoS Medicine, 11.

Grabowski, M. K., Reynolds, S. J., Kagaayi, J., Gray, R. H., Clarke, W., Chang, L.,Nakigozi, G., Laeyendecker, O., Redd, A. D., Goud-Billoux, V. et al. (2018) Thevalidity of self-reported antiretroviral use in persons living with HIV: a population-based study. AIDS, 32, 363.

Grabowski, M. K., Serwadda, D. M., Gray, R. H., Nakigozi, G., Kigozi, G., Kagaayi, J.,Ssekubugu, R., Nalugoda, F., Lessler, J., Lutalo, T. et al. (2017) HIV prevention effortsand incidence of HIV in uganda. New England Journal of Medicine, 377, 2154–2166.

Hall, M. D., Holden, M. T., Srisomang, P., Mahavanakul, W., Wuthiekanun, V., Lim-mathurotsakul, D., Fountain, K., Parkhill, J., Nickerson, E. K., Peacock, S. J. et al.(2019) Improved characterisation of MRSA transmission using within-host bacterialsequence diversity. eLife, 8, e46402.

Hayes, R. J., Donnell, D., Floyd, S., Mandla, N., Bwalya, J., Sabapathy, K., Yang, B.,Phiri, M., Schaap, A., Eshleman, S. H., Piwowar-Manning, E., Kosloff, B., James, A.,Skalland, T., Wilson, E., Emel, L., Macleod, D., Dunbar, R., Simwinga, M., Makola,N., Bond, V., Hoddinott, G., Moore, A., Griffith, S., Deshmane Sista, N., Vermund,S. H., El-Sadr, W., Burns, D. N., Hargreaves, J. R., Hauck, K., Fraser, C., Shanaube,K., Bock, P., Beyers, N., Ayles, H. and Fidler, S. (2019) Effect of Universal Testingand Treatment on HIV Incidence — HPTN 071 (PopART). New England Journal ofMedicine, 381, 207–218. PMID: 31314965.

Hazelton, M. L. (2001) Inference for origin–destination matrices: estimation, predictionand reconstruction. Transportation Research Part B: Methodological, 35, 667–676.

Houlihan, C. F., Frampton, D., Ferns, R. B., Raffle, J., Grant, P., Reidy, M., Hail, L.,Thomson, K., Mattes, F., Kozlakidis, Z. et al. (2018) Use of whole-genome sequencingin the investigation of a nosocomial influenza virus outbreak. The Journal of InfectiousDiseases, 218, 1485–1489.

70 on behalf of Rakai Health Sciences Program and PANGEA-HIV

van de Kassteele, J., van Eijkeren, J., Wallinga, J. et al. (2017) Efficient estimationof age-specific social contact rates between men and women. The Annals of AppliedStatistics, 11, 320–339.

Le Vu, S., Ratmann, O., Delpech, V., Brown, A. E., Gill, O. N., Tostevin, A., Dunn, D.,Fraser, C., Volz, E. M. and Database, U. H. D. R. (2019) HIV-1 transmission patternsin Men Who Have Sex with Men: Insights from genetic source attribution analysis.AIDS Research and Human Retroviruses, 35, 805–813.

Leitner, T. and Romero-Severson, E. (2018) Phylogenetic patterns recover known HIVepidemiological relationships and reveal common transmission of multiple variants.Nature Microbiology, 3, 983–988.

Lemey, P., Rambaut, A., Drummond, A. J. and Suchard, M. A. (2009) Bayesian phylo-geography finds its roots. PLoS Computational Biology, 5, e1000520.

Lindström, T., Grear, D. A., Buhnerkempe, M., Webb, C. T., Miller, R. S., Portacci, K.and Wennergren, U. (2013) A bayesian approach for modeling cattle movements in theunited states: scaling up a partially observed network. PLoS One, 8, e53432.

Miller, H. J., Dodge, S., Miller, J. and Bohrer, G. (2019) Towards an integrated scienceof movement: converging research on animal movement ecology and human mobilityscience. International Journal of Geographical Information Science, 33, 855–876.

Poon, A. F., Gustafson, R., Daly, P., Zerr, L., Demlow, S. E., Wong, J., Woods, C. K.,Hogg, R. S., Krajden, M., Moore, D. et al. (2016) Near real-time monitoring of HIVtransmission hotspots from routine HIV genotyping: an implementation case study.The Lancet HIV, 3, e231–e238.

Probert, W., Hall, M., Xi, X., Sauter, R., Golubchik, T., Bonsall, D., Abeler-Dörner,L., Pickles, M., Cori, A., Bwalya, J., Floyd, S., Mandla, N., Shanaube, K., Yang, B.,Ayles, H., Bock, P., Donnell, D., Grabowski, K., Pillay, D., Rambaut, A., Ratmann, O.,Fidler, S., Hayes, R., Fraser, C., consortium, P. and the HPTN 071 (PopART) studyteam (2019) Quantifying the contribution of different aged men and women to onwardstransmission of HIV-1 in generalised epidemics in sub-Saharan Africa: A modelling andphylogenetics approach from the HPTN071 (PopART) trial.

Rasmussen, C. E. and Williams, C. (2006) Gaussian processes for Machine Learning.MIT Press.

71

Ratmann, O., Grabowski, M. K., Hall, M., Golubchik, T., Wymant, C., Abeler-Dörner,L., Bonsall, D., Hoppe, A., Brown, A. L., de Oliveira, T. et al. (2019) Inferring HIV-1 transmission networks and sources of epidemic spread in africa with deep-sequencephylogenetic analysis. Nature Communications, 10, 1–13.

Ratmann, O., Kagaayi, J., Hall, M., Golubchick, T., Kigozi, G., Xi, X., Wymant, C.,Nakigozi, G., Abeler-Dörner, L., Bonsall, D. et al. (2020) Quantifying HIV transmissionflow between high-prevalence hotspots and surrounding communities: a population-based study in Rakai, Uganda. The Lancet HIV, 7, PE173–E183.

Ratmann, O., Van Sighem, A., Bezemer, D., Gavryushkina, A., Jurriaans, S., Wensing,A., De Wolf, F., Reiss, P., Fraser, C. et al. (2016) Sources of HIV infection among menhaving sex with men and implications for prevention. Science Translational Medicine,8, 320ra2.

Raymer, J., Wiśniowski, A., Forster, J. J., Smith, P. W. and Bijak, J. (2013) Integratedmodeling of european migration. Journal of the American Statistical Association, 108,801–819.

Rhoads, A. and Au, K. F. (2015) PacBio sequencing and its applications. Genomics,Proteomics & Bioinformatics, 13, 278–289.

Riutort-Mayol, G., Bürkner, P.-C., Andersen, M. R., Solin, A. and Vehtari, A. (2020)Practical hilbert space approximate bayesian gaussian processes for probabilistic pro-gramming. arXiv preprint arXiv:2004.11408.

Rue, H. and Held, L. (2005) Gaussian Markov random fields: theory and applications.CRC press.

Saul, J., Bachman, G., Allen, S., Toiv, N., Cooney, C. and Beamon, T. (2018) Deter-mined resilient empowered AIDS-free mentored and safe (DREAMS): What is the corepackage and why now. PLOS One, 13, e0208167.

Scire, J., Barido-Sottani, J., Kühnert, D., Vaughan, T. G. and Stadler, T. (2020)Improved multi-type birth-death phylodynamic inference in BEAST 2. bioRxiv,2020.01.06.895532.

Skums, P., Zelikovsky, A., Singh, R., Gussler, W., Dimitrova, Z., Knyazev, S., Mandric,I., Ramachandran, S., Campo, D., Jha, D. et al. (2018) QUENTIN: reconstructionof disease transmissions from viral quasispecies genomic data. Bioinformatics, 34,163–170.

72 on behalf of Rakai Health Sciences Program and PANGEA-HIV

Solin, A. and Särkkä, S. (2020) Hilbert space methods for reduced-rank Gaussian processregression. Statistics and Computing, 30, 419–446.

Sun, K., Wang, W., Gao, L., Wang, Y., Luo, K., Ren, L., Zhan, Z., Chen, X., Zhao, S.,Huang, Y. et al. (2021) Transmission heterogeneities, kinetics, and controllability ofsars-cov-2. Science, 371.

Tebaldi, C. and West, M. (1998) Bayesian inference on network traffic using link countdata. Journal of the American Statistical Association, 93, 557–573.

UNAIDS (2018) Miles to go: closing gaps, breaking barriers, righting justice, docu-ment jc2924. URL: https://www.unaids.org/sites/default/files/media_asset/miles-to-go_en.pdf.

— (2019) UNAIDS Data 2019, document jc2959e. URL: https://www.unaids.org/en/resources/documents/2019/2019-UNAIDS-data.

Vaughan, T. G., Kühnert, D., Popinga, A., Welch, D. and Drummond, A. J. (2014)Efficient Bayesian inference under the structured coalescent. Bioinformatics, 30, 2272–2279.

Vaughan, T. G. and Drummond, A. J. (2013) A stochastic simulator of birth–deathmaster equations with application to phylodynamics. Molecular Biology and Evolution,30, 1480–1493.

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B. and Bürkner, P.-C. (2019) Rank-normalization, folding, and localization: An improved r for assessing convergence ofmcmc. arXiv preprint arXiv:1903.08008.

Volz, E. M., Ionides, E., Romero-Severson, E. O., Brandt, M.-G., Mokotoff, E. andKoopman, J. S. (2013) HIV-1 transmission during early infection in Men who have Sexwith Men: a phylodynamic analysis. PLoS Medicine, 10, e1001568.

Volz, E. M., Pond, S. L. K., Ward, M. J., Brown, A. J. L. and Frost, S. D. (2009)Phylodynamics of infectious disease epidemics. Genetics, 183, 1421–1430.

Wymant, C., Hall, M., Ratmann, O., Bonsall, D., Golubchik, T., de Cesare, M., Gall,A., Cornelissen, M., Fraser, C., STOP-HCV Consortium, The Maela PneumococcalCollaboration and The BEEHIVE Collaboration (2017) PHYLOSCANNER: inferringtransmission from within-and between-host pathogen genetic diversity. Molecular Bi-ology and Evolution, 35, 719–733.

73

Zhang, Y., Wymant, C., Laeyendecker, O., Grabowski, M. K., Hall, M., Hudelson, S.,Piwowar-Manning, E., McCauley, M., Gamble, T., Hosseinipour, M. C. et al. (2020)Evaluation of phylogenetic methods for inferring the direction of HIV transmission:HPTN 052. Clinical Infectious Diseases, Epub ahead of print, ciz1247.

Stadler, T. and Bonhoeffer, S. (2013) Uncovering epidemiological dynamics in heteroge-neous host populations using phylogenetic methods. Philosophical Transactions of theRoyal Society B: Biological Sciences, 368, 20120198.