Coverage statistics for sequence census methods

10
RESEARCH ARTICLE Open Access Coverage statistics for sequence census methods Steven N Evans 1,2 , Valerie Hower 1* , Lior Pachter 1,3 Abstract Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce a coding of the shape of the coverage depth function as a tree and explain how this can be used to detect regions with anomalous coverage. This modeling perspective is especially germane to current high- throughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. Results: Under the mild assumptions that fragment start sites are Poisson distributed and successive fragment lengths are independent and identically distributed, we observe that, regardless of fragment length distribution, the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We then study the successive jumps of the coverage function, and show that they can be encoded as a random tree that is approximately a Galton-Watson tree with generation-dependent geometric offspring distributions whose parameters can be computed. Conclusions: We extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in high-throughput sequence census based experiments. Our approach leads to explicit determinations of the null distributions of certain test statistics, while for others it greatly simplifies the approximation of their null distributions by simulation. Our focus on fragments also leads to a new approach to visualizing sequencing data that is of independent interest. Background The classic Lander-Waterman model [1] provides statistical estimates for the read depth in a whole gen- ome shotgun (WGS) sequencing experiment via the Poisson approximation to the Binomial distribution. Although originally intended for estimating the redun- dancy when mapping by fingerprinting random clones, the Lander-Waterman model has served as an essential tool for estimating sequencing requirements for modern WGS experiments [2]. Further-more, although it makes a number of simplifying assumptions (e.g. fixed frag- ment length and uniform fragment selection) that are violated in actual experiments, extensions and generali- zations [3-9] have continued to be developed and applied in a variety of settings. The advent of high-throughput sequencing, which refers to massively parallel sequencing technologies has greatly increased the scope and applicability of sequen- cing experiments. With the increasing scope of experi- ments, new statistical questions about coverage statistics have emerged. In particular, in the context of sequence census methods, it has become important to understand the shape of coverage functions. Sequence census methods [10] are experiments designed to assess the content of a mixture of molecules via the creation of DNA fragments whose abundances can be used to infer those of the original molecules. The DNA fragments are identified by sequencing, and the desired abundances inferred by solution of an inverse problem. An example of a sequence census method is ChIP-Seq. In this experiment, the goal is to determine the locations in the genome where a specific protein binds. An anti-body to the protein is used to pull downfragments of DNA that are bound via a process called chromatin immunoprecipitation (abbreviated by ChIP). These fragments form the mixture of moleculesand after purifying the DNA, the fragments are deter- mined by sequencing. The resulting sequences are * Correspondence: [email protected] 1 Department of Mathematics, University of California, Berkeley, California, USA Full list of author information is available at the end of the article Evans et al. BMC Bioinformatics 2010, 11:430 http://www.biomedcentral.com/1471-2105/11/430 © 2010 Evans et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Transcript of Coverage statistics for sequence census methods

RESEARCH ARTICLE Open Access

Coverage statistics for sequence census methodsSteven N Evans1,2, Valerie Hower1*, Lior Pachter1,3

Abstract

Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In anextension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments.We also introduce a coding of the shape of the coverage depth function as a tree and explain how this can beused to detect regions with anomalous coverage. This modeling perspective is especially germane to current high-throughput sequencing experiments, where both sample preparation protocols and sequencing technologyparticulars can affect fragment length distributions.

Results: Under the mild assumptions that fragment start sites are Poisson distributed and successive fragmentlengths are independent and identically distributed, we observe that, regardless of fragment length distribution,the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatialPoisson process. We then study the successive jumps of the coverage function, and show that they can beencoded as a random tree that is approximately a Galton-Watson tree with generation-dependent geometricoffspring distributions whose parameters can be computed.

Conclusions: We extend standard analyses of shotgun sequencing that focus on coverage statistics at individualsites, and provide a null model for detecting deviations from random coverage in high-throughput sequencecensus based experiments. Our approach leads to explicit determinations of the null distributions of certain teststatistics, while for others it greatly simplifies the approximation of their null distributions by simulation. Our focuson fragments also leads to a new approach to visualizing sequencing data that is of independent interest.

BackgroundThe classic “Lander-Waterman model” [1] providesstatistical estimates for the read depth in a whole gen-ome shotgun (WGS) sequencing experiment via thePoisson approximation to the Binomial distribution.Although originally intended for estimating the redun-dancy when mapping by fingerprinting random clones,the Lander-Waterman model has served as an essentialtool for estimating sequencing requirements for modernWGS experiments [2]. Further-more, although it makesa number of simplifying assumptions (e.g. fixed frag-ment length and uniform fragment selection) that areviolated in actual experiments, extensions and generali-zations [3-9] have continued to be developed andapplied in a variety of settings.The advent of “high-throughput sequencing”, which

refers to massively parallel sequencing technologies has

greatly increased the scope and applicability of sequen-cing experiments. With the increasing scope of experi-ments, new statistical questions about coverage statisticshave emerged. In particular, in the context of sequencecensus methods, it has become important to understandthe shape of coverage functions.Sequence census methods [10] are experiments

designed to assess the content of a mixture of moleculesvia the creation of DNA fragments whose abundancescan be used to infer those of the original molecules. TheDNA fragments are identified by sequencing, and thedesired abundances inferred by solution of an inverseproblem. An example of a sequence census method isChIP-Seq. In this experiment, the goal is to determinethe locations in the genome where a specific proteinbinds. An anti-body to the protein is used to “pulldown” fragments of DNA that are bound via a processcalled chromatin immunoprecipitation (abbreviated byChIP). These fragments form the “mixture of molecules”and after purifying the DNA, the fragments are deter-mined by sequencing. The resulting sequences are

* Correspondence: [email protected] of Mathematics, University of California, Berkeley, California,USAFull list of author information is available at the end of the article

Evans et al. BMC Bioinformatics 2010, 11:430http://www.biomedcentral.com/1471-2105/11/430

© 2010 Evans et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

compared to the genome, leading to a coverage functionthat records, at each site, the number of sequenced frag-ments that contained it. As with many sequence censusmethods, “noise” in the experiment leads to randomsequenced fragments that may not correspond to boundDNA, and therefore it is necessary to identify regions ofthe coverage function that deviate from what is expectedin the “null” situation when only noise is present. Find-ing peaks that are extreme requires a definition of“extreme” in the sense of some test statistic taking alarge value as well as a probability model for the cover-age process that leads to the null distribution of the teststatistic and hence to means for calibrating what valuesof the test statistic are improbably large in the nullregime. The height of a peak is one obvious statistics,but we hope to get more discriminating procedures byalso considering a suitably defined numerical summaryof the shape of a peak. Indeed, the shape-based methodspresented here have been used to develop a peak-caller–T-PIC–for the ChIP-Seq assay [11].The purpose of this paper, however, is not to develop

methods for data analysis, but rather to present a nullmodel for the shape of a coverage function that is ofgeneral utility. That is, we propose a definition for theshape of a coverage function in terms of the topology ofa tree. We describe a random instance assuming thatfragments are selected at random from a genome, withlengths of fragments given by a known distribution. Weindicate how our description can be used to either com-pute analytically or approximate via simple Monte Carlosimulation the distributions of quantities of interest in adata analysis.

MethodsIn this section, we use some specialized mathematicalterminology and notation that the reader may be unfa-miliar with. We feel it is important to include this inorder to make our statements rigorous and mathemati-cally correct. We will give the definitions of some of theconcepts and a general idea of others, but first we setsome notation. The symbols R,ℤ, and ℤ≥0 stand for thereal numbers, integers, and non-negative integers(respectively), and the elements of a set can be listedinside curly braces, for instance A = {1,2,3}.

The shape of a fragment coverage functionWe begin by explaining what we mean by a coveragefunction. Given a genome of length N, a coverage func-tion is a function f : {1, ..., N} ® ℤ≥0. The interpretationof this function is that f(i) is the number of sequencedfragments obtained from a sequencing experiment thatcover position i in the genome. Because N is very large,we work with the set R and redefine a coverage functionas f : R ® ℤ≥0, which simplifies our analysis. We next

introduce an object that describes a sequence coveragefunction’s shape. Our approach is motivated by recentapplications of topology including persistent homology[12,13] and the use of critical points in shape analysis[14-16]. For a given coverage function f : R ® ℤ≥0, wewill define a rooted tree, which is a particular type ofdirected graph with all the directed edges pointing awayfrom the root. This tree Tf is based on the upper-excur-sion sets off : Uh: = {(x,f(x))|f(x) ≥ h},h Î ℤ≥0 and keepstrack of how the sets Uh evolve as h decreases. Longpaths in Tf represent features of the coverage functionthat persist through many values of h.Specifically, for each h Î ℤ≥0, let Ch denote the set of

connected components of the upper-excursion set Uh.That is, each element of Ch is an interval I such that f(x)≥ h for all x Î I and if J is another interval for whichI ⊂ J and J ≠ I (so that J strictly contains I), then f(y) <hfor some y Î J. We define the rooted tree Tf = (V,E) asfollows

• Vertices in V correspond to the connected compo-nents in the sets Ch, with h ranging over all non-negative integers.• (i, j) Î E provided their corresponding connectedcomponents c Ci hi

∈ and c Cj h j∈ with hi <hj

satisfy hi = hj-1 and cj ⊂ ci.

Note that the root of Tf corresponds to the single con-nected component in C0. The tree Tf is very similar to acontour tree [[14],§4.1], which is built using level sets ofa function, and a join tree [17]. Indeed, suppose weignore every vertex that is adjacent to only one vertexwith greater height. Then, the remaining vertices of Tf

correspond to (equivalence classes of) local extrema of f.Each local maximum of f yields the birth of a new con-nected component as we sweep down through h Î ℤ≥0

while a local minimum of f merges connected compo-nents. Since we do not require f to have distinct criticalvalues (as is frequently assumed), the vertices in Tf canhave arbitrary (but assumed to be finite) degrees, as isdepicted in Figure 1C.In the sequel, we will use the following equivalent

characterization that can be found in [[18], §2.3]. Givena coverage function f : R ® ℤ≥0 with f(a) = f(b) = 0 andf(x) > 0 for x Î (a, b), we form an integer-valuedsequence x0, ..., x2n that records the changes in height off on the interval [a,b]. First, we note that while the cov-erage from one nucleotide to the next may jump bymore than one, we can always extend the known func-tion values to define a coverage function f on R whosejumps are all one unit. In any case, for the probabilitymodel of the coverage function that we propose below,jumps of size greater than one occur with zero probabil-ity. Then, the sequence x0, ..., x2n consists of the y values

Evans et al. BMC Bioinformatics 2010, 11:430http://www.biomedcentral.com/1471-2105/11/430

Page 2 of 10

that f travels through from x0 := f(a) = 0 to x2n := f(b) =0 and satisfies

x x

x for i n

x x for i n

n

i

i i

0 2

1

0

0 0 2

1 1 2

= => < <− = ≤ ≤−

,

,

.

Such a sequence is called a lattice path excursionaway from 0. Next, we define an equivalence relation onthe set {0, 1, ..., 2n} by setting

i j x x xi j i k j k≡ ⇔ = =≤ ≤min .

The equivalence classes under this relation are in 1:1correspondence with the connected components in theupper-excursion sets of f|[a,b]. One equivalence class is{0, 2n}, and if {i1, ..., ip} is an equivalence class with 0<i1 <i2 < ... <ip then x xi i1 11 1− = − , , whereasx xi iq q− = +1 1 for 2 ≤ q ≤ p Conversely, any index iwith xi-1 = xi-1 is the minimal element of an equivalenceclass. We use the minimal element of each equivalenceclass as its representative. Thus, we can view the verticesof Tf a b|[ , ] as the set {0} ∪ {i|xi-1 = xi-1} Two indices i1<i2 are adjacent in Tf a b|[ , ] provided x xi i2 1

1= + andx xk i≥

1for i1 ≤ k ≤ i2. Figure 1 gives an example of a

coverage function together with its lattice path excur-sion (0, 1, 2, 3, 4, 3, 2, 3, 4, 5, 4, 3, 2, 3, 2, 1, 0) androoted tree. The minimal elements of each equivalenceclass in Figure 1B are depicted with red squares.

Planar Poisson processes from sequencing experimentsIn order to model random coverage along the genomethought of as a continuum, we adopt the perspective of

the Lander-Waterman model and use a Poisson processto give random starting locations for the fragments. Spe-cifically, we suppose that the left end-points of the frag-ments form stationary Poisson point process on R withintensity r.At each point of the Poisson point process we lay

down an interval that has that point as its left end-point. The lengths of the successive intervals are inde-pendent and identically distributed with common distri-bution μ. We will use the notation X for a coveragefunction built from this process and Xt for the height ata point t.Let t1,t2, ... be the left-end points and l1,l2, ... be the

corresponding lengths of intervals. The interval given by(ti, li) will cover a nucleotide t0 provided ti ≤ t0 and ti +li ≥ t0. We can view this pictorially by plotting points{(tj,lj)} in the plane. Then Xt0

– the number of intervalscovering t0– is the number of points in the wedge-shaped region in Figure 2.Before defining a two-dimensional Poisson process, we

note that the reader can think of Borel sets as being the“nice” subsets of R2 that measures are defined on, wherea measure is a generalization of the area of a set. Anyset the reader can imagine is almost certainly a Borel setand we include this terminology to maintain mathemati-cal rigor - there are difficulties that arise in definingmeasures in a self-consistent manner on all subsets ofR2 that don’t arise if we restrict to Borel sets. We nowrecall the definition of a two-dimensional Poisson pro-cess and refer the reader to [[19],§6.13] or [[20], §2.4]for the details. Suppose is a locally finite measure on theBorel sets ℬ (R2) (that is, Γ assigns finite mass to anybounded set). A random countable subset ∏ of R2 iscalled a non-homogeneous Poisson process with mean

Figure 1 A coverage function, lattice path excursion, and rooted tree. A coverage function is depicted in (A) with its associated lattice pathexcursion (0,1,2,3,4,3,2,3,4,5,4,3,2,3,2,1,0) in (B). The lattice path excursion in (B) differs from the function (A) in that it records only the jumps of(A). It does not give any information regarding how long the function remains at each y-value. The rooted tree for the coverage function is in(C). The rooted tree is equivalent to the lattice path excursion (B). The red squares in (B) are the equivalence class representatives.

Evans et al. BMC Bioinformatics 2010, 11:430http://www.biomedcentral.com/1471-2105/11/430

Page 3 of 10

measure Γ if, for all Borel subsets A, the random vari-ables N(A) := #(A ⋂ ∏) satisfy:

1. N(A) has the Poisson distribution with parameterΓ(A), and2. If A1, ..., Ak are disjoint Borel subset of R2, then N(A1), ..., N(Ak) are independent random variables.

The following theorem is a theoretical statementabout our null model for random fragment placementand is a consequence of [[21], Proposition 12.3]. Thetheorem and the work that follows from it will allow usto access the shape of random fragment placement bygiving a description we can simulate.Theorem 1. The collection {(tj,lj)} of points obtained as

described above is a non-homogeneous Poisson processwith mean measure rm ⊗ μ. Here m is Lebesgue mea-sure (that is, length measure) on R.The expected value of the coverage function Xt0

atan arbitrary point t0 is the expected number of pointsthat the Poisson process puts into the wedge-shapedregion in Figure 2. By definition, this is the massassigned to the wedge by the mean measure rm ⊗μThat is, [ ]Xt0

= rm ⊗ μNote that

m dv du

t u du

t u

t

t

⊗ =

= − ∞

=

−∞

−∞∞

∫∫∫∫

( ) ( )

(( , ))

(

wedge0

0

0

0

0(( , ))

( ),

s ds

s ds

=∞

∫ 0

where the last line follows from an integration-by-parts. Thus, [ ]Xt0

] is the product of the intensityr and the mean length of a fragment.Remark: The average height E [ Xt0

] can be com-puted without the use of Theorem 1. We include thederivation above as a first illustration that properties ofthe coverage function can be understood in terms of thetwo-dimensional Poisson process.

Fragment lengths have a general distributionTo use the shape of fragment coverage in a data analy-sis, one needs to understand the distribution of theshape when fragments are laid down according to thenull model described above. In particular, one is inter-ested in the probability of seeing shapes associated withtrees that have a height exceeding some high level. Oneway of doing this would be to first simulate a very longstretch of the two-dimensional Poisson process, deter-mine the coverage function, construct the trees forpeaks that exceed a high level, compute our shape sta-tistic for each tree, and then record the empirical distri-bution of the resulting values. However, peaks thatexceed high levels occur very infrequently and so wewould need to simulate infeasibly long stretches of thePoisson process in order to determine the probabilitieswe are interested in with reasonable accuracy. Thus, inthis section we propose a Markov approximation thatlets us start at high levels (rather than wait for them toappear in simulations of the Poisson process). The cor-responding trees are distributed as Galton-Watson treeswith generation-dependent geometric offspring distribu-tions and these are easy to simulate. In the Results andDiscussion section, we compare this approximation tothat obtained by simulating the Poisson process forfixed length fragments.Suppose that we have a general distribution μfor the

fragment lengths. The discrete-time stochastic processthat records the values of X at its successive jumps istypically not a Markov chain (although, as we illustratein the Results and Discussion section, it is if the distri-bution μis exponential), but we will compute the condi-tional probability that X takes the values k ± 1 at itsnext jump given that it currently has the value k anduse the discrete-time Markov chain with transitionprobabilities given by these conditional probabilities asan approximation for the actual process of successivevalues of X. More precisely, we observe X at some fixed“time” -which might as well be 0 because of stationarity,and ask for the conditional probabilities given X0 thatthe next jump of X will be upwards to X0 + 1 or down-wards to X0-1. Let T denote the time until the next frag-ment comes along. This random variable has an expo-nential distribution with rate r and is independent ofX0 [[20], §2.1]. If we condition on X0 = k, the

Figure 2 A two dimensional view of a sequencing experiment.A typical wedge in the (t,l) plane is shown. Each interval gives apoint (ti,li) in this plane where ti gives the start position of aninterval and li gives the length. The number of points in the greenwedge gives the height Xt0

of the coverage function at t0.

Evans et al. BMC Bioinformatics 2010, 11:430http://www.biomedcentral.com/1471-2105/11/430

Page 4 of 10

two-dimensional Poisson point process must have kpoints in the region

A t l t t l: {( , ) : , },= −∞ < ≤ − < < ∞0

depicted in Figure 3. Conditionally, these k points in Ahave the same distribution as k points chosen at randomin A according to the probability measure

m Bm A

B A⊗⊗

⊂( )( )

.for

However, in order that the next jump after 0 is up-wards, the two-dimensional Poisson point process musthave no points in the orange region

{( , ) : , }t l t t l T t−∞ < ≤ − < < −0

in Figure 3 as these fragments end before time T. Thisleaves the k points lying in the blue region

{( , ) : , },t l t T t l−∞ < ≤ − ≤ < ∞0

which occurs with probability

T u du

u du

k∞∫ ∞∞∫ ∞

⎝⎜⎜

⎠⎟⎟

(( , ))

(( , ))0

.

Thus, conditional on X0 = k, the probability that thenext jump will be upwards is

t u du

u due dt

k

t∞∫ ∞∞∫ ∞

⎜⎜

⎟⎟

∞−∫

(( , ))

(( , )).

00

Write p(k) for this quantity. To build trees, we areinterested in the jumps of the coverage function, andhence we define a discrete-time Markov chain on thenonnegative integers with transition probabilities

P i j

j

p i i j i

p i i( , )

, ,

( ), ,

( ),=

= =≥ = +

− ≥

1 0 1

1 1

1 1

if i and

if and

if annd

otherwise

j i= −

⎨⎪⎪

⎩⎪⎪

1

0

,

, .

Suppose now we have a lattice path excursion startingat 0. Given a vertex v of the associated tree at height k,we are interested in the number of offspring (at height k+ 1) of this vertex. Suppose i0 is the minimal equiva-lence class representative for vertex v, and suppose i0the equivalence class of i0 is {i0, i1,..., in} with i0 <i1 < ...<in. Then, we have x kir

= for 0 ≤ r ≤ n, x kir + = +1 1for 0 ≤ r ≤ n - 1, x kin + = −1 1 , and xt >k for i0 <t <inwith t ≠ some ir. From the Markov property, for 0 ≤ j ≤n we have the equations

{ | } ( )

{ | } ( ).

x k x k p k

x k x k p k

i i

i i

j j

j j

+

+

= + = =

= − = = −1

1

1

1 1

and

The resulting tree is a Galton-Watson tree with gen-eration-dependent offspring distributions (see [22-25]for more on Galton-Watson trees). Indeed, the probabil-ity a vertex at height k has n offspring is given by

p k p kn( ) ( ( )),1 − (1)

which is the probability of n failures before the firstsuccess in a sequence of independent Bernoulli trialswhere the probability of success equals 1-p(k). The uti-lity of Equation 1 is that it allows one to (approximately)simulate trees for peaks that exceed a high level underthe null model, making it possible to compare treesbuilt from actual data to those formed by random frag-ment placement.We close this section by processing another feature of

the trees (under the null model) that we can computeusing our Markov approximation. Let r(i, j) be probabil-ity that our Markov chain started in height i reachesheight j before it hits height 0. We have the relations

r i j p i r i j p i r i j( , ) ( ) ( , ) ( - ( )) ( - , ) = + +1 1 1 (2)

with the boundary conditions r(i, j) = 1 and r(0, j) = 0:

Next, given a height H, let Ynr n Hr H

: ( , )( , )

=1

, for 1 ≤ n ≤ H.

Using equation (2), we have

YYn p n Yn

p nn+ = + − −1

1 1( ( ) )( )

for 2 ≤ n ≤ H - 1 with Y1 = 1, Yp211

=( ) . We may

solve inductively for YH and obtain r HYH

( , )11= . The

Figure 3 A wedge from the planar Poisson process . Theintervals that correspond to points in both the blue and orangeregions contribute to the height X0. Any point in the orange regionwould “die” before T while points in the blue region contribute tothe height XT.

Evans et al. BMC Bioinformatics 2010, 11:430http://www.biomedcentral.com/1471-2105/11/430

Page 5 of 10

quantity r(1,H) gives the probability that a tree corre-sponding to a single lattice path excursion away from0 and coming from the null model is at least as tall asheight H. Note that this type of tree comes from ablock where the coverage function rises from 0 andthen back again-often referred to as an island or con-tig. This probability can be used to do an initial “filter-ing” of peaks in a data analysis: one first concentrateson peaks that exceed some height that is calibratedusing a knowledge of r(1,H) and then computes theshape statistic and associated p-values for just thosepeaks. As an example, Figure 5 in the Results and Dis-cussion section shows r(1,H) plotted for the fixed frag-ment length.

Results and DiscussionFragment lengths have the exponential distributionWhen the distribution μof fragment lengths is exponen-tial with rate l, our Markov approximation is exact, as

shown below. In this case, we have μ((s,∞)) = ℙ{l >s} =e-l sand

( ) .X e dsts= =−

0

Claim 1. The process X is a stationary, time-homoge-neous Markov process.Proof. It is clear that X is stationary because of the

manner in which it is constructed from a Poisson pro-cess on R2 that has a distribution which is in-variantunder translations in the t direction; that is, the randomset {(ti,li)} has the same distribution as {(ti + t,li)} for anyfixed t Î R. Since μis exponential, it is memoryless,meaning for any interval length l with an exponentialdistribution

{ | } { }.l a b l a l b> + > = >

Figure 4 Comparison of the Poisson process and Markov approximation in terms of tree height. Histograms of the densities for treeheight are shown for trees built from a simulated Poisson process (solid yellow) and Galton-Watson trees from the Markov approximation (bluestriped) for the case of fixed fragment lengths. Each tree corresponds to one lattice path excursion away from 0 (also referred to as sequenceislands or contigs). The simulations include average height θ = 6 with 14,466 trees simulated for each type (A), θ = 9 with 3,551 trees simulatedfor each type (B), θ = 12 with 1,429 trees simulated for each type (C), and θ = 15 with 217 trees simulated for each type (D).

Evans et al. BMC Bioinformatics 2010, 11:430http://www.biomedcentral.com/1471-2105/11/430

Page 6 of 10

This means that probability that an interval covers t2knowing that it covers t1 is the same as the probabilitythat an interval starting at t1 covers t2. Thus, the prob-ability that X kt 2

= given Xt for at t ≤ t1 only dependson the value of Xt1

.Indeed, in terms of time,{ | }X k X kt t2 1

= = ′ depends only on t2 -t1.More specifically, X is a birth-and-death process with

birth rate b(k) = r in all states k and death rate δ(k) =kl in state k ≥ 1. The jumps of X are given by a dis-crete-time Markov chain with transition matrix

P i j

i j

ii j i

ii

i( , )

, ,

, ,

,=

= =

+≥ = +

+≥

1 0 1

1 1

1

if and

if and

if

aand

otherwise

j i= −

⎪⎪⎪

⎪⎪⎪

1

0

,

, ,

and we have the probability a vertex at height k has noffspring is

+

⎝⎜

⎠⎟ +k

kk

n

.

Note that as the exponential distribution is the onlydistribution with the memoryless property, we lose theMarkov property when μis not exponential.

Fragments have a fixed lengthSuppose μ is the point mass at L (that is, all fragmentlengths are L). Then

(( , )),

,,u

u L

u L∞ =

<≥

⎧⎨⎩

1

0

Figure 5 Comparison of trees built from the Poisson process with the probability r(1, H). The function r(1,H) = Π{Galton-Watson tree hasheight ≥ H} is plotted in red. Using trees from a simulated Poisson process, the function Π{tree from simulated Poisson process has height ≥ H}is plotted in blue. The plots include average height θ = 6 (A), θ = 9 (B), θ = 12 (C) and θ = 15 (D) for the case of fixed fragment lengths.

Evans et al. BMC Bioinformatics 2010, 11:430http://www.biomedcentral.com/1471-2105/11/430

Page 7 of 10

and

t

t

L

u dudu L t t L

t L

∫ ∫∞ == − <

⎧⎨⎪

⎩⎪(( , ))

,

, .0

This gives

p kL t k

Lke dt

e Ldw

e e dw

w

Lt

k L Lw

k w

( )( )

( )

= −

=

=

∫∫

− −

0

0

1

0

1

for k ≥ 1, where θ: = rL = [ ]X0 . We integrate byparts and find that p(k) = θe-θq(k) where

q kke w k

e dw

e kq k

ww

w

w

k w( )

( )

= −

= − −

=

=

−∫

0

1

1

0

1

1

for k ≥ 2, which yields the recursion

p kk

p k( ) ( ),= − −1 1

(3)

for k ≥ 2 with p e( )1 1 1= − +−

. solving explicitly,we obtain

p k kk j

j k j

k ek

j

k

( ) !( )

!

( )= − −

− + − − −⎛

⎜⎜

⎟⎟

=∑ 1 1 1

0

(4)

for k ≥ 1. Below we verify that Equation (4) satisfiesthe recursion in Equation (3):

1 1

11 1

11 2

10

1

− −

= − − − −

− − + − − −

⎜=

kp k

k k j

j k j

k ek

j

k

( )

! ( )

!

( )⎜⎜

⎟⎟

= + − −

− + − − −⎛

⎜⎜

⎟⎟

=

=

∑kk

k j

j k j

k ek

k

j

k

!!

( )

!

( )

!(

1 1 1 1

0

1

−− −

− + − − −⎛

⎜⎜

⎟⎟

==∑ 1 1 1

0

)

!

( )

( ).

k j

j k j

k ek

p k

j

k

Next, we compare the trees built from the Markovapproximation to the trees arising from the Poisson

process when fragments have a fixed length. We simu-late trees with average height θ = 6,9,12, and 15 usingboth the Poisson process and the Markov approxima-tion. The histograms in Figure 4 show the densities ofsimulated trees for the Markov approximation (bluestriped bars) and for the Poisson process (yellow solidbars) for θ = 6,9,12, and 15. Additionally, the plots inFigure 5 depict the probabilities r(1,H) (in red) andΠ{tree from simulated Poisson process has height ≥ H}(in blue). These figures illustrate that, for large θ, theMarkov approach seems like a reasonableapproximation.Our observation that randomly sequenced fragments

from a genome form a planar Poisson process in (posi-tion, length) coordinates has implications beyond thecoverage function analysis performed in this paper. Forexample we have found that the visualization of sequen-cing data in this novel form is useful for quickly identi-fying instances of sequencing bias by eye, as it is easy to“see” deviations from the Poisson process. An exampleis shown in Figure 6 where fragments from an Illuminasequencing experiment are compared with an idealizedsimulation (where the fragments are placed uniformly atrandom). Specifically, paired-end reads from an RNA-Seq experiment conducted on a GAII sequencer weremapped back to the genome and fragments inferredfrom the read end locations. Bias in the sequencing isimmediately visible, likely due to non-uniform PCRamplification [26] and other effects.The “shape” we have proposed for coverage functions

was motivated by persistence ideas from topological dataanalysis (TDA). In the context of TDA, our setting isvery simple (1-dimensional), however unlike what is typi-cally done in TDA, we have provided a detailed probabil-istic analysis that can be used to construct a nullhypothesis for coverage-based test statistics. For example,computing test statistics [27] based on the trees con-structed from coverage functions and comparing those tothe statistics expected from the Galton-Watson trees hasbeen used to determine protein binding sites in ChIP-Seqassay [11]. It should be interesting to perform similaranalyses with high-dimensional generalizations for whichwe believe many of our ideas can be translated. There arealso biological applications, for example in the analysis ofChip-Seq experiments [11], as previously mentioned.

ConclusionsWe believe that the study of sequence coverage func-tions that we have initiated may be of use in the analysisof many sequence census methods. The number of pro-posed protocols used in such methods has exploded inthe past two years, as a result of dramatic drops in theprice of sequencing. For example, in January 2010, the

Evans et al. BMC Bioinformatics 2010, 11:430http://www.biomedcentral.com/1471-2105/11/430

Page 8 of 10

company Illumina announced a new sequencer, theHiSeq 2000, that they claim “changes the trajectory ofsequencing” and can be used to sequence 25 Gb perday. Al-though technologies such as the HiSeq 2000were motivated by human genome sequencing a surpris-ing development has been the fact that the majority ofsequencing is in fact being used for sequence censusexperiments [10]. The vast amounts of sequence beingproduced in the context of complex sequencing proto-cols, means that a detailed probabilistic under-standingof random sequencing is likely to become increasinglyimportant in the coming years.

AcknowledgementsSNE is supported in part by NSF grant DMS-0907630 and VH is funded byNSF fellowship DMS-0902723. We thank Adam Roberts for his help inmaking Figure 6.

Author details1Department of Mathematics, University of California, Berkeley, California,USA. 2Department of Statistics, University of California, Berkeley, California,USA. 3Department of Molecular and Cell Biology, University of California,Berkeley, California, USA.

Authors’ contributionsLP proposed the problem of understanding the random behaviour ofcoverage functions in the context of sequence census methods. VHinvestigated the coverage function and lattice path excursions based onideas from topological data analysis. SE developed the probability theoryand identified the relevance of Theorem 1. SNE, VH and LP worked togetheron all aspects of the paper and wrote the manuscript. All authors read andapproved the final manuscript.

Received: 23 April 2010 Accepted: 18 August 2010Published: 18 August 2010

References1. Lander E, Waterman M: Genomic mapping by finger-printing random

clones: a mathematical analysis. Genomics 1988, 2:231-239.

2. Weber J, Myers E: Human whole-genome shotgun sequencing. GenomeResearch 1997, 7:401-409.

3. Wendl M, Barbazuk WB: Extension of Lander-Waterman theory forsequencing ltered DNA libraries. BMC Bioinformatics 2005, 6:245.

4. Wendl M: A general coverage theory for shotgun DNA sequencing.Journal of Computational Biology 2006, 13:1177-1196.

5. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ,Kravitz SA, Mobarry CM, Reinert KH, Remington KA, Anson EL, Bolanos RA,Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC,Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X,Rubin GM, Adams MD, Venter JC: A Whole-Genome Assembly ofDrosophila. Science 2000, 287(5461):2196-2204.

6. Holst L: Random arcs on the circle. Journal of Mathematical Sciences 1984,25(3):1231-1233.

7. Sharon I, Pati A, Markowitz V, Pinter R: A Statistical Framework for theFunctional Analysis of Metagenomes. Research in Computational MolecularBiology 2009, 496-511.

8. Arratia R, Lander ES, Tavare S, Waterman MS: Genomic mapping byanchoring random clones: A mathematical analysis. Genomics 1991,11(4):806-827.

9. Schbath S: Coverage Processes in Physical Mapping by AnchoringRandom Clones. Journal of Computational Biology 1997, 4:61-82.

10. Wold B, Myers R: Sequence census methods for functional genomics.Nature Methods 2008, 5:19-21.

11. Hower V, Evans SN, Pachter L: Shape-based peak identification for ChIP-Seq. ArXiv e-prints 2010.

12. Carlsson G: Topology and data. Bull Amer Math Soc (N.S.) 2009,46(2):255-308.

13. Zomorodian A, Carlsson G: Computing persistent homology. DiscreteComput Geom 2005, 33(2):249-274.

14. Biasotti S, Giorgi D, Spagnuolo M, Falcidieno B: Reeb graphs for shapeanalysis and applications. Theoretical Computer Science 2008, 392(1-3):5-22.

15. de Berg M, van Kreveld M: Trekking in the Alps without freezing orgetting tired. Algorithmica 1997, 18(3):306-323.

16. Edelsbrunner H, Harer J, Zomorodian A: Hierarchical Morse-Smalecomplexes for piecewise linear 2-manifolds. Discrete Comput Geom 2003,30:87-107.

17. Carr H, Snoeyink J, Axen U: Computing contour trees in all dimensions.Comput Geom 2003, 24(2):75-94.

18. Evans SN: Probability and real trees, Volume 1920 of Lecture Notes inMathematics Berlin: Springer 2008.

19. Grimmett GR, Stirzaker DR: Probability and random processes New York:Oxford University Press, third 2001.

Figure 6 Examples of sequencing in the (t,l) plane. (A) Fragments from a sequencing experiment shown in the (t,l) plane. (B) The spatialPoisson process resulting from fragments with the same length distribution as (A) but with position sampled uniformly at random.

Evans et al. BMC Bioinformatics 2010, 11:430http://www.biomedcentral.com/1471-2105/11/430

Page 9 of 10

20. Daley DJ, Vere-Jones D: An introduction to the theory of point processesSpringer Series in Statistics, New York: Springer-Verlag 1988.

21. Kallenberg O: Foundations of modern probability Probability and itsApplications (New York), New York: Springer-Verlag, second 2002.

22. Fearn DH: Galton-Watson processes with generation dependence.Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics andProbability (Univ. California, Berkeley, Calif., 1970/1971), Vol. IV: Biology andhealth Berkeley, Calif.: Univ. California Press 1972, 159-172.

23. Good IJ: The joint distribution for the sizes of the generations in acascade process. Proc Cambridge Philos Soc 1955, 51:240-242.

24. Harris TE: The theory of branching processes Dover Phoenix Editions, Mineola,NY: Dover Publications Inc 2002.

25. Jagers P: Galton-Watson processes in varying environments. J ApplProbability 1974, 11:174-178.

26. Hansen K, Brenner S, Dudoit S: Biases in Illumina transcriptomesequencing caused by random hex-amer priming. Nucleic Acids Research2010.

27. Matsen F: A geometric approach to tree shape statistics. SystematicBiology 2006, 4:652-661.

doi:10.1186/1471-2105-11-430Cite this article as: Evans et al.: Coverage statistics for sequence censusmethods. BMC Bioinformatics 2010 11:430.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit

Evans et al. BMC Bioinformatics 2010, 11:430http://www.biomedcentral.com/1471-2105/11/430

Page 10 of 10