The Undirected Incomplete Perfect Phylogeny Problem

12
1 The Undirected Incomplete Perfect Phylogeny Problem Ravi Vijaya Satya, Member, IEEE, and Amar Mukherjee, Fellow, IEEE Abstract— The incomplete perfect phylogeny (IPP) problem and the incomplete perfect phylogeny haplotyping (IPPH) prob- lem deal with constructing a phylogeny for a given set of haplo- types or genotypes with missing entries. The earlier approaches for both of these problems dealt with restricted versions of the problems, where the root is either available or can be trivially re-constructed from the data, or certain assumptions were made about the data. In this paper, we deal with the unrestricted versions of the problems, where the root of the phylogeny is neither available nor trivially recoverable from the data. Both IPP and IPPH problems have previously been proven to be NP- complete. Here, we present efficient enumerative algorithms that can handle practical instances of the problem. Empirical analysis on simulated data shows that the algorithms perform very well both in terms of speed and in terms accuracy of the recovered data. Index Terms— Phylogenetics, Perfect Phylogeny, Incomplete Perfect Phylogeny, Haplotype Inference I. I NTRODUCTION A frequently encountered problem in evolutionary genetics is the construction of the evolutionary tree for a set of taxa. This problem is referred to as the problem of phylo- genetic reconstruction. A phylogeny in which each mutation event is unique is called a perfect phylogeny. A given set of taxa may or may not admit a perfect phylogeny, but if they do admit a perfect phylogeny, then it is highly likely that the phylogeny is the actual evolutionary tree for the given taxa. A perfect phylogeny for the given set of taxa can be constructed in polynomial time in terms of the number of differentiating loci (henceforth referred to as ‘characters’) when the number of states (alleles) in each character is bound by a constant [1]. Most of the variation in the human genome is due to Single Nucleotide Polymorphisms (SNPs). SNPs are single-base loci in the genome where multiple alleles occur in a population. A position in the human genome is generally classified as a SNP only if there are at least two alleles in that position that occur with a frequency greater than a certain threshold. It is estimated that there are 10 million SNPs in the human genome [2]. Most of the SNPs in the human genome are bi-allelic. Since the human genome is diploid, there are two copies of each chromosome in each cell of an individual. One of these copies is derived from the mother of the individual, and the other is derived from the father of the individual. Each of these copies is called a haplotype. Collecting haplotype data empir- ically is prohibitively expensive. Therefore, genome variation R. Vijaya Satya and A. Mukherjee are with the School of Electrical Engineering and Computer Science, University of Central Florida, Orlando, Florida 32816, USA. E-mail: [email protected], [email protected] studies like the HapMap Project (www.hapmap.org) collect the genotype data instead. The genotype of an individual is the conflated information about the two haplotypes of the individual. A site is homozygous in a genotype if the two constituent haplotypes have the same allele in that site, and heterozygous otherwise. The exact allele information for all the homozygous sites in an individual is available from the genotype of an individual. However, in case of heterozygous sites, the only information available from the genotype is that the sites are heterozygous. Haplotype data is preferred and necessary for disease association studies. Therefore, the general approach is to obtain the genotypes empirically, and computationally infer the most likely pair of haplotypes that explain each genotype. This problem is called as the haplotype inference (HI) problem. A genotype with m heterozygous sites can be the result of the conflation of any of 2 m1 haplotype pairs. The fun- damental issue in the HI problem is to find the most likely explanation for the genotype out of the 2 m1 haplotype pairs. The haplotype-inference problem was first introduced by Clark [3], and a parsimony-based exponential-time algorithm was presented. Gusfield [4] presented a new approach to the HI problem based on perfect phylogeny. The perfect phylogeny approach to haplotype inference is to resolve each genotype in the given set of genotypes into a pair of haplotypes so that all the resulting haplotypes in the population form a perfect phylogeny. This version of the HI problem is called as the perfect phylogeny haplotyping (PPH) problem. The PPH problem is justified because of the block structure of the human genome. Multiple linear-time solutions have been presented for the PPH problem [5], [6], [7], [8], [9]. A. Missing Data Real biological data is generally incomplete. i.e., the state of some loci might not be known in each taxon. The incomplete perfect phylogeny (IPP) problem is to determine if the given set of incomplete taxa admit a perfect phylogeny. This problem is known to be NP-complete [10], even when each character is bi-allelic. However, if at least one taxon in the input set is complete, any one of the complete taxa can be considered as the root for the phylogeny, and the problem then becomes the incomplete directed phylogeny (IDP) problem. The IDP problem is solvable in polynomial time [11]. When the input consists of incomplete genotypes, the prob- lem is called the incomplete perfect phylogeny haplotyping (IPPH) problem. The IPPH problem is clearly NP-complete since the IPP problem can be viewed as a special case of

Transcript of The Undirected Incomplete Perfect Phylogeny Problem

1

The Undirected Incomplete Perfect PhylogenyProblem

Ravi Vijaya Satya,Member, IEEE,and Amar Mukherjee,Fellow, IEEE

Abstract— The incomplete perfect phylogeny (IPP) problemand the incomplete perfect phylogeny haplotyping (IPPH) prob-lem deal with constructing a phylogeny for a given set of haplo-types or genotypes with missing entries. The earlier approachesfor both of these problems dealt with restricted versions oftheproblems, where the root is either available or can be triviallyre-constructed from the data, or certain assumptions were madeabout the data. In this paper, we deal with the unrestrictedversions of the problems, where the root of the phylogeny isneither available nor trivially recoverable from the data. BothIPP and IPPH problems have previously been proven to be NP-complete. Here, we present efficient enumerative algorithms thatcan handle practical instances of the problem. Empirical analysison simulated data shows that the algorithms perform very wellboth in terms of speed and in terms accuracy of the recovereddata.

Index Terms— Phylogenetics, Perfect Phylogeny, IncompletePerfect Phylogeny, Haplotype Inference

I. I NTRODUCTION

A frequently encountered problem in evolutionary geneticsis the construction of the evolutionary tree for a set of

taxa. This problem is referred to as the problem of phylo-genetic reconstruction. A phylogeny in which each mutationevent is unique is called aperfectphylogeny. A given set oftaxa may or may not admit a perfect phylogeny, but if theydo admit a perfect phylogeny, then it is highly likely that thephylogeny is the actual evolutionary tree for the given taxa. Aperfect phylogeny for the given set of taxa can be constructedin polynomial time in terms of the number of differentiatingloci (henceforth referred to as ‘characters’) when the numberof states (alleles) in each character is bound by a constant [1].

Most of the variation in the human genome is due to SingleNucleotide Polymorphisms (SNPs). SNPs are single-base lociin the genome where multiple alleles occur in a population.A position in the human genome is generally classified as aSNP only if there are at least two alleles in that position thatoccur with a frequency greater than a certain threshold. It isestimated that there are 10 million SNPs in the human genome[2]. Most of the SNPs in the human genome are bi-allelic.Since the human genome isdiploid, there are two copies ofeach chromosome in each cell of an individual. One of thesecopies is derived from the mother of the individual, and theother is derived from the father of the individual. Each of thesecopies is called ahaplotype. Collecting haplotype data empir-ically is prohibitively expensive. Therefore, genome variation

R. Vijaya Satya and A. Mukherjee are with the School of ElectricalEngineering and Computer Science, University of Central Florida, Orlando,Florida 32816, USA.E-mail: [email protected], [email protected]

studies like the HapMap Project (www.hapmap.org) collectthe genotypedata instead. The genotype of an individual isthe conflated information about the two haplotypes of theindividual. A site is homozygousin a genotype if the twoconstituent haplotypes have the same allele in that site, andheterozygous otherwise. The exact allele information for allthe homozygous sites in an individual is available from thegenotype of an individual. However, in case of heterozygoussites, the only information available from the genotype isthat the sites are heterozygous. Haplotype data is preferredand necessary for disease association studies. Therefore,thegeneral approach is to obtain the genotypes empirically, andcomputationally infer the most likely pair of haplotypes thatexplain each genotype. This problem is called as thehaplotypeinference(HI) problem.

A genotype withm heterozygous sites can be the resultof the conflation of any of2m−1 haplotype pairs. The fun-damental issue in the HI problem is to find the most likelyexplanation for the genotype out of the2m−1 haplotype pairs.The haplotype-inference problem was first introduced by Clark[3], and a parsimony-based exponential-time algorithm waspresented. Gusfield [4] presented a new approach to the HIproblem based on perfect phylogeny. The perfect phylogenyapproach to haplotype inference is to resolve each genotypein the given set of genotypes into a pair of haplotypes sothat all the resulting haplotypes in the population form aperfect phylogeny. This version of the HI problem is calledas the perfect phylogeny haplotyping(PPH) problem. ThePPH problem is justified because of the block structure ofthe human genome. Multiple linear-time solutions have beenpresented for the PPH problem [5], [6], [7], [8], [9].

A. Missing Data

Real biological data is generallyincomplete. i.e., the state ofsome loci might not be known in each taxon. The incompleteperfect phylogeny (IPP) problem is to determine if the givenset of incomplete taxa admit a perfect phylogeny. This problemis known to be NP-complete [10], even when each characteris bi-allelic. However, if at least one taxon in the input setis complete, any one of the complete taxa can be consideredas the root for the phylogeny, and the problem then becomesthe incomplete directed phylogeny (IDP) problem. The IDPproblem is solvable in polynomial time [11].

When the input consists of incomplete genotypes, the prob-lem is called theincomplete perfect phylogeny haplotyping(IPPH) problem. The IPPH problem is clearly NP-completesince the IPP problem can be viewed as a special case of

2

the IPPH problem in which there are no heterozygous loci.Interestingly, even the rooted version of the IPPH problemwas shown to be NP-complete [12].

In this paper, we handle the IPP and IPPH problems intheir original form, without making any assumptions aboutthe input data. Using empirical analysis, we demonstrate thatthe IPP problem can almost always be solved in polynomialtime, even when as much as 50% of the input data is missing.We extend this approach to the IPPH problem, and present anefficient algorithm for the IPPH problem. As stated in [13], thenecessary and sufficient conditions under which an incompletematrix admits a unique perfect phylogeny are unknown. Wesolve this open problem, and formulate a set of necessaryand sufficient conditions under which any given IPP or IPPHinstance has a unique solution. Some of the results of thispaper have previously been published as a technical report[14].

II. PROBLEM STATEMENT AND PREVIOUS WORK

In this paper, we exclusively deal with bi-allelic data. There-fore, a complete haplotype is represented by a length-m vectorover the alphabet{0, 1}, where 0 and 1 are representativeof the two alleles in each position. An incomplete haplotypeis a length-m vector over the alphabet{0, 1, ?}, where ‘?’represents missing data. A complete genotype is representedby a length-m vector over the alphabet{0, 1, 2}, where ‘0’or ‘1’ indicate that the corresponding SNP is homozygousin the genotype with the ‘0’ or ‘1’ allele respectively, and‘2’ indicates that the corresponding SNP is heterozygous. Anincomplete genotype is a length-m vector over the alphabet{0, 1, 2, ?}.

The perfect phylogenyT for ann×m complete haplotypematrix M is a tree with exactlym edges that satisfies thefollowing properties:

1) Each vertexV of T is labeled by a haplotype vectorL(V ) of lengthm.

2) Each of them columns labels exactly one edge inT .3) If U and V are two adjacent vertices, and the edge

(U, V ) is labeled by thekth column, thenL(U)[k] =L(V )[k]. The vertex labelsL(U) andL(V ) are identicalin all other positions.

4) Each row inM labels some vertex inT .

The input to the IPP problem is ann×m matrixM over thealphabet{0, 1, ?}. Each of then rows in the matrix representsa haplotype. The incomplete perfect phylogeny (IPP) problemis to determine if there is an assignment of ‘0’ or ‘1’ to each ‘?’in M so that the resulting matrix admits a perfect phylogeny.

We define the following terms. An ordered pair(a, b), a ∈{0, 1}, b ∈ {0, 1}, is said to beinducedby a pair of orderedcolumns(i, j) if there is a rowr in M such thatM [r, i] = aand M [r, j] = b. The set of ordered pairs induced by a pairof columns(i, j) is denoted byI(i, j). According to the well-established four-gamete test, the matrixM will admit a perfectphylogeny only if|I(i, j)| ≤ 3 for every pair of columns(i, j).

Halperin et al. [13] made the assumption that|I(i, j)| = 3for every pair of columns(i, j) in M . They call this as-sumption therich data hypothesis. If an incomplete matrixM

satisfies the rich data hypothesis,M will either admit uniqueperfect phylogeny or will not admit any perfect phylogeny atall. When the matrixM satisfies the rich data hypothesis, theypresented aO(nm) algorithm to recover a complete haplotypeand construct the perfect phylogeny forM . In Section V, weshow that the IPP problem can in fact be solved as the perfectphylogeny problem on complete haplotype data when the richdata hypothesis is satisfied. The procedure presented in SectionV is applicable in many situations, even when the rich datahypothesis isnot satisfied.

When the root, i.e., any complete haplotype that must bein the perfect phylogeny is available, the IPP problem can besolved as the IDP problem. Peer et al [11] present an efficientsolution for the IDP problem when the root is an all-0 vector.If the root is not an all-0 vector, it can be converted into anall-zero vector by flipping (replacing each ‘0’ by ‘1’ and each‘1’ by a ‘0’) every column that is not ‘0’ in the root. Thealgorithm presented by Peer et al [11] takes an expected timeof O(nm), wheren is the number of taxa andm is the numberof characters.

In case of the IPPH problem, Kimmel and Shamir [12]present an algorithm with expected time ofO(nm2), whencertain assumptions about the input data are satisfied. Themost significant of these assumptions is that the columns in thematrix are much fewer than the number of rows. Specifically,they assume thatm = O(n0.5). Their algorithm, in fact,involves an exhaustive search through all-possible haplotypevectors that could be the root of the perfect phylogeny. Foreach root, they try all possiblephaserelationships betweenpairs of columns in order to search for a solution. Thoughtheir algorithm has exponential time worst-case complexity,they show that the algorithm takesO(nm2) time when theassumptions they make are satisfied.

Gramm et al. [15] introduced a special case of IPPHproblem, where the perfect phylogeny is known to be a path.They show that even this restricted version of the problem,known asPerfect Path Phylogeny Haplotyping, is NP-hard.

III. R EALIZABILITY CONDITIONS FOR THE IPP PROBLEM

In this section, we present the conditions under which agiven undirected IPP instance admits a perfect phylogeny. Ouralgorithm for the IPP problem is based on these conditions.In the following, we introduce some definitions.

For any pair of columns (i, j), the set of non-induced pairs, denoted byU(i, j), is given by U(i, j) ={(0, 0), (0, 1), (1, 0), (1, 1)} − I(i, j). The four-gamete testcan be re-stated in terms of the non-induced pairs as in thefollowing sentence - any matrixM (complete or incomplete)does not admit a perfect phylogeny if|U(i, j)| = 0 for anypair of columns(i, j). When|U(i, j)| = 1, the single orderedpair (fij , fji) ∈ U(i, j) is defined as theforbidden pair forthe pair of columns(i, j), denoted byF(i, j). Throughout thispaper, we follow the notation thatF(i, j) = (fij , fji).

A column i is said to benon-polymorphicif there are no‘1’s or no ‘0’s in the sub-matrixM [∗, i]. It can be triviallyshown that non-polymorphic columns are un-informative, andhence can be removed from the matrix without effecting the

3

i j

k

b

b a

c

c

a

k

c c b b a a

j i

(a)

(b)

T1

T2

Fig. 1. The two possible topologies for any three sitesi, j andk in a perfectphylogeny.

matrix in any way. Therefore, throughout the rest of this paper,we assume that every column is polymorphic in the givencomplete or incomplete matrixM .

A. Significance of the forbidden pairs

In a perfect phylogeny, there are certain relationships be-tween the forbidden pairs of any three columns. In any perfectphylogeny, the topology of the tree formed by a triplet ofcolumns(i, j, k) must be one of the two topologies shownin Figure 1. i.e, the three of them must form a Y-shapedtree as shown in Figure 1a, or a path, as in Figure 1b. Theedges can be labeled differently, but the overall topology mustbe either that in Figure 1a or that in Figure 1b. Let(a, a),(b, b), (c, c), be the pairs of alleles for the sitesi, j and krespectively. Consider the labeling in Figure 1a. There canbe no vertex in the perfect phylogenyT1 with the alleleain site i and b in site j. Hence,F(i, j) = (a, b), wherefij = a and fji = b. Similarly, fik = a, fki = c, andfjk = b, fkj = c. Therefore, for the topology inT1, fij = fik,fji = fjk, andfki = fkj , irrespective of how the edges areactually labeled. Similarly, for the topology inT2, fij = fik,fki = fkj , and fji = fjk, where j is the column in themiddle. Therefore, in any perfect phylogeny, there are somerestrictions and associations between the forbidden pairsoftriplets of columns. In the following sections, we present aformalization for these associations.

B. The 3-way compatibility expression

For any three distinct columnsi, j and k with F(i, j) =(fij , fji), F(j, k) = (fjk, fkj), and F(i, k) = (fik, fki),we define the3-way compatibility expression, denoted byR(i, j, k):

R(i, j, k) = (Ei + Ej)(Ej + Ek)(Ek + Ei)

= EiEj + EjEk + EkEi (1)

whereEi = 1 ⊕ fij ⊕ fik, Ej = 1 ⊕ fji ⊕ fjk andEk =1 ⊕ fki ⊕ fkj . Here, ‘+’ is the logical OR operator, and ‘⊕’is the logical XOR operator. We define the three columnsi,j andk to be3-way compatibleif R(i, j, k) = 1, and3-wayincompatibleif R(i, j, k) = 0.

×

×

×

kjki

jkji

ikij

ffk

ffj

ffi

kji

(a) (b)

kiikij

jkjiikij

fffr

ffffr

kji

?

?

............

2

1

=

==

M

M

Fig. 2. (a) The rowsr1 and r2 in M ; (b) A matrix representation of theforbidden pairs

Theorem 1:An incomplete matrixM with |I(i, j)| = 3 forevery pair of columnsi and j admits a perfect phylogenyiffR(i, j, k) = 1 for every triplet of columns(i, j, k).

Proof: We first prove thatM does not admit a perfectphylogeny ifR(i, j, k) = 0 for any triplet of columns(i, j, k).Since the expressionR is symmetric with respect toEi, Ej

and Ek, we only prove the case whenEi = Ej = 0. Fromthe definition ofEi andEj , if Ei = 0 andEj = 0, we get:

fij = fik andfji = fjk (2)

Since |I(i, j)| = 3 and F(i, j) = (fij , fji), the pair(fij , fji) is in I(i, j), and there will be a rowr1 in M withM [r1, i] = fij = fik and M [r1, j] = fji = fjk, as shownin Figure 2. Similarly, there will be a rowr2 in M withM [r2, i] = fik = fij andM [r2, k] = fki, as shown in Figure2. Without loss of generality, assume thatM [r1, k] =? andM [r2, j] =?. We will show that any assignment of values toM [r1, k] andM [r2, j] leads to a forbidden pair for some pairof columns.

SinceM [r1, i] = fik andM [r1, j] = fjk, in order to avoida forbidden pair in row inr1, we must assignM [r1, k] =fki = fkj . This implies thatfki = fkj . In row r2, sinceM [r2, i] = fij , M [r2, j] must be equal tofji in order toavoid F(i, j). However, sincefji = fjk, this means thatfki

cannot be equal tofkj in order to avoid havingF(j, k) inr2. Therefore, it is not possible to complete both the rowsr1

and r2 without introducing the forbidden pair in some pairof columns. Hence, the matrixM does not admit a perfectphylogeny whenR(i, j, k) = 0 for any triplet of columns.

Now, we prove thatM admits a perfect phylogeny whenR(i, j, k) = 1 for every triplet of columns(i, j, k). This meansthat we should be able to assign values to all missing entriesin M without inducing the forbidden pair for any pair ofcolumns. The proof is by contradiction. Assume that thereis some entryM [r, k] =? which cannot be assigned a valuewithout forcing a forbidden pair for some pair of columns.This can only happen if there are at least two columnsi andj such thatM [r, i] = fik, M [r, j] = fjk and fki = fkj . IfM [r, k] is set tofki, F(i, k) will be induced into rowr. SinceR(i, j, k) = 1, fki = fkj (i.e., Ek = 0) implies thatfij = fik

(Ei = 1) andfjk = fji (Ej = 1). This implies thatF(i, j) isalready induced by the rowr. However, this is not possible,since we know that|I(i, j)| = 3. Therefore, there can be nosuch entryM [r, k] in M , and every ‘?’ inM can be assigned

4

a 0 or 1 so that there is a perfect phylogeny for the resultingmatrix.

Theorem 1 can be better understood from the matrix rep-resentation of the forbidden pairs shown in Figure 2-(b). Thevariables along the diagonal are not defined. The termsEi,Ej and Ek in R(i, j, k) are relationships between the twovariables in the rowsi, j and k, respectively.Ei, Ej or Ek

will be zero if the two variables in the corresponding row arenot equal to each other. The columns(i, j, k) will be 3-waycompatible if variables in at least two rows are equal to eachother.

In some situations, Theorem 1 allows us to defineF(i, j)even if it is not directly induced by the matrixM . For example,consider the case whenF(i, k) andF(j, k) are known, andfki = fkj . Applying Theorem 1 will tell us thatfij must beequal tofik andfji must be equal tofjk if M is to admit aperfect phylogeny. Hence we can indirectly defineF(i, j) inthis case, using Theorem 1.

C. Conditions for any incomplete matrixM

In the following, we answer the following question - givenan incomplete matrixM in which |I(i, j)| < 3 for somepairs of columns(i, j), is there an assignmentF(i, j) =(fij , fji) for every pair of such columns(i, j) that leads toa perfect phylogeny? In other words, is it possible to havea matrix in which the forbidden row cannot be defined forsome pairs of columns and every possible assignment offorbidden pairs results in a matrix that does not admit a perfectphylogeny, but the original incomplete matrix does admit aperfect phylogeny? To answer this question, we need to firstexamine the circumstances under which|I(i, j)| can be lessthan 3 for two columnsi and j in a complete matrixMthat admits a perfect phylogeny. We will later (in Theorem2) extend these conditions to an incomplete matrix.

Property 1: For any pair of columnsi andj in a completematrix M that admits a perfect phylogeny,2 ≤ |I(i, j)| ≤ 3.

Property 2: For any pair of columnsi andj in a completematrix M that admits a perfect phylogeny,|I(i, j)| = 2 onlyif i andj label the same edge inT .

Property 1 directly follows from the fact that every columnin M must be polymorphic. Property 2 is evident from Figure3a. Let i and j label the two edges as shown, with the twoalleles of the sitei beinga and a and the two alleles of thesite j beingb andb. As shown, the state of the columns(i, j)is (a, b) at vertexA and (a, b) at vertexB. If there is anyinternal nodeC in the path fromA to B (other thanA andB), the state ofC will be (a, b). When |I(i, j)| = 2, therecan be no such third nodeC. Therefore,i andj must label asingle edge connecting the nodesA andB.

Property 2 leads to an additional result - SinceI(i, j) ={(a, b), (a, b)} when |I(i, j)| = 2, I(i, j) must be either{(0, 0), (1, 1)} or {(0, 1), (1, 0)}.

Theorem 2:An incomplete matrixM with |I(i, j)| < 3 forsome pairs of columns(i, j) admits a perfect phylogenyiffthere is a matrixM ′ obtained by adding additional rows toM so that (a)|I(i, j)| = 3 for every pair of columns(i, j) inM ′; and (b)M ′ admits a perfect phylogeny.

i aa

(a,b) A B

C

(a)

(b) c1

(c)

A B c2

C

C'

c1,c2

A B

(a,b)

(a,b)

(a,b)

(a,b) (a,b)

j bb

(a,b) (a,b)

(a,b)

(a,b)(a,b) (a,b)

Fig. 3. (a) Illustration of Property 2; (b) Columnsc1 andc2 in the originaltree;(c) Columnsc1 andc2 after splitting the edge (A,B)

Proof: Let M be a matrix that admits a perfect phylogenyT . Without loss of generality, we assume that every columnin M is polymorphic (Non-polymorphic columns are un-informative, and can be removed fromM without affectingthe perfect phylogeny forM in any way). From Property 1,we know thatI(i, j) ≥ 2 for every pair of columns(i, j)in a complete matrixM . Let c1 and c2 be two columns thatlabel the same edge inT , as depicted by the edge(A, B) inFigure 3b. From Property 2, we know that|I(c1, c2)| = 2 inM . Clearly, the node labels ofA andB are identical exceptin the sitesc1 and c2. If c1 = a and c2 = b at A, c1 and c2

will be a and b at B. We can always introduce a new nodeC so thatc1 labels the edge(A, C) and c2 labels the edge(B, C) by introducing an extant leafC′ as shown in Figure3c. At verticesC andC′, c1 = a andc2 = b, and every othercolumn takes the same value as at nodesA andB. If we addthe label of the leafC′ to M ′, |I(c1, c2)| will be equal to 3 inM ′. The same can be done for every pair of columns(i, j) forwhich |I(i, j)| = 2 in M . Hence, for every incomplete matrixM that admits a perfect phylogeny, there will be a matrixM ′

in which |I(i, j)| = 3.Because of Theorem 2, we can use the 3-way compatibility

expression to determine if a given matrixM admits a perfectphylogeny, even if|I(i, j)| < 3 for some pairs of columnsin M . If M admits a perfect a phylogeny,F(i, j) can bedefined for every pair of columns(i, j). Applying the 3-waycompatibility expression on any triplet of columnsi, j andk,we obtain the following set of equations:

(1 ⊕ fij ⊕ fik) + (1 ⊕ fji ⊕ fjk) = 1

(1 ⊕ fji ⊕ fjk) + (1 ⊕ fkj ⊕ fki) = 1 (3)

(1 ⊕ fkj ⊕ fki) + (1 ⊕ fij ⊕ fik) = 1

In total there will bem(m − 1)(m − 2)/2 such equations,since there arem(m − 1)(m − 2)/6 possible ways to choosei, j and k. The incomplete matrix will admit a perfect phy-logeny only if there is an assignment of 0 or 1 to each variablethat satisfies all these equations. In the special situationinwhich at least two out of the four variables in each expressioncan be assigned a value, the problem can be reduced to the2-SAT problem, and can be solved in polynomial time.

For any pair of columns(i, j), if |I(i, j)| = 3, then bothfij

and fji will be known. When|I(i, j)| = 2, eitherfij or fji

5

c

U V

i j icf

icf jcf jcf 0 1

Fig. 4. For any site labeling an edge(U, V ), the state of any other sitei atboth the verticesU andV will be fic

will be known, or one of them can be expressed as the otheror the complement of the other. For instance:

I(i, j) = {(0, 0), (0, 1)} ⇒ F(i, j) ∈ {(1, 0), (1, 1)}

⇒ fij = 1

Similarly,

I(i, j) = {(0, 0), (1, 1)} ⇒ F(i, j) ∈ {(1, 0), (0, 1)}

⇒ fij = fji

When|I(i, j)| = 1, fij andfji will be related by a disjunction.For instance:

I(i, j) = {(0, 0)} ⇒ F(i, j) ∈ {(1, 0), (0, 1), (1, 1)}

⇒ fij + fji = 1

For the matrix M to admit perfect phylogeny, the abovedisjunctions have to be satisfied in addition to equation 3.

D. Properties of the forbidden pairs

It is convenient to represent the forbidden pairs using anm × m matrix F , whereF [i, j] = fij ∀ (i, j), i 6= j. Thediagonal of the matrix, i.e.,F [i, i] ∀ i, is not defined. When|I(i, j)| = 3, bothfij andfji can be assigned a value of 0 or1. When|I(i, j)| < 3, we might be able to define one of thetwo variables(fij , fji), or introduce an equality or disjunctionrelationship between the two variables.

In any phylogenyT , we denote the node-label of a nodeV using the notationL(V ). L(V ) is a length-m haplotypevector. The following are some interesting properties ofF .

Property 3: Exactly two node labels inT can be obtainedfrom each column inF .Assume the matrixM admits a perfect phylogenyT . There-fore, every sitei in M labels a unique edge inT . Let a columnc label an edge(U, V ), whereU and V are nodes inT . Weshow how to construct the node labels ofU and V from F .Without loss of generality, letL(U)[c] = 0, andL(V )[c] = 1,as shown in Figure 4. For any columni in T , the state ofthe columni at both the nodesU andV will be fic. This isirrespective of which ‘side’ ofc the columni appears inT ,as is evident from the columnsi and j shown in Figure 4.Therefore, if we knowfic for every sitei, we will be able tobuild the node labels for both the verticesU and V . i.e., ifevery entry (except F[c,c], which is not defined) in the columnc in F is known, we can construct the node labels for the twonodes that define the edge labeled with columnc.

Let Hc be a vector formed by transposing the columnc inF . The node labels forU andV can be constructed by setting

00100

10100 01100

01101 01110 10000

c1 c2

c3 c4 c5

T (b) (a)

×

×

×

×

×

1111

1111

0000

0011

1101

5

4

3

2

1

54321

c

c

c

c

c

ccccc

Fig. 5. (a) A perfect phylogenyT ; (b) The forbidden matrixF for T

L(U)[i] = L(V )[i] = Hc[i] ∀i 6= c, and settingL(U)[c] = 0andL(V )[c] = 1. Hence, if we can assign a value to everyentry in a column inF , we can convert the IPP problem intothe IDP problem. Note that the rich-data hypothesis need notbe satisfied on the matrixM for us to be able to fill a columnc in F completely. The algorithm for the IPP problem wepresent in Section V makes use of this property ofF .

Property 4: Each node label inT can be derived from acolumn inF .Another interesting property ofF is that each node label inTcan be obtained directly from some column inF . As describedabove, each columnc in F describes the two node labelsL(U),L(V ) where (U, V ) is the edge labeled by the sitec in T .Therefore, we can derive the node label of any nodeX in Tfrom any columni that labels an edge incident onX . Therecan be at mostm+1 nodes inT , and we can obtain2m node-labels fromF . The number of times a node-label is repeatedin these2m node-labels gives the degree of the correspondingnode inT . For example, refer to Figure 5. A phylogenyT isshown in Figure 5a, and the corresponding forbidden matrixFis shown in Figure 5b. The node label{01100} can be derivedfrom any of the three columnsc2, c4, or c5 in F . As the nodewith the label{01110} is a leaf in the tree, it can be derivedfrom only one column (columnc4) in F . Also, notice that anyrow that is all-0 or all-1 inF is a site that is incident on aleaf in T . All three leaves of the tree satisfy this property.

IV. REALIZABILITY CONDITIONS FOR THEIPPHPROBLEM

The input to the IPPH problem is a matrixM ={0, 1, 2, ?}n×m. Each row in the matrixM represents agenotype. Each genotype contains the conflated informationabout two haplotypes. LetH1 andH2 be two haplotype vectorsthat are conflated to produce a genotypeG. If H1 andH2 havethe same allele in a sitei, i.e, if H1[i] = H2[i] = a, a ∈ {0, 1},thenG[i] = a. On the other hand, if the sitei is heterozygousin G, G[i] = 2.

Formally stated, the IPPH problem is to determine if thereis a 2n × m complete haplotype matrixA so that:

1) The matrixA admits a perfect phylogeny2) For every rowr in M , there are two rows(r, r′) in

A such thatA[r, i] = A[r′, i] for every positionsi inwhich M [r, i] = 2, andA[r, i] = A[r′, i] = M [r, i] inevery positioni in which M [r, i] ∈ {0, 1}.

The IPP problem can be viewed as a special case ofthe IPPH problem in which there are no ‘2’s in the matrixM . Therefore, the discussion and results in Section III areapplicable for the IPPH problem too. The only difference is

6

that the definition of the induced rows is slightly different, andan additional set of constraints apply on triplets of columnsthat are all ‘2’ in the same row. For a genotype matrixM , arow ab, a ∈ {0, 1}, b ∈ {0, 1} is in I(i, j) for a pair of columns(i, j) if there is a rowr in M such thatM [r, i] = a andM [r, j] = b, or M [r, i] = a and M [r, j] = 2, or M [r, i] = 2andM [r, j] = b. The definitions ofU(i, j) or F(i, j) do notchange, as they are defined in terms ofI(i, j).

A triplet of columns(i, j, k) is said to be acompaniontripletif there is a rowr in M such thatM [r, i] = M [r, j] =M [r, k] = 2. Since all the three columnsi, j and k areheterozygous,i, j and k must mutate in the path betweenthe two haplotypes for the rowr in any perfect phylogenyTfor M . Hence, any companion triplet of columns must forma path topology, as shown in Figure 1b. There are three waysin which the columnsi, j and k can label three edges in anun-directed path, each corresponding to the columnsi, j ork labeling the ‘inner’ edge in the path. This restriction on acompanion triplet of columns can be expressed in terms of theforbidden pairs as:

EiEjEk + EiEjEk + EiEjEk = 1 (4)

whereEi, Ej andEk are as described in Section III-B. Itcan easily be seen thatEiEjEk = 1 iff the columnsi, j andk form a path withi in the middle. Similarly the other twoterms in Equation 4 correspond toj being in the middle andk being in the middle. Equation 4 can be simplified to thefollowing form:

fij⊕fji ⊕ fjk ⊕ fkj ⊕ fik ⊕ fki

⊕(fij ⊕ fik)(fji ⊕ fjk)(fki ⊕ fkj) = 1 (5)

If a pair of columns(i, j) are both ‘2’ in a genotype, thepair of columns can be expanded as either{(0, 0), (1, 1)} or{(0, 1), (1, 0)} in the two haplotypes for the genotype. It hasbeen previously established [16], [6], [9] that every genotypein M in which the columnsi andj are ‘2’ must be expandedthe same way ifM is to admit a perfect phylogeny. Thisrelationship between a pair of columns is defined as thephasebetween the two columns [6], [9]. The phase between the pairof columns(i, j) is represented asP (i, j). In terms of theforbidden pairs, the phase between a pair of columns(i, j) canbe expressed asP (i, j) = 1⊕ fij ⊕ fji. The genotype matrixM will admit a perfect phylogenyiff Equation 3 is satisfiedon every non-companion triplet of columns and Equation 5 issatisfied on every companion triplet of columns.

Assume that the matrixM admits a perfect phylogenyT . If three columnsi, j and k are all ‘2’ in some rowrof the matrixM , the pairwise phase relationships will havesome interdependencies, very similar to those introduced in[6]. Let H1 and H2 be the two haplotypes that combine toproduce the rowr in M . Since i, j and k are all ‘2’ inrow r, H1 and H2 differ in all three columnsi, j and k.In T , the path between the vertices labeled withH1 andH2 must contain all the three edges labeled withi, j andk. Theorem 3 establishes the interdependencies between thepairwise relationships. Theorem 3 for the rooted PPH problem

H1 i j k H2

H1 i k j H2

H1 j i k H2

(a)

(b)

(c)

(a,b,c) (a,b,c) (a,b,c) (a,b,c)

Fig. 6. (a)j in the middle (b)k in the middle (c)i in the middle

was first introduced by Bafna et. al. [16] using differentterminology. Here, we present a generalized version of thetheorem that is applicable to the un-rooted version of theproblem.

Theorem 3:In any genotype matrixM that admits a perfectphylogeny, if three columnsi, j and k are all ‘2’ in somerow r, then the pairwise phase relationships are related by theexpressionP (i, j) ⊕ P (j, k) = P (i, k).

Proof: Let us consider the arrangement in Figure 6-(a).Let the two alleles for columnsi, j andk be{a, a}, {b, b} and{c, c}, respectively. Leta, b, c be the alleles on the left of edgeslabeledi, j andk in Figure 6-(a). Therefore, the vertex labelswill be (a, b, c) (labeling H1), (a, b, c) (any vertex betweenthe edgesi and j), (a, b, c) (any vertex between the edgejand k), and (a, b, c) (labelingH2). Clearly,F(i, j) = (a, b),F(i, k) = (a, c), andF(j, k) = (b, c). Hence,

P (i, j) ⊕ P (j, k) = 1 ⊕ fij ⊕ fji ⊕ 1 ⊕ fjk ⊕ fkj

⇒ P (i, j) ⊕ P (j, k) = a ⊕ b ⊕ b ⊕ c

⇒ P (i, j) ⊕ P (j, k) = 1 ⊕ a ⊕ c = P (i, k)

V. A LGORITHMS

The algorithms we present in this section for the IPP andIPPH problems are enumerative algorithms - they essentiallydo an exhaustive search of all possible arrangements that mightresult in a perfect phylogeny. However, the search space canbereduced by applying the conditions explained in the previoussections. This reduction of search space happens to be soeffective that most practical instances of the problems canbesolved in less than a second on a commodity CPU.

The brute-force enumerative solution for IPP and IPPHproblems will be try all possible assignments for all theunknown entries inF until we obtain an assignment thatsatisfies the equations 3 and 5 on each triplet. We can thenbuild the perfect phylogeny fromF . However, this approachmight be impractical, since there can be quite a few un-assigned entries inF , and the computational complexity wouldbe exponential in terms of the number of un-assigned entriesin F . Our approach, instead, is to apply equations 3 and 5on triplets of columns in order to fillF to the fullest extentpossible. If all the entries in a complete column ofF are filledusing this procedure, we can convert the un-rooted version ofthe problem into a rooted version of the same problem. For

7

both the problems, solving the rooted version of the problemis much simpler, and the rooted version of the IPP problemcan be solved inO(mn) time.

A. An algorithm for the IPP problem

In the following, we present a practical algorithm for the IPPproblem. For simplicity of illustration, we assume that eachcolumn in the input matrixM is polymorphic. As describedearlier, non-polymorphic columns inM are un-informative,and will not label any edges in the perfect phylogeny forM . We also assume that there are no complete rows inM ,as the problem can be solved as the IDP problem if thereis a complete row inM . The algorithm first constructs theforbidden matrixF from M . If there is a complete column inF , the root of the phylogeny can be derived from this completecolumn as described in Section III-D, and the problem canagain be solved as the IDP problem. In the following, weassume that no such complete column is directly availablefrom the data.Obtaining all possible information from FIf there is no complete column inF , we apply conditionR(i, j, k) = 1 on triplets of columns, in order to obtainassignments for unknown entries. We continue to do this untileither all the entries in a column or assigned, or until no furtherinformation can be obtained fromF . In general, if two of thefour variables in any of the three expressions in Equation 3 areknown, it might be possible to infer some information aboutthe others. This step of obtaining all the possible informationfrom F can be implemented to run inO(m3 + nm2) time.Solving the IPP problemAfter obtaining all possible information fromF , if all theentries in any column are assigned values, the root of thephylogeny can be obtained as described in Section III-D,and the IPP problem can be solved as the IDP problem.If a complete column is not available, we select the mostcomplete column inF and obtain all possible assignments forthe un-assigned entries in this column. Each assignment canbe solved as an IDP problem. The computational complexityof this step will be exponential in terms of the number of un-assigned entries in the most complete column ofF . A highlevel description of the algorithm is given in Figure 7.Uniqueness of the solutionA given incomplete matrixM will have a unique perfectphylogeny if there is a unique way of fillingF so that Equation3 is satisfied on every triplet of columns. Each complete matrixF that satisfies Equation 3 has a unique perfect phylogenyT .This is because a complete matrixF refers to a hypotheticalmatrix M ′ in which I|(i, j)| = 3 for all pairs of columns. Theincomplete matrixM consists of a subset of rows fromM ′.

B. Algorithm for the IPPH problem

Our IPPH algorithm is fundamentally similar to the IPPalgorithm described in Section V-A. The approach is to obtainall the information that can be obtained by applying Equations3 and 5 on triplets of columns until no further information canbe obtained. In practice, this leads to a situation in which atleast one column inF is complete or nearly complete. Once

1) Construct the matrixF from M by building I(i, j) and inferringF(i, j) from I(i, j). When 1 ≤ |I(i, j)| < 3, relate fij and fji

by a disjunction or an equality so that all the restrictions imposed byI(i, j) on F(i, j) are accounted for.

2) Apply R(i, j, k) = 1 on triplets of columns fromM until a columnin F is complete or until no new assignments/equalities/disjunctionscan be derived.

3) If a column c in F is complete, derive the root fromc, and solvethe problem as an IDP problem. Otherwise select a columnc with thefewest un-assigned entries. Letp be the number of un-assigned entriesin c.

a) For each of the2p possible ways in which the columnc in Fcan be completed:i) Derive the rootR from columnc, and solve the problem as

an IDP problem. If the problem can be solved as an IDPproblem rooted atR, report the solution and halt.

4) report that the matrixM does not admit a perfect phylogeny.

Fig. 7. The algorithm for the IPP problem

a complete column inF is known, the IPPH problem can beconverted to the directed IPPH problem. However, as eventhe directed IPPH problem is NP-complete, obtaining the rootdoes not lead to a solution for the IPPH problem. We presentan enumerative algorithm for the directed IPPH problem. Thisenumerative algorithm is based on the IDP algorithm presentedin [11], and makes maximum use of the information availablefrom the matrix F . In fact, if the forbidden matrixF iscomplete, this algorithm will, in principle, be the same as theIDP algorithm presented in [11].Enumerative algorithm for the directed (rooted) IPPHproblem We first present a brief overview of the enumerativealgorithm for the directed IPPH problem. The algorithm triesto construct the tree starting at the root. In order to build thesubtree under any node in the tree, the algorithm tries to finda column that must label an edge incident on the node. Weterm such a column as abranching columnfrom the node.There could be multiple branching columns at any node. Acolumn that can be a branching column needs to satisfy certainconditions, and the search for the branching column can benarrowed down to a few columns based on these conditions.Once the algorithm finds a branching column, the problemcan be divided into two independent sub-problems - all thehaplotypes that must be in a sub-tree under branching columnform one sub-problem, and all the remaining haplotypes formthe other sub-problem. For each candidate branching column,there can be multiple ways of distributing the haplotypes intothe two sub-problems. Selection of a branching column anddistribution of the haplotypes into sub-problems force someassignments to the unknown entries in the input matrix. Eachpossible branching column and each possible distribution haveto be tried in order to see if they can lead to a perfectphylogeny. The procedure is repeated for each sub-problem.The algorithm halts when it arrives at a perfect phylogenyor when all the possible choices for branching columns andpartitions have been tried out. A more detailed explanationofthe algorithm follows.

We assume that the root is an all-zero vector. When thegiven root is not an all-zero vector, a simple transformationof F and M can be performed to ensure that the root is an

8

all-zero vector - for every column that is ‘1’ in the root, wecan complement the corresponding row inF and column inM in order to ensure that the root is an all-zero vector. A2(m + n) × m matrix B is constructed as follows:

1) The first 2m rows in B are incomplete or completehaplotypes derived fromF . Two haplotypes are derivedfrom each column inF , as described in Property 4.

2) The remaining2n rows inB are genotypes or haplotypesderived fromM :

a) If a row r in M has exactly one columni that is‘2’, B[2m + 2r, i] = 0 andB[2m + 2r + 1, i] = 1.In every columnj 6= i, B[2m + 2r, j] = B[2m +2r + 1, j] = M [r, j].

b) If a row r in M has no ‘2’s or has multiple ‘2’s,B[2m + 2r] = B[2m + 2r + 1] = M [r].

The matrixB has the following properties:

1) Every IPPH solution forB is an IPPH solution forM .2) Every IPPH solution forM is an IPPH solution forB.

Decomposing matrixB into smaller matricesThe very first step of the algorithm is to find if the matrixB

can be split into independent sub-matrices. Each independentsub-matrix can then be solved using the enumerative algo-rithm.

Let C be the set of columns inB, and letR be the setof rows in B. A graph G2 = (V, E) is constructed, whereV = C

⋃R, and the edge(c1, c2) ∈ E if there is a rowr in

which B[r, c1] ∈ {1, 2} andB[r, c2] ∈ {1, 2}. Also, the edge(r, c) ∈ E2 if B[r, c] ∈ {1, 2}. We determine the connectedcomponents ofG2.

We call the graphG2 as the2-dependencygraph. If twocolumns are both in the same connected component ofG2,there might be some relationships between two columns thatneed to be satisfied in order to build a perfect phylogeny forB.On the other hand, if two columns are in different connectedcomponents ofG2, they are completely independent of eachother.

The columns and rows in each connected component ofBform a completely independent sub-problem. In any rowr ofB, all the columns that do not belong to the same connectedcomponent asr can be set to ‘0’. Because of the way thematrix B is built, for each columnc in B, there will be atleast one rowr in B with B[r, c] = 1. Hence, each columnc will be in a non-trivial connected component ofG2, whereas there can be rows inB that form isolated vertices inG2.The overall effect is that the matrixB can be divided intoa set of sub-matrices as shown in Figure 8a. All the entriesoutside each sub-matrix can be set to ‘0’. The columns ineach non-trivial connected component inG2 form a subtreeof the phylogeny for the matrixB, as shown in Figure 8b.This decomposition is very similar to the decomposition on ahaplotype matrix presented in [11].Solving each sub-matrix

For each sub-matrixBi, i ∈ {1, ..., k}, we need to checkif the matrixBi admits a perfect phylogeny. The fundamentalapproach at this stage is to select a branching columncb andto divide Bi into two sub-matricesH0 andH1, as shown inFigure 9a. In the matrixBi, there might be multiple choices

B1

B2

B3

Bk

Each of these rows is an

isolated component in G2

The entries in the

un-shaded region

can be set to 0

B1 B2 Bk

(a)

(b)

Fig. 8. Decomposition ofB - (a) B1 to Bk are sub matrices derived fromconnected components inG2. (b) The phylogeny forB consistsk subtreesrooted at the same vertex.

1

1

1

1

1

1 0

0

0

0

0

0

cb

H1

H0

All entries

here must be

set to 0

All entries

here must

be set to 0

cb

(b)

(a)

Bi

H0 H1

Fig. 9. (a) DividingBi into two sub-matricesH0 andH1 (b) The phylogenyfor Bi.

for the branching columncb. For each such choice, there mightbe multiple ways of distributing the columns betweenH0 andH1. In order do this efficiently, we do the following.

For the sub-matrixBi, let Ci be the set of columns inBi,and letRi be the set of rows inBi. We construct a graphGi

1= (V, E), where V = Ci

⋃Ri. For any two columns

c1 ∈ Ci and c2 ∈ Ci, the edge(c1, c2) ∈ E if there is arow r ∈ Ri with Bi[r, c1] = 1 and Bi[r, c2] ∈ {1, 2}, orif Bi[r, c2] = 1 and Bi[r, c1] ∈ {1, 2}. For any rowr andcolumn c, the edge(r, c) ∈ E if Bi[r, c] = 1. Note that therows that do not have any 1’s form isolated vertices inGi

1.The significance of the graphGi

1is that if we take any

non-trivial connected component ofGi1, there must be at least

one column that should be 1 in every rowr in the connectedcomponent in order for the matrixBi to admit a perfectphylogeny. Such a column is called asemi-universalcolumn.A column c is semi-universal if it satisfies the followingconditions:

1) Bi[r, c] ∈ {1, ?} in every row r that is in the sameconnected component ofGi

1asc.

2) For any columnc1 that is in the same connected compo-nent asc, there must be no rowr ∈ Ri with Bi[r, c] = 0andBi[r, c1] = 2. (Refer to Figure 11 to see an exampleof this situation).

All columns that can be semi-universal are candidates for

9

1

1

?

1

?

?

?

?

?

0

?

2

2

0

D1

D2

D3

cb Bi

Fig. 10. (a) Distributing the connected components inBi into H0 andH1,for a branching columncb: D1 should be inH1, D3 in H0. D2 can be inH0 or H1

the branching column. At least one of these columns must be abranching column at this level. For each branching columncb,we need to test all possible ways of distributing the columnsin Bi between the two matricesH0 and H1. The followingproperties are taken into consideration while distributing thecolumns betweenH0 andH1:

1) The columns in any non-trivial connected component ofGi

1must all be in the same matrixH0 or H1.

2) All the columns that are in the same connected compo-nent ascb must be inH1.

3) If Bi[r, cb] = 0 in any row r that is in a non-trivialconnected component ofGi

1, then the all the columns in

the same connected component asr must be inH0.

Figure 10 shows some scenarios for distributing columnsbetweenH0 andH1. D1, D2 andD3 are sub-matrices derivedfrom non-trivial connected components inGi

1. All the columnsin D1 (excludingcb) must be inH1, because they belong tothe same connected component ascb. Sincecb is 0 in one ofthe rows inD3, all the columns inD3 must go intoH0. Sincecb is ‘?’ in every row inD2, D2 can either be part ofH0 orH1. Both these possibilities have to be tested.

Once the columns are distributed betweenH0 and H1,the isolated rows inBi that do not belong to any connectedcomponent ofGi

1have to be distributed betweenH0 andH1.

This is done based on the following criteria:

1) Any row r with Bi[r, c] =? in every columnc in H1 isinserted intoH0.

2) Any row r with Bi[r, cb] =? andBi[r, c0] =? in everycolumnc0 in H0 is put intoH1.

3) For any rowr with Bi[r, cb] = 2 or Bi[r, c1] = 2 forany columnc1 in H1 andBi[r, c0] = 2 for any columnc0 in H0,

a) A new row r1 is inserted intoH1. For everycolumn c1 in H1, Bi[r

1, c1] = 1 if Bi[r, c1] = 2and Bi[r

1, c1] = Bi[r, c1] otherwise.Bi[r1, c0] =

0 for every columnc0 in H0.b) A new row r0 is inserted intoH0. For every

column c0 in H0, Bi[r0, c0] = 1 if Bi[r, c0] = 2

and Bi[r0, c0] = Bi[r, c0] otherwise.Bi[r

0, c1] =

22???????

?2222????

??2?2??20

221??????

?1?????0?

?11??????

???10????

????1????

??011????

?????221?

????????1

0?????011

12

11

10

9

8

7

6

5

4

3

2

1

987654321

r

r

r

r

r

r

r

r

r

r

r

r

ccccccccc

D1

D1

D2 D3

D3

D2

(a)

H1 H0

H1

H0

(b)

22?000000

?11000000

??1000000

221000000

?1?000000

?11000000

00011???1

000?1??01

00010???1

000?1???1

00011???1

000??22?1

000????11

000???011

12

011

010

9

8

7

111

110

6

5

4

3

2

1

987654312

r

r

r

r

r

r

r

r

r

r

r

r

r

r

ccccccccc

Fig. 11. (a) A matrixBi (b) Splitting Bi into H0 and H1 after selectingc2 as the branching column and includingD2 in H1

0 for every columnc1 in H1.

If the matrixH0 or H1 created in this fashion does not haveany 2’s, it can be solved as an instance of the IDP problem.Otherwise, each of these sub-matrices is again subject tofurther decomposition. The process is repeated recursivelyuntil a solution is found or until all possible choices forsplitting Bi into H0 and H1 fail. If all possible choices forH0 and H1 fail, a different branching column is selected. Ifall candidate branching columns fail to produce a solution,thematrix Bi does not admit a perfect phylogeny.

Figure 11 demonstrates the process of splittingBi intoH0 and H1 through an example. Consider the matrixBi inFigure 11a. There are three non-trivial connected componentsin the graphGi

1: D1 = {c1, c2, c3, c4, r1, r2, r3}, D2 ={c5, c6, r4, r5, r6}, andD3 = {c7, c8, c9, r7, r8, r9}. The rows

10

Procedure SolveIPPH(), Input: A genotype matrixM1) Construct the incomplete forbidden matrixF from the input genotype

matrix M .2) Apply Equation 3 on each triplet of columns and apply Equation 5 on

companion triplets of columns to infer unknown values inF . Repeatthe process until no new information can be obtained.

3) Take the most complete columnF and derive a (possibly) incompletehaplotype vectorR from it, as described in Property 3. For eachpossible completionRc of R, do the following:

a) Root the matricesF andM in Rc and construct the matrixBfrom F andM .

b) If SolveDirectedIPPH(B) succeeds in building a perfect phy-logeny, report the phylogeny and halt.

4) Report that the incomplete matrixM does not admit a perfect phy-logeny.

Procedure SolveDirectedIPPH(), Input: A genotype matrixB1) Construct the 2-dependency graphG2 for the input genotype matrix

B.2) Based on the non-trivial connected components inG2, decompose the

matrix B into a set of sub-matrices{B1, ....,Bk}.3) For each submatrixBi, i ∈ {1, ..., k}, if ConstructPhylogeny(Bi )

fails to construct a perfect phylogeny, report that the matrix B doesnot admit a perfect phylogeny. If every sub-matrix admits a perfectphylogeny, assemble all the phylogenies into one phylogenyand reportthat the input genotype matrixB admits a perfect phylogeny.

Procedure ConstructPhylogeny(), Input: A genotype matrixM1) Construct the graphG1 for the input genotype matrixM , and construct

the connected components inG1.2) Construct the set of possible semi-universal columnsS in the input

matrix M . If none of the columns can be semi-universal, report thatthe matrixM does not admit a perfect phylogeny and return.

3) For each columncb ∈ S, do the following:a) Usecb as the branching column. For each possible division ofM

into two matricesH0 andH1, if both SolveDirectedIPPH(H0 )and SolveDirectedIPPH(H1 ) succeed in building perfect phylo-genies, report the phylogenies and return.

4) Report that the input genotype matrixM does not admit a perfectphylogeny.

Fig. 12. The algorithm for the IPPH problem

r10, r11 and r12 do not belong to any non-trivial connectedcomponent ofGi

1, as they do not have any 1’s. The columns

c2, c5, c6 andc7 can be semi-universal. Notice that the columnc1 can not be semi-universal because of the rowr10 asc1 andc2 are in the same connected component, andBi[r10, c1] = 0andBi[r10, c2] = 2.

Let us assume we pickc2 as the branching column. Now weneed to splitBi into H0 andH1 based onc2. As c2 is in D1,all the columns inD1 must be inH1. SinceBi[r8, c2] = 0 andr8 is in D3, D3 must be inH0. D2, however, can be eitherin H0 or in H1, and both the possibilities must be tested.Consider the case whenD2 is included inH1. Since the rowsr10 andr11 have 2’s in bothH0 andH1, each one of them isphased into two rows, with one of them going intoH1, andthe other going intoH0, as indicated by the rowsr1

10, r0

10, r1

11

and r0

11in Figure 11b. Since the rowr12 has 2’s only in the

columns inH0, it is copied intoH0 without phasing. All theentries out sideH0 andH1 must be set to 0.

A high level description of the complete IPPH algorithm isgiven in Figure 12.

TABLE I

PERCENTAGE OF INPUT DATA SETS IN WHICH A COMPLETE COLUMN IS

DIRECTLY AVAILABLE FROM F (BEFORE APPLYINGR(i, j, k) = 1), AND

AFTER APPLYINGR(i, j, k) = 1

Before applyingR(i, j, k) = 1n × m p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.550 × 50 99.9 99.2 93.9 79.0 49.550 × 100 100 97.4 84.8 55.7 19.0100 × 50 100 99.9 99.6 97.4 86.9100 × 100 100 99.9 99.2 93.9 72.2

After applying R(i, j, k) = 1n × m p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.550 × 50 99.9 99.8 99.0 95.3 81.850 × 100 100 100 98.4 90.3 66.8100 × 50 100 100 100 99.8 97.7100 × 100 100 100 99.9 100 97.2

VI. RESULTS

We have implemented the algorithms inC++. The implementation is available fromhttp://www.cs.ucf.edu/∼rvijaya/ipph/. The algorithms weretested on simulated data. We first generate complete haplotypematrices that admit perfect phylogenies using the programMS [17]. For the IPPH case, we combine consecutiverows in the haplotype matrix to form genotypes. We thencreate incomplete haplotype/genotype matrices from thesecomplete matrices by converting each entry in the matrixto a ‘?’ with a fixed masking probabilityp. The incompletehaplotype/genotype matrices created in this fashion are inputsto the IPP/IPPH algorithms. The algorithms were tested on apentium 3.2 GHz machine running redhat Linux.

A. Results of the IPP Algorithm

The IPP algorithm was tested on data sets withm and nranging from 50 to 100. The input data was generated sothat the minimum allele frequency for any site is at least2%. For each data set, incomplete matrices were created withmasking probabilityp ranging from 0.1 to 0.5. The motivationin selecting data sets of this size is to test the performanceofthe algorithm under the worst possible scenarios. Whenmis large, andn is equal to or even less thanm, very littleinformation is available from the input matrix.

The experiment was repeated one thousand times for eachproblem size and each value ofp. Interestingly, the input datasets never satisfied the rich data hypothesis on all pairs of sites.A complete haplotype was only available for very few data sets(< 5% of the data sets withp = 0.1). Table I shows that justbuilding the forbidden matrixF (which takesO(nm2) time) orapplying the conditionR(i, j, k) = 1 on triplets (which takesO(m3) time) is sufficient, in most cases to convert the IPPproblem into the IDP problem. It is evident from the resultsthat even with 50% missing data, the root can be effectivelyinferred from the matrixF in most situations. In the casesin which a complete column was not available inF evenafter applyingR(i, j, k) = 1 on triplets of columns, therewere at most 5 unknown values in the most complete column.Therefore the maximum number of root vectors tested (numberof IDP instances tried) for any data set never exceeded 32.

11

TABLE II

AVERAGE NUMBER OF ROOTS TESTED PER DATA SET

n × m p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.550 × 50 1.00 1.00 1.01 1.06 1.3750 × 100 1.00 1.00 1.02 1.10 1.49100 × 50 1.00 1.00 1.00 1.00 1.03100 × 100 1.00 1.00 1.00 1.00 1.03

TABLE III

PERFORMANCE OF THEIPPALGORITHM ON A PENTIUM 3.2 GHZ PC - ALL

TIMES ARE IN SECONDS, AND ARE AVERAGES OVER1000MATRICES

n × m p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.550 × 50 0.015 0.013 0.012 0.012 0.01350 × 100 0.045 0.040 0.038 0.044 0.062100 × 50 0.026 0.025 0.022 0.020 0.019100 × 100 0.082 0.073 0.065 0.057 0.057

Table II shows the average number of roots (average numberof IDP instances) that had to be tested for each test case. Itcan be seen that this quantity is very close to 1 for all testcases withp ≤ 0.4. Even for the most difficult problem withn = 50 andm = 100 with 50% of the input data missing, theaverage number of IDP instances that needed to be tested wasjust 1.49.

The performance of the algorithm in terms of speed isshown in Table III. All the times are averages over 1000 runs.The standard deviation for the run times varied greatly, andwas as high as 20% for some test cases. It can be seen thatthe time taken is less than 0.1 seconds for all problem sizes.Also, it can be seen that time taken for a given problem sizedid not vary much with masking probability.

B. Results of the IPPH Algorithm

The IPPH problem is significantly harder than the IPPproblem, as there is no polynomial-time solution even whenthe root is known. Therefore, to establish the usefulness ofthe algorithm, we tested the algorithm on a practical200×30dataset, in addition to the test cases presented above. ThoughKimmel and Shamir [12] have also tested their algorithm onmatrices of size200× 30, their matrices were generated froma maximum of 9 distinct haplotypes. Since the only restrictionwe apply on the data is that the minimum allele frequency begreater than 2%, our input data sets can have up to 50 distincthaplotypes. Therefore, we are dealing with problems that aremuch harder to solve. Our algorithm has performed extremelywell on all the problem sizes.

For the problem sizes we are dealing with here, we considerthat the algorithm has failed on the data set if the algorithmtakes longer than 60 seconds to come up with a solution. Thereare very few input matrices on which the algorithm took morethan one minute. In very few instances of the problem, thealgorithm took up to 10 hours to complete.

Table IV presents the results for the problem size of200×30. The error rate is the average percentage of loci in originalhaplotype matrices that are incorrectly recovered. Noticethatthe algorithm takes less than 0.1s nearly 95% of the time, evenwith 50% missing data. The algorithm completed in less than

TABLE V

PERFORMANCE OF OURIPPHALGORITHM ON DATA SETS OF DIFFERENT

SIZES.

Median run time (sec)n × m p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.550 × 50 0.15 0.15 0.15 0.15 0.1450 × 100 1.27 1.26 1.22 1.18 1.13100 × 50 0.17 0.17 0.16 0.16 0.15100 × 100 1.36 1.34 1.33 1.28 1.25

Percentage of data sets that completed in less than 60sn × m p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.550 × 50 100 99.9 100 98.7 97.650 × 100 97.8 98.1 98.0 95.4 97.9100 × 50 100 100 99.6 99.1 99.5100 × 100 99.5 99.1 99.0 98.3 96.4

a second in more than 99.4% of the cases with 50% missingdata.

Performance of the IPPH algorithm on more difficult datasets is presented in Table V. As can be seen, the algorithmtakes less than two seconds on most of the test cases. Even forp = 0.5, greater than 97% of the instances complete in lessthan one minute. Note that the algorithm is extremely fast ondata sets with 10% missing data. As no more than 10% of thedata is missing in case of real genotype data, the algorithm isvery practical for application on real genotype data.

C. Conclusion

The algorithms presented in this paper, though very effi-cient, are mainly of theoretical significance. This is becauselarge data sets of real haplotype/genotype data often contain adirect violation of the 3-gamete rule, and hence seldom admita perfect phylogeny. The results presented in this paper canbe of practical significance if they can be extended to handleimperfectphylogenies.

ACKNOWLEDGMENT

This work has been partially supported by NSF grant0312724.

REFERENCES

[1] R. Agarwala and D. Fernandez-Baca, “A polynomial-time algorithm forthe perfect phylogeny problem when the number of character states isfixed,” SIAM J. Computing, vol. 23, pp. 1216–1224, 1994.

[2] HapMapConsortium, “The international hapmap project,” Nature,vol. 426, pp. 789–796, December 2003.

[3] A. G. Clark, “Inference of haplotypes from pcr-amplifiedsamples ofdiploid populations,”Mol. Biol. Evol., vol. 7, pp. 111–122, 1990.

[4] D. Gusfield, “Haplotyping as perfect phylogeny: conceptual frameworkand efficient solutions,” inProceedings of RECOMB, 2002.

[5] Y. Liu and C.-Q. Zhang, “A linear solution for haplotype perfectphylogeny problem,” inProceedings of ICBA, (Fort Lauderdale, USA),2004.

[6] R. Vijaya Satya and A. Mukherjee, “An efficient algorithmfor perfectphylogeny haplotyping,” inProceedings of CSB2005, (Stanford, CA),pp. 103–110, August 2005.

[7] Z. Ding, V. Filkov, and D. Gusfield, “A linear time algorithm forthe perfect phylogeny haplotyping (PPH) problem,” inProceedings ofRECOMB, (MIT, Cambridge, MA), pp. 585–600, 2005.

[8] Z. Ding, V. Filkov, and D. Gusfield, “A linear time algorithm for theperfect phylogeny haplotyping (PPH) problem,”J Comput Biol., vol. 13,no. 2, pp. 522–553, 2006.

12

TABLE IV

PERFORMANCE OF THEIPPHALGORITHM ON DATA SETS OF SIZE200 × 30

p = 0.1 p = 0.2 p = 0.3 p = 0.4 p = 0.5% of incorrectly recovered loci (error rate) 0.298 0.659 1.08 1.60 2.32median run time (sec) 0.07 0.06 0.06 0.06 0.06% of data sets that completed in less than 0.1s 96.7 97.5 97.4 96.5 94.8% of data sets that completed in less than 1s 99.9 99.9 99.7 99.7 99.4% of data sets that completed in less than 60s 100 100 99.9 99.9 99.8

[9] R. Vijaya Satya and A. Mukherjee, “An optimal algorithm for perfectphylogeny haplotyping,”Journal of Computational Biology, vol. 13,no. 4, pp. 897–928, 2006.

[10] M. Steel, “The complexity of reconstructing trees fromqualitativecharacters and subtrees,”Journal of Classification, vol. 9, pp. 91–116,1992.

[11] I. Peer, T. Pupko, R. Shamir, and R. Sharan, “Incompletedirected perfectphylogeny,”SIAM Journal on Computing, vol. 33, no. 3, pp. 590–607,2004.

[12] G. Kimmel and R. Shamir, “The incomplete perfect phylogeny problem,”J Bioinform Comput Biol., vol. 3, no. 2, pp. 1–25, 2005.

[13] E. Halperin and R. Karp, “Perfect phylogeny and haplotype assignment,”in Proceedings of RECOMB, 2004.

[14] R. Vijaya Satya and A. Mukherjee, “The undirected incomplete perfectphylogeny problem,” Tech. Rep. CS-TR-05-11, School of ComputerScience, University of Central Florida, Orlando, December12, 2005.

[15] J. Gramm, T. Nierhoff, R. Sharan, and T. Tantau, “Haplotyping withmissing data via perfect path phylogenies,” inProceedings of the secondRECOMB datellite workshop on Computational methods for SNPs andhaplotypes, pp. 35–46, 2004.

[16] V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph, “Haplotyping asperfect phylogeny: a direct approach,”J Comput Biol., vol. 10, no. 3–4,pp. 323–340, 2003.

[17] R. Hudson, “Generating samples under the wright-fisherneutral modelof genetic variation,”Bioinformatics, vol. 18, pp. 337–338, 2002.

Ravi Vijaya Satya received B.Tech(1999) inElectronics and Communications Engineering fromJawaharlal Nehru Technological University, Hyder-abad, India. He received MS in Computer Sci-ence in 2001 and Ph.D. in Computer Science in2006 from the school of Electrical Engineering andComputer Science, University of Central Florida,Orlando, USA. His research interests include stringalgorithms, text compression and computational bi-ology. He is especially interested in problems incomparative genomics and phylogenetics. He has

published several papers on perfect phylogeny based approaches to thehaplotype inference problem. He is currently working as a Research Scientistwith the Biotechnology High Performance Computing Software ApplicationsInstitute in Ft. Detrick, Maryland, USA.

Amar Mukherjee (a.k.a. Amar Mukhopadhyay)received his D. Phil. (Sc.) degree from the Universityof Calcutta in 1963. He took a faculty position atPrinceton University in 1963-64 with the ElectricalEngineering Department and then joined the TataInstitute of Fundamental Research, Mumbai in 1964-66. He later joined University of Iowa during 1969-79. He moved to University of Central Florida asa Professor of Computer Science in 1979 and isnow with the School of Electrical Engineering andComputer Science.

Professor Mukherjee’s research interests include switching theory and logicdesign, cellular automata, graph theory, VLSI design, string matching algo-rithms and hardware, data compression, compressed domain pattern matchingand bioinformatics. Dr. Mukherjee is the author of a textbook Introduction tonMOS and CMOS VLSI Systems Design (Prentice Hall, 1986) and is a coauthor and editor of Recent Developments in Switching Theory (AcademicPress, 1971), which is a collection of fundamental works on switching theoryby leading researchers. He is also a co author and editor of New Paradigmsfor Manufacturing, Proceedings of the NSF sponsored workshop, May 2 4,1994, Arlington.

Professor Mukherjee is a Fellow of the IEEE and has been inducted asa member of the Golden Core of the IEEE. He has served the IEEE invarious capacities: Editor, IEEE Transactions on computers for three terms(1973-76, 1982-86, 1992-94); Editor, IEEE Transactions onVLSI (1994-96),Chair, IEEE Technical Committee on VLSI (1984-86), Member,SteeringCommittee, IEEE Transactions on VLSI (1996-2000) and Member of theSelection Committee, IEEE Fellow. He continues to serve as the Chair ofthe Steering Committee and General Co-Chair of International Symposiumon VLSI. Professor Mukherjee received the IEEE Computer Society 2005Technical Achievement award.