Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian...

10
Biochem. J. (1980) 187, 65-74 65 Printed in Great Britain Techniques for the Verification of Minimal Phylogenetic Trees Illustrated with Ten Mammalian Haemoglobin Sequences David PENNY,* M. D.HENDYt and L. R. FOULDSt *Department ofBotany and Zoology, and tDepartment ofMathematics, Massey University, Palmerston North, New Zealand (Received 17 October 1979) We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds, Hendy & Penny (1979) J. Mol. Evol. 13, 127-150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151-1661. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin a-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree. It is well known since the work of Fitch & Margoliash (1967) and Eck & Dayhoff (1966) that amino acid sequences can be used to construct evolutionary trees. It has not been possible pre- viously to decide whether a particular tree rep- resents only a local minimum or whether it is a global minimum. Any conclusions that could be drawn from a particular case have always been subject to the criticism that a shorter tree may be found for the data. Probably this is one of the reasons why these results have not been widely accepted. Hence a global minimal tree is of value for two reasons. Firstly, for comparative purposes it is a mathematically proven benchmark for trees con- structed by other methods. Secondly, it provides a mathematical solution to the model of minimizing the number of changes in an evolutionary tree. There is no guarantee of course, that this solution represents what actually happened in evolutionary history. We have described a new two-part technique (Foulds et al., 1979a,b) that allows the identi- fication of a minimal tree that can link a set of protein sequences. The first part is an efficient tree-building algorithm. The length of the con- structed tree is an upper bound on the length of any minimal solution since a minimal tree cannot be longer than any tree connecting the sequences. The Vol. 187 second part independently analyses the data to determine a lower bound on the length of any tree. No tree can exist with a length shorter than the lower bound. The analysis keeps improving the lower bound and searching for shorter trees until the upper and lower bounds are equal. The final tree is then a tree of minimal length. The present paper presents some improved tech- niques for the construction and verification of minimal phylogenetic trees from protein-sequence data. It is laid out as follows. The first section presents a method for converting a given data set of amino acid sequences to nucleotide sequences. Some problem cases that arise are dealt with and it is explained why it is desirable to make the con- version. The main section reveals a new strategy for constructing a minimal phylogenetic tree from a set of nucleotide sequences. In previous papers (Foulds et al., 1979a,b; Hendy et al., 1978) we have first built up a tree, taxon by taxon, using a variation of a method described by Prim (1957). We then ex- amined the final tree and attempted to prove it was minimal by partitioning the data into subsets. If this was not successful we tried to transform the tree into a shorter one. We no longer use this two-stage approach but our strategy is to progressively build a tree, taxon by taxon, and attempt to show minimality at each step. 0306-3275/80/040065-1O$O1.50/1 )1980 The Biochemical Society C

Transcript of Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian...

Biochem. J. (1980) 187, 65-74 65Printed in Great Britain

Techniques for the Verification of Minimal Phylogenetic Trees Illustratedwith Ten Mammalian Haemoglobin Sequences

David PENNY,* M. D.HENDYt and L. R. FOULDSt*Department ofBotany and Zoology, and tDepartment ofMathematics, Massey University,

Palmerston North, New Zealand

(Received 17 October 1979)

We have recently reported a method to identify the shortest possible phylogenetic treefor a set of protein sequences [Foulds, Hendy & Penny (1979) J. Mol. Evol. 13,127-150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151-1661. The presentpaper discusses issues that arise during the construction of minimal phylogenetic treesfrom protein-sequence data. The conversion of the data from amino acid sequences intonucleotide sequences is shown to be advantageous. A new variation of a method forconstructing a minimal tree is presented. Our previous methods have involved firstconstructing a tree and then either proving that it is minimal or transforming it into aminimal tree. The approach presented in the present paper progressively builds up a tree,taxon by taxon. We illustrate this approach by using it to construct a minimal tree forten mammalian haemoglobin a-chain sequences. Finally we define a measure of thecomplexity of the data and illustrate a method to derive a directed phylogenetic treefrom the minimal tree.

It is well known since the work of Fitch &Margoliash (1967) and Eck & Dayhoff (1966) thatamino acid sequences can be used to constructevolutionary trees. It has not been possible pre-viously to decide whether a particular tree rep-resents only a local minimum or whether it is aglobal minimum. Any conclusions that could bedrawn from a particular case have always beensubject to the criticism that a shorter tree may befound for the data. Probably this is one of thereasons why these results have not been widelyaccepted. Hence a global minimal tree is of value fortwo reasons. Firstly, for comparative purposes it is amathematically proven benchmark for trees con-structed by other methods. Secondly, it provides amathematical solution to the model of minimizingthe number of changes in an evolutionary tree. Thereis no guarantee of course, that this solutionrepresents what actually happened in evolutionaryhistory.We have described a new two-part technique

(Foulds et al., 1979a,b) that allows the identi-fication of a minimal tree that can link a set ofprotein sequences. The first part is an efficienttree-building algorithm. The length of the con-structed tree is an upper bound on the length of anyminimal solution since a minimal tree cannot belonger than any tree connecting the sequences. The

Vol. 187

second part independently analyses the data todetermine a lower bound on the length of any tree.No tree can exist with a length shorter than the lowerbound. The analysis keeps improving the lowerbound and searching for shorter trees until the upperand lower bounds are equal. The final tree is then atree of minimal length.

The present paper presents some improved tech-niques for the construction and verification ofminimal phylogenetic trees from protein-sequencedata. It is laid out as follows. The first sectionpresents a method for converting a given data set ofamino acid sequences to nucleotide sequences. Someproblem cases that arise are dealt with and it isexplained why it is desirable to make the con-version. The main section reveals a new strategy forconstructing a minimal phylogenetic tree from a setof nucleotide sequences. In previous papers (Fouldset al., 1979a,b; Hendy et al., 1978) we have first builtup a tree, taxon by taxon, using a variation of amethod described by Prim (1957). We then ex-amined the final tree and attempted to prove it wasminimal by partitioning the data into subsets. If thiswas not successful we tried to transform the tree intoa shorter one.We no longer use this two-stage approach but our

strategy is to progressively build a tree, taxon bytaxon, and attempt to show minimality at each step.

0306-3275/80/040065-1O$O1.50/1 )1980 The Biochemical SocietyC

D. PENNY, M. D. HENDY AND L. R. FOULDS

The advantages of this approach are detailed in thesection. The strategy is illustrated by finding aminimal tree for ten mammalian haemoglobina-chain sequences. The final section of the paperintroduces a measure of complexity of the data inconstructing a minimal tree from a given data set.

The Conversion to Nucleotide Sequences

We have pointed out in a previous paper (Fouldset al., 1979a) that it is possible to use a criterion ofminimizing either the total number of weightedamino acid changes or the number of nucleotidesubstitutions when constructing a phylogenetic treebased on amino acid sequences. Several workershave converted amino acid sequences into nucleo-tide sequences before constructing phylogenetic trees(e.g. Fitch & Margoliash, 1967; Moore et al., 1973).In dealing with complex problems we have found itadvantageous to convert amino acid sequences intonucleotide sequences. This is because the differencebetween any pair of nucleotides is of unit weight,which leads to easier partitioning strategies. Themethod we use to convert the data into nucleotidesequences is now outlined with an example.

For each protein site we list the amino acids thatoccur at that site. As an example, suppose that onlyglycine, histidine, leucine, and proline occur indifferent sequences at a particular site. The tripletsthat code for these amino acids are given below:

Amino acidsGlycine (Gly)

Histidine

Leucine

Proline

TripletsGGX

(His) CAY

(Leu) CUX or UUR

(Pro) CCX

where any of the four nucleotides may occur in theposition of an X, A or G may occur in the positionof an R, and U or C may occur in the position of aY (Fitch & Farris, 1974). Each column of the datais then examined independently. Assuming first thatCUX codes for leucine, in this case we find that onlyC and G occur, hence there must be a C-Gsubstitution present in any tree. The second columnhas four nucleotides, which must require threesubstitutions. The third column does not require anysubstitutions since only X and Y occur and thereforeeither U or C can occur at all positions in thecolumn. Thus any tree will require at least four(1 + 3 + 0) substitutions for this amino acid site,assuming that CUX codes for leucine in this case.The procedure is repeated with UUR replacing

CUX for leucine. However with the choice of UURit can be seen that any tree will require at least six(2 + 3+ 1) nucleotide substitutions for this amino

acid site. As our criterion is one of minimizing thetotal number of nucleotide substitutions in the finaltree, we assume that CUX codes for leucine in thiscase.

Arginine and serine are also coded for by any oftheir six triplets. If more than one amino acid codedfor by six triplets occurs at an amino acid site, thenall possibilities are considered. The combinationcorresponding to the smallest number of substi-tutions is used. Ties are settled by choosing an X inpreference to a Y or R, where possible, withremaining ties being settled arbitrarily. If a ter-minating triplet (UAR or UGA) is present, bothpossibilities are considered in the same way asabove. A terminating triplet is used at the freeC-terminus of a protein sequence that is shorter thanother sequences in the data set. There could be up to16 combinations of codes to test if arginine, leucine,serine and a terminating code all occurred at thesame amino acid site.

Deletions at the N-terminus are treated differ-ently. For example, two different sequences mayhave at the N-terminus:

Met-Gly-Arg-His-Lys-Asn-Gly ...Lys-Asn-Gly ...

The criterion we use for minimizing the tree is to findthe smallest number of changes that will connect allthe sequences. In the second sequence the positioncorresponding to that of His in the first sequence iscoded for by a deletion symbol (/). This is con-sidered to be one nucleotide change distant from allother codings. Positions to the left are replaced by anull character, which is not compared with any othercharacter. Consequently the two sequences shownare considered to be one change apart for thepositions shown.

It is indeed possible that in the course of theevolution of the two sequences there may have beenmore than one change for the positions shown.Governed by our criterion of minimizing the numberof changes, we must select the possibility corres-ponding to this single change. The same procedure isused for internal deletions. To summarize, the methodconverts the protein sequences into nucleotidesequences with the minimum total number ofchanges between the nucleotide sequences.

These nucleotide sequences are used to constructa phylogenetic tree by methods described below. Atthis stage of our method, at any site a particularamino acid is coded for by the same triplet ofnucleotides for all taxa in which it occurs. Howeverit may be possible later to shorten the length of aspecific tree by using alternative codes for aparticular amino acid for different taxa. An exampleis given in Fig. 1, where the tree shown is concernedwith a single amino acid site and is part of a treederived by considering complete sequences. Note

1980

66

VERIFICATION OF MINIMAL PHYLOGENETIC TREES

Val

Fig. 1. Shortening the length of a phylogenetic tree byusing alternative codingsfor an amino acid

that leucine is present at two points in the tree. Thecodes shown: CUX at position (1) and UUR atposition (2) lead to a tree of length 7. If the samecode, either CUX or UUR is used at both positions(1) and (2) the minimal tree is of length 8. If UUR iscoded at position (1), CUX at positions (2) theminimal tree is of length 9. Other workers havedeveloped algorithms to produce optimal codings fora given tree (Moore et al., 1973; Fitch & Farris,1974). After we have constructed a minimal tree forthe derived data, changes to the codings are thenmade whenever the length of this tree can beshortened. This final step cannot be carried out untila tree is constructed.

Therefore there are three steps in the process.First codings are assigned that require the fewestchanges, before any tree is constructed. In thesecond step these codings are used to construct atree with methods that we have found usuallyproduces minimal trees. The third step involvesrevising the codings whenever this leads to adecrease in the total length of the tree.

However, it is conceivable that different codingsmight have resulted in a shorter tree (of differenttopology) than that finally obtained. This possibleproblem occurs only when the data are convertedfrom amino acid into nucleotide sequences. Theproof of optimality used below is that no shorter treecan be constructed by using the derived data set.

The H1 strategyIn our experience of tree construction and

subsequent proof of minimality, we have found themost difficult phase has been proof of minimality.Hence we have adapted the tree-construction algo-rithm in order to assist this analysis. We use whatare known as 'partitions' of the data set. A partitionof the set S of sites of the data set is a collection ofsubsets of S with the property that every site appearsin precisely one subset. Our strategy is to let the tree'grow', i.e. to link one new taxon at each stage tothe tree of the previous stage, and attempt to find a

Vol. 187

suitable partitioning of the data set to prove that thisnew tree is minimal. The major advantage of thisapproach over our previously described algorithm(Foulds et al., 1979a) is that the partitions 'grow'with the tree. For example, suppose T, is a treelinking i taxa, which has been proved minimal by thepartition II, of the data set. Then the necessarypartitioning 1,+ I to prove T, 1 minimal (where T, 1is obtained from T, by linking one further taxon) canusually be derived simply from II,.

This version of our tree-construction strategy isbased loosely on the method described by Prim(1957) of constructing minimal spanning trees. Wealways retain a connnected tree T, at each stage, andthe tree T,, 1 is derived by linking the closest taxon tothe tree T, that is. not part of T,. The major differencefrom Prim's (1957) algorithm is that we areconstructing a minimal phylogenetic spanning tree,which allows for the introduction of Steiner points,i.e. new points at which the tree branches. These newpoints represent definite sequences that were notpresent in the original data set (Hendy et al., 1978;Foulds et al., 1979a). In Prim's (1957) algorithm,such an addition is not permitted. Except in thesimplest cases, when the two are equal, the Prim(1957) tree will be of greater length than the minimalphylogenetic spanning tree.A secondary difference is that we do not

necessarily choose the shortest link to create ourinitial tree, T2. Instead we select a pair of taxa thatare relatively close to each other and to many otherspecies. From experience we have found this to bemore successful than the initial step of Prim's (1957)algorithm, which chooses the shortest link. Thisgives a more uniform growth to T, and 1,. We haveapplied the strategy only to data that have beenconverted into nucleotide sequences by using theprocess described above. The tree we produce is atree of minimal length that spans the derivednucleotide sequences. We have called this develop-ment the II, strategy.We illustrate the II, strategy with an example,

finding an optimal tree for the data set of mam-malian haemoglobin a-chain nucleotide sequenceslisted in Table 1. These sequences were derived byusing our conversion algorithm from proteinsequences, which come from various sources. Wehave condensed the sequences by deleting those sitesat which there is no variation, and renumbering thesite numbers. From a table of differences betweentaxa, we note the shortest link is MAN-APE, oflengthl; however, we soon discover, if we com-mence the tree building with this link, that MAN-APE and MKY are a relatively isolated part of thetree. Thus we found it more fruitful for the JCIIanalysis to begin with the COW-EWE link. (Thethree-capital-letter abbreviations for the animals aredefined in Table 1.)

67

D. PENNY, M. D. HENDY AND L. R. FOULDS

7 ()1£10 0 0 0 < <

v. e> (I)jsI< < < < < < < <

o(Z)K)I 0 0 0 0 0 0 0 0 0

V - (Z)I I °*s, ru(1)1110 0 0 D D D D <

(Z)Z8 U) U) <

| N (I)Z80 0 0 0 0 0 0 0 0 < <

K _(16L O O S O<

sx f o (Z)8L < < <

" cr (1)8L < 6 < O

.z oc(1)9L < < < U> < U) U)

*5 t(Z)89 < < 3 0< < <S < 0<

] ()8 <<< <U< < <

i o5>(I)O9 < < < O < < <<

o (Z )LS

ZA

t '-(I)SS < <

.o ,< (Z)89 < <

X _^ u (I)S < < < < O O < < 0

(Z)09 3 3 3 < < 3 u 3 3 3 <

(z) z¢~~~Uo0ozl U U

- *5 c oc (Z)6I U U 0 0 0 0 U 0 0 0

co (I)LIO < < 0 <

F n Q (Z)SI 00 0 < 0 <

(1)OTO 00 < < 0 < 0

< (Z)8 U

i= N(1)8 < < < < < < 0< <.e '-_ (T)t'

02

~~~~~~~~~~~ oU C

cis 0 0 < U.EcisU

.a0|dwt0d ;tZ3

1980

68

VERIFICATION OF MINIMAL PHYLOGENETIC TREES

We have shown (D. Penny, M. D. Hendy & L. R.Foulds, unpublished work) how to prove optimalityof a tree with no more than five taxa directly, so T2,T3 (trivially optimal by construction), T4, and T, areeasily proven optimal, and so T6 is the first caserequiring partitioning. The tree T6 of length 50connecting EWE, HOS, PIG, MUS, MKY and

76=(22) {1}126}

(7)(4)(3)(5)(4)(5)50

The partition is listed with the length of eachsubset (in nucleotide changes), followed by the sitesincluded in that subset. The first line includes all sitesthat are in subsets by themselves; that is each subsetin the first line has only one site. There are 19 suchsubsets, of which three (23, 26, 36) require twonucleotide changes each on the tree.

14) 18) 19) {10} {14} {16} 1221 1231 {25}{27} 1291 130) 1311 136) 137) 1401 {421

12,3,24,33)111, 38}115,43)118,28,351119, 39)120, 34,411

COW is illustrated in Fig. 2. This tree is minimal,this being proved via the partition 116. To define I16we list each subset of the partition by the sitenumbers, followed by the length, in brace brackets(1 }), of the minimal spanning tree needed to span thetaxa on that subset of data. We prove minimalityprovided the sum of these lengths is equal to thelength of T6'

From our partitioning theorem (Foulds et al.,1979a) we can state that it is not possible to linkthese data with fewer than 50 nucleotide changes.But a tree of length 50 already exists (Fig. 2), andtherefore the tree in Fig. 2 is the shortest possibleway of linking the six species.Now T7, of length 54, is created, by adding MAN,

as shown in Fig. 3. All duplications of T7 except

ES A~~~~~~

Fig. 2. T6, a tree oflength 50 that links the six speciesNo shorter tree can exist, but the analysis has not excluded the possibility of alternative trees of the same total length.Nucleotide changes are expressed, for example as 1CG, on the MKY-S14 link. MKY will have a C at position 1(Table 1) and S 14 a G. The order in which the changes occurred is not specified on the links. Additional evidence isrequired (see the section on rooted phylogenetic trees) before it is known whether the change was from C to G or viceversa.

Vol. 187

69

D. PENNY, M. D. HENDY AND L. R. FOULDS

those of the MAN-S5 link are justified by the 7;6partition. Hence it is only the further duplications atsites 3, 9, 25 and 26 that need to be justified. Thenew partition 7r7 is derived from .6

7r7= (11)(3)(4)(7)(5)(4)(3)(5)(4)(5)(3).54

{4} {8} {10}{23,40}{1,26}{2,3,27,33}{9, 25, 30}{11, 38}{15,43}{18,28,35}{19, 39}{20, 34, 41}{24, 29}

Site 36 in the first line requires two nucleotidechanges. Thus the seven species cannot be linkedwith fewer than 54 changes, T7 (Fig. 3) has 54changes, and therefore Fig. 3 is the shortest possibletree.

T. of length 55 is obtained by adding APE withno further duplications, so ; = 7r7 with one furthersite, 12, requiring one change. Thus T8 is minimal.

Fig. 4 shows the tree T9, which is of length 67,obtained when RAB (rabbit) is added. We have notbeen able to prove that it is minimal by consideringpartitions that can be generated from those of ;8.Hence we have added DOG to create T1o. Thecorresponding partitions are:

ir: (8)(14)(9)(9)

(14)(7)(6)(7)(4)78

The subset is on the fifth line of the partition andincludes sites 6, 10, 15, 23, 37, 40 and 43. Fig. 6includes all the different combinations of codes thatoccur among the ten species being studied [omitting

{14} {16} {22} {31} {36} {37} {42}

kangaroo (ROO)]. Species that are only onenucleotide change apart are linked unless a circuitwould be formed. Species linked by links of length 1are called 1-clusters. The method examines allpossible ways of linking the 1-clusters together andno trees of length less than 14 are found. The treemust therefore have at least 14 changes at these sites(Foulds et al., 1979a) and this is the numberoccurring in Fig. 5. We have not been able to proveT9 minimal by extending the 7;8 partition. However,Tg has been proven minimal (D. Penny, M. D.Hendy & L. R. Foulds, unpublished work) byextensions of the methods given in Hendy et al.(1978).

{22} {31 {32}

A Measure of Complexity

It has been shown (Foulds et al., 1979a) that alower bound on the length of the minimal tree forany set of data can be calculated in the followingway. At each site in the data where there is variation,construct a minimal spanning tree for the charactersthat occur. Then the sum of the lengths of theseminimal spanning trees is a valid lower bound on thelength of the minimal phylogenetic tree.

{5} {13} {14} {17}{1,8, 18,26,28,35}{2,3,27,33,39}{4, 11, 16,38}{6, 10, 15,23,37,40,43}{9, 12, 25, 30}{19,36}{20, 34,41,42}{24, 29}

Site 14 requires two nucleotide changes in f1lu.This partition demonstrated that Fig. 5 is theshortest possible way of linking the ten species. Aswill be observed 7+, is obtained in each case from7;, by combining some subsets and occasionallybreaking up a subset to recombine it in differingways.A more detailed explanation about how a par-

tition is shown to be minimal is illustrated in Fig. 6,where a subset is taken from the partition for 7roo.

1980

70

VERIFICATION OF MINIMAL PHYLOGENETIC TREES

}~~~~~~~~~~~~~~~~~G2G 26A CGU15G1C3C

Fig. 3. Trees T, and T.A tree of length 54 now includes MAN, and APE can be included with one more link. Both trees are the shortestpossible.

C~~~~~~~~~~~~~~~~~~G

Fig. 4. Tree T9, oflength 66, with RAB included

This is P0 and is the total number of nucleotide Now as we have assumed that the data have beensubstitutions required considering each site indi- converted into nucleotide sequences, all distinctvidually. m m characters are exactly one nucleotide substitution

Let P0 = E (Nf-1)= (j Nf)-m apart. If there are N1 distinct nucleotides in column i,f=l f=l then the minimal spanning tree has length (n,- 1).

Vol. 187

71

D. PENNY, M. D. HENDY AND L. R. FOULDS

Fig. 5 T1O, a tree oflength 78 that links the ten species being studiedAgain, no shorter tree is possible for this coding.

MSY / MUS/ AGCACURG GGAY GAR

Fig. 6. A minimal tree with two Steinerpoints linking the subset ofsites 6, 10, 15, 23, 37, 40 and 43The subset is from the partition for IIlo. For this subset of sites, APE and MKY have the same codes as MAN, andEWE the same as COW. Other trees of length 14 can link this subset, but no tree of length less than 14 can link them.There must therefore be at least 14 changes on the tree (Fig. 5) involving these sites (Foulds et al., 1979a).

Thus if the length of the minimal spanning tree is L*,and there are m sites with varying characters.

m m

L*> 1i (Ni- 1) = (I2 N,)-m1=1 1=1

The the difference, L*-PO, is a non-negativequantity representing the discrepancy between thelength of a minimal tree for the data and P0. It is the

number of duplications (Penny, 1976) that occur inthe tree. Then the quantity C [=(L*-P0)/PO] isdefined as the measure ofcomplexity of the data.

Given a data set, L* will generally not be knownuntil a tree is proved to be minimal, but an upperbound on its value can be obtained after a tree isconstructed. For this analysis, sites are omitted ifthey do not have at least two codes occurring morethan once, because such sites will neither have

1980

72

VERIFICATION OF MINIMAL PHYLOGENETIC TREES

duplications on the tree nor will help explainduplications at other sites (Fitch, 1977). For 7r1these are sites 5, 13, 14, 17, 22, 31 and 32, and whenthese are omitted, P0= 42 and L* = 70.Thus:

70-42C= = 0.667

42

To summarize, a measure C has been defined forwhich an upper bound can be easily found. Thismeasure often gives a good guide as to how difficultit will be to produce a minimal tree. C = 0 representsthe case where L* = Po, a tree in which noduplication is necessary, which is the simplestpossible case.The difficulty in verifying that a given final tree is

minimal appears to depend more on the complexityof the data (as previously defined) than on thenumber of species present. For instance, ourprevious methods, which produced a minimal treefor a data set with 23 species but complexity 0.010(Foulds et al., 1979b), failed to identify a minimaltree for the present data set, which has ten speciesand complexity 0.667. However the previousmethods are easier to implement than the 7approach and hence more desirable to use on data ofrelatively low complexity. Minimal trees have beenconstructed for a set of ten species derived by usingthe III approach on a number of different proteinsequence data sets (D. Penny, M. D. Hendy & L. R.Foulds, unpublished work).

Rooted Phylogenetic Trees

Most workers have displayed phylogenies withdirected links, with one end of each link being earlierin time than the other end. The model that adopts asits optimality criterion the minimization of the lengthof the tree is concerned solely with trees ofundirectedlinks. However should a directed tree (i.e. withdirected links) be desired, it can be constructed froman undirected tree provided additional information isavailable to determine a root (original ancestor) ofthe tree (Penny, 1976).One possible method for determining the root of a

tree is to introduce a species with known sequencethat is thought to have diverged from the group ofspecies under study before any other divergencewithin the group. This species would identify the rootof the tree. The kangaroo, as a marsupial, is asuitable species for our data set as its haemoglobinsequence is available. The shortest way of con-necting the kangaroo (ROO) to the existing tree inFig. 5 is to join it to the HOS-S2 link, giving aphylogeny with directed links shown in Fig. 7. Thenucleotide changes on each link are not shown inFig. 7, but the same changes occur as in Fig. 1 plus16 on the ROO-S 19 line, giving a total length of 94for the tree in Fig. 7.

Vol. 187

/

//

/

/

/

/

/ i/ 71 3 / / NO U) O X wJ > X) >- W ZO 0° D

Fig. 7. Introduction of a new species to determine theroot ofthe tree

The direction on the links is now specified. The 1CGchange (see Fig. 5) would now go from G on S 14 toC on monkey whereas the direction of the changewas not specified in Figs. 2-5. The tree is of length94.

/

//

//

//

//

//

//

/

/

C) U) >- W z CD D nLD 0 ~e ( < 0 < .

. . An a v dE d t, aL 2o 9

Fig. 8. An alternative directed tree, also oflength 94

It should be noted that the tree produced by theaddition of the ROO-S 19 link is not the onlyminimal tree with the enlarged set of eleven species.The rooted tree shown in Fig. 8 has the same lengthof 94 and thus Figs. 7 and 8 are equally likely onthese data.

Discussion

It should be remembered that the claim to havingfound the shortest possible tree for a data set islimited in the present case by possible alternativenucleotide sequence codings from the original amino

73

cca

74 D. PENNY, M. D. HENDY AND L. R. FOULDS

acid sequences. This will always be a limitation whenamino acid sequences are used, but with the markedimprovement in DNA sequencing it is likely thatnucleic acids will increasingly be used in the futurefor determining phylogenetic relationships. The useof nucleic acid sequences will simplify the method byeliminating the transformation of the data and willallow a more positive statement that no shorter treescan exist. Another limitation is that it is not usuallypossible, except with smaller data sets (Hendy et al.,1978), to determine how many trees exist with theminimal length. The method leads to the conclusionin a specific case 'that no shorter tree can link thedata', and does not exclude the possibility that otherequally short trees exist. An example is shown inFigs. 7 and 8, where two topologies with the samelength link the data.The present paper describes part of a larger study

using the same eleven species but five differentpolypeptides. It is apparent from this work that it isunwise to claim too much biological significance tothe one protein used. For this reason any biologicalconclusions should be avoided, except to mentionthat primates form a natural group in all cases andthat the rodents (as represented by the mouse) aremore closely related to primates than to any of theother groups studied.The present paper therefore reports an improved

method for determining minimal trees with complexdata sets. It uses the well known conversion ofamino acid sequences imto nucleotides. The tree-building method is based on the minimal-spanning-tree method of Prim (1957) rather than on that of

Kruskal (1956), which attempts to determine aftereach species has been added that the tree is minimalbefore proceeding further. A measure of the com-plexity of the data is defined, and this can be used todecide how to approach the problem of determiningthe shortest tree. The conversion of an undirectedinto a directed (or rooted) tree is illustrated by usinga marsupial (kangaroo) sequence.

It is expected that the use of methods such asthese will become increasingly common in the nextfew years as a result of marked improvements inDNA-sequencing techniques.

References

Eck, R. V. & Dayhoff, M. 0. (1966) Atlas of ProteinSequence and Structure 1966, National BiomedicalResearch Foundation, Silver Springs, MD

Fitch, W. M. (1977) Am. Nat. 111, 223-258Fitch, W. M. & Farris, J. S. (1974) J. Mol. Evol. 3, 263Fitch, W. M. & Margoliash, E. (1967) Science 155,

279-284Foulds, L. R., Hendy, M. D. & Penny, D. (1979a)J. Mol.

Evol. 13, 127-150Foulds, L. R., Penny, D. & Hendy, M. D. (1979b)J. Mol.

Evol. 13, 151-167Hendy, M. D., Penny, D. & Foulds, L. R. (1978) J.

Theor. Biol. 71, 441-452Kruskal, J. R. (1956) Proc. Am. Math. Soc. 7, 48-50Moore, G. W., Barnabas, J. & Goodman, M. (1973) J.

Theor. Biol. 38,459-485Penny, D. (1976)J. Mol. Evol. 8, 95-116Prim, R. C. (1957) Bell Syst. Tech. J. 36, 1389-1401

1980