Cuvier meets Watson and Crick: the utility of molecules as classical homologies

18
Biological Journal of the Linnean Sociely (1991), 44: 307-324. With 7 figures Cuvier meets Watson and Crick: the utility of molecules as classical homologies PAUL MORRIS AND EMILY COBABE*,t Department of Earth and Planetary Sciences, Harvard University, Cambridge MA, 02138, U.S.A. Received 5 Jub 1990, accepted for publication 27 July 1990 Much phylogenetic information has been derived from the analysis of sequence similarity in genes and proteins. These data are generally considered to be more reliable than an examination of the phylogenetic distribution of similar biologically active molecules. However, molecules can provide significant phylogenetic information when accompanied by a careful analysis of their structure, synthesis, genetics and function. Molecules may be highly structurally and functionally constrained. Thus, similar or even chemically identical molecules may be unrelated, and this may be discernible only by examination of information beyond the simple structure of the molecule. Phylogenetic variation in the synthesis of tyrosine and lysine demonstrates that chemical identity of molecules may be brought about by unrelated synthetic pathways. The widespread distribution of collagen triple helices is shown to result from convergence under structural and functional constraints. This is demonstrated by an examination of the steric constraints upon collagens and the presence of several independent families of collagen genes. Lysyl oxidase crosslinking occurs in several unrelated proteins, indicating that similarity in the post-translational modification of proteins is not evidence of homology. KEY WORDS:-Homology - evolution - structural constraint - functional morphology - phylogenetics - collagen - amino acids - lysyl oxidase. CONTENTS Introduction . . . . . . . . . . . . . . . . . . . 307 Classical criteria for homology . . . . . . . . . . . . . . . 308 Variation in amino acid synthesis . . . . . . . . . . . . . . 309 Tyrosine synthesis pathways . . . . . . . . . . . . . . 309 Lysine synthesis pathways . . . . . . . . . . . . . . . Structural constraints on collagens . . . . . . . . . . . . . Gene structure as evidence of convergence . . . . . . . . . . . Post-translational pathways in proteins . . . . . . . . . . . . . 31 I 3 13 313 315 3 19 Conclusion . . . . . . . . . . . . . . . . . . . 322 Acknowledgements . . . . . . . . . . . . . . . . . 322 References . . . . . . . . . . . . . . . . . . . 322 Collagen triple helices . . . . . . . . . . . . . . . . . INTRODUCTION The explosive growth of molecular biology in the last few decades has produced a wealth of information about the development, function and structure *Author to whom correspondence should be addressed. ?Present address: Department of Geology, Wills Memorial Building, Queens Road, University of Bristol, Bristol BS8 IRJ. 00244066/91/120307 + 18 803.00/0 307 0 1991 The Linnean Society of London

Transcript of Cuvier meets Watson and Crick: the utility of molecules as classical homologies

Biological Journal of the Linnean Sociely (1991), 44: 307-324. With 7 figures

Cuvier meets Watson and Crick: the utility of molecules as classical homologies

PAUL MORRIS AND EMILY COBABE*,t

Department of Earth and Planetary Sciences, Harvard University, Cambridge MA, 02138, U.S.A.

Received 5 Jub 1990, accepted f o r publication 27 July 1990

Much phylogenetic information has been derived from the analysis of sequence similarity in genes and proteins. These data are generally considered to be more reliable than an examination of the phylogenetic distribution of similar biologically active molecules. However, molecules can provide significant phylogenetic information when accompanied by a careful analysis of their structure, synthesis, genetics and function. Molecules may be highly structurally and functionally constrained. Thus, similar or even chemically identical molecules may be unrelated, and this may be discernible only by examination of information beyond the simple structure of the molecule. Phylogenetic variation in the synthesis of tyrosine and lysine demonstrates that chemical identity of molecules may be brought about by unrelated synthetic pathways. The widespread distribution of collagen triple helices is shown to result from convergence under structural and functional constraints. This is demonstrated by an examination of the steric constraints upon collagens and the presence of several independent families of collagen genes. Lysyl oxidase crosslinking occurs in several unrelated proteins, indicating that similarity in the post-translational modification of proteins is not evidence of homology.

KEY WORDS:-Homology - evolution - structural constraint - functional morphology -

phylogenetics - collagen - amino acids - lysyl oxidase.

CONTENTS

Introduction . . . . . . . . . . . . . . . . . . . 307 Classical criteria for homology . . . . . . . . . . . . . . . 308 Variation in amino acid synthesis . . . . . . . . . . . . . . 309

Tyrosine synthesis pathways . . . . . . . . . . . . . . 309 Lysine synthesis pathways . . . . . . . . . . . . . . .

Structural constraints on collagens . . . . . . . . . . . . . Gene structure as evidence of convergence . . . . . . . . . . .

Post-translational pathways in proteins . . . . . . . . . . . . .

31 I 3 13 313 315 3 19

Conclusion . . . . . . . . . . . . . . . . . . . 322 Acknowledgements . . . . . . . . . . . . . . . . . 322 References . . . . . . . . . . . . . . . . . . . 322

Collagen triple helices . . . . . . . . . . . . . . . . .

INTRODUCTION

The explosive growth of molecular biology in the last few decades has produced a wealth of information about the development, function and structure

*Author to whom correspondence should be addressed. ?Present address: Department of Geology, Wills Memorial Building, Queens Road, University of Bristol,

Bristol BS8 IRJ.

00244066/91/120307 + 18 803.00/0 307

0 1991 The Linnean Society of London

308 P. MORRIS AND E. COBABE

of organisms, undreamt of by classical comparative anatomists. Little of this information has been used to test hypotheses of genealogical relationships among organisms. Other than the radical changes in higher-level bacterial taxonomy produced by examination of the molecular biology of bacterial genetics (for examples, see Woese, Magrum & Fox, 1978; Woese, 1982; Schleifer & Stackebrandt, 1985), the primary contribution made by molecular biology to phylogenetic studies has been in the area of sequence similarity. I t has been recognized that the construction of phylogenies from the distribution of similar molecules is problematic. Numerous phylogenetic inferences have been drawn from the known occurrences of such molecules as chitin Ueuniaux, 1963), creatine phosphate (Roche, Thoai & Robin, 1957; Rudull & Kechington, 1973) and respiratory pigments (Terwilliger, 1980). Kerkut ( 1960) noted that such biochemical phylogenies based upon presence/absence data in a few taxa are fraught with error, while Wilmer (1990: 87), in a discussion of the disreputable status of comparative biochemistry, comments: “Scoring the presence of a molecule in a few samples of a few species from a phylum is never going to be good enough as grounds for comparisons between higher taxa, even without the problems of purification and uncertainties as to whether the molecules are endogenously synthesized or derived from the food or environment. Some of the classic success stories of early comparative biochemistry appear to be untenable . . . Clearly there is a need to move on rapidly to a more sophisticated consideration of molecular history and interrelationships”. We agree that the construction of phylogenies from the distribution of similar molecules is an inadequate approach. In many cases, molecular structure alone can be misleading because molecules are structurally and functionally constrained. However, in these instances, other kinds of information, such as synthetic pathways and higher-level historical information, such as gene structure and chromosome placement, can be used as characters. Molecular structures can be used as homologies in the traditional sense, following the lines of reasoning of classical comparative anatomy.

This paper looks at the potential usefulness of molecules as classical homologous structures. I t discusses how variations in constructional pathways, such as enzymatic series, can provide clues to the phylogenetic relationships among organisms. It details how gene structure can be used in phylogenetic studies. The study of molecules in this manner may provide additional information about genealogical relationships, beyond that available in studies of sequence similarity or anatomy.

CLASSICAL CRITERIA FOR HOMOLOGY

The concept of homology was developed by comparative anatomists of the 19th century, such as Georges Cuvier, Richard Owen and August Weismann. A general definition of homology involves features “. . . developed . . . from the basis of inheritance from an ancestral type common to all” (Simpson, 1967: 180) ; in essence, similarity by virtue of common genealogy. Classical homologies have been recognized by precise similarities in structure, positional relationships to other anatomical elements and by developmental origin. Recent discussions have examined both definitions of homology and criteria for the recognition of homologies in the contexts of development and molecular sequence similarity

MOLECULES AS HOMOLOGIES 309

(e.g. Roth, 1984; Patterson, 1988; Michaux, 1989; Lewin, 1987). Identification of homologies is critical to the elucidation of phylogenetic relationships, as homologous characters are the basis of both cladistic and evolutionary systematics.

Analogous criteria may be applied to comparisons of molecular structures. Homologies in molecules can be defined by precise structural, synthetic and genetic similarities. Like anatomical features, molecules can be under considerable structural and functional constraint. Identification of properties not under the same functional constraints as the chemical structure of a molecule, such as synthetic pathways (as well as such genetic information as chromosome placement), can provide additional insight into genealogical relationships.

The development of techniques for the study of the mechanisms of bacterial genetics has lead to a revolution in the understanding of the relationships between the major groups of bacteria. The transition from a fundamental division of organisms into prokaryotes and eukaryotes to the realization that organisms are divided into three distinct lineages (archaebacteria, eubacteria and eukaryotes) resulted from the use of molecular characters including ribosomal structure, cell wall composition and genetic organization (Woese, 1985). This work illustrates the potential power of molecular information in phylogenetic and evolutionary studies. In some cases, however, more is needed to define homologies than simply chemical structures.

VARIATION IN AMINO ACID SYNTHESIS

Tyrosine synthesis pathway

A striking example of the homologies that may be revealed at a molecular level of organisms is provided by the synthesis of the amino acids tyrosine and lysine. The synthetic pathways of both of these amino acids are different in different organisms, illustrating that even though two molecules are structurally identical, they may not be homologous because they have been built in different ways.

In all organisms, the synthesis of the aromatic amino acids (tryptophan, phenylalanine and tyrosine) proceeds from a pathway in which molecules of the central carbohydrate pathways (erythrose-4-phosphate and phosphoenol- pyruvate) are linked. In a series of seven enzymatic reactions, they are converted into a cyclic carboxylic acid, chorismate (Fig. 1A). Chorismate is the precursor of all three of the aromatic amino acids. Tryptophan is synthesized by a five-step pathway from chorismate. A rearrangement of chorismate by a mutase produces prephenate, a precursor to tyrosine (Fig. 1B) and phenylalanine. Two additional steps are necessary for the conversion of prephenate into tyrosine (Fig. lC)-an amino group must be added by a transaminase and an unwanted side chain must be removed to make the ring aromatic (Zubay, 1983; Harborne, 1990).

In plants, tyrosine is synthesized by the transamination of prephenate followed by aromatization (Fig. 2A). Tyrosine follows a pathway in which prephenate is transaminated to form the non-protein amino acid, arogenate, and then converted by a dehydrogenase into tyrosine. In Escherichia coli (Miguli) and related bacteria, the conversion of prephenate into tyrosine follows a reverse

310 P. MORRIS AND E. COBABE

0 no no -O-P-O-CH2-C-C-C

I I I '0- p=o

Erythrose Cphosphate I

H,C=C-COO' II I I /p

A A 'n 0 0-

0- \ Phosphoenolppvate

9 Tryptophan I 5 Steps coo- - a

Chorismate

0 I

-- Transamination -- Removal of COO Side Chain

I I

n- c -NH:

Tyrosine Figure 1. Biosynthetic pathway of the aromatic amino acids. The third aromatic amino acid, phenylalanine, is also derived from this pathway by modification of prephenate. (Modified from Stryer, 1981.)

pathway (Fig. 2B). Prephenate is first converted into 4-hydroxyphenyl pyruvate by a dehydrogenase. It is then transaminated to form tyrosine. Thus, a molecule of tyrosine in a plant is formed by a different synthetic pathway than an identical molecule of tyrosine in a bacterium- the last two enzymatic steps are reversed (Harborne, 1990; Zubay, 1983). The variation in these last two steps means that the substrates upon which these enzymes are acting at these last two stages are different. Therefore, the enzymes acting on these substrates are probably different. These two molecules of tyrosine could be considered non- homologous because the synthetic pathways are different, even though the molecules themselves are structurally identical. The synthetic pathway of phenylalanine, the other amino acid derived from prephenate, differs between

MOLECULES AS HOMOLOGIES

0

31 I

PLANTS BACTERIA Prephenate

Prephenate dehydrogenase

coo- l c=o I

Prephenate trnnsaminase

OH Arogenate

coo- l

H-C-NH: I CH,

0 I on

Tyrosine

bH 4-Hydroxyphenyl pyruvate

QHydmxyphenyl pyruvate ttansaminase

YO- H-C-NH:

I 6 OH

Tyrosine

Figure 2. Comparison of tyrosine synthesis in plants and bacteria. (Modified from Stryer, 1981.)

plants and bacteria in the same manner (Harborne, 1990). Anatomically similar structures that differ only in subtle aspects of their development are often considered homologous. Given this, the example of tyrosine may be seen as one of homologous structures. However, when structurally identical molecules are derived from completely dissimilar pathways, that their similarity results from homology is a considerably less favourable hypothesis.

Lysine synthesis pathways

Lysine synthesis is also phylogenetically variable. Bacteria and plants share a similar synthetic pathway, while fungi and at least some protists use a very different pathway. Protein amino acids may be divided into six biosynthetic

312 P. MORRIS AND E. COBABE

Funnal Lvsine Synthesis Bacterial Lvsine Synthesis

a-Ketoglutarate I

coo- l

(CH,), I I CHZ I coo-

HO-C-COO-

Homocitrate 1

cis-Homoaconitate 1

Homoisocitrate 1

u-Ketoadipate 1

a-Aminoadipate I

6 -Adenyl-a-aminoadipate 1

u-Aminoadiptic-6 -semialdehyde 1

Saccharopine I

c H,N H: I I I

(C H 2)3

HCNH:

coo-

Pyruvate Aspartic-S-semialdehyde

2,3-Dihydrodipicolinate I

d-Piperideine-2,6-dicarboyla te 1

N-Succinyl-c-keto-a-aminopirnelate 1

N-Succinyl-L,L-u,c-didnopimelate 1

L,L-u,c-diaminopimelate I

Lysine

Lysine

la Figure 3. Comparison of lysine synthesis pathways found in fungi and bacteria. In these taxa identical molecules of lysine are derived via different biosynthesis pathways from unrelated precursors. (Modified from Zubay, 1983.)

families, that is they are derived from a-ketoglutarate, oxaloacetate, 3-phosphoglycearate, pyruvate, ribose 5-phosphate or by the combination of phosphoenolpyruvate and erythrose-4-phosphate. All these molecules are intermediates from central metabolic pathways (the citric acid cycle and the glycolytic pentose phosphate pathway) (Stryer, 1981). In fungi and Euglena, lysine is synthesized from a-ketoglutarate (the source of glutamine, proline and arginine) (Fig. 3A). In bacteria and plants, lysine belongs to the oxaloacetate biosynthetic family (Fig. 3B) (Zubay, 1983). Here we cannot invoke simple changes of enzyme function to produce a minor change in a synthesis pathway. The synthetic pathways of lysine in a bacterium and a fungus are completely

MOLECULES AS HOMOLOGIES 313

dissimilar, suggesting independent derivations for the molecules in the two groups. A molecule of lysine in a fungus is a structure more closely related in origin to a molecule of proline than a molecule of lysine in a bacterium.

The examples of tyrosine and lysine demonstrate what may be a common problem in biology-otherwise indistinguishable structures formed by different developmental pathways. What do we call homologous in this situation? While a molecule of tyrosine in a bacterium may not be homologous to a molecule of tyrosine in a plant because they have been synthesized by different enzymes, clearly the molecules of lysine in fungi and bacteria are not homologous. These molecules are not closely related to each other, even though they are structurally identical, because the synthetic pathways from which they are derived are so dissimilar.

However, tyrosine and lysine are fundamental amino acids in all organisms. The presence of codons coding for them and requirements for them in proteins are both features that are clearly rooted very deeply in the evolution of life. In these cases, we may claim that the requirement for tyrosine and lysine by these organisms is homologous. The presence of these structures is required by very strong functional constraints. I t is therefore not surprising that we observe identical structures ( tyrosine and lysine) being produced by different synthetic pathways in these taxa.

Functional constraints upon morphology are well known in anatomy. In this manner, tyrosine and lysine synthesis are like the morphology of squid and human eyes. The gross anatomy of these eyeballs is strikingly similar. That they have been independently derived can be deduced from the histological structure of their retinas and their development. The retina of vertebrate eyes is derived by folding of the mesenchymal optic cup, while a squid eye develops from an invagination of epithelium (Simpson, 1967). The gross morphology of eyes with lenses is strongly constrained by the laws of optics. Thus, the unrelated eyes of squids and humans appear morphologically identical. Functionally or structurally constrained molecules, like morphological features, may appear structurally identical, yet be developmentally or synthetically entirely unrelated.

Collagen and other triple helices are good examples of molecules that are highly structurally constrained. For this reason, the sequences in these molecules provide limited phylogenetic information. However, some information about the relatedness of collagen molecules can be derived from the structure of their genes.

COLLAGEN TRIPLE HELICES

Structural constraints on collagens

The recent discovery of a collagen triple helical motif in a capsomer protein of a bacteriophage (Bamford & Bamford, 1990) highlights the diversity of molecules known to contain Gly-X-Y sequences in triple helical domains. Outside of the 15 types of collagens, triple helical regions are believed to occur in number of other molecules (Table 1). A triple helical domain within these molecules consists of three interwound amino acid chains (Fig. 4C). These amino acid chains are each twisted into a conformation termed the collagen fold (Fig. 4B), considered to be a secondary structure of the same class as or-helices and

314 P. MORRIS AND E. COBABE

Polyproline II Helix

P Y Y

Collagen Fold Collagen Triple Helix Figure 4. Amino acid sequence and structure in polyproline I1 helices, the collagen fold, and the collagen triple helix. Note the location of glycine residues in the core of the collagen helix. (Modified from Piez, 1984.)

8-sheets (Miller & Gay, in press). The structure of this collagen fold in a single collagen subunit is effectively a left-handed polyproline type I1 helix (Fig. 4A). Such polyproline I1 helices are fairly common secondary structures, found in extracellular glycoproteins of plants and green algae (van Holst & Varner, 1984; Homer & Roberts, 1979; van den Ende et al., 1988).

Triple helical quaternary structures can form when three such helices, with the amino acid sequence Gly-Pro-Y, are interwound. This structure imposes significant restrictions on the amino acids that may be present in many of the positions of the molecule. Glycine is required in one-third of the sites for simple steric reasons, while the mechanism of helix formation requires a substantial number of proline residues to be present. In a triple helical molecule composed of three interwound collagen folds, every third amino acid on each chain is oriented such that the side chain faces the core of the helix. Glycine is the only amino acid that will fit into this space (Linsenmayer, 1981). Therefore, the presence of glycine in every third site is a structural requirement for a collagenous molecule.

Proline is favoured in the X position (of Gly-X-Y) for slightly more complex reasons. Type I collagen molecules have four main steps in the assembly of their subunits. First, three random-coiled, freshly transcribed subunits are linked to form the C-terminal globular domain. These are linked by disulphide bonds,

TABLE 1. Non-collagenous molecules in which collagenous triple helical domains have been identified either by inference from gene sequence or by demonstration

through XRD

Molecule References

Immune complement Clq Acetocholinesterase Pulmonary surfactant approprotein Serum mannose binding protein Macrophage scavenger receptor Phage PRDl 34 kD capsomer protein

Kilchherr ct al., 1985 Piez, 1984 Floros cf al., 1986 Drickamer cf al., 1986 Kodoma cf al., 1990 Bamford & Bamford, 1990

MOLECULES AS HOMOLOGIES 315

which require the action of an enzyme. Next, the three randomly coiled strands zip up into an ordered triple helix. This coiling is energetically favoured and occurs spontaneously if sufficient proline is present. Next, crosslinks form in the N- terminal globular domain and finally, the terminal globular domains are excised (Kivirikko & Myllyla, 1984). Proline is an imino acid, rather than an amino acid. Its side chain links to both the carbon and the nitrogen of its backbone. This peculiarity endows proline with two characteristics that favour its presence in collagen.

In proline, rotation about the Ca-N bond of the amino acid backbone is restricted, and the unusual cis form of the peptide bond may occur. The trans configuration of the peptide bond is energetically favoured in the amino acids (side chains on adjacent peptides almost always point away from each other). The cis form of this bond may occur in some 16% of the prolines in a randomly coiled peptide. The trans configuration is, however, still the energetically favourable state. The conversion of a freshly transcribed randomly coiled amino acid chain into an ordered structure involves a decrease in entropy and is not an energetically favoured process. The conversion of cis-proline in a randomly coiled procollagen strand to trans-proline in the helix supplies sufficient energy to overcome the gain in entropy. Thus, once three random coil Gly-Pro-Y sequences are linked, they will spontaneously coil into a triple helix (Miller & Gay, in press). In a collagen molecule, glycine must be present for a triple helix to form, and the presence of proline allows the helix to assemble spontaneously. Because of these structural constraints, amino acid sequences provides little historical information in these molecules.

Given these constraints upon the collagen triple helical structure, the presence of triple helical domains in a variety of very disparate molecules (e.g. a vertebrate pulmonary surfactant and a bacteriophage capsomer protein) suggests that this structure may have multiple evolutionary origins. Collagenous triple helical domains may be functionally useful. Amino acid chains with repeating Gly-Pro-Y sequences readily self-assemble into triple helical structures (after the bonding of a terminal domain). Triple helical domains are large, mechanically rigid parts of a molecule. Considerable electrochemical variability may be produced through the choice of amino acids placed in the third position of the structure, and the highly ordered repeating helix may be of utility in organized interaction with other molecules (Piez, 1984). Essentially, these are self- assembling molecules with a preferred structure for a wide range of molecular activities. While it is not entirely implausible that all triple helical domains have a single genealogical source and that gene transfections and duplications have spread this structure into a variety of molecular contexts, it is perhaps more parsimonious to think about a number of independently derived molecules that share a single, versatile and effective structure. This suggests that Bamford & Bamford’s statement that “. . . this peptide motif [collagen] is evolutionarily old . . .” (1990: 497) may not be correct. This structure may simply have been independently derived several times.

Gene slructure as evidence of convergence

Even though the presence of triple helical domains in non-collagenous molecules thus may be best assumed to be a convergent feature, direct evidence

316 P. MORRIS AND E. COBABE

TABLE 2. Classification of collagens by domain structure, type and exon structure. (Compiled from Olsen el al., 1989; Burbelo el al.,

1989; Ramerez, 1989; Miller & Gay, in press)

Collagen structure Collagen types Exon structure

Fi brillar I, 11, 111, V, XI 54 bp repeats FACIT IX, XI1 54 bp repeats Basement membrane IV 3’... 213,99,129,72 ... 5’ Short chain VIII, X, XIV Single exon

for independent origins of triple helical domains comes from the structure of the genes of collagens themselves. At least three major independent families of collagen genes appear to exist in vertebrates. These families are distinguished soley upon the basis of gene structure (Table 2) .

Collagens are a group of extracellular proteins that are widespread in the animal kingdom. All are comprised of three subunits which are largely interwound in triple helical domains. Collagens share some complex post- translational modifications. They undergo hydroxylation of proline and lysine groups (Kivirikko & Myllyla, 1984). Some of the hydroxylysine residues are glycolated with a unique set of disaccharides (Glca 1-2Galfl-0-Lys) (Hakomori el al., 1984). Additionally, collagen molecules are cross-linked by an unusual set of pathways, initiated by the action of lysyl oxidase (Eyre, 1987). These characteristics are not good evidence of relatedness amongst these molecules, since post-translational modification pathways may be shared by unrelated proteins.

Roughly 15 vertebrate collagen types and several additional invertebrate collagens have been described. Vertebrate collagens have been named by roman numerals in the order of their discovery. As a single collagen molecule may be coded for by more than a single gene, some 25 vertebrate collagen genes have been identified. For example, Type I collagens are coded for by two genes (CollAl and CollA2) whose products form a trimer [0l1(1)]~ aP(1). (The polypeptide produced by the CollAl gene is referred to as al(I).) Structurally, these 15 collagens may be divided into four broad groups. The fibrillar collagens (Types I, 11, 111, V, XI) are the typical fibre-forming collagens of cartilage, tendons and bone (Fig. 5,I). The basement membrane collagens (Type IV) form meshworks in the basement membrane, the thin extracellular matric that underlies epithelial cell sheets (Fig. 5,IV). The FACIT collagens (Fibril Associated Collagens with Interrupted Triple Helices) (Types IX, XI1 and XIV) bind to the surface of collagen fibres and are believed to mediate their interactions with other extracellular molecules (Fig. 5,IX). Other short chain collagens (Types VI, VII, VIII and X) are a poorly known group of molecules, with potential roles in development (Fig. 5,X) (Olsen & Nimni, 1989). Thus, the 15 collagens may be classified into four structurally similar groups.

The relatedness of molecules within these groups and the relationships between these groups is not addressable by amino acid sequence analysis. The constraints imposed by the structure of collagen upon the sequence of the molecule are sufficient to prevent division of the c. 25 collagen genes into phylogenetically related groups. Collagen gene structures can answer questions about their phylogenetic relationships. The fibrillar collagens have genes

MOLECULES AS HOMOLOGIES 317

Type Domains Molecular Aggregates

I @1_1 -- Ip H -=F X L1I

JNHI

KEY eGL T i I , a 4 . NH 0s GAG

Figure 5. Domain structure and molecular aggregates of some representative collagens. Each molecule figured is composed of three peptide chains. I, Type I collagen, a typical fibrillar collagen, forms with terminal globular domains that are trimmed off before extracellular aggregation. The body of the molecule, some 900 amino acids long, is a single large triple helical domain composed of GIy-X-Y repeats. Type I collagen polymerizes to form ordered fibres. IV, Type IV collagen has a globular terminal domain and numerous small non-helical regions. I t polymerizes into an open meshwork in the basal lamina. IX, Type IX collagen (a FACIT collagen) has a terminal globular domain, three non-helical interupts and has a glycosaminoglycan bound to one of its chains. FACIT collagens may be found attached to the surface of collagen fibres, possibly mediating interactions between the fibres and other matrix components, such as the GAG bearing proteoglycans. X, Type X collagen is a small collagen with an interupted triple helical domain that is capped by short non- helical domains. Key: GL, globular domain; TH, triple helical domain; NHI, short non-helical interuption; NH, non-helical domain; OS, 0-linked mono or disaccharide; GAG, glycosaminoglycan (chain of charged amino sugars). (Modified from Kiririkko & Myllyla, 1984; Olsen & Nimini, 1989; and Ninomiya ct al., 1989.)

composed of multiple exons that are 54 bp in length. The Type I V collagens of basement membranes are divided into exons of variable lengths, while the short chain (Type VIII and X ) collagens are encoded for by a single large exon (Ninomiya et al., 1989; Olsen & Nimni, 1989).

Most of the exons in the genes that are encoding the fibrillar collagen chains are composed of either 54 base pairs or are multiples of 54 bp. That is, they code for multiple of six Gly-X-Y repeats. The other exons are 45 or 99 base pairs in length; they have deleted one Gly-X-Y repeat. For example, in the gene for the 012 helix of Type I collagen from the chicken, there are 42 exons encoding the triple helical domain. Of these, 23 are 54 bp in length, eight are 108 bp, one is 162 bp, five are 45 bp and five are 99 bp in length (Ninomiya et al., 1989). This exon structure is believed to have been derived by multiple recombination of an ancestral gene of 54 bp in length (coding for 18 amino acids, six Gly-X-Y repeats, or one and a half turns of the helix) (Ramirez, 1989). The presence of

318 P. MORRIS AND E. COBABE

108 and 162 bp units is easily explained by fusion of 54 bp exons. Such a fusion has been documented in the gene for the a1 chain of Type I collagen. This gene differs from the a2(I) , a1 (11)) a1 (111) and a2(V) genes in the fusion of two 54 bp exons (numbers 33 and 34) into a single 108 bp exon. The 99 bp exons can be explained by fusion of 54 and 45 bp exons. The 45 bp exons are one Gly-X-Y repeat shorter than the 54 bp exons and may also be derived from an ancestral 54 bp sequence (Ramirez, 1989).

The FACIT collagens (Type IX, XI1 and XIV) also have a triple helical domain with 54 bp exons. However, they have a second triple helical domain that appears to have been derived from a 36 bp sequence. Some aspects of the gene structure, such as split exons and exons that code for both triple helical and non-helical domains suggest that these molecules are fairly distant from the fibrillar collagens (Olsen et al., 1989; Ninomiya et al., 1989). However, the presence of a triple helical domain with 54 bp exons does suggest that these genes are related to the same ancestral sequence as the fibrillar collagens, and thus to the same broad family.

The basement membrane collagens (the products of the Type IV gene family) appear to be unrelated to the FACIT or fibrillar collagens. Genes of human and Drosophila Type IV collagens are divided into exons of varying sizes that lack an underlying 54 bp repeat. The human Type IV collagen is divided into 52 exons. These exons vary in size from 26 to 2 13 bp, only two are 54 bp long. Seven exons coding for triple helical sequences are clearly unrelated to a 54 bp repeat (Soininen et al., 1989). The N-terminal exon structure of human and mouse a1 (IV) collagen genes are similar (Burbelo et al., 1989)) however a Drosophila Type IV collagen gene, a l ( I V ) , is divided into only nine large introns (Blumberg, Mackrell & Fessler, 1988). Sequences at the junctions of two of these exons show a 75% sequence similarity with sequences at comparable locations in the human gene. Four more sequences of greater than 70% sequence similarity show evidence of rearrangement in the order of exons. These sequences similarities do not occur between the al(1V) and the a2(IV) genes. The structure of these genes and the domain structure of Type IV collagen suggests that Type I V collagen genes are related, but that the duplication of the a1 (IV) and the a2(IV) genes occurred prior to the separation of protostome and deuterostome lineages. (Soininen et al., 1989).

The Type VIII and X short chain collagens are encoded by genes with only a single large exon coding for most of the molecule. They have a few small N-terminal exons, but the bulk of the gene has no introns. The chicken a l ( X ) gene is composed of three exons: a short 5’ untranslated exon, a short exon coding for a signal peptide and the N-terminal non-helical domain, and a single large exon encoding the entire triple helical domain and the C-terminal non- helical domain (Ninomiya et al., 1989). The gene structure of Type VIII collagen is similar. It has a single large exon coding for the entire triple helical domain (B.R. Olsen, personal communication, 1989), These short chain collagens are similar in domain structure, gene structure and the location of the eight non-helical interruptions of the triple helix. They also have about a 60% DNA sequence similarity (Ninomiya et al., 1989).

On the evidence of gene structure, there are at least three independent families of molecules termed collagens. One group of collagens are encoded for by genes with 54 bp exons. These are the fibrillar collagens which have long

MOLECULES AS HOMOLOGIES 319

uninterrupted triple helical domains and the FACIT collagens which have interrupted triple helices and large globular domains. In contrast, the Type IV collagens have interrupted triple helices, large globular domains and an incompletely determined gene structure that appears to lack 54 bp exons. The known short chain collagens are clearly different. They are small, have globular terminal domains, and an interrupted triple helix encoded for by a single large exon. The gene structures of other collagens are yet to be determined (Ninomiya et al., 1989).

The genes for human collagens are dispersed over several chromosomes. Few are in close proximity to each other. Genes for the Type I, I1 and I11 collagens are spread over chromosomes 17, 7, 12 and 2. These fibrillar collagens share the same repeated 54 bp exon structure. Since duplicated genes are usually found in close proximity this chromosomal dispersion suggests that they are either unrelated (in which case, the shared exon structure must be explained) or are products of a very ancient gene duplication (Miller & Gay, in press). The genes for the al(1V) and a2(IV) collagen subunits are located on chromosome 13 (Burbelo et al., 1989), while collagen IV genes are known from chromosome 21 and a collagen I X gene occurs on chromosome 6 (Olsen et al., 1989). These data are consistent with independent origins of these collagen gene families.

Despite the complexity and oddity of their synthesis, the presence of different gene structures suggests that collagens have multiple independent evolutionary origins. The similarity in primary structure between these proteins is largely produced by structural constraints upon the collagen fold. At least three long distinct, if not independent, families of collagen genes exist that produce molecules with these characteristics and with differing functions in the extracellular matrix.

POST-TRANSLATIONAL PATHWAYS IN PROTEINS

Immune complement protein Cql shares both triple helical domains and lysyl oxidase crosslinking with collagens. The lysyl oxidase crosslinking pathway is almost entirely restricted to collagens. The enzyme lysyl oxidase initiates crosslink formation between various lysine and hydroxylysine residues (Fig. 6A,B). It can form lysine-lysine crosslinks (lysinorleucine crosslinks), but most pathways involve hydroxylysine residues (residues that must have been enzymatically modified from lysine during protein synthesis) (Eyre, 1987) (Fig. 6C). It may be suggested that the shared presence of lysyl oxidase crosslinking in complement Cq 1 and collagen is evidence that the triple helical domains of these molecules are related. Unfortunately, unlike the synthesis of non-peptide molecules, similarities in post-translational modifications of proteins are not developmental evidences for homology.

Lysyl oxidase-initiated crosslinking pathways are largely restricted to collagens. However, elastin, an entirely unrelated protein found only in chordates, is also crosslinked by a lysyl oxidase pathway. Even if we assume that the enzyme lysyl oxidase used in elastin crosslinking is homologous to that used in collagen crosslinking, the shared presence of this pathway in collagen and elastin is not an indication of a genealogical relationship between these molecules. Elastin and collagen are completely different in both amino acid sequence and structure (Fig. 7 ) . Elastin is composed of repeating random coil

320 P. MORRIS AND E. COBABE

93Qn N9 01 HLN

87” C A

B

C

Lysi ne I I Lysyl oxidase

LN HLN

<a-

Lysinorleucine Crosslink (LN)

Hydroxylysine

Lysyl oxidase I 1

H ydroxyall ysine

HP Figure 6. Formation of intermolecular crosslinks in collagen. A, Two difunctional crosslinking sites in Type I collagen. A number of di-, tri- and tetrafunctional crosslinks form between lysine. For example, a HLN crosslink forms between a lysine residue 16 amino acids from the C-terminus of a strand of collagen and a hydroxylysine residue 87 residues from the N-terminus of a strand in an adjacent trimer. The formation of crosslinks between lysines in stylized positions such as these maintains the highly ordered and mechanically strong organization of collagen molecules into banded fibres. B, Formation of a lysinorleucine (LN) crosslink, illustrating the chemistry of lysyl oxidase crosslink formation. A lysine residue is modified to an allysine by the action of the enzyme lysyl oxidase. This modified residue is brought into proximity with a lysine residue on another molecule. These residues react without further enzymatic activity to form a difunctional LN crosslink. C, Pathways by which the common di- and trifunctional lysyl oxidase crosslinks are formed. Modification of lysine residues results in the formation of the difunctional LN and HLN (hydroxylysinorleucine) crosslinks (also tri- and tetrafunctional crosslinks, not illustrated). Modification of hydroxylysine by lysyl oxidase leads to the formation of a similar set of di- and trifunctional crosslinks. The LN pathway shown in B is marked by a dotted line in C. This is the pathway of lysyl oxidase crosslink formation found in elastin. (Modified from Eyre, 1987; Miller & Gay, in press and Kiririkko & Myllyla, 1984.)

MOLECULES AS HOMOLOGIES

Type I Collagen

32 1

F1-J ,Triple Helical (Gly-X-Y) Domain

Crosslinking Sites Non-Helical Domain

Elastin ,Random Coil Domain

\Crosslinking Domain

Figure 7. Domain structure in elastin and collagen I. A, A molecule of Type I collagen is composed of a long triple helical domain capped at each end by short non-helical domains. Crosslinking sites appear to be concentrated at the ends of the molecule but are found in both domains. An attached disaccharide is illustrated. B, A molecule of elastin is composed of repeating domains. Each repeat is composed of a short crosslinking domain and a larger random coil domain. The crosslinking domain has a conserved sequence and contains several lysine residues that link to other elastin molecules. The thermodynamics of this interlinked set of random coils are responsible for the elastic properties of the molecule. (Modified from Kiririkko & Myllyla, 1984; Miller & Gay, in press and Franzblau el al., 1989.)

domains and crosslinking domains (Franzblau el al., 1989). The activity of lysyl oxidase in elastin can only indicate either independent origins of an enzyme with lysyl oxidase activity in elastin or the co-opting of a lysyl oxidase from collagen synthesis to elastin synthesis. (Collagens and their crosslinks are widespread amongst all animals, while elastin is restricted to chordates.) Elastin is clearly of more recent phylogenetic origin than collagen.

The shared presence of lysyl oxidase crosslinking in elastin and collagen indicates that enzymes evolved for the processing of one molecule may become involved in the processing of another unrelated molecule. A similarity in the post-translational modification of two proteins (a developmental similarity) is not an indication of a genealogical relationship between those proteins. The shared presence of hydroxylysine in collagen and complement Cql is not evidence of a homologous relationship between those molecules. The enzymes used for hydroxylation of lysine in collagen synthesis and complement Cql synthesis may be homologous, but this does not provide evidence of homology between triple helical domains in collagen and complement Cq 1. The triple helical domains in complement Cql are seen as having an independent origin from the triple helical domains in collagens. Similarity of synthesis in proteins is not evidence of homology between those proteins.

322 P. MORRIS AND E. COBABE

CONCLUSION

While much has been gained from the use of sequence similarity to determine phylogenetic relationships, clearly molecules can be used as homologies in the more traditional sense. A good example of the utility of molecules as characters is the major revision of bacterial taxonomy that has occurred in the last decade. Structural, developmental and positional characteristics are used to evaluate anatomical homologies; in the same way, structural, synthetic and genetic characteristics can be used in molecules. The examples of amino acid synthesis illustrate strong phylogenetic constraints upon these molecules. The complex system by which genetic material is translated and transcribed into proteins was fixed long ago, and with it, the requirement for the 20 amino acids. Organisms may have diverged in their pathways for the synthesis of these molecules, but they are historically constrained to produce molecules of these chemical structures (morphologies). The large quantity of glycine in collagenous molecules is an example of a structural constraint. The construction of a collagenous triple helix imposes a steric constraint upon the amino acids that may occupy the core of the helix. The structure of collagen genes suggests that this is a group of molecules that were independently derived several times. The versatility of collagen as an electrochemically variable structure may have lead to these convergences.

As with anatomical data, structural identity in molecules is not always indicative of relatedness. Molecules can be highly structurally and functionally constrained. In proteins, this may require a demonstration of homology beyond sequence similarity, such as gene structure or chromosome placement. The shared similarity of lysyl oxidase crosslinking in the clearly unrelated proteins, collagen and elastin, demonstrates that similarity in post-translational modifications is not evidence of homology in proteins. In non-peptide molecules, such as sugars, demonstration of homology at the level of the synthetic pathway may be required to substantiate the relationships between identical molecules. Given this information, however, homologies in molecules may provide much needed information in determining genealogical relationships.

ACKNOWLEDGEMENTS

We would like to thank Stephen Jay Gould, Kathy Burgess and Warren Allon for their insightful comments on this manuscript.

REFERENCES

BAMFORD, D. H. & BAMFORD, J. K. H., 1990. Collagenous proteins multiply. Nature, 334: 497. BLUMBERG, B., MACKRELL, A. K. & FESSLER, J. H., 1988. Drosophila basement membrane prorollagen

a1 (IV): 11. Complete cDNA sequence, genomic structure, and general implications for supra-molecular assemblies. journal of Biological Chemistry, 253: 18328-18337.

BURBELO, P., KILLEN, P. D., EBIHARA, I . , SAKURAI, Y. & YAMADA, Y., 1989. Structure and expression of collagen IV genes. In B. R. Olsen & M. E. Nimni (Eds), Collagen: Volume ZV, Molecular Biology. Boca Raton: CRC Press.

DRICKAMER, K., DORDAL, M. S. & REYNOLDS, L., 1986. Mannose-binding proteins isolated from rat liver contain carbohydrate-recognition domains linked to collagenous tails. Journal of Biological Chemistry, 261: 6 a 7 a - m ~ .

MOLECULES AS HOMOLOGIES 323

EYRE, D. R., 1987. Collagen cross-linking amino acids. Methods in Engvnology, I#: 115-139. FLOROS, J. ct al . , 1986. Isolation and characterization of cDNA clones from the 35-kDa pulmonary

surfactant-associated protein. Journal of Biological Chemistry, 261: 9029-9033. FRANZBLAU, C., PRAT, C. A., FARIS, B., COLANNINO, N. M., OFFNER, G. D., MOGAYZEL, P. J.

Jr. & TROXLER, R. F., 1989. Role of tropoelastin fragmentation in elastogenesis in rat smooth muscle cells. Journal of Biological Chemistry, 284: 151 15-15119.

HAKOMORI, S., FUKUDA, M., SEKIGUCHI, K. & CARTER, W. G., 1984. Fibronectin, laminin, and other extracellular glycoproteins. In K. A. Piez & A. H. Reddi (Eds', Extracellular Matrix Biochemzstry. New York: Elsevier.

HARBORNE, J. B., 1990. Constraints on the evolution of biochemical pathways. Biological Journal of the Linnean Society, 39: 135-1 5 1.

HOMER, R. B. & ROBERTS, K., 1979. Protein conformation in plant cell walls. Circular dichroism reveals a polyproline I1 structure. Planta, 146: 21 7-222.

JEUNIAUX, C., 1963. Chitin et chitiolyse. Paris: Masson. KERKUT, G. A., 1960. Implications of Evolution. Oxford: Pergamon Press. KILCHHERR, E., HOFFMAN, H., STEIGENANN, W. & ENGEL, J., 1985. Structural model of the

collagen-like region Clq comprising the kink region and the fiber-like packing of the six triple helices. Journal of Molecular Biology, 186: 403415.

KIVIRIKKO, K. I. & MYLLYLA, R., 1984. Biosynthesis of the collagens. In K. A. Piez & A. H. Reddi (Eds), Extracellular Matrix Biochemistry. New York: Elsevier.

KODAMA, T., FREEMAN, M., ROHRER, L., ZABRECKY, J., MATSUDAIRA, P. & KRIEGER, M., 1990. Type I macrophage scavenger receptor contains a-helical and collagen-like coiled coils. Nature, 343: 531-535.

LEWIN, R., 1987. When does homology mean something else? Science, 237: 1570. LINSENMAYER, T . F., 1981. Collagen. In E. D. Hay (Ed.), Cell Biology of the Extracellular Matr ix . New York:

Plenum. MICHAUX, B., 1989. Homology: A question of form or a product of genealogy? Rivista dz Biologica-Biology

Forum, 82: 217-246. MILLER, E. J. & GAY, S., 1991. Collagen structure and function. In Cohen el al . (Eds), Wound Healing:

Philadelphia: W. B. Saunders. NINOMIYA, Y., CASTAGNOLA, Y., GERECKE, D., GORDON, M., JACENKO, O., LUVALLE, P.,

MCCARTHY, M., MURAGAKI, Y., NISHIMURA, I., OH, S., ROSENBLUM, N., SATO, N., SUGRUE, S., TAYLOR, R., VASIOS, G., YAMAGUCHI, N. & OLSEN, B. R., 1989. The molecular biology of collagens with short triple-helical domains. In C. Boyd, P. Byers & L. Sandell (Eds), Collagen Genes: Structure, Regulation and Abnormalities: San Diego: Academic Press.

OLSEN, B. R. & NIMNI, M. E. (Ed), 1989. Collagen: Volume I V , Molecular Biology. Boca Raton: CRC Press. OLSEN, B. R., NINOMYIA, Y., GERECKE, D., GORDON, M., GREEN, G., KIMURA, 'I'.,

MURAGAKI, Y., NISHIMURA I. & SUGRUE, S., 1989. A new dimension in the extracellular matrix. In B. R. Olsen & M. E. Nimni (Eds), Collagen: Volume I V , Molecular Biology. Boca Raton: CRC Press.

PATTERSON, C., 1988. Homology in classical and molecular biology. Molecular Biology and Evolution, 5: 603425.

PIEZ, K. A., 1984. Molecular and aggregate structures of the collagens. In K. A. Piez & A. H. Reddi (Eds), Extracellular Matrix Biochemistry. New York: Elsevier.

RAMIREZ, F., 1989. Organization and evolution of the fibrillar collagen genes. In B. R. Olsen & M. E. Nimni (Eds), Collagen: Volume I V , Molecular Biology. Boca Raton: CRC Press.

ROCHE, J., THOAI, N.-V. & ROBIN, Y., 1957. Sur la prksence de la crtathe chez les inverttbrks et sa signification biologique. Biochimica el Biophysics A d a , 24: 514-5 19.

ROTH, V. L., 1984. On homology. Biological Journal of the Linnean Society, 22: 13-19. RUDALL, K. M. & KECHINGTON, W., 1973. The chitin system. Biological Review, 49: 597-636. SCHLEIFER, K. H. & STACKEBRANDT, E. (Eds), 1985. Evolution of Prokaryotes. Federation of European

SIMPSON, G. G., 1967. The Meaning of Evolution. New Haven: Yale University Press. SOININEN, R., HOUTARI, M., GANGULY, A., PROCKOP, D. J. & TRYGGVASON, K., 1989.

Structural organization of the gene for the a1 chain of human Type IV collagen. Journal of Biological Chemistry, 264: 13565-13571.

Microbiological Societies, Symposium No. 29. New York: Academic Press.

STRYER, L., 1981. Biochemistry, 2nd edition. New York: W. H. Freeman. TERWILLIGER, R. C., 1980. Structure of invertebrate haemoglobins. American <OO/Ogi51, 20: 53-67. VAN DEN ENDE, H., KLIS, F. M., MUSGRAVE, A. & STEGWEE, D., 1988. Sexual agglutination in

Chlamydomonas eugamelos. In G . P. Chapman, C. C. Ainsworth & C. J. Chatham (Eds), Eukaryote Cell Recognition. Cambridge: Cambridge University Press.

VAN HOLST, G. J. & VARNER, J. E., 1984. Reinforced polyproline I1 conformation in a hydroxyproline- rich cell wall glycoprotein from carrot root. Plant Physiology, 74: 247-251.

WILMER, P., 1990. Invertebrale Relationships. Cambridge: Cambridge University Press. WOESE, C. R., MAGRUM, L. J. & FOX, G. E., 1978. Archaebacteria. Journal of Molecular Evolution, 11:

245-252.

324

WOESE, C. R. , 1982. Archaebacteria and cellular origins: an overview. <mfralblaff fuer Bakferiologie, Parasitenkunde, und Infecktionskrankhciten. Abteilung 1. Medizinischc-Hygicnischc Bakteriologie Virusforschung und Tierischc Parasifologie. Reihe C, 3: 1-17.

WOESE, C. R., 1985. Why study evolutionary relationships among bacteria? In K. H. Schleifer & E. Stackebrandt (Eds), Evolution of Prokaryotes. Federation of European Microbiological Societies, Symposium No. 29. New York: Academic Press.

P. MORRIS AND E. COBABE

ZUBAY, G., 1983. Biochemistry, 2nd edition. Reading, Mass: Addison-Wesley.