Conservation and Functional Element Discovery in 20 Angiosperm Plant Genomes

16
Article Conservation and Functional Element Discovery in 20 Angiosperm Plant Genomes Daniel Hupalo* ,1 and Andrew D. Kern 2 1 Department of Biological Sciences, Dartmouth College, Hanover, New Hampshire 2 Department of Genetics, Rutgers University *Corresponding author: E-mail: [email protected]. Associate editor: Hideki Innan Abstract Here, we describe the construction of a phylogenetically deep, whole-genome alignment of 20 flowering plants, along with an analysis of plant genome conservation. Each included angiosperm genome was aligned to a reference genome, Arabidopsis thaliana, using the LASTZ/MULTIZ paradigm and tools from the University of California–Santa Cruz Genome Browser source code. In addition to the multiple alignment, we created a local genome browser displaying multiple tracks of newly generated genome annotation, as well as annotation sourced from published data of other research groups. An investigation into A. thaliana gene features present in the aligned A. lyrata genome revealed better conservation of start codons, stop codons, and splice sites within our alignments (51% of features from A. thaliana conserved without interruption in A. lyrata) when compared with previous publicly available plant pairwise alignments (34% of features conserved). The detailed view of conservation across angiosperms revealed not only high coding-sequence conservation but also a large set of previously uncharacterized intergenic conservation. From this, we annotated the collection of conserved features, revealing dozens of putative noncoding RNAs, including some with recorded small RNA expression. Comparing conservation between kingdoms revealed a faster decay of vertebrate genome features when compared with angiosperm genomes. Finally, conserved sequences were searched for folding RNA features, including but not limited to noncoding RNA (ncRNA) genes. Among these, we highlight a double hairpin in the 5 0 -untranslated region (5 0 -UTR) of the PRIN2 gene and a putative ncRNA with homology targeting the LAF3 protein. Key words: Arabidopsis, alignment, conservation, comparative genomics, ultraconserved elements, angiosperm, RNA folding. Introduction Within the past decade, a flood of whole-genome data has enabled a comparative genomics approach to functional el- ement discovery. The construction of phylogenetically deep, whole-genome multiple alignments in models such as humans (Miller et al. 2007; Rhead et al. 2010; Fujita et al. 2011), Drosophila (Drosophila 12 Genomes Consortium et al. 2007), and yeast (Kellis et al. 2003) has allowed the research community to understand each genome in a com- parative framework. These alignments have bridged annota- tion between similar species, and subsequent investigations in each individual organism have utilized these resources to discover a variety of functional genomic elements and genome characteristics (Pedersen et al. 2006; Stark et al. 2007; Friedman et al. 2009; Kim et al. 2009; Stojanovic 2009). Comparative genomic methods that use sequence similar- ity, protein alignments, and whole-genome alignments between two and five species have been widely applied by plant scientists to rice and Arabidopsis. Initially, these inves- tigations into angiosperms focused primarily on synteny relationships between species (Acarkan et al. 2000; Ku et al. 2000; Gebhardt et al. 2003; Tang, Bowers, et al. 2008; Tang, Wang, et al. 2008), but have subsequently expanded into observations of lineage specific protein-coding genes (Campbell et al. 2007; Yang et al. 2009), RNA genes (Michaud et al. 2011), miRNAs (Zhang et al. 2006; Lenz et al. 2011), and of particular note, conserved noncoding se- quences (Kaplinsky 2002; Guo 2003; Inada et al. 2003; Thomas et al. 2007; Wang et al. 2009; Kritsas et al. 2012). As the avail- ability of sequenced species increases, comparative genomics in plants may now be performed using the same powerful frameworks and methodologies that have been applied to other model systems. The wealth of genetic resources available for work on Arabidopsis thaliana, combined with its compact genome, has made it the prime target for comparative genomics research within plants (Schmidt 2002). Currently, there exist dozens of sequenced angiosperm genomes, along with a large number of sequenced Arabidopsis genomes. This wealth of data, in conjunction with the detailed molecular biological characterization of plant genes available from The Arabidopsis Information Resource (TAIR) (Lamesch et al. 2011), has the potential to reveal a more complete set of functional elements in the A. thaliana genome through the use of sequence comparison. One major axis of motiva- tion for this research is the need to bridge biological knowl- edge gained from study of Arabidopsis to agricultural plants (Morrell et al. 2011); comparative genomics can be a potent tool toward these ends. ß The Author 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] Mol. Biol. Evol. 30(7):1729–1744 doi:10.1093/molbev/mst082 Advance Access publication May 2, 2013 1729 by guest on July 6, 2016 http://mbe.oxfordjournals.org/ Downloaded from

Transcript of Conservation and Functional Element Discovery in 20 Angiosperm Plant Genomes

Article

Conservation and Functional Element Discovery in20 Angiosperm Plant GenomesDaniel Hupalo*,1 and Andrew D. Kern2

1Department of Biological Sciences, Dartmouth College, Hanover, New Hampshire2Department of Genetics, Rutgers University

*Corresponding author: E-mail: [email protected].

Associate editor: Hideki Innan

Abstract

Here, we describe the construction of a phylogenetically deep, whole-genome alignment of 20 flowering plants, alongwith an analysis of plant genome conservation. Each included angiosperm genome was aligned to a reference genome,Arabidopsis thaliana, using the LASTZ/MULTIZ paradigm and tools from the University of California–Santa Cruz GenomeBrowser source code. In addition to the multiple alignment, we created a local genome browser displaying multiple tracksof newly generated genome annotation, as well as annotation sourced from published data of other research groups.An investigation into A. thaliana gene features present in the aligned A. lyrata genome revealed better conservation ofstart codons, stop codons, and splice sites within our alignments (51% of features from A. thaliana conserved withoutinterruption in A. lyrata) when compared with previous publicly available plant pairwise alignments (34% of featuresconserved). The detailed view of conservation across angiosperms revealed not only high coding-sequence conservationbut also a large set of previously uncharacterized intergenic conservation. From this, we annotated the collection ofconserved features, revealing dozens of putative noncoding RNAs, including some with recorded small RNA expression.Comparing conservation between kingdoms revealed a faster decay of vertebrate genome features when compared withangiosperm genomes. Finally, conserved sequences were searched for folding RNA features, including but not limitedto noncoding RNA (ncRNA) genes. Among these, we highlight a double hairpin in the 50-untranslated region (50-UTR) ofthe PRIN2 gene and a putative ncRNA with homology targeting the LAF3 protein.

Key words: Arabidopsis, alignment, conservation, comparative genomics, ultraconserved elements, angiosperm,RNA folding.

IntroductionWithin the past decade, a flood of whole-genome data hasenabled a comparative genomics approach to functional el-ement discovery. The construction of phylogenetically deep,whole-genome multiple alignments in models such ashumans (Miller et al. 2007; Rhead et al. 2010; Fujita et al.2011), Drosophila (Drosophila 12 Genomes Consortiumet al. 2007), and yeast (Kellis et al. 2003) has allowed theresearch community to understand each genome in a com-parative framework. These alignments have bridged annota-tion between similar species, and subsequent investigationsin each individual organism have utilized these resourcesto discover a variety of functional genomic elements andgenome characteristics (Pedersen et al. 2006; Stark et al.2007; Friedman et al. 2009; Kim et al. 2009; Stojanovic 2009).

Comparative genomic methods that use sequence similar-ity, protein alignments, and whole-genome alignmentsbetween two and five species have been widely applied byplant scientists to rice and Arabidopsis. Initially, these inves-tigations into angiosperms focused primarily on syntenyrelationships between species (Acarkan et al. 2000; Ku et al.2000; Gebhardt et al. 2003; Tang, Bowers, et al. 2008; Tang,Wang, et al. 2008), but have subsequently expanded intoobservations of lineage specific protein-coding genes

(Campbell et al. 2007; Yang et al. 2009), RNA genes(Michaud et al. 2011), miRNAs (Zhang et al. 2006; Lenzet al. 2011), and of particular note, conserved noncoding se-quences (Kaplinsky 2002; Guo 2003; Inada et al. 2003; Thomaset al. 2007; Wang et al. 2009; Kritsas et al. 2012). As the avail-ability of sequenced species increases, comparative genomicsin plants may now be performed using the same powerfulframeworks and methodologies that have been applied toother model systems.

The wealth of genetic resources available for work onArabidopsis thaliana, combined with its compact genome,has made it the prime target for comparative genomicsresearch within plants (Schmidt 2002). Currently, there existdozens of sequenced angiosperm genomes, along with alarge number of sequenced Arabidopsis genomes. Thiswealth of data, in conjunction with the detailed molecularbiological characterization of plant genes available fromThe Arabidopsis Information Resource (TAIR) (Lameschet al. 2011), has the potential to reveal a more completeset of functional elements in the A. thaliana genome throughthe use of sequence comparison. One major axis of motiva-tion for this research is the need to bridge biological knowl-edge gained from study of Arabidopsis to agricultural plants(Morrell et al. 2011); comparative genomics can be a potenttool toward these ends.

� The Author 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, pleasee-mail: [email protected]

Mol. Biol. Evol. 30(7):1729–1744 doi:10.1093/molbev/mst082 Advance Access publication May 2, 2013 1729

by guest on July 6, 2016http://m

be.oxfordjournals.org/D

ownloaded from

For some time now, many pairwise and small-scale multi-ple plant genome alignments have been available, mainlybased on the VISTA comparative genomics pipeline(Dubchak et al. 2000; Frazer et al. 2004). This system hasutilized the LAGAN alignment tool (Brudno et al. 2003) togenerate dozens of Arabidopsis-based pairwise alignments,as well as create five-way multiple alignments in modelorganisms (Brudno et al. 2007). Yet, no attempt is knownto have been made to create or analyze a deep mergeddata set that can assess general conservation across genera,in similar treatment to that seen in all other kingdoms of life.To address this, we have used the University of California–Santa Cruz (UCSC) source tree (Kent et al. 2002) in combi-nation with a LASTZ/MULTIZ paradigm (Blanchette et al.2004; Harris 2007) to create a 20way plant alignment thatreaches nearly to single-nucleotide resolution of conservation;we have provided that information in its entirety to theplant community via a plant genome browser available atgenome.genetics.rutgers.edu.

A major goal for our research is to characterize patternsof global conservation within angiosperms and leverageconservation data for functional element discovery.A recent analysis of a 105 kb syntenic segment of sequencebetween five Solanaceae demonstrated that measuringthe conservation of DNA in plants can be a potent methodof investigation for coding and noncoding sequence (Wanget al. 2008). This look into the nightshade family, alongwith investigations in fruit flies (Drosophila 12 GenomesConsortium et al. 2007), humans (Miller et al. 2007; Rheadet al. 2010; Fujita et al. 2011), and yeast (Kellis et al. 2003), has

made clear the utility of a comparative genomics perspectiveon genome function. Identifying and combining conservedregions of the A. thaliana genome with known annotationfrom the plant community will help identify novel highlyconserved features and provide insight into contrastingevolutionary histories among the kingdoms of life.

Results

Alignment of Angiosperms to an A. thalianaReference Genome

We have assembled the largest comparative genomic data setin plants to date, using whole-genome sequence data span-ning the breadth of flowering plants. Choice of species toinclude in the alignment was based on data availability,and, in some cases, by simplicity of genome architecture.The wheat genome, for example, was excluded due to itssize and complexity. The included species span all angio-sperms, with representatives from four monocot Poaceae(Goff et al. 2002; Paterson et al. 2009; Schnable et al. 2009;Vogel et al. 2010), as well as 16 eudicots including fourBrassicales (Arabidopsis Genome Initiative 2000; Ming et al.2008; Hu et al. 2011; Wang et al. 2011), one Malvale (Argoutet al. 2011), two Malpighiales (Tuskan et al. 2006; Chan et al.2010), four Fabales (Retzel et al. 2007; Sato et al. 2008; Kim et al.2010; Schmutz et al. 2010), one Cucurbitale (Huang et al. 2009),two Rosales (Velasco et al. 2010; Shulaev et al. 2011), one Vitale(Velasco et al. 2007), and one Solanaceae (Xu et al. 2011).The common names and genome details can be reviewed intable 1, sorted by their alignment coverage of A. thaliana.

Table 1. Species Information and Alignment Coverage for Each Included Species in the 20way Comparison.

Name CommonName

Type Nucleotides Assembly Date TotalAlign(%)

CDSAlign(%)

Subs/Sitea

Arabidopsis thaliana Thale cress Pseudochromosomes 119 Mbp February 2009 TAIRv9 — — —

Arabidopsis lyrata Lyrate rockcress Pseudochromosomes 206 Mbp May 2011 77.82 98.16 0.09

Brassica rapa Chinese cabbage Scaffold 274 Mbp August 2011 65.85 96.78 0.35

Carica papaya Linnaeus Papaya Scaffold 342 Mbp December 2007 34.84 79.70 1.04

Theobroma cacao Cocoa Scaffold 290 Mbp August 2010 36.59 83.28 1.11

Vitis vinifera Grape Pseudochromosomes 497 Mbp March 2010 32.21 75.93 1.19

Populus trichocarpa Poplar Scaffold 417 Mbp March 2011 35.41 81.40 1.21

Malus� domestica Borkh. Apple Scaffold 881 Mbp November 2009 35.07 80.89 1.24

Ricinius communis Castor bean Scaffold 350 Mbp February 2009 34.31 80.78 1.25

Fragaria vesca Strawberry Scaffold 214 Mbp June 2010 34.43 80.30 1.28

Glycine max Soybean Pseudochromosomes 973 Mbp January 2010 36.77 78.67 1.33

Glycine soja Wild soybean Sequence Reads 973 Mbp — 36.06 78.94 1.33

Lotus japonica Birdsfood trefoil Pseudochromosomes 301 Mbp May 2008 25.92 62.84 1.43

Cucumis sativus var. sativus L. Cucumber Scaffold 203 Mbp January 2010 32.54 76.79 1.47

Medicago truncatula Clover Scaffold 307 Mbp August 2007 28.40 67.49 1.51

Solanum tuberosum Potato Scaffold 727 Mbp July 2011 34.88 77.77 1.52

Sorghum bicolor Sorghum Pseudochromosomes 738 Mbp January 2007 26.24 63.93 1.92

Oryza sativa L. ssp. Japonica Rice Pseudochromosomes 373 Mbp January 2009 25.33 63.39 1.94

Brachypodium distachyon Purple false brome Scaffold 271 Mbp December 2009 25.37 63.68 1.95

Zea mays ssp. Mays Corn Pseudochromosomes 2.06 Gbp March 2010 25.96 62.93 1.96

aThe “Substitutions per Site” column lists the divergence from A. thaliana based on the neutral tree of figure 1.

1730

Hupalo and Kern . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

There is a diverse set of methods available for whole-genome alignment, including both open-source and com-mercial packages. Our goal was both to create a deep multiplealignment and to make that data set available for communityuse. The success in the alignment of vertebrate genomes, andtheir subsequent browsable alignment, demonstrated thatboth these goals can be achieved in an integrated open-source manner (Miller et al. 2007; Rhead et al. 2010; Fujitaet al. 2011). Following this example, we created a mirror of theUCSC genome browser (genome.genetics.rutgers.edu) andbuilt within its framework databases for multiple plant spe-cies. Currently, A. thaliana is the model browser for eudicotspecies, using TAIR version 9 annotations, and an additionalbrowser with a TAIR version 8 assembly for legacy support.Each of the 20 genomes was aligned in a pairwise fashionusing tuned parameters (see Materials and Methods), fol-lowed by chaining and conversion to pairwise alignmentfiles. A phylogenetic tree covering all included species, with-out branch lengths, was drawn from an angiosperm supertree(Davies et al. 2004) and used to guide the MULTIZ’s(Blanchette et al. 2004) merging of pairwise alignments.Using the 20-way alignment, branch lengths for a neutraltree based on 4-fold degenerate sites were computed usingthe PHAST package (Hubisz et al. 2011) and are displayed infigure 1. Other analyses of eudicots and angiosperms haveconstructed phylogenies with similar substitutions per site asthose seen in the neutral tree used in this investigation (Yanget al. 1999; Tang, Wang, et al. 2008).

Base pair coverage for the whole genome and for codingDNA sequence (CDS) regions is presented in table 1 sortedby divergence from A. thaliana based on a neutral phyloge-netic tree. Genomes included in the alignment vary greatlyin terms of genome architecture, sequence quality, size, and

phylogenetic distance from the reference. The coverageshows generally similar patterns compared with numbersgathered from mammalian alignments (Miller et al. 2007;Rhead et al. 2010; Fujita et al. 2011). It is informative to com-pare alignment coverage at various evolutionary distancesbetween vertebrate alignments (Miller et al. 2007; Rheadet al. 2010; Fujita et al. 2011) and plant alignments. For in-stance, A. thaliana and Brassica rapa are roughly as divergentas humans and the galago Otolemur garnettii at 0.35 and 0.33substitutions per 4D-site, respectively. At this level of diver-gence, our plant alignment shows a greater proportion ofaligned bases (65.8% vs. 44.3% aligned, respectively). Codingregion alignments in this comparison follow suit with 96%versus 80% aligned base pairs in plants versus animals.Looking at the most diverged species comparison in our anal-ysis, A. thaliana to Zea mays (1.96 substitutions per site), wefind this is roughly proportional to the amount of divergencebetween Human and Xenopus tropicalis (1.97 substitutionsper site). In this comparison, vertebrates lose a greateramount of overall alignment (26% vs. 8% aligned); however,the coding regions are more conserved (62% vs. 87% aligned).Despite the differences seen in one to one comparisons, weobserve a shared pattern that as distance increases, coverageby whole-genome sequence drops precipitously, bottomingat roughly 35% across eudicots and 26% across monocots.Unsurprisingly, protein-coding sequence shows higherconservation, never dropping below 62%.

Coverage and Gene Feature Comparisons in theArabidopsis Genus

The VISTA genome browser has made available for publicuse a number of precomputed whole-genome alignmentsof plant genomes (Frazer et al. 2004; Brudno et al. 2007).

Malus x domestica Fragaria vesca

Cucumis sativus

Medicago trunculaLotus japonica

Glycine sojaGlycine max

Ricinus communisPopulus trichocarpa

Arabidopsis thalianaArabidopsis lyrata

Brassica rapa

Carica papayaTheobroma cacao

Oryza sativa

Brachypodium distachyon

Zea maysSorghum bicolor

Vitis viniferaSolanum tuberosum

0.1 Subst/site

Eudicot

Monocot

FIG. 1. A phylogenetic tree of the relationships between species included in the 20way angiosperm alignment and used to guide MULTIZ merging ofpairwise alignments. The neutral tree is based on 4-fold degenerate sites sampled from each chromosome with branches proportional to the listed scale,with substitutions per site determined by the PhyloFit software. Average trees for conserved and nonconserved regions with branch lengths are availablein supplementary figure S5, Supplementary Material online.

1731

Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

These alignments range from pairwise up to 4way andhave been used by the scientific community for comparisonsbetween angiosperm genomes (Swarbreck et al. 2008; Zelleret al. 2009). We used one of these alignments created by theVISTA pipeline comparing A. thaliana to A. lyrata as a bench-mark for the quality of our pairwise alignments, whichused the LASTZ/MULTIZ and axtChain methodology(Kent et al. 2003; Blanchette et al. 2004; Harris 2007).This VISTA A. thaliana vs. A. lyrata (Ath/Aly) alignmentis available on the araTha8 genome browser, along with itscorresponding conservation track and TAIR version 8 geneannotation.

To evaluate nucleotide coverage, several types of basealignment were measured, including the number of exactbase pair matches, the number of mismatched nucleotides,the number of gaps, RepeatMasked regions, and regionswhere no relationship between the two genomes was as-signed, which is equivalent to a gap (fig. 2A, supplementarytable S1, Supplementary Material online). The number ofexact matches and mismatches between the VISTA align-ments and our alignment was close to identical, witheach covering 65% and 66% of the A. thaliana genome,respectively. Large differences can be seen in the amount ofmasking applied to both the reference and query genome.Comparing the RepeatMasker track created during our mask-ing of A. thaliana to the VISTA alignment track on the TAIRv8 genome browser, it is evident that, although the VISTAalignment employs some masking, it is limited, and repeatregions are often gapped. This results in a comparativelyhigher proportion of gapped sequences within the VISTAalignment. Both methods result in raw coverage, which iswithin 10% of other pairwise alignments of Ath/Aly (Huet al. 2011).

With coverage numbers comparable to previous align-ments, we wanted to investigate how the constituent partsin A. thaliana are affected by the process of multiple align-ment. Base-by-base coverage numbers may conceal errorsin reading frame or poorly aligned functional sites such assplice sites. We used TAIR protein coding gene annotation(Lamesch et al. 2011) and the cleanGenes program in thePHAST package (Hubisz et al. 2011) to locate and evaluatestart codons, stop codons, and splice sites, and to identifyframeshift/nonsense mutations. Annotated gene regionscontaining all the listed functional elements without inter-ruptions between an A. thaliana and A. lyrata alignment were16% greater (5,240) in our alignments, compared with theVISTA alignments (fig. 2B). Features listed as having no align-ment have an excess of gaps obscuring any measurementof features. Gene regions with no alignment occur morethan twice as often in the VISTA data set. In the subsequentobservations, there were marginal increases in failed tests ofgene features for our alignments, attributable to a greaterproportion of features passing the initial “no alignment”test. The cleanGenes software also examines full exonslisted in the annotation for the A. thaliana genome.Using this function, we tabulated the number of exons withuninterrupted alignment in an Ath/Aly alignment. Over4,000 more conserved exons with no gaps in alignmentwere identified in our alignments, compared with theVISTA alignments (supplementary table S1, SupplementaryMaterial online).

To further address whether our methods are creating suit-able alignments beyond pairwise comparisons to A. lyrata, weinvestigated gene conservation in two additional species pre-sent in the alignment. Alignments created using LASTZ/MULTIZ for Vitis vinifera and Glycine max were compared

VISTA Ath/Aly

20way Ath/Aly

0

8000

16000

A B

Features Passing All

Tests

NoAlignment

FailedStart

Codon

Failed Stop

Codon

Failed 5' SpliceSite

Failed 3' Splice Site

Nonsense Mutation

Frameshift Mutation

Exact65%

Mismatch7%

Gap13%

1%

No Align14%

Exact66%

Mismatch

Gap5%

Masked14%

No Align7%

Masked

8%

FIG. 2. Alignment coverage and quality comparison between our implementation of the LASTZ/MULTIZ paradigm and a publicly available alignmenthosted by the VISTA genome browser using a Lagan-based alignment. (A) Coverage statistics as tabulated by the mafCoverage utility for eachmethodology detailing exact nucleotide matches, alignment with mismatched nucleotides, gapped sequence, sequence intentionally removed dueto repeats, and regions where no relationship between the two genomes was assigned (equivalent to a gap). (B) Results from the cleanGenes utility thattakes a TAIR v8 annotation and measures whether a given alignment has conserved the gene feature and maintained its protein coding ability. If thegene alignment between the two genomes is not cleanly conserved, the type of error is recorded.

1732

Hupalo and Kern . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

with existing VISTA pairwise alignments created using theLAGAN pipeline. Similar to the results observed when com-paring A. thaliana with A. lyrata, our alignments outper-formed existing alignments. As detailed in supplementarytable S2, Supplementary Material online, LASTZ was ableto cleanly align nearly twice (801/427) the number of fea-tures compared with LAGAN when comparing Arabidopsisto wine grape. The same result was found in the A. thaliana/G. Max alignment, which found more than twice the numberof gene features cleanly conserved when compared withLAGAN (837/336). This difference can be attributed to thepreservation of start/stop codons and splice sites during thealignment process.

Base-by-Base Conservation and Discrete ConservedElements among Angiosperms

To predict conserved regions in our multiple alignment,we used the phyloHMM method of Siepel et al. (2005),which searched for conserved elements within four differentgroups of organisms: vertebrates, insects, worms, andyeast. This phyloHMM scores general conservation acrossan alignment and also creates a smaller set of discreteelements, representing the most highly conserved blocks ofsequence (mostCons).

The normal composition of genome features in A. thali-ana is illustrated in figure 3A and serves as a reference towhich the predicted conserved regions can be compared.The composition of all scored conserved elements (fig. 3B)can be contrasted with the normal distribution, revealing anexpansion in the proportion of protein-coding sequence andunannotated intergenic sequence. Annotations specific toregions not associated with protein-coding genes, such asnoncoding RNAs and translational RNAs, only represent afraction of the larger conserved region data set. This con-trasts with the relatively higher proportion of conservedRNAs observed in previous analyses in organismal groupssuch as vertebrates (Siepel et al. 2005). Using this previousdata, and reproducing the analysis for angiosperm genomes,we observe a greater amount of CDS conservation in angio-sperms (42%) than seen in vertebrate (18%) and insect(26%) conservation but less than that seen in worms(55%) and yeast (86%). In general, when comparing the pat-terns of element conservation and diversity seen in angio-sperm genomes to the same distributions previouslymapped in reference vertebrate, yeast, insect, and wormgenomes, we find that angiosperms most closely resemblethe distribution of conserved elements seen in nematodessuch as Caenorhabditis elegans.

PhastCons produces a set of discrete regions that are themost conserved within the alignment and is graphed byannotation type in figure 3C. To further isolate regionswith the deepest phylogenetic conservation, we selectedthe top 10% of this mostCons set, as defined by having alogarithm of the odds (LOD) score greater than 88. Themajority of the mostCons set and the tail of its distributionare annotated as protein-coding sequence. Despite filteringfor only the highest scoring regions in the mostCons set,elements mapping to intergenic regions are still represented.

These intergenic regions do not include any known DNA-level annotation; this suggests that there is substantial undis-covered functionality present in A. thaliana and other plantgenomes. Cis-regulatory elements are equally representedamong the normal composition, conserved, and most-con-served regions. Although using short sequence motifs toidentify regulatory elements may accrue false-positive re-gions that share sequence identity but are nonfunctional,the deep conservation of many of these sites demonstratesthat most are likely functional in some way. In general, themostCons data set serves as the starting point for furtheranalysis and annotation of conserved regions withinangiosperms.

One way of characterizing the conserved portion of thegenome is to ask what functional annotations are enrichedamong identified conserved elements. Figure 3D providessuch a view of the conserved portion of plant genomes.In particular, translational RNAs are the most enriched anno-tation among conserved regions, followed by protein-codingsequences. Following these two groups, we observe thatRNAs that regulate transcription, such as miRNAs, are en-riched among conserved sequences. We observed that thisannotation set of miRNA and noncoding RNA (ncRNA) an-notations was diverse in its alignment depth. It includedRNAs present in many, if not all the 20 included species,and RNAs with alignment to only Brassicales. This wide var-iation in the depth of conservation of RNAs makes their mildenrichment unsurprising. In addition to the enrichment ofregulatory RNAs, regions tentatively annotated as bindingtranscription factors. As a control, transposable elementsare drastically under-represented, as being highly repetitivethey do not align well nor should they be conserved inmost cases between species. Comparing the enrichment ofangiosperm annotations among conserved regions to verte-brate annotation enriched in conserved regions determinedby the 46way vertebrate conservation track showed nearlythe same ordering of enriched annotation types (supplemen-tary fig. S1, Supplementary Material online). Vertebrate en-richment values trended higher in all categories comparedwith angiosperm enrichment.

Previous investigations into conservation between verte-brate species have looked into the alignability (i.e., percentageof bp with aligned sequence to a reference) of differentcomponents of the genome as a function of evolutionarydivergence (Miller et al. 2007). To compare and contrastthe animal results of Miller et al. (2007) with conservationwithin plant species, we recapitulated their analysis usingthe 46way alignment information (Fujita et al. 2011) andoverlaid the trend lines on selected angiosperm results(fig. 3F). Comparing conservation of RefSeq CDS regionsfrom vertebrates to conservation of TAIR CDS regionswithin angiosperms showed a faster decline of alignabilityin vertebrate species. Similarly, a faster decline in vertebratealignability was observed when comparing angiospermsto vertebrate cis-regulatory sites as seen in the trend lineof figure 3F. This may be due to ORegAnno annotationbeing biochemically validated compared with our initial setof regulatory sites that are bioinformatically predicted.

1733

Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

AB

CD

G

E

ncR

NA

Hom

olog

y40

ES

T26

Pse

udog

ene

Hom

olog

y78 tR

NA

100

rRN

A17

Tran

spos

able

Ele

men

tH

omol

ogy

533

snR

NA

/sno

RN

AH

omol

ogy

77

tRN

A +

Evo

Fol

d52

Sm

all R

NA

E

xpre

sion

185

5353

30E

xons

Fla

nkin

g C

onse

rvat

ion

266

Con

serv

ed R

egio

ns w

ith

Hom

olog

y to

Kno

wn

Pro

tein

s17

87

8

Evo

Fol

dS

econ

dary

Str

uctu

re12

0

No

Hom

olog

y28

5

Sm

all R

NA

30

Evo

Fol

d27

4

Con

serv

ed R

egio

ns w

ith

Hom

olog

y to

Kno

wn

Pro

tein

s17

87 -

Tot

al

Sm

all R

NA

E

xpre

sion

Exo

ns F

lank

ing

Con

serv

atio

n

Evo

Fol

dS

econ

dary

S

truc

ture

Goo

dO

RF

27

23

7

2912

0

58

43

724

18

1

302

127

154

85

3.55

2.28

1.33

1.27

0.39

0.04

Tran

slatio

nal R

NAs

CDS

Regula

tory

RNAs

cis-R

egula

tory

Elem

ents

Intro

ns

Tran

spos

able

Elemen

ts

Conserved Region Enrichment

17.5

4x

4.08

x3.

0x

1.64

x1.

23x

0.41

x0.

07x

Conserved Secondary StructureEnrichment

**

***

***

**

**

**

****

**1.

07

Pseud

ogen

es

CD

S30

%

Tran

spos

able

E

lem

ents

29%

Inte

rgen

ic21

%

Intr

on16

%

cis-

Reg

ulat

ory

Oth

er1%

3%

CD

S42

% Tran

spos

able

Ele

men

ts1%

Una

nnot

ated

Con

serv

atio

n33

%

Inte

rgen

ic

Intr

on19

%

cis-

Reg

ulat

ory

Oth

er1%

3%

Mos

t Con

serv

ed

Top

10%

Mos

t Con

serv

edTr

ansp

osab

le Elem

ents

Intro

ns

Pseud

ogen

es

cis-R

egula

tory

Elem

ents

Regula

tory

RNAs

Tran

slatio

nal R

NAs

CDS

F

tRN

A

CD

S

Intr

on

Cis

Reg

Reg

ulat

ory

RN

As

UT

R

TE

tRN

A

CD

S

Cis

Reg

Reg

ulat

ory

RN

As

UT

R

TE

Ver

tebr

ates

Exo

ns

Cis

Reg

Lyra

taR

apa

Oth

er E

udic

tos

Mon

ocot

s

Dis

tanc

e (S

ubst

itutio

ns P

er S

ite)

Alignability

.51

1.5

2

1

0.8

0.6

0.4

0.2

FIG

.3.

An

anal

ysis

ofco

nse

rvat

ion

inth

e20

way

pla

nt

alig

nm

ent

com

par

edto

.(A

)T

heco

mp

osit

ion

ofth

eA

rabi

dops

isth

alia

nage

nom

eso

urce

dfr

omT

AIR

v9an

dn

ewly

gen

erat

edan

not

atio

ns.

“Oth

er”

con

tain

sn

cRN

As,

miR

NA

s,tR

NA

s,rR

NA

s,sm

alln

ucle

arR

NA

s(s

nRN

As)

/sm

alln

ucle

olar

RN

As

(sn

oRN

As)

,an

dp

seud

ogen

es.(

B)Ph

astC

ons-

pre

dict

edco

nse

rved

elem

ents

sort

edby

ann

otat

ion.

(C)

The

disc

rete

“mos

tCon

s”re

gion

sp

redi

cted

byp

hast

Con

san

dth

e10

%hi

ghes

tsc

orin

gta

ilof

the

dist

ribu

tion

ofco

nse

rvat

ion

scor

esfo

rm

ostC

ons

regi

ons.

Col

oran

dse

gmen

tpo

siti

onco

rres

pon

dto

ann

otat

ion

typ

ede

scri

bed

in(A

)an

d(B

).(D

)En

rich

men

tof

con

serv

edel

emen

tsw

ithi

ndi

ffere

nt

feat

ure

typ

es.S

ign

ifica

nce

was

dete

rmin

edby

aFi

sche

r’sex

act

test

wit

hsi

ngl

est

ars

den

otin

gP<

0.05

and

doub

lest

ars

P<

0.01

.(E)

Enri

chm

ent

ofEv

oFol

d-p

redi

cted

seco

nda

ryst

ruct

ures

indi

ffere

nt

typ

esof

gen

ome

feat

ures

.(F)

Alig

nab

ility

ofA

.tha

liana

gen

ome

feat

ures

toco

rres

pon

din

gfe

atur

esin

pla

nts

atin

crea

sin

gp

hylo

gen

etic

dist

ance

s.A

lso

plo

tted

are

pro

por

tion

altr

end

lines

ofve

rteb

rate

alig

nab

ility

ofci

s-re

gula

tory

site

san

dex

ons

draw

nfr

om46

vert

ebra

tesp

ecie

s(F

ujit

aet

al.2

011)

.The

dist

ance

issc

aled

acco

rdin

gto

subs

titu

tion

spe

rsi

tedr

awn

from

a4-

fold

dege

ner

ate

neu

tral

tree

.(G

)BL

AST

hom

olog

yan

not

atio

nof

unan

not

ated

inte

rgen

icco

nse

rved

elem

ents

from

(B).

An

not

atio

nca

tego

ries

are

show

nas

pro

por

tion

alar

eas

wit

hsu

bset

ssh

own

asV

enn

diag

ram

s.R

egio

ns

are

labe

led

wit

hth

eir

put

ativ

ean

not

atio

nty

pe

and

tota

lnum

ber

ofel

emen

tsid

enti

fied.

1734

Hupalo and Kern . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

De novo Annotation of Unknown ConservedElements

The conservation analysis from figure 3B revealed that thereremains a large percentage of conserved intergenic DNAthat is not associated with any documented annotation orfunction. To further investigate this set of regions, we applieda BLAST homology search to all existing plant databasesto annotate new sequence (see Materials and Methods).BLAST annotation terms associated with each region of unan-notated conservation were recorded and graphed as propor-tional area, so as to visualize the diversity of the previouslyunknown conservation (fig. 3G). Each circular area representsa group of conserved regions that do not overlap existingannotation in Arabidopsis but that share sequence identityto an annotation group in Arabidopsis or in any otherplant genome. It is important to note that this is by nomeans an exact one-to-one annotation, as most regionsshow moderate sequence identity. However, it illustratesthat single-genome computational predictions of functionalelements have overlooked many biologically relevant siteswithin the Arabidopsis genome and provides inroads towardtheir further characterization.

Intersecting predicted folding RNAs (fRNAs) with con-served regions with tRNA homology revealed that half ofthese regions with tRNA homology also exhibited folding.More than half of conserved regions that show sequencehomology to angiosperm tRNAs also exhibit complex RNAfolding patterns. This overlap of independent methods ofidentification gives a strong indication that these conservedregions are part of previously unannotated tRNAs inArabidopsis. The remaining conserved regions that onlyhave homology to tRNAs may be truncated and lack thecomplementary sequence to accurately predict a foldwithin that region. Forty conserved regions were found thatshowed some sequence identity to, but not overlap with,currently annotated noncoding RNAs in plants; these regionswere intersected with a track listing regions of small RNAexpression, which revealed that 14 sequences in that setwere transcribed. Of those 14 regions that expressed RNA,seven also have the predicted folding structures associatedwith the conserved region. Although the first BLAST se-quence identity term for these elements was similar toncRNAs, many also have protein-coding homology as a sec-ondary BLAST term, suggesting their potential targets forregulation. Despite all attempts at classification, the functionof 10% of the starting data set of unannotated conservedelements remains unknown. These elements of unusuallyhigh sequence conservation among species, labeled in figure3G as “no homology,” cannot yet be fully characterized; how-ever, similar to many of the other regions successfully identi-fied, a subset shows small RNA expression or predictedfolding, giving clues to a currently veiled function.

The most prominent set of newly annotated elements(fig. 3G) is conserved regions with homology to protein-coding sequence. Overlapping these regions of proteinhomology are subsets that have been intersected with differ-ent whole-genome annotation tracks. This is visualized as an

internal Venn diagram of different types of feature character-istics, such as structure or expression. One possibility is thatthis large group of elements comprised regions that couldcode for proteins, either currently or ancestrally. To explorethis, we evaluated the reading frames of each conservedregion and identified that, for the length of the conservation,at least one-third (573) have one or more viable readingframes without stop codons. An additional measure of po-tential protein-coding ability was evaluating the proximity toknown exons. About one-fifth of the regions with good openreading frames (174) were within 300 bp of a known exon,making them candidates for being involved as an alternativevariant of a transcript or as an unknown exon of an annotatedgene. Although all these included regions have some homol-ogy to protein sequence, such homology is not always anindication that the conserved sequence contributes to anmRNA transcript. RNAs that regulate transcripts or targetDNA require homology to that target DNA (e.g., miRNAs).As such, parts of these regions of homology could result fromtargeting protein-coding regions as part of a regulatorymechanism.

To further differentiate this large group of 1,787 elements,other features were employed to identify additional charac-teristics of each conserved region. Secondary structure andsmall RNA expression were used to elucidate potential RNAgenes within this set. This resulted in 53 elements at theintersection of secondary structure and small RNA expression,which made prime targets for further investigation. Oneintriguing region from this set had a top BLAST term,which listed an unknown protein, and a second BLASTterm, which listed the protein LAF3 (AT3G55850) (fig. 5).The LAF3 protein participates in regulating phytochrome Asignal transduction in the cytosol (Hare et al. 2003). The thirdhighest scoring BLAST hit was the noncoding RNAAT1g70185, located 500 kb downstream of the unknownconserved region on chromosome 1, with a stretch of homol-ogy 80 bp long with 10 substitutions along that stretch. Thebiological function of this related ncRNA is unknown. Ourpredicted ncRNA, which was found through its pattern ofconservation, shares homology with the sequence for theLAF3 protein, as well as with the TAIR-annotated ncRNA.The protein homology overlaps the expressed small RNAsmapped to the region. EvoFold-predicted secondary structureshows an unusually high level of conservation, with almost nosubstitutions in the fold found among angiosperms. In addi-tion to this ncRNA, several other targets from this dataset share similar patterns of high conservation, expression,and high-scoring secondary structure; these are annotatedas part of a browser track on the A. thaliana genome.

RNA Secondary Structure Prediction acrossA. thaliana

Previous successes in whole-genome comparisons amongspecies groups have opened a window onto using multiplealignments and phylogenetic trees to identify RNA genes(Pedersen et al. 2006; Stark et al. 2007). To identify possibleRNA genes in our plant alignment, we used the phyloSCFG

1735

Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

algorithm implemented in the EvoFold software package(Pedersen et al. 2006), in addition to the RNAalifold program(Bernhart et al. 2008). These previous RNA structure analyseshave highlighted the inherent high rates of false positives infolding prediction. Using these two independent predictionmethods provided the opportunity to evaluate each fRNAfrom multiple perspectives; this helped to eliminate falsepositives that may have resulted from characteristics thatare unique to a particular algorithm. An example of thisapproach can be seen in predictions such as those illustratedin figure 4E, where the two algorithms overlap in theirannotation of a fold.

The combined predictions of the two approaches identi-fied 86,000 sites that could potentially fold. Short folds of lessthan 15 bp were found to be the majority of predictions,though longer folds were also found in large numbers(fig. 4A). To assess the accuracy in determining fRNA fromhighly conserved alignments, the set of TAIR annotations fortransfer RNAs consisting of 689 sites was used as a positivecontrol for fRNA prediction. Our fold classifications predict97% (637) of these established fRNAs, figure 4B. The remain-ing 3% of annotated tRNAs were not identified, due topoor alignment or low conservation. This suggests accurateprediction of known, conserved fRNAs, on par with previousinvestigations into fRNA genes in other organisms.

Secondary structure in RNAs can take many physicalforms; we quantified this variation in shape by recordingthe type of matching seen in both long and short folds(fig. 4C). The hairpin type dominates among shorter folds.Long folds show much higher diversity in shape, includingcomplex folds that have more than three hairpins in thefolded structure. Both long and short regions show a greaterproportion of folds comprising two hairpins in angiosperms,compared with the distribution observed among folds in thehuman genome (Pedersen et al. 2006) that observed thatdouble hairpins are more rare in primates. The types ofannotation that overlap regions which fold are describedin figure 4D for long and short folds. In vertebrates, nearlyhalf of all known folds are intergenic, with the remainderbeing associated with introns and CDS. In contrast, angio-sperms have few intergenic folds, with 70% or more occurringwithin coding sequence. This difference mirrors the differ-ences seen in the type of conservation of all sequencebetween species. The data set used for both analyses,the “mostCons” (most-conserved regions identified byphastCons), impacts the distribution of folds among anno-tation types. As a result, we see that the mostCons compo-sition in figure 3C is similar to the composition of foldsdetected in figure 4D.

As a vignette describing one of the types of folds detectedin this analysis, we selected a previously undescribed con-served high-scoring double hairpin within the 50-untranslatedregion of the plastid redox insensitive 2 (PRIN2) gene (fig. 4E).PRIN2 is a nuclear-encoded chloroplast-localized proteinwhose expression levels are altered by light (Kindgren et al.2011). The PRIN2 protein was also found to interact with theplastid-encoded RNA polymerase-altering expression andtherefore is thought to be a nonessential regulator of plastid

gene expression. The folded RNA structure is highly con-served among all flowering plants, with few mismatchingbase pairs (fig. 4F). The consensus fold shows two stronglyconserved hairpins joined by a more variable region (fig. 4G).The gene shows two transcripts scored 3 and 4 stars by TAIR,differing only in the length of the 50-untranslated region(50-UTR): one with the predicted folds and one without.Interestingly, in the longer transcript, the hairpins directlyoverlap the ribosome initiation site that begins at the 50-cap. Additionally, we detected two cis-regulatory motifs:one near, and one within, the UTR regions (as seen in fig.4E). The first cis-regulatory element, a “sequence overrepre-sented in light repressed promoters number 3” (SORLREP3)motif has previously been found to occur near promoterswhose transcript levels are reduced under a continuous redlight stimulus (Hudson and Quail 2003). The second motif,found within the 50-UTR of both transcripts, is an I-box and isknown to exist in the promoter regions of light-regulatedgenes (Giuliano et al. 1988). How these two cis-regulatoryelements contribute to expression levels of each of the alter-native PRIN2 transcripts is unknown, but observed regulatorypattern fits with previous knowledge about stimuli associatedwith PRIN2 expression. Their presence flanking the predictedfolds may indicate that different expression patterns of thegene are possible, depending on transcription factor binding.Although these new predictions need further validation, theyhighlight the ability of these genome-wide data sets to addvalue to existing gene investigations.

Uninterrupted Conservation in Angiosperms

One peculiarity found in the genomes of mammals and in-sects is long stretches of uninterrupted conservation(Bejerano et al. 2004). These ultraconserved elements(UCEs) were originally located by a comparative genomicsearch between human, mouse, and rat, which showed evi-dence of deep phylogenetic conservation, as well as ongoingpurifying selection in the human genome (Bejerano et al.2004; Katzman et al. 2007; Chiang et al. 2008). UCEs canextend to lengths of more than 500 bp and can be best de-scribed as the extreme tail of the distribution of genome-wideconserved elements. There is a degree of controversy asto whether they exist within plant genomes, with someresearchers reporting their discovery and others remainingskeptical (Zheng and Zhang 2008; Freeling andSubramaniam 2009). More recent research has used BLASTsearches across multiple plant genomes, identifying regionsthat have been termed ultraconserved-like elements (ULEs)(Kritsas et al. 2012). ULEs have unusually high levels of con-servation, and negative selection acting on their sequence,but lack the uninterrupted segments and extreme purifyingselection that are found in mammals.

To explore whether flowering plants contain even mod-estly extended stretches of uninterrupted conservationseen in mammals and insects, we conducted a search usingmethods mirroring those used to detect these regions inmammals (Bejerano et al. 2004; Glazov et al. 2005). Theseapproaches use whole-genome multiple alignment todetect blocks of conservation. Specifically, the algorithm

1736

Hupalo and Kern . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

Fol

d Le

ngth

s

- F

old

- F

old

- F

old

71%

22%

87%

8%

Sho

rt 6

3%(5

4981

)

Long

37%

(396

81)

88%

11%

25%

44%

21%

10%

97%

(63

7)C

over

age

ofK

now

n tR

NA

Acc

urac

yS

cale

chr1

:20

0 ba

ses

3469

600

3469

650

3469

700

3469

750

3469

800

3469

850

3469

900

3469

950

3470

000

3470

050

3470

100

3470

150

Use

r S

uppl

ied

Tra

ckT

AIR

9 P

rote

in-C

odin

g G

enes

Put

ativ

e C

is-R

egul

ator

y E

lem

ents

RN

Aal

ifold

and

Evo

fold

Pre

dict

ions

of R

NA

Sec

onda

ry S

truc

ture

20 W

ay M

ultiz

Ang

iosp

erm

Gen

ome

Alig

nmen

t

AT

1G10

522.

1A

T1G

1052

2.2

SO

RLR

EP

3I B

ox

vf_1

_142

172

ef_1

_142

172

vf_1

_142

175

Con

serv

atio

n

5’U

TR

GAACAGG

AU C U G U U C

UAUA

GG

CU

C GUA

CCU

C U G U U U C C U U G AUU

UCUAAGGAGACAG 0

1

5’

araTha9.chr1 ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTTTCCTTGATTTCTAAGGAGACAG

lyrata ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTTTCCTTGATTTCTAAGGAGACAG

rapa ACGACCTTACTTGAACAGGATCTGTTCTATAGGA-AGTACCTCTGTATCCTTGATTTCTAAGGAG-CAG

papaya ACGACCTTACTTGAACAGGATCTGTTCTATAGGTTCGTACCTCTGTTTCCTGGAGTTCGAAGGAGACAG

cacao ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTATCCTTTAGCACAAAGGAGACAG

glycineMax ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTGTCCTTGAGTTCTAAGGAGACAG

glycineSoja ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTGTCCTTGAGTTCTAAGGAGACAG

malus ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCTCTGTATCCTTGATTTCTAAGGAGACAG

fragaria ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCTCTGTATCCTTGACTTCTAAGGAGACAG

cucumis ACGACCTTACTTGAACAGGATCTGTTCTATAGG-TTGTACATCTGTGTCCTTGAGTTCTAAGGAGACAA

ricinus ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCTCTGTGTCCTTTATCACAAAGGAGACAG

populus ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCTCTGTATCCTTAATCACTAAGGAGACAG

vitis ACGACCTTACTTGAACAGGATCTATTCTATAGA-TTGTACCTCTGTATCCTTGAGTTCTAAGGAGACAG

tuberosum ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCACTGAATCCTTGATTTCTAAGGAGACAG

sorghum ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCGCTGCATCCTTGATTAATAAGGAGGCAA

zea ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACTGTTGTATCCTTGATTGATAAGGAGGCAA

oryza ACGACCTTACTTGAACAGGATCTGTTCTATAGGCTCGTACCGTTGCATCCTTGACTAATAAGGAGGCAA

brachypodium ACGACCTTACTTGAACAGGATCTGTTCTATAGGATCGTACCGCTACATCCTTTACCAAAAAGGAGGCAA

SS anno (((((((.....((((((...))))))...))).))))....((((.((((((.....)))))).))))

pair symbol abcdefg hijklm mlkjih gfe dcba abcd efghij jihgfe dcba

Com

plex

Fol

d

Long

Fol

dsS

hort

Fol

ds (

<15

bp)

CD

S

Inte

rgen

ic

UT

R

RN

As

Oth

er

Intr

on

AB

C D

E

F

G

FIG

.4.

Pred

icte

dse

con

dary

stru

ctur

eba

sed

ona

20w

ayan

gios

per

mal

ign

men

tus

ing

EvoF

old

and

RN

Aal

ifold

.(A

)Fo

ldle

ngt

hsse

par

ated

into

shor

t(<

15bp

)or

lon

gfo

lds.

(B)

Cov

erag

eof

know

ntR

NA

sin

ters

ecte

dw

ith

fold

pre

dict

ion

s.R

emai

nin

g3%

(21)

offo

lds

wer

edu

eto

low

con

serv

atio

nor

poo

ral

ign

men

t.(C

)Fo

ldst

ruct

ure

for

both

lon

gan

dsh

ort

fRN

Ase

ts.T

hen

umbe

rof

hair

pin

sw

asco

unte

din

asi

ngl

efo

ld,a

nd

clas

sifie

dba

sed

onth

efo

ld’s

stru

ctur

e.(D

)T

ype

ofov

erla

ppin

gan

not

atio

nfo

rbo

thlo

ng

and

shor

tda

tase

ts.(

E)U

CSC

gen

ome

brow

ser

scre

ensh

otof

pre

dict

edha

irp

ins

inth

ePR

IN2

(AT

1G10

522)

gene

that

isin

volv

edin

pla

stid

gen

etr

ansc

rip

tion

,an

dis

alte

rnat

ivel

ysp

liced

.The

hair

pin

sw

ere

pre

dict

edby

both

Evof

old

and

RN

Aal

ifold

,an

dov

erla

pth

eri

boso

mal

init

iati

onsi

te.A

lso

pic

ture

dar

eci

s-re

gula

tory

pre

dict

ion

s,a

TA

IRge

ne

trac

k,an

dth

ep

hast

Con

sco

nse

rvat

ion

trac

k(g

)20

way

alig

nm

ent

ofth

ere

gion

colo

red

blue

whe

reth

ere

isa

sin

gle

subs

titu

tion

com

pati

ble

wit

hth

ean

not

ated

pai

r,gr

een

wit

ha

com

pat

ible

doub

lesu

bsti

tuti

on,a

nd

red

whe

reth

ere

isa

subs

titu

tion

not

com

pat

ible

wit

hth

ean

not

ated

pai

r.(F

)C

onse

nsu

sV

ien

na

RN

Afo

ldp

redi

cted

MFE

stru

ctur

eof

the

high

light

ed50

-UT

Rre

gion

.

1737

Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

identifies UCEs by starting with a conserved alignment col-umn and stringing together subsequent preserved columnsuntil this pattern breaks due to any kind of nucleotide change.Constraining this algorithm are two parameters: the numberof genomes within the alignment and the minimum thresh-old for declaring an UCE. Using all 20 aligned species to searchfor regions of consistent alignment column conservation re-turns no regions when using a cutoff of �18 bp. One possi-bility is that gaps in alignment could be due to alignment orassembly quality errors. To better account for this, we limitedthe search space by using three-way alignments to A. thaliana.The alignment of G. max and V. vinifera to A. thalianareturned the largest number of uninterrupted regions greaterthan 18 bp. Using TAIR annotation and BLAST homology,we annotated 1,600 uninterrupted regions detected inthis 3way alignment and determined they all fall withinknown types of genome features (supplementary fig. S2,Supplementary Material online). Considering that novelmetazoan-type uninterrupted conservation has not beenfound, it can be concluded that metazoan-like UCEs arenot present in angiosperm genomes at the investigated phy-logenetic depths. As suggested by Kritsas et al., plant genomesmay contain features that may serve a similar purpose butwith altered or reduced conservation characteristics.

DiscussionIs the evolution of plant genomes distinct from that of ani-mals? Here, we construct a phylogenetically deep alignmentof angiosperm genomes, to ask how sequence conservation inangiosperms compares to groups of species in other king-doms of life. Relating the entirety of currently sequenced ge-nomes can reveal a more complete story on how similar plantgenomes are and on what features they “value” as part of ashared evolutionary history. Additionally, conservation of se-quence has been shown to quickly and clearly identify func-tional regions which might otherwise have been overlooked.

As such, by analyzing genome conservation in floweringplants, we have been able to add new annotations basedon patterns of conservation and identify novel features withsecondary structure and potential target sequences.

Information content of multiple alignments increases asthe number of species and the breadth of the phylogenyincrease. Although this is true for the first few included spe-cies, there are diminishing returns as further species areadded. To better quantify this, some have looked into howmany genomes are necessary to reach the nucleotide-levelresolution of conservation in comparative studies (Cooperet al. 2003; Eddy 2005). Although these investigations focuson the number of mammalian genomes, they still providepotent rules of thumb for estimating how many genomes areneeded for high resolution. Depending on the phylogeneticrelationships, anywhere between 15 and 40 genomes may benecessary. Our choice of how many genomes to include waslargely dictated by availability, as roughly 20 genomes wereavailable to us for use. Although plant alignments could ben-efit from further inclusion of comparative data of close phy-logenetic distance to A. thaliana, this phylogenetically broad20way alignment is a large step toward nucleotide-level iden-tification of conserved sequences in angiosperm genomes.

Aligning plant genomes inevitably leads to the question ofhow recent polyploid events impact the construction of datasets and their analysis. Arabidopsis thaliana is one of thesmallest sequenced plant genomes; as a result, alignmentsgenerated with an A. thaliana reference will be equally min-imal. In instances where the reference (Arabidopsis) only has asingle copy, species aligned to this compact reference willexclude regions of less conserved paralogs within the querygenome. This method has worked well for broad use in lessdynamic whole-genome comparative data sets such as verte-brates. In light of the complex ploidy history of plant ge-nomes, we want our inference to be conservative, relativeto the influence of polyploidy; thus, we focus attention only

01

lyratarapa

papayacacaoricinus

populusmalus

fragariacucumis

medicagolotus

glycineMaxglycineSoja

vitistuberosum

oryzabrachypodium

sorghumzea

Scalechr1:

Protein Homology

smallRNAs

50 bases26891600 26891650

vf_1_107594ef_1_107594

Conservation

FIG. 5. A screenshot from the Arabidopsis thaliana genome browser displaying tracks overlaid on a putative noncoding RNA, detected due to its highconservation and expression. Tracks include conservation for each of the 20 included species, a track showing conserved regions BLAST-annotated ashaving protein homology, a track showing secondary structure computed with EvoFold and RNAalifold with dark green denoting fold predictions andlight green nonfolding regions, and a track from the Arabidopsis Small RNA Project Database showing small RNA expression overlapping the conservedregions. Also shown is the consensus Vienna RNAfold predicted MFE structure of the putative ncRNA.

1738

Hupalo and Kern . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

on questions of genome conservation within the A. thalianagenome. A modified alignment process, using high-qualitychromosome data to resolve genome duplications events be-fore the LASTZ alignment step, could produce alignmentswithout such a complex composite nature. However, asparse number of genomes have such data available.

Our investigation into coding-sequence conservationdemonstrates that we are generating quality whole-genomealignments, which preserve essential gene features in alignedregions. Our approach also implemented a new platform forvisualizing and accessing alignment data for plants; such toolshave long been available for other model species, but this isthe first instance a deep comparative browser for plants. Theresults from benchmarking the gene quality of the alignment(fig. 2) was surprising in that almost half of A. thaliana geneannotations contained a change that would disrupt functionwhen aligned to A. lyrata. This is despite the two sharing thelarge majority of A. thaliana’s coding sequence, with A. lyratahaving alignments for 98% of the coding sequence present inA. thaliana. This stands in stark contrast to start/stop codonconservation comparisons in vertebrate species, where evendistant mammals such as platypus retain more than 60% ofthese essential sites (Miller et al. 2007). One explanation forthis observation could be that the alignment process createscomposite genes by aligning multiple copies of A. lyrata genesonto single copy genes in the A. thaliana reference. This hy-pothesis, however, does not fully explain the number of dis-ruptions in aligned genes. The majority of genes between thetwo species occur colinearly, with a minority being duplicated(Hu et al. 2011). If alignment errors from paralogs were cre-ating all these observed disruptions, then we would expectthem to occur at a similar rate to the proportion of dupli-cated genes overall. More likely, there is a mix of causes, withonly some being false positives due to poor alignment. Eventaking these potential false positives into consideration, wesee a trend that is markedly different from the gene featureconservation seen in vertebrate species.

The proportion of conserved genome features, such ascoding sequence, introns, and UTRs, relative to the completeset of detected conserved elements within a referencegenome, has been previously investigated for vertebrates, in-sects, worms, and yeasts (Siepel et al. 2005). Specifically, thiscomparison shows a trend relating an increase in the com-plexity of the conserved element set to an increase in overallorganismal complexity. Reproducing this analysis (fig. 3B) forangiosperms revealed that the proportion of gene featuresamong the complete set of conserved elements most closelyparallels the proportions observed in worms such as C. ele-gans. Both nematodes and plants are known to exhibit a widedegree of phenotypic plasticity, making drastic alterations tobody structure due to environmental stress (Sultan 2000;Sommer and Ogawa 2011). The observations that both an-giosperms and nematodes share a common distribution ofconserved elements, and that they both make use of a moreflexible phenotypic landscape, may imply that this less diversecomposition of conserved elements is necessary for environ-mentally induced large phenotypic changes. Moreover, itcould be that the developmental plans of plants are relatively

flexible in comparison to animals, and thus, this developmen-tal lability is reflected in genomic architecture and evolution.

It is clear that angiosperm coding sequence and essentialRNAs can be reliably aligned and identified, even across pro-tracted phylogenetic timelines. However, these componentsrepresent only a fraction of genome features. The question ofthe phylogenetic distance at which we lose sequence identityfor rapidly diverging features in plants can help make in-formed decisions in experimental design. Comparing align-ability in vertebrates to angiosperm species reveals a fasterdecay in vertebrates compared with plant species. Both be-tween closely related species and phylogenetic comparisonsbeyond the Brassicales, the alignability of coding sequence tothe A. thaliana reference was greater than equally distantvertebrates to a human reference (fig. 3F). Similarly, the align-ment coverage of reference genome coding sequence be-tween equally distant vertebrate and plant species showeda trend of higher coverage in plant species that were recentlyand distantly diverged. The implication of this result is thatalthough plant genomes can be highly variable intergenically,essential features such as coding sequences are highly con-served between species. It is more difficult to draw definitiveconclusions about the alignability of cis-regulatory modulesbetween kingdoms due to differing quality of annotations.However, we do observe a substantial difference in alignabilitywhen comparing the two kingdoms of life.

The preservation of conserved elements with no knownannotation, even in the most stringently filtered sets, wassurprising. This pattern illustrates that there are still segmentsof the A. thaliana genome, which are conceivably functional,but which are as yet uncharacterized. Our first-pass annota-tion of these conserved regions (fig. 3G) has shed light on thetype of function associated with this DNA. Although we de-tected several types of features, the highlight is the identifi-cation of dozens of new potential RNA genes. These regions,found by partial BLAST homology to existing ncRNAs, oralternatively by finding protein homology with overlappingsmall RNA expression, may represent a previously unknownsource of regulation in A. thaliana. However, the homology-based method used to annotate conserved regions is simpleand broad and as a result cannot produce truly definitiveannotations. These new annotation groups, however, aresmall enough for future manual refinement. Ultimately, func-tion can only truly be assigned as a result of validationthrough benchwork that verifies RNA expression and effectson phenotype. The vignettes of novel folding regions high-lighted here demonstrate the ability to quickly identify novelgenome features and can serve as a guide for bench scientiststo probe deeper into their gene families of choice with thehelp of our genome browser. Novel putative ncRNAs, such asthat described in figure 5, are promising but require furtherinvestigation to confirm the paradigm that conservationimplies function.

Beyond identifying new annotations via conservation, wehave added depth to existing annotation by layering it withRNA folding and cis-regulatory information. In doing so, wehave characterized on a genome-wide scale how RNA genesfold in flowering plants. Our leveraging of two independent

1739

Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

algorithms, to precompute RNA folds and display themgenome-wide, gives researchers an instant second opinionon a methodology, which is often subject to high false dis-covery rates and intense computational time (Gorodkin et al.2010). This folding information, combined with TAIR genetracks, conservation, and cis-regulatory motifs, was able toidentify a potential fold, which may well control regulationof a gene (fig. 4E).

This study articulated a few avenues for investigation into acomparative alignment. Most of the analyses presented canbe viewed as tracks on the Arabidopsis genome browser;many can be reconstituted using the genome browser andtable browser web tools. Plant comparative genomics hasunique challenges due to the architecture of genomes inthis kingdom of life. Continued accrual of sequence informa-tion and annotation will help empower further analysis ofthese complex organisms. At both the gene level and thegenome level, this integration of plant DNA informationwill help inform decisions and formulate targets for investi-gation to gain further insight about plant evolution.

Materials and Methods

Pairwise Alignment

In this analysis, we included only those angiosperm genomesthat have been published on previously, following the guide-lines of the Ft. Lauderdale agreement on rapid data release.The 20 genomes that were included in this analysis and theirversion numbers are listed in table 1, sorted by coverage. Totalalign refers to the number of nucleotides aligned to A. thali-ana determined by the mafCoverage software. CDS alignrefers to the overlap of alignment with existing A. thalianaCDS annotation as determined by intersections using thefeatureBits software, both programs are part of the UCSCsource tree (Kent et al. 2002).

A pairwise alignment pipeline was used to generate whole-genome alignments against a version 9 A. thaliana referencegenome sequence assembled by TAIR (Swarbreck et al. 2008).All sequences were obtained as scaffolds or pseudochromo-somes from the web repositories of the respective sequencinggroups with the exception of the G. soja genome. Sequencedata for G. soja were obtained from the Sequence ReadArchive and mapped to G. max using Maq (Li et al. 2008),following the methods of the sequencing group (Kim et al.2010). Masking was employed to remove lineage-specific re-petitive regions; resulting in improved BLAST results, this wasaccomplished using the RepeatMasker (Smit and Hubley2004) software suite. Each query genome was split into re-gions of 1 million base pairs or less, whereas the referencegenome, A. thaliana, was split into its seven pseudochromo-somes. The alignment then proceeded using the LASTZprogram (Harris 2007), a local alignment algorithm optimizedfor whole-genome alignment, which locally compared theA. thaliana reference genome sequence against all sequencesin each query genome. This process was parallelized across acomputer cluster to efficiently generate alignments from largedata sets. LASTZ output relating query to reference was thenlinked into longer chains of contiguous alignment using

axtChain (Kent et al. 2003). The alignment chains weresorted using chainNet, which filters only the single best-aligned chain, and maximizes coverage across the referencegenome. Converting the nets to multiple alignment files fol-lowed this. The resulting pairwise alignments of each querygenome to the A. thaliana reference were joined usingMULTIZ (Blanchette et al. 2004) and guided by the tree to-pology in figure 1. Postprocessing of the alignments includedinserting annotations for alignment breaks and gaps using themafAddIrows tool, and identifying regions removed byRepeatMasker.

Evaluating Alignment Quality and Refining Parameters

To determine whether the alignment process is producingreliable sequence relationships, alignments were evaluatedbased on base coverage and on annotation-specific quality.To measure the number of raw bases aligned, the number ofexact base matches, the number of base mismatches, and thecoverage the mafCoverage program, part of the UCSC sourcetree was used. Starting with default parameters for all pro-grams, each step in the alignment process was tuned to max-imize coverage and minimize mismatch. The LASTZalignment algorithm proved to be robust; even without anytuned parameters, the software produced pairwise align-ments between A. thaliana and A. lyrata with coverageonly 5% less than pairwise alignments made by theA. lyrata genome sequencing project. The final LASTZ param-eters were as follows for all alignments: inner = 2,000,xdrop = 9,400, gappedthresh = 3,000, hspthresh = 2,200.Using these parameters, for example, coverage was increasedby 4% between Ath/Aly compared with the default parame-ter baseline.

Evaluating Gene Quality

To judge the conservation of known A. thaliana gene featuresin pairwise aligned genomes beyond simple coverage num-bers, the program cleanGenes (part of the PHAST package)was used to evaluate feature conservation. Using version 9genome annotation from TAIR, gene-feature coordinateswere extracted and located in pairwise alignments, and con-servation was assessed. The types of features evaluated in-cluded start sites, stop sites, splice sites, frameshifts, andnonsense mutations; these were searched for “cleanly” con-served exons without gaps or mutations. Features were talliedas passing or failing after evaluating the features conservationin a pairwise alignment.

To compare our alignment pipeline to previously availablewhole-genome plant alignments created for the VISTAgenome browser, an alignment using TAIR version 8 A. thali-ana sequence was aligned to A. lyrata, so that we couldbenchmark our alignments against previously released pub-licly available alignments. Additional more recent alignmentsof A. thaliana/V. vinifera and A. thaliana/G. max, which uti-lized a TAIR10 reference, were used for additional compari-sons. VISTA alignments were incompatible with genomebrowser tools; thus, the mfa alignments were convertedinto MAF block format using a custom Python script. The

1740

Hupalo and Kern . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

resulting MAF format alignments for Ath/Aly were uploadedto a TAIR version 8 genome browser as browser MAF tracks.This allowed mafCoverage, as well as cleanGenes results fromVISTA alignments, to be directly compared with alignmentsfrom the LASTZ/MULTIZ pipeline.

Scoring Conservation

To compute conservation tracks for the multiple alignment,phyloFit (Siepel and Haussler 2004) (a component of thePHAST package) was used to fit a phylogenetic model to4-fold-degenerate sites found on each chromosome of theMULTIZ alignment as an initial starting sample (as describedin the PHAST documentation). The resulting phyloge-netic model was used in conjunction with the phastCons(Siepel et al. 2005) tool to create conserved and noncon-served phylogenetic trees. The phastCons program requiresseveral iterations to refine parameters that predict conser-vation as part of its phylo-HMM. Starting with parametersfor expected coverage and expected length gathered from aprevious conservation analysis focusing on Solanaceae(Wang et al. 2008), the phastCons run was tuned to fitpredefined criteria. Similar to previous studies analyzing con-servation, our criteria was 60% coverage of the annotatedcoding regions by predicted conserved elements, as well asphylogenetic information threshold score close to 10 bitsmeasured by the consEntropy software. The resulting pa-rameters that fit our criteria were an expected coverage of0.2 and an expected length of 80. Wig format data files wereused to create a conservation track on the A. thalianagenome browser, which visualizes conservation scores asa continuous variable. Resulting conserved region lengthsare graphed in supplementary figure S3, SupplementaryMaterial online.

Conserved regions were classified using A. thaliana anno-tation tracks based on TAIR version 9 GFF files. Intersectionsand enrichment values of the annotation tracks versus theconserved region track were achieved using the featureBitscommand line tool, part of the UCSC source tree. Significancewas determined by Fischer’s exact test, using values gatheredby featureBits, to determine whether certain groups wereover-represented versus the normal composition. Normalcomposition of the A. thaliana genome was determinedusing the same methodology. Vertebrate conserved elementenrichment was determined using featureBits and thephastConsElements46way track with annotation drawnfrom UCSC (Fujita et al. 2011).

To evaluate and compare the conservation ofspecific genome features between species, the tool maf_in-terval_alignability.py was employed. This tool, part of the bx-python package and utilized in Miller et al. (2007), scoresalignments to annotated features by measuring presence orabsence of aligned sequence. Specifically, the program pro-ceeds by tabulating the number of bases covered by a queryspecies, compared with the number of bases within an inter-val that have missing alignment information. The alignabilityvalue is the number of bases with alignment divided by thesum of the number of positions with and without alignment.The graph in figure 3F displays mean values of alignability for

annotation groups in select species. The columns of meanalignability values for each species are then scaled based onphylogenetic distance, as determined by substitutions per sitedrawn from a 4-fold degenerate neutral tree. Trend lines forvertebrate data were recapitulated using the latest alignmentinformation drawn from the phastCons46way alignment(Fujita et al. 2011) to confirm the previously observed pattern.Annotation for cis-regulatory sites in the 46way alignmentwas drawn from the ORegAnno track annotation (Griffithet al. 2008).

Building the Browser

To visualize alignments, and make use of the collection ofbrowser genomics tools, a mirror of the UCSC genome brow-ser (Kent et al. 2002) was installed at local facilities and re-mains available at genome.genetics.rutgers.edu. The focus ofthis browser is to host comparative genomics data forDrosophila and plant species. We selected Oryza sativa andA. thaliana as reference genome browsers for monocotsand eudicots, respectively, due to their extensive annotationand high-quality pseudochromosomes. Development hasfocused on the A. thaliana browser tracks as a prototypefor a plant comparative genomics browser. These tracksinclude a bed file-based display of regions identified byRepeatMasker as repetitive sequence and gene tracks basedon known genome annotations. Specifically, the foundationof the browser is gff3 format annotation created by TAIR,filtered for a single coverage of genes across each genome,and then converted to gene prediction (genePred) formatand uploaded to the browser MySQL database. An alternativegene prediction track, created using Gnomon gene predictionsoftware as part of a recent TAIR release (Lamesch et al. 2011),was also included as part of the browser.

Cis-regulatory elements were predicted based on regularexpressions of A. thaliana transcription factor binding sites aslisted in the AGRIS cis-regulatory database (Yilmaz et al. 2011).To create a browser-compatible track of elements, putativebinding sites were called using GREP according to TAIR ver-sion 9 chromosomes and then mapped using BLAT (Kent2002). The resulting coordinates were formatted into an ex-tended bed genome-browser track, labeling the type of motifand its coordinates.

Identifying Uninterrupted Conservation

To locate regions within the A. thaliana genome that are alsofound in all other sequenced and aligned angiosperm ge-nomes in an uninterrupted block, a Python program basedon mafUltras was written. This software was used to identifyultraconserved elements in vertebrates and has been adaptedfor use here with the 20way alignment. Unlike a phastConsconservation analysis, this search method is dependent on auser-defined threshold; specifically, the threshold is the min-imum length of uninterrupted alignment columns. Whensearching the human genome, this threshold was defined as100 bp. To maximize inclusion of highly conserved elementsand account for the overall shorter length of plant conservedelements, this threshold was set to 18 bp for the search inthe 20way alignment. This was chosen because in general

1741

Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

angiosperm-conserved regions are substantially shorterthan their mammalian counterparts (supplementary fig. S4,Supplementary Material online) and that 18 bp is the shortestlength an expected noncoding RNA might be. We expect thatthis shortening of the threshold compared with mammals,from 100 to 18 bp, is an inclusive estimate rather than exclu-sive. Any detected elements were sorted according to overlapof known A. thaliana annotation from TAIR. Regions that didnot map to known annotation were de novo annotatedbased on BLAST homology.

Computing Secondary Structure

Computing folding structure of RNA molecules can beinformed by conservation between related genomes. To iden-tify secondary structure conservation, EvoFold was imple-mented to predict folding given a MAF block and aphylogenetic tree. As with similar EvoFold studies (Pedersenet al. 2006; Stark et al. 2007), conserved regions predicted byPhastCons were first joined to any neighboring conservedregion at a distance no greater than 30 bp. These extendedregions were subsequently split into lengths no greater than750 bp. MAF blocks were extracted from the 20way align-ment using the MafFrag utility, part of the UCSC softwarepackage, and postprocessed to be compatible with EvoFoldalignment format. The 20 species newick tree used for allEvoFold runs was sourced from the PhastCons conservationrun. EvoFold predictions were distributed across a computecluster using default values in the control file provided in theEvoFold source code. Folds with scores below an LOD of 100and folds with overlap of repetitive elements were filteredout. Result files were formatted into a BED6 structure, adher-ing to the file format used for previously implemented UCSCgenome browser EvoFold tracks, and uploaded to a MySQLdatabase for use as a browser track. Resulting fold lengths aregraphed in supplementary figure S3, Supplementary Materialonline.

A similar approach was taken to predict secondary struc-ture in RNAs using the RNAalifold algorithm, part of theVienna RNA package (Hofacker et al. 2002; Bernhart et al.2008). This method uses the same data set described earlier,which contains conserved elements identified by PhastConsand processed for length and format. Results were filteredusing the same thresholds as above and postprocessed tocreate a genome browser track. Secondary structure enrich-ment was found by intersecting the EvoFold annotationtrack with other annotation tracks using the featureBitscommand line tool. A final browser track was createdcontaining the composite scores of both independent pre-diction methods. This data set was used for the results infigure 4.

The control data set used to verify the accuracy of predic-tions was sourced from TAIR annotation of tRNAs, whichproduced 658 annotations. Predicted fRNAs overlap 637of the 658 annotations. The enrichment of predicted fRNAsin the set of existing annotation for the A. thaliana ge-nome can be seen in figure 3E. As would be expected, trans-lational related RNAs (including tRNAs, rRNAs, snRNAs, andsnoRNAs) are significantly enriched for having folding regions,

more than triple the enrichment (17.52x) of the next nearestcategory, regulatory RNAs (4.08x).

Annotating Conserved Noncoding Regions

To characterize unannotated conserved regions scored byphastCons as most conserved within flowering plants, werelied on BLAST-based homology searches with defaultsearch parameters. The top 10% of the distribution ofmost-conserved elements was focused on for annotation,so as to limit a considerably large data set to only the mosthighly conserved regions. A first-pass search for homologywas performed using the BLAST algorithm to scan TAIR ver-sion 10 genome-wide annotation. BLAST results from this firstpass search were parsed using a custom script, to which wereextracted the top scoring search term for any result with an e-value cutoff of 0.1 or less. Regions with no homology withinknown A. thaliana annotation were then searched for homol-ogy to any known plant annotation contained in the PlantGenome Database, using an e-value cutoff of 0.1 or less(Duvick et al. 2007). In each case, the top BLAST searchterm was used as its tentative annotation. Further annotationwas achieved by intersecting bed files containing coordinatesof conserved regions annotated by BLAST homology withsecondary structure browser tracks, proximity to exons, andexisting small RNA expression databases sourced from theASRP (Backman et al. 2008). Evaluating exon proximity wasdetermined by searching for coordinates that were within164 bp of an annotated exon, the average intron length inA. thaliana.

Programming and Data

All programs were written in the Python and the C program-ming language. All custom software used in the developmentand analyses are available upon request. All data sets of con-served elements and annotations have been made available asfiles and tracks on the A. thaliana genome browser (araTha9)located at genome.genetics.rutgers.edu.

Supplementary MaterialSupplementary tables S1 and S2 and figures S1–S5 areavailable at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

Acknowledgments

This work was supported by two grants to A.D.K., NSFMCB-1052148 and DOE/USDA 124336, as well as theHuman Genetics Institute of New Jersey.

ReferencesAcarkan A, Rossberg M, Koch M, Schmidt R. 2000. Comparative genome

analysis reveals extensive conservation of genome organisation forArabidopsis thaliana and Capsella rubella. Plant J. 23:55–62.

Arabidopsis Genome Initiative. 2000. Analysis of the genomesequence of the flowering plant Arabidopsis thaliana. Nature 408:796–815.

Argout X, Salse J, Aury J-M, et al. (61 co-authors). 2011. The genome ofTheobroma cacao. Nat Genet. 43:101–108.

Backman TWH, Sullivan CM, Cumbie JS, Miller ZA, Chapman EJ,Fahlgren N, Givan SA, Carrington JC, Kasschau KD. 2008. Update

1742

Hupalo and Kern . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

of ASRP: the Arabidopsis small RNA project database. Nucleic AcidsRes. 36:D982–D985.

Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS,Haussler D. 2004. Ultraconserved elements in the human genome.Science 304:1321–1325.

Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF. 2008. RNAalifold:improved consensus structure prediction for RNA alignments.BMC Bioinformatics 9:474.

Blanchette M, Kent WJ, Riemer C, et al. (12 co-authors). 2004. Aligningmultiple genomic sequences with the threaded blockset aligner.Genome Res. 14:708–715.

Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, NISC ComparativeSequencing Program, Green ED, Sidow A, Batzoglou S. 2003. LAGANand Multi-LAGAN: efficient tools for large-scale multiple alignmentof genomic DNA. Genome Res. 13:721–731.

Brudno M, Poliakov A, Minovitsky S, Ratnere I, Dubchak I. 2007. Multiplewhole genome alignments and novel biomedical applications at theVISTA portal. Nucleic Acids Res. 35:W669–W674.

Campbell MA, Zhu W, Jiang N, Lin H, Ouyang S, Childs KL, Haas BJ,Hamilton JP, Buell CR. 2007. Identification and characterizationof lineage-specific genes within the Poaceae. Plant Physiol. 145:1311–1322.

Chan AP, Crabtree J, Zhao Q, et al. (18 co-authors). 2010. Draft genomesequence of the oilseed species Ricinus communis. Nat Biotechnol.28:951–956.

Chiang CWK, Derti A, Schwartz D, Chou MF, Hirschhorn JN, Wu C-T.2008. Ultraconserved elements: analyses of dosage sensitivity, motifsand boundaries. Genetics 180:2277–2293.

Cooper GM, Brudno M, NISC Comparative Sequencing Program., GreenED, Batzoglou S, Sidow A. 2003. Quantitative estimates of sequencedivergence for comparative analyses of mammalian genomes.Genome Res. 13:813–820.

Davies TJ, Barraclough TG, Chase MW, Soltis PS, Soltis DE, Savolainen V.2004. Darwin’s abominable mystery: insights from a supertree ofthe angiosperms. Proc Natl Acad Sci U S A. 101:1904–1909.

Drosophila 12 Genomes Consortium, Clark AG, Eisen MB, et al.(418 co-authors). 2007. Evolution of genes and genomes on theDrosophila phylogeny. Nature 450:203–218.

Dubchak I, Brudno M, Loots GG, Pachter L, Mayor C, Rubin EM,Frazer KA. 2000. Active conservation of noncoding sequencesrevealed by three-way species comparisons. Genome Res. 10:1304–1306.

Duvick J, Fu A, Muppirala U, Sabharwal M, Wilkerson MD, Lawrence CJ,Lushbough C, Brendel V. 2007. PlantGDB: a resource for compara-tive plant genomics. Nucleic Acids Res. 36:D959–D965.

Eddy SR. 2005. A model of the statistical power of comparative genomesequence analysis. PLoS Biol. 3:e10.

Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I. 2004. VISTA:computational tools for comparative genomics. Nucleic Acids Res.32:W273–W279.

Freeling M, Subramaniam S. 2009. Conserved noncoding sequences(CNSs) in higher plants. Curr Opin Plant Biol. 12:126–132.

Friedman RC, Farh KK-H, Burge CB, Bartel DP. 2009. Most mammalianmRNAs are conserved targets of microRNAs. Genome Res. 19:92–105.

Fujita PA, Rhead B, Zweig AS, et al. (27 co-authors). 2011. The UCSCGenome Browser database: update 2011. Nucleic Acids Res. 39:D876–D882.

Gebhardt C, Walkemeier B, Henselewski H, Barakat A, Delseny M, StuberK. 2003. Comparative mapping between potato (Solanum tubero-sum) and Arabidopsis thaliana reveals structurally conserveddomains and ancient duplications in the potato genome. Plant J.34:529–541.

Giuliano G, Pichersky E, Malik VS, Timko MP, Scolnik PA, Cashmore AR.1988. An evolutionarily conserved protein binding sequenceupstream of a plant light-regulated gene. Proc Natl Acad SciU S A. 85:7089–7093.

Glazov EA, Pheasant M, McGraw EA, Bejerano G, Mattick JS. 2005.Ultraconserved elements in insect genomes: a highly conserved

intronic sequence implicated in the control of homothoraxmRNA splicing. Genome Res. 15:800–808.

Goff SA, Ricke D, Lan T-H, et al. (55 co-authors). 2002. A draft sequenceof the rice genome (Oryza sativa L. ssp. japonica). Science 296:92–100.

Gorodkin J, Hofacker IL, Torarinsson E, Yao Z, Havgaard JH, Ruzzo WL.2010. De novo prediction of structured RNAs from genomicsequences. Trends Biotechnol. 28:9–19.

Griffith O, Montgomery SB, Bernier B, et al. (27 co-authors). 2008.ORegAnno: an open-access community-driven resource for regula-tory annotation. Nucleic Acids Res. 36:D107–D113.

Guo H. 2003. Conserved Noncoding sequences among cultivated cerealgenomes identify candidate regulatory sequence elements andpatterns of promoter evolution. Plant Cell 15:1143–1158.

Hare PD, Moller SG, Huang L-F, Chua N-H. 2003. LAF3, a novel factorrequired for normal phytochrome A signaling. Plant Physiol. 133:1592–1604.

Harris RS. 2007. Improved pairwise alignment of genomic DNA.[PhD thesis]. [University Park (PA)]: The Pennsylvania StateUniversity.

Hofacker IL, Fekete M, Stadler PF. 2002. Secondary structure predictionfor aligned RNA sequences. J Mol Biol. 319:1059–1066.

Hu TT, Pattyn P, Bakker EG, et al. (30 co-authors). 2011. The Arabidopsislyrata genome sequence and the basis of rapid genome size change.Nat Genet. 43:476–481.

Huang S, Li R, Zhang Z, et al. (96 co-authors). 2009. The genome of thecucumber, Cucumis sativus. L. Nat Genet. 41:1275–1281.

Hubisz MJ, Pollard KS, Siepel A. 2011. PHAST and RPHAST: phylogeneticanalysis with space/time models. Brief Bioinform. 12:41–51.

Hudson ME, Quail PH. 2003. Identification of promoter motifs involvedin the network of phytochrome A-regulated gene expression bycombined analysis of genomic sequence and microarray data.Plant Physiol. 133:1605–1616.

Inada DC, Bashir A, Lee C, Thomas BC, Ko C, Goff SA, Freeling M. 2003.Conserved noncoding sequences in the grasses. Genome Res. 13:2030–2041.

Kaplinsky NJ. 2002. Utility and distribution of conserved noncodingsequences in the grasses. Proc Natl Acad Sci U S A. 99:6147–6151.

Katzman S, Kern AD, Bejerano G, Fewell G, Fulton L, Wilson RK, SalamaSR, Haussler D. 2007. Human genome ultraconserved elements areultraselected. Science 317:915.

Kellis M, Patterson N, Endrizzi M, Birren B. 2003. Sequencing and com-parison of yeast species to identify genes and regulatory elements.Nature 423:241–254.

Kent WJ. 2002. BLAT—the BLAST-like alignment tool. Genome Res. 12:656–664.

Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. 2003. Evolution’scauldron: duplication, deletion, and rearrangement in the mouseand human genomes. Proc Natl Acad Sci U S A. 100:11484–11489.

Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM,Haussler D. 2002. The human genome browser at UCSC. GenomeRes. 12:996–1006.

Kim J, He X, Sinha S. 2009. Evolution of regulatory sequences in12 Drosophila species. PLoS Genet. 5:e1000330.

Kim MY, Lee S, Van K, et al. (29 co-authors). 2010. Whole-genomesequencing and intensive analysis of the undomesticated soybean(Glycine soja Sieb. and Zucc.) genome. Proc Natl Acad Sci U S A. 107:22032–22037.

Kindgren P, Kremnev D, Blanco NE, de Dios Barajas Lopez J, FernandezAP, Tellgren-Roth C, Small I, Strand A. 2011. The plastid redoxinsensitive 2 mutant of Arabidopsis is impaired in PEP activity andhigh light-dependent plastid redox signalling to the nucleus. Plant J.70:279–291.

Kritsas K, Wuest SE, Hupalo D, Kern AD, Wicker T, Grossniklaus U. 2012.Computational analysis and characterization of UCE-like elements(ULEs) in plant genomes. Genome Res. 22:2455–2466.

Ku HM, Vision T, Liu J, Tanksley SD. 2000. Comparing sequencedsegments of the tomato and Arabidopsis genomes: large-scaleduplication followed by selective gene loss creates a network ofsynteny. Proc Natl Acad Sci U S A. 97:9121–9126.

1743

Conservation in 20 Plant Genomes . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from

Lamesch P, Berardini T, Li D. 2011. The Arabidopsis InformationResource (TAIR): improved gene annotation and new tools.Nucleic Acids Res. 21:1–9.

Lenz D, May P, Walther D. 2011. Comparative analysis of miRNAs andtheir targets across four plant species. BMC Res Notes. 4:483.

Li H, Ruan J, Durbin R. 2008. Mapping short DNA sequencing reads andcalling variants using mapping quality scores. Genome Res. 18:1851–1858.

Michaud M, Cognat V, Duchene A-M, Marechal-Drouard L. 2011.A global picture of tRNA genes in plant genomes. Plant J. 66:80–93.

Miller W, Rosenbloom K, Hardison RC, et al. (26 co-authors). 2007.28-way vertebrate alignment and conservation track in the UCSCGenome Browser. Genome Res. 17:1797–1808.

Ming R, Hou S, Feng Y, et al. (85 co-authors). 2008. The draft genome ofthe transgenic tropical fruit tree papaya (Carica papaya Linnaeus).Nature 452:991–996.

Morrell PL, Buckler ES, Ross-Ibarra J. 2011. Crop genomics: advances andapplications. Nat Rev Genet. 13:85–96.

Paterson AH, Bowers JE, Bruggmann R, et al. (45 co-authors). 2009. TheSorghum bicolor genome and the diversification of grasses. Nature457:551–556.

Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K,Lander ES, Kent J, Miller W, Haussler D. 2006. Identification andclassification of conserved RNA secondary structures in thehuman genome. PLoS Comput Biol. 2:e33.

Retzel EF, Johnson JE, Crow JA, Lamblin AF, Paule CE. 2007.Legume resources: MtDB and Medicago.Org. Methods Mol Biol.406:261–274.

Rhead B, Karolchik D, Kuhn RM, et al. (20 co-authors). 2010. The UCSCGenome Browser database: update 2010. Nucleic Acids Res. 38:D613–D619.

Sato S, Nakamura Y, Kaneko T, et al. (29 co-authors). 2008. Genomestructure of the legume, Lotus japonicus. DNA Res. 15:227–239.

Schmutz J, Cannon SB, Schlueter J, et al. (45 co-authors). 2010. Genomesequence of the palaeopolyploid soybean. Nature 463:178–183.

Schmidt R. 2002. Plant genome evolution: lessons from comparativegenomics at the DNA level. Plant Mol Biol. 48:21–37.

Schnable PS, Ware D, Fulton RS, et al. (157 co-authors). 2009. The B73maize genome: complexity, diversity, and dynamics. Science 326:1112–1115.

Shulaev V, Sargent DJ, Crowhurst RN, et al. (71 co-authors). 2011. Thegenome of woodland strawberry (Fragaria vesca). Nat Genet. 43:109–116.

Siepel A, Bejerano G, Pedersen JS, et al. (16 co-authors). 2005.Evolutionarily conserved elements in vertebrate, insect, worm, andyeast genomes. Genome Res. 15:1034–1050.

Siepel A, Haussler D. 2004. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol BiolEvol. 21:468–488.

Smit A, Hubley R. 2004. RepeatMasker Open-3.0. 1996–2010 [Internet].Institute for Systems Biology. Available from: http://www.repeatmasker.org

Sommer RJ, Ogawa A. 2011. Hormone signaling and phenotypicplasticity in nematode development and evolution. Curr OpinBiol. 21:R758–66.

Stark A, Lin MF, Kheradpour P, et al. (44 co-authors). 2007. Discovery offunctional elements in 12 Drosophila genomes using evolutionarysignatures. Nature 450:219–232.

Stojanovic N. 2009. A study of the distribution of phylogeneticallyconserved blocks within clusters of mammalian homeobox genes.Genet Mol Biol. 32:666–673.

Sultan SE. 2000. Phenotypic plasticity for plant development, functionand life history. Trends Plant Sci. 5:537–542.

Swarbreck D, Wilks C, Lamesch P, et al. (16 co-authors). 2008.The Arabidopsis Information Resource (TAIR): gene structure andfunction annotation. Nucleic Acids Res. 36:D1009–D1014.

Tang H, Bowers JE, Wang X, Ming R, Alam M, Paterson AH. 2008.Synteny and collinearity in plant genomes. Science 320:486–488.

Tang H, Wang X, Bowers JE, Ming R, Alam M, Paterson AH. 2008.Unraveling ancient hexaploidy through multiply-aligned angio-sperm gene maps. Genome Res. 18:1944–1954.

Thomas BC, Rapaka L, Lyons E, Pedersen B, Freeling M. 2007. Arabidopsisintragenomic conserved noncoding sequence. Proc Natl Acad SciU S A. 104:3348–3353.

Tuskan GA, Difazio S, Jansson S, et al. (110 co-authors). 2006. Thegenome of black cottonwood, Populus trichocarpa (Torr. & Gray).Science 313:1596–1604.

Velasco R, Zharkikh A, Affourtit J, et al. (86 co-authors). 2010. Thegenome of the domesticated apple (Malus � domestica Borkh.).Nat Genet. 42:833–839.

Velasco R, Zharkikh A, Troggio M, et al. (57 co-authors). 2007. A highquality draft consensus sequence of the genome of a heterozygousgrapevine variety. PLoS One 2:e1326.

Vogel J, Garvin D, Mockler T, Schmutz J. 2010. Genome sequencing andanalysis of the model grass Brachypodium distachyon. Nature 463:763–768.

Wang X, Haberer G, Mayer KF. 2009. Discovery of cis-elements betweensorghum and rice using co-expression and evolutionary conserva-tion. BMC Genomics 10:284.

Wang XX, Wang HHH, Wang JJJ, et al. (110 co-authors). 2011. Thegenome of the mesopolyploid crop species Brassica rapa. NatGenet. 43:1035–1039.

Wang Y, Diehl A, Wu F, Vrebalov J, Giovannoni J, Siepel A, Tanksley SD.2008. Sequencing and comparative analysis of a conserved syntenicsegment in the Solanaceae. Genetics 180:391–408.

Xu X, Pan S, Cheng S, et al. (98 co-authors). 2011. Genome sequenceand analysis of the tuber crop potato. Nature 475:189–195.

Yang X, Jawdy S, Tschaplinski T. 2009. Genome-wide identification oflineage-specific genes in Arabidopsis, Oryza and Populus. Genomics93:473–480.

Yang Y-W, Lai K-N, Tai P-Y, Li W-H. 1999. Rates of nucleotide substi-tution in angiosperm mitochondrial DNA sequences and dates ofdivergence between Brassica and other angiosperm lineages. J MolEvol. 48:597–604.

Yilmaz A, Mejia-Guerra MK, Kurz K, Liang X, Welch L, Grotewold E.2011. AGRIS: the Arabidopsis Gene Regulatory Information Server,an update. Nucleic Acids Res. 39:D1118–D1122.

Zeller G, Henz SR, Widmer CK, Sachsenberg T, Ratsch G, Weigel D,Laubinger S. 2009. Stress-induced changes in the Arabidopsis thali-ana transcriptome analyzed using whole-genome tiling arrays.Plant J. 58:1068–1082.

Zhang B, Pan X, Cannon C, Cobb G. 2006. Conservation and divergenceof plant microRNA genes. Plant J. 46:243–259.

Zheng W-X, Zhang C-T. 2008. Ultraconserved elements between thegenomes of the plants Arabidopsis thaliana and rice. J Biomol StructDyn. 26:1–8.

1744

Hupalo and Kern . doi:10.1093/molbev/mst082 MBE by guest on July 6, 2016

http://mbe.oxfordjournals.org/

Dow

nloaded from