Using Supermatrices for Phylogenetic Inquiry: An Example Using the Sedges

Post on 04-May-2023

0 views 0 download

Transcript of Using Supermatrices for Phylogenetic Inquiry: An Example Using the Sedges

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 205 205–219

Syst. Biol. 62(2):205–219, 2013© The Author(s) 2012. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved.For Permissions, please email: journals.permissions@oup.comDOI:10.1093/sysbio/sys088Advance Access publication October 26, 2012

Using Supermatrices for Phylogenetic Inquiry: An Example Using the Sedges

CODY E. HINCHLIFF∗ AND ERIC H. ROALSON

School of Biological Sciences, Washington State University, Pullman, WA 99164-4236, USA∗Correspondence to be sent to: Department of Ecology and Evolutionary Biology, University of Michigan, 2071A Kraus Natural Science Building, 830 N

University, Ann Arbor, MI 48109-1048, USA; E-mail: cody.hinchliff@gmail.com.

Received 29 February 2012; reviews returned 14 June 2012; accepted 15 October 2012Associate Editor: Michael A. Charleston

Abstract.—In this article, we use supermatrix data-mining methods to reconstruct a large, highly inclusive phylogeny ofCyperaceae from nucleotide data available on GenBank. We explore the properties of these trees and their utility forphylogenetic inference, and show that even the highly incomplete alignments characteristic of supermatrix approachesmay yield very good estimates of phylogeny. We present a novel pipeline for filtering sparse alignments to improvetheir phylogenetic utility by maximizing the partial decisiveness of the matrices themselves through a technique we call“phylogenetic scaffolding,” and we present a new method of scoring tip instability (i.e. “rogue taxa”) based on the I statisticimplemented in the software Mesquite. The modified statistic, which we call IS, is somewhat more straightforward tointerpret than similar statistics, and our implementation of it may be applied to large sets of large trees. The largest sedgetrees presented here contain more than 1500 tips (about one quarter of all sedge species) and are based on multigenealignments with more than 20 000 sites and more than 90% missing data. These trees match well with previously supportedphylogenetic hypotheses, but have lower overall support values and less resolution than more heavily filtered trees. Ourbest-resolved trees are characterized by stronger support values than any previously published sedge phylogenies, andshow some relationships that are incongruous with previous studies. Overall, we show that supermatrix methods offerpowerful means of pursuing phylogenetic study and these tools have high potential value for many systematic biologists.[Cyperaceae; decisiveness; megaphylogeny; PHLAWD; supermatrix.]

The sedge family, Cyperaceae, is one of the world’s10 most speciose families of flowering plants (Stevens2008), containing more than 5400 species (Govaertset al. 2007). Sedge species richness is highest inthe tropics, but sedges may be found in nearly allangiosperm-supporting habitats across the biosphere,spanning arctic tundra, alpine zones, temperate andtropical forests, savannas, prairies, marshes, swamps,and deserts (Dahlgren et al. 1985; Bruhl 1995). Sedgesare phenotypically and ecologically diverse, includingtiny ephemerals, fire-resistant tussocks, scandent vinesmore than 10 m long, and submerged aquatic herbs.The combination of their broad distribution and diversephenotypes make sedges a compelling system for thestudy of global macroevolutionary ecology in plants(Naczi and Ford 2008).

Modern methods exploring macroevolutionarydynamics frequently require well-resolved, highlycomplete phylogenies (Smith and Donoghue 2008;Smith et al. 2009; Holton and Pisani 2010), whichcan be prohibitively expensive to obtain via directedsequencing of individual loci. Fortunately, publicresources, such as GenBank contain an abundance ofexisting data that can be used for this purpose. Here, wetake advantage of supermatrix data-mining methodsto reconstruct the most inclusive sedge phylogenyyet published, using nucleotide data gathered fromGenBank via a novel data pipeline that relies on thesoftware PHLAWD (Smith and Donoghue 2008; Smithet al. 2009). PHLAWD implements a protocol describedin its documentation as “megaphylogeny”—an iterativeprocess involving the extraction and alignment ofnucleotide sequences, which results in fully alignedmatrices. PHLAWD relies on the software Muscle

(Edgar 2004) and MAFFT (Katoh et al. 2005) to performrecursive profile alignments that attempt to maximizesite homology even at deep evolutionary scales. Thepipeline we present in this article further improves theutility of the resulting alignments through a series ofcomplementary refinement steps that filter taxa andgenes to improve taxon coverage and branch stability.

Supermatrix methods offer a variety of advantages,including, perhaps most significantly, the ability toreconstruct phylogeny at broad scales with minimalinvestment in sequencing (McMahon and Sanderson2006; de Queiroz and Gatesy 2007; Ren et al. 2009;Davis et al. 2010; van der Linde et al. 2010; Wolsanand Sato 2010), provided that the requisite sequencedata are already available. These methods often presenttheir own challenges, however, including issues ofcomputational power and time efficiency, contentiousquestions regarding the effects of very sparse alignmentson branch lengths (Lemmon et al. 2009; Wiens andMorrill 2011), and concerns about data integrity.

One important challenge is presented by thedetrimental effects that high levels of missing data mayhave on tree reconstruction. A major component of thiseffect has recently been examined in detail by Steeland Sanderson (2010) and Sanderson et al. (2010). Thesepapers defined and explored a property of data matricesthat they named “partial decisiveness” (PDC), which isa practical measure of the limits of a multilocus dataset to inform phylogenetic inference—PDC describes theproportion of branches across all possible trees that maybe resolved given the pattern of missing samples. In mostreal-world alignments, some branches in the treespacecannot be resolved because of insufficient taxon overlap;alignments containing such patterns thus have a PDC

205

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 206 205–219

206 SYSTEMATIC BIOLOGY VOL. 62

of <1, and are said to be “indecisive” for those branchesthat cannot be reconstructed. Supermatrices are oftencharacterized by very high levels of missing data, and assuch can typically be expected to have PDC values lowerthan more conventional matrices. The PDC metric hasnot been broadly applied across a sample of such largecombined matrices, but in practice we have observedPDC values between 0.6 and 0.7 to be common forunoptimized matrices compiled from public databases,which frequently contain >70% missing data. Althoughit is not necessary to have a completely decisive matrix toreconstruct the true tree (because all possible branchesin the treespace are not present in all trees), maximizingPDC can nonetheless generally be expected to improvetree resolution in the majority of real-world scenarios.

A related challenge is presented by so-called “roguetaxa”—destabilizing tips that move around the treeenough in replicate searches to cause the collapse ofclades and the erosion of branch support values. Theinstability of a given tip may be compounded by multipleattributes, although perhaps most frequently by (i) datanonintegrity due to sequence misidentification and (ii)very weak phylogenetic signal simply due to a dearthof meaningful phylogenetic information. Identificationand removal of rogue taxa is generally expected tohelp improve the resolution of trees containing suchunstable tips (Thorley and Wilkinson 1999; Thomson andShaffer 2010), and several methods have been presentedto accomplish this goal (Thorley and Wilkinson 1999;Thorley and Page 2000; Smith and Dunn 2008). Here,we present and make use of a statistic modifiedfrom the I score implemented in Mesquite (Maddisonand Maddison 2010), which overcomes some of thedrawbacks of previously available instability metrics.

In this study, we present results from a supermatrixphylogeny (a PHLAWD megaphylogeny) of Cyperaceae.In this context, we address issues of data integrityas they relate to large combined analyses, and weexplore the association between data decisiveness andtree resolution. We compare the results of our treesearches to results from a previously published familylevel analysis of the Cyperaceae, and we show thatmaximizing the number of loci and taxa using themethods employed here may in fact yield greater powerto resolve phylogenetic trees than more traditional,focused sequencing approaches that rely on smallernumbers of better-sampled loci, even with a highlyincomplete alignment.

MATERIALS AND METHODS

Data Collection and ValidationData were gathered from GenBank release 185 using

the software tool PHLAWD (Smith et al. 2009), whichsearches the NCBI database for all nucleotide sequencesfor species within a given taxon that match a giventext query. For instance, PHLAWD can be used toyield an alignment of all NCBI sequences matching thequery string “internal transcribed spacer” within the

NCBI taxon Cyperaceae. Because some nonorthologoussequences may be returned for any given query,PHLAWD requires a set of presupplied guide sequencesfor each query. Any query results that do not match theseguide sequences by some arbitrary minimum amountof coverage and identity are excluded. We used cutoffproportions of 0.3 for coverage and 0.2 for identity for allsearches.

We gathered data on 23 genes for all available specieswithin the Cyperaceae. Figure 1 presents a heatmap ofthe coverage of these data for all recognized genera. Wealso ranked genera by overall coverage for these markers,according to several statistics intended to measure theamount of information available for each genus (Fig. 1,bottom 4 rows; see Supplementary material for details;Data Dryad doi: 10.5061/dryad.6p76c3pb). From thesedata, we generated alignment files representing morethan 1500 species from almost all of the family’s genera.Guide sequences for each gene were chosen arbitrarilyfrom the set of available GenBank data, in a mannerthat sought to maximize their phylogenetic spread. Theresulting alignment files were concatenated on the basisof species name using the software phyutility (Smithand Dunn 2008). This approach combines available datafrom multiple exemplar specimens of each species tomake phylogeny estimation at this scale possible, and ithas proven effective for estimating phylogeny of largenumbers of species when parallel data from multiplemarkers are not available from a single specimen (Joneset al. 2002; Smith and Donoghue 2008; Smith et al. 2009).

Because data integrity on public databases such asGenBank is uncertain, it is possible to include poorquality or incorrectly identified sequences when datamining these resources. When multiple sequences fromdifferent vouchers are concatenated on the basis ofspecies identity, the potential exists for sequences frommisidentified species to be concatenated with correctlyidentified ones, leading to the creation of so-called“chimeric” taxa. The tip nodes that are associated withthese mistaken taxa are expected to contain conflictingphylogenetic signal which, at least in the case ofbootstrapping, can erode support values for otherwisewell-supported clades, because the chimeric tip iseither expected to be placed near each of its differentconstituent species in different replicate searches, or tocontain strong enough conflicting signal that it simplycauses parts of the tree to collapse. One approach to aposteriori identification of these taxa involves scoringeach taxon for a leaf stability index—a statistic thatquantifies how much its position changes relative toother taxa in replicate tree searches. Those taxa with thelowest stability are assumed to contain either conflictingphylogenetic signal (especially in the case of chimerictaxa) or such low levels of signal that many placementsare equivocal.

Multiple methods of measuring leaf stability havebeen proposed (Thorley and Wilkinson 1999; Thorleyand Page 2000; Smith and Dunn 2008). For this study, webuilt upon previous work by Maddison and Maddison(2010) in Mesquite version 2.7. They present a statistic,

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 207 205–219

2013 HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX 207

rps1

6pe

tn-p

sbm

trnc

-ycf

6at

pbtr

ny-t

rne

mat

kat

p1ac

cdat

pb-r

bcl

adh2

adh1

psbk

-psb

iat

pf-a

tph

psba

-trn

hrp

obrp

oc1

trnk

ets

itsndhf

rbcl

trnl

-trn

ftr

nl

00.

20.

40.

60.

81

valu

e0

500

1000

1500

Co

lor

key

and

his

tog

ram

countActinoschoenus

ZameioscirpusMicrodracoides.squamosus

Volkiella.distichaEleocharis

AndrotrichumOreobolopsisPhylloscirpus

KyllingiellaCourtoisina

IsolepisTrichophorum

AlinulaBlysmus

MesomelaenaChrysitrix

TetrariaFiciniaCarex

EriophorumCalyptrocarya

TrianoptilesCyathochaeta

CostulariaGahnia

SchoenusFimbristylis

BecquereliaTrilepis

AscolepisTrichoschoenus.bosseri

Sumatroscirpus.junghuhniiRhynchocladium.steyermarkii

Reedia.spathaceaPseudoschoenus.inanus

Principina.grandisNelmesia.melanostachya

Koyamaea.neblinensisEverardia

CypringleaCephalocarpus

AfrotrilepisBisboeckeleraLepidosperma

BulbostylisOxycaryum

RhynchosporaScleria

LagenocarpusMachaerina

KyllingaLipocarpha

MapaniaHypolytrum

CyperusPycreusCarpha

GymnoschoenusBolboschoenusSchoenoplectus

SchoenoplectiellaChorizandra

EvandraMorelotia

ScirpusTricostularia

EpischoenusColeochloa

ParamapaniaFuirena

DiplacrumCaustis

ScirpodendronCapitularina.involucrata

Exocarya.sclerioidesHellmuthia.membranacea

ScirpoidesRemirea.maritima

Sphaerocyperus.erinaceusOreobolus

Arthrostylis.aphyllaAmphiscirpusActinoscirpus

Dulichium.arundinaceumPtilothrix.deusta

Lepironia.articulataNeesenbeckia.punctoria

Diplasia.karatifoliaCladium

Trachystylis.stradbrokensisCapeobolus.brevicaulis

Khaosokia.caricoides

AB

S_i

mpo

rt_r

ank

AB

S_i

mpo

rtan

ceT

OT

_pro

port

ion_

spp

TO

T_p

erce

nt_c

over

FIG

UR

E1.

Hea

tm

apof

nucl

eoti

dese

quen

cesa

mpl

ing

inC

yper

acea

e.Sa

mpl

ing

dens

ity

and

sum

mar

yst

atis

tics

for

allc

urre

ntly

reco

gniz

edge

nera

ofC

yper

acea

ear

esh

own

for

each

ofth

e23

mos

tden

sely

sam

pled

gene

tic

mar

kers

onG

enBa

nkre

leas

e18

5.Sa

mpl

ing

dens

ity

for

each

mar

ker/

genu

sco

mbi

nati

onis

apr

opor

tion

ofth

eto

taln

umbe

rof

spec

ies

sam

pled

for

that

mar

ker/

genu

sdi

vide

dby

the

tota

lnum

ber

ofsp

ecie

sre

cogn

ized

inth

atge

nus.

Sum

mar

yst

atis

tics

are

used

tora

nkto

rela

tive

qual

ity

ofta

xono

mic

sam

plin

g:“T

OT

perc

entc

over

”is

the

prop

orti

onof

tota

lspe

cies

sequ

ence

sob

tain

edfo

rso

me

genu

s;“T

OT

prop

orti

onsp

p”is

the

rela

tive

cont

ribu

tion

ofa

genu

sto

the

spec

ies

rich

ness

ofth

efa

mily

;“A

BSim

port

ance

”is

asu

mm

ary

stat

isti

cth

atin

corp

orat

esbo

thTO

Tpe

rcen

tcov

eran

dTO

Tpr

opor

tion

spp

toes

tim

ate

the

valu

eof

addi

tion

alse

quen

cing

wit

hin

som

ege

nus.

“RA

NK

impo

rtan

ce”

isth

epe

rcen

tile

rank

ofea

chva

lue

ofA

BSim

port

ance

and

repr

esen

tsth

ere

lati

vene

cess

ity

ofad

diti

onal

sequ

enci

ngin

vest

men

tfor

each

genu

s.Se

eSu

pple

men

tary

mat

eria

lfor

am

ore

thor

ough

expl

anat

ion

incl

udin

gfo

rmul

asus

edfo

rsu

mm

ary

stat

isti

cs.

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 208 205–219

208 SYSTEMATIC BIOLOGY VOL. 62

I, which summarizes taxon movement among a set oftrees, using differences in patristic distance between alltaxon pairs across all pairs of trees in the set. Eachdifference in distance for a pair of taxa i and j betweeneach pair of trees x and y is scaled to the sum ofthe patristic distance between i and j across both treesx and y, and these are summed across all trees. TheMesquite implementation of this method also requiresan alignment to be loaded because the instability scoresfor each taxon are plotted against the percentage ofmissing data for that taxon in the alignment, but thisrequirement is inconvenient when the comparison ofinstability to missing data is not of interest. An additionaldrawback is that because the magnitude of the scoresdepends on the number of trees and the number oftaxa, direct comparison of I scores generated fromdifferent distributions of trees is not straightforward.It is also not feasible to perform these analyses inMesquite using very large phylogenies because ofmemory restrictions. To overcome these challenges, wewrote a Python (http://www.python.org; last accessedNovember 2012) program that can take advantageof parallel multiprocessing architectures and largeamounts of memory to rapidly calculate instabilityscores even for large sets of very large trees. This scriptgenerates scores we have termed IS, which differ from Iin that they are instead scaled to the random expectationfor taxon movement in the provided set of trees. Thus,for all combinations {x,y} of trees x and y in some setof trees X (e.g. a set of bootstrap replicates or a Bayesianposterior distribution), and all taxa j �= i in these trees, thesummary statistic IS

i for a given taxon i is calculated as:

ISi =

2∑

{xy}∑

i �=j|Dijx −Dijy|

DRe ·n(n−1)

where Dija is the unweighted patristic distance betweentaxa i and j in tree a, and DRe is the random expectationfor this distance, which is calculated by taking theaverage per-tree value of the numerator in the aboveequation on a subsample of trees from X with theirtips randomized, and n is the number of trees inX. This statistic has the advantage of being directlyinterpretable as a multiple of the random expectationfor taxon movement, and as such it may be compareddirectly even between different distributions of treeswith different sets of taxa. A score of IS

j =0.7, forinstance, means that taxon j moves ∼70% of the randomexpectation for taxon movement relative to all the othertaxa across a given set of trees. A taxon with a score of1.4 would, therefore, move twice this much.

Data Filtering and Heuristic SearchesWe used 2 strategies to subsample the data gathered

by phlawd to improve the resolution of resulting trees.The first strategy consisted of removing the top 10%of taxa with the highest IS scores to improve branch

stability, the second of excluding taxa lacking sequencesfor both of 2 loci that are broadly sampled and containstrong phylogenetic signal for deep nodes in the tree;these are ndhF and rbcL. This second approach, whichwe refer to as phylogenetic “scaffolding,” improves thePDC of the matrix by maximizing taxon coverage forthe selected markers, and relies on the assumption thatthose markers contain sufficient signal and samplingto reconstruct major relationships (i.e. deep nodes) inthe tree. To independently assess levels of phylogeneticsignal across our data set, we estimated phylogeneticinformativeness profiles (Townsend and López-Giráldez2010; Townsend and Leuenberger 2011) for the mostbroadly sampled loci in this study (Fig. 2), usingthe PhyDesign website (López-Giráldez and Townsend2011). Informativeness profiles indicate the predictedphylogenetic signal contained by each locus as a functionof tree depth, and can be useful for assessing the utility ofloci to reconstruct relationships in different parts of thetree. Although rbcL has overall lower informativenessthroughout the entire tree than ndhF (Fig. 2), it is themost broadly sampled marker across the family and wejustified its use as part of the scaffold because it allowsthe inclusion of many rare taxa that would otherwise beabsent, with a minimal decrease in decisiveness.

To assess the relative utility of both of our filteringapproaches, we conducted 4 parallel rounds of inferenceusing data sets filtered by either (i) neither method (i.e.raw concatenated PHLAWD alignments), (ii) filteringrogue taxa only, (iii) filtering taxa lacking sequencesfor elected loci only, or (iv) both methods (Table 1).We performed maximum likelihood (ML) bootstrapsto estimate phylogenies from these alignments usingRAxML version 7.2.6 (Stamatakis 2006). For eachalignment, a 300-replicate rapid bootstrap heuristicsearch (RAxML’s “-f a” option) was performed. Finally,we calculated descriptive statistics about the alignmentsand the resulting trees to compare our filtering methods’efficacy for improving tree resolution.

For comparison, we also reconstructed the Cyperaceaefamily alignment used in Muasya et al. (2009), whichconsisted of rbcL and trnL-trnF data. We applied thesame heuristic search methods to this alignment andevaluated the resulting trees under the same criteria.The sequence data were gathered directly from GenBankrelease 185 using the accession ids listed in Muasya etal. (2009). Alignment was performed in Muscle v3.7(Edgar 2004), with subsequent manual adjustments byeye. The trnL-trnF region is highly polymorphic at thisphylogenetic depth and some regions were excludedbecause of difficulty assessing homology, as in theoriginal paper. Using these methods, we were able tocollect and align sequence data for 253 of the original262 taxa.

RESULTS

A heatmap of sampling density for select markersacross all currently recognized Cyperaceae genera is

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 209 205–219

2013 HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX 209

trnl-trnf

rps16ndhf

itsatpb matk

rbcl

10-7

Net

Phy

. Inf

orm

.10

-7 N

et P

hy. I

nfor

m.

Tree depth00.20.40.60.8

Tree depth00.20.40.60.8

Tree depth00.20.40.60.8

0

2.1

4.1

6.2

8.2

0

2.1

4.1

6.2

8.2

0

2.1

4.1

6.2

8.2

FIGURE 2. Phylogenetic informativeness profiles for the most broadly sampled loci in the combined PHLAWD alignment, generated usingPhyDesign. Labeled panels contain the infomativeness profile for the indicated gene, while the bottom-middle panel contains all profiles and thebottom right panel contains the chronogram used by the informativeness algorithm. Shapes in the tree represent the same clades as in Figure 3.

TABLE 1. Descriptive statistics for alignments and trees resulting from each of four combined approaches to data filtering, and anrbcL/trnL-trnF alignment reconstructed from Muasya et al. 2009

Method Tips Sites Missing Decisiveness Resolved >0.95 >0.7 >0.6

0. trnL-F+rbcL tree 253 3006 0.46 0.95 0.49 0.31 0.77 0.911. All tips/rogues not filtered (AT/NF) 1526 20017 0.89 0.74 0.57 0.31 0.72 0.862. All tips/rogues filtered (AT/RF) 1366 19425 0.88 0.76 0.64 0.33 0.74 0.873. Scaffold taxa only/rogues not filtered (SC/NF) 484 16025 0.79 0.86 0.71 0.41 0.75 0.874. Scaffold taxa only/rogues filtered (SC/RF) 435 16016 0.79 0.86 0.76 0.42 0.76 0.86

Notes: Row 0 corresponds to this rbcL/trnL-trnF alignment. Rows 1–4 correspond to various filtering methods applied to the PHLAWD alignment:1. No filtering, all available data for the markers chosen at the data-mining step are used. 2. Filtering of unstable tips only—all sequence datafrom the 90% most stable tips are used. 3. Filtering of less informative loci only—taxa not represented by at least an rbcL or ndhF sequence areremoved, all other available markers for the remaining taxa retained. 4. Filtering of loci and taxa—all available loci for the 90% most stable taxarepresented by at least ndhF or rbcL are retained. All decimal values are proportions. The “Tips” and “Sites” columns show the actual countof tips and sites in the alignments; “Missing” is the proportion of missing data across all cells; “Decisiveness” is the PDC metric; “Resolved”shows the proportion of resolved nodes in the majority rule consensus trees calculated from the ML bootstrap tree sets, and the last 3 columnsrepresent the proportion of these resolved nodes bearing a bootstrap support value greater than or equal to the respective cutoff value.

presented in Figure 1. Only a small handful of loci (thoseat the top of the matrix) show reasonably broad samplingacross the family. The broadest sampled are ndhF andrbcL. The second to last row, ABS_importance, containsvalues of a summary statistic we developed to quantifytaxon coverage for each genus, for the markers used inthis study (see Supplementary material for formulas).Because all genera have been poorly sampled for mostof these markers, all their importance scores are quitesimilar. We therefore calculated the percentile rank ofeach importance score, which are presented in the lastrow, ABS_import_rank. High scores in this row indicatethose genera with the lowest levels of sampling. This

heatmap was generated using the gplots package in R(Warnes et al. 2011).

Table 1 presents statistics describing the alignmentsand corresponding ML trees constructed from the rawand filtered phlawd alignments and also the rbcL +trnL-trnF alignment reconstructed from Muasya et al.(2009). Despite having very high decisiveness, therbcL +trnL-trnF alignment performed similarly to theconcatenated, unfiltered PHLAWD alignment, and wasactually surpassed by the raw PHLAWD data in termsof total proportion of nodes resolved, by ∼10%. For thefiltered PHLAWD alignments, each increase in filteringstringency was associated with further improvement to

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 210 205–219

210 SYSTEMATIC BIOLOGY VOL. 62

the resolution of the resulting consensus trees (Fig. 3), asevidenced by the increase in the proportion of resolvednodes (Table 1). Although the frequency of nodeswith >0.6 bootstrap proportion (BP) varied little as filterstrictness increased, the proportion of nodes with 0.7 orgreater BP showed a steady increase. The most dramaticimprovement was seen when taxa not representedby either an rbcL or ndhF sequence were excludedfrom the alignment—the proportion of nodes resolvedwith >0.95 BP increased by 8%, with a concomitant andpresumably related 10% increase in the decisiveness ofthe matrix. The best-resolved and best-supported tree(Figs. 4–7) was generated from the most strictly filteredalignment (Table 1, row 4), which contained only the90% most stable tips from the alignment consisting onlyof those taxa represented by a sequence for at least oneof the loci ndhF and rbcL.

The majority-rule consensus topologies from the MLbootstrap searches corresponding to rows 1–4 of Table 1are presented in Figure 3. The trees are presentedwithout taxon names or support values because spaceconstraints preclude legible presentation of these datain printed form (higher-resolution versions of trees1–3 with legible taxon names and support values areavailable in Supplementary material, respectively; DataDryad doi: 10.5061/dryad.6p76c3pb). Figure 3 clearlyshows an overall increasing trend in the proportionof resolved nodes that accompanied more stringentfiltering methods. The best-resolved consensus tree,corresponding to the most heavily filtered alignment(Table 1, row 4), is presented in Figures 4–7. Supportvalues throughout most of the tree are high, especiallyfor deep nodes, and relationships among major lineagesof sedges are generally well-resolved, with strongsupport for the monophyly of the 2 subfamiliesMapanioideae and Cyperoideae, and generally strongsupport for most previously recognized major clades.The monophyly of many currently recognized genera,however, is not supported, in agreement with previousfindings from the studies that generated the data wegathered from GenBank.

DISCUSSION

Data Decisiveness and Tree ReconstructionPDC (Sanderson et al. 2010; Steel and Sanderson

2010) provides a general method of assessing the effectof missing data on tree reconstruction methods, andit is of interest for supermatrix data mining becausemaximizing PDC is likely to improve phylogeneticresolution. This metric is independent from measures ofphylogenetic signal; instead it describes the proportionof branches, out of the set of all branches in all possibletrees, which may be reconstructed given the patterns oftaxon overlap among markers in the data set. Samplingmatrices with high decisiveness may be used toreconstruct most or all branches in many trees, whereasthose with low decisiveness may lack the taxon overlap

to even allow resolution of many or all branches, even ifthe sampled markers have strong phylogenetic signal.

Because missing data can have such strong effectson tree searching, decisiveness should be of interestto all systematists working with even moderatelyincomplete data sets, and particularly so for thoseusing very large, sparse data sets. However, achievingcomplete decisiveness is not necessarily a requirementfor resolving the best tree (or set of trees), because underthe common assumption that there are relatively fewbest trees (or just one), only a tiny fraction of all possiblebranches is represented in them (at least for large trees).As long as patterns of missing data do not preclude theresolution of these “good” branches, very good trees maybe found using matrices with PDC considerably <1, aswe have demonstrated here.

The PDC metric itself actually measures theproportion of 4-way combinations (quartets) of taxathat are represented by homologous data for at least asingle locus in the alignment. Matrices within whichevery quartet is represented by a set of homologoussequences for at least one locus are fully decisive. Steeland Sanderson (2010) called this the 4-way partitionproperty. The phylogenetic scaffolding approach weused relies on this property: it maximizes the proportionof homologous quartets/triplets by excluding taxa notrepresented for some subset of loci—the scaffoldbackbone. When a single locus is used as a backbone,scaffolding results in a completely decisive matrix,because the 4-way partition property is met throughthe presence of all taxa at that locus. We used 2 well-sampled—but not completely sampled—loci to forma backbone, which does not necessarily result in acompletely decisive matrix, but has the advantage ofallowing more tips to be included.

Phylogenetic Informativeness and ScaffoldingAn additional advantage of phylogenetic scaffolding

is that it provides a means to robustly reconstructdeep nodes in the tree (using the backbone loci), whileallowing the inclusion of faster evolving markers toinform shallower relationships, even when samplingdoes not overlap among clades. Identifying goodbackbone markers requires consideration not only ofhow broadly sampled the candidate markers are butalso the depths at which they will be maximallyinformative. One metric that may be helpful for selectingscaffold backbones is the phylogenetic informativenessof Townsend and Leuenberger (2011) and Townsend andLópez-Giráldez (2010), which quantifies the predictedphylogenetic signal for a given marker as a function oftree depth.

As shown in Figure 2, some loci that are broadlysampled, such as ITS and trnL-trnF, are characterizedby having informativeness peaks at relatively shallowdepths. The peak informativness is presumably the pointat which the sequence data reach saturation in termsof number of changes, and beyond which their signal

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 211 205–219

2013 HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX 211

a) b)

c) d)

FIGURE 3. Majority-rule consensus trees resulting from parallel 300-replicate ML bootstrap searches performed on each of the alignmentssummarized in Table 1. a)–d) correspond to rows 1–4 from Table 1, respectively. Although tip and branch labels could not be included due tospace constraints, several topological landmark nodes corresponding to several major clades are labeled. The star labels the genus Carex, thesquare Eleocharis, the circle the tribe Cypereae, and the triangle the tribe Schoeneae. The dotted lines in c) and d) indicate an area of the treethat apparently experienced a decrease in resolution as a result of more stringent rogue taxon filtering, although as discussed in the text, thisapparent loss of resolution may actually represent an increase in phylogenetic accuracy.

is replaced by noise. Markers showing this pattern arepresumed to be more informative for shallow nodesthan deep ones. Loci exhibiting this pattern are ofsuspect utility for use as a scaffold backbone, but maybe included in alignments containing strong signal inother loci for deep nodes. In the context of this study,loci such as rbcL and ndhF have lower predicted overallsignal than some more rapidly evolving markers, butbecause they remain unsaturated deep in the tree, theyare presumed to contain more information to resolvedeep nodes (Fig. 2). Unsaturated loci such as theseare better candidates for use as backbones. We haveobserved, however, that high predicted signal does notnecessarily equate with high resolution even in clades

for which a locus is well-sampled (e.g. trnL-trnF forthe genus Carex). The relatively low performance of therbcL+trnL-trnF alignment (Table 1, row 1), despite theexceptional predicted signal for trnL-trnF, provides anexample of this (Fig. 2).

Rogue Taxon FilteringFiltering out the 10% least stable taxa from the most-

inclusive alignment (Table 1, row 2) resulted in a 7%increase in the proportion of resolved nodes, whereasfiltering the 10% least stable taxa from the scaffoldedalignment resulted in a 5% increase. These differences

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 212 205–219

212 SYSTEMATIC BIOLOGY VOL. 62

FIGURE 4. Majority rule consensus tree resulting from a 300-replicate ML search performed on the alignment corresponding to Table 1, row4. Branch labels indicate BP. Major clades and grades of the Cyperaceae are labeled to the right. D. = Dulichieae; Fuir. grade = Fuireneae grade;T. = Trilepideae. Each labeled group occurs only once in the tree, but it may extend across multiple pages. In these cases, the parts of the groupon different pages are labeled individually.

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 213 205–219

2013 HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX 213

Carex_microglochin

Carex_maritima

Scirpus_cyperinus

Carex_lachenalii

Carex_monostachya

Scirpus_polystachyus

Scirpus_microcarpus

Zameioscirpus_muticus

Uncinia_hamata

Trichophorum_planifolium

Carex_cephalophora

Carex_backii

Scirpus_pendulus

Carex_pauciflora

Trichophorum_dioicum

Carex_ovalis

Oreobolopsis_clementis

Scirpus_fontinalis

Kobresia_capillifolia

Trichophorum_rigidum

Schoenoxiphium_filiforme

Scirpus_orientalis

Khaosokia_caricoides

Phylloscirpus_deserticola

Trichophorum_subcapitatum

Trichophorum_alpinum

Eriophorum_crinigerum

Kobresia_fragilis

Kobresia_prattii

Kobresia_laxa

Uncinia_nemoralis

Carex_divulsa

Carex_echinata

Phylloscirpus_acaulis

Eriophorum_chamissonis

Carex_elynoides

Scirpus_sylvaticus

Carex_lamprocarpa

Eriophorum_vaginatum

Amphiscirpus_nevadensis

Carex_capillacea

Oreobolopsis_inversa

Schoenoxiphium_ecklonii

Dulichium_arundinaceum

Carex_nardina

Carex_pulicaris

Carex_xerantica

Schoenoxiphium_burkei

Carex_peregrina

Carex_camptoglochin

Carex_conferta

Uncinia_filiformis

Phylloscirpus_bolivianus

Carex_pairae

Oreobolopsis_tepalifera

Uncinia_uncinata

Blysmus_compressus

Carex_marinaCarex_canescens

Scirpus_georgianus

Carex_otrubae

Trichophorum_clintonii

Zameioscirpus_atacamensis

Scirpus_mitsukurianus

Carex_obtusata

Scirpus_wichurae

Scirpus_flaccidifolius

Carex_chordorrhiza

Scirpus_radicans

Scirpus_ancistrochaetus

Scirpus_atrocinctus

Carex_leptalea

Trichophorum_pumilum

Eriophorum_angustifolium

Trichophorum_cespitosum

Scirpus_hattorianus

Schoenoxiphium_sparteum

Scirpus_karuizawensis

Eriophorum_brachyantherum

Kobresia_simpliciuscula

Scirpus_expansus

Carex_rupestris

Schoenoxiphium_ludwigii

Uncinia_phleoides

Cymophyllus_fraserianus

96.67

98.67

67.67

97

61.67

99.67

58.33

63

71.33

95.67 93.67

99

76.67

50

88.67

70100

64

71.6773.33

94

91.67

75.67

98.67

76.67

98.33

98.67

98

84

75.67

83.33

80

52.33

94.33

82

87

83.67

99.67

95

100

51.33

72

98

78.33

99.67

100

100

59.33

56.33

97

100

53.67

77.67

95

69.67

59

99.33

62

93.33

60.67

91.33

Carex

Scirp

eaeD

.To Fig. 6

To Fig. 4

FIGURE 5. See legend to Figure 4.

may seem small, but visual inspection of the resultingconsensus trees (Fig. 3, compare 3a vs. 3b and 3c vs. 3d)confirms that they represent a meaningful increase inresolution. The newly resolved nodes are concentrated inareas of the tree that were mostly to entirely unresolvedby the unfiltered alignments. Some areas of the tree

actually experienced an apparent decrease in resolutionas a result of stringent rogue filtering (see dottedoutlines in Fig. 3d), but the nodes that were lost werecharacterized by relatively low support values, and withtheir removal, support values for surrounding nodesincreased.

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 214 205–219

214 SYSTEMATIC BIOLOGY VOL. 62

FIGURE 6. See legend to Figure 4.

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 215 205–219

2013 HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX 215

FIGURE 7. See legend to Figure 4.

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 216 205–219

216 SYSTEMATIC BIOLOGY VOL. 62

Several leaf instability indices exist, all of which canbe used to filter rogues (Thorley and Wilkinson 1999;Thorley and Page 2000; Smith and Dunn 2008; Maddisonand Maddison 2010). They serve similar purposesand should have essentially similar results, but theirinterpretations may differ. The statistic developed byThorley and Wilkinson (1999), implemented in RadCon(Thorley and Page 2000) and phyutility (Smith and Dunn2008), is based on branch support values and may besensitive to them, while the I statistic employed byMaddison and Maddison (2010) in Mesquite measurestaxon movement directly (as opposed to using supportvalues), but is not standardized among data sets. Thestatistic IS that we employ also measures movementdirectly, but creates scores that may be directly comparedamong data sets. The python script implementing theIS calculations may also be applied to larger sets oflarger trees than some of the other methods, usingparallel computation. How the IS method compares tothe graph-based approach RogueNaRok (Aberer et al.2012) remains to be seen.

Data Filtering and Sampling StrategiesAlthough filtering taxa out of alignments reduces their

taxonomic inclusivity, it is a tradeoff that may be usefulfor developing highly resolved trees (Kearney 2002;Wiens 2003b; Campbell and Lapointe 2009; Thomsonand Shaffer 2010; Aberer et al. 2012). In the caseof supermatrices, even heavily filtered data sets areoften considerably more taxon inclusive than manuallygenerated ones. In this study, our best-resolved treehas 435 tips—only 29% of the available Cyperaceaeoperational taxonomic units (OTU) on GenBank—but this still represents an addition of 173 OTUs (a66% increase) when compared with the next largestpublished phylogeny for the group (Muasya et al. 2009).In addition, our trees present stronger support fordeep relationships than any previous family level tree(Katsuyama et al. 2007; Muasya et al. 2009; Hinchliff et al.2010), which is likely directly due to the greater numberand variety of loci used to generate them.

The sampling patterns shown in Figure 1 reflect acomposite sampling strategy that has been applied tomany organismal groups, because most phylogeneticstudies have traditionally fallen into one of 2 samplingcategories (i) many studies focus narrowly on a single(often large, species rich) clade, which is sampledextensively for a small number of rapidly evolvingmarkers that may not be shared by other studies,while (ii) some studies take a broader sample across adeeper taxonomic scale of one or more relatively slowlyevolving markers that are often not shared with the finer-scale studies. High-throughput sequencing is likely toinfluence these strategies, but there will still exist slowlyevolving markers that are broadly represented, and fastmarkers specific to certain groups.

In the case of Cyperaceae, ndhF and rbcL have beensampled broadly by family level studies, whereas ITS,ETS and a variety of chloroplast markers have beenused by infrageneric studies. Because of low levelsof overlap among studies, Figure 1 contains a highproportion of blue cells representing absent data formost locus-taxon combinations. This apparent lack ofsampling, however, is not necessarily problematic forphylogeny reconstruction because the nature of thephylogenetic sampling is geared specifically to addressthe relationships in the most reasonable trees, thoughcareful selection of markers and taxa by the authors ofphylogenetic studies.

While careful addition of sequence data to sparsealignments can dramatically improve results (van derLinde et al. 2010; Brown J.W., unpublished data), it istypically redundant to require complete coverage forall markers, and prohibitively so for supermatrices. Thepower of these methods is their ability to tie togetherdisparate but already information-rich data sets, andthe generation of full alignments is not required forsuccess (Wiens 2003a). This is simply because, given thesampling patterns inherent in phylogenetic studies, onlya small subset of additional sequence data are likelyto contain additional phylogenetic information—it isnot advisable to exhaustively sample slowly evolvingmarkers for closely related taxa, because they willcontain little variation, nor does it makes sense tosequence rapidly evolving markers for distantly relatedtaxa, because they will contain a great deal of noise.It is worth pointing out, however, that despite ourabilities to accurately reconstruct relationships despitehigh levels of missing data, sparse alignments can(and do) negatively impact our ability to infer branchlengths (Lemmon et al. 2009). Some disagreement existsregarding the severity of this problem (see Wiens andMorrill 2011), and the topic of sampling optimizationremains of interest (Yan et al. 2005; Mittelbach et al. 2007;Weir and Schluter 2007; Svenning et al. 2008; Sandersonet al. 2010; Thomson and Shaffer 2010; Townsend andLópez-Giráldez 2010; Cho et al. 2011).

Implications for Large Combined AnalysesResults from our large-tree analyses are promising.

The best resolved trees from this study improve uponless inclusive family level trees (Katsuyama et al. 2007;Muasya et al. 2009; Hinchliff et al. 2010), suggesting animportant role for effective methods to deal with large,sparse, and potentially noisy data sets As advances insequencing technology drive the generation of morelarge, noisy data sets, these methods will becomeincreasingly more important. Harnessing the power ofthese data sets requires an initial planning investmentto ensure data integrity and to optimize phylogeneticresolution, but the result of successful planning anddata management is very large, very well-resolvedphylogenies, that in many cases may surpass previousexpectations about the limits of possibility.

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 217 205–219

2013 HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX 217

Although supermatrix methods may offer analternative to the generation of novel sequence data,it is important to keep in mind that these approachesare complementary. Data mining methods rely onthe adequacy of previous sampling, and many taxaand genomes remain absent from public databases(Fig. 1). Wet laboratory techniques fulfill the vital roleof improving the coverage of these collaborative datasets, and the generation of novel sequence data is likelyto remain essential to progress in systematic biology forthe foreseeable future.

Systematics and Classification of CyperaceaeOur results corroborate many previous findings

regarding sedge phylogeny, but provide new resultsthat contradict some earlier studies. One of the largestdifferences between this and the next most-inclusivesedge phylogeny in the literature (Muasya et al. 2009)is that the tree we present has very strong supportfor nearly all relationships among major clades. Severaldifferences in topology also exist. Classification unitsreferred to herein follow Simpson et al. (2007) andGoetghebeur (1998). For a thorough, concise discussionof classification, refer to Muasya et al. (2009).

In our best-resolved sedge phylogeny (Figs. 4–7),subfamily Mapanioideae is strongly supported as sisterto the rest of the Cyperaceae, with the Trilepideae sisterto the all remaining, as in Muasya et al. (2009) (Fig. 4).However, we find a Sclerieae+Bisboeckelereae cladestrongly supported as sister to all remaining Cyperaceae,which is incongruous with Muasya et al. (2009)where these tribes were nested within the Schoeneae.A monophyletic Sclerieae+Bisboeckelereae that isseparate from Schoeneae better fits expectations basedon morphological classifications (Goetghebeur 1998)and some previous molecular phylogenetic analyses(Simpson et al. 2007). Like Muasya et al. (2009), however,we resolve Didymiandrum and Lagenocarpus, members oftribe Cryptangieae that have previously been classifiedwith Scleria (Goetghebeur 1998), within the Schoeneaewith very high support (Fig. 4). We find support forthis Schoeneae+Cryptangieae clade as sister to allother remaining lineages. Although relationships withinthe Schoeneae are generally unresolved, several well-supported clades corresponding to major genera arepresent. Some genera in Schoeneae appear polyphyletic,such as Tetraria (corroborated by Verboom 2006) andGahnia (though this finding is incongruent with Zhang etal. 2004). There is good support for a sister relationshipbetween Cladium and the rest of the tribe (Fig. 4). Thegenus Rhynchospora is strongly supported as sister to allremaining lineages and contains all present members ofthe genus Pleurostachys (Fig. 4).

Strong support exists for one large additionalmajor clade consisting of 2 sister subclades:(i) Cariceae+Dulichieae+Khaosokia+Scirpeae; and(ii) Abildgaardieae+Cypereae+Eleocharis+Fuireneae(Figs. 5–7). Within the first of these, Dulichieae

is supported in a sister relationship with aCariceae+Khaosokia+Scirpeae clade, but relationshipsbetween Cariceae, Scirpeae, and the monotypicKhaosokia caricoides are unresolved. The monophyly ofthe genus Carex, if circumscribed to include Cymophyllus,Kobresia, Schoenoxiphium, and Uncinia (i.e. the Cariceae),is strongly supported (Figs. 5 and 6). Most species ofScirpeae present in this data set are not represented bysequences at rapidly evolving, data rich markers suchas ndhF and ITS, and the addition of these markersmay help clarify the relationships within this groupand among Cariceae, Khaosokia, and Scirpeae. Themonophyly of a clade containing the Abildgaardieaeand Eleocharis is well supported, as is the monophylyof both Eleocharis and the Abildgaardieae themselves(Fig. 6). Relationships among the genera Bolboschoenus,Fuirena, Schoenoplectiella, and Schoenoplectus are poorlyresolved (Figs. 6 and 7), and additional studies involvingnovel sequencing effort will likely be necessary to shedlight on this problematic area. Within the Cypereae,a strongly supported sister relationship is indicatedbetween (i) a Ficinia + Isolepis clade in which these 2genera form nearly monophyletic groups, and (ii) aclade in which Scirpoides is sister to a clade containinga highly paraphyletic Cyperus with numerous othergenera interdigitated throughout it (Fig. 7).

Relationships within major clades are in general lesswell-resolved than those in the studies from which thesedata were originally published (Roalson and Friar 2000;Yen and Olmstead 2000; Muasya et al. 2001; Roalsonet al. 2001; Muasya et al. 2002; Yano et al. 2004; Zhanget al. 2004; Chacón et al. 2006; Verboom 2006; Ghamkharet al. 2007; Katsuyama et al. 2007; Simpson et al. 2007;Hinchliff and Roalson 2009; Muasya et al. 2009; Starrand Ford 2009; Thomas et al. 2009; Waterway et al. 2009;Hinchliff et al. 2010; Roalson et al. 2010). This may be dueto decreased levels of decisiveness for shallow branchesbecause of the addition of samples (such as outgrouptaxa) that bridge taxonomic gaps among studies butlack overlap for sampled genes. Low resolution near thetips may also result from lower amounts of phylogeneticinformation available to inform shallow splits, becauseof the smaller numbers of taxa (and therefore fewersampled loci) that can be used.

We find strong support for recognizing the followingclades (disregarding the taxonomic rank at which theyare recognized): Mapanioideae, Trilepideae, Sclerieae(including Bisboeckelereae), Schoeneae (includingCryptangieae), Rhynchosporeae, Abildgaardieae(including Arthrostylideae), and Eleocharideae(Figs. 4–7). The Scirpeae+Cariceae clade (Figs. 5 and 6),and the Fuireneae+Cypereae clade (Figs. 6 and 7), eachcontain poorly resolved regions, which may containclades that conflict with current classification units. Thekey question regarding the Scirpeae+Cariceae clade iswhether the Scirpeae is monophyletic or forms a gradeleading to Cariceae; both of these patterns have beenresolved in published studies (grade: Hinchliff et al.2010; clade, including Dulichieae: Muasya et al. 2009),although never with strong support. The phylogeny we

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 218 205–219

218 SYSTEMATIC BIOLOGY VOL. 62

present suggests the presence of 4 lineages, if Eriophorumcrinigerum groups with the rest of Scirpeae: theDulichieae (Dulichium + Blysmus), Khaosokia, Scirpeae,and Cariceae. More targeted work on the Scirpeae willbe necessary to clarify this. The Fuireneae+Cypereaeclade presents a similar problem: the monophyleticCypereae contains 2 well-supported clades (Cyperuss.l. and Ficinia/Isolepis), but the taxa usually attributedto the Fuireneae form a polytomy below Cypereae(Figs. 6 and 7). Previous studies have seen these lineagespositioned in many different locations, usually withoutstrong support (Simpson et al. 2007; Muasya et al. 2009),but a study using ndhF and psbB-psbH (Hinchliff etal. 2010) showed strong support for a Fuireneae gradeleading to the Cypereae. Additional sampling of ndhFand other data-rich cpDNA regions such as psbB-psbHand perhaps matK may help clarify these relationships.

Overall, 9 clades are strongly supported andmorphologically diagnosible (Mapanioideae,Trilepideae, Sclerieae, Schoeneae, Rhynchosporeae,Abildgaardieae, Eleocharis, and Cypereae), and shouldbe recognized in a new classification, as previousclassifications are clearly do not define phylogeneticlineages as we now know them. Additional researchwill clarify how many diagnosible lineages will need tobe recognized within the Carex + Dulichieae + Khaosokia+ Scirpeae clade and the Fuireneae assemblage.

CONCLUDING REMARKS

We have shown that highly incomplete alignments canproduce well-resolved, highly taxon-inclusive trees, andthat supermatrix data-mining methods that yield theseare feasible to employ. The utility of highly incompletealignments may be negatively impacted by missing data,but the PDC metric of Sanderson et al. (2010) providesa powerful way of assessing and extending these limitsto inference. One such improvement method is matrixscaffolding: the removal of tips lacking data for one ormore backbone loci known to contain strong signal forimportant nodes deep in the tree. We have shown thatthis method can dramatically improve both the PDC ofa matrix and the resolution of the resulting trees. Wealso present a novel statistic for identifying tips withweak or conflicting signal, IS, based on previous workby Maddison and Maddison (2010), but which can bemore easily interpreted across varied data sets.

The trees we present represent the largest, mostwell-resolved sedge phylogenies published to date, andhave direct implications for sedge systematics andclassification. This study is exemplary of a recent butrapidly developing trend in systematics, where large,aggregate data sets gathered by automated tools canbe used to yield answers to longstanding questions inbiology at scales once thought impossible. As thesemethods continue to mature, we can expect to seegreater data integration, more powerful tree-searchingand statistical tools, and many new opportunities for thestudy of organic evolution on Earth.

SUPPLEMENTARY MATERIAL

Data files and/or other supplementary informationrelated to this paper have been deposited on Dryadat http://datadryad.org under doi: 10.5061/dryad.6p76c3pb.

FUNDING

This work was supported by the National ScienceFoundation [DEB 1011206 to C.E.H.].

ACKNOWLEDGEMENTS

Useful discussions from WSU/UI PuRGe wereinvaluable, as was specific feedback from Matt Pennell,Luke Harmon, and Jeremiah Busch. Many thanks toSteve Orzell and Edwin Bridges of Avon Park, FL,for companionship and hospitality in the field thatmistakenly went unobserved in an earlier paper.

REFERENCES

Aberer A.J., Krompass D., Stamatakis A. 2012. Pruning roguetaxa improves phylogenetic accuracy: an efficient algorithm andwebservice. Systematic Biology (in press) doi:10.1093/sysbio/sys078.

Bruhl J.J. 1995. Sedge genera of the world: relationships and a newclassification of the Cyperaceae. Aust. Syst. Bot. 8:125–305.

Campbell V., Lapointe F.J. 2009. The use and validity of composite taxain phylogenetic analysis. Syst. Biol. 58(6):560–572.

Chacón J., Madriñán S., Chase M.W., Bruhl J.J. 2006. Molecularphylogenetics of Oreobolus (Cyperaceae) and the origin anddiversification of the American species. Taxon 55(2):359–366.

Cho S., Zwick A., Regier J.C., Mitter C., Cummings M.P., Yao J.,Du Z., Zhao H., Kawahara A.Y., Weller S., Davis D.R., BaixerasJ., Brown J.W., Parr C. 2011. Can deliberately incomplete genesample augmentation improve a phylogeny estimate for theadvanced moths and butterflies (Hexapoda: Lepidoptera)? Syst.Biol. 60(6):782–796.

de Queiroz A., Gatesy J. 2007. The supermatrix approach to systematics.Trends Ecol. Evol. 22(1):34–41.

Dahlgren R.M.T., Clifford H.T., Yeo P.F. 1985. The families ofmonocotyledons: structure, evolution, and taxonomy. Berlin:Springer-Verlag.

Davis B.W., Li G., Murphy W.J. 2010. Supermatrix and species treemethods resolve phylogenetic relationships within the big cats,Panthera (Carnivora: Felidae). Mol. Phylogenet. Evol. 56(1):64–76.

Edgar R.C. 2004. MUSCLE: multiple sequence alignment with highaccuracy and high throughput. Nucleic Acids Res. 32(5):1792–1797.

Ghamkhar K., Marchant A.D., Wilson K.L., Bruhl J.J. 2007. Phylogenyof Abildgaardieae (Cyperaceae) inferred from ITS and trnL-F data.Aliso 23:149–164.

Goetghebeur P. 1998. Cyperaceae. In: Kubitzki K., editor. The familiesand genera of vascular plants. Berlin: Springer. p. 164.

Govaerts R., Simpson D.A., Bruhl J.J., Egorova T.V., Goetghebeur P.,Wilson K.L. 2007. World checklist of Cyperaceae. London: RoyalBotanic Gardens Kew.

Hinchliff C.E., Lliully Aguilar A., Carey T., Roalson E.H. 2010. Theorigins of Eleocharis (Cyperaceae) and the status of Websteria,Egleria, and Chillania. Taxon 59(3):709–719.

Hinchliff C.E., Roalson E.H. 2009. Stem architecture in Eleocharissubgenus Limnochloa (Cyperaceae): evidence of dynamicmorphological evolution in a group of pantropical sedges.Am. J. Bot. 96(8):1487–1499.

Holton T.A., Pisani D. 2010. Deep genomic-scale analyses of themetazoa reject Coelomata: evidence from single- and multigene

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from

[13:29 28/1/2013 Sysbio-sys088.tex] Page: 219 205–219

2013 HINCHLIFF AND ROALSON—CYPERACEAE SUPERMATRIX 219

families analyzed under a supertree and supermatrix paradigm.Genome Biol. Evol. 2:310–324.

Jones K., Purvis A., MacLarnon A., Bininda-Emonds O., Simmons N.2002. A phylogenetic supertree of the bats (Mammalia: Chiroptera).Biol. Rev. 77:223–259.

Katoh K., Kuma K., Toh H., Miyata T. 2005. MAFFT version 5:improvement in accuracy of multiple sequence alignment. NucleicAcids Res. 33:511–518.

Katsuyama T., Hirahara T., Hoshino T. 2007. Suprageneric phylogenyof Japanese Cyperaceae based on DNA sequences from chloroplastndhF and 5.8 S nuclear ribosomal DNA. Acta Phytotaxon. Geobot.58(2):57.

Kearney M. 2002. Fragmentary taxa, missing data, and ambiguity:mistaken assumptions and conclusions. Syst. Biol. 51(2):369–381.

Lemmon A.R., Brown J.M., Stanger-Hall K., Lemmon E.M. 2009.The effect of ambiguous data on phylogenetic estimates obtainedby maximum likelihood and Bayesian inference. Syst. Biol. 58(1):130–145.

López-Giráldez F., Townsend J.P. 2011. PhyDesign: an onlineapplication for profiling phylogenetic informativeness. BMC Evol.Biol. 11(1):152.

Maddison W.P., Maddison D.R. 2010. Mesquite: a modular systemfor evolutionary analysis. Vancouver, BC: University of BritishColumbia.

McMahon M., Sanderson M.J. 2006. Phylogenetic supermatrix analysisof GenBank sequences from 2228 papilionoid legumes. Syst. Biol.55(5):818.

Mittelbach G., Schemske D.W., Cornell H.V., Allen A.P., Brown J.M.,Bush M.B., Harrison S.P., Hurlbert A., Knowlton N., Lessios H.2007. Evolution and the latitudinal diversity gradient: speciation,extinction and biogeography. Ecol. Lett. 10(4):315–331.

Muasya A.M., Simpson D.A., Chase M.W. 2002. Phylogeneticrelationships in Cyperus L. sl (Cyperaceae) inferred from plastidDNA sequence data. Bot. J. Linn. Soc. 138(2):145–153.

Muasya A.M., Simpson D.A., Chase M.W., Culham A. 2001. Aphylogeny of Isolepis (Cyperaceae) inferred using plastid rbcL andtrnL-F sequence data. Syst. Bot. 26(2):342–353.

Muasya A.M., Simpson D.A., Verboom G.A., Goetghebeur P., NacziR., Chase M.W., Smets E. 2009. Phylogeny of cyperaceae based onDNA sequence data: current progress and future prospects. Bot.Rev. 75(1):2–21.

Naczi R.F., Ford B.A. 2008. Sedges: uses, diversity, and systematics ofthe Cyperaceae. Saint Louis (MO): Missouri Botanic Garden Press.

Ren F., Tanaka H., Yang Z. 2009. A likelihood look at the supermatrix-supertree controversy. Gene 441:119–125.

Roalson E.H., Columbus J.T., Friar E.A. 2001. Phylogenetic relationshipsin Cariceae (Cyperaceae) based on ITS (nrDNA) and trnT-LF(cpDNA) region sequences: assessment of subgeneric and sectionalrelationships in Carex with emphasis on section Acrocystis. Syst.Bot. 26:318–341.

Roalson E.H., Friar E.A. 2000. Infrageneric classification of Eleocharis(Cyperaceae) revisited: evidence from the internal transcribedspacer (ITS) region of nuclear ribosomal DNA. Syst. Bot. 25(2):323–336.

Roalson E.H., Hinchliff C.E., Trevisan R., Da Silva C. 2010. Phylogeneticrelationships in Eleocharis (Cyperaceae): C4 photosynthesis originsand patterns of diversification in the Spikerushes. Syst. Bot.35(2):257–271.

Sanderson M.J., McMahon M., Steel M. 2010. Phylogenomics withincomplete taxon coverage: the limits to inference. BMC Evol. Biol.10(1):155.

Simpson D.A., Muasya A.M., Alves M., Bruhl J.J., Dhooge S., ChaseM.W., Furness C.A., Ghamkhar K., Goetghebeur P., Hodkinson T.,Marchant A.D., Reznicek A.A., Nieuwborg R., Roalson E.H., SmetsE., Starr J.R., Thomas W.W., Wilson K.L., Zhang X. 2007. Phylogenyof Cyperaceae based on DNA sequence data—a new rbcL analysis.Aliso 23:72–83.

Smith S.A., Beaulieu J.M., Donoghue M.J. 2009. Mega-phylogenyapproach for comparative biology: an alternative to supertree andsupermatrix approaches. BMC Evol. Biol. 9:37.

Smith S.A., Donoghue M.J. 2008. Rates of molecular evolutionare linked to life history in flowering plants. Science 322(5898):86–89.

Smith S.A., Dunn C.W. 2008. Phyutility: a phyloinformatics tool fortrees, alignments and molecular data. Bioinformatics 24(5):715–716.

Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-basedphylogenetic analyses with thousands of taxa and mixed models.Bioinformatics 22:2688–2690.

Starr J.R., Ford B.A. 2009. Phylogeny and evolution in Cariceae(Cyperaceae): current knowledge and future directions. Bot. Rev.75(1):110–137.

Steel M., Sanderson M.J. 2010. Characterizing phylogenetically decisivetaxon coverage. Appl. Math. Lett. 23(1):82–86.

Stevens P.F. 2008. Angiosperm Phylogeny Website. Available from:URL http://www.mobot.org/mobot/research/APweb/. Lastaccessed November 2012.

Svenning J., Borchsenius F., Bjorholm S., Balslev H. 2008. High tropicalnet diversification drives the New World latitudinal gradient inpalm (Arecaceae) species richness. J. Biogeogr. 35(3):394–406.

Thomas W.W., Araújo A.C., Alves M. 2009. A preliminary molecularphylogeny of the Rhynchosporeae (Cyperaceae). Bot. Rev. 75(1):22–29.

Thomson R.C., Shaffer H.B. 2010. Sparse supermatrices forphylogenetic inference: taxonomy, alignment, rogue taxa, and thephylogeny of living turtles. Syst. Biol. 59(1):42–58.

Thorley J.L., Page R.D.M. 2000. RadCon: phylogenetic tree comparisonand consensus. Bioinformatics 16:486–487.

Thorley J.L., Wilkinson M. 1999. Testing the phylogenetic stability ofearly tetrapods. J. Theor. Biol. 200(3):343.

Townsend J.P., Leuenberger C. 2011. Taxon sampling and the optimalrates of evolution for phylogenetic inference. Syst. Biol. 60(3):358–365.

Townsend J.P., López-Giráldez F. 2010. Optimal selection of gene andingroup taxon sampling for resolving phylogenetic relationships.Syst. Biol. 59(4):446–457.

van der Linde K., Houle D., Spicer G., Steppan S. 2010. A supermatrix-based molecular phylogeny of the family Drosophilidae. Genet. Res.92(1):25–38.

Verboom G.A. 2006. A phylogeny of the schoenoid sedges (Cyperaceae:Schoeneae) based on plastid DNA sequences, with special referenceto the genera found in Africa. Mol. Phylogenet. Evol. 38(1):79–89.

Warnes G.R., Bolker B., Bonebakker L., Gentleman R., Huber W., LiawA., Lumley T., Maechler M., Magnusson A., Moeller S., SchwartzM., Venables B. 2011. gplots package for R statistical framework.Rochester, NY: University of Rochester.

Waterway M., Hoshino T., Masaki T. 2009. Phylogeny, species richness,and ecological specialization in Cyperaceae tribe Cariceae. Bot. Rev.75(1):138–159.

Weir J., Schluter D. 2007. The latitudinal gradient in recent speciationand extinction rates of birds and mammals. Science 315(5818):1574.

Wiens J.J. 2003a. Missing data, incomplete taxa, and phylogeneticaccuracy. Syst. Biol. 52(4):528–538.

Wiens J.J. 2003b. Incomplete taxa, incomplete characters, andphylogenetic accuracy: is there a missing data problem? J. VertebratePaleontol. 23(2):297–310.

Wiens J.J., Morrill M.C. 2011. Missing data in phylogenetic analysis:reconciling results from simulations and empirical data. Syst. Biol.60(5):719–731.

Wolsan M., Sato J.J. 2010. Effects of data incompleteness on therelative performance of parsimony and Bayesian approaches ina supermatrix phylogenetic reconstruction of Mustelidae andProcyonidae (Carnivora). Cladistics 26(2):168–194.

Yan C., Burleigh J.G., Eulenstein O. 2005. Identifying optimalincomplete phylogenetic data sets from sequence databases. Mol.Phylogenet. Evol. 35(2):528–535.

Yano O., Katsuyama T., Tsubota H., Hoshino T. 2004. Molecularphylogeny of Japanese Eleocharis (Cyperaceae) based on ITSsequence data, and chromosomal evolution. J. Plant Res. 117:11.

Yen A., Olmstead R. 2000. Molecular systematics of Cyperaceae tribeCariceae based on two chloroplast DNA regions: ndhF and trnLintron-intergenic spacer. Syst. Bot. 25(3):479–494.

Zhang X., Marchant A., Wilson K., Bruhl J.J. 2004. Phylogeneticrelationships of Carpha and its relatives (Schoeneae, Cyperaceae)inferred from chloroplast trnL intron and trnL-trnF intergenicspacer sequences. Mol. Phylogenet. Evol. 31(2):647–657.

by guest on March 26, 2016

http://sysbio.oxfordjournals.org/D

ownloaded from