Development of molecular genetic maps and massive SNP mining through NGS technology in Cynara...

9
:j U :+*' rf nf li TlillÌflfrr t *$J9: *-;+---, q : ry# *ii*-fi rc { ,s lyirrJ'i{elrlìvei , ,., '., i;;' r- d.'H ,*-::" l*i. ffi a:t# ffi .J :ffi J'Ífi"r-.:" rli ì $r* .ffi -1i PlÈ .í.t-- i ni.f.ì "",fi -"J'f' 3..'.fl 5# #-1':t

Transcript of Development of molecular genetic maps and massive SNP mining through NGS technology in Cynara...

: jU:+*'

rf nf

li

TlillÌflfrr tÀt*$J9: *-;+---,

q

:

ry#*ii*-fi

rc

{

,slyirrJ'i{elrlìvei , ,À ,., '.,

i;;' r- d.'H,*-::"

l* i.

ffia:t#

ffi.J

:ffiJ'Ífi"r-.:"

rli ì

$r*.ffi-1iPlÈ

. í . t - - ini.f.ì

"",fi

-"J'f'3..'.fl5#

#-1':t

Development of Molecular Genetic Maps and Massive SNP Miningthrough NGS Technology in Cynara cardunculusL.

E. Portisr, A. Acquadror, D. Sgaglionrt,Z.Lait, S. Knapp3, L. Rieseberga,R.P. Mauro'', G. Mauromicale'and S. Lanteri '

DISAFA - Plant Genetics and Breeding, University of Torino, ltaly- Center for Genomics and Bioinformatiis, lndiana iJniversity, USÀ' Center for Applied Genetics Technologies, University of Géorgia, USA- Department of Botany, University of British Colun,bia, Canada

DISPA - Scienze Agronomiche, University of Catania, ltaly

Keywords: linkage map, microsatellite, SNP, transcriptorne. next generation sequencing

\bstractTwo Fr progenies involving the cross of the globe artichoke genotype

Romanesco C3 by the cultivated cardoon'Alti l is 41'as well as the wild cardoon'Creta 4'were generated; the former allowed to construct the first cultivatedcardoon map which was integrated with the one of globe artichoke. A wide set ofSSR loci derived îiom ESTs was positioned into the reference genetic maps and aconsensus SSR-based linkage map was developed. To further saturate theC. cardanculrs map, with the goal to investigate the genetic basis controlling traits ofinterest, a broad-based sequencing approach was applied for the development of alarge and robust SNP dataset. Next-Ceneration sequencing (NGS) technologies wereapplied using two complementary approaches: (i) genomic RAD (Restriction-site\ssociated DNA) tag sequencing in combination with the Illumina platform; (ii)transcriptome sequencing, via 454 (Roche) and ll lumina technologies. These SNPsrepresent a one-stop resource to produce a dense C. curdunculzs genetic map viahigh-throughput genotyping technologies.

I\TRODUCTION 1Cynara carduncttltts (2n:2x:34) is an out-breeding and thus characteristically

righly heterozygous species. We previously developed the first globe artichoke genetic:raps which were based on a cross between the two genotypes'Romanesco C3'andSpinoso di Palermo', by applying a two way pseudo-test cross approach (Lanteri et al.,

1006). A new F1 progeny involving the cross of the same'Romanesco C3'genotype by,he cultivated cardoon genotype'Alti l is 4l 'was also generated, this allowed to constructhe first cultivated cardoon map which was aligned with the one of globe arlichoke (Portis:t al., 2009). More recently crosses between globe artichoke and its ancestor wild cardoonlave generated highly segregating F1 populations exploitable as ornamentals (Lanteri et:.L,2012) as rvell as for mapping studies (Sonnate et al., 201 i). With the goal to cieveloprigher resolution maps as well as a consenslls map a wide set of SSR markers wasJeveloped from ESTs-(expressed sequence tags.1 of glòbe arrichol<e, rnade available by theComposite Genome Project (CGP; http://compgenomics.ucdavis.edu/). Using a customrioinformatic pipelìne, 36,321 ESTs were assembled into 19,055 unigenes (6,62 I contigs:.nd 12,434 singletons), annotated, and mined for perfect SSRs. Over 4,000 potential EST-SSR loci, lying within some 3,300 genes (1 SSR per 3.6 kbp) were identif ied, and PCR:rimers for the amplification of rr,ore than 2,000 of these designed; in a test of a sample..f 300 of these assays, over half proved to be informative between the parents of the:r'ailable mapping populations (Scaglione et al., 2009).

Here we describe the integration of a large number of these EST-SSR loci into the:lobe arlichoke and cultivated cardoon maps. and show that the two maDS can be readilv. l igned and integrated to der elop an SSR-based consensus map

'of the specrei .

Fr-rfihermore, to further saturate fhe C. cardunculus map, with the goal to investigate the:enetic basis controlling traits of interest, Next-Generation Sequencing (NGS):echnologies were applied for SNP mining and the development of a iarge aid ìobusi

'Loc. 8'r'IS on Artichoke, Cardoon and Their Wild Relatives:d. : M.A. Pagnotta\cta Hof i . 983, ISHS 2013

t19

SNP dataset.

MATERIALS AND METHODS

Linkage Analysis and Consensus Map ConstructionThe set of 178 infomative Cynara Expressed Microsatellite (CyEM) marker.

identifled by Scaglione et al. (2009) were used to genotype 94 Fr hybrid (randonhselected from 154 true hybrids as described by Portis et a1., 2009) from the cross'Romanesco C3'(globe artichoke; femaleparent) x'Alti l is 41'(cultivated cardoon; maleparent). The CyEM genotypes were at first combined with previous genotypic data basecon 605 AFLP, 27 S-SAP and 56 otirer SSRs (Portis et al., 2009), along u,ith ten SNP:from genes underlying caffeoylquinic acid synthesis (reported by Comino et al., 2009 ancMenin et aI.,2010). JoinMap v4.0 (van Oorjen,2006) was used to generate two separatÈlinkage maps (one for each parent) using the double pseudo-testcross mapping strategy. Agenotypic data set based on just SSRs and a few SNPs was then used to construct aconsensus SSR-based rnap. Here, the three segregation classes were I : I (allelessegregating in one of the two parents), 1:2:1 (the same pair of alleles segregating in eachparent), and 1:l:1:1 (different alleles segregating in each parent). The most l ikely locr"rsorder was established from a comparison of the C3, Altilis 4l and consensus linkagegroups (LGs), and where these differed substantially fro.m one another, the most likellorder was assumed to be one-associated with the lowest 1'value (estimating goodness-of-fit) and the lowest

-.un 1t contribution for all loci.'bnce the framerioik map was

established, some LGs were merged by lowering the LOD threshold to 5 and theconsensus LGs were numbered serially in descending order of genetic length.

SNP MiningTwo complementary approaches were adopted for SNP development (Fig. 1):

1) Genomic RAD (Restriction-site Associated DNA) tag sequencing (Miller et al., 2007)in combination with the Genoure Analyzer GAIIx (l l lumina) sequencing device (asreported by Scaglione et a1.,2012a'). Three genomic RAD libraries were obtained fiomC. cardunculas genotypes belonging to the three taxa of the species (globe artichoke.cultivated cardoon and wild cardoon) and parents of two mapping populations. Thehrst mapping population is the previously described progeny involving the crossbetween globe art ichoke'Romanesco C3'and cul t ivated cardoon'Al t i l is 41' . Thesecond one is an F1 progen| involving the cross between the same female parent aspreviously and the *' i ld cardoon genotype 'Creta 4'(Lanteri et al..2012).

2) Transcriptome sequencing, via 454 and Illumina technologies, of a total of elevenC. cardunculzrs EST libraries: 3 libraries, deriving from the three mapping parentswere sequenced with the 454 Titanium (Roche) to produce a reference transcriptome;8 libraries, set up from five globe artichoke accessions, two cultivated cardoon and onewild cardoon genotypes, were sequenced using the lllurnina platform, in order tohighly increase the total SNP calling amount. Alongside, a functional characterisationand annotation of the obtained sequence set was performed, as reported by Scaglioneet al . (2012b).

RESULTS AND DISCUSSION

SSR-Based Consensus MapThe integration of the EST-SSR loci has significantly improved the resolution and

accuracy of the'Romanesco C3'and'Al t i l is 41'maps (Port is eta1.,2012). The numberof informative shared co-dominant markers was raised to 66 (64 SSRs and 2 SNPs),equivaient a number of bridge markers (l-15) per LG' these markers allowed thealignment of all the LGs. Following alignment, a consensus linkage map basedexclusively on microsatellite and SNP markers was constructed (Fig. 2). The consensusmap (Fig.3) comprised 221 loci (217 SSRs and 10 SNPs targeting genes involved in the

180

synthesis of caffeoylquinic acids) aranged into 20 LGs (LOD threshold >6.0)- Thecbns.nsus map lengih was 1068.0 cM, with a mean inter-marker spacìng of 5.2 cM. Thelength of LCi vaiied from 4.0 to 113.7 cM (mean 62.8.cM), with the largest.LGcon'Íaining 36 loci. Lowering the LoD threshold to 5.0 resulted in the merging 9f tryggpairs of iGs, thereby reducìng the overall_ number to lJ, co_rresponding to the. haploidòhtonroso-. complement of tlie species. The majority of the LGs contained a mixture of

'Romanesco C3'. 'Alt i l is 41'and shared co-dominant markers, with only four (LG-9,

_13, _14 and _17) carrying shared loci andmarkers only pr€sent in. the 'Romanesco C3'marr.This SSR-based coÀsensus map of C. cardunculzs is based on a robust markerplaiform of SSRs and a few gene-based SNP loci. It ìs expected that the furtherpositioning of markers within target regions will provide key tools for marker-assistedbreeding f.ogtator as well as the necessary framework to exploit mapping data obtainedfrom diierselopulations. At present around 200 of the loci on the cons.ensus map (about

88%) are s:ited i,ithin genic sequence, presentin&_sorye opportunity to identi$' candidateg"n"i fo. pafiicular túits within the species. The known genomic location of ge19-

éerived màrkers (such as the SNPs within the genes underlying caffeoylq_uinic acidssynthesis) may contribute to gain an understanding of. how these genes influence key

tiaits, ané so simplify the procàss of elucidating their underlying mechanisms. Finally, the

basing of markeri on geniì sequenqe will facilitate comparative genomic analyses withinthe Asteraceae.

The genetic maps, obtained from the domesticated C. cardunculus forms, werecompared w-ith the Sonnante et al. (2011) consensus map constructed from a crossbetween the var. scolymus cultivar 'Mola' and the var. sylvesîris (wild cardoon) accession,Tolfa', by considering 125 (117 SSRs, eight SNPs) common markers.

^[n gel9T!fl4$e1

order and"genetic sepa"ration were comparable, with some exceptions. Over 100 SSR loci

featured ià our SSR-based consensus map apparently were either non-informative orremained as singlet loci in the 'Mola'/'Tolfa' population (Portis ef al.,2012).

SNP Mining through NGS TechnologYThe LAD-sù exercise producéd 9.7 million reads equivalent to -1 Gbp of raw

sequences. The distrìbution of reads was uneven atross the three DNA samples, with1.2 mill ion reads achieved for globe artichoke,2.6 mill ion for cultivated cardoon and

5.9 million for wild cardoon; thJ latter, being the largest set, was chosen as the basis 1'or

de novo contigs assembly. The assembly procedure created 19,061 reference genomic

contigs, rpannitrg 6. l1 Mbp. The coniig sequences characterisation resulted in the

unnoìótioo of 5,315 contigs (28.0o/o). Eniyme codes were retrieved for 7,327 contigs,defìning a unique set of 313 putative enzymatic activities, which were mapped ontoKEGG"referenie pathways (http://www.genomejp/kegg/). The sequences gelgrted loreach mapping parènt weie aligned using the reference contig set as a scaffold. Totally,

-33,000-ùqulence variants weie detected, including 1,520 s.hot indels, distributed over12,068 contigs. The overall SNP frequency was estimated to be 5.6 per 1,000 nucleotides.A subset oi -1Z,+OO SNPs was obtained considering allelic variant which were

informative for both mapping populations (16,121 SNPs, and 123 1-2nt indels) distributedover 7,478 contigs. fhé numbèr of heterozygous SNP loci was 1,235 in the globe

artichoke, 2,863 in the cultivated cardoon and 5,069 in the wild cardoon mapping parents.Heterozygous SNPs are of key imporrance for mapping studies since for the linkageanalysis à two-way pseudo-testcross approach, based gn- a segregant Fr progenY,^ry?Iadoíted. In this rensè u key parameter for the successful isolation of such useful SNPmarkers u as the sequencing coverage.

The outcomè of 454-based tianscriptome sequencing of the three mapping parents

generated some 1.7 M reads of overall length 695 Mb oî raw sequences,^which were

ieduced to 692 Mb after a post-sequencing f,rltering. cDNA libraries of other eightgenotypes, sequenced using a'GAIlx Illuminl platfor.m, produced 6.9 9bp of raw data(+e.+"W paireà-end reads) with a mean of.5.8 M reads per accession. The data set was

ieduced to 6.2 Gbp after quality trimming. The assembly oî 454 reads generated

t8r

37,622 cont igs for 'Romanesco C3' ,40,130 cont igs for 'Al t i l is 41' , and 12,837 cont igsfor'Creta 4' with mean coverage levels of 7.31X, 8.45X, and9.lJX, respectively. A finaiset oî 38, J26 reference contigs, spanning 32 Mbp of the transcriptome, was obtaineciafter merging the three taxon-specific assemblies. Sequences were functionally annotated:enzymes were tagged on KEGG's reference paihways (www.genomejp/kegg/), includingprimary and secondary metabolisms. On the whole, 16,419 enzyme codes were retrieved(12,449 transcripts), and mapped onto KEGG's pathways. The sample of C. cardunctthrsenzymes consisted of 1,133 unique enzyme codes distributed across 147 pathways. About1 M of 454-derived reads (about 0.a Gbp) were aligned to the reference contig set(38,726), while read alignment with corect pair information was successfui for 34 NIlllumina sequences (about 2.6 Gbp), resulting in a median reference transcriptomecoverage of 96X. Reliable SNPs (Bayesian probability >95%) were detected at 195,400sites across the set of eleven accessions. The average SNP frequency was calculated atI per 167 bp, with a mean of five per contig. Overall, SNPs were most frequent in3'-UTR (one per 126 bp), followed by the CDS (one per 169 bp). and the 5'-UTR (oneper 265 bp).

The combination of two NGS olatforms (454 FLX Titanium - Roche and GAIIx -Illunrina) for the extensive characterizatLon 'of the genome and transcriptome olC. carduncultts,has proven to be a highly reliable tool for SNP discovery. Overall, theavailability of such a large number of sequence-based markers, in a fotmat allowing forhigh throughput genotyping, offers opportunities to develop a high-density genetic mapand association mapping studies aimed at correlating molecular polymorphisms withvariation in phenotypic traits, as well as for molecular breeding approaches.

ACKNOWI,EDGEMENTSThis research was supported by: (i) U.S. National Science Foundation grants

(DBl-0820451) for LHR, SJK, and ZL,_(ll) and by MIPAAF (Ministero delle PoliticheAgricole, Alimentari e Forestali - ltaly) through the CYNERGIA ("Costituzione evalutazione dell'adattabilità di genotipi di Cynara cardunculus per la produzione dibiomassa e biodiesel in ambiente mediteraneo") project and CARVARVI("Yalorizzazione di germoplasma di carcioio attraverso la costituzione varietale ed ilr isanamento da virus") project .

Literature CitedComino, C., Hehn, ,A., Mogiia, A., Menin, 8., Bourgaud, F., Lanteri, S. and Portis, E.

2009. The isolation and mapping of a novel hydroxycinnamoyltransferase in the globeartichoke chlorogenic acid pathway. BMC Plant Biology 9:30.

Lanteri, S., Acquadro, 4., Comino, C., Mauro, R., Mauromicale, G. and Portis, E. 2006.A first linkage map of globe artichoke (Cynara cardunculus var. scolymus L.) basedon AFLP, S-SAP, M-AFLP and microsateliite markers. Theoretical and AppliedGenetics 1 12:1 532-1 542.

Lanteri, S., Portis, E., Acquadro, A., Mauro, R.P. and Mauromicale, G. 2012.Motphology and SSR fingerprinting of newly developed Cynara cardunctrlusgenotypes exploitable as ornamentals. Euphytica 184:3 I 1-321.

Menin, B., Comino, C., Moglia, 4., Dolzhenko, Y., Porlis, E,. and Lanteri, S. 2010.ldentification and mapping of genes related to caffeoylquinic acid synthesis in Cynaracardtmculus L. Plant Science 119:338-341 .

Miller, M., Dunham, J., Amores, A., Cresko, W. and Johnson, E. 2001 . Rapid and cost-effective polymorphism identification and genotyping using restriction site associatedDNA (RAD) markers. Genome Research I7:240-248.

Portis, E., Mauromicale, G., Mauro, R., Acquadro, A., Scaglione, D. and Lanteri, S. 2009.Construction of a reference molecular linkage map of globe artichoke (Cynaracardunculus var. scolymtrs). Theoretical and Applied Genetics 120:59-10.

Portis, E., Scaglione, D., Acquadro, A., Mauromicale, G., Mauro, R., Knapp, S.J. andLanteri, S. 2012. Genetic mapping and identification of QTL for earliness in the globe

182

arlichoke / cultivated cardoon complex. BMC Research Notes 5:252.Scaglione, D., Acquadro. A., Portis.- E., Taylor, C., Lanteri, S. and lfuupp, S. 2009.

Ontology and diversity of transcript-associated microsatellites mined from a globearlichoke EST database. BMC Genomics 10:454.

Scaglione, D., Acquadro, A., Portis, E,., Tirone, M., Knapp, S.J. and Lanteri, 5.2012a.RAD tag sequencing as a source of SNP markers in Cynara cardunculus L. BMCGenomics I 3 :3.

Scaglione, D., Lanteri, S., Acquadro,4., Lai, Z.,Knapp, S.J., Rieseberg, L. and Portis, E,.2012b. Large-scale transcriptome characterization and mass discovery of SNPs inglobe artichoke and its related taxa. Plant Biotechnology Journal 10(8):956-969.

Sonnante, G., Gatto,4., Morgese, 4., Montemurro, F., Sarli, G., Blanco, E,. and Pignone,D. 2011. Genetic maD of artichoke x wild cardoon: toward a consensus mao forCynara cat'dtrnctrltrs.Theoretical and Applied Genetics it23:l2l 5-1229.

van Ooijen, J.V/. 2006. JoinMap* v.4, Software for the calculation of genetic linkagemaps in experimental populations. Wageningen, Netherlands: Kyazma BV.

Figures

gDNA RAD libraries

3 mapping parents

"'::ll:lì.:r''

-3.5M reads each parent

@

-19,000 contìg mean length 3 l2 bp

+

i mapping parents li gempìasm genotypes

-0.5M readrcch parent

)

-5.8M reads each accession

@f f i f re

'38,000 cont ig nean Iength 8,14 bp

+

Fig. l. Cynara cardunculus SNP mining workflow.

-33.000 r'ariants -195,000 variants

183

B) c3-1

A//&/wmAtr/w.'&/w

ffi' f f\H

Y

Ari_11 c3_2

4VNffi. NWWW

Fig.2. Examples of alignment and consensus LG construction. Alignment of the'Romanesco c3' (white) and the 'Altilis 41' (dark gray) LGs baéd on commonmarkers (A). SSR-based consensus LGs (right g'ay; ioíítruction (B). ,r-, and,a-'indicate markers segregating only ii -.Rimalesco

c3, anà'.atilis 41,.respective.ly. Marker nomenclàture i i the one reported in Portis et al. IZOOO.ZOIZSand Scaglione et al. (2009).

r84

- : r ] e * S FE S q P,*

5 ;= È ì ; ' È '9 i ; =trqF.ìe-,-=-- iaH::

Y#ffio- ó :3 h à3 a 6 b3

x e, { ; i : ; , } ie: , r j9$ È' i .e, ìó, i : ì i i i c i i i i ì I i= F' ìÈ,6 ! j r x ! ? )2r.22t è3 5i ÈY#

ffi

3 r iE'F E's"e"EEqÉ s"; s ' :É

o >,<tesq.^ F.-<i otJ F4^t( . ) .= d'3 ts0,! ) !v

PÈ cE oi iaéq.È R.È

-Él)

-O

È9É^. . F

/h ì j :VOÈrl() !

- g()

oot;CÙ!Cd6)

-<t' .2^-:vr)

-E a^h^= H.=;-

^;?-r€" o

-sH=- FFco Y"€ u.É'@Olr

e9E899o=-^

Èc< 9

E >',X ̂= É e=

- o,=óa qpaù

^ì . ! q

lÈ:EÈ S)ìuS FoT e=o?!

" c, $+^\J() :cúr, - : lcoA \ .^aÈts è YÈgìo(ÚHEI-Èl-

a ) ìXc.r5.:HÈgEERY!

E cd XOx.-* cÉÒl

'o =, .v9 E F--

ucC

f J. B oú.n,a

35ÈEH oEH

aa;

bb

F g:x;

I g sqd

i ffi-n

I

I

='! -' g

qb:;

I

6 : \ d- EF. ; > iÈ !

| > u- * . ì " ,O Y ? iT Ì, q-F-TTI

-e * : 'É R'

G €a!3 I a-@N"&4|-*r :

- , úaEA c? í i=.n gYó9 J QeQ

:4lf@qeefl: t5: : -È?

I

._ { FB I I ;8" ì^{Fq:43r, î d{ ,Rì3, : EÈ I d i3iÉg:ÈìEEss È 3iE:o i--ìi----i---ijjs , ' ' r ' ' ' t " ' r ' , . ' t î t l f

6 È::ppÉ€RNR cb+$9sgg Btg e;35àgg3a3à538

E

S ì s c I aFp e 3 e Ò : ' *

s;; ; J Es;; : i ;5 * E iI r 5; . : d EÉ85 5l E! : à 5

t l ì

' " I R 3 S3g È 3 È È à È :

o'

G { r F, ; ( .1. " e 5, ; 1ú ^

n$c-r j i P35,È 3El3:È: e ' : ÉÈÉgEE:È:stI

RRh 5 A $ - : : R !33 3 53 3 3

lÒqir-ÈoÓ-. o- qo-ryq q o

e R RSA 3;99\9e;3 S 334È R A

" ì : E- 3 3." l l 's ; S sf i rF

s,E i a i : q ' i i ; : i idìr : lrÀ ! 9 Y' a - ::

.q à : : - 5 : * ; : ; 3: ;1: ; t

rssN 0567-7572rsBN 978 90 6605 3r9 9Price for non-members of ISHS: € 100,-

ISBN 978-90-6605-319 9

ilil1il1J|ilililruilil