Repeat subtraction-mediated sequence capture from a complex genome

TECHNICAL ADVANCE

Repeat subtraction-mediated sequence capture froma complex genome

Yan Fu1,2, Nathan M. Springer3, Daniel J. Gerhardt4, Kai Ying5,6, Cheng-Ting Yeh1,7, Wei Wu1, Ruth Swanson-Wagner5,6,

Mark D’Ascenzo4, Tracy Millard4, Lindsay Freeberg4, Natsuyo Aoyama4, Jacob Kitzman4, Daniel Burgess4, Todd Richmond4,

Thomas J. Albert4, W. Brad Barbazuk8, Jeffrey A. Jeddeloh4,* and Patrick S. Schnable1,2,6,7,*

1Department of Agronomy, Iowa State University, Ames, IA 50011, USA,2Center for Carbon-Capturing Crops, Iowa State University, Ames, IA 50011, USA,3Department of Plant Biology, University of Minnesota, St Paul, MN 55108, USA,4Roche NimbleGen Inc., Madison, WI 53719, USA,5Interdepartmental Genetics Graduate Program, Iowa State University, Ames, IA 50011, USA,6Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA,7Center for Plant Genomics, Iowa State University, Ames, IA 50011, USA, and8Department of Biology and the Genetics Institute, University of Florida, Gainesville, FL 32610, USA

Received 30 October 2009; revised 3 February 2010; accepted 9 February 2010; published online 12 April 2010.*For correspondence (fax +1 608 218 7601; e-mail [email protected] or fax +1 515 294 5256; e-mail [email protected]).

SUMMARY

Sequence capture technologies, pioneered in mammalian genomes, enable the resequencing of targeted

genomic regions. Most capture protocols require blocking DNA, the production of which in large quantities can

prove challenging. A blocker-free, two-stage capture protocol was developed using NimbleGen arrays. The

first capture depletes the library of repetitive sequences, while the second enriches for target loci. This strategy

was used to resequence non-repetitive portions of an approximately 2.2 Mb chromosomal interval and a set of

43 genes dispersed in the 2.3 Gb maize genome. This approach achieved approximately 1800–3000-fold

enrichment and 80–98% coverage of targeted bases. More than 2500 SNPs were identified in target genes. Low

rates of false-positive SNP predictions were obtained, even in the presence of captured paralogous sequences.

Importantly, it was possible to recover novel sequences from non-reference alleles. The ability to design novel

repeat-subtraction and target capture arrays makes this technology accessible in any species.

Keywords: NimbleGen sequence capture, genotyping, SNP, molecular marker, reduced representation

sequencing, allele mining.

INTRODUCTION

Identifying genetic variation is a critical step in relating

genotypes to phenotypes. Reference genome sequences

exist for several plant species; however, it remains expen-

sive and difficult to perform whole-genome sequencing of

dozens of haplotypes per species, especially in crops such as

maize (Martienssen et al., 2004; Bennetzen, 2005), wheat

(Feuillet et al., 2008) and conifers (Morse et al., 2009), which

have large genomes with complicated high-copy repeat

interspersion. The ability to perform targeted resequencing

of specific intervals of the low-copy fraction of these

genomes has significant potential for a range of applica-

tions, such as discovering markers, identifying the basis of

mutants, and cloning qualitative and quantitative trait loci

(QTL), all of which can contribute to the genetic improve-

ment of crops.

Microarray-based sequence capture (Albert et al., 2007;

Hodges et al., 2007; Okou et al., 2007; D’Ascenzo et al., 2009)

has been successfully applied to mammalian genomes for

resequencing exons, large genomic loci and candidate gene

sets. Sequence capture has also been performed using

solution hybrid selection with long oligos (Porreca et al.,

2007) or PCR products (Herman et al., 2009) as probes. The

substantial enrichment of target sequences achieved via

sequence capture makes it much less expensive than

898 ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd

The Plant Journal (2010) 62, 898–909 doi: 10.1111/j.1365-313X.2010.04196.x

https://www.researchgate.net/publication/5911724_Okou_DT_et_al_Microarray-based_genomic_selection_for_high-throughput_resequencing_Nat_Methods_4_907-909?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

https://www.researchgate.net/publication/5794236_Cereal_breeding_takes_a_walk_on_the_wild_side_Trends_Genet?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

https://www.researchgate.net/publication/26307981_Filter-based_hybridization_capture_of_subgenomes_enables_resequencing_and_copy-number_detection?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

https://www.researchgate.net/publication/7173552_Sequencing_the_Maize_Genome?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

https://www.researchgate.net/publication/7547545_Transposable_elements_gene_creation_and_genome_rearrangement_in_flowering_plants_Curr_Opin_Genet_Dev?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

resequencing whole genomes or even the entire gene space.

Even for marker discovery, where it is possible to obtain

large numbers of ‘random’ SNPs via high-throughput

sequencing of genomic fractions, as has been done for

maize (Barbazuk et al., 2007), large numbers of SNPs are not

typically discovered within a set of specific genes or a

defined genomic region (e.g. a QTL interval).

Since their earliest implementation, hybridization-based

complexity reduction technologies for targeted sequencing

have required the use of blocking DNA in massive excess

(Bashiardes et al., 2005). Most usually, the blocking reagent

of choice has been the most repetitive genomic portion, the

Cot-1 fraction (Strachan and Read, 1999). Blocking is

believed to suppress non-specific DNA binding that could

lead to capture of ‘off-target fragments’. A second function

of the blocker is to suppress the secondary capture of library

molecules based upon their intrinsic repeat content.

Secondary capture could occur when an array probe anneals

to a complementary fragment from a sample and there

is repeat content elsewhere on that captured genomic

fragment, which could potentially anneal to and capture

other repeat-containing library molecules. To prevent this

secondary capture, the blocking DNA must match the

specific repeats in the genome of interest for effective

sequence capture. Over 85% of the maize B73 genome

(2.3 Gb) consists of repetitive DNA (Bennetzen et al., 2001;

Martienssen et al., 2004; Schnable et al., 2009). In most

genomes of agricultural interest, genes represent a small

percentage of the genome and are dispersed among large

blocks of highly repeated retrotransposon-derived sequence.

Consequently, it is necessary to develop species-specific

sequence capture blocking reagents. The logistical burden of

Cot-1 production drove us to develop a ‘blocker-free’ capture

protocol.

Here we have used a two-stage sequential sequence

capture strategy (repeat subtraction-mediated sequence

capture, RSSC) to sequence an approximately 2.2 Mb chro-

mosomal interval and a set of 43 genes dispersed across the

maize genome. The first capture is designed to deplete

highly repetitive elements from input plant genomic DNA

using a repeat subtraction array. The second capture

employs a target-specific array that enriches for the target

in a reduced-complexity sample. We have further stream-

lined capture implementation by directly capturing from an

approximately 700 bp insert 454 Life Sciences GS FLX-

Titanium long-read sequencing library.

RESULTS

Repeat subtraction-mediated sequence capture (RSSC)

The process of array-based RSSC is shown in Figure 1. RSSC

consists of two phases: reducing the abundance of repetitive

sequences within the capture library and capturing target

sequences from the resulting reduced-complexity library.

The publically available 454 Life Sciences GS FLX-Titanium

(454 hereafter) library construction protocol was utilized to

produce a single-stranded A/B-adapted sequencing library

for either B73 or Mo17 inbreds with a mean insert size of

approximately 700 bp. This library was then amplified via

limited cycles of PCR using primers designed to the 454 A/B

adapters, purified, and quality checked. Next, RSSC was

performed on a maize repeat array constructed by tiling

probes across the maize accessions in a cereal repeat data-

base (see Experimental Procedures for design criteria).

In addition to the maize repeat array, two specific

sequence capture arrays were designed and generated by

Roche NimbleGen. The first capture array (Interval 377)

targets an approximately 2.2 Mb genomic interval from

chromosome 3 of the B73 inbred (Experimental Proce-

dures). This array was designed based on the sequences of

a series of 70 overlapping BACs. The Interval 377 capture

array models situations in other crop genomes where a

specific region of a sequenced genome is under investiga-

tion or where several sequenced BACs covering a region of

interest are available from an otherwise unsequenced

genome. Such situations may be expected when chromo-

some walking in a large genome such as wheat or pine. The

second capture array (43-Gene array) targets 43 genes

dispersed throughout the genome. The 43-Gene capture

array models the situation whereby several genes in an

otherwise unsequenced genome are under investigation.

For the Interval 377 capture array only, repeat sequences

in the interval were masked prior to probe design (see

Experimental Procedures and Figure S1). Table 1 provides

summary statistics for the design of both capture arrays. The

target region for each array consists of a non-redundant set

of sequences that comprise the probe space. Figure S2

shows the distribution of designed probes across Inter-

val 377. The mapping of probes onto the whole genome

provided an estimate of the repetitiveness of each probe

from the Interval 377 capture array (Table S1). As expected,

all probes except one map to Interval 377. However, 9.9% of

probes (4067/41 555) were mapped to 1�3 additional loci

elsewhere in the genome due to the ancient allotetraploid

nature of the maize genome (Paterson et al., 2004), the high

frequency of transposon-mediated redistribution of genic

fragments, and the existence of nearly identical paralogs

(Schnable et al., 2009).

Regional and dispersed sequence capture (from B73)

To characterize RSSC, the Interval 377 capture array was

used to perform two independent captures of B73 genomic

DNA. DNA fragments eluted from the capture arrays were

sequenced using the 454 pyrosequencing technology, and

the resulting filtered reads were mapped to the B73 refer-

ence genome (B73 RefGen_v1; Experimental Procedures).

If the two captures from the B73 genotype are considered as

a pool, more than 97% of the filter-passing captured 454

Sequence capture in maize 899

ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd, The Plant Journal, (2010), 62, 898–909

https://www.researchgate.net/publication/8544429_Paterson_A_H_Bowers_J_E_Chapman_B_A_Ancient_polyploidization_predating_divergence_of_the_cereals_and_its_consequences_for_comparative_genomics_Proc_Natl_Acad_Sci_USA_101_9903-9908?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

https://www.researchgate.net/publication/6177159_Schnable_PS_SNP_discovery_via_454_transcriptome_sequencing?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

https://www.researchgate.net/publication/7609372_Direct_genomic_selection?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4


https://www.researchgate.net/publication/11613895_National_Science_Foundation-Sponsored_Workshop_Report_Maize_Genome_Sequencing_Project?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

https://www.researchgate.net/publication/263453178_Human_Molecular_Genetics?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

sequence reads can be mapped uniquely to the reference

genome (Table S2). The two independent B73 captures

show similar proportions of on-target reads and base cov-

erage, median, as well as similar mean base-pair coverage

statistics (Figure S2 and Table S2). Consequently, all sub-

sequent analyses were performed after pooling reads from

the two independent B73 captures to more fully encompass

all sources of technical variation. Nearly 90% of the pooled

B73 reads that can be mapped to B73 RefGen_v1 map to a

single location, and are therefore non-repetitive sequences

(Figure S3). Given that 85–90% of the maize genome is

highly repetitive (Bennetzen et al., 2001; Martienssen et al.,

2004; Schnable et al., 2009), this finding demonstrates the

efficiency of the array-based repeat-subtraction procedure.

Reads are considered ‘on-target’ if they overlap the target

region. By comparing the percentage of on-target reads

(31%) with the proportion of the genome contained in

Interval 377, we calculate that RSSC for B73 achieved an

approximately 2600-fold enrichment of target sequences

(Table 2). This enrichment is illustrated by the dramatic dif-

ference in coverage achieved between targeted regions

within Interval 377 and flanking untargeted regions (Fig-

ure 2a). Approximately 98% (271 498 bp/277 305 bp) of all

bases in the target region were covered by at least one

sequence read, and approximately 97% of all bases in the

target region have greater than or equal to threefold cover-

age (Table 2). The threefold coverage value is significant

because that is the minimum coverage previously estab-

lished for SNP identification between inbred maize lines

(Barbazuk et al., 2007).

Mapped reads are highly clustered near probe locations,

suggesting highly efficient capture of probe sequences but

reduced capture of sequences >500 bp from probes, consis-

tent with a capture library consisting of fragments of

approximately 700 bp. B73 reads that could not be mapped

to Interval 377 (i.e. off-target reads) exhibit a seemingly

Figure 1. Workflow for the NimbleGen repeat subtraction-mediated seq-

uence capture experiments for the maize genome.

Target regions in the maize genome were selected and an array was designed

to represent the unique portions of each target (red, green and blue segments

in DNA). (1) A GS FLX-Titanium sequencing library was constructed for each

sample; this process places the sequences necessary for 454 sequencing into

the library before capture. The magnified section shows Titanium molecule

ends (approximately 4 h). (2) The sequencing library was amplified via the 454

adapter sequences (approximately 2 h). (3) The sample is hybridized to a

repeat subtraction array; design content focused on tiling the repeat content

from the sample genome (24–72 h). The repeat-containing library molecules

hybridize to the repeat array (black molecules and arrows), but the target

fragments do not (red, green and blue). (4) The repeat content array is

discarded. (5) The hybridization cocktail is recovered and placed onto the

capture array. (6) The array is hybridized (72 h) and washed (1 h including

elution time). (7) Captured target fragments are eluted from the array (red,

green and blue). (8) Target fragments are amplified via 454 adapters

(approximately 2 h). Because the 454 adapters are on the fragments, the

samples are simply diluted and directly sequenced (9) using the 454 GS FLX-

Titanium (24 h). A total of 8–16 samples can move from step 1 through to

sequencing in as little as 2 weeks of total time.

Sequence capture array6. Array washing

9. Sequencing

5. Recover cocktail & repeat hybridization

7. Target fragment elution

8. Amplification

3. Hybridization

4. Discard array Subtraction array

1. Build titanium sequencing library

2. Amplification

Targetregion A

Targetregion B

Targetregion C

900 Yan Fu et al.






random distribution across the genome, with two notable

exceptions. In the first exception, 91% of an approximately

53 kb interval of chromosome 1 exhibits ‡98% identity to

Interval 377. Consequently, 2032 probes perfectly match

both Interval 377 and this interval of chromosome 1. The

second exception involves an approximately 33 kb interval

of chromosome 8 that exhibits less sequence identity to

Interval 377, but even so 28 probes perfectly match both

intervals. In both of these cases, the existence of perfectly

matched probes resulted in paralog capture (Figure 3).

When the 43-Gene capture array was similarly used to

capture B73 sequences via RSSC, an approximately 2900-

fold enrichment was achieved (Table 1). Even though many

fewer reads were generated for this capture [approximately

16 k versus approximately 268 k], 91% of the targeted bases

were covered by at least one sequence read. The reduced

read number results in a lower percentage of greather than

or equal to threefold coverage (73% versus 97%) and a lower

mean coverage (6· versus 106·).

Capture of allelic sequences from Mo17

To determine the efficiency of capturing sequences from a

non-reference inbred using an array based on the B73 ref-

erence genome, RSSC was performed using both B73 arrays

Table 2 Summary statistics for maize capture data using two arrays and two genotypes

Genotype

Interval 377 array 43-Gene arraya

B73b Mo17 B73 Mo17

Number of filtered readsc 268 350 132 162 16 135 30 367Number of on-target readsd

(percentage of on-target reads)83 429 (31%) 29 226 (22%) 5612 (35%) 11 074 (36%)

Fold enrichmente Approximately 2600 Approximately 1800 Approximately 2900 Approximately 3000On-paralog readsf (percentageof on-paralog reads)

8939 (3.3%) 5157 (3.9%) NDg ND

Fold enrichment for paralogsh Approximately 1700 Approximately 2000 ND NDCoverage

Percentage target bases coveredby ‡1/‡3/‡10 capture reads

98/97/94 82/78/70 91/73/20 81/70/46

Mean coverage of target bases 106 38 6 12Mean coverage per 1000 on-target reads 1.3 1.3 1.1 1.1

aCalculations were based on combined data from all genes.bTwo B73 regional captures were combined for calculation.cReads remaining after removal of low-quality reads (Experimental Procedures).dReads mapping to a region overlapping with the target region (Figure S3).ePercentage of on-target reads/(length of target region/size of B73 reference [2.3 Gb]).fThe read mapped to a region overlapping the target paralog region (Figure 3).gNot determined.hPercentage of on-paralog reads/(length of target paralogous region/size of B73 reference genome [2.3 Gb]).

Table 1 Summary statistics for maize cap-ture array design Array design statistics Interval 377 arraya 43-Gene array

Total length (bp) 2 224 325 303 557Primary target space* after repeat masking (bp)b 666 488 No maskingLength of target region (bp)c 277 305 280 749Percentage of primary target space coveredby probesd

42% 92%

Length of target paralogous region (bp)c 45 434 Not determinedNumber of non-transposable elementprotein-encoding genes

40e 43

aUsing the B73_RefV1 sequence as the reference sequence (Experimental Procedures).bSee Figure S1 for detailed method.cTarget region consists of a non-redundant set of sequences used for probe synthesis.dLength of target region/length of primary target space.eBased on members of the ‘filtered gene set’ (Schnable et al., 2009) that overlapped with thetarget region.*Albert et al., 2007; Hodges et al., 2007.



(a)

(b)

(c)

Figure 2. RSSC of Interval 377.

(a) Many sequence reads map to Interval 377 (indicated by the black bar), but few sequence reads map to adjacent, non-target regions.

(b) Detailed view of Interval 377. The black bars in the top track indicate regions targeted by sequence capture probes. The next track (orange bars) provides CGH

data for probes within this interval taken from Springer et al., 2009. For each probe, the log2 of the ratio of Mo17/B73 hybridization signals (y axis) is provided.

Negative values indicate higher hybridization values for B73 than Mo17. The blue, red and purple tracks provide normalized coverage (y axis) for B73 sequence

captures (pool of two captures) and a Mo17 sequence capture, and their difference (M – B), respectively. The arrows highlight two examples with negative

log2(Mo17/B73) CGH values and normalized coverage difference (Mo17 – B73). The green track indicates locations of SNPs identified from sequence capture data.

(c) Close-up view of sequence coverage in a small region of the capture interval indicated by the black bar below the green track in (b). Tracks are as described for (b).

Figure 3. Capture of paralogs from chromosome 8.

A total of 28 probes from Interval 377 (shown in green) perfectly match an interval of chromosome 8. The y axis indicates the depth of coverage at each nucleotide

by captured sequences that uniquely align to this interval of chromosome 8.

902 Yan Fu et al.


with a capture library constructed from Mo17 genomic DNA.

Applying the same stringent alignment criteria used for B73,

we achieved approximately 1800-fold and approximately

3000-fold enrichment of Mo17 sequences that match the

targets from the two arrays, respectively (Table 2). For the

43-Gene capture array similar enrichments were achieved

for the B73 and Mo17 genotypes (approximately 2900-fold

and approximately 3000-fold, respectively). In contrast,

when using the Interval 377 capture array, less enrichment

was obtained for Mo17 than was achieved for B73 (approx-

imately 1800-fold versus approximately 2600-fold, respec-

tively). We hypothesized that this could be a consequence

of polymorphisms between B73 and Mo17 within Inter-

val 377. Polymorphisms could reduce the capture of Mo17

sequences by probes designed based on B73 sequences,

and/or cause difficulties in properly mapping Mo17 reads to

the B73 reference sequence (Figure S4).

Maize sequence capture and comparative genomic

hybridization

The hypothesis that polymorphisms are responsible for the

reduced fold enrichment achieved from Mo17 is supported

by data from independent maize comparative genomic

hybridization (CGH) experiments performed with a 2.1 mil-

lion oligonucleotide microarray (Springer et al., 2009) and

the 43-Gene capture array (this study). Consistent with pre-

vious studies, our CGH data indicate that there is extensive

sequence and structural variation between these two maize

haplotypes. Our whole-genome array contains 2072 probes

from within Interval 377. Regions of Interval 377 that had

substantial B73 capture but low or no coverage by Mo17

capture reads typically exhibited negative log2 ratios of

Mo17/B73 hybridization signals in our whole-genome CGH

experiment (Figure 2b). This relationship was also observed

(a)

(b)

Figure 4. Successful capture of genic regions using the 43-Gene array.

(a) The zmet2 gene, including flanking regions, is one of 43 targeted for capture using this array. The position of an approximately 4.9 kb retrotransposon insertion

(red triangle) in the Mo17 allele is indicated by a triangle. High-density CGH data (orange data track) provide information on sequence variation between the B73 and

Mo17 alleles. Extreme log2(Mo17/B73) values (y axis) indicate high levels of SNPs, InDel polymorphismss or presence/absence variants. Normalized coverage

(y axis) by B73 (blue) and Mo17 (red) sequence reads is shown.

(b) Successful recovery of novel Mo17 allelic sequences. The VISTA identity plot (pink) was presented using the zmet2 Mo17 allele sequence as the reference

sequence (bottom). Mo17 capture reads were mapped to both the B73 reference sequence and the known sequence of the Mo17 allele. Reads that could be aligned

to both alleles are shown in green. Reads that could be aligned only to the Mo17 allele are shown in red. Reads that align to only Mo17 span the junction of the

Mo17-specific insertion and over-lie a highly polymorphic region. By de novo assembling Mo17 sequence reads into contigs (shown in orange) prior to alignment

with the B73 reference allele, it was possible to recover Mo17 sequences that are highly divergent from the B73 reference allele.



in the 43-Gene capture experiments (Figure 4a). As expected,

regions with equivalent coverage typically exhibited CGH

log2 ratios close to zero. We hypothesized that the reason

that the fold enrichments observed for the B73 and Mo17

captures from the 43-Gene capture array were similar is that

the well-characterized genes on this array generally exhibit a

higher degree of conservation between the two genotypes

than do the predicted genes located in Interval 377. This

hypothesis is supported by the finding that 15% of the CGH

probes in Interval 377 (319/2072) exhibit greater than or

equal to twofold variation in hybridization signals (reflecting

significant structural variation between B73 and Mo17),

whereas only approximately 6% of CGH probes designed for

the 43-Gene array do so (980/16 406; Table S3).

For both Interval 377 and the 43-Gene capture arrays, we

noted that the sequence capture provided coverage for a

larger proportion of the target bases in B73 than in Mo17. We

hypothesized that Mo17 regions without coverage may be

caused by our inability to align captured Mo17 sequence

reads with allelic B73 sequences due to high levels of DNA

sequence polymorphism. To test this hypothesis, we aligned

all Mo17 reads captured from the 43-Gene array to existing

sequences of B73 and Mo17 alleles of four genes from this

array. It was possible to align 2010 of the reads captured

from Mo17 to the sequences of B73 alleles. Interestingly, 223

reads that could not be mapped to the B73 alleles of these

genes could be mapped to the sequences of the Mo17 alleles

of these genes. This finding demonstrates that some Mo17

sequences had been captured but had not been detected as

being on target because they did not align to B73. Pre-

assembly of the Mo17 reads into contigs, followed by

mapping of the contigs onto the B73 reference, allowed

identification of nearly 90% of these reads (197/223). These

newly rescued Mo17 reads cover regions of the Mo17

haplotype that are poorly conserved relative to, or even

absent from, B73. Figure 4(b) depicts this analysis for one of

the four genes.

SNP prediction and validation

An important application of sequence capture is to develop

SNP-based markers within targeted genomic regions by

using captured sequences from non-reference genotypes.

The ability to use RSSC-derived data to identify SNPs within

targeted regions was tested by aligning the captured Mo17

reads from the two arrays that uniquely map to the target

regions (‘on-target reads’; Table 2) to the B73 reference

genome. Potential SNP sites were required to be covered by

a minimum of three Mo17 reads. Because Mo17 is homo-

zygous at each locus, Mo17 base calls at the polymorphic

site were expected to be identical. Hence, only those poly-

morphic sites that were mono-allelic within all Mo17 reads

were designated ‘high-confidence SNPs’. SNP sites that had

more than one base call within the aligned reads of a single

genotype were assumed to result from the inadvertent

alignment of paralogous sequences; such SNPs were

designated ‘lower-confidence SNPs’. The alignments of

Mo17 reads to the B73 reference genome were used to

predict 1357 and 1221 high-confidence SNPs from the

Interval 377 and 43-Gene arrays, respectively (Experimental

Procedures and Table 3). Rates of false-positive SNP

predictions were estimated via comparison with known

SNPs that had been detected by alignment of existing partial

sequences of the Mo17 alleles for four of the genes present

on the 43-Gene array. We predicted a total of 212 SNPs,

including 151 high-confidence and 61 lower-confidence

SNPs, within the corresponding regions of these four genes.

All of the 151 high-confidence SNPs and 56 of the lower-

confidence SNPs were confirmed via comparisons to our

previously known control sequences. Based on this analysis,

the rate of false-positive SNP prediction is extremely low

(<3%).

It was possible to identify ‘on-target reads’ for use in the

analysis described above because we had access to the B73

reference genome sequence. This would not be possible in a

Table 3 SNP prediction using reads cap-tured from B73 and Mo17

Input dataaNumber ofSNPs

Number ofhigh-quality SNPsb

Number of genesc withhigh-quality SNPs

Interval 377 B73 (all) 8531 98 2B73 (target) 23 5 1Mo17 (all) 8044 1693 35Mo17 (target) 1649 1357 34

43-Gene set B73 (all) 170 31 11B73 (target) 144 30 11Mo17 (all) 2249 1240 40Mo17 (target) 1790 1221 39

aTwo sets of B73 and Mo17-derived sequence reads were used for SNP prediction: all filteredreads (‘all’) and only on-target reads (‘target’).bHigh-quality SNPs are those that are mono-allelic for all aligning reads. In addition, SNPsidentified within repetitive DNA regions of Interval 377 were removed (Experimental Proce-dures).cThere are 40 and 43 genes represented on the Interval 377 and 43-Gene arrays, respectively(Table 1).

904 Yan Fu et al.


species that lacks a reference genome sequence. To test

whether SNPs could be successfully predicted without

access to a reference genome sequence, we performed a

second experiment in which we aligned all Mo17 reads

captured from the two arrays to their respective B73 capture

intervals. This experiment yielded 1693 and 1240 high-

confidence SNPs from the two arrays (Table 3), representing

25% and 2% increases in the numbers of SNPs predicted,

compared to using genome-directed ‘on-target reads’.

To test the hypothesis that the inclusion of non-target

paralogous sequences in the SNP discovery pipeline is

responsible for the increased numbers of high-confidence

SNPs predicted in this second experiment, we performed a

SNP discovery experiment using B73 sequences captured by

the Interval 377 array to predict ‘SNPs’ relative to the B73

reference genome. We have previously shown that there is

little residual heterozygosity in B73 (Emrich et al., 2007).

Hence, the rate at which we identify ‘SNPs’ when using

captured B73 reads is a measure of the number of putative

SNPs that are false-positive due to sequencing errors or the

inadvertent identification of ‘paramorphisms’ as SNPs.

Paramorphisms are sequence variants between highly sim-

ilar paralogs (Fu et al., 2004).

Alignment of only on-target B73 reads to Interval 377 of

the B73 reference genome yielded five such high-confidence

‘SNPs’ (Table 3). Because few paralogous sequences are

expected among the on-target reads, most of these false-

positive SNPs are probably due to sequencing errors in

either the captured reads or the reference genome. Exam-

ination of the alignments of B73 and Mo17 captured

sequences in the regions of the five potential false-positive

polymorphic sites indicated that two are the result of

sequence errors in the reference genome and one is

probably caused by capture of paralogous sequences, while

the causes of the remaining two could not be determined

because Mo17 reads were not available for these sites. The

low rate of false-positives caused by sequencing errors

reflects the high stringency of our SNP prediction pipeline.

Alignment of all B73 reads to Interval 377 of the B73

reference genome yielded 98 high-confidence ‘SNPs’, which

probably includes false positives due to both sequencing

errors (ours and reference) and paramorphisms. Overall, the

low rate of false-positive SNP calls caused by sequencing

errors led us to conclude that <6% (98/1693) of the high-

confidence SNPs generated in the absence of paralog

removal represent false positives due to paramorphisms.

DISCUSSION

Repeat subtraction-mediated sequence capture (RSSC)

Over the past two decades, several approaches to achieve a

reduction in genomic complexity have been attempted,

including EST sequencing, methyl filtration, and high-Cot

DNA selection (Barbazuk et al., 2005). Each of these

approaches has been successful in reducing genome com-

plexity, but none delivers sequences of interest in a targeted

fashion as is possible with hybridization-based sequence

capture.

In initial experiments in which we utilized Cot1 DNA as a

blocker, we found that maize Cot1 DNA improved the

performance of sequence capture compared with human

Cot1 DNA (data not shown). Extending this idea suggests

that adapting sequence capture technology for the many

crop genomes would require the production of species-

specific blocking agents for each of the many important

crops. Published maize Cot1 production protocols have only

approximately 10% yield, meaning that scaling of produc-

tion is prohibitive from the perspective of genomic DNA

consumption (Zwick et al., 1997). Furthermore, in our hands,

16 of 20 independent attempts at using the previously

published Cot1-based protocol yielded fold enrichments that

were at least an order of magnitude below those achieved

in the current study (P.S.S., Y.F., N.M.S., W.B.B. and J.A.J.,

unpublished results). We therefore investigated the use of a

two-stage microarray sequence capture method that might

yield samples with consistently reduced complexity.

A repeat-subtraction microarray was designed to remove

DNA fragments that contain highly repetitive sequences. A

similar approach has been used to improve hybridization

performance (Newkirk et al., 2005). To date, seven of nine

maize RSSC sample attempts have been successful in

providing >1000-fold enrichment in each sample. The two

‘failures’ have been traced to a hybridization reagent issue

within one experiment (D.J.G. and J.A.J., unpublished

results).

Approximately one-third of captured reads are ‘on target’.

Although this is sufficient to make this technology very

attractive for practical applications, it must be asked why

two-thirds of the reads are ‘off target’. Interestingly, in the

capture experiments for the human genome, probe sets in

the range of 200–500 kb resulted in off-target rates that are

in the same range as observed here (Albert et al., 2007). We

therefore hypothesize that this off-target rate is probably a

consequence of the small design space on our array. When

only a small amount of library is hybridized to the array (so

as to not overwhelm the repeat subtraction), there are only

limited numbers of copies of the target in the sample.

Increasing the design space might result in higher ‘on-

target’ rates. Recent results from maize using a larger

capture target design space support the correlation between

design space and specificity that was first observed in the

human genome (J.A.J., T.J.A. and D.J.G., unpublished

results). Larger designs (approximately fivefold) exhibited

an approximately twofold better on-target read rate in two

independent tests (data not shown). Other potential

approaches to increase the rate of on-target reads include

making the numbers of various types of probes on the repeat

subtraction array proportional to their copy number in the



https://www.researchgate.net/publication/8412257_Types_and_Frequencies_of_Sequencing_Errors_in_Methyl-Filtered_and_High_C0t_Maize_Genome_Survey_Sequences?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

https://www.researchgate.net/publication/7728268_Reduced_representation_sequencing_A_success_in_maize_and_a_promise_for_other_plant_genomes?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

https://www.researchgate.net/publication/6688370_Nearly_Identical_Paralogs_Implications_for_Maize_Zea_mays_L_Genome_Evolution?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

https://www.researchgate.net/publication/7413146_Distortion_of_quantitative_genomic_and_expression_hybridization_by_Cot-1_DNA_Mitigation_of_this_effect?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

https://www.researchgate.net/publication/5386469_A_rapid_procedure_for_isolation_of_Cot-1_DNA_from_plants?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

genome and reducing fragment sizes in the capture library

(thereby reducing the potential for secondary capture).

Use of sequence capture to identify allelic variation

The two applications of sequence capture described here

highlight the potential uses of this technology. Sequence

capture of chromosomal regions, such as Interval 377,

which contain a target gene, mutation or QTL can provide

two important outcomes. First, the targeted resequencing

identifies polymorphisms such as SNPs that can be con-

verted into high-density genetic markers. Not much

sequencing was required to obtain a large number of SNPs:

a 1/16 region 454 PicoTiterPlate run was well paired, from a

coverage perspective, with our approximately 300 kb cap-

ture interval within this homozygous genome. Approxi-

mately 1600 SNPs were identified that can be used to map

the gene or causative QTL to high resolution. Second, the

targeted resequencing concomitantly provides a set of

potentially causative polymorphisms.

Another application of sequence capture is the isolation

and characterization of novel alleles from a non-reference

genome. Maize is a highly polymorphic species with many

SNPs, InDel polymorphisms and presence/absence variants

(Springer et al., 2009). We demonstrated that characteriza-

tion of novel alleles is greatly facilitated by the combination

of CGH and sequence capture. CGH data provide information

on the relative conservation of the target genome and the

reference genome. Regions of the genome that had lower

hybridization to the target genome in CGH experiments often

had much lower Mo17 coverage. By complementing

sequence capture with CGH, it is possible to rapidly identify

conserved and non-conserved regions and to focus novel

allele characterization efforts on highly variable regions.

Although mapping Mo17 reads to the B73 reference

sequence provided coverage of many regions, thereby

allowing us to identify SNPs, a number of regions lacked

Mo17 read coverage. It is likely that a subset of the examples

of regions with missing Mo17 sequence may reflect pres-

ence/absence variants (i.e. B73-derived sequences that are

simply absent from the Mo17 genome). The remaining

regions with missing Mo17 sequence represent either

reduced capture of Mo17 sequence due to polymorphisms

or inability to align the sequences to the reference B73

sequence due to polymorphisms. Using comparisons to the

known Mo17 sequences of several genomic regions, we

found that the Mo17 sequences had been effectively cap-

tured, but that SNPs and InDel polymorphisms limited our

ability to map these captured Mo17 sequence reads to the

appropriate location on the B73 reference genome. It was

possible to recover some of these sequence reads by first

performing an assembly of all captured reads and then

mapping the longer Mo17 assemblies to the B73 genome

(Figure 4b). Importantly, assembly prior to alignment

allowed us to recover novel allelic sequences from a

non-reference haplotype that were not targeted by the

capture array. Even though the number of sequence reads

rescued in this manner is not large, such reads are valuable

because they can be used to construct and extend the

sequence of a captured haplotype and are therefore useful

for identifying insertion/deletion polymorphisms, and

may be useful for iterative capture-mediated chromosome

walking.

RSSC and paralogs

Maize arose from an allotetraploidization event in the past

5–10 million years (Paterson et al., 2004), and has retained an

extensive degree of gene duplication. Processes such as

transposon capture of gene fragments (Schnable et al., 2009)

have provided additional paralog complexity. Consequently,

approximately 10% of our capture probes had more than one

identical match in the maize genome, potentially making

them eligible to capture paralogs. As expected, these probes

were equally capable of capturing the target sequence and

the paralogous sequence. Paralogous reads were recovered

at a frequency consistent with their probe representation

frequency [e.g. 10.7% (8939/83 429) for the B73 captures

from the Interval 377 array; Table 2]. The degree to which

paralog capture complicates SNP discovery depends on the

structure of the genome being analyzed, but we were

encouraged to discover that, even in maize, very few of the

mono-allelic putative SNPs appear to be false positives.

Broader applicability of RSSC

We have reported a protocol implementation that allowed us

to achieve 1800–3000-fold enrichment of both a defined

chromosomal interval and a set of dispersed genes. This

enrichment is comparable to that achieved for the human

genome (Albert et al., 2007). For both captures, 80–98% of

targeted bases were covered by captured sequences. The

mean coverage of the target regions per 1000 on-target

reads was similar for captures from the two different arrays

(1.3 versus 1.1), highlighting the overall robustness of the

approach. Therefore, the RSSC protocol provides a method

to resequence targeted genomic regions of the maize gen-

ome, and is expected to exhibit similar levels of performance

in other genomes. The ability to design reagents required for

repeat subtraction in silico significantly reduces the techni-

cal hurdles involved in applying sequence capture across

diverse species. Because highly repetitive elements can be

discovered using only limited amounts of whole-genome

shotgun sequencing data, it should be possible to design

species-specific repeat-subtraction arrays with limited

investment of resources in combination with next-genera-

tion sequencing technologies. Hence, it will be possible to

apply RSSC not only to species with sequenced reference

genomes, but also to those whose genomes have not yet

been sequenced. Importantly, we have established that

polymorphism analyses performed in the absence of a fully

906 Yan Fu et al.



sequenced reference genome are not substantially cum-

bersome. We therefore foresee application of this technol-

ogy for studies of population genetics, cloning of loci

controlling quantitative variation, and allele mining in crops,

model organisms, and, importantly, non-model species.

EXPERIMENTAL PROCEDURES

Repeat array design

A customized NimbleGen 3 x 720 K sequence capture microarray(081110_Zea_mays_repeats_cap) was synthesized three times perslide to contain maize repetitive elements in the MAGI Cereal RepeatDatabase (version 3.1; http://magi.plantgenomics.iastate.edu/repeatdb.html) and the Maize Repeat Database (version 4; http://maize.jcvi.org/repeat_db.shtml). The design may be ordered byrequest. There are 2.1 M total probes on the array, although only thecenter sub-array containing 720 K probes was utilized in this study.The median probe length is 74 bp.

Maize NimbleGen sequence capture array design

A large genomic region on a BAC fingerprint contig (FPC Ctg138,chromosome 3) was originally selected for targeting. Based on thephysical map released prior to 29 May 2008, a total of 70 sequencedBACs are within this FPC contig, and their sequences were down-loaded from GenBank on 29 May 2008. The physical map has beenupdated to the latest release (maize Golden Path AGP version 1,release 4a.53). Details regarding sequence annotation and geneprediction are shown in Figure S1. A total of approximately 1.5 Mb,comprising 44 unordered sequence fragments with 83 non-redun-dant predicted non-repetitive genes, were soft-masked for probedesign. The uniqueness/repetitiveness of all the probes and physi-cal locations of the probes were determined based on the collectionof maize BAC sequences available in March 2008. The array designwas constructed by tiling at approximately 5 bp spacing across thetarget regions. Probes with a mean 15-mer frequency in the genomegreater than 100 were excluded, as were probes that had more thanfive close matches in the genome. A close match is a match to thegenome that is at least 38 bp long, allowing up to five insertions/deletions/mismatches. When the probes are shorter than 50 bp, weuse the length of the probe (12 bp, the seed length) as the minimummatch size. A total of 41 555 probes were selected, and replicated atleast 17 times on the array. To reconcile with the reference genomesequence, probes were remapped to B73 RefGen_v1 (Schnableet al., 2009). The final sequence interval was defined from the 1 kbupstream of themost-left mappedprobe (REGION0042FS000010140)to the 1 kb downstream of the most-right mapped probe(REGION0028FS000002032), i.e. 183 062 553–185 609 824 bp onchromosome 3. Two fragments (183 315 664–183 553 126 bp and183 880 178–183 965 661 bp) were excluded from analysesbecause they were not present in the sequences used for probedesign. This design is used to generate a customized NimbleGen 3 x720 K sequence capture array. Only the center sub-array wasutilized for this study; this array may be ordered from RocheNimbleGen by requesting 081028_Zea_mays_schnable_cap.

The second NimbleGen 3 x 720 K sequence capture array designwas constructed by tiling at approximately 15 bp spacing across 43dispersed gene targets. Probes with a mean 13-mer frequency in thegenome greater than 500 were excluded, as were probes that hadmore than seven close matches in the genome. A total of 16 406probes were selected and replicated 44 times on the array. Thisarray comprises approximately 350 kbp of genomic space, but hasonly 123 kbp represented within the probes. Again, only the centersub-array was utilized for this study. This design may be ordered by

requesting 080328_maize_cap_springer_1. These probes are ofvarious lengths and the median probe length for both designs is76 bp.

Maize sequence capture and 454 sequencing

DNA was isolated from 14-day-old seedlings of two maize inbreds,B73 and Mo17, using a previously described protocol (Li et al.,2007). A 700 bp mean insert size 454 GS FLX-Titanium sequencinglibrary (454 Life Sciences, http://www.454.com) was generated foreach inbred and subjected to eight cycles of amplification usingprimers based upon the sequencing adapters. Amplicons werepurified using a QIAquick/MinElute spin column (Qiagen, http://www.qiagen.com/). The DNA concentration was determined usingNanoDrop ND1000 (Thermo Scientific, http://www.thermo.com)and the molecular weight range was determined using an AgilentBioanalyzer 2100 with a DNA7500 kit (Agilent Technologies, http://www.agilent.com). We progressively decreased the total amount oflibrary used per hybridization across the study. The Interval 377captures used 500 ng, and the 43-Gene captures utilized either 250or 150 ng for the repeat-subtraction hybridization. The indicatedmass of double-stranded sequencing library was hybridized to themaize repeat subtraction at low stringency (37�C) using the Mai Taisystem (SciGene, http://www.scigene.com) with NimbleGenhybridization solution supplemented with Tween-20 at 0.1% v/v,together with a 100-fold molar excess of non-extendable primerscomplementary to the sequencing adapters. The rotation speed inthe SciGene hybridization oven was set to 15. The hybridizationcocktail was recovered by separating the two slides with the gasketarray on the bottom (facing up) and the subtraction array on the top(facing down). The remaining hybridization cocktail, containing thelibrary fragments of interest (still on the gasket slide), was subjectedto a second capture array aimed at the gene space of interest. Thecapture array was placed with the probe side facing down onto thehybridization cocktail on the gasket slide. The gasket slide remainedin the Mai-Tai rig during placement. The capture array was thensubjected to an additional 4 days of hybridization at 42.5�C with therotator set to 15. The capture array was washed as previouslydescribed (Albert et al., 2007) and eluted using a sodium hydroxidemethod that is available from Roche NimbleGen Technical Supporton request. The eluted molecules were amplified via the sequencingadapters (14 cycles), and the products were purified and quantified.The double-stranded eluted libraries were diluted for emulsion PCR(emPCR) as recommended by 454 Life Sciences, and sequencedusing the 454 Life Sciences GS FLX-Titanium protocol according tothe manufacturer’s instructions using a 4- or 16-region TitaniumPicoTiterPlate. Prior to emPCR, the diluted double-stranded eluatelibraries were heat-treated at 95�C for 2 min in a thermal cycler. Thisheating step was found to be essential to avoid amplification-associated artifacts in the emPCR. The raw 454 capture reads(deposited to the GenBank Short Read Archive with accessionnumber SRA009261.9) with low quality (parameters: maximummean error = 0.01, maximum error at ends = 0.01), and short 454reads (<200 bp) were removed using Lucy (Chou and Holmes, 2001).This cut-off was selected because few of the 454 reads of <200 bpcould be mapped to the B73 reference genome (K.Y., Y.F. and P.S.S.,unpublished results).

Data analyses

To estimate on-target rates, all filtered B73 and Mo17 captured 454reads were aligned to the B73 reference genome sequence, i.e.B73_RefGen_v1 (Schnable et al., 2009) (BLAST alignment criteria:95% similarity and total unaligned regions of both 5¢ and 3¢ ends of454 reads £15 bp). Sequence reads whose best match overlapped a



https://www.researchgate.net/publication/6324159_Functional_Analysis_of_Maize_RAD51_in_Meiosis_and_Double-Strand_Break_Repair?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4


https://www.researchgate.net/publication/11606506_DNA_sequence_quality_trimming_and_vector_removal?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4

target region were classified as on-target. The target paralog regionis defined as a non-redundant set of sequences of those probes thatcan be mapped both inside and outside Interval 377. Sequencereads with a best match that overlaps the target paralog region areconsidered as ‘on-paralog’ reads. Whole-genome CGH data wereretrieved from the NCBI GEO database (GSE16938) (Springer et al.,2009). Only CGH probes within targeted regions were used to cal-culate normalized coverage. GFF files were generated for datavisualization using NimbleScan (version 2.4, NimbleGen). Shell andAWK scripts for the analysis pipeline are available upon request.Additional CGH data for the 43-Gene array are given in Table S3.Sequence alignments between B73 and Mo17 allelic sequenceswere performed using VISTA (LAGAN alignment program used withdefault settings) (Frazer et al., 2004). CAP3 (Huang and Madan,1999) was used for assembling Mo17 reads from the 43-Gene array(parameters used: overlap percentage identity ‡95, overlap length‡50 bp).

Comparative genomic hybridization (CGH)

CGH was performed using the 43-Gene capture array in place of aCGH design within a standard Roche NimbleGen CGH workflow forNimbleGen human CGH with 385 K arrays. Two arrays were utilizedin a B73 versus Mo17 dye swap. Labeling, hybridization, washing,scanning and analytical conditions were as previously reported(Springer et al., 2009).

SNP discovery

SNP discovery was performed either using all filtered 454 reads orthe subset of on-target 454 reads defined above. The 454 reads werealigned to the reference sequences (either the chromosome 3Interval 377 or the 43-Gene set) using MosaikAligner (Hillier et al.,2008) with the following parameters: -a (alignment algorithm), all;)p (CPUs used), 8; )mmp (maximum percentage of read length tobe mismatched), 0.05; –minp (minimum percentage of the readlength aligned), 0.95; –mmal (aligned read length rather than theoriginal read length when counting errors); )m (alignment mode),unique; )hs (hash size), 15; )mhp (maximum number of positionsto use), 100. These alignment parameters ensured that each 454sequence read was uniquely aligned; sequences that failed to meetthese criteria were discarded from the analysis. SNPs were identi-fied within the alignments using the GIGABAYES package (http://bioinformatics.bc.edu/marthlab). Arguments to GIGABAYES were: –D(pairwise nucleotide diversity), 0.003; –ploidy (sample ploidy),haploid; –algorithm, recursive; –sample (sequence source), single;–anchor; –CAL (minimum overall allele coverage), 3; –QRL (mini-mum base quality value), 20. Potential SNP sites were required to becovered by a minimum of three Mo17 reads, and all Mo17 base callsat the polymorphic site were expected to be identical. SNP sites thathad more than one allele within the aligned reads were assumed toresult from alignment of paralog sequences. In addition, potentialhigh-confidence SNP sites within the Interval 377 region were re-quired to be from non-repetitive regions. The false SNP discoveryrate was determined by identifying potential SNPs from B73 cap-tured reads.

ACKNOWLEDGEMENTS

We thank James Birchler (Division of Biological Sciences, Universityof Missouri) for providing repeat clones for early developmentwork, John Luckey, Jason Norton and Paul Marrione for supportwith Mai-tai hybridization optimization and platform development,Rudi Seibl, Rebecca Selzer and Courtney Erickson for research anddevelopment support, and the Maize Genome Sequencing Project(NSF DBI-0527192) for sharing genome sequences and annotation

prior to publication. This project was supported in part by funding toP.S.S. from the Iowa State University Plant Sciences Institute,funding to N.M.S. from the University of Minnesota, and a grantfrom the National Science Foundation Plant Genome Program (DBI-0501758) and funding from University of Florida to W.B.B. TheRoche NimbleGen research and development group is privatelyfunded.

SUPPORTING INFORMATION

Additional Supporting Information may be found in the onlineversion of this article:Figure S1. Workflow for preparing BAC sequences within FPCCtg138 for design of probes for the Interval 377 array.Figure S2. Evaluation of the coverage and reproducibility ofsequence capture from Interval 377.Figure S3. Mapping of capture reads to Interval 377.Figure S4. Comparison of CGH data and sequence capture effi-ciency.Table S1. Re-mapping of probes designed for Interval 377 to B73RefGen_v1.Table S2. Summary statistics for two B73 captures of Interval 377.Table S3. Gene array CGH data.Please note: As a service to our authors and readers, this journalprovides supporting information supplied by the authors. Suchmaterials are peer-reviewed and may be re-organized for onlinedelivery, but are not copy-edited or typeset. Technical supportissues arising from supporting information (other than missingfiles) should be addressed to the authors.

REFERENCES

Albert, T.J., Molla, M.N., Muzny, D.M. et al. (2007) Direct selection of human

genomic loci by microarray hybridization. Nat. Methods, 4, 903–905.

Barbazuk, W.B., Bedell, J.A. and Rabinowicz, P.D. (2005) Reduced represen-

tation sequencing: a success in maize and a promise for other plant

genomes. Bioessays, 27, 839–848.

Barbazuk, W.B., Emrich, S.J., Chen, H.D., Li, L. and Schnable, P.S. (2007) SNP

discovery via 454 transcriptome sequencing. Plant J. 51, 910–918.

Bashiardes, S., Veile, R., Helms, C., Mardis, E.R., Bowcock, A.M. and Lovett,

M. (2005) Direct genomic selection. Nat. Methods, 2, 63–69.

Bennetzen, J.L. (2005) Transposable elements, gene creation and genome

rearrangement in flowering plants. Curr. Opin. Genet. Dev. 15, 621–627.

Bennetzen, J.L., Chandler, V.L. and Schnable, P. (2001) National Science

Foundation-sponsored workshop report. Maize genome sequencing pro-

ject. Plant Physiol. 127, 1572–1578.

Chou, H.H. and Holmes, M.H. (2001) DNA sequence quality trimming and

vector removal. Bioinformatics, 17, 1093–1104.

D’Ascenzo, M., Meacham, C., Kitzman, J. et al. (2009) Mutation discovery in

the mouse using genetically guided array capture and re-sequencing.

Mamm. Genome, 20, 424–436.

Emrich, S.J., Li, L., Wen, T.J., Yandeau-Nelson, M.D., Fu, Y., Guo, L., Chou,

H.H., Aluru, S., Ashlock, D.A. and Schnable, P.S. (2007) Nearly identical

paralogs: implications for maize (Zea mays L.) genome evolution. Genetics,

175, 429–439.

Feuillet, C., Langridge, P. and Waugh, R. (2008) Cereal breeding takes a walk

on the wild side. Trends Genet. 24, 24–32.

Frazer, K.A., Pachter, L., Poliakov, A., Rubin, E.M. and Dubchak, I. (2004)

VISTA: computational tools for comparative genomics. Nucleic Acids Res.

32, W273–W279.

Fu, Y., Hsia, A.P., Guo, L. and Schnable, P.S. (2004) Types and frequencies of

sequencing errors in methyl-filtered and high C0t maize genome survey

sequences. Plant Physiol. 135, 2040–2045.

Herman, D.S., Hovingh, G.K., Iartchouk, O., Rehm, H.L., Kucherlapati, R.,

Seidman, J.G. and Seidman, C.E. (2009) Filter-based hybridization capture

of subgenomes enables resequencing and copy-number detection. Nat.

Methods, 6, 507–510.

Hillier, L.W., Marth, G.T., Quinlan, A.R. et al. (2008) Whole-genome

sequencing and variant discovery in C. elegans. Nat. Methods, 5, 183–188.

908 Yan Fu et al.






















https://www.researchgate.net/publication/8493648_VISTA_Computational_Tools_for_Comparative_Genomics?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4











https://www.researchgate.net/publication/12791319_Cap3_A_DNA_sequence_assembly_program?el=1_x_8&enrichId=rgreq-3859f7e1-0b16-4270-b5c7-ba8885c3921e&enrichSource=Y292ZXJQYWdlOzQxOTY4MTQwO0FTOjE0MjIzNTUwOTIwMjk0NEAxNDEwOTIyOTk2NjE4


Hodges, E., Xuan, Z., Balija, V. et al. (2007) Genome-wide in situ exon capture

for selective resequencing. Nat. Genet., 39, 1522–1527.

Huang, X. and Madan, A. (1999) CAP3: a DNA sequence assembly program.

Genome Res. 9, 868–877.

Li, J., Harper, L.C., Golubovskaya, I., Wang, C.R., Weber, D., Meeley, R.B.,

McElver, J., Bowen, B., Cande, W.Z. and Schnable, P.S. (2007) Functional

analysis of maize RAD51 in meiosis and double-strand break repair.

Genetics, 176, 1469–1482.

Martienssen, R.A., Rabinowicz, P.D., O’Shaughnessy, A. and McCombie, W.R.

(2004) Sequencing the maize genome. Curr. Opin. Plant Biol. 7, 102–107.

Morse, A.M., Peterson, D.G., Islam-Faridi, M.N. et al. (2009) Evolution of

genome size and complexity in Pinus. PLoS ONE, 4, e4332.

Newkirk, H.L., Knoll, J.H. and Rogan, P.K. (2005) Distortion of quantitative

genomic and expression hybridization by Cot-1 DNA: mitigation of this

effect. Nucleic Acids Res. 33, e191.

Okou, D.T., Steinberg, K.M., Middle, C., Cutler, D.J., Albert, T.J. and Zwick,

M.E. (2007) Microarray-based genomic selection for high-throughput

resequencing. Nat. Methods, 4, 907–909.

Paterson, A.H., Bowers, J.E. and Chapman, B.A. (2004) Ancient polyploidi-

zation predating divergence of the cereals, and its consequences for

comparative genomics. Proc. Natl Acad. Sci. USA, 101, 9903–9908.

Porreca, G.J., Zhang, K., Li, J.B. et al. (2007) Multiplex amplification of large

sets of human exons. Nat. Methods, 4, 931–936.

Schnable, P.S., Ware, D., Fulton, R.S. et al. (2009) The B73 maize genome:

complexity, diversity, and dynamics. Science, 326, 1112–1115.

Springer, N.M., Ying, K., Fu, Y. et al. (2009) Maize inbreds exhibit high levels

of copy number variation (CNV) and presence/absence variation (PAV) in

genome content. PLoS Genet. 5, e1000734.

Strachan, T. and Read, A.P. (1999) Human Molecular Genetics. New York:

John Wiley & Sons Inc.

Zwick, M.S., Hanson, R.E., Islam-Faridi, M.N., Stelly, D.M., Wing, R.A., Price,

H.J. and McKnight, T.D. (1997) A rapid procedure for the isolation of C0t-1

DNA from plants. Genome, 40, 138–142.
























Repeat subtraction-mediated sequence capture from a complex genome

Documents

Transcript of Repeat subtraction-mediated sequence capture from a complex genome