Repeat subtraction-mediated sequence capture from a complex genome
Transcript of Repeat subtraction-mediated sequence capture from a complex genome
TECHNICAL ADVANCE
Repeat subtraction-mediated sequence capture froma complex genome
Yan Fu1,2, Nathan M. Springer3, Daniel J. Gerhardt4, Kai Ying5,6, Cheng-Ting Yeh1,7, Wei Wu1, Ruth Swanson-Wagner5,6,
Mark D’Ascenzo4, Tracy Millard4, Lindsay Freeberg4, Natsuyo Aoyama4, Jacob Kitzman4, Daniel Burgess4, Todd Richmond4,
Thomas J. Albert4, W. Brad Barbazuk8, Jeffrey A. Jeddeloh4,* and Patrick S. Schnable1,2,6,7,*
1Department of Agronomy, Iowa State University, Ames, IA 50011, USA,2Center for Carbon-Capturing Crops, Iowa State University, Ames, IA 50011, USA,3Department of Plant Biology, University of Minnesota, St Paul, MN 55108, USA,4Roche NimbleGen Inc., Madison, WI 53719, USA,5Interdepartmental Genetics Graduate Program, Iowa State University, Ames, IA 50011, USA,6Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA,7Center for Plant Genomics, Iowa State University, Ames, IA 50011, USA, and8Department of Biology and the Genetics Institute, University of Florida, Gainesville, FL 32610, USA
Received 30 October 2009; revised 3 February 2010; accepted 9 February 2010; published online 12 April 2010.*For correspondence (fax +1 608 218 7601; e-mail [email protected] or fax +1 515 294 5256; e-mail [email protected]).
SUMMARY
Sequence capture technologies, pioneered in mammalian genomes, enable the resequencing of targeted
genomic regions. Most capture protocols require blocking DNA, the production of which in large quantities can
prove challenging. A blocker-free, two-stage capture protocol was developed using NimbleGen arrays. The
first capture depletes the library of repetitive sequences, while the second enriches for target loci. This strategy
was used to resequence non-repetitive portions of an approximately 2.2 Mb chromosomal interval and a set of
43 genes dispersed in the 2.3 Gb maize genome. This approach achieved approximately 1800–3000-fold
enrichment and 80–98% coverage of targeted bases. More than 2500 SNPs were identified in target genes. Low
rates of false-positive SNP predictions were obtained, even in the presence of captured paralogous sequences.
Importantly, it was possible to recover novel sequences from non-reference alleles. The ability to design novel
repeat-subtraction and target capture arrays makes this technology accessible in any species.
Keywords: NimbleGen sequence capture, genotyping, SNP, molecular marker, reduced representation
sequencing, allele mining.
INTRODUCTION
Identifying genetic variation is a critical step in relating
genotypes to phenotypes. Reference genome sequences
exist for several plant species; however, it remains expen-
sive and difficult to perform whole-genome sequencing of
dozens of haplotypes per species, especially in crops such as
maize (Martienssen et al., 2004; Bennetzen, 2005), wheat
(Feuillet et al., 2008) and conifers (Morse et al., 2009), which
have large genomes with complicated high-copy repeat
interspersion. The ability to perform targeted resequencing
of specific intervals of the low-copy fraction of these
genomes has significant potential for a range of applica-
tions, such as discovering markers, identifying the basis of
mutants, and cloning qualitative and quantitative trait loci
(QTL), all of which can contribute to the genetic improve-
ment of crops.
Microarray-based sequence capture (Albert et al., 2007;
Hodges et al., 2007; Okou et al., 2007; D’Ascenzo et al., 2009)
has been successfully applied to mammalian genomes for
resequencing exons, large genomic loci and candidate gene
sets. Sequence capture has also been performed using
solution hybrid selection with long oligos (Porreca et al.,
2007) or PCR products (Herman et al., 2009) as probes. The
substantial enrichment of target sequences achieved via
sequence capture makes it much less expensive than
898 ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd
The Plant Journal (2010) 62, 898–909 doi: 10.1111/j.1365-313X.2010.04196.x
resequencing whole genomes or even the entire gene space.
Even for marker discovery, where it is possible to obtain
large numbers of ‘random’ SNPs via high-throughput
sequencing of genomic fractions, as has been done for
maize (Barbazuk et al., 2007), large numbers of SNPs are not
typically discovered within a set of specific genes or a
defined genomic region (e.g. a QTL interval).
Since their earliest implementation, hybridization-based
complexity reduction technologies for targeted sequencing
have required the use of blocking DNA in massive excess
(Bashiardes et al., 2005). Most usually, the blocking reagent
of choice has been the most repetitive genomic portion, the
Cot-1 fraction (Strachan and Read, 1999). Blocking is
believed to suppress non-specific DNA binding that could
lead to capture of ‘off-target fragments’. A second function
of the blocker is to suppress the secondary capture of library
molecules based upon their intrinsic repeat content.
Secondary capture could occur when an array probe anneals
to a complementary fragment from a sample and there
is repeat content elsewhere on that captured genomic
fragment, which could potentially anneal to and capture
other repeat-containing library molecules. To prevent this
secondary capture, the blocking DNA must match the
specific repeats in the genome of interest for effective
sequence capture. Over 85% of the maize B73 genome
(2.3 Gb) consists of repetitive DNA (Bennetzen et al., 2001;
Martienssen et al., 2004; Schnable et al., 2009). In most
genomes of agricultural interest, genes represent a small
percentage of the genome and are dispersed among large
blocks of highly repeated retrotransposon-derived sequence.
Consequently, it is necessary to develop species-specific
sequence capture blocking reagents. The logistical burden of
Cot-1 production drove us to develop a ‘blocker-free’ capture
protocol.
Here we have used a two-stage sequential sequence
capture strategy (repeat subtraction-mediated sequence
capture, RSSC) to sequence an approximately 2.2 Mb chro-
mosomal interval and a set of 43 genes dispersed across the
maize genome. The first capture is designed to deplete
highly repetitive elements from input plant genomic DNA
using a repeat subtraction array. The second capture
employs a target-specific array that enriches for the target
in a reduced-complexity sample. We have further stream-
lined capture implementation by directly capturing from an
approximately 700 bp insert 454 Life Sciences GS FLX-
Titanium long-read sequencing library.
RESULTS
Repeat subtraction-mediated sequence capture (RSSC)
The process of array-based RSSC is shown in Figure 1. RSSC
consists of two phases: reducing the abundance of repetitive
sequences within the capture library and capturing target
sequences from the resulting reduced-complexity library.
The publically available 454 Life Sciences GS FLX-Titanium
(454 hereafter) library construction protocol was utilized to
produce a single-stranded A/B-adapted sequencing library
for either B73 or Mo17 inbreds with a mean insert size of
approximately 700 bp. This library was then amplified via
limited cycles of PCR using primers designed to the 454 A/B
adapters, purified, and quality checked. Next, RSSC was
performed on a maize repeat array constructed by tiling
probes across the maize accessions in a cereal repeat data-
base (see Experimental Procedures for design criteria).
In addition to the maize repeat array, two specific
sequence capture arrays were designed and generated by
Roche NimbleGen. The first capture array (Interval 377)
targets an approximately 2.2 Mb genomic interval from
chromosome 3 of the B73 inbred (Experimental Proce-
dures). This array was designed based on the sequences of
a series of 70 overlapping BACs. The Interval 377 capture
array models situations in other crop genomes where a
specific region of a sequenced genome is under investiga-
tion or where several sequenced BACs covering a region of
interest are available from an otherwise unsequenced
genome. Such situations may be expected when chromo-
some walking in a large genome such as wheat or pine. The
second capture array (43-Gene array) targets 43 genes
dispersed throughout the genome. The 43-Gene capture
array models the situation whereby several genes in an
otherwise unsequenced genome are under investigation.
For the Interval 377 capture array only, repeat sequences
in the interval were masked prior to probe design (see
Experimental Procedures and Figure S1). Table 1 provides
summary statistics for the design of both capture arrays. The
target region for each array consists of a non-redundant set
of sequences that comprise the probe space. Figure S2
shows the distribution of designed probes across Inter-
val 377. The mapping of probes onto the whole genome
provided an estimate of the repetitiveness of each probe
from the Interval 377 capture array (Table S1). As expected,
all probes except one map to Interval 377. However, 9.9% of
probes (4067/41 555) were mapped to 1�3 additional loci
elsewhere in the genome due to the ancient allotetraploid
nature of the maize genome (Paterson et al., 2004), the high
frequency of transposon-mediated redistribution of genic
fragments, and the existence of nearly identical paralogs
(Schnable et al., 2009).
Regional and dispersed sequence capture (from B73)
To characterize RSSC, the Interval 377 capture array was
used to perform two independent captures of B73 genomic
DNA. DNA fragments eluted from the capture arrays were
sequenced using the 454 pyrosequencing technology, and
the resulting filtered reads were mapped to the B73 refer-
ence genome (B73 RefGen_v1; Experimental Procedures).
If the two captures from the B73 genotype are considered as
a pool, more than 97% of the filter-passing captured 454
Sequence capture in maize 899
ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd, The Plant Journal, (2010), 62, 898–909
sequence reads can be mapped uniquely to the reference
genome (Table S2). The two independent B73 captures
show similar proportions of on-target reads and base cov-
erage, median, as well as similar mean base-pair coverage
statistics (Figure S2 and Table S2). Consequently, all sub-
sequent analyses were performed after pooling reads from
the two independent B73 captures to more fully encompass
all sources of technical variation. Nearly 90% of the pooled
B73 reads that can be mapped to B73 RefGen_v1 map to a
single location, and are therefore non-repetitive sequences
(Figure S3). Given that 85–90% of the maize genome is
highly repetitive (Bennetzen et al., 2001; Martienssen et al.,
2004; Schnable et al., 2009), this finding demonstrates the
efficiency of the array-based repeat-subtraction procedure.
Reads are considered ‘on-target’ if they overlap the target
region. By comparing the percentage of on-target reads
(31%) with the proportion of the genome contained in
Interval 377, we calculate that RSSC for B73 achieved an
approximately 2600-fold enrichment of target sequences
(Table 2). This enrichment is illustrated by the dramatic dif-
ference in coverage achieved between targeted regions
within Interval 377 and flanking untargeted regions (Fig-
ure 2a). Approximately 98% (271 498 bp/277 305 bp) of all
bases in the target region were covered by at least one
sequence read, and approximately 97% of all bases in the
target region have greater than or equal to threefold cover-
age (Table 2). The threefold coverage value is significant
because that is the minimum coverage previously estab-
lished for SNP identification between inbred maize lines
(Barbazuk et al., 2007).
Mapped reads are highly clustered near probe locations,
suggesting highly efficient capture of probe sequences but
reduced capture of sequences >500 bp from probes, consis-
tent with a capture library consisting of fragments of
approximately 700 bp. B73 reads that could not be mapped
to Interval 377 (i.e. off-target reads) exhibit a seemingly
Figure 1. Workflow for the NimbleGen repeat subtraction-mediated seq-
uence capture experiments for the maize genome.
Target regions in the maize genome were selected and an array was designed
to represent the unique portions of each target (red, green and blue segments
in DNA). (1) A GS FLX-Titanium sequencing library was constructed for each
sample; this process places the sequences necessary for 454 sequencing into
the library before capture. The magnified section shows Titanium molecule
ends (approximately 4 h). (2) The sequencing library was amplified via the 454
adapter sequences (approximately 2 h). (3) The sample is hybridized to a
repeat subtraction array; design content focused on tiling the repeat content
from the sample genome (24–72 h). The repeat-containing library molecules
hybridize to the repeat array (black molecules and arrows), but the target
fragments do not (red, green and blue). (4) The repeat content array is
discarded. (5) The hybridization cocktail is recovered and placed onto the
capture array. (6) The array is hybridized (72 h) and washed (1 h including
elution time). (7) Captured target fragments are eluted from the array (red,
green and blue). (8) Target fragments are amplified via 454 adapters
(approximately 2 h). Because the 454 adapters are on the fragments, the
samples are simply diluted and directly sequenced (9) using the 454 GS FLX-
Titanium (24 h). A total of 8–16 samples can move from step 1 through to
sequencing in as little as 2 weeks of total time.
Sequence capture array6. Array washing
9. Sequencing
5. Recover cocktail & repeat hybridization
7. Target fragment elution
8. Amplification
3. Hybridization
4. Discard array Subtraction array
1. Build titanium sequencing library
2. Amplification
Targetregion A
Targetregion B
Targetregion C
900 Yan Fu et al.
ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd, The Plant Journal, (2010), 62, 898–909
random distribution across the genome, with two notable
exceptions. In the first exception, 91% of an approximately
53 kb interval of chromosome 1 exhibits ‡98% identity to
Interval 377. Consequently, 2032 probes perfectly match
both Interval 377 and this interval of chromosome 1. The
second exception involves an approximately 33 kb interval
of chromosome 8 that exhibits less sequence identity to
Interval 377, but even so 28 probes perfectly match both
intervals. In both of these cases, the existence of perfectly
matched probes resulted in paralog capture (Figure 3).
When the 43-Gene capture array was similarly used to
capture B73 sequences via RSSC, an approximately 2900-
fold enrichment was achieved (Table 1). Even though many
fewer reads were generated for this capture [approximately
16 k versus approximately 268 k], 91% of the targeted bases
were covered by at least one sequence read. The reduced
read number results in a lower percentage of greather than
or equal to threefold coverage (73% versus 97%) and a lower
mean coverage (6· versus 106·).
Capture of allelic sequences from Mo17
To determine the efficiency of capturing sequences from a
non-reference inbred using an array based on the B73 ref-
erence genome, RSSC was performed using both B73 arrays
Table 2 Summary statistics for maize capture data using two arrays and two genotypes
Genotype
Interval 377 array 43-Gene arraya
B73b Mo17 B73 Mo17
Number of filtered readsc 268 350 132 162 16 135 30 367Number of on-target readsd
(percentage of on-target reads)83 429 (31%) 29 226 (22%) 5612 (35%) 11 074 (36%)
Fold enrichmente Approximately 2600 Approximately 1800 Approximately 2900 Approximately 3000On-paralog readsf (percentageof on-paralog reads)
8939 (3.3%) 5157 (3.9%) NDg ND
Fold enrichment for paralogsh Approximately 1700 Approximately 2000 ND NDCoverage
Percentage target bases coveredby ‡1/‡3/‡10 capture reads
98/97/94 82/78/70 91/73/20 81/70/46
Mean coverage of target bases 106 38 6 12Mean coverage per 1000 on-target reads 1.3 1.3 1.1 1.1
aCalculations were based on combined data from all genes.bTwo B73 regional captures were combined for calculation.cReads remaining after removal of low-quality reads (Experimental Procedures).dReads mapping to a region overlapping with the target region (Figure S3).ePercentage of on-target reads/(length of target region/size of B73 reference [2.3 Gb]).fThe read mapped to a region overlapping the target paralog region (Figure 3).gNot determined.hPercentage of on-paralog reads/(length of target paralogous region/size of B73 reference genome [2.3 Gb]).
Table 1 Summary statistics for maize cap-ture array design Array design statistics Interval 377 arraya 43-Gene array
Total length (bp) 2 224 325 303 557Primary target space* after repeat masking (bp)b 666 488 No maskingLength of target region (bp)c 277 305 280 749Percentage of primary target space coveredby probesd
42% 92%
Length of target paralogous region (bp)c 45 434 Not determinedNumber of non-transposable elementprotein-encoding genes
40e 43
aUsing the B73_RefV1 sequence as the reference sequence (Experimental Procedures).bSee Figure S1 for detailed method.cTarget region consists of a non-redundant set of sequences used for probe synthesis.dLength of target region/length of primary target space.eBased on members of the ‘filtered gene set’ (Schnable et al., 2009) that overlapped with thetarget region.*Albert et al., 2007; Hodges et al., 2007.
Sequence capture in maize 901
ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd, The Plant Journal, (2010), 62, 898–909
(a)
(b)
(c)
Figure 2. RSSC of Interval 377.
(a) Many sequence reads map to Interval 377 (indicated by the black bar), but few sequence reads map to adjacent, non-target regions.
(b) Detailed view of Interval 377. The black bars in the top track indicate regions targeted by sequence capture probes. The next track (orange bars) provides CGH
data for probes within this interval taken from Springer et al., 2009. For each probe, the log2 of the ratio of Mo17/B73 hybridization signals (y axis) is provided.
Negative values indicate higher hybridization values for B73 than Mo17. The blue, red and purple tracks provide normalized coverage (y axis) for B73 sequence
captures (pool of two captures) and a Mo17 sequence capture, and their difference (M – B), respectively. The arrows highlight two examples with negative
log2(Mo17/B73) CGH values and normalized coverage difference (Mo17 – B73). The green track indicates locations of SNPs identified from sequence capture data.
(c) Close-up view of sequence coverage in a small region of the capture interval indicated by the black bar below the green track in (b). Tracks are as described for (b).
Figure 3. Capture of paralogs from chromosome 8.
A total of 28 probes from Interval 377 (shown in green) perfectly match an interval of chromosome 8. The y axis indicates the depth of coverage at each nucleotide
by captured sequences that uniquely align to this interval of chromosome 8.
902 Yan Fu et al.
ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd, The Plant Journal, (2010), 62, 898–909
with a capture library constructed from Mo17 genomic DNA.
Applying the same stringent alignment criteria used for B73,
we achieved approximately 1800-fold and approximately
3000-fold enrichment of Mo17 sequences that match the
targets from the two arrays, respectively (Table 2). For the
43-Gene capture array similar enrichments were achieved
for the B73 and Mo17 genotypes (approximately 2900-fold
and approximately 3000-fold, respectively). In contrast,
when using the Interval 377 capture array, less enrichment
was obtained for Mo17 than was achieved for B73 (approx-
imately 1800-fold versus approximately 2600-fold, respec-
tively). We hypothesized that this could be a consequence
of polymorphisms between B73 and Mo17 within Inter-
val 377. Polymorphisms could reduce the capture of Mo17
sequences by probes designed based on B73 sequences,
and/or cause difficulties in properly mapping Mo17 reads to
the B73 reference sequence (Figure S4).
Maize sequence capture and comparative genomic
hybridization
The hypothesis that polymorphisms are responsible for the
reduced fold enrichment achieved from Mo17 is supported
by data from independent maize comparative genomic
hybridization (CGH) experiments performed with a 2.1 mil-
lion oligonucleotide microarray (Springer et al., 2009) and
the 43-Gene capture array (this study). Consistent with pre-
vious studies, our CGH data indicate that there is extensive
sequence and structural variation between these two maize
haplotypes. Our whole-genome array contains 2072 probes
from within Interval 377. Regions of Interval 377 that had
substantial B73 capture but low or no coverage by Mo17
capture reads typically exhibited negative log2 ratios of
Mo17/B73 hybridization signals in our whole-genome CGH
experiment (Figure 2b). This relationship was also observed
(a)
(b)
Figure 4. Successful capture of genic regions using the 43-Gene array.
(a) The zmet2 gene, including flanking regions, is one of 43 targeted for capture using this array. The position of an approximately 4.9 kb retrotransposon insertion
(red triangle) in the Mo17 allele is indicated by a triangle. High-density CGH data (orange data track) provide information on sequence variation between the B73 and
Mo17 alleles. Extreme log2(Mo17/B73) values (y axis) indicate high levels of SNPs, InDel polymorphismss or presence/absence variants. Normalized coverage
(y axis) by B73 (blue) and Mo17 (red) sequence reads is shown.
(b) Successful recovery of novel Mo17 allelic sequences. The VISTA identity plot (pink) was presented using the zmet2 Mo17 allele sequence as the reference
sequence (bottom). Mo17 capture reads were mapped to both the B73 reference sequence and the known sequence of the Mo17 allele. Reads that could be aligned
to both alleles are shown in green. Reads that could be aligned only to the Mo17 allele are shown in red. Reads that align to only Mo17 span the junction of the
Mo17-specific insertion and over-lie a highly polymorphic region. By de novo assembling Mo17 sequence reads into contigs (shown in orange) prior to alignment
with the B73 reference allele, it was possible to recover Mo17 sequences that are highly divergent from the B73 reference allele.
Sequence capture in maize 903
ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd, The Plant Journal, (2010), 62, 898–909
in the 43-Gene capture experiments (Figure 4a). As expected,
regions with equivalent coverage typically exhibited CGH
log2 ratios close to zero. We hypothesized that the reason
that the fold enrichments observed for the B73 and Mo17
captures from the 43-Gene capture array were similar is that
the well-characterized genes on this array generally exhibit a
higher degree of conservation between the two genotypes
than do the predicted genes located in Interval 377. This
hypothesis is supported by the finding that 15% of the CGH
probes in Interval 377 (319/2072) exhibit greater than or
equal to twofold variation in hybridization signals (reflecting
significant structural variation between B73 and Mo17),
whereas only approximately 6% of CGH probes designed for
the 43-Gene array do so (980/16 406; Table S3).
For both Interval 377 and the 43-Gene capture arrays, we
noted that the sequence capture provided coverage for a
larger proportion of the target bases in B73 than in Mo17. We
hypothesized that Mo17 regions without coverage may be
caused by our inability to align captured Mo17 sequence
reads with allelic B73 sequences due to high levels of DNA
sequence polymorphism. To test this hypothesis, we aligned
all Mo17 reads captured from the 43-Gene array to existing
sequences of B73 and Mo17 alleles of four genes from this
array. It was possible to align 2010 of the reads captured
from Mo17 to the sequences of B73 alleles. Interestingly, 223
reads that could not be mapped to the B73 alleles of these
genes could be mapped to the sequences of the Mo17 alleles
of these genes. This finding demonstrates that some Mo17
sequences had been captured but had not been detected as
being on target because they did not align to B73. Pre-
assembly of the Mo17 reads into contigs, followed by
mapping of the contigs onto the B73 reference, allowed
identification of nearly 90% of these reads (197/223). These
newly rescued Mo17 reads cover regions of the Mo17
haplotype that are poorly conserved relative to, or even
absent from, B73. Figure 4(b) depicts this analysis for one of
the four genes.
SNP prediction and validation
An important application of sequence capture is to develop
SNP-based markers within targeted genomic regions by
using captured sequences from non-reference genotypes.
The ability to use RSSC-derived data to identify SNPs within
targeted regions was tested by aligning the captured Mo17
reads from the two arrays that uniquely map to the target
regions (‘on-target reads’; Table 2) to the B73 reference
genome. Potential SNP sites were required to be covered by
a minimum of three Mo17 reads. Because Mo17 is homo-
zygous at each locus, Mo17 base calls at the polymorphic
site were expected to be identical. Hence, only those poly-
morphic sites that were mono-allelic within all Mo17 reads
were designated ‘high-confidence SNPs’. SNP sites that had
more than one base call within the aligned reads of a single
genotype were assumed to result from the inadvertent
alignment of paralogous sequences; such SNPs were
designated ‘lower-confidence SNPs’. The alignments of
Mo17 reads to the B73 reference genome were used to
predict 1357 and 1221 high-confidence SNPs from the
Interval 377 and 43-Gene arrays, respectively (Experimental
Procedures and Table 3). Rates of false-positive SNP
predictions were estimated via comparison with known
SNPs that had been detected by alignment of existing partial
sequences of the Mo17 alleles for four of the genes present
on the 43-Gene array. We predicted a total of 212 SNPs,
including 151 high-confidence and 61 lower-confidence
SNPs, within the corresponding regions of these four genes.
All of the 151 high-confidence SNPs and 56 of the lower-
confidence SNPs were confirmed via comparisons to our
previously known control sequences. Based on this analysis,
the rate of false-positive SNP prediction is extremely low
(<3%).
It was possible to identify ‘on-target reads’ for use in the
analysis described above because we had access to the B73
reference genome sequence. This would not be possible in a
Table 3 SNP prediction using reads cap-tured from B73 and Mo17
Input dataaNumber ofSNPs
Number ofhigh-quality SNPsb
Number of genesc withhigh-quality SNPs
Interval 377 B73 (all) 8531 98 2B73 (target) 23 5 1Mo17 (all) 8044 1693 35Mo17 (target) 1649 1357 34
43-Gene set B73 (all) 170 31 11B73 (target) 144 30 11Mo17 (all) 2249 1240 40Mo17 (target) 1790 1221 39
aTwo sets of B73 and Mo17-derived sequence reads were used for SNP prediction: all filteredreads (‘all’) and only on-target reads (‘target’).bHigh-quality SNPs are those that are mono-allelic for all aligning reads. In addition, SNPsidentified within repetitive DNA regions of Interval 377 were removed (Experimental Proce-dures).cThere are 40 and 43 genes represented on the Interval 377 and 43-Gene arrays, respectively(Table 1).
904 Yan Fu et al.
ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd, The Plant Journal, (2010), 62, 898–909
species that lacks a reference genome sequence. To test
whether SNPs could be successfully predicted without
access to a reference genome sequence, we performed a
second experiment in which we aligned all Mo17 reads
captured from the two arrays to their respective B73 capture
intervals. This experiment yielded 1693 and 1240 high-
confidence SNPs from the two arrays (Table 3), representing
25% and 2% increases in the numbers of SNPs predicted,
compared to using genome-directed ‘on-target reads’.
To test the hypothesis that the inclusion of non-target
paralogous sequences in the SNP discovery pipeline is
responsible for the increased numbers of high-confidence
SNPs predicted in this second experiment, we performed a
SNP discovery experiment using B73 sequences captured by
the Interval 377 array to predict ‘SNPs’ relative to the B73
reference genome. We have previously shown that there is
little residual heterozygosity in B73 (Emrich et al., 2007).
Hence, the rate at which we identify ‘SNPs’ when using
captured B73 reads is a measure of the number of putative
SNPs that are false-positive due to sequencing errors or the
inadvertent identification of ‘paramorphisms’ as SNPs.
Paramorphisms are sequence variants between highly sim-
ilar paralogs (Fu et al., 2004).
Alignment of only on-target B73 reads to Interval 377 of
the B73 reference genome yielded five such high-confidence
‘SNPs’ (Table 3). Because few paralogous sequences are
expected among the on-target reads, most of these false-
positive SNPs are probably due to sequencing errors in
either the captured reads or the reference genome. Exam-
ination of the alignments of B73 and Mo17 captured
sequences in the regions of the five potential false-positive
polymorphic sites indicated that two are the result of
sequence errors in the reference genome and one is
probably caused by capture of paralogous sequences, while
the causes of the remaining two could not be determined
because Mo17 reads were not available for these sites. The
low rate of false-positives caused by sequencing errors
reflects the high stringency of our SNP prediction pipeline.
Alignment of all B73 reads to Interval 377 of the B73
reference genome yielded 98 high-confidence ‘SNPs’, which
probably includes false positives due to both sequencing
errors (ours and reference) and paramorphisms. Overall, the
low rate of false-positive SNP calls caused by sequencing
errors led us to conclude that <6% (98/1693) of the high-
confidence SNPs generated in the absence of paralog
removal represent false positives due to paramorphisms.
DISCUSSION
Repeat subtraction-mediated sequence capture (RSSC)
Over the past two decades, several approaches to achieve a
reduction in genomic complexity have been attempted,
including EST sequencing, methyl filtration, and high-Cot
DNA selection (Barbazuk et al., 2005). Each of these
approaches has been successful in reducing genome com-
plexity, but none delivers sequences of interest in a targeted
fashion as is possible with hybridization-based sequence
capture.
In initial experiments in which we utilized Cot1 DNA as a
blocker, we found that maize Cot1 DNA improved the
performance of sequence capture compared with human
Cot1 DNA (data not shown). Extending this idea suggests
that adapting sequence capture technology for the many
crop genomes would require the production of species-
specific blocking agents for each of the many important
crops. Published maize Cot1 production protocols have only
approximately 10% yield, meaning that scaling of produc-
tion is prohibitive from the perspective of genomic DNA
consumption (Zwick et al., 1997). Furthermore, in our hands,
16 of 20 independent attempts at using the previously
published Cot1-based protocol yielded fold enrichments that
were at least an order of magnitude below those achieved
in the current study (P.S.S., Y.F., N.M.S., W.B.B. and J.A.J.,
unpublished results). We therefore investigated the use of a
two-stage microarray sequence capture method that might
yield samples with consistently reduced complexity.
A repeat-subtraction microarray was designed to remove
DNA fragments that contain highly repetitive sequences. A
similar approach has been used to improve hybridization
performance (Newkirk et al., 2005). To date, seven of nine
maize RSSC sample attempts have been successful in
providing >1000-fold enrichment in each sample. The two
‘failures’ have been traced to a hybridization reagent issue
within one experiment (D.J.G. and J.A.J., unpublished
results).
Approximately one-third of captured reads are ‘on target’.
Although this is sufficient to make this technology very
attractive for practical applications, it must be asked why
two-thirds of the reads are ‘off target’. Interestingly, in the
capture experiments for the human genome, probe sets in
the range of 200–500 kb resulted in off-target rates that are
in the same range as observed here (Albert et al., 2007). We
therefore hypothesize that this off-target rate is probably a
consequence of the small design space on our array. When
only a small amount of library is hybridized to the array (so
as to not overwhelm the repeat subtraction), there are only
limited numbers of copies of the target in the sample.
Increasing the design space might result in higher ‘on-
target’ rates. Recent results from maize using a larger
capture target design space support the correlation between
design space and specificity that was first observed in the
human genome (J.A.J., T.J.A. and D.J.G., unpublished
results). Larger designs (approximately fivefold) exhibited
an approximately twofold better on-target read rate in two
independent tests (data not shown). Other potential
approaches to increase the rate of on-target reads include
making the numbers of various types of probes on the repeat
subtraction array proportional to their copy number in the
Sequence capture in maize 905
ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd, The Plant Journal, (2010), 62, 898–909
genome and reducing fragment sizes in the capture library
(thereby reducing the potential for secondary capture).
Use of sequence capture to identify allelic variation
The two applications of sequence capture described here
highlight the potential uses of this technology. Sequence
capture of chromosomal regions, such as Interval 377,
which contain a target gene, mutation or QTL can provide
two important outcomes. First, the targeted resequencing
identifies polymorphisms such as SNPs that can be con-
verted into high-density genetic markers. Not much
sequencing was required to obtain a large number of SNPs:
a 1/16 region 454 PicoTiterPlate run was well paired, from a
coverage perspective, with our approximately 300 kb cap-
ture interval within this homozygous genome. Approxi-
mately 1600 SNPs were identified that can be used to map
the gene or causative QTL to high resolution. Second, the
targeted resequencing concomitantly provides a set of
potentially causative polymorphisms.
Another application of sequence capture is the isolation
and characterization of novel alleles from a non-reference
genome. Maize is a highly polymorphic species with many
SNPs, InDel polymorphisms and presence/absence variants
(Springer et al., 2009). We demonstrated that characteriza-
tion of novel alleles is greatly facilitated by the combination
of CGH and sequence capture. CGH data provide information
on the relative conservation of the target genome and the
reference genome. Regions of the genome that had lower
hybridization to the target genome in CGH experiments often
had much lower Mo17 coverage. By complementing
sequence capture with CGH, it is possible to rapidly identify
conserved and non-conserved regions and to focus novel
allele characterization efforts on highly variable regions.
Although mapping Mo17 reads to the B73 reference
sequence provided coverage of many regions, thereby
allowing us to identify SNPs, a number of regions lacked
Mo17 read coverage. It is likely that a subset of the examples
of regions with missing Mo17 sequence may reflect pres-
ence/absence variants (i.e. B73-derived sequences that are
simply absent from the Mo17 genome). The remaining
regions with missing Mo17 sequence represent either
reduced capture of Mo17 sequence due to polymorphisms
or inability to align the sequences to the reference B73
sequence due to polymorphisms. Using comparisons to the
known Mo17 sequences of several genomic regions, we
found that the Mo17 sequences had been effectively cap-
tured, but that SNPs and InDel polymorphisms limited our
ability to map these captured Mo17 sequence reads to the
appropriate location on the B73 reference genome. It was
possible to recover some of these sequence reads by first
performing an assembly of all captured reads and then
mapping the longer Mo17 assemblies to the B73 genome
(Figure 4b). Importantly, assembly prior to alignment
allowed us to recover novel allelic sequences from a
non-reference haplotype that were not targeted by the
capture array. Even though the number of sequence reads
rescued in this manner is not large, such reads are valuable
because they can be used to construct and extend the
sequence of a captured haplotype and are therefore useful
for identifying insertion/deletion polymorphisms, and
may be useful for iterative capture-mediated chromosome
walking.
RSSC and paralogs
Maize arose from an allotetraploidization event in the past
5–10 million years (Paterson et al., 2004), and has retained an
extensive degree of gene duplication. Processes such as
transposon capture of gene fragments (Schnable et al., 2009)
have provided additional paralog complexity. Consequently,
approximately 10% of our capture probes had more than one
identical match in the maize genome, potentially making
them eligible to capture paralogs. As expected, these probes
were equally capable of capturing the target sequence and
the paralogous sequence. Paralogous reads were recovered
at a frequency consistent with their probe representation
frequency [e.g. 10.7% (8939/83 429) for the B73 captures
from the Interval 377 array; Table 2]. The degree to which
paralog capture complicates SNP discovery depends on the
structure of the genome being analyzed, but we were
encouraged to discover that, even in maize, very few of the
mono-allelic putative SNPs appear to be false positives.
Broader applicability of RSSC
We have reported a protocol implementation that allowed us
to achieve 1800–3000-fold enrichment of both a defined
chromosomal interval and a set of dispersed genes. This
enrichment is comparable to that achieved for the human
genome (Albert et al., 2007). For both captures, 80–98% of
targeted bases were covered by captured sequences. The
mean coverage of the target regions per 1000 on-target
reads was similar for captures from the two different arrays
(1.3 versus 1.1), highlighting the overall robustness of the
approach. Therefore, the RSSC protocol provides a method
to resequence targeted genomic regions of the maize gen-
ome, and is expected to exhibit similar levels of performance
in other genomes. The ability to design reagents required for
repeat subtraction in silico significantly reduces the techni-
cal hurdles involved in applying sequence capture across
diverse species. Because highly repetitive elements can be
discovered using only limited amounts of whole-genome
shotgun sequencing data, it should be possible to design
species-specific repeat-subtraction arrays with limited
investment of resources in combination with next-genera-
tion sequencing technologies. Hence, it will be possible to
apply RSSC not only to species with sequenced reference
genomes, but also to those whose genomes have not yet
been sequenced. Importantly, we have established that
polymorphism analyses performed in the absence of a fully
906 Yan Fu et al.
ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd, The Plant Journal, (2010), 62, 898–909
sequenced reference genome are not substantially cum-
bersome. We therefore foresee application of this technol-
ogy for studies of population genetics, cloning of loci
controlling quantitative variation, and allele mining in crops,
model organisms, and, importantly, non-model species.
EXPERIMENTAL PROCEDURES
Repeat array design
A customized NimbleGen 3 x 720 K sequence capture microarray(081110_Zea_mays_repeats_cap) was synthesized three times perslide to contain maize repetitive elements in the MAGI Cereal RepeatDatabase (version 3.1; http://magi.plantgenomics.iastate.edu/repeatdb.html) and the Maize Repeat Database (version 4; http://maize.jcvi.org/repeat_db.shtml). The design may be ordered byrequest. There are 2.1 M total probes on the array, although only thecenter sub-array containing 720 K probes was utilized in this study.The median probe length is 74 bp.
Maize NimbleGen sequence capture array design
A large genomic region on a BAC fingerprint contig (FPC Ctg138,chromosome 3) was originally selected for targeting. Based on thephysical map released prior to 29 May 2008, a total of 70 sequencedBACs are within this FPC contig, and their sequences were down-loaded from GenBank on 29 May 2008. The physical map has beenupdated to the latest release (maize Golden Path AGP version 1,release 4a.53). Details regarding sequence annotation and geneprediction are shown in Figure S1. A total of approximately 1.5 Mb,comprising 44 unordered sequence fragments with 83 non-redun-dant predicted non-repetitive genes, were soft-masked for probedesign. The uniqueness/repetitiveness of all the probes and physi-cal locations of the probes were determined based on the collectionof maize BAC sequences available in March 2008. The array designwas constructed by tiling at approximately 5 bp spacing across thetarget regions. Probes with a mean 15-mer frequency in the genomegreater than 100 were excluded, as were probes that had more thanfive close matches in the genome. A close match is a match to thegenome that is at least 38 bp long, allowing up to five insertions/deletions/mismatches. When the probes are shorter than 50 bp, weuse the length of the probe (12 bp, the seed length) as the minimummatch size. A total of 41 555 probes were selected, and replicated atleast 17 times on the array. To reconcile with the reference genomesequence, probes were remapped to B73 RefGen_v1 (Schnableet al., 2009). The final sequence interval was defined from the 1 kbupstream of themost-left mappedprobe (REGION0042FS000010140)to the 1 kb downstream of the most-right mapped probe(REGION0028FS000002032), i.e. 183 062 553–185 609 824 bp onchromosome 3. Two fragments (183 315 664–183 553 126 bp and183 880 178–183 965 661 bp) were excluded from analysesbecause they were not present in the sequences used for probedesign. This design is used to generate a customized NimbleGen 3 x720 K sequence capture array. Only the center sub-array wasutilized for this study; this array may be ordered from RocheNimbleGen by requesting 081028_Zea_mays_schnable_cap.
The second NimbleGen 3 x 720 K sequence capture array designwas constructed by tiling at approximately 15 bp spacing across 43dispersed gene targets. Probes with a mean 13-mer frequency in thegenome greater than 500 were excluded, as were probes that hadmore than seven close matches in the genome. A total of 16 406probes were selected and replicated 44 times on the array. Thisarray comprises approximately 350 kbp of genomic space, but hasonly 123 kbp represented within the probes. Again, only the centersub-array was utilized for this study. This design may be ordered by
requesting 080328_maize_cap_springer_1. These probes are ofvarious lengths and the median probe length for both designs is76 bp.
Maize sequence capture and 454 sequencing
DNA was isolated from 14-day-old seedlings of two maize inbreds,B73 and Mo17, using a previously described protocol (Li et al.,2007). A 700 bp mean insert size 454 GS FLX-Titanium sequencinglibrary (454 Life Sciences, http://www.454.com) was generated foreach inbred and subjected to eight cycles of amplification usingprimers based upon the sequencing adapters. Amplicons werepurified using a QIAquick/MinElute spin column (Qiagen, http://www.qiagen.com/). The DNA concentration was determined usingNanoDrop ND1000 (Thermo Scientific, http://www.thermo.com)and the molecular weight range was determined using an AgilentBioanalyzer 2100 with a DNA7500 kit (Agilent Technologies, http://www.agilent.com). We progressively decreased the total amount oflibrary used per hybridization across the study. The Interval 377captures used 500 ng, and the 43-Gene captures utilized either 250or 150 ng for the repeat-subtraction hybridization. The indicatedmass of double-stranded sequencing library was hybridized to themaize repeat subtraction at low stringency (37�C) using the Mai Taisystem (SciGene, http://www.scigene.com) with NimbleGenhybridization solution supplemented with Tween-20 at 0.1% v/v,together with a 100-fold molar excess of non-extendable primerscomplementary to the sequencing adapters. The rotation speed inthe SciGene hybridization oven was set to 15. The hybridizationcocktail was recovered by separating the two slides with the gasketarray on the bottom (facing up) and the subtraction array on the top(facing down). The remaining hybridization cocktail, containing thelibrary fragments of interest (still on the gasket slide), was subjectedto a second capture array aimed at the gene space of interest. Thecapture array was placed with the probe side facing down onto thehybridization cocktail on the gasket slide. The gasket slide remainedin the Mai-Tai rig during placement. The capture array was thensubjected to an additional 4 days of hybridization at 42.5�C with therotator set to 15. The capture array was washed as previouslydescribed (Albert et al., 2007) and eluted using a sodium hydroxidemethod that is available from Roche NimbleGen Technical Supporton request. The eluted molecules were amplified via the sequencingadapters (14 cycles), and the products were purified and quantified.The double-stranded eluted libraries were diluted for emulsion PCR(emPCR) as recommended by 454 Life Sciences, and sequencedusing the 454 Life Sciences GS FLX-Titanium protocol according tothe manufacturer’s instructions using a 4- or 16-region TitaniumPicoTiterPlate. Prior to emPCR, the diluted double-stranded eluatelibraries were heat-treated at 95�C for 2 min in a thermal cycler. Thisheating step was found to be essential to avoid amplification-associated artifacts in the emPCR. The raw 454 capture reads(deposited to the GenBank Short Read Archive with accessionnumber SRA009261.9) with low quality (parameters: maximummean error = 0.01, maximum error at ends = 0.01), and short 454reads (<200 bp) were removed using Lucy (Chou and Holmes, 2001).This cut-off was selected because few of the 454 reads of <200 bpcould be mapped to the B73 reference genome (K.Y., Y.F. and P.S.S.,unpublished results).
Data analyses
To estimate on-target rates, all filtered B73 and Mo17 captured 454reads were aligned to the B73 reference genome sequence, i.e.B73_RefGen_v1 (Schnable et al., 2009) (BLAST alignment criteria:95% similarity and total unaligned regions of both 5¢ and 3¢ ends of454 reads £15 bp). Sequence reads whose best match overlapped a
Sequence capture in maize 907
ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd, The Plant Journal, (2010), 62, 898–909
target region were classified as on-target. The target paralog regionis defined as a non-redundant set of sequences of those probes thatcan be mapped both inside and outside Interval 377. Sequencereads with a best match that overlaps the target paralog region areconsidered as ‘on-paralog’ reads. Whole-genome CGH data wereretrieved from the NCBI GEO database (GSE16938) (Springer et al.,2009). Only CGH probes within targeted regions were used to cal-culate normalized coverage. GFF files were generated for datavisualization using NimbleScan (version 2.4, NimbleGen). Shell andAWK scripts for the analysis pipeline are available upon request.Additional CGH data for the 43-Gene array are given in Table S3.Sequence alignments between B73 and Mo17 allelic sequenceswere performed using VISTA (LAGAN alignment program used withdefault settings) (Frazer et al., 2004). CAP3 (Huang and Madan,1999) was used for assembling Mo17 reads from the 43-Gene array(parameters used: overlap percentage identity ‡95, overlap length‡50 bp).
Comparative genomic hybridization (CGH)
CGH was performed using the 43-Gene capture array in place of aCGH design within a standard Roche NimbleGen CGH workflow forNimbleGen human CGH with 385 K arrays. Two arrays were utilizedin a B73 versus Mo17 dye swap. Labeling, hybridization, washing,scanning and analytical conditions were as previously reported(Springer et al., 2009).
SNP discovery
SNP discovery was performed either using all filtered 454 reads orthe subset of on-target 454 reads defined above. The 454 reads werealigned to the reference sequences (either the chromosome 3Interval 377 or the 43-Gene set) using MosaikAligner (Hillier et al.,2008) with the following parameters: -a (alignment algorithm), all;)p (CPUs used), 8; )mmp (maximum percentage of read length tobe mismatched), 0.05; –minp (minimum percentage of the readlength aligned), 0.95; –mmal (aligned read length rather than theoriginal read length when counting errors); )m (alignment mode),unique; )hs (hash size), 15; )mhp (maximum number of positionsto use), 100. These alignment parameters ensured that each 454sequence read was uniquely aligned; sequences that failed to meetthese criteria were discarded from the analysis. SNPs were identi-fied within the alignments using the GIGABAYES package (http://bioinformatics.bc.edu/marthlab). Arguments to GIGABAYES were: –D(pairwise nucleotide diversity), 0.003; –ploidy (sample ploidy),haploid; –algorithm, recursive; –sample (sequence source), single;–anchor; –CAL (minimum overall allele coverage), 3; –QRL (mini-mum base quality value), 20. Potential SNP sites were required to becovered by a minimum of three Mo17 reads, and all Mo17 base callsat the polymorphic site were expected to be identical. SNP sites thathad more than one allele within the aligned reads were assumed toresult from alignment of paralog sequences. In addition, potentialhigh-confidence SNP sites within the Interval 377 region were re-quired to be from non-repetitive regions. The false SNP discoveryrate was determined by identifying potential SNPs from B73 cap-tured reads.
ACKNOWLEDGEMENTS
We thank James Birchler (Division of Biological Sciences, Universityof Missouri) for providing repeat clones for early developmentwork, John Luckey, Jason Norton and Paul Marrione for supportwith Mai-tai hybridization optimization and platform development,Rudi Seibl, Rebecca Selzer and Courtney Erickson for research anddevelopment support, and the Maize Genome Sequencing Project(NSF DBI-0527192) for sharing genome sequences and annotation
prior to publication. This project was supported in part by funding toP.S.S. from the Iowa State University Plant Sciences Institute,funding to N.M.S. from the University of Minnesota, and a grantfrom the National Science Foundation Plant Genome Program (DBI-0501758) and funding from University of Florida to W.B.B. TheRoche NimbleGen research and development group is privatelyfunded.
SUPPORTING INFORMATION
Additional Supporting Information may be found in the onlineversion of this article:Figure S1. Workflow for preparing BAC sequences within FPCCtg138 for design of probes for the Interval 377 array.Figure S2. Evaluation of the coverage and reproducibility ofsequence capture from Interval 377.Figure S3. Mapping of capture reads to Interval 377.Figure S4. Comparison of CGH data and sequence capture effi-ciency.Table S1. Re-mapping of probes designed for Interval 377 to B73RefGen_v1.Table S2. Summary statistics for two B73 captures of Interval 377.Table S3. Gene array CGH data.Please note: As a service to our authors and readers, this journalprovides supporting information supplied by the authors. Suchmaterials are peer-reviewed and may be re-organized for onlinedelivery, but are not copy-edited or typeset. Technical supportissues arising from supporting information (other than missingfiles) should be addressed to the authors.
REFERENCES
Albert, T.J., Molla, M.N., Muzny, D.M. et al. (2007) Direct selection of human
genomic loci by microarray hybridization. Nat. Methods, 4, 903–905.
Barbazuk, W.B., Bedell, J.A. and Rabinowicz, P.D. (2005) Reduced represen-
tation sequencing: a success in maize and a promise for other plant
genomes. Bioessays, 27, 839–848.
Barbazuk, W.B., Emrich, S.J., Chen, H.D., Li, L. and Schnable, P.S. (2007) SNP
discovery via 454 transcriptome sequencing. Plant J. 51, 910–918.
Bashiardes, S., Veile, R., Helms, C., Mardis, E.R., Bowcock, A.M. and Lovett,
M. (2005) Direct genomic selection. Nat. Methods, 2, 63–69.
Bennetzen, J.L. (2005) Transposable elements, gene creation and genome
rearrangement in flowering plants. Curr. Opin. Genet. Dev. 15, 621–627.
Bennetzen, J.L., Chandler, V.L. and Schnable, P. (2001) National Science
Foundation-sponsored workshop report. Maize genome sequencing pro-
ject. Plant Physiol. 127, 1572–1578.
Chou, H.H. and Holmes, M.H. (2001) DNA sequence quality trimming and
vector removal. Bioinformatics, 17, 1093–1104.
D’Ascenzo, M., Meacham, C., Kitzman, J. et al. (2009) Mutation discovery in
the mouse using genetically guided array capture and re-sequencing.
Mamm. Genome, 20, 424–436.
Emrich, S.J., Li, L., Wen, T.J., Yandeau-Nelson, M.D., Fu, Y., Guo, L., Chou,
H.H., Aluru, S., Ashlock, D.A. and Schnable, P.S. (2007) Nearly identical
paralogs: implications for maize (Zea mays L.) genome evolution. Genetics,
175, 429–439.
Feuillet, C., Langridge, P. and Waugh, R. (2008) Cereal breeding takes a walk
on the wild side. Trends Genet. 24, 24–32.
Frazer, K.A., Pachter, L., Poliakov, A., Rubin, E.M. and Dubchak, I. (2004)
VISTA: computational tools for comparative genomics. Nucleic Acids Res.
32, W273–W279.
Fu, Y., Hsia, A.P., Guo, L. and Schnable, P.S. (2004) Types and frequencies of
sequencing errors in methyl-filtered and high C0t maize genome survey
sequences. Plant Physiol. 135, 2040–2045.
Herman, D.S., Hovingh, G.K., Iartchouk, O., Rehm, H.L., Kucherlapati, R.,
Seidman, J.G. and Seidman, C.E. (2009) Filter-based hybridization capture
of subgenomes enables resequencing and copy-number detection. Nat.
Methods, 6, 507–510.
Hillier, L.W., Marth, G.T., Quinlan, A.R. et al. (2008) Whole-genome
sequencing and variant discovery in C. elegans. Nat. Methods, 5, 183–188.
908 Yan Fu et al.
ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd, The Plant Journal, (2010), 62, 898–909
Hodges, E., Xuan, Z., Balija, V. et al. (2007) Genome-wide in situ exon capture
for selective resequencing. Nat. Genet., 39, 1522–1527.
Huang, X. and Madan, A. (1999) CAP3: a DNA sequence assembly program.
Genome Res. 9, 868–877.
Li, J., Harper, L.C., Golubovskaya, I., Wang, C.R., Weber, D., Meeley, R.B.,
McElver, J., Bowen, B., Cande, W.Z. and Schnable, P.S. (2007) Functional
analysis of maize RAD51 in meiosis and double-strand break repair.
Genetics, 176, 1469–1482.
Martienssen, R.A., Rabinowicz, P.D., O’Shaughnessy, A. and McCombie, W.R.
(2004) Sequencing the maize genome. Curr. Opin. Plant Biol. 7, 102–107.
Morse, A.M., Peterson, D.G., Islam-Faridi, M.N. et al. (2009) Evolution of
genome size and complexity in Pinus. PLoS ONE, 4, e4332.
Newkirk, H.L., Knoll, J.H. and Rogan, P.K. (2005) Distortion of quantitative
genomic and expression hybridization by Cot-1 DNA: mitigation of this
effect. Nucleic Acids Res. 33, e191.
Okou, D.T., Steinberg, K.M., Middle, C., Cutler, D.J., Albert, T.J. and Zwick,
M.E. (2007) Microarray-based genomic selection for high-throughput
resequencing. Nat. Methods, 4, 907–909.
Paterson, A.H., Bowers, J.E. and Chapman, B.A. (2004) Ancient polyploidi-
zation predating divergence of the cereals, and its consequences for
comparative genomics. Proc. Natl Acad. Sci. USA, 101, 9903–9908.
Porreca, G.J., Zhang, K., Li, J.B. et al. (2007) Multiplex amplification of large
sets of human exons. Nat. Methods, 4, 931–936.
Schnable, P.S., Ware, D., Fulton, R.S. et al. (2009) The B73 maize genome:
complexity, diversity, and dynamics. Science, 326, 1112–1115.
Springer, N.M., Ying, K., Fu, Y. et al. (2009) Maize inbreds exhibit high levels
of copy number variation (CNV) and presence/absence variation (PAV) in
genome content. PLoS Genet. 5, e1000734.
Strachan, T. and Read, A.P. (1999) Human Molecular Genetics. New York:
John Wiley & Sons Inc.
Zwick, M.S., Hanson, R.E., Islam-Faridi, M.N., Stelly, D.M., Wing, R.A., Price,
H.J. and McKnight, T.D. (1997) A rapid procedure for the isolation of C0t-1
DNA from plants. Genome, 40, 138–142.
Sequence capture in maize 909
ª 2010 The AuthorsJournal compilation ª 2010 Blackwell Publishing Ltd, The Plant Journal, (2010), 62, 898–909