Genome-Tagged Amplification (GTA): a PCR-based method to prepare sample-tagged amplicons from...

12
Genome-Tagged Amplification (GTA): a PCR-based method to prepare sample-tagged amplicons from hundreds of individuals for next generation sequencing Thien Ho Linda Cardle Xin Xu Micha Bayer K. Silvas Jebakumar Prince Raymond N. Mutava David F. Marshall Naeem Syed Received: 3 February 2014 / Accepted: 2 April 2014 Ó Springer Science+Business Media Dordrecht 2014 Abstract Sampling the sequence of a relatively small fraction of the genome in large numbers of individuals is an important objective for population genetics and association genetics approaches. How- ever, currently available ‘sequence capture’ methods either require expensive instrumentation or have problems dealing with high sample numbers and relatively small target sizes. We have developed Genome-Tagged Amplification (GTA) as a flexible PCR-based method for preparing pools of hundreds of amplicons from hundreds of samples for next gener- ation sequencing. The method involves tagging of genomic DNA with barcode adapters at restriction sites, followed by PCR amplification from flanking DNA. It is freely scalable for both sample number and amplicon number and has no specialized equipment requirement. An optimized protocol is presented which provides a matrix of 96 9 192 combinations of samples x amplicons, corresponding to a complete 454 Titanium run. Initially, we used 454 sequencing; however, GTA could easily be adapted to Illumina sequencing platforms as read lengths have signifi- cantly increased in this system. Keywords Next generation sequencing Barley PCR Pooling Introduction Next generation sequencing (NGS) technologies (Bent- ley 2006; Choi et al. 2009; Margulies et al. 2005) have transformed sequence-based genetic and genomic ana- lysis of human, animal and plant species. The capacity Electronic supplementary material The online version of this article (doi:10.1007/s11032-014-0090-7) contains supple- mentary material, which is available to authorized users. T. Ho X. Xu Division of Plant Sciences, University of Dundee at The James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland, UK Present Address: T. Ho Department of Plant Pathology, University of Arkansas, 495N Campus Dr, Fayetteville, AR 72701, USA L. Cardle M. Bayer D. F. Marshall Information and Computational Sciences Group, The James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland, UK K. S. J. Prince R. N. Mutava National Centre for Soybean Biotechnology, Division of Plant Sciences, University of Missouri, Columbia, MO 65211, USA N. Syed (&) Department of Geographical and Life Sciences, Canterbury Christ Church University, Canterbury CT1 1QU, UK e-mail: [email protected] 123 Mol Breeding DOI 10.1007/s11032-014-0090-7

Transcript of Genome-Tagged Amplification (GTA): a PCR-based method to prepare sample-tagged amplicons from...

Genome-Tagged Amplification (GTA): a PCR-based methodto prepare sample-tagged amplicons from hundredsof individuals for next generation sequencing

Thien Ho • Linda Cardle • Xin Xu • Micha Bayer • K. Silvas Jebakumar Prince •

Raymond N. Mutava • David F. Marshall • Naeem Syed

Received: 3 February 2014 / Accepted: 2 April 2014

� Springer Science+Business Media Dordrecht 2014

Abstract Sampling the sequence of a relatively

small fraction of the genome in large numbers of

individuals is an important objective for population

genetics and association genetics approaches. How-

ever, currently available ‘sequence capture’ methods

either require expensive instrumentation or have

problems dealing with high sample numbers and

relatively small target sizes. We have developed

Genome-Tagged Amplification (GTA) as a flexible

PCR-based method for preparing pools of hundreds of

amplicons from hundreds of samples for next gener-

ation sequencing. The method involves tagging of

genomic DNA with barcode adapters at restriction

sites, followed by PCR amplification from flanking

DNA. It is freely scalable for both sample number and

amplicon number and has no specialized equipment

requirement. An optimized protocol is presented

which provides a matrix of 96 9 192 combinations

of samples x amplicons, corresponding to a complete

454 Titanium run. Initially, we used 454 sequencing;

however, GTA could easily be adapted to Illumina

sequencing platforms as read lengths have signifi-

cantly increased in this system.

Keywords Next generation sequencing �Barley � PCR � Pooling

Introduction

Next generation sequencing (NGS) technologies (Bent-

ley 2006; Choi et al. 2009; Margulies et al. 2005) have

transformed sequence-based genetic and genomic ana-

lysis of human, animal and plant species. The capacity

Electronic supplementary material The online version ofthis article (doi:10.1007/s11032-014-0090-7) contains supple-mentary material, which is available to authorized users.

T. Ho � X. Xu

Division of Plant Sciences, University of Dundee at The

James Hutton Institute, Invergowrie, Dundee DD2 5DA,

Scotland, UK

Present Address:

T. Ho

Department of Plant Pathology, University of Arkansas,

495N Campus Dr, Fayetteville, AR 72701, USA

L. Cardle � M. Bayer � D. F. Marshall

Information and Computational Sciences Group, The

James Hutton Institute, Invergowrie, Dundee DD2 5DA,

Scotland, UK

K. S. J. Prince � R. N. Mutava

National Centre for Soybean Biotechnology, Division

of Plant Sciences, University of Missouri, Columbia,

MO 65211, USA

N. Syed (&)

Department of Geographical and Life Sciences,

Canterbury Christ Church University,

Canterbury CT1 1QU, UK

e-mail: [email protected]

123

Mol Breeding

DOI 10.1007/s11032-014-0090-7

of these enabling technology platforms to generate

sequences at the gigabase level per experiment allows

mutation detection, gene discovery and comparative

genomic analysis at a level which was impossible

previously (Mardis 2008). In genomes carrying high

levels of nucleotide diversity, direct sequencing is now

a viable alternative and arguably a replacement for

molecular markers. The goal for diversity analysis by

sequencing is a matrix of multiple sequenced loci

(typically genes or segments thereof) in multiple

samples. This raises important technical questions.

First, how is the genomic subset isolated? Recently,

several methods for this have been described. The first

uses molecular inversion probes (derived from padlock

probes), based upon hybridization, gap filling and

ligation to provide thousands of PCR amplicons (Dahl

et al. 2007; Porreca et al. 2007; Turner et al. 2009).

Another, more widely used approach involves ‘capture’

of genomic target DNA by hybridization (Bainbridge

et al. 2010; Bodi et al. 2013; Gnirke et al. 2009;

Mamanova et al. 2010; Okou et al. 2007). Potential

problems with this approach include paralog capture

(most genes are members of multiple gene families) and

the need to block hybridization of repetitious DNAs

(particularly transposable elements and their pseudo-

genes) (Okou et al. 2007). Capture methods work best

in genomes, such as Homo sapiens (Hopp et al. 2011;

Tan et al. 2011) with relatively simple sets of repetitious

DNA, and when the capture target is greater than 2 Mb,

giving capture specificities around 85 % (Lee et al.

2009). For the larger maize genome with a more

complex repetitious DNA populations, preparation of

blocking DNA to prevent non-specific hybridization of

DNA to the capture probe has been troublesome

(Bashiardes et al. 2005) and capture specificities have

been lower (between 22 and 36 %) though larger

capture targets may improve this (Fu et al. 2010).

The second technical issue with generating large

numbers of (sample 9 locus) combinations is how to

track multiple samples per locus during NGS of

pooled templates. The manufacturers have addressed

this issue firstly by providing multiple-lane gaskets

(typically 2, 4, 8 or 16 lanes) and secondly by

implementing short sequence ‘barcode’ tag capability

(typically up to 12 bp) (Binladen et al. 2007). Decon-

volution of sample ID is performed later in the

bioinformatics processing steps. Barcoding was first

developed in the context of NGS by Binladen et al.

(2007), who used 2-bp barcodes at the 50 ends of their

PCR primers to sample identity. Barcodes were

originally used by Qiu et al. (2003) to differentiate

maize cDNA libraries from different samples (Ash-

lock et al. 2002) and have been widely employed

(Craig et al. 2008; Smith et al. 2010). A problem with

barcoding is significant unevenness in sample cover-

age associated with particular barcodes (Binladen

et al. 2007; Mamanova et al. 2010; Qiu et al. 2003).

Can conventional PCR be used for production of

complex multi (sample 9 locus) NGS templates? PCR

is cheap, simple and has inherently high specificity for

targeting genome sub-fractions. This was the original

method used for array re-sequencing in humans (Albert

et al. 2007), but the approach was aimed at very large

stretches of very few individual genomes and involved

thousands of difficult long range PCRs. For surveys of

hundreds of samples, the complexity of amplified

genomic DNA is far less challenging; for example, a

capture experiment involving *200 9 0.4 Kb loci in

100 sample genotypes (80 Kb complexity, 8 Mbp

capture) corresponds to a complete 454 sequencer run

(800,000 sequence reads) at 40-fold sequencing depth.

However, a new problem arises. Preparation of the

above amplicons requires 20,000 PCRs (Fig. 1),

divided by the multiplex ratio per PCR. In practice,

multiplexing is only realistic for amplicon sets which

will be used repeatedly as it requires much optimiza-

tion. In addition, the PCR primers must be tagged to

keep track of sample origin in the NGS step. This

necessitates at least 40,000 primers for the above

experiment (Fig. 1), costing at least as much as the

direct sequencing costs, and this expense would rise

linearly with increase in sample size and target loci.

Conventional tagged PCRs are therefore unrealistic

options for NGS template preparation in large-scale

diversity studies. Recently, microfluidic approaches to

conventional PCR setup have been developed (Tewhey

et al. 2009). These are a useful option but they have the

disadvantages of requiring somewhat expensive dedi-

cated equipment for reaction setup and PCR, and also

are tied to particular combinations of sample 9 locus

numbers (see ‘Discussion’ section).

The purpose of this study was to address the

problems of tagged PCRs for targeted amplification of

multiple loci for multiple samples and to develop a

simple flexible method capable of amplifying tagged

gene segments from hundreds of loci in hundreds of

samples, corresponding to a complete NGS run

capacity.

Mol Breeding

123

Materials and methods

Plant samples and DNAs

Barley lines used in this study are listed in Supple-

mentary Material 1. Genomic DNAs were isolated

from plant leaves using Qiagen DNeasy 96 kit

following the manufacturer’s instructions. DNA sam-

ples were quantified using either NanoDrop 2000

spectrophotometer (Thermo Scientific), combined

with agarose gel visualization in parallel or picogreen

(Molecular Probes) (Ahn et al. 1996), which we

A B

C

Fig. 1 Tagging amplicons from multiple samples: The GTA

approach. a Conventional tagging of PCR amplicons, where

each genotype sample has a corresponding tag (color coded)

incorporated into the PCR primer. For X amplicons and Y

samples, this requires Y(X ? 1) primers and XY PCRs, or

2XHY primers if both ends are tagged (and this raises potential

problems because sequence fidelity is often low at one end of an

NGS sequence run). b The Genome-Tagging principle. Each

genotype sample is tagged (horizontally in the figure—indicated

by colored stars), all genotypes are pooled (vertically) then

aliquots of the pools are amplified at gene loci (3 are shown in

the figure). Each amplicon in the sequence pool carries a tag

indicating its sample origin, so the mixture can be sequenced by

NGS. In principle, for X amplicons and Y samples, Y tagging

reactions plus X amplifications are needed. c The Genome-

Tagged Approach (GTA). (1) Each DNA sample is digested

with a frequent cutter restriction enzyme. A single gene is shown

here. (2) Adapters containing sample-specific tags (color-coded

red, dark blue and purple in the figure) and a generic primer

sequence (brown) are ligated to all fragments. (3) Nested PCR is

carried out with the generic primer and two successive locus-

specific primers (brown and blue arrows; nested PCR improves

locus specificity). Finally, the amplification products are pooled

and subjected to 454 sequencing. (Color figure online)

Mol Breeding

123

recommend. In all cases, parallel standard DNA

dilution series controls were used.

Tagged (barcoded) adapter design and

gene-specific primers used

The structure of GTA adapters and corresponding

adapter PCR primers are shown in Supplementary

Material 2. Adapters carry a 30-AT overhang to ligate

to DNA restricted with MseI (50 T/TAA 30), BfaI

(C/TAG) or Csp6I (G/TAC). Gene-specific primer

sequences are listed in Supplementary Material 3,

barcode tag sequences are listed in Supplementary

Material 4 and numbers of sequences obtained relative

to loci, samples and tags in all 3 GTA experiments are

in the Supplementary Spreadsheet. Genomic barley

gene sequence data (Rostoks et al. 2005) and EST

sequence data (http://www.harvest-web.org/) were

mined, and nested primer pairs were matched to the

generic PCR adapter primer using Primer3 (http://

frodo.wi.mit.edu/primer3/).

GTA protocol

The optimized GTA protocol is described in detail in

Supplementary Material 5. Briefly, exactly equal

amounts of barley genomic DNAs (0.6 lg each) were

restriction digested separately with each of the following

enzymes producing a 50 TA overhang: MseI (T/TAA),

BfaI (C/TAG) and Csp6I (G/TAC). After checking

digestion completion by agarose gel electrophoresis and

heat-inactivating of the restriction enzymes, each

digested DNA sample was ligated with a corresponding

sample-specific adapter carrying an 8-bp barcode tag,

followed by heat inactivation of the DNA ligase.

Adapter-ligated DNA samples were then tested

separately for their quality as templates for the GTA

reaction by amplifying a single locus (AF052665)

previously shown to produce a single predicted

product from a reference DNA sample. For these

tests, high PCR cycle numbers were used for the two

consecutive nested GTA PCRs to ensure strong

amplification bands. The first test PCR used 12

‘touchdown’ cycles at decreasing annealing tempera-

tures from 65 to 56 �C, followed by 27 cycles of 56 �C

and the second (nested) test PCR (25 ll, 25 cycles)

used 0.04 ll of the first round PCRs. Five microliters

of the PCR products were visualized on a 2 % agarose-

TBE-ethidium bromide gel under UV light (examples

are shown in Fig. 2). Any sample failing to produce

the predicted band was replaced by another fresh

aliquot of the same or an alternative sample.

GTA amplifications for downstream NGS were

performed as follows: First, equal volumes of adapted

samples from above were pooled into GTA template

mixes which were cleaned up using Qiaquick PCR

cleanup kits (Qiagen). Different sample pooling

depths between 12 and 96 were explored for these

template mixes in the study (see ‘Results’ section),

with 48 sample pooling selected for the final validated

protocol. Aliquots of this mix were GTA amplified

separately for each amplicon for 28 cycles (55 �C

annealing temperature). PCR products were checked

by gel electrophoresis as above. Typically, PCR

products obtained using these conditions were very

faint and hardly visible on the gels (e.g., Supplemen-

tary Material 6B iii, top lane). The second (nested)

GTA PCRs used exactly the same cycling conditions

as the first round PCR but with 20 cycles and an

annealing temperature of 58 �C, with 0.01 ll DNA

input from the first round PCR, together with nested

gene-specific primer and the same generic adapter

primer used in the first round PCR. Nested PCRs were

checked by gel electrophoresis. Amplified samples

were normalized for DNA content (Invitrogen Sequal-

PrepTM) then pooled for NGS.

454 Sequence data processing

A complete annotated processing pipeline containing

Linux shell and Perl scripts for GTA sequence data is

obtainable from the authors on request. Briefly, reads

and quality scores were extracted from the sequence

providers’ SFF files. The sequence reads were com-

pared to the nucleotide tags representing each corre-

sponding genotype source using BLAST (Altschul

et al. 1997). A read was only accepted for further

A

B

Fig. 2 Examples of locus-specific nested GTA test amplifica-

tions (GTA Experiment 1). First round PCRs (a) and nested

PCRs (b) are shown for gene-specific primers 97–144 on pooled

samples 1–96. See Supplementary Material 1 and 3 for samples,

loci and primer sequences

Mol Breeding

123

processing if it matched a tag perfectly and did not

match more than one tag. Each read was then trimmed

to remove the tag and any sequence external to it. Any

reads containing 454 primers after this step were

discarded. Each read’s identifier was then prefixed

with the genotype name corresponding to its tag, and

all reads containing a given tag were added to a

corresponding genotype bin. The corresponding qual-

ity scores for each read were also trimmed and binned

by genotype in the same manner.

Each genotype bin was mapped to the amplicon

sequences using Newbler Reference Mapper Command

Line Interface (454 Life Sciences) with the runMapping

command. All contigs derived from each genotype bin

were prefixed with the bin’s genotype name and

corresponding amplicon name. Any excess sequence

extending beyond the restriction site used to design the

amplicons (due either to incomplete restriction digestion

or polymorphism in the restriction site) was trimmed

from each consensus. Again, the consensus quality scores

were trimmed in parallel. Consensus sequences were

then obtained from all genotype amplicon contigs and

these, with corresponding quality scores, were assembled

to produce amplicon super-contigs (one super-contig per

amplicon), using CAP3 (Huang and Madan 1999) on

default settings. SNP finding within amplicon super-

contigs used GigaBayes (http://bioinformatics.bc.edu/

marthlab/GigaBayes). SNP calls were confirmed by

visual inspection of the aligned genotype contigs in

Tablet (Milne et al. 2013).

In Experiment 1, cases were observed where a

sample was represented in more than one genotype

amplicon contig consensus. Inspection of these

showed that they derived either from two paralogous

genes being amplified by a given PCR primer pair

(limited to amplicons ABC0001327, ABC0001791

and ABC0002813), or by chimera formation during

the GTA amplification (see Supplementary Material 7

and ‘Results’ below). Chimeric assemblies were not

seen if the optimized GTA protocol was followed.

Results

Development of the Genome-Tagged

Amplification method

One way to take advantage of the high specificity of

the PCR-based approach to generate tens of thousands

of amplicon-by-sample-tagged amplicons, while

avoiding the high cost for primers and PCR reactions

entailed by a conventional tagged PCR method

(Fig. 1a), is to tag each genome before the PCRs are

carried out (Fig. 1b). Thus, locus-specific primers do

not need to be tagged and, in the case described above,

100 ‘genome-tagging’ reactions (on pools of 100

mixed genotypes) followed by 400 PCRs using 400

amplicon primers (200 outer and 200 nested) plus one

universal tag primer can achieve the same result as

20,000 PCRs involving 40,000 primers. To tag

genomic DNA (Fig. 1c), we first digest it with a

frequent cutter restriction enzyme then ligate adapters

containing a sample-specific tag (red, dark blue and

purple in Fig. 1c) flanked by a generic primer region

(brown in the figure; see also Supplementary Material

2). Last, PCR is performed using the generic primer

(brown arrow) and two nested locus-specific primers

(blue arrows). This is essentially an anchored PCR

approach (Waugh et al. 1997). Because the adapter

carries a genotype-coded tag, all samples can be

pooled for NGS sequencing after the adapter ligation

step [Fig. 1c (iii)].

The first experiment to test this approach involved

amplifying 96 diverse barley genotypes at 205

dispersed barley gene loci that were known to contain

at least one validated SNP polymorphism and an MseI

restriction site (T/TAA) between 80 and 200 bp from a

PCR primer site. The genomic DNA samples were

restriction digested separately with MseI and a differ-

ently tagged adapter was ligated to each product. Each

adapter contained an 8 base tag (Supplementary

Spreadsheet sheet 2) that was distinguishable from

any other used by at least a 2 base walk to ensure

accurate sample binning of the sequenced products

(Qiu et al. 2003). Equal amounts of the adapter-ligated

DNAs were then pooled, and 205 identical aliquots

were amplified separately, each at a different gene

locus, by nested PCR; 203 of the 205 loci yielded

discrete bands after nested amplification (see Fig. 2

for examples). The 203 locus-specific amplicons were

cleaned up, combined together in equimolar amounts

and sequenced bi-directionally by 454.

A total of 371,136 sequence reads were obtained

from this experiment, comprising 67.7 Mbp in total

(summarized in Table 1). The size spectrum of reads

was very similar to the corresponding spectrum of

DNAs sent for sequencing (not shown), and 74 % of

reads contained one of the tags plus generic primer.

Mol Breeding

123

The drop-out fraction mainly comprised short

sequence reads from the locus-specific primer end

that failed to reach the tag. Tag-containing reads were

binned by tag to their 96 samples of origin using

BLAST, and contigs were obtained for each bin. The

results showed excellent amplification specificity,

with greater than 99 % of contigs deriving from the

203 amplicon set (data not shown).

However, two problems were apparent with the first

GTA protocol. First, we observed an unacceptably

high proportion of presumed chimeric sequences

(mixed haplotypes) for some samples at some loci

(Supplementary Material 7). PCR assays confirmed

that the original genotypes lack these chimeric mole-

cules (data not shown), proving they were experimen-

tal artefacts introduced during the GTA procedure. We

hypothesized that PCR overcycling (continued PCR

cycling after the exponential growth phase has ended)

(Klock and Lesley 2009) was responsible for this

problem and confirmed this using a recombination

assay involving two amplicons polymorphic for a

22-bp indel sequence (Supplementary Material 6B, C).

This assay showed that lowered PCR cycle numbers in

both the first and nested PCRs greatly reduces chimera

formation (Supplementary Material 6C iii). These

improved conditions were validated in GTA experi-

ment 2, which tested three differential pooling regimes

for 48 DNA samples amplified at 46 loci, each carried

through to separate 454 sequencing (Table 1). Tagged

samples were GTA-amplified in sample pools of 12, 24

or 48, with the last being the most stringent test for the

new reaction conditions, and the first being the least

demanding but the most complicated to set up.

Chimera occurrence was not observed in the processed

(sample 9 locus) contigs for any of the three super-

pools, showing that at least 48 samples can be

successfully pooled in the optimized GTA protocol.

The second problem we encountered was biased

representation of sequences in the data set, with

respect to both sample and locus. To explore the

source(s) of bias by sample, we designed a new set of

GTA control experiments (GTA test sets 1–10;

Fig. 3), to test the parameters DNA sample (‘good/

bad’ and ‘old/new’), digestion/ligation (‘old/new’)

and tag adapter (‘good/bad’; ‘old/new’), where

‘good’/‘bad’ refers to numbers of sequences obtained

from the first GTA experiment and ‘old’/‘new’ refers

to GTA experiments 1 and 2, respectively. For each

parameter, a set of five similarly-performing variants

(tag or DNA sample) was selected based on perfor-

mance in the first GTA, and a shared set of 10 well-

performing loci were amplified for all 10 GTA pools

(thus, all variants were tested against each other for

each of 10 pools). The results are summarized in Fig. 3

and detailed in Supplementary Spreadsheet sheets

4–8. A clear correlation was seen between tag adapter

quality (‘good/bad’ tags) and proportion of sequences

obtained in both GTA experiments 1 and 2. For the 10

pooled GTA test sets, the combined Pearson product-

moment correlation coefficient r in tag performance

between the two experiments is 0.702 and for each tag

adapter treated separately r is 0.586. The only

pronounced difference between the two experiments

was the performance of GTA Set 3 (‘old good’ DNA

sample), which over-performed in GTA experiment 1

and under-performed somewhat in GTA experiment 2,

perhaps due to decay in quality during prolonged

storage between the two experiments. We conclude

that tag sequences show reproducible variation in

amplification performance during the GTA procedure,

allowing us to discard tags with strong positive or

negative bias in GTA reactions (see ‘Discussion’

section). None of the other three experimental param-

eters (‘good’/‘bad’ adapter ligation, ‘old’/‘new’ tag

adapter and ‘old’/‘new’ DNA) showed an obvious

correlation with the number of sequences obtained.

Much of the locus-specific bias in sequence repre-

sentation was traced back to the use of poor quality

GTA-amplified products in the first experiment

Table 1 Summary of sequence data obtained from GTA

experiments

Before quality

control

After quality

control

Sequence

reads

Mb Sequence

reads

Mb

GTA experiment 1 371,136 67.7 275,216 41.7

GTA experiment 2 469,953 74.1 282,435 43.2

Super-pool 1

(12-sample pooling)

89,104 15.8 65,504 10.6

Super-pool 2

(24-sample pooling)

105,056 17.8 72,888 11.4

Super-pool 3

(48-sample pooling)

142,658 25.0 103,517 16.6

Super-pool 4 133,135 15.5 40,526 4.8

GTA experiment 3 1,471,936 314.9 834,455 161.5

Mol Breeding

123

(Supplementary Spreadsheet sheet 3), and more strin-

gent standards for amplicon inclusion in the GTA

protocol solved this in subsequent experiments (see

below). Residual bias-by-locus (amplicon) was

explored by GTA sets 3–4 and 11–14 (Fig. 3) which

together tested the reproducibility of amplicon repre-

sentation in sequences from GTA experiments 1 and 2.

Amplicons were binned into ‘good’, ‘medium’ and

‘bad’ test set categories based upon performance in the

first GTA experiment (Supplementary Spreadsheet

sheets 1 and 5) and amplified from a common set of

well-performing DNAs and tags. The results (Fig. 3)

showed that amplicon performance in a GTA NGS

super-pool is reasonably reproducible, and we con-

clude that a degree of uneven representation between

amplicons is an inherent property of the amplicons.

The detailed GTA protocol incorporating the

improvements described above is described in

Supplementary Material 5 and its performance in

GTA Experiment 2, super-pool 3 with regard to bias

by amplicon and tag is shown in Fig. 4 and Supple-

mentary Material 9. Using good quality amplicons and

tags together largely removes the bias effects noted

above and yields satisfactory sequence depth in 95 %

of the (sample 9 locus) combinations (a single poorly

performing amplicon accounted for much of the poor

coverage). Our final experiment (GTA Experiment 3)

focussed on 58 loci in 286 samples, compared to 205

loci in 96 samples for experiment 1, to demonstrate the

versatility of the method and to test a large set of

previously unused tags. The results are summarized in

Table 1 and detailed in Supplementary Spreadsheet

sheet 10. Again, chimeric amplicons were not seen,

excellent amplicon specificity was achieved and a set

of 192 validated tags was chosen for future use in GTA

experiments (Supplementary Material 4).

Fig. 3 Exploring source(s) of bias in sequence representation

relative to GTA experimental parameters. The summed

sequences obtained for the GTA experiments 1 and 2 are

shown, split into the 14 GTA test sets, as described in the text

and detailed in the Supplementary Spreadsheet. The component

GTA templates, Tags and Amplicons tested in GTA Experiment

2 are color-keyed by performance in Experiment 1 (pink for

‘Bad’, green for ‘Good’ and yellow for ‘Medium’; see

Supplementary Spreadsheet). The numbers of sequences

obtained per test set per GTA experiment are also color-coded

to show correlations between experiment parameters and

sequence representation bias. Green = greater than 2,500

sequences/experiment; Yellow = between 1,500 and 2,500;

pink = less than 1,500. Total sequence numbers for GTA

Experiment 1 are normalized to GTA Experiment 2. (Color

figure online)

Mol Breeding

123

Discussion

Advantages of the GTA approach compared

to hybridization capture

The GTA method developed and described here offers

an alternative to hybridization-based capture methods

and emerging microfluidic approaches (Tewhey et al.

2009). Its advantages relative to other current

approaches are as follows:

1. It is very low-tech and inexpensive. All the

reactions are simple incubations (restriction

digestions, ligation PCRs) in plate format, so the

experiment can be carried out in any molecular

biology lab with no extra equipment expenditure.

There is no need for a pre-NGS library preparation

step per sample, which for high sample numbers

becomes time-consuming and very expensive

(currently *£30,000 for 100 samples), or expen-

sive equipment associated with hybridization

capture and microfluidics.

2. The GTA procedure has no requirement for

blocking of hybridization between repeats, which

is an essential part of all hybridization-based

capture methods and requires a new blocker

preparation for each genome under investigation.

3. It does not require detailed knowledge of the

genome under study. The only information

needed is small DNA segments (less than 1 Kb)

which could be very easily obtained by a single

NGS sequencing run on cDNA. Currently, hybrid-

ization capture arrays are only available for a

limited number of species. Furthermore, design of

capture arrays for a new species demands the

availability of a large body of genome sequence to

exclude repetitive sequences from the capture

oligonucleotides. This problem is much greater

for crop plants with large genomes, because their

dispersed repeat retrotransposon populations are

highly complex (Gribbon et al. 1999; SanMiguel

et al. 1996).

4. It is easily scalable in both dimensions (sample

and locus). We have tested 309 different tags

(Supplementary Material 4) and recommend 192

that should be combined in groups of up to 48.

One problem with using tags in hybridization

capture is blocking cross-hybridization between

tags to form ‘daisy chains’. In fact, the recently

announced Agilent SureSelect barcoding (Gnirke

et al. 2009) specifies that barcoded DNAs should

be hybridized separately then pooled for sequenc-

ing. This makes the analysis of hundreds of

samples very difficult and expensive.

5. It can deal perfectly well with indel polymor-

phism. Hybridization capture is adversely affected

by indels because the capture probes have pro-

gressively lower homology with increasing indel

size. GTA PCRs simply generate an amplicon of a

different size, unless the deleted region knocks out

one of the primer-binding sites (see below).

Limitations of the GTA approach

Disadvantages of the GTA approach include the

following:

6. It is dependent upon restriction enzyme cleavage

sites, making it less suitable than a hybridization-

based approach for capturing large genome regions,

because the experimental design is rather time-

consuming and particular sites may be missing

from some regions. We have reduced this problem

Samples

Loci

Fig. 4 Representation (read depth) bias by sample and locus in the

optimized GTA protocol. A matrix of 48 samples (X axis) by 46

loci (Y axis) is shown. Samples and loci are ordered by total read

number per column or row, in descending order from top to bottom

(loci) and left to right (samples). Each cell is coded according to

read depth as follows (all reads are full length). Black greater than 5

reads; Gray 3–5 reads; White 0–2 reads. The 9 sporadic white

squares in well-represented (amplicon 9 sample) combinations

represent possible examples of allelic dropout (see text)

Mol Breeding

123

by using multiple enzymes with identical sticky

ends, (MseI, T/TAA; BfaI, C/TAA; and Csp6I,

G/TAC) in GTA experiment 3. However, several

digestions are needed per sample, leading to an

approximate 25 % loss in coverage due to overlaps

between amplicons. There is also a failure rate of

roughly 15 % in candidate GTA primers, necessi-

tating a planning stage for primer validation on a

reference sample. It is important to note that

restriction site polymorphism is not a big problem

to the GTA approach because the amplified

sequence is extended to the next site, and as

sequencing is from both ends, a 454 template longer

than the *450 bp read length will retain its tag for

half of the sequence runs in such circumstances.

7. GTA is only well suited to NGS applications that

generate long (at least 200 bp) sequencing reads.

GTA amplicons carry tagged PCR adapters

(Waugh et al. 1997) at one end and only sequences

that contain the tag will retain the barcoding. At

present, this limits the application of GTA to

Roche-454 sequencing (Margulies et al. 2005).

However, with the latest increase in read lengths

on Illumina platforms, it could now readily be

used in place of 454-sequencing in the future.

8. Sample pooling depth and PCR cycling are both

limited by chimera formation and representation

bias. Each halving in GTA sample pool size

almost doubles the number of reactions needed

but halves the competition between amplicons.

Our validated protocol uses 48 sample pooling,

but we suggest that 96 sample pooling will give

acceptable results, provided bioinformatics-based

exclusion of chimeras is carried out and at least

40-fold sequencing depth is used. Provided that a

minor allele (say 1/48 of the input in a 48 GTA

pool) results in a corresponding tagged-amplified

sequence with less that 50 % chimera contami-

nation then the bioinformatics filtering can elim-

inate it effectively.

9. PCR-based target enrichment is potentially sus-

ceptible to allelic dropout (loss of particular

sequences because they amplify poorly). For

example, mutation to a PCR priming site might

reduce or remove a particular amplicon in a subset

of samples. This would be detected as sporadic

low representation in well-represented amplicons

and possible examples of this (0.5 % of the cells)

are visible in Fig. 4.

10. Bias by barcode and amplicon during the GTA

amplifications limits the total coverage obtained.

This can be reduced by avoiding tags and

amplicons showing strong bias but the latter limits

target options. However, our studies here show

that at least 95 % coverage can easily be obtained

at acceptable read depth (409) and in crucial

situations, a ‘catch-up’ GTA experiment focussed

on the missing samples and/or sequences could be

performed. In such cases, the poorly performing

sets (sample or amplicon) would be amplified

separately to reduce bias (all poorly performing

pool members would be competing against

equally poor members) and for poor samples,

new DNA preps and adapter ligations, together

with high performance tags would be the best

approach. For poorly performing amplicons, high

performance tags would be used.

Comparison of the GTA approach with other

capture technologies

In Table 2, we compare the GTA method with other

sequence capture approaches, to show the most

appropriate use for it. The GTA and Fluidigm

approaches are best suited for high sample numbers

(hundreds) and low capture complexity (below

100 Kb), whereas hybridization capture, MIPs and

Raindance technologies are better suited to a few

samples and high capture complexity ([0.4 Mb)

captures. GTA is a low-tech option for medium-sized

projects, whereas Fluidigm requires capital invest-

ment in equipment, but GTA involves more lab work

(ligations, PCRs, gels) (30) than Fluidigm. GTA is

very flexible in terms of sample number versus

amplicon number, but Fluidigm is tied to 48 samples

by 48 amplicons. Lastly, 192 validated tag options are

available for GTA (Supplementary Material 4),

whereas Fluidigm has 48, which necessitates either

the use of sequencing plate gaskets or much deeper

NGS sequence pooling and lower coverage.

In passing, it is worth noting that the occurrence of

chimeric PCR products during our early GTA exper-

iments has worrying implications for both hybridiza-

tion capture and conventional PCR amplification of

multi-gene families, such as histocompatibility or

plant resistance genes, because both involve PCR of

mixed template loci. Under such conditions,

Mol Breeding

123

Ta

ble

2C

om

par

iso

no

fca

ptu

rete

chn

olo

gie

s

Tec

hn

olo

gy

Bas

isT

yp

ical

cap

ture

targ

et

Cap

ture

com

ple

xit

y

(Mb

)

Ty

pic

al

sam

ple

size

To

tal

seq

uen

ce

cap

ture

(bp

)

%o

fta

rget

cap

ture

d

Ap

pli

cab

le

seq

uen

cin

g

pla

tfo

rm(s

)

Cap

ital

equ

ipm

ent

cost

Ap

pro

xim

ate

cost

per

run

c

Gen

om

e-T

agg

ed

Am

pli

fica

tio

n

(GT

A)

An

cho

red

PC

R

48

PC

Ram

pli

con

s0

.02

96

4,6

08

(2M

b)

[9

5R

och

e-4

54

on

lyL

ow

bL

ow

c

Flu

idig

mM

icro

flu

idic

PC

R

48

PC

Ram

pli

con

s0

.01

48

2,3

04

(1M

b)

[9

0R

och

e-4

54

on

ly£

70

,00

0L

ow

c

Rai

nd

ance

Mic

rofl

uid

ic

PC

R

20

,00

0P

CR

amp

lico

ns

81

20

,00

0(8

Mb

)9

8R

och

e-4

54

or

Illu

min

a

£2

00

,00

1,0

00

Mo

lecu

lar

inv

ersi

on

pro

bes

(MIP

)

Gap

fill

lig

atio

n

55

,00

05

6b

p

amp

lim

ers

0.4

15

5,0

00

(0.4

Mb

)

90

Illu

min

ao

nly

Lo

wb

Lo

wc

Arr

ayca

ptu

re

(Nim

ble

gen

)

Hy

bri

diz

atio

n3

85

,00

0o

r2

.1M

bai

ts(t

wo

fold

dep

th)

0.4

or

21

–1

2(5

or

30

Mb

)5

0–

90

aR

och

e-4

54

or

Illu

min

a

£1

5,0

00

£8

00

Liq

uid

cap

ture

(Ag

ilen

t)

Hy

bri

diz

atio

n2

0,0

00

bai

ts2

.51

[1

5,0

00

(2.5

Mb

)

40

–8

0Il

lum

ina

on

lyL

ow

90

0

aD

epen

ds

up

on

targ

etsi

ze—

see

tex

tb

No

spec

iali

zed

lab

equ

ipm

ent

nee

ded

cM

ain

cap

ital

ou

tlay

iso

lig

on

ucl

eoti

des

,w

hic

hca

nb

eu

sed

for

mu

ltip

leru

ns

Mol Breeding

123

recombination between multiplex barcoded samples

or among multi-gene family members will occur if

overcycling conditions are used in a single sample.

Conclusions

In this study, we have described a new PCR-based

approach for sequence capture. We particularly recom-

mend the GTA approach for studies of hundreds of gene

segments in hundreds of samples. Typical applications

include population genetics, evolutionary studies and

association genetics screens in genome subsets. We are

currently using it to scan across association peaks in

barley but it could readily be used, for example, to

screen the members of a dispersed gene family in

hundreds of samples for association with traits of

interest. We used 454 sequencing during development

of GTA method, however, this could easily be adapted

to Illumina sequencing platforms as read lengths have

significantly increased recently. Furthermore, GTA

could virtually be used for any organism with available

genome/sequence information.

Acknowledgments The authors thank Patrick Schnable for

suggestions on barcoding, Dan Ashlock for generating and

providing barcodes for this study and Jeffrey Jeddeloh for

communicating unpublished observations on hybridization

capture technologies. We also thank Margaret Hughes, Garry

Cusack and the Roche-454 Delaware sequencing facility for

much help and advice with 454 sequencing. This work was

supported by Biotechnology and Biological Sciences Research

Council grant BB/E003184/1.

References

Ahn SJ, Costa J, Emanuel JR (1996) PicoGreen quantitation of

DNA: effective evaluation of samples pre- or post-PCR.

Nucleic Acids Res 24:2623–2625

Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D,

Song XZ, Richmond TA, Middle CM, Rodesch MJ,

Packard CJ, Weinstock GM, Gibbs RA (2007) Direct

selection of human genomic loci by microarray hybrid-

ization. Nat Methods 4:903–905

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z,

Miller W, Lipman DJ (1997) Gapped BLAST and PSI-

BLAST: a new generation of protein database search pro-

grams. Nucleic Acids Res 25:3389–3402

Ashlock D, Guo L, Qiu F (2002) Greedy closure evolutionary

algorithms. In: Proceedings of the 2002 congress on evo-

lutionary computation, pp 1296–1301

Bainbridge MN, Wang M, Burgess DL, Kovar C, Rodesch MJ,

D’Ascenzo M, Kitzman J, Wu YQ, Newsham I, Richmond

TA, Jeddeloh JA, Muzny D, Albert TJ, Gibbs RA (2010)

Whole exome capture in solution with 3 Gbp of data.

Genome Biol 11:R62

Bashiardes S, Veile R, Helms C, Mardis ER, Bowcock AM,

Lovett M (2005) Direct genomic selection. Nat Methods

2:63–69

Bentley DR (2006) Whole-genome re-sequencing. Curr Opin

Genet Dev 16:545–552

Binladen J, Gilbert MTP, Bollback JP, Panitz F, Bendixen C,

Nielsen R, Willerslev E (2007) The use of coded PCR

primers enables high-throughput sequencing of multiple

homolog amplification products by 454 parallel sequenc-

ing. PLoS ONE 2:e197. doi:10.1371/journal.pone.0000197

Bodi K, Perera AG, Adams PS, Bintzler D, Dewar K, Grove DS,

Kieleczawa J, Lyons RH, Neubert TA, Noll AC, Singh S,

Steen R, Zianni M (2013) Comparison of commercially

available target enrichment methods for next-generation

sequencing. J Biomol Tech 24:73–86

Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir

A, Bakkaloglu A, Ozen S, Sanjad S, Nelson-Williams C,

Farhi A, Mane S, Lifton RP (2009) Genetic diagnosis by

whole exome capture and massively parallel DNA

sequencing. Proc Natl Acad Sci USA 106:19096–19101

Craig DW, Pearson JV, Szelinger S, Sekar A, Redman M,

Corneveaux JJ, Pawlowski TL, Laub T, Nunn G, Stephan

DA, Homer N, Huentelman MJ (2008) Identification of

genetic variants using bar-coded multiplexed sequencing.

Nat Methods 5:887–893

Dahl F, Stenberg J, Fredriksson S, Welch K, Zhang M, Nilsson

M, Bicknell D, Bodmer WF, Davis RW, Ji HL (2007)

Multigene amplification and massively parallel sequencing

for cancer mutation discovery. Proc Natl Acad Sci USA

104:9387–9392

Fu Y, Springer NM, Gerhardt DJ, Ying K, Yeh CT, Wu W,

Swanson-Wagner R, D’Ascenzo M, Millard T, Freeberg L,

Aoyama N, Kitzman J, Burgess D, Richmond T, Albert TJ,

Barbazuk WB, Jeddeloh JA, Schnable PS (2010) Repeat

subtraction-mediated sequence capture from a complex

genome. Plant J Cell Mol Biol 62:898–909

Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM,

Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C,

Gabriel S, Jaffe DB, Lander ES, Nusbaum C (2009)

Solution hybrid selection with ultra-long oligonucleotides

for massively parallel targeted sequencing. Nat Biotechnol

27:182–189

Gribbon BM, Pearce SR, Kalendar R, Schulman AH, Paulin L,

Jack P, Kumar A, Flavell AJ (1999) Phylogeny and trans-

positional activity of Ty1-copia group retrotransposons in

cereal genomes. Mol Gen Genetics 261:883–891

Hopp K, Heyer CM, Hommerding CJ, Henke SA, Sundsbak JL,

Patel S, Patel P, Consugar MB, Czarnecki PG, Gliem TJ,

Torres VE, Rossetti S, Harris PC (2011) B9D1 is revealed

as a novel Meckel syndrome (MKS) gene by targeted exon-

enriched next-generation sequencing and deletion analysis.

Hum Mol Genet 20:2524–2534

Huang X, Madan A (1999) CAP3: a DNA sequence assembly

program. Genome Res 9:868–877

Klock HE, Lesley SA (2009) The Polymerase Incomplete Pri-

mer Extension (PIPE) method applied to high-throughput

cloning and site-directed mutagenesis. Methods Mol Biol

498:91–103

Mol Breeding

123

Lee H, O’Connor BD, Merriman B, Funari VA, Homer N, Chen

Z, Cohn DH, Nelson SF (2009) Improving the efficiency of

genomic loci capture using oligonucleotide arrays for high

throughput resequencing. BMC Genom 10:646

Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH,

Kumar A, Howard E, Shendure J, Turner DJ (2010) Target-

enrichment strategies for next-generation sequencing. Nat

Methods 7:111–118

Mardis ER (2008) Next-generation DNA sequencing methods.

Annu Rev Genom Hum G 9:387–402

Margulies M, Egholm M, Altman WE, Attiya S, Bader JS,

Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z,

Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He

W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML,

Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Le-

amon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H,

Makhijani VB, McDade KE, McKenna MP, Myers EW,

Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth

GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M,

Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH,

Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM

(2005) Genome sequencing in microfabricated high-den-

sity picolitre reactors. Nature 437:376–380

Milne I, Stephen G, Bayer M, Cock PJA, Pritchard L, Cardle L,

Shaw PD, Marshall D (2013) Using Tablet for visual

exploration of second-generation sequencing data. Brief

Bioinform 14:193–202

Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick

ME (2007) Microarray-based genomic selection for high-

throughput resequencing. Nat Methods 4:907–909

Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL,

LeProust EM, Peck BJ, Emig CJ, Dahl F, Gao Y, Church

GM, Shendure J (2007) Multiplex amplification of large

sets of human exons. Nat Methods 4:931–936

Qiu F, Guo L, Wen TJ, Liu F, Ashlock DA, Schnable PS (2003)

DNA sequence-based ‘‘bar codes’’ for tracking the origins

of expressed sequence tags from a maize cDNA library

constructed using multiple mRNA sources. Plant Physiol

133:475–481

Rostoks N, Mudie S, Cardle L, Russell J, Ramsay L, Booth A,

Svensson JT, Wanamaker SI, Walia H, Rodriguez EM,

Hedley PE, Liu H, Morris J, Close TJ, Marshall DF, Waugh

R (2005) Genome-wide SNP discovery and linkage ana-

lysis in barley based on genes responsive to abiotic stress.

Mol Genetics Genomics 274:515–527

SanMiguel P, Tikhonov A, Jin YK, Motchoulskaia N, Zakharov

D, Melake-Berhan A, Springer PS, Edwards KJ, Lee M,

Avramova Z, Bennetzen JL (1996) Nested retrotranspo-

sons in the intergenic regions of the maize genome. Science

274:765–768

Smith AM, Heisler LE, St Onge RP, Farias-Hesson E, Wallace

IM, Bodeau J, Harris AN, Perry KM, Giaever G, Pourmand

N, Nislow C (2010) Highly-multiplexed barcode

sequencing: an efficient method for parallel analysis of

pooled samples. Nucleic Acids Res 38:e142. doi:10.1093/

nar/gkq368

Tan IB, Cutcutache I, Zang ZJ, Iqbal J, Yap SF, Hwang W, Lim

WT, Teh BT, Rozen S, Tan EH, Tan P (2011) Fanconi’s

anemia in adulthood: chemoradiation-induced bone mar-

row failure and a novel FANCA mutation identified by

targeted deep sequencing. J Clin Oncol Off J Am Soc Clin

Oncol 29:e591–e594

Tewhey R, Warner JB, Nakano M, Libby B, Medkova M, David PH,

Kotsopoulos SK, Samuels ML, Hutchison JB, Larson JW,

Topol EJ, Weiner MP, Harismendy O, Olson J, Link DR, Frazer

KA (2009) Microdroplet-based PCR enrichment for large-scale

targeted sequencing. Nat Biotechnol 27:U1025–U1094

Turner EH, Lee CL, Ng SB, Nickerson DA, Shendure J (2009)

Massively parallel exon capture and library-free rese-

quencing across 16 genomes. Nat Methods 6:315–316

Waugh R, McLean K, Flavell AJ, Pearce SR, Kumar A, Thomas

BB, Powell W (1997) Genetic distribution of Bare-1-like

retrotransposable elements in the barley genome revealed

by sequence-specific amplification polymorphisms (S-

SAP). Mol Gen Genetics 253:687–694

Mol Breeding

123