Global patterns of apparent copy number variation in birds revealed by cross-species comparative...

30
Global patterns of apparent copy number variation in birds revealed by cross species comparative genomic hybridization Benjamin M Skinner 1,2 , Deborah Smith 2 , Martin Völker 2 , Nilofour Hojjat 2 , Sannaa Raja 2 , Steven Trim 2 , Peter Houde 3 , William J. Boecklen 3 , Abdullah Al Mutery 2 , Darren K Griffin 2 1 Department of Pathology, University of Cambridge, Cambridge, UK, CB2 1QP 2 School of Biosciences, University of Kent, Canterbury, UK, CT2 7NJ 3 New Mexico State University, Department of Biology, Las Cruces NM 88003 USA Running title: Copy number variation in birds Key words: evolution, avian, chromosome, microarray, CNVs, array CGH Corresponding author Darren K. Griffin School of Biosciences, University of Kent, Canterbury, Kent, UK, CT2 7NJ Tel: +44 1227 823022 Fax: +44 1227 763912 [email protected] This is the post-review version of this article. The original publication is available at www.springerlink.com: http://link.springer.com/article/10.1007/s10577-014-9405-0 DOI: 10.1007/s10577-014-9405-0 1

Transcript of Global patterns of apparent copy number variation in birds revealed by cross-species comparative...

Global patterns of apparent copy number variation in birds revealed by cross species

comparative genomic hybridization

Benjamin M Skinner1,2, Deborah Smith2, Martin Völker2, Nilofour Hojjat2, Sannaa Raja2,

Steven Trim2, Peter Houde3, William J. Boecklen3, Abdullah Al Mutery2 , Darren K Griffin2

1 Department of Pathology, University of Cambridge, Cambridge, UK, CB2 1QP

2 School of Biosciences, University of Kent, Canterbury, UK, CT2 7NJ

3 New Mexico State University, Department of Biology, Las Cruces NM 88003 USA

Running title: Copy number variation in birds

Key words: evolution, avian, chromosome, microarray, CNVs, array CGH

Corresponding author

Darren K. Griffin

School of Biosciences, University of Kent, Canterbury, Kent, UK, CT2 7NJ

Tel: +44 1227 823022

Fax: +44 1227 763912

[email protected]

This is the post-review version of this article. The original publication is available at

www.springerlink.com: http://link.springer.com/article/10.1007/s10577-014-9405-0

DOI: 10.1007/s10577-014-9405-0

1

Abstract

There is a growing interest in copy number variation (CNV) and the recognition of its

importance in phenotype, disease, adaptation and speciation. CNV data is usually ascertained

by array-CGH within-species, but similar inter-species comparisons have also been made in

primates, mice and domestic mammals. Here, we conduct a broad appraisal of putative cross

species CNVs in birds, 16 species in all, using the standard array-CGH approach. Using a

chicken oligonucleotide microarray we detected 790 apparent CNVs within 135 unique

regions and developed a bioinformatic tool “CNV Analyser” for analysing and visualising

cross species data sets. We successfully addressed four hypotheses: 1) That cross species

CNVs (compared to chicken) are, as suggested from preliminary evidence, smaller and fewer

in number than in mammals; this “dogma” was rejected in the light of the new evidence. 2)

That CNVs in birds are likely to have a functional effect through an association with genes; a

large proportion of detected regions (70%) were indeed associated with genes (suggesting

functional significance) however, not necessarily more so than in mammals. 3) That there are

more CNVs in birds with more rearranged karyotypes; this hypothesis was rejected, indeed

Falco species contained fewer than most with relatively standard (chicken-like) karyotypes.

4) That there are more CNVs per megabase on micro-chromosomes than

macrochromosomes; this hypothesis was accepted; indeed, in species with rearranged

karyotypes characterised by chromosomal fusions, the fused former microchromosomes still

“behaved” as though they were their microchromosomal ancestors. Gene ontology analysis of

CNVRs revealed enrichment in immune response and antigen presentation genes and five

CNVRs were perfectly correlated with the unique loss of sexual dichromatism in one

Galliform species.

2

Introduction

Variations in the copy number of DNA fragments ≥1kb (excepting insertions or deletions of

transposable elements), CNVs (Feuk et al. 2006; Freeman et al. 2006), are increasingly

recognised for their role in phenotype, disease, adaptation and speciation (Freeman et al.

2006; Emanuel and Saitta 2007). In medicine, CNVs have been linked with schizophrenia,

autism and other psychiatric disorders (Stefansson et al. 2008; St Clair 2009), and there have

been associations of HIV-AIDS susceptibility with CNVs at the CCL3L locus in humans,

chimpanzees and macaques (reviewed by Colobran et al. 2010).

CNVs can facilitate adaptation and by conferring phenotypes with a selective advantage, e.g.

increase in copy number of the human amylase (AMY1) gene has been linked to the adoption

of a starch-rich diet in human populations (Perry et al. 2007). This is one of many examples

of the proposed effects of CNVs on human evolution, which were summarised recently

(Iskow et al. 2012). CNVs have also been linked to genome evolution on a chromosomal

scale in that they may facilitate non-homologous recombination (Lupski 1998; Stankiewicz

and Lupski 2002). That is, a correlation has been observed in primates between the locations

of segmental duplications (low copy-number CNVs with >90% sequence identity; Lander et

al. 2001) and the breakpoints of chromosomal rearrangements. It has also been suggested that

chromosomal fusions may be facilitated by the presence of segmental duplications (Emanuel

and Saitta 2007). CNV generation is therefore perhaps driven, at least in part, through non-

allelic homologous recombination (NAHR) following ancestral segmental duplications (Perry

et al. 2006; Repping et al. 2006; Völker et al. 2010).

Amongst mammals, CNV research has focussed on primates (Perry et al. 2006), mice

(Graubert et al. 2007), cattle (Fadista et al. 2010), pigs (Fadista et al. 2008; Wang et al. 2012)

3

and sheep (Fontanesi et al. 2011). For the vast majority of CNV studies, the platform used for

detection is array CGH (microarray based comparative genomic hybridization). While most

have been performed between individuals of the same species, inter-specific array CGH is

becoming increasingly commonplace with primate and mouse cross species array CGH all

reporting numerous lineage-specific gene gains and losses (Samonte and Eichler 2002; Locke

et al. 2003; Fortna et al. 2004; Newman et al. 2005; Bailey and Eichler 2006; Dumas et al.

2007). The 40+ year-old hypothesis that gene duplication is a major factor in evolutionary

change (Ohno 1970) is thus supported. Despite inherent drawbacks associated with the fact

that all inter-specific array CGH is made with respect to a reference species (usually human),

cross species array CGH is a useful tool for determining global patterns of DNA copy number

variation between species as it can identify areas of the genome suitable for further study. In

other words, although apparent losses of DNA copy number compared to the reference

species may, in some cases, be attributable to reduction in microarray hybridization dynamics

(e.g. brought about by sequence divergence), inter-specific array CGH data is nonetheless

extremely valuable for cross species genomic comparisons. For the purposes of this study

therefore we will hence refer to CNVs and CNVRs to mean the outputs of array CGH studies,

being mindful of the potential limitations of the platform.

In birds, three studies by our own group (Griffin et al. 2008; Skinner et al. 2009; Völker et al.

2010) in turkey, duck and zebra finch respectively, examined a small number of animals per

species to derive the first results of cross species array CGH in birds (using chicken as the

reference species). That is, we reported the identification of 16 putative CNVs between

chicken and turkey (Griffin et al. 2008), 32 between chicken and duck (Skinner et al. 2009),

and 20 between chicken and zebra finch (Völker et al. 2010). This compares with 58 CNVs

detected on a comparable platform between chimp and human (Perry et al. 2006). With a

4

reported divergence of humans and chimpanzees of 6 million years, and of chickens and

turkey, duck and zebra finch, 40, 80 and 100 million years respectively, the initial inference is

that CNVs are far less common in birds than in mammals. Within-species avian studies are

limited to chickens, and have revealed patterns of breed specific CNVs (Wang et al. 2010;

Wang et al. 2012; Crooijmans In Press), many of which contain genes of interest for

cytoskeletal components (mainly the keratin superfamily), assorted transcription factors and

metabolic processes. Nonetheless, general patterns of avian CNVs between species remain

very under-explored, despite being essential to provide the basis for studies of the genome

evolution of this extremely important phylogenetic class.

Bird genomes are characterised by a small size (reviewed in Gregory 2005) and by distinctive

and highly conserved karyotypes, consisting of about 10 pairs of macrochromosomes and

~30 pairs of microchromosomes (reviewed in Griffin et al. 2007). The genomes in most bird

lineages have undergone far fewer interchromosomal rearrangements than mammalian ones

(Hillier et al. 2004; Griffin et al. 2007); though recent work suggests intrachromosomal

rearrangements may be common (Völker et al. 2010; Skinner and Griffin 2011). The chicken

genome has a low number of segmental duplications relative to a typical mammalian genome

(Hillier et al. 2004) and this also appears to be the case for more recently published zebra

finch and turkey genomes (Dalloul et al. 2010; Warren et al. 2010). These findings are thus

consistent with the notion that segmental duplications facilitate chromosomal rearrangements

(Kehrer-Sawatzki and Cooper 2008).

Given these smaller genome sizes, smaller intergenic distances, low repeat content and low

number of pseudogenes in the avian genome (Burt et al. 1999; Hillier et al. 2004) it is

reasonable to hypothesise that avian CNVs (as identified by array CGH) will be

5

correspondingly smaller and fewer in number than those reported in mammals. As mentioned

above, our preliminary evidence (Griffin et al. 2008; Skinner et al. 2009; Völker et al. 2010)

supports this. One could also therefore propose a second hypothesis that those CNVs present

are more likely to have a functional effect through association with genes. This leads to third

hypothesis, namely that birds that are characterised by a high degree of chromosome

rearrangement (such as Falco and Gyps species) have a higher proportion of CNVs than the

avian norm. Given the inverse relationship between chromosome size and gene density, bird

microchromosomes contain a higher density of genes and a lower amount of repeats than

macrochromosomes, a fourth hypothesis may be proposed i.e. that microchromosomes

contain more CNVs per megabase than macrochromosomes.

The purpose of this study was therefore to extend CNV investigations to several key avian

clades including Galliformes, Anseriformes, Passeriformes, Gruiformes, and Falconiformes

to address the above hypotheses. We used the previously well-established array-CGH

approach, hybridising genomic DNA from 16 bird species across six orders to a chicken

whole genome tiling path microarray. Our results provide a global overview of the patterns of

CNVs (or at least “apparent” CNVs as revealed by array CGH) in birds. We have also and

performed gene ontology analysis on the putative CNV regions to ask whether any

conclusions can be drawn about the role of CNVs in phenotypic variation. Finally, we present

a program named "CNV Analyser" that we have developed for the collation and presentation

of cross species array CGH (putative CNV) data from a Nimblegen microarray platform.

6

Materials and Methods

Sample preparation

Blood samples and/or feather pulp were acquired from individuals from 16 species (Table 1).

Genomic DNA was prepared using a Qiagen DNeasy Blood and Tissue Kit (#69504)

according to the manufacturers’ instructions. The reference (Red Jungle Fowl) DNA, from

the same animal used in the chicken genome sequencing project, was kindly provided by Dr

Hans Cheng (Michigan State University).

Array-CGH

The NimbleGen chicken whole-genome tiling array (Catalogue Number/Design Name

B3791001-00-01, galGal3 WG CGH - Roche NimbleGen, Milton Keynes, UK) was used for

the array CGH experiments. It contains 385,000 50-mer oligonucleotides with an average

spacing of 2,586 base pairs (source - UCSC, build - galGal3). Labeling of genomic DNA and

hybridization to the NimbleGen array were performed by the company and used random

priming to incorporate modified nucleotides by either amino-allyl or direct linkage to either

of the two dyes used (Cy3 and Cy5). All of the hybridizations in this experiment used two

dyes per slide (Cy3 and Cy5). Red Jungle Fowl reference DNA was co-hybridized with test

DNA. Array data has been uploaded to ArrayExpress under accessions E-TABM-920 and E-

MTAB-1554.

Data analysis

CGH analysis proceeded in three stages, normalization, window averaging and segmentation.

After combining the signal intensity and genomic coordinate information, the Cy3 and Cy5

7

signal intensities were normalized to one another using Qspline normalization (Workman et

al. 2002). Qspline is a robust non-linear method for normalization using array signal

distribution analysis and cubic splines. Once normalized, the data was prepared for DNA

segmentation analysis.

Previously (Griffin et al. 2008; Skinner et al. 2009; Völker et al. 2010), we have used the

default segmentation analysis provided by Nimblegen. In this study, we performed

segmentation using DNACopy (Olshen et al. 2004). Each probe on the array was annotated as

to whether it was located within an Ensembl exon, using an in-house script. The data were

then divided into two groups: Group 1 contained all probes on the array. Group 2 contained

only probes NOT found in exons. The DNACopy algorithm was run on both data sets. CNVs

were called for segments spanning 4 or more probes in which the absolute log2 ratio was

greater than 0.5. The resulting CNVs from groups 1 and 2 were compared, and only those

present in both groups (the vast majority) were taken forward for further analysis. This extra

step was used to check that our CNVs were not simply genomic regions highly conserved

across species appearing as gains following normalisation. Following normalisation, regions

of higher conservation appear as gains, while regions of lower conservation appear as losses.

Since exonic regions are more like to be highly conserved than intronic or intergenic regions,

we tested for this possibility by removing all exonic probes from the dataset before running

the segmentation algorithm. The vast majority or the CNVs were still detected in this

analysis, indicating that the data is not simply due to differential conservation of sequence

identity. Furthermore, there was no change in the number of CNVs excluded by this method

per species with increasing phylogenetic distance from chicken, as would be expected if

sequence divergence were dominating the results.

8

Overlapping CNVs were combined into CNV regions (CNVRs) irrespective of the degree of

overlap or whether the change was a gain or loss (as in Redon et al. 2006) and subsequent

CNV studies). We incorporated our previously published data from turkey (Griffin et al.

2008), duck (Skinner et al. 2009) and zebra finch (Völker et al. 2010) into the data analysis

for completeness, including the DNACopy segmentation. Chromosomes Z, W and

unassigned linkage groups were discarded from the CNVR calling. In order to visualise

CNVRs across individuals within the same species, or multiple species compared to chicken,

a bespoke tool “CNV Analyser” was developed specifically for this study. The purpose of this

was to display the results in a graphical format compared to a standard chicken ideogram (see

figure 1), and to provide a simple tool for discovering genes within the CNVRs. CNV

Analyser is freely available by contacting the authors.

Gene content of CNVRs and gene ontology was identified from Ensembl using BioMart

(www.ensembl.org/biomart/martview). Gene enrichment analysis was performed using

DAVID (Da Wei Huang and Lempicki 2008); http://david.abcc.ncifcrf.gov/). Gene clusters

with an enrichment score >2 were taken to be significant. Statistical phenotypic associative

comparisons focused on the fowl (Phasianidae), the nearest relatives to chicken, upon which

all CNVs were scored. As members of the same family, these birds are genetically similar to

one another and differ in a relatively narrowly defined set of phenotypes; genes are less likely

to have diverged functions than they might between more distantly related species. Their

phylogeny is well-established so autocorrelation of traits due to shared ancestry can be

accounted for (Kimball and Braun 2008; Bonilla et al. 2010). There is an available data set of

phenotypes related to sexual traits that covers all species (Kimball et al. 2011) that encode

body size, shape, plumage, and sex and are therefore likely related to numerous

developmental and hormonal regulators. Correlations between CNV and sexual phenotype

9

data sets were assessed visually and with a pairwise correlation matrix (Minitab Statistical

Software, Minitab, Inc.). Statistical testing was carried out using R.

10

Results

Cross-species array-CGH identification of CNVs

We successfully hybridised genomic DNA from 16 avian species to the 384,000

oligonucleotide chicken microarray (Roche Nimblegen). We also successfully reanalysed and

incorporated previous CNV data from turkey (Griffin et al. 2008), duck (Skinner et al. 2009)

and zebra finch (Völker et al. 2010). Our full data set thus includes species from six orders

and three major clades, the Galliformes, Anseriformes and Neoaves. The details of the

species and animals used are in Table 1. An example of the output generated by CNV

Analyser is given in Figure 1: CNVRs in the common quail are shown over a chicken

karyotype. Gains are shown in green to the right of the chromosome and apparent losses in

red to the left.

Figure 1Output from CNV Analyser showing relative gains and losses in 4 common quails compared to chickens. Chicken chromosomes are numbered in order at the top of the figure and ideograms are represented by vertical lines. Apparent CNVR gains are represented by green bars to the right of each chromosome in green, apparent losses by red bars to the left of the chromosome.

All reported ‘gains’ and ‘losses’ are presented with respect to chicken. Excluding the sex

chromosomes and unassigned linkage groups, 790 CNVs were found (supplementary table

S1). The overlapping CNVs were combined into 135 CNV regions (supplementary table S2).

11

These CNVRs are similar in size and distribution to those seen in our previous experiments:

CNVRs ranged from 7.15kb to 1.56Mb, with a median size of 32.9kb (figure 2). Overall,

11.8Mb of the chicken genome was encompassed by CNVRs.

Figure 2Distribution of CNV sizes across the species in this study. The grey bounded region shows the median and quartiles for the dataset as a whole. Numbers at the top show the number of CNVs per species.

Gene content of CNVRs

We examined the CNVRs for association with known or novel genes on the Ensembl

database (release 69); 95 CNVRs (70%) contained, or overlapped with, genes (supplementary

table 2). Both gains and losses were predominantly associated with genes, either known or

novel; 47 of the gains (79%) and 48 of the losses (63%) had associated genes. Furthermore,

most of these genes are annotated as protein-coding (supplementary figure S2).

CNVR frequency in birds with highly rearranged karyotypes

Both Falco (falcon) and Gyps (vulture) species have previously been reported as having

significantly rearranged karyotypes and high degrees of interchromosomal rearrangements

(Bed'Hom et al. 2003; Nanda et al. 2006; Nishida et al. 2008). The median number of CNVs

12

per genome for Falco species was 10, for Gyps species was 12 compared to an overall

median of 11.75. Differences were not statistically significant.

CNVR density in macro- versus microchromosomes

We compared the sizes and densities of CNVRs against chromosome sizes. As predicted, we

found a significant inverse association between CNVR frequency and chromosome size

(Figure 3A) – CNVRs are more common on smaller chromosomes (rho=-0.88, p<0.0001,

n=25, Spearman's rank). The size of CNVRs however does not show any correlation with

chromosome size (Figure 3B) – small CNVRs are as common as large CNVRs on small

chromosomes (rho=0.08, p=0.67, n=25, Spearman's rank). Consequently, smaller

chromosomes tend to have a higher percentage of sequence within CNVs than larger

chromosomes. The same patterns were seen when the CNVRs were separated into gains and

losses, with similar statistical significances.

Figure 3Correlations between chromosome size and A) CNVR density (the number of CNVRs per megabase of chromosome); B) the median CNVR length. There is a strong significant inverse correlation with CNVR density (Spearman's rank, rho=-0.88, p<0.0001, n=25) but no correlation with median CNVR length (Spearman's rank, rho=0.08, p=0.67, n=25).

Overlap with CNVs amongst chicken breeds

13

A total of 47 of the 135 CNVRs from our species (~36%) overlap chicken CNVRs found by

Crooijmans et al. (2013) looking between chicken breeds (supplementary table S6). The

shared CNVRs are distributed across most chromosomes and contain both gains and losses.

Many of the CNVRs seen as gain in our species relative to Red Jungle Fowl appear as losses

in chicken breeds relative to Red Jungle Fowl, consistent with the high percentage of losses

(75%) found by Crooijmans et al. (2013).

Gene ontology analysis

A gene ontology analysis using DAVID revealed enrichment among losses for genes involved

in immune responses and antigen presentation (supplementary Table S3). No specific

enrichment categories were seen amongst gains. A total of 51 CNVRs (38%) are shared

between two or more species (see also figure S3). While the total number of gains and losses

are similar (59 versus 76 respectively), the 18 CNVRs shared amongst more than four species

are all losses (see figure S3). These CNVRs predominantly contain genes related to immune

function, including CNVR109, which covers almost all of the available GGA16 sequence

(the MHC region).

Some of the CNVRs appear lineage-specific: for example, the gain CNVR64 is found only in

the four quails. It contains the gene LRFN5, which while uncharacterised in chicken, in

humans may play a role in regulating synapse formation (Mah et al. 2010). Similarly, the gain

CNVR56 is found only in four animals from Gyps fulvus, and contains ADAMTS3. This gene

has roles in cell signalling, cell fusion, and cell-cell interactions, and in humans is implicated

in multiple diseases including myocardial infarction (Lee et al. 2012). An annotated list of

genes in CNVRs seen concordant in most animals of a species is given in supplementary

table S5.

14

Statistical correlations between our CNV dataset and phenotypic datasets detect some

correlated CNVRs with sexual phenotypic data (despite deliberately excluding Z and W

probes from the analysis). That is, five CNVRs (four losses and one gain) were found to be

perfectly correlated with the unique loss of sexual dichromatism in the chukar (Alectoris

chukar) (Kimball et al. 2011). A sixth CNVR (weak loss vs strong losses in chukar) was

shared by a pheasant with sexual dichromatism. This CNVR overlaps an intron of DACH2,

associated with the development of female primary sexual characteristics, and an as-yet un-

annotated antisense miRNA within the intron.

In supplementary table S6, a list of autapomorphic species specific and lineage-specific

putative CNVRs containing genes are given. Since many of the genes remain unannotated in

chicken, information from the human orthologue of the gene has been included where

possible to provide an approximation of function.

15

Discussion

In this study we successfully addressed four hypotheses, as outlined in the introduction,

pertaining to cross species CNVs (as revealed by array CGH using chicken as a reference).

We developed a novel in house tool for analysing and visualising cross species CNVRs and

identified putative relationships between inter-specific CNVRs and phenotypic variation in

birds. As such, this study represents the most comprehensive appraisal of copy number

variation (or at least cross species array CGH data) in birds to date and its possible role in

avian species variation and genome evolution.

Hypothesis 1) Size and number of copy number variants in birds

The median CNV size (32.9kb) presented here is smaller than reported in our individual

studies of turkey (127.1 kb), duck (50kb) and zebra finch (90kb) (Griffin et al. 2008; Skinner

et al. 2009; Völker et al. 2010), albeit employing slightly different segmentation methods.

The range of CNV sizes is near identical in all the species studied, with only a slight

variability. These results compare favourably with those of Wang et al. (2010, 2012) who

examined CNVs between chicken breeds, finding median CNV sizes of 42.59kb and 14.43kb,

with similar ranges.

Our first hypothesis was that CNVs apparent in birds would appear smaller than those seen in

mammals; this was supported by initial evidence. Recent within-species and cross-species

studies in mammals however indicate this not to be the case. That is, experiments with

oligonucleotide platforms within species in mice, cattle and pigs have found median CNV

sizes of 122kb, 16.7kb and 142.9kb respectively (Graubert et al. 2007; Fadista et al. 2010;

Wang et al. 2012). Cross species experiments comparing sheep and goats to cattle have found

16

median sizes of 55.9kb and 49.5kb (Fontanesi et al. 2011). It is becoming clearer therefore

that as the resolution of CNV detection platforms improves, smaller CNVs are being found.

We thus do not feel confident in accepting out first hypothesis that bird CNVs are smaller

than mammals and await comparable results using similar platforms in order to confirm or

refute this.

The number of CNVs per individual genome also seems comparable to the situation in

mammals. We found a median of 11.75 CNVs per animal, varying by species from 5.5 in

Lady Amherst's pheasant to 39.75 in the red-legged partridge. CNVs within chickens were

found with an average of 17 and 40 CNVs per bird (Wang et al. 2010; Wang et al. 2012). The

equivalent numbers for mammals range from 11.6 to 47 in cattle (Fadista et al. 2010; Liu et

al. 2010), about 17 per individual genome in dogs, goats and sheep (Chen et al. 2009;

Fontanesi et al. 2010; Fontanesi et al. 2011) and 22 per individual genome in mice (Graubert

et al. 2007). Consequently, it appears that CNVs are equally frequent across birds as they are

across mammals, accounting for slight variations in array platforms and analysis methods,

and thus are likely to play as important a functional role. Given that bird genomes are, on

average, one third smaller than mammalian ones however the number of CNVs per megabase

in birds is nonetheless correspondingly larger.

Hypothesis 2) Gene content of CNVRs

Our second hypothesis was that CNVRs would frequently be associated with genes, and more

so than in mammals. In the current data set, we found Ensembl genes within the majority of

CNVRs (70%). Redon et al. (2006), found 58% of CNVRs in the human genome overlapping

known RefSeq genes. Amongst other mammals, CNVRs overlapping with genes range from

30% in cattle (Fadista et al. 2008; Fadista et al. 2010) through 61% (sheep and goats,

17

(Fontanesi et al. 2010; Fontanesi et al. 2011)) to 74% in pigs and dogs (Chen et al. 2009;

Wang et al. 2012). Amongst chicken breeds, the overlaps are 58% (Wang et al. 2010), 66%

(Crooijmans In Press) and 87.7% (Wang et al. 2012). Thus it is possible that there is a slightly

higher association between genes and CNVRs amongst birds, but this is only tentative. For

the moment, we thus feel it prudent to conclude that, while we accept the hypothesis that

CNVRs are frequently associated with genes, there appear to be not dissimilar proportions of

genes involved in copy number variation in birds and mammals. We thus await more robust

statistical analysis of comparable platforms and larger data sets. In any event, the high

association of the avian CNVRs identified in this study with protein coding genes argues for

the functional significance of these CNVRs.

Hypothesis 3) CNVs in species with highly rearranged karyotypes

The falcon and vulture species have substantial karyotypic rearrangements compared with the

‘typical’ avian karyotype. The reduction in chromosome number, typical of the Falconidae

was mediated by fusions of microchromosomes with macrochromosomes,

microchromosomes with other microchromosomes, and some fissions of the largest ancestral

chromosomes (Bed'Hom et al. 2003; Nanda et al. 2006; Nishida et al. 2008). Our third

hypothesis, that we would find a higher number of CNVRs in these birds as compared with

species with more stable karyotypes was not supported. That is, the number of CNVs per

animal was not greater in Falco nor Gyps species compared to the others, nor were CNV

sizes noticeably different from the global median. In fact, with an average of about six CNVs

per animal in Falco species, they were among the lowest counts in the dataset. This contrasts

with our initial expectation that species with highly rearranged karyotypes would be more

prone to copy number variation. As CNVs have been observed to be associated with intra-

chromosomal rearrangements (Völker et al. 2010), we hypothesise that CNVs correlate more

18

strongly with intra-chromosomal rearrangements than with inter-chromosomal

rearrangements. The in-progress sequence assemblies of falcon and parrot species will aid in

testing this.

Hypothesis 4) Association between chromosome size and CNV content

Models of copy-number variation suggest it is driven by non-allelic homologous

recombination (NAHR), and so is expected to be more frequent in regions of the genome

with high recombination rates. In birds, the smaller chromosomes have high recombination

rates, due in part to the obligatory crossover during meiosis (Burt 2005; Schmid et al. 2005).

Consequently, our final hypothesis was that CNV density (number of CNVs per megabase)

would inversely correlate with chromosome size. As hypothesised, the smaller chromosomes

appeared to be more CNV-dense (Figure 2). We also found that the CNVs have an equal

range of sizes across chromosomes, meaning that the proportion of a microchromosome

affected by copy number variation is greater than the comparable proportion of a

macrochromosome. This impacts on the functional relevance of avian CNVRs, given about

two thirds of most avian species’ genes are found on the microchromosomes (Hillier et al.

2004; Dalloul et al. 2010; Warren et al. 2010). Moreover, Falco CNVR density still follows

the association with chicken chromosome size, and CNVR length still lacks association with

chromosome size that is seen in species with “standard” (i.e. chicken-like) karyotypes. Thus,

while these birds have rearranged karyotypes, the individual chromosome synteny blocks are

still 'behaving' like their ancestral chromosome. Similarly, chicken GGA4p, the ancestral

chromosome ten (Griffin et al. 2007), still has a GC content, CpG density and recombination

rate resembling a microchromosome (Hillier et al. 2004; Griffin et al. 2008).

19

Do our reported “gains” and “losses” represent genuine CNVs?

Inference of true DNA copy number variation from relative hybridization efficiencies and

fluorescent intensities on a microarray platform is a point for discussion. It is possible for

instance, given the means by which the data was ascertained (array CGH), that some of our

apparent losses are, in point of fact, due to sequence divergence rather than copy number

change. We would argue however, as they mostly overlap with genes, these apparent CNVRs

still represent genomic regions of interest, whichever the cause. On the other hand, it is hard

to imagine how the gains (relative to chicken) could be anything other than genuine CNVs.

Since generating the primary array CGH data we have made multiple attempts to verify the

results independently, chiefly by quantitative PCR. While some results (not presented)

appeared to confirm the array data, others did not. Moreover, qPCR of “known” inter-specific

CNVs do not appear consistently to confirm the data either. After around two years of

attempts we were thus forced to conclude that the qPCR data was no more or less reliable

than the array CGH. That is, while qPCR is excellent for confirming array CGH data between

individuals of the same species, possible confounding effects of differing annealing

efficiencies between the species-specific primers leads us to believe that it is not a suitable

strategy for confirming cross species array CGH data. Other possible alternatives could be

FISH-based approaches e.g. fibre FISH for small CNVs or 3D FISH for larger ones. Another

possibility is massive parallel deep sequencing of the genomes in question. Both these data,

particularly the latter, will need to wait for better-characterised genome assemblies.

Comparison with the pre-existing chicken data (in which genuine CNVs are presumably

represented) reveals significant overlaps, thus leading us to conclude that a large proportion

(perhaps the majority) of the “gains” and “losses” seen in this data set are genuine CNVs.

20

Gene ontology analysis

Many CNVRs appearing consistently in every (or most individuals in every) species appear

to be losses relative to chicken, e.g. CNVR109 on GGA16. The only sequence available for

GGA16 (~400kb) contains the MHC locus (as well as other rRNA genes) - a rapidly

changing and copy number variable region of the genome e.g. (Ewald and Livant 2004).

Furthermore, the analysis of gene enrichment categories using DAVID revealed other

immune system function and cytoskeletal components to be particularly highly enriched

(supplementary table S4) amongst the losses. The fact that no gains were seen in these

regions supports the notion that a significant proportion of our 'losses here are caused by

sequence divergence in these rapidly evolving genomic regions.

The majority of CNVRs (62%) that were found appeared to be unique to individual species.

Those that included genes were almost uniformly either all gains or all losses, e.g. 24 gene

losses and 1 gain in Meleagris, 14 gene gains and no losses in Coturnix. The fact that many

of these genes have multifaceted transcriptional and/or regulatory roles (e.g., containing LIM-

domains (Jurata and Gill 1997) is consistent with the suggestion that the correlation of gains

or losses may be due to their participation in functional networks. Further, it is a reasonable

assumption that these genomic regions may contain genes involved in speciation and/or

adaptation of that lineage because most closely related species typically exhibit opposite

trends of gains versus losses. The most interesting examples of these are those CNVRs that

are present in all or most of the animals within that species, e.g. the quail-specific CNVR64.

Other interesting examples include:

• Genes that may be involved in diet in turkey (e.g., losses of F1N9P3_CHICK and

F1NP19_CHICK, implicated in proteolytic digestion and degradation of trypsin

inhibitors (Szmola et al. 2003);

21

• The uniquely migratory behaviour of common quail among fowl (e.g., gains of

OBSCN, implicated in hypertrophy of myofribrils, and of MAPK8IP3, implicated in

respiratory gaseous exchange (Borisov et al. 2003);

• Muscle activity in the speedy falcons (e.g., gain of MYOZ3, preferentially expressed

in fast-twitch myofibers and skeletal muscle (Frey and Olson 2002);

• Immune function in the common quail and silver pheasant (e.g. LEAP2 and ITCH

respectively),

• Homeotic genes in the common pheasant and California quail (e.g., SCML2 and

DLX5, respectively)

• Brain development and neuronal function in turkey (e.g., loss of CTXN1), common

quail (gain of LRFN5) and duck (e.g., DLGAP2), based on comparison of

orthologous gene functions in mammals (Watson et al. 1994; Mah et al. 2010).

Conclusions

Copy number variation is increasingly being recognised as an important and still unexplored

area of genetic variation in most species, contributing to disease states, adaptations and

speciation. This work represents a broad survey of copy number variation in avian species,

and establishes a base from which future studies can build. Undoubtedly, the utility of cross-

species array-CGH in birds will prove valuable to investigations into copy number variation

in general and avian genetic variability in particular.

Acknowledgements

We thank Wingham Wildlife Park, Kent, Cherry Valley Farms, British United Turkeys, and

the Central Veterinary Research Laboratory in Dubai, and the Falcon Hospital Dubai for

providing feathers and blood samples from which the DNA samples were extracted.

22

Supplementary Figures

Figure S1Numbers of CNVRs containing genes in each species, and the proportion of genes that are protein-coding.

Figure S2Heatmap in which gains are shown in green, losses in red, with the brightness proportional to the log2 ratio of the CNV. The animals and CNVs have been clustered, revealing patterns of lineage-specific CNVRs, and demonstrating that losses are more frequently shared across multiple species than gains.

Figure S3The number of CNVRs which are unique to one species, or which are shared between two or more species. Gains are shared between at most four species; losses are shared with up to 18 species. Shared losses are likely attributable to sequence divergence rather than true copy number change.

Table 1Species analysed in this study, comprising 66 animals from 18 species across 6 orders.

Species Common name Order nAlectoris chukar Red legged partridge Galliformes 4Anas platyrhynchos Duck Anseriformes 2Callipepla californica California quail Galliformes 2Chlamydotis undulata Houbara bustard Gruiformes 9Chrysolophus

amherstiae

Lady Amherst's pheasant Galliformes 2

Chrysolophus pictus Golden pheasant Galliformes 2Colinus virginianus Bobwhite quail Galliformes 2Coturnix coturnix Common quail Galliformes 4Falco chiquera Red-necked falcon Falconiformes 5Falco rusticolus Gyrfalcon Falconiformes 11Gyps africanus White-backed vulture Accipitriformes 2Gyps fulvus Griffon vulture Accipitriformes 2Gyps rueppellii Rüppell's Vulture Accipitriformes 2Lophura nycthermera Silver pheasant Galliformes 4Meleagris gallopavo Turkey Galliformes 4Pavo cristatus Peafowl Galliformes 4Phasianus colchicus Common pheasant Galliformes 3Taeniopygia guttata Zebra finch Passeriformes 2

References

Bailey JA, Eichler EE. 2006. Primate segmental duplications: crucibles of evolution, diversity and disease. Nature Reviews Genetics 7(7): 552-564.

Bed'Hom B, Coullin P, Guillier-Gencik Z, Moulin S, Bernheim A, Volobouev V. 2003. Characterization of the atypical karyotype of the black-winged kite Elanus caeruleus (Falconiformes: Accipitridae) by means of classical and molecular cytogenetic techniques. Chromosome Research 11(4): 335-343.

Bonilla AJ, Braun EL, Kimball RT. 2010. Comparative molecular evolution and phylogenetic utility of 3′-UTRs and introns in Galliformes. Molecular phylogenetics and evolution 56(2): 536-542.

Borisov AB, Raeker MO, Kontrogianni-Konstantopoulos A, Yang K, Kurnit DM, Bloch RJ, Russell MW. 2003. Rapid response of cardiac obscurin gene cluster to aortic stenosis: differential activation of Rho-GEF and MLCK and involvement in hypertrophic growth. Biochemical and biophysical research communications 310(3): 910-918.

Burt DW. 2005. Chicken genome: current status and future opportunities. Genome Research 15(12): 1692-1698.

Burt DW, Bruley C, Dunn IC, Jones CT, Ramage A, Law AS, Morrice DR, Paton IR, Smith J, Windsor D. 1999. The dynamics of chromosome evolution in birds and mammals. Nature 402(6760): 411-413.

Chen W-K, Swartz JD, Rush LJ, Alvarez CE. 2009. Mapping DNA structural variation in dogs. Genome research 19(3): 500-509.

Colobran R, Pedrosa E, Carretero‐Iglesia L, Juan M. 2010. Copy number variation in chemokine superfamily: the complex scene of CCL3L–CCL4L genes in health and disease. Clinical & Experimental Immunology 162(1): 41-52.

Crooijmans RPMA, Fife, M.S. Fitzgerald, T. Strickland, S. Cheng, H.H. Kaiser, P. Redon R. and Groenen M. A. M. BMC Genomics 14:398 . Large scale variation in DNA copy number in chicken breeds.

Da Wei Huang BTS, Lempicki RA. 2008. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols 4(1): 44-57.

Dalloul RA, Long JA, Zimin AV, Aslam L, Beal K, Bouffard P, Burt DW, Crasta O, Crooijmans RP, Cooper K. 2010. Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis. PLoS biology 8(9): e1000475.

Dumas L, Kim YH, Karimpour-Fard A, Cox M, Hopkins J, Pollack JR, Sikela JM. 2007. Gene copy number variation spanning 60 million years of human and primate evolution. Genome Research 17(9): 1266-1277.

Emanuel BS, Saitta SC. 2007. From microscopes to microarrays: dissecting recurrent chromosomal rearrangements. Nature Reviews Genetics 8(11): 869-883.

Ewald S, Livant E. 2004. Distinctive polymorphism of chicken B-FI (major histocompatibility complex class I) molecules. Poultry science 83(4): 600-605.

Fadista J, Nygaard M, Holm L-E, Thomsen B, Bendixen C. 2008. A snapshot of CNVs in the pig genome. PLoS One 3(12): e3916.

Fadista J, Thomsen B, Holm L-E, Bendixen C. 2010. Copy number variation in the bovine genome. BMC genomics 11(1): 284.

Feuk L, Carson AR, Scherer SW. 2006. Structural variation in the human genome. Nature Reviews Genetics 7(2): 85-97.

Fontanesi L, Beretti F, Martelli P, Colombo M, Dall'Olio S, Occidente M, Portolano B, Casadio R, Matassino D, Russo V. 2011. A first comparative map of copy number variations in the sheep genome. Genomics 97(3): 158-165.

Fontanesi L, Martelli P, Beretti F, Riggio V, Dall'Olio S, Colombo M, Casadio R, Russo V, Portolano B. 2010. An initial comparative map of copy number variations in the goat (Capra hircus) genome. BMC genomics 11(1): 639.

Fortna A, Kim Y, MacLaren E, Marshall K, Hahn G, Meltesen L, Brenton M, Hink R, Burgers S, Hernandez-Boussard T. 2004. Lineage-specific gene duplication and loss in human and great ape evolution. PLoS biology 2(7): e207.

Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM, Aburatani H, Jones KW, Tyler-Smith C, Hurles ME. 2006. Copy number variation: new insights in genome diversity. Genome Research 16(8): 949-961.

Frey N, Olson EN. 2002. Calsarcin-3, a novel skeletal muscle-specific member of the calsarcin family, interacts with multiple Z-disc proteins. Journal of Biological Chemistry 277(16): 13998-14004.

Graubert TA, Cahan P, Edwin D, Selzer RR, Richmond TA, Eis PS, Shannon WD, Li X, McLeod HL, Cheverud JM. 2007. A high-resolution map of segmental DNA copy number variation in the mouse genome. PLoS Genetics 3(1): e3.

Gregory TR. 2005. The evolution of the genome. Academic Press.Griffin D, Robertson L, Tempest H, Skinner B. 2007. The evolution of the avian genome as

revealed by comparative molecular cytogenetics. Cytogenetic and genome research 117(1-4): 64-77.

Griffin DK, Robertson LB, Tempest HG, Vignal A, Fillon V, Crooijmans RP, Groenen MA, Deryusheva S, Gaginskaya E, Carré W. 2008. Whole genome comparative studies between chicken and turkey and their implications for avian genome evolution. BMC genomics 9(1): 168.

Hillier LW, Miller W, Birney E, Warren W, Hardison RC, Ponting CP, Bork P, Burt DW, Groenen MA, Delany ME. 2004. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432(7018): 695-716.

Iskow RC, Gokcumen O, Lee C. 2012. Exploring the role of copy number variants in human adaptation. Trends in Genetics.

Jurata LW, Gill GN. 1997. Functional analysis of the nuclear LIM domain interactor NLI. Molecular and cellular biology 17(10): 5688-5698.

Kehrer-Sawatzki H, Cooper DN. 2008. Molecular mechanisms of chromosomal rearrangement during primate evolution. Chromosome Research 16(1): 41-56.

Kimball RT, Braun EL. 2008. A multigene phylogeny of Galliformes supports a single origin of erectile ability in non‐feathered facial traits. Journal of Avian Biology 39(4): 438-445.

Kimball RT, Mary CMS, Braun EL. 2011. A macroevolutionary perspective on multiple sexual traits in the Phasianidae (Galliformes). International journal of evolutionary biology 2011.

Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W. 2001. Initial sequencing and analysis of the human genome. Nature 409(6822): 860-921.

Lee CW, Hwang I, Park C-S, Lee H, Park D-W, Kang S-J, Lee S-W, Kim Y-H, Park S-W, Park S-J. 2012. Expression of ADAMTS-2,-3,-13, and-14 in culprit coronary lesions in patients with acute myocardial infarction or stable angina. Journal of thrombosis and thrombolysis 33(4): 362-370.

Liu GE, Hou Y, Zhu B, Cardone MF, Jiang L, Cellamare A, Mitra A, Alexander LJ, Coutinho LL, Dell'Aquila ME. 2010. Analysis of copy number variations among diverse cattle breeds. Genome research 20(5): 693-703.

Locke DP, Segraves R, Carbone L, Archidiacono N, Albertson DG, Pinkel D, Eichler EE. 2003. Large-scale variation among human and great ape genomes determined by array comparative genomic hybridization. Genome Research 13(3): 347-357.

Lupski JR. 1998. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends in genetics: TIG 14(10): 417.

Mah W, Ko J, Nam J, Han K, Chung WS, Kim E. 2010. Selected SALM (synaptic adhesion-like molecule) family proteins regulate synapse formation. The Journal of Neuroscience 30(16): 5559-5568.

Nanda I, Karl E, Volobouev V, Griffin D, Schartl M, Schmid M. 2006. Extensive gross genomic rearrangements between chicken and Old World vultures (Falconiformes: Accipitridae). Cytogenetic and genome research 112(3-4): 286-295.

Newman TL, Tuzun E, Morrison VA, Hayden KE, Ventura M, McGrath SD, Rocchi M, Eichler EE. 2005. A genome-wide survey of structural variation between human and chimpanzee. Genome Research 15(10): 1344-1356.

Nishida C, Ishijima J, Kosaka A, Tanabe H, Habermann FA, Griffin DK, Matsuda Y. 2008. Characterization of chromosome structures of Falconinae (Falconidae, Falconiformes, Aves) by chromosome painting and delineation of chromosome rearrangements during their differentiation. Chromosome Research 16(1): 171-181.

Ohno S. 1970. Evolution by gene duplication. London: George Alien & Unwin Ltd. Berlin, Heidelberg and New York: Springer-Verlag.

Olshen AB, Venkatraman E, Lucito R, Wigler M. 2004. Circular binary segmentation for the analysis of array‐based DNA copy number data. Biostatistics 5(4): 557-572.

Perry GH, Dominy NJ, Claw KG, Lee AS, Fiegler H, Redon R, Werner J, Villanea FA, Mountain JL, Misra R. 2007. Diet and the evolution of human amylase gene copy number variation. Nature genetics 39(10): 1256-1260.

Perry GH, Tchinda J, McGrath SD, Zhang J, Picker SR, Cáceres AM, Iafrate AJ, Tyler-Smith C, Scherer SW, Eichler EE. 2006. Hotspots for copy number variation in chimpanzees and humans. Proceedings of the National Academy of Sciences 103(21): 8006-8011.

Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W. 2006. Global variation in copy number in the human genome. Nature 444(7118): 444-454.

Repping S, van Daalen SK, Brown LG, Korver CM, Lange J, Marszalek JD, Pyntikova T, van der Veen F, Skaletsky H, Page DC. 2006. High mutation rates have driven extensive structural polymorphism among human Y chromosomes. Nature genetics 38(4): 463-467.

Samonte RV, Eichler EE. 2002. Segmental duplications and the evolution of the primate genome. Nature Reviews Genetics 3(1): 65-72.

Schmid M, Nanda I, Burt DW. 2005. Second report on chicken genes and chromosomes 2005. Cytogenetic and genome research 109(4): 415-479.

Skinner B, Griffin D. 2011. Intrachromosomal rearrangements in avian genome evolution: evidence for regions prone to breakpoints. Heredity 108(1): 37-41.

Skinner B, Robertson L, Tempest H, Langley E, Ioannou D, Fowler K, Crooijmans R, Hall A, Griffin D, Völker M. 2009. Comparative genomics in chicken and Pekin duck using FISH mapping and microarray analysis. BMC genomics 10(1): 357.

St Clair D. 2009. Copy number variation and schizophrenia. Schizophrenia bulletin 35(1): 9-12.

Stankiewicz P, Lupski JR. 2002. Genome architecture, rearrangements and genomic disorders. Trends in genetics: TIG 18(2): 74.

Stefansson H, Rujescu D, Cichon S, Pietiläinen OP, Ingason A, Steinberg S, Fossdal R, Sigurdsson E, Sigmundsson T, Buizer-Voskamp JE. 2008. Large recurrent microdeletions associated with schizophrenia. Nature 455(7210): 232-236.

Szmola R, Kukor Z, Sahin-Tóth M. 2003. Human mesotrypsin is a unique digestive protease specialized for the degradation of trypsin inhibitors. Journal of Biological Chemistry 278(49): 48580-48589.

Völker M, Backström N, Skinner BM, Langley EJ, Bunzey SK, Ellegren H, Griffin DK. 2010. Copy number variation, chromosome rearrangement, and their association with recombination during avian evolution. Genome Research 20(4): 503-511.

Wang J, Jiang J, Fu W, Jiang L, Ding X, Liu J-F, Zhang Q. 2012. A genome-wide detection of copy number variations using SNP genotyping arrays in swine. BMC genomics 13(1): 273.

Wang X, Nahashon S, Feaster TK, Bohannon-Stewart A, Adefope N. 2010. An initial map of chromosomal segmental copy number variations in the chicken. BMC genomics 11(1): 351.

Warren WC, Clayton DF, Ellegren H, Arnold AP, Hillier LW, Künstner A, Searle S, White S, Vilella AJ, Fairley S. 2010. The genome of a songbird. Nature 464(7289): 757-762.

Watson KL, Justice RW, Bryant PJ. 1994. Drosophila in cancer research: the first fifty tumor suppressor genes. Journal of cell science Supplement 18: 19.

Workman C, Jensen LJ, Jarmer H, Berka R, Gautier L, Nielser HB, Saxild H-H, Nielsen C, Brunak S, Knudsen S. 2002. A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome biol 3(9): 1-16.