Integrating QTL and high-density SNP analyses in mice to identify Insig2 as a susceptibility gene...

13
Integrating QTL and high-density SNP analyses in mice to identify Insig2 as a susceptibility gene for plasma cholesterol levels i Alessandra C. Cervino a , Guoya Li a , Steve Edwards a , Jun Zhu a , Cathy Laurie a , George Tokiwa a , Pek Yee Lum a , Susanna Wang b , Lawrence W. Castellini b , Aldons J. Lusis b , Sonia Carlson a , Alan B. Sachs a , Eric E. Schadt a, * a Rosetta Inpharmatics, LLC, a Wholly Owned Subsidiary of Merck & Co., Inc., Seattle, WA 98109, USA b Division of Cardiology, Department of Medicine, and Department of Microbiology, Immunology, and Molecular Genetics, David Geffen School of Medicine at the University of California at Los Angeles, Los Angeles, CA 90095-1679, USA Received 19 May 2005; accepted 25 July 2005 Available online 29 August 2005 Abstract The use of inbred strains of mice to dissect the genetic complexity of common diseases offers a viable alternative to human studies, given the control over experimental parameters that can be exercised. Central to efforts to map susceptibility loci for common diseases in mice is a comprehensive map of DNA variation among the common inbred strains of mice. Here we present one of the most comprehensive high-density, single nucleotide polymorphism (SNP) maps of mice constructed to date. This map consists of 10,350 SNPs genotyped in 62 strains of inbred mice. We demonstrate the utility of these data via a novel integrative genomics approach to mapping susceptibility loci for complex traits. By integrating in silico quantitative trait locus (QTL) mapping with progressive QTL mapping strategies in segregating mouse populations that leverage large-scale mapping of the genetic determinants of gene expression traits, we not only facilitate identification of candidate quantitative trait genes, but also protect against spurious associations that can arise in genetic association studies due to allelic association among unlinked markers. Application of this approach to our high-density SNP map and two previously described F2 crosses between strains C57BL/6J (B6) and DBA/2J and between B6 ApoE / and C3H/HeJ ApoE / results in the identification of Insig2 as a strong candidate susceptibility gene for total plasma cholesterol levels. D 2005 Elsevier Inc. All rights reserved. Keywords: SNP; QTL; Complex traits; Insig2; Cholesterol; Gene expression; Genetics One class of genetic approaches to dissecting complex disease traits treats variations in DNA as markers and tests whether such markers are associated with a given disease trait of interest. Markers found to cosegregate or otherwise associate with a given disease highlight regions in the genome that are considered linked to the disease and that at least partially explain susceptibility to the disease. The use of inbred strains of mice in this setting has proven to be a viable alternative to human genetic studies given the degree of control that can be exercised over experimental parameters such as environment, breeding scheme, and detailed phenotyping. More recently, quantitative trait locus (QTL) mapping has led to the identification of numerous genetic loci for a variety of traits relevant to human diseases, including behavioral differences, lipid levels, obesity, and atherosclerosis [1]. Although the use of QTL analysis to identify complex disease genes can still be a long and laborious undertaking, important progress in the development of genomics and bioinformatics tools, such as the creation of whole-genome congenic strains, microarray tech- nologies, and the availability of the genomic sequence from several different strains, is serving to accelerate this process. Previously, we have reported on exploiting genomic varia- tions in mice used to infer causal relationships among gene expression traits and between gene expression and disease traits [2–4]. One intuitive strategy in particular involved examining 0888-7543/$ - see front matter D 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2005.07.010 i The Mouse Phenome Database (http://www.jax.org/phenome) accession numbers for the phenomes discussed herein are MPD:8 (chol _ fat from the Albers1 set), MPD:29 (total _ chol from the Paigen1 set), and MPD:99 (chol _ 17 from the Paigen2 set). * Corresponding author. E-mail address: eric _ [email protected] (E.E. Schadt). Genomics 86 (2005) 505 – 517 www.elsevier.com/locate/ygeno

Transcript of Integrating QTL and high-density SNP analyses in mice to identify Insig2 as a susceptibility gene...

sevier.com/locate/ygeno

Genomics 86 (200

Integrating QTL and high-density SNP analyses in mice to identify

Insig2 as a susceptibility gene for plasma cholesterol levelsi

Alessandra C. Cervinoa, Guoya Lia, Steve Edwardsa, Jun Zhua, Cathy Lauriea, George Tokiwaa,

Pek Yee Luma, Susanna Wangb, Lawrence W. Castellinib, Aldons J. Lusisb,

Sonia Carlsona, Alan B. Sachsa, Eric E. Schadta,*

a Rosetta Inpharmatics, LLC, a Wholly Owned Subsidiary of Merck & Co., Inc., Seattle, WA 98109, USAb Division of Cardiology, Department of Medicine, and Department of Microbiology, Immunology, and Molecular Genetics,

David Geffen School of Medicine at the University of California at Los Angeles, Los Angeles, CA 90095-1679, USA

Received 19 May 2005; accepted 25 July 2005

Available online 29 August 2005

Abstract

The use of inbred strains of mice to dissect the genetic complexity of common diseases offers a viable alternative to human studies, given the

control over experimental parameters that can be exercised. Central to efforts to map susceptibility loci for common diseases in mice is a

comprehensive map of DNA variation among the common inbred strains of mice. Here we present one of the most comprehensive high-density,

single nucleotide polymorphism (SNP) maps of mice constructed to date. This map consists of 10,350 SNPs genotyped in 62 strains of inbred

mice. We demonstrate the utility of these data via a novel integrative genomics approach to mapping susceptibility loci for complex traits. By

integrating in silico quantitative trait locus (QTL) mapping with progressive QTL mapping strategies in segregating mouse populations that

leverage large-scale mapping of the genetic determinants of gene expression traits, we not only facilitate identification of candidate quantitative

trait genes, but also protect against spurious associations that can arise in genetic association studies due to allelic association among unlinked

markers. Application of this approach to our high-density SNP map and two previously described F2 crosses between strains C57BL/6J (B6) and

DBA/2J and between B6 ApoE�/� and C3H/HeJ ApoE�/� results in the identification of Insig2 as a strong candidate susceptibility gene for total

plasma cholesterol levels.

D 2005 Elsevier Inc. All rights reserved.

Keywords: SNP; QTL; Complex traits; Insig2; Cholesterol; Gene expression; Genetics

One class of genetic approaches to dissecting complex

disease traits treats variations in DNA as markers and tests

whether such markers are associated with a given disease trait

of interest. Markers found to cosegregate or otherwise associate

with a given disease highlight regions in the genome that are

considered linked to the disease and that at least partially

explain susceptibility to the disease. The use of inbred strains

of mice in this setting has proven to be a viable alternative to

human genetic studies given the degree of control that can be

0888-7543/$ - see front matter D 2005 Elsevier Inc. All rights reserved.

doi:10.1016/j.ygeno.2005.07.010

i The Mouse Phenome Database (http://www.jax.org/phenome) accession

numbers for the phenomes discussed herein are MPD:8 (chol_fat from the

Albers1 set), MPD:29 (total_chol from the Paigen1 set), and MPD:99 (chol_17

from the Paigen2 set).

* Corresponding author.

E-mail address: [email protected] (E.E. Schadt).

exercised over experimental parameters such as environment,

breeding scheme, and detailed phenotyping. More recently,

quantitative trait locus (QTL) mapping has led to the

identification of numerous genetic loci for a variety of traits

relevant to human diseases, including behavioral differences,

lipid levels, obesity, and atherosclerosis [1]. Although the use

of QTL analysis to identify complex disease genes can still be a

long and laborious undertaking, important progress in the

development of genomics and bioinformatics tools, such as the

creation of whole-genome congenic strains, microarray tech-

nologies, and the availability of the genomic sequence from

several different strains, is serving to accelerate this process.

Previously, we have reported on exploiting genomic varia-

tions in mice used to infer causal relationships among gene

expression traits and between gene expression and disease traits

[2–4]. One intuitive strategy in particular involved examining

5) 505 – 517

www.el

A.C. Cervino et al. / Genomics 86 (2005) 505–517506

genes located in regions of the genome genetically linked to the

disease trait and then testing for colocalization of cis-acting

expression QTL (eQTL) for those genes residing in the region

linked to the disease and correlated with disease-related

phenotypes. These genes serve as natural candidates for disease

phenotypes of interest. One problem with this approach arises if

the gene contributing to a particular disease is not causally

associated with respect to transcript abundances, but instead

involves some other mechanism of action unrelated to changes in

transcript levels (e.g., alternative splicing or amino acid changes

leading to changes in protein function). In such cases, or in cases

in which the relevant tissues cannot be profiled or the particular

state of a system in which changes in expression lead to disease

cannot be determined, there will be no cis eQTL colocalizing

with the disease trait QTL of interest. Further, the number of cis

eQTL with significant lod scores in a given linkage region of

interest that are independent but closely linked to the QTL for the

trait of interest also complicates this process, as we have

previously described [5]. These shortcomings, combined with

the difficulty in general in positionally cloning quantitative trait

genes underlying QTL, speak to the need for alternative methods

to identify more efficiently causal genes for disease.

One alternative approach that has been recently detailed

exploits naturally occurring DNA variation over any number of

common strains of mice to recover haplotype structures in

these mice, which in turn can be used to carry out genome-wide

association scans [6,7]. Such ‘‘in silico’’ genetics procedures

have been applied to collections of inbred lines to narrow

significantly genomic regions previously linked to disease in

segregating mouse populations, even allowing for the direct

identification of the causal gene and ultimately the causal

polymorphism giving rise to the disease trait [6,7]. A necessary

first step in carrying out this type of approach is a dense map of

markers informative over the strains of mice commonly used in

biological research. Here we present one of the most

comprehensive high-density single nucleotide polymorphism

(SNP) maps constructed to date for common strains of

laboratory mice. Our map consists of 10,350 SNPs genotyped

in 62 strains of inbred mice, including 56 common strains

ordered from The Jackson Laboratory. Using hierarchical trees,

phylogenetic analysis, and the reconstruction of haplotype

blocks, we investigated the ancestral relationships among these

62 strains of mice. The SNP data are essential to establish

ancestral relationships among the strains and to leverage these

relationships to reconstruct haplotypes for any region of

interest in the genome, thereby facilitating more rapid

identification of genes underlying complex disease traits.

However, we also show that there is strong, widespread

nonrandom association operating among unlinked loci (linkage

disequilibrium) that has the potential to lead to spurious

associations in genetic association studies. To protect against

such spurious associations, we developed an integrative

genomics approach to map susceptibility loci for complex

traits. This approach involves integrating in silico (association-

based) QTL mapping with QTL mapping in segregating mouse

populations. By leveraging large-scale mapping of the genetic

determinants of gene expression traits, we not only facilitate

identification of candidate quantitative trait genes, but also

protect against spurious associations due to allelic association

among unlinked markers. Application of this novel process to

our high-density SNP map and two previously described F2

crosses between strains C57BL/6J (B6) and DBA/2J (DBA)

and between B6 ApoE�/� and C3H/HeJ (C3H) ApoE�/�

results in the identification of Insig2 as a strong candidate gene

for total plasma cholesterol levels.

Results

SNP distribution

The exact number of SNPs between all common inbred

strains of mice is still unknown, although the completion of the

sequencing of the mouse genome and several large-scale SNP

discovery studies have begun to shed light on the extent of DNA

differences among common laboratory strains of mice. The

distribution of SNPs across the genome does not appear to be

random, but rather bimodal, with areas of high SNP frequency

(¨40 SNP per 10 kb) and low SNP frequency (¨0.5 SNP per 10

kb) corresponding to areas of shared ancestry [8]. From some of

the more comprehensive studies aimed at assessing DNA

variation in common strains of mice, SNP frequency estimates

range from 1 SNP every 700+ bp for related inbred strains of

mice to 150 bp for more distantly related mice [9]. These

estimates are consistent with our own estimates obtained from

the Celera Mouse Genome Database, in which more than

2,400,000 SNPs are represented for five inbred strains of mice.

In August 2004, Celera and NCBI had approximately

60,000 high-confidence SNPs represented in their publicly

available databases. Others have significantly leveraged the

complete set of SNPs represented in the Celera database to

construct a map of 10,990 SNPs across 48 inbred lines of mice

and demonstrated the utility of such high-density maps in

carrying out in silico mapping studies [10]. While the

proprietary Celera database represented in the commercially

available Celera Discovery System offers the most compre-

hensive source of SNP data available in mouse, the focus on

only 5 inbred strains of mice may limit the ability to derive

more general ancestral relationships among more varied strains

of mice due to strain-specific biases in the SNPs represented.

We sought to construct a high-density SNP map that was

complementary to those provided by Celera and Pletcher et al.

[10]. In an attempt to reduce strain-specific biases, we chose to

limit the selection of SNPs coming from the Celera database,

so that SNPs were chosen from this database only when

necessary to cover a specific region of the genome that could

not otherwise be covered.

The 60,000 SNPs available in the public domain (dbSNP) at

the time we designed our SNP panel give rise to an average SNP

density of 1 SNP every ¨40 kb. We selected an initial set of

15,360 SNPs for genotyping by Illumina from this set (see

Methods). Of these 15,360 SNPs we were able to map 12,572 to

the mouse genome (NCBI Build 32) with high confidence.

From this reduced set of SNPs Illumina was able to convert

10,350 into valid genotyping assays. This set of SNPs provided

Table 1

Summary description of SNP distribution per chromosome for the high-density

SNP map described in the text

Chromosome Chromosome

length in MbaNo. of SNPs Average distance

in kb (SD)

1 195 866 230 (261)

2 182 795 228 (282)

3 161 677 242 (270)

4 154 632 244 (299)

5 149 613 245 (291)

6 150 588 256 (321)

7 133 610 225 (292)

8 129 565 226 (234)

9 124 522 242 (277)

10 131 447 291 (337)

11 122 586 207 (269)

12 115 472 238 (250)

13 116 513 224 (234)

14 117 486 239 (265)

15 104 425 249 (286)

16 99 447 216 (279)

17 94 409 234 (258)

18 91 386 234 (286)

19 61 311 197 (270)

a Based on Ensembl.

A.C. Cervino et al. / Genomics 86 (2005) 505–517 507

for good coverage on all chromosomes, with an average SNP

spacing of 200 kb (Table 1). On average, roughly 33% of the

validated SNPs were informative between any pair of strains,

allowing for the selection of high-density SNP panels for use in

genotyping experimental crosses constructed from any pair of

strains represented in our set. We have seen that increasing the

number of markers used in a QTL analysis from the

recommended 100 to 200 for a moderate-sized cross [11] to

1500 or more can improve linkage results by increasing the

accuracy of the positioning of QTL (manuscript in preparation).

Haplotype blocks

Due to the close genetic relationship among the various

strains, one would expect large areas of the mouse genome to

be identical by descent between any two common strains. One

recent study by Petkov et al. [12] reported an average block

size of 6 Mb (personal communication). If this were in fact the

case, one would expect 1 SNP per 200 kb to capture a large

part of the haplotype structure in mice. However, others have

provided data indicating a much finer haplotype structure. Park

et al. [13] used haplotype block analysis in an attempt to map a

quantitative trait gene for a complex disease trait mapping to

the proximal end of chromosome 19. Their data appear to be

consistent with haplotype blocks that are greater than 80 kb,

but with the largest observed haplotype block spanning only

¨150 kb in their chromosome 19 region of interest. Because

the experimental approach employed by Park et al. did not

sequence the entire region of interest, but instead investigated

only short amplicons across the region, they could not rule out

haplotype blocks of substantially shorter length. In yet another

study, Wade et al. [8] found only two different haplotypes

among 22 of 27 sequence segments of length 500–1000 bp,

highlighting again the potential to use haplotype-based analysis

to identify candidate genes, given that regions of the mouse

genome that are identical between any two given strains

exhibiting phenotypic differences can be excluded as candidate

regions underlying such differences. Others have reported

increased complexity in the mouse genome, including micro-

heterogeneity in otherwise identical stretches of genomic

sequence among different strains of mice, highlighting the

potential risks in creating haplotype blocks that may contain

small varying genomic segments [14,15].

We investigated the relationship between strains by looking

at regions of the genome between two strains that were not

informative (potentially identical by descent). We calculated

the block distance as the distance between the two polymorphic

markers most closely flanking the stretch of homozygous

markers defining the block. Of course, the average block length

defined in this way is an upper bound on the true block length,

given that informative SNPs that were not genotyped could

have broken up the homozygous regions into smaller blocks.

To illustrate this point we estimated block lengths using

varying densities of SNPs for a given region. In addition, we

constructed similar blocks based on all of the SNPs available

from NCBI and Celera for the DBA/2J and C57BL/6J strains

of mice in which regions identical by descent (IBD) can be

precisely identified given the high density of SNPs [5]. In Fig.

1, described below in more detail, we demonstrate that our

high-density SNP panel can be used to accurately identify

regions that are IBD between two strains of mice. Using the

comprehensive set of SNPs available in the Celera RefSNP

database for the B6 and DBA strains of mice, wherein the

genomes for these two strains of mice have been completely

sequenced, all IBD regions between these two strains were

identified. Comparison to the regions of sharing constructed

from our high-density SNP map highlights that our dense SNP

map was able to identify the major IBD blocks in this region

(highlighted in Fig. 1).

Relationship between strains based on hierarchical trees

To investigate the relationship between the 62 strains we

computed a cluster dendrogram using all of the SNPs across

the entire genome between any two strains as a measure of

similarity between them. Only one haplotype per strain was

used in the calculations since the mice were inbred, and the

number of observed heterozygotes (proportional to the

genotyping error rate in this case) was judged to be small

enough to have no significant impact on our conclusions.

Because SNPs were selected based on the information content

in the strains of interest, care must be exercised in interpreting

the tree as representative of genetic history. However, given

that our SNP selection process was not intentionally biased

toward any particular group of strains, it is likely to have

resulted in a less biased set of markers that captures more of the

true genetic history among the strains.

The first similarity measure used to assess relationships

among the mouse strains was computed as the average length

over all blocks of SNPs in which a given strain pair shares the

same alleles. We expect the length of these ‘‘blocks’’ to be

Fig. 1. Defining regions identical by descent (IBD) that highlight the haplotype structure through the chromosome 1 region identified in the text for the B6, DBA,

and C3H strains of mice. The top is a graph of the SNP counts between B6 and DBA in the 120- to 130-Mb chromosome 1 region, in which the counts are specific to

50-kb windows moved through the region in 10-kb increments. The Celera genome sequence for the B6 and DBA strains was used to define the SNPs, providing a

nearly complete map of DNAvariation between these two strains [44]. Regions with consecutive counts that fall below 5 are considered IBD between the B6 and the

DBA strains of mice [5]. The second graph is the same as the first, but smoothed by computing a moving average through the interval. The third graph indicates

regions that are called IBD using the frequency plot given in the second graph. Regions in which the count dropped below 5 were considered IBD (curve set to 0),

while all other regions were considered heterozygous (curve set to 1). Panel 4 indicates IBD regions for the DBA and B6 strains of mice defined by the high-density

SNP map described in the text. Yellow indicates IBD regions as defined in the text (homozygous blocks), while green indicates regions that are polymorphic between

the two strains. The IBD regions identified by the high-density SNP map in this fairly narrow interval can be seen to approximate well the IBD regions identified

using the complete data from the Celera database. The bottom panel is similar to the fourth but for the B6 and C3H strains of mice. The vertical black line near 124

Mb indicates the location of the Insig2 gene.

A.C. Cervino et al. / Genomics 86 (2005) 505–517508

proportional to the extent of shared ancestry (i.e., the longer the

shared blocks, the more closely related two strains are).

Hierarchical trees were generated for each chromosome using

the R package and the standard hclust function. Trees for the 19

autosomal chromosomes are available in the supplementary

material, but a representative tree is shown in Fig. 2A for

chromosome 1. In looking at the similarity between strains

using the longest block algorithm, it is interesting to note that

the trees vary by chromosome, suggesting that stochastic,

perhaps genetic, and even epigenetic processes were at play in

fixing DNA variation among the strains. This interchromosome

tree variation warrants further investigation to assess whether

the trees are significantly different or statistically indistin-

guishable. For ease of interpretation we have colored the strains

by group, as categorized by Beck et al. [16]. It is apparent that

the clusters generally reflect the known relationships among

the different strains of mice. This fact at least partially validates

the choice of distance metric.

The main characteristic shared across all 19 chromosomes

is that of separating ‘‘wild-derived’’ mice (represented in

brown) from the remaining inbred mice at the first

bifurcation. At most it took two bifurcations in the tree to

separate all wild strains from the standard inbred laboratory

strains. As seen in Fig. 2A, the first bifurcation separates the

wild strains Mus mus musculus, M. m. molossinus, M.

spicilegus (formerly hortulanus), M. spretus, and M. m.

castaneus from all other strains. The second bifurcation

separates the M. m. domesticus strains from the common

inbred laboratory strains, indicating a closer genetic distance

from the common inbred laboratory strains to the M. m.

domesticus subspecies. The standard laboratory mice are

reported to be descendents of both M. m. domesticus and M.

m. musculus [16]. The M. m. domesticus group, based on the

chromosome 1 tree in Fig. 2A, includes strains PERA, PERC,

WSB, WMP, TIRANO, and ZALENDE as well as IS. IS is

only partly M. m. domesticus and it is interesting to see it

Fig. 2. Relationships among inbred lines of mice inferred from the high-density SNP map. (A) Hierarchical tree highlighting relationships among 62 strains of mice

based on chromosome 1 SNPs. (B) Phylogenetic tree for 62 common inbred strains of mice based on SNPs in chromosome 1.

A.C. Cervino et al. / Genomics 86 (2005) 505–517 509

A.C. Cervino et al. / Genomics 86 (2005) 505–517510

cluster with the other domesticus strains. More surprising is

the clustering pattern of M. m. poschiavinus. RBF clusters

with other wild-derived strains only for chromosomes 6, 8, 9,

12, 13, 14, and 18. RBF is adjacent to PL for three of the

chromosomes (1, 3, and 7), RF for two chromosomes (2 and

17), and M16 for two other chromosomes (5 and 16).

Otherwise the separation between the ‘‘M subspecies’’ is

exactly what is expected: M. spicilegus and M. spretus are

the most distant strains in our tree, both strains having

diverged over 2 million years ago, followed by M. m.

domesticus, castaneus, and musculus, which diverged ap-

proximately 1 million years ago [16]. Our data further

support the view that M. m. molossimus would be derived

from M. m. musculus and M. m. castaneus [12].

Similar conclusions were reached when looking at the

genealogy tree (Fig. 2B). We reconstructed a mouse

phylogenetic tree using Phylip 3.6. Kimura’s two-parameter

model was used for calculation of genetic distance using the

SNPs on chromosome 1 [17]. The neighbor method was

selected for building the phylogenetic trees [18]. Compared to

Petkov et al. [12], our tree does not separate the group 1 and

group 2 mice they reported (the Bagg albino derives from the

Swiss mice). The minor differences between the two trees are

likely due to biases in SNP selection, so that the Petkov tree

is able to separate these closely related groups because the

SNP selection they employed was more biased toward these

strains.

Fig. 3. Map of allelic association for markers on chromosomes 8 and 14. The black p

value less than 2.3 � 10�8), where the upper triangle of the matrix corresponds to

marker data. There are six regions indexed by roman numerals representing correlati

on chromosome 8 for permuted data, (III) on chromosome 14 for observed genotype

8 and 14 for observed genotype data, and (VI) on chromosomes 8 and 14 for permut

8 and 14 that are strongly associated with each other.

Strong nonrandom associations observed among unlinked

markers

We next computed all pair-wise correlations as a measure

of linkage disequilibrium (LD) between SNP markers over

the 62 inbred strains of mice to assess the degree of

nonrandom association between alleles at different loci (both

linked and unlinked). After a filter was set to eliminate SNPs

with minor allele frequencies less than 15%, a total of 6641

of the 10,350 genotyped SNPs were selected for this

association analysis. Of the pairs considered, 930,037

(4.2%) were significantly correlated at the 0.0001 signifi-

cance level (2204 would be expected by chance). To

estimate the false discovery rate (FDR) [19] at the 0.0001

level, we permuted the genotypes for each marker 1000

times and computed all pair-wise correlations for each

permuted marker set. The mean number of pairs with

correlations significant at the 0.0001 level over all permuted

sets was 8901, leading to an FDR of just under 1%.

Interestingly, of the 930,037 pairs that were significantly

correlated, 836,315 (nearly 90%) comprised SNPs located on

different chromosomes, demonstrating a degree of allelic

association that has not been previously reported for the

common strains of inbred mice.

Fig. 3 highlights the extent of ‘‘long-range’’ allelic

association among markers defining LD blocks on chromo-

somes 8 and 14. Highlighted in Fig. 3 are two LD blocks on

oints depict marker pairs for which R2 is greater than 0.4 (corresponding to a p

the observed marker data and the lower triangle corresponds to the permuted

ons between marker pairs: (I) on chromosome 8 for observed genotype data, (II)

data, (IV) on chromosome 14 for permuted genotype data, (V) on chromosomes

ed genotype data. Highlighted by the red boxes are LD blocks on chromosomes

A.C. Cervino et al. / Genomics 86 (2005) 505–517 511

chromosomes 8 and 14 that are strongly associated with

each other. The LD block highlighted on chromosome 8 is

associated with many regions over both chromosomes, a trend

that extends to all other chromosomes examined. This type of

behavior highlights that there are some LD blocks or specific

markers that are highly connected with many different regions

of the genome (‘‘hub’’ nodes of the marker interaction

network), suggesting a scale-free topology [20] for the network

of associated markers (Supplementary Fig. 1). Nonrandom

association of alleles at unlinked markers is a well-known

phenomenon in human genetic studies. Clustered sampling,

epistatic interactions among two or more loci, and population

admixture are all well-recognized mechanisms that can lead to

the type of behavior exhibited in Fig. 3 [21].

Two pieces of data suggest that at least part of the

observed allelic association may be due to epistatic selection

of unlinked alleles during the construction of the inbred

strains of mice. First, the SNP interaction network appears to

exhibit scale-free properties (Supplementary Fig. 1), which

may indicate specific regions that were strongly selected for

during the inbreeding process and interacting epistatically

with many other regions across the genome. Second, when

we examined genes in the associated chromosome 8 and 14

LD blocks (Fig. 3) for enrichment of GO Biological Process

categories, we found that fertilization pathways were

significantly enriched in these two regions. Specifically, there

were 813 genes represented in the chromosome 8 and 14 LD

regions and 128 genes represented in the GO Biological

Process fertilization category, of 28,069 genes annotated in

the genome [22]. Of the 783 genes in the LD regions, 16

were in the fertilization category (4 would be expected by

chance). The probability of observing this enrichment by

chance is 1.6 � 10�6, as determined by the Fisher exact test.

Whatever the explanation for the long-range allelic associ-

ation, the extent of correlation among these markers has the

potential to lead to spurious associations in association-based,

genome-wide QTL mapping studies.

From linkage to in silico QTL mapping, an integrated

approach to mapping genes for complex traits

One use of the high-density SNP panel described herein is in

carrying out association-based (in silico) QTL mapping studies

among the inbred strains represented in the set. However, the

limited number of strains represented in the set of strains

genotyped over this high-density panel, combined with testing

many markers in the context of a genome-wide scan (multiple

testing issues) and the problem of allelic association among

unlinked markers, not only lead to reduced power to identify

regions of the genome associated with phenotypes of interest,

but can also lead to spurious associations between markers and

disease phenotypes. To address these limitations we propose a

multistep strategy to carry out association-based QTL mapping

studies.

The first step involves analysis of one or more experimen-

tal cross populations to identify QTL for disease phenotypes

of interest. The identification of QTL for disease phenotypes

serves to focus attention on a significantly reduced portion of

the genome, in which association studies can then be carried

out to fine-map the QTL regions of interest. A significant

number of QTL studies have been carried out in experimental

cross populations, so that these populations make for a useful

resource in focusing attention on regions known to be

associated with disease phenotypes. Further, QTL analyses

in experimental crosses such as an F2 intercross are not

subject to allelic association between genes on different

chromosomes, such as that depicted in Fig. 3. However, in

these types of experimental crosses, the precise positioning of

QTL is well known to be problematic [11], so that following

up detection of QTL in this setting with an association-based

approach could provide a more rapid method to fine-map the

QTL regions. This complementary linkage/association ap-

proach is a common strategy employed in human genetic

studies [23].

The second step involves carrying out genetic association

studies in the QTL support intervals of interest. The haplotype

structure among the inbred strains of mice are used in this

type of analysis to narrow more rapidly the precise region

controlling for the disease phenotypes of interest, as previously

described by others [6,7,10]. However, unlike these and related

studies, the power to detect association will be significantly

increased given the focus on a single QTL region (which nearly

eliminates multiple testing issues), and there will be less danger

of spurious associations with unlinked regions that are not

associated with the disease phenotypes of interest.

The final step leverages expression data in the experimen-

tal cross populations as a way to prioritize positional

candidates supported by haplotypes identified as associated

with the disease phenotypes of interest. Several groups have

proposed a straightforward strategy for identifying candidate

susceptibility genes for QTL of interest by integrating gene

expression and genotypic data in the context of experimental

cross populations [2,4,24–29]. Briefly, genes physically

located in the regions supported by the QTL of interest are

identified. The corresponding expression traits from a given

tissue for these genes are then treated as quantitative traits in

the experimental cross setting, and eQTL are mapped for

these expression traits. Genes can be considered experimen-

tally supported positional candidates identified in a complete-

ly objective fashion if they have the following properties: (1)

detected as expressed in the tissue of interest, (2) physically

residing in the QTL support interval for the phenotypes of

interest, (3) correlated with the phenotype of interest, and (4)

exhibiting cis eQTL behavior.

Application of the integrated linkage/association-based QTL

mapping strategy to identify Insig2 as a strong candidate gene

for total plasma cholesterol levels

In a complementary study we have mapped QTL for

metabolic traits and liver gene expression traits in F2 mice

constructed from the C57B6/J (B6) and C3H/HeJ (C3H) strains

of mice on an ApoE null background [30]. Linkage analysis

was performed on various metabolic traits of interest to obesity.

Table 2

QTL analysis results for the BXH mice for the chromosome 1 locus defined in the text

Trait Chr. Start End Max LOD R2 Add Dom

Free fatty acids 1 0.5331 1.0281 0.799 6.747234 0.104713 16.22848 1.895533

Glucose 1 4.28 � 10�2 1.0281 0.7733 7.786703 0.119519 54.53892 11.004

Log(HDL cholesterol) 1 0.2144 1.0281 0.7787 9.282095 0.140734 0.212238 6.24 � 10�2

LDL + VLDL 1 0.2751 1.0281 0.6351 5.476793 8.57 � 10�02 127.9582 93.38116

Total cholesterol 1 0.2551 1.0281 0.6351 5.934453 9.25 � 10�2 135.1756 95.01039

Log(triglycerides) 1 0.5209 1.0281 0.7787 8.928773 0.136227 0.380037 �0.16505

The third and fourth columns define the region in Morgans supporting the linkage, the fifth column gives the position in Morgans for the peak lod score in the

linkage region, the sixth column gives the maximum lod score in the linkage region, the seventh column gives the fraction of variation explained by the QTL, and the

final two columns give estimates of the additive and dominance effects for the QTL.

A.C. Cervino et al. / Genomics 86 (2005) 505–517512

The strongest lod scores were observed on chromosome 1 for

free fatty acids, plasma glucose levels, plasma HDL cholesterol

levels, plasma LDL + VLDL cholesterol levels, total plasma

cholesterol levels, and triglyceride levels (Table 2). Fig. 4 gives

the chromosome 1 lod score curve for total cholesterol and

insulin levels. Although the lod peak covers a large interval of

Fig. 4. Cis eQTL for the Insig2 gene (RIKEN cDNA 2900053111) from the

BXH liver expression data overlapping cQTL for insulin and total cholesterol

level traits scored in the BXH animals. The curves represented at the top were

constructed using the full BXH data set, while the curves at the bottom were

constructed using only the female animals in the BXH set.

about 40 cM (typical for an F2 cross), the region defined by a

one lod drop (99% confidence interval for true QTL location) is

5.5 cM (between 58.9 and 64.4 cM, corresponding to the

physical interval 120 to 128 Mb).

We performed genetic association tests between SNPs and

lipid traits represented in the JAX Mouse Phenome database

to attempt to support and narrow further the chromosome 1

linkage region identified in the BXH cross. We queried the

JAX Phenome database for experiments that were compara-

ble to the BXH cross and for which lipid traits had been

scored. The mice comprising the BXH data set were 24

weeks of age and had been on a high-fat Western diet for 14

weeks at the time the cholesterol and insulin phenotypes

were scored. Given this, we selected the following three sets

of data from the phenome database [31]: ‘‘chol_17’’ from the

Paigen2 set of experiments (MPD:99) consisting of total

cholesterol measurements on mice that were 25 weeks of age

and that had been on a high-fat diet for 17 weeks, ‘‘chol_fat’’

from the Albers1 set of experiments (MPD:8) consisting of

cholesterol measurements on 12-week-old mice that had been

on a high-fat diet for 6 weeks, and ‘‘total_chol’’ from the

Paigen1 set of experiments (MPD:29) consisting of choles-

terol measurements on 16-week-old mice that had been on an

atherogenic diet for 8 weeks. Association analyses were

performed separately for these three data sets; where

appropriate, gender was considered as a covariate. Of the

different strains we had genotyped, 33 were represented in

the Paigen2 set, 12 in the Albers1 set, and 31 in the Paigen1

set. The results from single SNP association tests with the

total cholesterol traits are given in Table 3. Only a single

SNP at 121 Mb on chromosome 1 was identified as

Table 3

Tests for association between SNP markers with minor allele frequencies

greater than 1% in the chromosome 1 linkage region defined in the text and

total cholesterol levels

Data set SNP p value

Alberts1 119144634 0.005294

Alberts1 120992934 0.005294

Alberts1 131027701 0.007862

Paigen1 119144634 0.004477

Paigen1 120992934 0.001748

Paigen2 70842743 0.009736

Paigen2 120992934 0.003204

SNP 120992934 is the only marker significantly associated (at the 0.01

significance level) with total cholesterol in all three data sets.

A.C. Cervino et al. / Genomics 86 (2005) 505–517 513

significantly associated with cholesterol levels in all three

data sets. This site lies within the 99% confidence interval

for the QTL location (120 to 128 Mb).

Further support for a cholesterol trait QTL in this region

is provided by a previously described cross between the B6

and DBA strains of mice (BXD) [3], which shows linkage to

this region for cholesterol levels, free fatty acids, aortic

lesions, and fat pad mass, with lod scores of 2.0, 1.5, 1.5,

and 3.0, respectively. Although lod scores of 2–3 may be

considered only suggestive in a genome-wide scan, they can

be considered significant in the context of a region-specific

test of a hypothesis generated by the BXH data. If we

assume that the same QTL is segregating in both the B6 �C3H and the B6 � DBA crosses, then further positional

information is provided by the high-density SNP map. This

QTL is unlikely to fall in a region that is IBD between B6

and C3H or between B6 and DBA. Fig. 1 shows three

relatively large blocks of IBD that exclude a considerable

portion of the QTL 1-LOD region, leaving the interval 121

to 125 Mb and small intervals near 120 and 128 Mb as the

most likely locations.

We identified all genes physically residing in the 1-LOD

(120 to 128 Mb) interval that (1) had gene expression values in

liver from the BXH cross that were significantly correlated

with cholesterol levels and (2) gave rise to a cis-acting eQTL in

the BXH liver data (Fig. 4). Only one gene in this interval was

found to meet these criteria, Insig2, which is located at about

124 Mb (Fig. 1). In addition, there was a significant cis-acting

eQTL for Insig2 liver expression in the BXD data giving rise to

a lod score of 4.7. This eQTL roughly explains the same

amount of variation as the cis-acting eQTL for this gene in the

BXH cross (the smaller BXD lod score reflects the fact that the

BXD cross is roughly 40% the size of the BXH cross), where

the C3H and DBA versions of this gene are showing increased

expression relative to the B6 version of the gene. Therefore, the

BXD data are completely consistent with the BXH and inbred

strain data and support Insig2 as a strong candidate gene

underlying the chromosome 1 cholesterol linkage. Insig2 is a

strong candidate susceptibility gene for cholesterol regulation

given the key roles of the Insig proteins in lipid metabolism

[32,33]. Furthermore, Insig2 is supported as a causal candidate

for these traits using the likelihood test for causality previously

described [2].

Table 4

Correlations between gene expression traits (Ins1, Ins2, Scap, Srebpc1, Srebpc2, Sc

INS2 Insulin INS1 Total Chol SREBP1

INS2 0.822964 0 0.796904 0.801376

Insulin 0.015572 0.917524 5.75 � 10�14 0.161077

INS1 0.978438 0.024302 0.701406 0.93651

Total Chol 0.014874 0.419804 0.045175 0.208941

SREBP1 0.019599 �0.08149 0.008504 0.0682

HDLC 0.049301 0.169336 0.073052 0.248179 �0.24365

SREBP2 �0.02595 �0.02545 �0.03464 �0.14182 0.249391

Insig1 �0.05377 0.142939 �0.05692 0.068291 0.371524

SCAP 0.029268 �0.29465 0.007832 �0.26951 0.423068

Insig2 �0.00983 0.424657 0.019525 0.343963 �0.40619

The upper triangle of the matrix gives the p values for the correlations, and the low

Genes interacting with Insig2 further support its role in the

regulation of cholesterol

As described in the literature, the regulation of the Srebp –

Scap – Insig system is complex [33–36]. Insig1, Insig2, Srebp-

1c, and Srebp2 affect the transcription of different lipid and

cholesterol metabolism genes. At physiological levels, Srebp-

1c plays a larger role in controlling the transcription of

lipogenic genes such as fatty acid synthase. Srebp-2 is a potent

activator of many cholesterol biosynthetic genes like HMG-

CoA reductase, HMG-CoA synthase, and squalene epoxidase,

in addition to many others. Only recently was Insig2a

discovered as a liver-specific isoform [33]. Yabe et al. showed

that the expression of the liver-specific isoform, Insig2a, is

repressed by insulin. In contrast, the expression of Srebp-1c

was up-regulated in the presence of insulin. The probe for

Insig2 represented on the gene expression array used in the

BXH cross is situated in the 3VUTR region of the gene and is

specific to Insig2a, the liver-specific isoform. Insig1 and

Insig2 are also regulated differently by sterols. Interestingly,

our results show that Insig2a and not Insig1 is associated with

plasma cholesterol levels. To investigate further the biological

significance of Insig2a in the BXH cross, we computed all

pair-wise Pearson correlations between the Srebp-1c, Srebp2,

Scap, Ins1, Ins2, Insig1, and Insig2a gene expression traits as

well as plasma cholesterol and insulin levels (Table 4). Insig2a

expression in the liver is significantly correlated with plasma

insulin, HDLC, and total cholesterol levels. Insig2a expression

levels are also negatively correlated with Srebp-1, Srebp-2, and

Scap expression levels. The negative correlation between

Srebp-1 and Insig2a in our data is in line with the results by

Yabe et al. [33], in which Srebp-1 and Insig2a are oppositely

regulated by insulin.

To investigate further the different roles Insig1 and Insig2a

may play in vivo, we used our previously described Bayesian

gene network reconstruction method [4] to explore the gene

neighborhoods surrounding the Insig1 and Insig2a gene nodes

in the female liver transcriptional network for the BXH cross

(Fig. 5). Although Insig1 and Insig2a were in the same gene

network neighborhood, genes appearing downstream of them

were distinct. Genes most strongly connected to Insig1

included fatty acid synthase (Fasn), stearoyl-coenzyme A

desaturase 1 (Scd1), acetyl-CoA synthetase 1, ATP citrate lyase

ap, Insig1, Insig2) and plasma protein levels (insulin, total cholesterol)

HDLC SREBP2 Insig1 SCAP Insig2

0.435403 0.362521 0.301656 0.345414 0.759756

0.031237 0.210432 0.010866 2.20 � 10�7 6.66 � 10�16

0.413371 0.410997 0.467686 0.75256 0.798459

7.36 � 10�5 0.001509 0.124138 1.24 � 10�6 1.03 � 10�9

7.38 � 10�5 5.35 � 10�6 1.12 � 10�10 3.71 � 10�14 6.59 � 10�12

0.083233 0.953825 2.81 � 10�6 1.61 � 10�11

�0.10635 0 0 0.000172

�0.05974 0.499925 8.13 � 10�7 0.044492

�0.27707 0.453157 0.290725 1.78 � 10�10

0.378067 �0.21607 0.022155 �0.34113

er triangle of the matrix gives the Pearson correlation coefficients.

Fig. 5. Insig1/Insig2 subnetwork from the BXH liver Bayesian network. The Bayesian network was reconstructed as described in the text. The subnetwork was

obtained from the full liver expression network by identifying nodes within a pathlength of 4 of Insig1 or Insig2. Highlighted in the network are nodes corresponding

to genes involved in lipid and cholesterol metabolism. As described in the text, Insig2 is significantly associated with a set of genes that are enriched for those

involved in cholesterol metabolism (light blue nodes), whereas Insig1 is significantly involved in lipid synthesis (green nodes). The Insig1-specific network appears

to be involved in biological processes that are distinct from biological processes associated with Insig2.

A.C. Cervino et al. / Genomics 86 (2005) 505–517514

(Acly), and other lipogenic genes (Fig. 5). These genes are

known to be activated by Srebp-1c. In contrast, genes most

closely related to Insig2a were generally cholesterol biosyn-

thetic genes, such as squalene epoxidase (Sqle), HMG-CoA

synthase, and Cyp51, among several others (Fig. 5). Srebp-2 is

known to activate cholesterol biosynthetic genes such as these.

To assess further the significance of the cholesterol synthesis

association observed in the Insig2a subnetwork, we examined

those genes in the BXH liver data that were significantly

correlated with Insig2 expression levels. Of the 14,424 most

transcriptionally active genes in the liver, 878 were signifi-

cantly correlated with Insig2 at the 0.01 significance level.

These genes were tested for enrichment of GO Biological

Process categories. Cholesterol metabolism was among the

most significantly enriched categories in this gene set

(uncorrected Fisher exact p value 9.9 � 10�11; p value after

correction for multiple testing was 1.4 � 10�7). In addition, of

the 14,424 most transcriptionally active genes, 954 had eQTL

with LODs greater than 2 falling in a 15-cM window

containing Insig2, and of these, 123 were also significantly

correlated with Insig2. The probability we would observe this

overlap by chance is 4.7 � 10�16, as determine by the Fisher

exact test.

Discussion

We have employed a novel procedure to identify genes

underlying quantitative traits using a comprehensive SNP map

constructed over 62 common strains of mice. By intersecting

F2 gene expression and clinical trait QTL data and publicly

available phenotype data on common inbred strains of mice,

we were able to narrow significantly a region of the genome

controlling for cholesterol levels in mice. Combining the

haplotype data with the genetics of gene expression data

generated in the BXH and BXD crosses, we were able to

identify Insig2 as a strong candidate susceptibility gene for the

cholesterol trait in the refined region. The process leading to

the identification of Insig2 was completely objective and data

driven, leveraging off of a number of novel data sources and

methods to identify the gene.

Among the interesting observations from our high-density

SNP map is the extensive long-range allelic association

among unlinked markers throughout the genome. The extent

of the allelic association has not been previously described in

the inbred strains of mice. While population admixture is a

plausible explanation for the extensive long-range allelic

association observed in this collection of inbred strains of

mice, it also seems likely that epistatic selection may have

played an important role. The interaction network of

associated markers appears to exhibit scale-free properties

(Supplementary Fig. 1), which may indicate specific regions

that were strongly selected in interaction with many other

regions across the genome during domestication and inbreed-

ing. Furthermore, epistatic selection related to hybrid

incompatibilities may have occurred as a consequence of

the origin of laboratory mice as a hybrid mixture of M. m.

domesticus from Western Europe and other subspecies (M.

m. musculus, molossinus, and castaneus) from Asia [8,37]. It

is well known that hybrids between these subspecies show

marked genetic incompatibilities (i.e., negative epistatic

interactions) that cause infertility, particularly in males

[38,39]. The genetic divergence between subspecies could

result in incompatibilities affecting a number of traits in the

hybrid progenitors of laboratory mice.

A.C. Cervino et al. / Genomics 86 (2005) 505–517 515

The identification of Insig2 as a susceptibility gene for

circulating levels of cholesterol using the fine-mapping strategy

described herein is supported by several additional lines of

evidence. First, the literature supports Insig2 as one of the key

genes involved in the regulation of lipid and cholesterol

metabolism. Second, our gene network showed that Insig2a is

associated with cholesterol biosynthesis genes, and liver gene

expression traits significantly correlated with Insig2 are

significantly enriched for genes involved in cholesterol

metabolism. Third, genes that are significantly correlated to

Insig2 are enriched for eQTL that link to the Insig2 gene, and

this set of genes is significantly enriched for genes involved in

cholesterol metabolism. Finally, the DBA strain carries the

same haplotype as the C3H strain in the region containing the

Insig2 gene, and the BXD data are completely consistent with

the BXH data and support Insig2 as at least one of the genes

controlling for cholesterol levels in the BXH chromosome 1

linkage region. In addition to cholesterol levels, Insig2 liver

expression was found to be significantly associated with

obesity and diabetes traits. From the network in Fig. 5 we

note that Insig2 is upstream of several genes (BC014805,

Socs2, and Mod1) testing as causal for obesity traits in the

BXD cross [1,2]. In a growth hormone deficient state, Socs2 is

known to increase fat mass in rodents and humans. Therefore,

our data also suggest that Insig2 may play a key role in

metabolic traits related to obesity and diabetes.

These data taken together provide strong evidence that

Insig2 is at least one of the genes underlying the chromosome 1

linkage in the BXH cross to cholesterol traits. It is interesting to

note that although Insig1 is a known gene involved in lipid and

cholesterol metabolism, we found Insig2a and not Insig1 to be

supported as a causal candidate for plasma cholesterol levels in

the BXH animals. The gene network we constructed for the

BXH data also showed Insig2a to be more closely associated

with cholesterol biosynthesis genes than Insig1, whereas Insig1

was more closely associated with lipogenic genes. This could

indicate that despite the apparent similarity in terms of binding

Scap and regulating the movement of Srebp’s into the Golgi for

processing, there may be some level of compartmentalization

regarding the roles of these genes, perhaps via more specific

interactions with the different Srebp’s. Alternatively, it may be

that Insig1 activity was not as polymorphic in the BXH and

BXD crosses, so that we were not as powered to detect

associations between Insig1 activity and the cholesterol traits in

these crosses.

Despite the strong evidence presented here supporting Insig2

as a causal candidate gene for cholesterol synthesis, the putative

functional role of this gene must be further validated using more

traditional methods such as gene knockouts or transgenics.

Isolating the C3H version of the Insig2 gene on the B6

background and demonstrating that these mice have increased

cholesterol levels over B6 control mice would provide the

necessary experimental support to conclude that Insig2 is at

least one of the quantitative trait genes underlying the

cholesterol linkages in the BXH cross. However, what was

accomplished via the application of our process was an

objective, completely data-driven approach to reducing the

number of genes to consider as candidates for the chromosome

1 cholesterol linkage. Other methods based on in silico QTL

mapping, QTL mapping in an F2 mouse population, or gene

expression experiments among the inbred strains of mice on

their own would not have supported Insig2 as strongly as a

susceptibility gene. It was the combination of multiple filters

over multiple orthogonal data sets that ultimately led to the

identification of this gene. The degree of pathway coherence

that exists among the genes linked to the Insig2 locus,

correlated with the expression of the Insig2 gene and correlated

with the cholesterol traits, provides additional confidence that

Insig2 is likely involved in the regulation of circulating

cholesterol levels. In the future, more integrative approaches

such as this may facilitate a more objective, data-driven

approach to the identification and prioritization of candidate

genes underlying QTL for complex traits of therapeutic interest.

Methods

Strain selection

A total of 63 inbred strains were selected based on their demonstrated

utility for biological and biomedical research, including their relevance as

models for common human diseases. The set of 63 strains consisted of A/J,

AKR/J, BALB/cJ, BKSdb, BTBR + (T) tf/tf, BUB/BnJ, C3H/HeJ, C3H/

HeSnJ, C57BL/10J, C57BL/6J, C57BLKS/J, C57BR/cdJ, C58/J, CAST/Ei,

CBA/J, CE/J, CZECHII/Ei, DBA/2J, DDY/Jcl, EL/SuzSeyFrkJ, FVB/NJ,

HTG/GosfSn, I/LnJ, ICR, IS/CamRk, 129P3/J, 129S1/SvlmJ, 129SBE-M,

129X1/SvJ, KK/HlJ, LG/J, LLCA, LP/J, M16, MOLF/EiJ, MRL/MpJ, NOD/

LtJ, NON/LtJ, NZB/BlNJ, NZO/HILtJ, NZW/LaC, P/J, Mus Pahari/Ei,

PANCEVO/Ei, PERA/EiJ, PERC/EiJ, PL/J, PWK/Ph, RBF/DnJ, RF/J,

RIIIS/J, SAMR1,SAMP10, SAMP8, SJL/J, SKIVE/Ei, SM/J, SPRET/Ei,

SWR/J, TIRANO/Ei, WMP/Pas, WSB/Ei, and ZALENDE/Ei. DNAs from

SAMR1, SAMP10, and SAMP8 mice were sent to us by Jennifer Greenhall

of UCSD; DNAs from the ICR and M16 strains were sent to us by Daniel

Pomp of the University of Nebraska; DNA from BKSdb was sent to us by

Richard Davis of UCLA; and DNAs from the remaining 57 strains were

ordered from The Jackson Laboratory. Only 1 of the strains, Mus pahari,

failed the genotyping set and was therefore excluded from all of the analyses.

SNP selection

Reference SNPs from dbSNP (Build 118) were mapped to the public

(October 2003, NCBI Build 32) and Celera (release 13) assemblies as

described below. SNPs were selected from 800-kb windows across both

genomes where possible. When using the Celera assembly, Celera SNPs

(release 3.4) from their mouse genome assembly were used in addition to the

publicly available SNPs in dbSNP. Multiple SNPs were picked from

windows for which strain information was available to increase the

likelihood that the window would contain at least one informative SNP for

a wide range of strain pairings. Four reference strains were used (C57BL/6,

A, CAST, CZECHII) to represent a diverse population of common strains,

and SNPs were selected to provide informative markers for all strains

annotated vs the four reference strains in each window. SNPs within each

window were scored for a variety of parameters as described below, and the

top scoring SNPs were selected from each window. Celera SNPs selected for

genotyping were subsequently mapped to the public (October 2003, NCBI

Build 32) assembly as described below.

The SNP flanking sequences (Celera SNP and dbSNP) were mapped onto

the Celera mouse assembly and the public mouse assembly (mm4, October

2003 release http://genome.ucsc.edu) by BLASTN (1) with the word size

parameter set to 28 and an e value threshold of 1 � 10�3. SNP positions

matching the following criteria were stored: (1) the alignments were longer than

100 bp with 97% identity, (2) the genomic sequence at the SNP position

A.C. Cervino et al. / Genomics 86 (2005) 505–517516

matched one of the annotated alleles for the SNP, and (3) the aligned sequence

included at least 50 bp on either side of the SNP.

SNPs in a window were scored based on proximity to other SNPs or

repetitive, low-complexity sequence and by the quality of the alignment to the

genomic sequence. SNPs within 5 bases of a repeat sequence or a neighboring

SNPs and SNPs that mapped ambiguously to multiple locations in the genome

were excluded from consideration. In addition, the validation status and

heterozygosity reported by dbSNP were used to prioritize SNPs within any

given window. Finally, SNPs predicted to yield poor genotyping results using

the Illumina genotyping platform were excluded from consideration, and the

highest scoring SNPs remaining were selected from each window.

DNA preparation and genotyping assay

DNA for each strain was isolated by commercial vendors and research

collaborators and sent to the Gene Expression Laboratory at Rosetta to assess

quantity. Each DNA sample was serially diluted (1:40, 1:400) in duplicate and

PicoGreen dsDNA assays (Molecular Probes) were performed on the samples.

Average DNA concentrations were determined using standard curves for the

assay. A 96-well microtiter plate containing 12 Ag (>100 ng/Al) of each

individual DNA sample was prepared and genotyping for all SNPs was carried

out by Illumina as described previously [3].

BXD and BXH F2 crosses

The BXD F2 mouse population and associated liver gene expression data

used in this study have been previously described [3,40]. Briefly, an F2

population consisting of 111 mice was constructed from two inbred strains of

mice, C57BL/6J and DBA/2J. Only female mice were maintained in this

population. Mice were on a rodent chow diet up to 12 months of age and then

switched to an atherogenic high-fat, high-cholesterol diet for another 4 months.

At 16 months of age the mice were euthanized and their livers extracted for

gene expression profiling. The mice were genotyped at 139 microsatellite

markers uniformly distributed over the mouse genome to allow for the genetic

mapping of the gene expression and disease traits.

The BXH F2 mouse population and associated liver gene expression data

used in this study will be more fully described in a future publication [30]. This

population was constructed from C57BL/6J ApoE null (B6.ApoE�/�) mice

purchased from The Jackson Laboratory and from C3H/HeJ ApoE null

(C3H.ApoE�/�) mice generated by backcrossing B6.ApoE�/� to C3H for 10

generations. F1 mice were generated from reciprocal intercrossing between

B6.ApoE�/� and C3H.ApoE�/�, and F2 mice were subsequently bred by

intercrossing F1 mice. A total of 334 (169 female, 165 male) were bred. All

mice were fed Purina Chow containing 4% fat until 8 weeks of age and then

transferred to a ‘‘Western’’ diet containing 42% fat and 0.15% cholesterol for 16

weeks. Mice were sacrificed at 24 weeks. At death, livers were immediately

collected and flash-frozen in liquid N2. Genomic DNA was isolated from

kidney by phenol–chloroform extraction. An examination of the Celera

database identified over 1800 SNPs that showed variation between the B6 and

the C3H strains, and a complete linkage map for all 19 autosomes was

constructed using 1032 of these SNPs, giving rise to a genetic map with an

average density of 1.5 cM. Genotyping was conducted by ParAllele using the

molecular-inversion probe multiplex technique [41].

For the BXH plasma lipid measures, mice were fasted overnight before

blood was collected through retro-orbital vein puncture under isoflurane

anesthesia. Plasma total cholesterol, HDL cholesterol, and triglyceride were

measured with enzymatic assays as previously described [42].

All procedures were in accordance with the current National Research

Council Guide for the Care and Use of Laboratory Animals and were approved

by the UCLA Animal Research Committee.

RNA sample preparation, microarray hybridization, and expression

analysis

Total RNA from the BXD and BXH liver samples was purified from 25-mg

portions using an RNeasy Mini Kit according to the manufacturer’s instructions

(Qiagen, Valencia, CA, USA), as previously described for the BXD set [3].

Fluorescently labeled cRNA (5 mg) from each F2 animal in each cross was

hybridized against a pool of RNAs specific to each cross. The RNA pools for

each cross were constructed from equal aliquots of RNA from all animals in the

BXD cross and 150 randomly selected animals in the BXH cross. Array images

were processed as previously described to obtain background noise, single-

channel intensity, and associated measurement error estimates [43]. Expression

changes between two samples were quantified as log10 (expression ratio),

where the ‘‘expression ratio’’ was taken to be the ratio between normalized,

background-corrected intensity values for the two channels (red and green) for

each spot on the array. An error model for the log ratio was applied as

previously described to quantify the significance of differential expression

between two samples [43].

QTL mapping

QTL analyses on the gene expression traits and phenotypes in the BXD and

BXH crosses were carried out as previously described [3,30,40]. Statistical

procedures for the association-based (in silico) QTL mapping were implemen-

ted in SPLUS 6.1. Cholesterol and insulin measures were treated as continuous

traits and were fitted to genotypic values for a given SNP using a generalized

linear model. When appropriate, sex was treated as a covariate and sex-by-

genotype interactions were modeled. An SNP and phenotype were considered

significantly associated if the likelihood ratio statistic between the full model

and the null model was significant. The null model was similar to the full

model, except that all parameters from the full model containing the genotype

random variable were removed.

Additional statistical analyses

The Pearson correlation coefficient was computed between all phenotypes

and gene expression traits to identify expression traits that were significantly

correlated with relevant phenotypes or other gene expression traits. When

appropriate, gene expression and phenotypic values were adjusted for gender

effects before computing the Pearson correlations. Five hundred thirty-three

genes were identified as significantly differentially expressed in at least 20% of

the BXH liver samples from the greater than 23,000 genes represented on the

microarray used in this experiment. A gene was determined to be significantly

differentially expressed in a given sample if it had an absolute value of the fold

change greater than 1.5, a mean log average intensity greater than �1.5, and a

mean log ratio that was significant at the 0.01 level. The 533 gene expression

traits in addition to 21 metabolic phenotypes were then provided as input into

our previously described Bayesian network reconstruction program, using the

parameter settings originally described by Zhu et al. [4]. That portion of the

network containing Insig1 and Insig2 is shown in Fig. 5.

Acknowledgments

We thank Iris Hovatta (MRL, San Diego) and Jennifer

Greenhall (UCSD) for providing the SAM DNAs, Daniel

Pomp (UN) for providing the M16 and ICR DNAs, and

Richard Davis and Jake Lusis’ lab (UCLA) for providing

DNAs for a number of strains. We also thank the Rosetta Gene

Expression Laboratory for the microarray work.

Appendix A. Supplementary data

Supplementary data associated with this article can be found,

in the online version, at doi:10.1016/j.ygeno.2005.07.010.

References

[1] H. Allayee, A. Ghazalpour, A.J. Lusis, Using mice to dissect genetic

factors in atherosclerosis, Arterioscler. Thromb. Vasc. Biol. 23 (2003)

1501–1509.

A.C. Cervino et al. / Genomics 86 (2005) 505–517 517

[2] E.E. Schadt, et al., An integrative genomics approach to infer causal

associations between gene expression and disease, Nat. Genet. 37 (2005)

710–717.

[3] E.E. Schadt, et al., Genetics of gene expression surveyed in maize, mouse

and man, Nature 422 (2003) 297–302.

[4] J. Zhu, et al., An integrative genomics approach to the reconstruction of

gene networks in segregating populations, Cytogenet. Genome Res. 105

(2004) 363–374.

[5] S. Doss, E.E. Schadt, T.A. Drake, A.J. Lusis, Cis-acting expression

quantitative trait loci in mice, Genome Res. 15 (2005) 681–691.

[6] A. Grupe, et al., In silico mapping of complex disease-related traits in

mice, Science 292 (2001) 1915–1918.

[7] G. Liao, et al., In silico genetics: identification of a functional

element regulating H2-Ealpha gene expression, Science 306 (2004)

690–695.

[8] C.M. Wade, et al., The mosaic structure of variation in the laboratory

mouse genome, Nature 420 (2002) 574–578.

[9] K.A. Frazer, et al., Segmental phylogenetic relationships of inbred mouse

strains revealed by fine-scale analysis of sequence variation across 4.6 Mb

of mouse genome, Genome Res. 14 (2004) 1493–1500.

[10] M.T. Pletcher, et al., Use of a dense single nucleotide polymorphism map

for in silico mapping in the mouse, PLoS Biol. 2 (2004) e393.

[11] A. Darvasi, A. Weinreb, V. Minke, J.I. Weller, M. Soller, Detecting

marker-QTL linkage and estimating QTL gene effect and map location

using a saturated genetic map, Genetics 134 (1993) 943–951.

[12] P.M. Petkov, et al., An efficient SNP system for mouse genome

scanning and elucidating strain relationships, Genome Res. 14 (2004)

1806–1811.

[13] Y.G. Park, R. Clifford, K.H. Buetow, K.W. Hunter, Multiple cross and

inbred strain haplotype mapping of complex-trait candidate genes,

Genome Res. 13 (2003) 118–121.

[14] F.Y. Ideraabdullah, et al., Genetic and haplotype diversity among wild-

derived mouse inbred strains, Genome Res. 14 (2004) 1880–1887.

[15] B. Yalcin, et al., Unexpected complexity in the haplotypes of commonly

used inbred strains of laboratory mice, Proc. Natl. Acad. Sci. USA 101

(2004) 9734–9739.

[16] J.A. Beck, et al., Genealogies of mouse inbred strains, Nat. Genet. 24

(2000) 23–25.

[17] M. Kimura, A simple method for estimating evolutionary rates of base

substitutions through comparative studies of nucleotide sequences, J. Mol.

Evol. 16 (1980) 111–120.

[18] J. Felsenstein, Phylogenies from molecular sequences: inference and

reliability, Annu. Rev. Genet. 22 (1988) 521–565.

[19] G.R. Grant, J. Liu, C.J. Stoeckert Jr., A practical false discovery rate

approach to identifying patterns of differential expression in microarray

data, Bioinformatics 21 (11) (2005) 2684–2690.

[20] A.L. Barabasi, Z.N. Oltvai, Network biology: understanding the cell_s

functional organization, Nat. Rev. Genet. 5 (2004) 101–113.

[21] J. Ott, Analysis of Human Genetic Linkage, Johns Hopkins Univ. Press,

Baltimore, 1999.

[22] T. Hubbard, et al., The Ensembl genome database project, Nucleic Acids

Res. 30 (2002) 38–41.

[23] A. Helgadottir, et al., The gene encoding 5-lipoxygenase activating protein

confers risk of myocardial infarction and stroke, Nat. Genet. 36 (2004)

233–239.

[24] R.B. Brem, G. Yvert, R. Clinton, L. Kruglyak, Genetic dissection

of transcriptional regulation in budding yeast, Science 296 (2002)

752–755.

[25] E.J. Chesler, et al., Complex trait analysis of gene expression uncovers

polygenic and pleiotropic networks that modulate nervous system

function, Nat. Genet. 37 (3) (2005) 233–242.

[26] R.C. Jansen, J.P. Nap, Genetical genomics: the added value from

segregation, Trends Genet. 17 (2001) 388–391.

[27] C.L. Karp, et al., Identification of complement factor 5 as a susceptibility

locus for experimental allergic asthma, Nat. Immunol. 1 (2000) 221–226.

[28] S.A. Monks, et al., Genetic inheritance of gene expression in human cell

lines, Am. J. Hum. Genet. 75 (2004) 1094–1105.

[29] M. Morley, et al., Genetic analysis of genome-wide variation in human

gene expression, Nature 430 (2004) 743–747.

[30] Wang, S., et al. Genetic and genomic analysis of a fat mass trait with

complex inheritance reveals marked sex-specificity. Submitted for

publication (2005).

[31] K. Paigen, J.T. Eppig, A mouse phenome project, Mamm. Genome 11

(2000) 715–717.

[32] A.D. Attie, Insig: a significant integrator of nutrient and hormonal signals,

J. Clin. Invest. 113 (2004) 1112–1114.

[33] D. Yabe, R. Komuro, G. Liang, J.L. Goldstein, M.S. Brown, Liver-

specific mRNA for Insig-2 down-regulated by insulin: implications for

fatty acid synthesis, Proc. Natl. Acad. Sci. USA 100 (2003) 3155–3160.

[34] R.G. Anderson, J. Goldstein, M. Brown, From cholesterol homeostasis

to new paradigms in membrane biology, Trends Cell Biol. 13 (2003)

534–539.

[35] R. McPherson, A. Gauthier, Molecular regulation of SREBP function: the

Insig–SCAP connection and isoform-specific modulation of lipid

synthesis, Biochem. Cell Biol. 82 (2004) 201–211.

[36] T. Yang, et al., Crucial step in cholesterol homeostasis: sterols promote

binding of SCAP to INSIG-1, a membrane protein that facilitates retention

of SREBPs in ER, Cell 110 (2002) 489–500.

[37] L.M. Silver, Mouse Genetics: Concepts and Applications, Oxford Univ.

Press, New York, 1995.

[38] J. Britton-Davidian, F. Fel-Clair, J. Lopez, P. Alibert, P. Boursot,

Postzygotic isolation between the two European subspecies of the house

mouse: estimates from fertility patterns in wild and laboratory-bred

hybrids, Biol. J. Linnean Soc. 84 (2005) 379–393.

[39] J. Forejt, P. Ivanyi, Genetic studies on male sterility of hybrids between

laboratory and wild mice (Mus musculus L.), Genet. Res. 24 (1974)

189–206.

[40] T.A. Drake, et al., Genetic loci determining bone density in mice with diet-

induced atherosclerosis, Physiol. Genom. 5 (2001) 205–215.

[41] P. Hardenbol, et al., Highly multiplexed molecular inversion probe

genotyping: over 10,000 targeted SNPs genotyped in a single tube assay,

Genome Res. 15 (2005) 269–275.

[42] C.C. Hedrick, L.W. Castellani, C.H. Warden, D.L. Puppione, A.J. Lusis,

Influence of mouse apolipoprotein A-II on plasma lipoproteins in

transgenic mice, J. Biol. Chem. 268 (1993) 20676–20682.

[43] Y.D. He, et al., Microarray standard data set and figures of merit for

comparing data processing methods and experiment designs, Bioinfor-

matics 19 (2003) 956–965.

[44] R.J. Mural, et al., A comparison of whole-genome shotgun-derived mouse

chromosome 16 and the human genome, Science 296 (2002) 1661–1671.