integrative methods for the analysis of genome wide ...

161
INTEGRATIVE METHODS FOR THE ANALYSIS OF GENOME WIDE ASSOCIATION STUDIES A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Marc A. Schaub June 2012

Transcript of integrative methods for the analysis of genome wide ...

INTEGRATIVE METHODS FOR THE ANALYSIS OF

GENOME WIDE ASSOCIATION STUDIES

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Marc A. Schaub

June 2012

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/qt820xd3631

© 2012 by Marc Andreas Schaub. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Serafim Batzoglou, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Atul Butte

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

David Dill

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

Genome Wide Association Studies (GWAS) have identified over 4,500 common vari-

ants in the human genome that are statistically associated with diseases and other

phenotypical traits. Most identified associations, however, only have a small effect on

disease risk, and their relevance in a clinical setting remains the subject of extensive

debate. In this thesis I present three integrative analysis directions that extend on

GWAS by developing new methods, by using genotyping data to ask new questions,

and by integrating additional types of data to generate functional hypotheses about

the biological processes underlying associations.

First, I introduce a new classifier-based methodology that identifies similarities

in the genetic architecture of diseases. This method can successfully identify both

known and novel relationships between common diseases such as type 1 diabetes,

rheumatoid arthritis, hypertension and bipolar disease.

Second, I show how control individuals from a GWAS can be used to detect

genetic differences between the pseudoautosomal regions of chromosomes X and Y

in the general population, which can be attributed to differences in allele frequency

between the two sex chromosomes likely caused by selective pressure.

Finally, I present an approach that integrates experimental data generated by the

ENCODE consortium in order to identify functional Single Nucleotide Polymorphisms

(SNPs). These functional SNPs are associated with a phenotype, either directly or

through linkage disequilibrium, and overlap a functional part of the genome such as

a transcribed region or a transcription factor binding site. GWAS associations are

significantly enriched for functional annotations, and up to 80% of all associations

iv

previously reported in a GWAS can be mapped to a functional SNP. For most asso-

ciations the functional SNP most strongly supported by experimental evidence is a

SNP in linkage disequilibrium with the reported association rather then the reported

SNP itself.

v

Acknowledgment

I would like to thank my advisor Serafim Batzoglou for his advice, feedback, encour-

agements and support throughout my graduate career, for giving me the freedom to

explore a broad range of research directions, and for bringing together such a truly

outstanding group of researchers. In particular, I would like to thank George Asi-

menos and Chuong Do for all their highly valuable advice early in my graduate career,

Anshul Kundaje for his advice and support during the second half of my thesis work,

and Irene Kaplow for having been such a fantastic summer student.

I would like to thank Atul Butte for all the feedback and encouragement through-

out my thesis work, for inviting me to his group meetings, for serving on my qualifying

examination, defense and reading committees, and for giving me the privilege of col-

laborating closely with two amazing Ph.D. students in his group, Marina Sirota and

Linda Liu. Working with Marina and Linda has certainly been the favorite part of

my research work at Stanford, and I’m very grateful for everything they have done to

make these joint projects such a successful and tremendously enriching experience.

I would like to thank Michael Snyder for giving me the opportunity to collaborate

with his group on the analysis of the ENCODE data, for his advice and support, and

for serving on my defense committee, Ross Hardison for his support of my work on

linking ENCODE and GWAS data, David Dill for serving on my qualifying exam,

defense and reading committees and Arend Sidow for chairing my thesis defense.

I would like to thank my friends and colleagues in the Batzoglou, Butte and Snyder

labs and in the Stanford biomedical research community for their support, encour-

agements, advice, feedback, and the many fruitful discussions we had about research

and life in general: Sarah Aerni, Andy Beck, Sivan Bercovici, Alan Boyle, Leticia

vi

Britos, David Chen, Rong Chen, Tiffany Chen, Annie Chiang, Erik Corona, Michelle

Davison, Eugene Davydov, Omkar Deshpande, Joel Dudley, Robert Edgar, Megan El-

more, Sangeeta English, Patrick Flaherty, Jason Flannick, Chuan Sheng Foo, Eugene

Fratkin, Yael Garten, Andrew Gentles, Sam Gross, Adam Grossman, Philip Guo,

Manoj Hariharan, Lin Huang, Nadine Hussami, Robert Ikeda, Konrad Karczewski,

Dorna Kashef-Haghighi, Peter Kang, Purvesh Khatri, Keiichi Kodama, Andy Kogel-

nik, Sofia Kyriazopoulou-Panagiotopoulou, Wei-Nchih Lee, Daniel Li, Li Li, Max

Libbrecht, Irene Liu, Yuling Liu, Alex Morgan, Daniel Newburger, Tony Novak,

Jon Palma, Chirag Patel, Victoria Popic, Yannick Pouliot, Dmitry Pushkarev, Jesse

Rodriguez, Jon Rodriguez, David Ruau, Olga Russakovsky, Karen Sachs, Raheleh

Salari, Nicelio Sanchez-Luege, Shai Shen-Orr, Andreas Sundquist, Silpa Suthram,

Nick Tatonetti, Rob Tirrell, Shivkumar Venkatasubrahmanyam, Dan Webster and

Noah Zimmerman.

My research work would not have been possible without the outstanding techni-

cal and administrative support of Miles Davis, Kathi DiTommaso, Sebastian Gutier-

rez, Alex Sandra Pinedo, Alex Skrenchuk, Tanya Raschke, Liliana Rivera and Verna

Wong.

During my time at Stanford, I had the privilege of being involved in a broad

range of extracurricular activities. I would like to thank all my friends in Stan-

ford EMS, and in particular Florian Schmitzberger, Chris Cheung, Brian Cheung,

Glenn Ulansey, Lauren Mamer, Mark Liao and James Liao, the teaching staff of the

Stanford EMT program, the Stanford Wilderness Medicine instructor team, and the

Escondido Village Community Associates for their friendship, encouragements and

support throughout my graduate career, and for just being an amazing group of peo-

ple! These programs tremendously enriched my experience at Stanford, and would

not have been possible without the support of the Department of Public Safety, the

Division of Emergency Medicine and Stanford Outdoor Education. I would like to

thank the Graduate Life Office, and in particular Ken Hsu, Laurette Beeson and

Anne Boswell for their support of my work as a Community Associate in Studio 2,

and all the great work they do to assist the Stanford graduate student community in

general.

vii

While many miles away, my friends and family in Switzerland, and in particular

Frederic Evequoz, Gregory Mermoud and Gregory Theoduloz as well as my brother

Alain have always been very supportive of my work.

Finally, none of this would have been possible without the unwavering support of

my family throughout my entire career. I am deeply grateful to my father Andreas

and my mother Margrith for everything they have done in order to give me the

opportunity to follow my interests, and for always encouraging me to do so, even

when it meant living nine timezones away from home. Danke viel, vielmals fur alles!

Joint Work

Chapter 3 and Sections 2.1 and 2.2 of Chapter 2 are a reproduction, in part, of a

previously published article:

M.A. Schaub, I.M. Kaplow, M. Sirota, C.B. Do, A.J. Butte, S. Batzoglou. A

Classifier-based Approach to Identify Genetic Similarities Between Diseases. Bioin-

formatics 25: i21-29. 2009.

I would like to thank my co-authors Irene M. Kaplow, Marina Sirota, Chuong

B. Do, Atul J. Butte and Serafim Batzoglou for their contributions to this project.

I conceived and designed the study, performed all data preprocessing, implemented

the version of the decision tree classifier used to obtain the reported results, analyzed

the data and wrote the manuscript. Irene M. Kaplow performed exploratory research

comparing various classifiers, which lead to the choice of the Decision Tree classifier

we used. Marina Sirota revised Figure 3.1, and designed the version shown herein.

Chuong B. Do and Marina Sirota provided input and feedback on the study design

and data analysis. Atul J. Butte and Serafim Batzoglou helped conceive the study

and supervised the study. All authors revised the manuscript.

Chapter 4 and Section 2.3 of Chapter 2 represent joint work that will become

part of a manuscript to be submitted after the time of submission of this thesis.

I would like to thank my co-authors on this upcoming manuscript Linda Y. Liu,

Marina Sirota, Serafim Batzoglou and Atul J. Butte for their contributions to this

viii

project. Linda Y. Liu and I jointly conceived and designed the study. Linda Y.

Liu performed the analysis on the WTCCC data set. I performed the analysis on

the HapMap 3 dataset, developed the modified Hardy-Weinberg model, identified

the sequence homology issue leading to false positives in autosomes, and wrote the

chapter. Marina Sirota provided input and feedback on the study design and data

analysis. Atul J. Butte helped conceive the study. Serafim Batzoglou and Atul J.

Butte supervised the study.

Chapters 5 and6 are a reproduction, in part, of a research article which, at the

time of submission of this thesis, has been accepted for publication:

M.A. Schaub, A.P. Boyle, A. Kundaje, S. Batzoglou, M.P. Snyder. Linking Disease

Associations with Regulatory Information in the Human Genome. Genome Research.

In press.

I would like to thank my co-authors Alan P. Boyle, Anshul Kundaje, Serafim

Batzoglou and Michael P. Snyder for their contributions to this project. I conceived

and designed the study, performed all data analysis steps and wrote the manuscript.

Alan P. Boyle designed RegulomeDB and provided regulatory annotations for the list

of all SNPs used in this work. Anshul Kundaje, Serafim Batzoglou and Michael P.

Snyder helped conceive the study and supervised the study. All authors revised the

manuscript.

Funding

This work would not have been possible without the very generous donators who sup-

ported me through the Richard and Naomi Horowitz Stanford Graduate Fellowship,

and a School of Engineering Fellowship.

Parts of this work have been supported by the ENCODE consortium under Grant

No. NIH 5U54 HG 004558, by the National Science Foundation under Grant No.

0640211, and by a King Abdullah University of Science and Technology research

grant.

ix

Data

This study makes use of data generated by the Wellcome Trust Case-Control Con-

sortium. A full list of the investigators who contributed to the generation of the data

is available from www.wtccc.org.uk. Funding for the project was provided by the

Wellcome Trust under award 076113.

This work makes use of data generated and processed by the ENCODE consor-

tium, the Office of Population Genomics at the National Human Genome Research

Institute, the HapMap consortium, and the Genome Bioinformatics Group at the

University of California Santa Cruz.

x

Contents

Abstract iv

Acknowledgment vi

1 Introduction 1

1.1 Genome wide association studies . . . . . . . . . . . . . . . . . . . . 2

1.2 Integrative analysis methods . . . . . . . . . . . . . . . . . . . . . . . 5

2 Data Quality 8

2.1 The Wellcome Trust Case Control Consortium data set . . . . . . . . 8

2.2 Genotype calling artifacts . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Genotype calling . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Examples of artifacts . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Consensus approach . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Effect of homology with the sex chromosomes . . . . . . . . . . . . . 14

3 Identifying Similarities Between Diseases 19

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.1 Classifier performance . . . . . . . . . . . . . . . . . . . . . . 26

3.3.2 Disease similarities . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.3 Differences between control sets . . . . . . . . . . . . . . . . . 31

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

xi

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.6.1 Classification task . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.6.2 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.6.3 Identifying similarities . . . . . . . . . . . . . . . . . . . . . . 42

4 Analysis of the Pseudoautosomal Regions 43

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.1 Significant differences between males and females in the pseu-

doautosomal region 1 . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.2 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.3 Modified Hardy-Weinberg model . . . . . . . . . . . . . . . . 48

4.2.4 Phasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.5 Evolution of differing allele frequencies . . . . . . . . . . . . . 50

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.5.1 Identifying differences between males and females . . . . . . . 56

4.5.2 Trio-based phasing in PARs . . . . . . . . . . . . . . . . . . . 57

5 Integrating Regulatory Information 60

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2.1 Lead SNP annotation . . . . . . . . . . . . . . . . . . . . . . . 65

5.2.2 Linkage disequilibrium . . . . . . . . . . . . . . . . . . . . . . 65

5.2.3 Integrating gene expression data . . . . . . . . . . . . . . . . . 69

5.2.4 SNP comparison within linkage disequilibrium regions . . . . . 69

5.2.5 Associations are enriched for regulatory elements . . . . . . . 71

5.2.6 Analysis at the phenotype level . . . . . . . . . . . . . . . . . 80

5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

xii

5.3.1 Identifying functional SNPs in linkage disequilibrium with lead

SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.2 Comparison of functional assays . . . . . . . . . . . . . . . . . 85

5.3.3 Differences between tissue types . . . . . . . . . . . . . . . . . 86

5.3.4 Functional SNPs beyond reported associations . . . . . . . . . 87

5.3.5 Analysis at the at the phenotype level . . . . . . . . . . . . . 88

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.5.1 GWAS catalog . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.5.2 Linkage disequilibrium . . . . . . . . . . . . . . . . . . . . . . 90

5.5.3 Genotyping arrays . . . . . . . . . . . . . . . . . . . . . . . . 90

5.5.4 SNP properties . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.5.5 Functional annotations . . . . . . . . . . . . . . . . . . . . . . 91

5.5.6 Transcribed regions . . . . . . . . . . . . . . . . . . . . . . . . 92

5.6 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.6.1 Lead SNP annotation . . . . . . . . . . . . . . . . . . . . . . . 93

5.6.2 Linkage disequilibrium integration . . . . . . . . . . . . . . . . 93

5.6.3 Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.6.4 Analysis at the phenotype level . . . . . . . . . . . . . . . . . 99

6 Analysis of Functional SNPs 100

6.1 Strongly supported functional SNPs . . . . . . . . . . . . . . . . . . . 100

6.2 Replication of a previously validated functional SNP . . . . . . . . . 101

6.3 A new functional SNP for type 2 diabetes . . . . . . . . . . . . . . . 102

6.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.4 The 9p21 region in coronary artery disease . . . . . . . . . . . . . . . 106

6.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Bibliography 115

xiii

List of Tables

2.1 Incorrect and correct genotype calls for rs2241572 . . . . . . . . . . . 12

2.2 Incorrect genotype calls for rs2491853 . . . . . . . . . . . . . . . . . . 16

3.1 Classifier performance (cross-validation) . . . . . . . . . . . . . . . . 26

3.2 Separate training set classifier performance . . . . . . . . . . . . . . . 32

4.1 Significant genotype differences between males and females in WTCCC

controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Analysis of rs312258 in WTCCC disease populations . . . . . . . . . 47

4.3 Genotype counts for rs312258 in HapMap 3 . . . . . . . . . . . . . . 47

4.4 Comparison of observed and estimated genotype counts . . . . . . . . 48

4.5 Allelle counts per chromosome for rs312258 in HapMap 3 . . . . . . . 51

4.6 Phasing cases in trios . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1 Fraction of associations overlapping functional regions for different

linkage disequilibrium thresholds. . . . . . . . . . . . . . . . . . . . . 67

5.2 Fraction of associations overlapping functional regions for different

linkage disequilibrium thresholds (European populations). . . . . . . . 68

5.3 Comparison of functional evidence between the lead SNP and the best

SNP in the linkage disequilibrium region. . . . . . . . . . . . . . . . . 70

5.4 Overview of enrichment: lead SNPs only, all populations. . . . . . . . 75

5.5 Overview of enrichment: perfect LD, all populations. . . . . . . . . . 75

5.6 Overview of enrichment: r2 ≥ 0.9, all populations. . . . . . . . . . . . 76

5.7 Overview of enrichment: r2 ≥ 0.8, all populations. . . . . . . . . . . . 76

xiv

5.8 Overview of enrichment: r2 ≥ 0.5, all populations. . . . . . . . . . . . 77

5.9 Overview of enrichment: lead SNPs only, European populations. . . . 77

5.10 Overview of enrichment: perfect LD, European populations. . . . . . 78

5.11 Overview of enrichment: r2 ≥ 0.9, European populations. . . . . . . . 78

5.12 Overview of enrichment: r2 ≥ 0.8, European populations. . . . . . . . 79

5.13 Overview of enrichment: r2 ≥ 0.5, European populations. . . . . . . . 79

5.14 Height-associated functional SNPs overlapping CTCF binding sites . 82

5.15 Prostate cancer-associated functional SNPs overlapping AR binding sites 83

5.16 Modified RegulomeDB scoring scheme. . . . . . . . . . . . . . . . . . 92

6.1 Overview of the lead SNPs most strongly supported by functional ev-

idence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2 Strongly supported functional SNPs in linkage disequilibrium with an

associated lead SNP in all populations. . . . . . . . . . . . . . . . . . 102

6.3 Strongly supported functional SNPs in linkage disequilibrium with an

associated lead SNP in the European population. . . . . . . . . . . . 103

xv

List of Figures

1.1 Overview of a genome wide association study . . . . . . . . . . . . . . 4

2.1 Overview of genotype calling . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Genotype calling for rs2241572 . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Genotyping chip signal intensities for rs312258 in controls . . . . . . . 14

2.4 Effect of homology on genotype calling . . . . . . . . . . . . . . . . . 15

2.5 Signal intensities for rs2491853 . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Overview of the approach. . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Distribution of the disease-class probabilities for the type 1 diabetes

classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Disease-class probabilities comparisons. . . . . . . . . . . . . . . . . . 30

3.4 Distribution of the class probabilities for the control-control classifier 33

4.1 Manhattan plot of differences between males and females in PAR1 . . 45

5.1 Schematic overview of the functional SNP approach. . . . . . . . . . . 64

5.2 Proportions of associations for different types of functional data. . . . 66

5.3 Enrichment for different combinations of assays. . . . . . . . . . . . . 71

5.4 Overview of enrichment. . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.5 Phenotype level overview of the overlap between associations and ChIP-

seq binding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.1 Functional SNP rs7163757 . . . . . . . . . . . . . . . . . . . . . . . . 104

6.2 Overview of the 9p21 region . . . . . . . . . . . . . . . . . . . . . . . 107

xvi

6.3 Evidence supporting the implication of rs1333047 in coronary artery

disease. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

xvii

Chapter 1

Introduction

The completion of the landmark effort in sequencing the human genome over ten years

ago [1, 2] has had a profound impact on disease genetics research [3]. The availability

of the full sequence of the human genome lead to the identification of increasingly

complete lists of common variation between individuals [4]. The most common form

of variation is called a Single Nucleotide Polymorphism (SNP): a position in the

genome where individuals differ at exactly one base pair, but the flanking sequence

on both sides of the SNP is identical in the population [5]. For most SNPs individuals

have one of two base pairs: the major allele, which is the most prevalent in the

population, and the minor allele. As humans are diploid, each individual will have two

alleles at a given SNP: one on the maternal copy of the chromosome, and one on the

paternal copy. These two alleles are combined to form the genotype of the individual

at that SNP. SNPs for which the minor allele is relatively frequent (a commonly

used threshold is 5% of the population) are called common SNPs. The HapMap

project [6, 7] lead to the identification of over 3.8 million common SNPs. This large

catalog of variation combined with cost-effective genotyping technologies [8], which

allow the measurement of hundreds of thousands of SNPs per individual, made it

possible to study how populations differ at the genetic level. In the context of diseases,

it became possible to compare a population of individuals who suffer from a given

disease to the general population [9]. This idea forms the basis of Genome Wide

Association Studies (GWAS). A landmark GWAS was published by the Wellcome

1

CHAPTER 1. INTRODUCTION 2

Trust Case Control Consortium Data Set in 2007, and assesses 14,000 cases and 3000

controls in order to identify variants associated with common diseases [10]. Since

then GWAS have lead to the identification of thousands of loci associated with a large

number of phenotypes [11, 12].

In this chapter I first show how a GWAS is performed, and then describe how my

thesis builds on existing GWAS by developing new methods, asking new questions,

and integrating GWAS results with functional data.

1.1 Genome wide association studies

The goal of a GWAS is to detect loci that have statistically significant differences in

genotype frequencies between individuals who have a phenotype of interest and the

general population. A commonly used study design is a case-control study, in which

individuals who have a known phenotype (such as a disease) are recruited to be part of

the case population, and matched healthy individuals are recruited to be part of the

control population. GWAS have also been performed in longitudinal cohort studies in

which a population has been followed over time [13]. Most recent GWAS studies have

used genotyping platforms that allow the measurement of 500,000 to a million SNPs

per sample, and can thus be used to detect many common variants associated with

a phenotype across the entire genome. This makes GWAS particularly applicable to

the study of common diseases under the assumption that a large number of common

variants each contribute a small amount to the overall disease risk [14].

For each SNP, the numbers of individuals with each genotype are tallied separately

in the cases and the controls. As statistical test is then applied to the genotype counts

in order to assess whether there is a statistically significant difference between the

genotype distribution in cases and in controls. A tutorial by Balding [15] summarizes

statistical methods that are relevant to GWAS. As the number of SNPs assessed in a

GWAS is very large, it is important to correct for multiple hypothesis testing. While

multiple methods have been proposed, most studies use the conservative Bonferroni

correction [16], which multiplies each P-value by the number of tested hypotheses.

In this context the number of hypotheses is equal to the number of SNPs on the

CHAPTER 1. INTRODUCTION 3

genotyping chip. While a P-value indicates how much statistical support there is for

a given association, it does not directly show how strong the effect is. An association

with a weak effect but which is tested in a very large population can obtain a much

stronger P-value than an association with a stronger effect that is tested in a smaller

population. Odds ratios can be computed in order to assess the effect size, but it is

important to keep in mind that a large odds ratio does not necessarily mean that the

association is significant. Figure 1.1 provides an overview of the approach used in

order to assess the association between a SNP and a phenotype in a GWAS.

SNPs that are physically located in close proximity in the genome tend to be cor-

related with each other [17]. This phenomenon is a consequence of the evolution of the

human genome and is called Linkage Disequilibrium (LD) [18]. Various metrics can be

used to quantify linkage disequilibrium [19], including the squared correlation coeffi-

cient r2. Figure 1.1 shows two perfectly correlated (r2 = 1.0) SNPs, SNP2 and SNP3.

Perfectly correlated SNPs are said to be in perfect linkage disequilibrium. SNPs can

be grouped into haplotype blocks of loci that have strongly correlated genotypes. The

HapMap project studied this haploblock structure in multiple populations [6, 7]. This

information is used in the design of genotyping platforms in order to decide which

SNPs in a haploblock need to be measured, and which SNPs can be inferred using in-

formation from other correlated SNPs. In the example of Figure 1.1, measuring SNP3

in addition to SNP2 would not provide any additional information. The SNP that is

present on the genotyping platform is called the tag SNP. Imputation methods [20]

can be used to obtain the likely genotypes of the individuals in the study at variants

that were not assessed directly through genotyping, but whose correlation with tag

SNPs is known from higher resolution data such as HapMap. Therefore, while GWAS

allow the identification of particular tag SNPs that are statistically associated with a

phenotype of interest, these SNPs are often part of a larger region of linkage disequi-

librium. Regions of strong linkage disequilibrium can be large, and SNPs associated

with a phenotype have been found to be in perfect linkage disequilibrium with SNPs

several hundred kilobases away. Linkage disequilibrium therefore makes it difficult

to precisely pinpoint which SNPs play a functional role in the phenotype of interest,

and which SNPs happen to be associated with a phenotype only because they are in

CHAPTER 1. INTRODUCTION 4

:-(:-(

:-(

ATACGGTATTAGCAAATAAACGATAGCATACAAAATACGCTATTAGCAATTAAACGATAGGATACAAA

:-)

ATACGCTATTAGCAAATAAACGATAGCATACATAATACGCTATTAGCAAATAAACGATAGCATACAAA

:-)

ATACGGTATTAGCAAATAAACGATAGCATACAAAATACGGTATTAGCAAATAAACGATAGAATACAAA

:-)

ATACGCTATTAGCAATTAAACGATAGGATACAAAATACGGTATTAGCAATTAAACGATAGGATACAAAATACGGTATTAGCAATTAAACGATAGGATACAAAATACGCTATTAGCAATTAAACGATAGGATACAAAATACGGTATTAGCAAATAAACGATAGCATACAAAATACGCTATTAGCAATTAAACGATAGGATACAAA

SNP

1

SNP

2

SNP

3

Complete Linkage Disequilibrium Rare variant

AA AT TT

2 1 0

0 1 2

:-):-(

Summary statistics:• P-value• Odds ratio

Con

trol

sC

ases

Figure 1.1: Overview of a genome wide association study

In this cartoon example of a Genome Wide Association Study, three healthy controlindividuals are compared to three case individuals that have some disease phenotypeof interest. Each individual has two copies of each chromosome. Three SNPs inwhich there is frequent variation in the population are shown. One rare variant, forwhich only one individual has a mutation is also shown. Parts of the sequence thatare identical in the entire population are in grey. SNP1 and SNP2 are perfectlycorrelated: if a chromosome contains the A allele at SNP2, it contains the C alleleat SNP3, and if it contains the T allele at SNP2, it contains the G allele at SNP3.SNP2 and SNP3 are said to be in perfect linkage disequilibrium. The genotypecounts for SNP2 are shown in a 2x3 table. Summary statistics for SNP2 can becomputed based on this table.

CHAPTER 1. INTRODUCTION 5

linkage disequilibrium with another SNP that has a functional role. On Figure 1.1

both SNP2 and SNP3 are equally strongly associated with the phenotype since their

genotypes are perfectly correlated.

While GWAS have lead to the identification of a large number of variants associ-

ated with common diseases, several major challenges remain unaddressed [21]. First,

interpreting GWAS results is difficult since most reported associations merely point to

larger regions of correlated variants [22]. Furthermore, while GWAS provide a list of

SNPs that are statistically associated with a phenotype of interest, they do not offer

any direct evidence about the biological processes that link the associated variant to

the phenotype. The fraction of associated loci found in GWAS that overlap known

coding regions is relatively low. While many associated SNPs are located near known

genes, strong associations have been found in so-called gene deserts [23, 24]. Second,

individual variants often have a small effect size, and even all variants associated

with a disease together only explain a small fraction of the disease risk [25]. Third,

associations identified in one study may not be replicable in a different study [26],

specially if the study is done in a population of different geographic origin.

1.2 Integrative analysis methods

In this thesis, I present three integrative approaches that extend genome wide as-

sociation studies by developing new methods, asking new questions, and integrating

new data. These approaches are applied to a wide variety of data sets, and provide

insights into a broad range of complex human diseases.

Chapter 3 discusses a method for identifying similarities between diseases at the

genetic level. This is an extension on GWAS both from a methods and from a ques-

tions perspective. Identifying genetic similarities between diseases is highly relevant

for the translation of GWAS results to medicine: if two diseases share a common

genetic architecture, then it is likely that the underlying disease processes are also

similar. I develop new methods in order to achieve the goal of identifying disease

similarities using information from multiple SNPs. By training a classifier that dis-

tinguishes cases and controls, a model of the disease architecture is learned. This is

CHAPTER 1. INTRODUCTION 6

a significant improvement over methods that only consider similarities at the level

of individual SNPs. The trained classifier is then applied to individuals that have a

different disease, and by aggregating the classification of those individuals, we can

estimate how close the genetic architecture of the two diseases are. While the per-

formance of the classifier is insufficient to apply it to single individuals, it is able to

identify significant relationships between diseases when aggregating predictions over

a large number of individuals.

In Chapter 4, I re-purpose GWAS data in order to ask a new question. The

genotype information of individuals used as controls in a GWAS is now used to

identify differences between males and females. This approach allows me to identify

significant differences in the pseudoautosomal regions of the sex chromosomes X and

Y. Therefore data originally collected for the purpose of identifying variations linked

to disease risk shine a new light onto the more fundamental biological question of the

differences between males and females, in a particularly interesting, yet understudied

region of the human genome.

In Chapter 5, I go back to the traditional associations identified using GWAS, but

with the goal of adding to the understanding of the biological mechanism underlying

individual associations. I integrate functional data about experimentally identified

regulatory and transcribed regions together with GWAS results. Linkage disequilib-

rium information is used to study the entire regions associated with a phenotype. I

show that functional hypotheses can be generated for a majority of previously identi-

fied associations. This is also an exercise in re-purposing data: the data sets used in

this chapter were generated by ENCODE in order to help understand which regions

of the human genome are functional, and how those functional aspects differ between

cell lines. Integrating these data sets with GWAS results is, however, one of the most

promising avenues for translating information about human gene regulation identified

by the ENCODE consortium to the study of human disease in general.

While Chapter 6 mainly discusses specific examples of functional SNPs identified

in Chapter 5, the analysis is also integrative. The analysis of the 9p21 region relies on

using information in a different way than originally intended, as I build a population

genetics argument based on a negative result, the lack of replication of a prominent

CHAPTER 1. INTRODUCTION 7

association in a different population, in order to show how a functional SNP may play

an important role in coronary artery disease.

Using GWAS data in an innovative way presents interesting challenges from a

quality control perspective. Chapter 2 highlights two particular aspects of genotype

calling that are directly relevant to the rest of the work presented herein.

Chapter 2

Data Quality

2.1 The Wellcome Trust Case Control Consortium

data set

In chapters 3 and 4 of this thesis, we use individual level genotyping data provided

by the Wellcome Trust Case Control Consortium (WTCCC). Authorization to use

these data sets were obtained separately from the WTCCC for each project.

The data sets we use come from a genome-wide association study [10] of seven

common diseases: type 1 diabetes (T1D), type 2 diabetes (T2D), coronary artery

disease (CAD), Crohn’s disease (CD), bipolar disease (BD), hypertension (HT), and

Rheumatoid Arthritis (RA). The data consist of a total of 2000 individuals per disease

and 3000 shared controls, with 1500 control individuals from the 1958 British Birth

Cohort (58C control set) and 1500 individuals from blood donors recruited specifically

for the project (UKBS control set). The genotyping of 500,568 SNPs per individual

was performed using the Affymetrix GeneChip 500K Mapping Array Set. In the

original analysis of this data set by the WTCCC, a total of 809 individuals and

31,011 SNPs that did not pass quality control checks are excluded. In addition, SNPs

that appear to have a strong association in the original study have been manually

inspected for quality issues, and 578 additional SNPs were removed. In this work, we

exclude all individuals and SNPs that were excluded in the WTCCC study, as well

8

CHAPTER 2. DATA QUALITY 9

as an additional 9,881 SNPs that do not appear in the WTCCC summary results.

The use of genotyping data as input to a classifier (Chapter 3) and to study

differences between males and females (Chapter 4) presents challenges that do not

exist when performing a case-control study as done by the WTCCC. In this chapter

we discuss specific data quality artifacts that could have affected our results, and how

we addressed them.

2.2 Genotype calling artifacts

A major concern in the analysis of GWAS data is the possibility that reported geno-

types for an individual could be incorrect. Genotype calling algorithms process the

raw signal obtained from the genotyping chips in order to assign a genotype to each

SNP and for each individual. Inaccuracies in genotype calling can lead to false pos-

itives in a GWAS [27]. While current algorithms are very accurate, a very large

number of SNPs are analyzed in a GWAS. This means that even a very small error

rate could lead to many false positives. If a genotype calling algorithm is accurate

for 99.9% of all SNPs, then over 500 SNPs would still be incorrect for a study of the

size of WTCCC. The purpose of the quality control steps performed by the WTCCC

is to identify SNPs that have poor genotype quality, either due to poor raw data

quality on the chip, or due to genotype calling errors. Quality control steps need

to carefully balance sensitivity and specificity. An overly sensitive approach would

eliminate a larger number of SNPs from the subsequent steps of the study. If a SNP

that was correctly genotyped is eliminated at this stage, then a potentially significant

association may be missed. It does therefore make sense to choose a more specific

approach, and only discard SNPs that are very clearly of poor quality. This can be

done without sacrificing sensitivity by adding an additional quality control step after

the analysis has been performed. All SNPs identified to be significantly associated

with the phenotype are inspected to identify any genotyping or genotype calling is-

sue. As the number of significant associations is orders of magnitude lower than the

number of SNPs on the genotyping platform, this step can be done manually. The

WTCCC study uses this approach to ensure that none of the reported associations

CHAPTER 2. DATA QUALITY 10

is a false positive due to a genotyping artifact.

In this section we briefly describe genotype calling, provide examples of artifacts,

and then describe an alternative method that we need to use when applying a classifier

to genotyping data.

2.2.1 Genotype calling

A SNP assessed on a genotyping platform generally has two alleles, a major allele and

a minor allele. The genotyping array contains multiple probes of the reverse comple-

ment of the sequence around each SNP. Some probes contain the reverse complement

of the sequence including the major allele, and others contain the the reverse comple-

ment of the sequence including the minor allele. The probe sequences are chosen in

such a way that only the sequence near the SNP binds to it. Small fragments of the

genome of the individual which is being genotyped then bind to those probes. If the

individual is homozygous for one of the alleles, then the sequence fragments around

the SNP will only bind to probes that have the reverse complement of the sequence

including that allele. If the individual is heterozygous, then half the sequence frag-

ments will contain the major allele, and half will contain the major allele, and they

will bind to the corresponding reverse complement probes. The amount of binding for

both probes is then measured as a signal intensity. As binding affinities between the

sequence and its reverse complement is variable, the genotyping calls are made for all

individuals in the population at the same time. Signal intensities for both probes can

be represented on a two-dimensional plot. Individuals that are homozygous will form

clusters along each axis, whereas individuals that are heterozygous will form a cluster

along the diagonal. Clustering algorithms are then used to assign each individual

to one of three clusters, and thus determine its genotype for the SNP of interest.

Figure 2.1 shows the appearance of a genotyping signal intensity plot.

2.2.2 Examples of artifacts

An example of a SNP in which an error in genotype calling in the original WTCCC

data leads to a false positive is rs2241572. The original genotyping counts made by

CHAPTER 2. DATA QUALITY 11

Normalized Signal for Allele T

Nor

mal

ized

Sig

nal f

or

Alle

le A

xxxxx

xxx xx

xxxxxxxxxxx

Homozygote TT

Heterozygote AT

xxxxxxxxxxxx

Homozygote AA

Figure 2.1: Overview of genotype calling

Each X represents the genotype of an individual. The arrows indicate the directionalong which individuals with each genotype will cluster. Dashed lines represent theclusters that a genotype calling algorithm should identify for this example.

the Chiamo algorithm [28] used by WTCCC are shown in Table 2.1. Computing a P-

value based on these counts leads to an extremely significant association with coronary

artery disease. This association was flagged as a false positive by the WTCCC after

inspection of the genotype intensity plots (Figure 2.2). It is interesting to note that

a cause of the specific genotype calling error shown here is that WTCCC performed

genotype calling separately for each population, and the algorithm made cluster-

to-genotype assignment decisions that were inconsistent between populations. This

issue would likely have been partially avoided if genotype calling had been performed

jointly for the cases and the controls. An additional issue comes from the fact that

the algorithm did not identify the small number of individual with GG genotype as

a cluster, and thus incorrectly assigned this genotype to other clusters. A second

genotyping algorithm (the standard Affymetrix algorithm BRLMM) correctly calls

the genotypes for this SNP (Table 2.1), and there is no significant difference between

cases and controls.

CHAPTER 2. DATA QUALITY 12

ChiamoGG GC CC

Controls 2924 0 0Coronary Artery Disease 240 0 1658

BRLMMGG GC CC

Controls 9 335 2580Coronary Artery Disease 8 242 1638

Table 2.1: Incorrect and correct genotype calls for rs2241572

• Different algorithm: MM Mm mm

Controls 9 335 2580

Coronary artery disease 8 242 1638

MM Mm mm

Controls 2924 0 0

Coronary artery disease 240 0 1658

Figure 2.2: Genotype calling for rs2241572

In the control sets (58C and NBS), the genotype calling algorithms clusters bothindividuals homozygous for the C allele and heterozygous into a single cluster (red).This cluster is assigned the GG genotype. In the cases (CAD), the clustering al-gorithm finds two clusters, which are assigned the GG genotype (red) and the CCgenotype (blue).

2.2.3 Consensus approach

In the WTCCC study, visual inspection of signal intensities is done after the anal-

ysis, which makes it possible to manually inspect the small subset of SNPs that are

potentially significant. In a classifier-based approach, it is impractical to perform any

kind of visual inspection, and we must try to minimize the errors due to genotype

calling prior to the analysis. It is intractable to manually inspect all SNPs prior to

training the classifier. If we choose to inspect SNPs that are used as features by the

classifier after training, then the whole classifier needs to be re-trained every time a

CHAPTER 2. DATA QUALITY 13

feature must be discarded due to poor data quality. Such an iterative approach is

also slow, and only works if a small number of features are used (which is the case in

a decision tree). In order to be able to use any classifier, we developed an approach

that combines the independent genotype calls of different algorithms in order to lower

the risk of genotyping artifacts. While additional genotype calling algorithms were

available at the time of our study [29], we did not have access to the raw genotyping

chip signal intensity data that would have been necessary to use them.

The WTCCC study only uses genotype calls made by a custom algorithm, Chi-

amo [28], but the genotype calls made using the standard Affymetrix algorithm

BRLMM are also available. While the study does show that Chiamo has, on av-

erage, a lower error rate than BRLMM, there are SNPs that are discarded during

the quality control process that show errors in the genotype calls made by Chiamo

(such as the example shown in Section 2.2.2). We use the two genotype sets to create

a consensus data set in which the genotype of a given individual at a given SNP is

used only if there is agreement between the call made by Chiamo and the call made

by BRLMM, and is considered to be unknown if the calls are different. This ap-

proach individually considers the calls made for every individual at every SNP, and

does not discard entire SNPs. The handling of SNPs that have a high proportion

of unknown genotypes is left to the classification algorithm, and is discussed in Sec-

tion 3.6.2. While this approach does reduce the errors in genotype calling, this comes

at the cost of discarding cases in which Chiamo is right but BRLMM is not. Overall,

the frequency of unknown genotypes is 2% using the consensus approach, compared

to 0.65% using Chiamo and 0.74% using BRLMM. Furthermore, BRLMM genotype

calls are entirely missing for a total of 184 individuals, which are thus excluded from

our study.

After performing these pre-processing steps, the data set used in this work consists

of 459,075 SNPs measured in 2938 control individuals (58C: 1480, UKBS: 1458),

1963 with type 1 diabetes, 1916 individuals with type 2 diabetes, 1882 individuals

with coronary artery disease, 1698 individuals with Crohn’s disease, 1819 individuals

with bipolar disorder, 1952 individuals with hypertension and 1834 individuals with

rheumatoid arthritis.

CHAPTER 2. DATA QUALITY 14

Sign

al fo

r Alle

le C

Signal for Allele A0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.5

1.0

1.5

2.0

2.5

rs312258 NBS

Intensity A

Inte

nsity

B

+

+

+

+

+

+ + +

+

+

++

+

+

+ +

+

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

++

+

++

+

++

+

+

++

+

+

+

+

+

+

+

++

+

++

+ ++

+

+

+

+

+

+

+

+

+

++

++

+++

+

++

+

+

+

+

+

+

+

+

+

+

+

+

++ +

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

++

+

+ ++

+++

+

+

++

+ ++

+

+

+

+

+ + +

+

+

++

++

+

++

+

++

++ +

+

+

++

+

+

++

+

+

++

+

+

+

+

+ +

+

++

+++

+

++

+

+

+

+

+

++

+

+

+++

+

+

+

++

++

++

++

++

+

+

++

+

+

++

+

+

+

+++

+++

+

+++

+

+

+

+

+

+

+

+

++

+++

+

+

+ +++ +

+ +

+

+

+

+++

++

+

+

+

+

+

+

+++

+++

++

+

++

+

+

+ +

+

++

+

+

+

+

++

+

++

+

++

+

++

+

+

+++

++

++

+

+

+

+

++ +

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

++++

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

++

+

+

+

+ +

+

+

+

++

+

+

+

++ ++

+

+

+

+

+

+

+

+

+

+ +

+

+

+

+

++

+

+

+

++

+

+

+

++

+

+

++

+

+

+

+

+

++

+ +

+

+

+

+

+ +

+

+

+

+

+

++

++

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+++

+

+

+

++

+

+

+

+

+

+

+

+

+ +

+

+

+

++

+

++ +

+ +++

+

+

+

+

+

+

+

+

++

+

+

+++

+

+

+

++

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

++

+

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+

+++ +

+

+

+

+

+

+

+

+

++

+

+

+++

+

+

+

+++

+

+

+

+

++

+

++

+

+

++

++

+

+

++

+

+

+

+

++

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++ +

+

+

+ ++

+

+

+

+

+

+

+++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

++

+

++

+

+

++

+

+

+

+

+

++

+

+

+++

+

+

+

+

+

+

++

++

+

+

+

+

+

+

+

+

+

+

++

++

+

++

+

++

+

+

+

++

+ +

+

+

+

+

+

+

+

+

+ ++

+

+++

+

++

++

++

+

+

+

+

++

+

+

+

+

+

+++

+

+

+

+

++

+++

+

+

+

+

++

+

+

+++ +

+

+

++

+

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+ +

+

+ +

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

++

+

+

+

+

+

+

+

++

+

+

++

+

+

+

+++

++++

++

+

++

+

+

+

+

+

+

+

+

+ +++

++

+

++

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++ +

+

+

++

+

+

+

+

+ +

+

+

+

+

++

+

+

+

+

+ +

++

+

+

+

+++

+

+

+

+

+

+

+

+

+

+ ++

+

+

+

+

++

+

++

+

+

+

+

+

+

++

+

+

+

+

++

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+ + +

+

+

+

+

+

+ ++

+

++++

++

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

++

+

+ ++ +

++

+

+

+

+

+

++

+

+

++++ +

+

+

+

+

++ +

++

+

+

+

+ +

++

++

+

++

+

+

+

+

++

+ ++

++

+ ++

+

++

+

+

+

++

+

++

++

+

+

+

+

+

+

+

++ ++ +

+

+

+

+

+

+

+++

++

+

+++

+

+

++++

+

+

+++

++

+

+

+

+

++ +

+

+

+

+

+

++

+

+

+

+

++

+

++

+

+

+

+

+

+

+

+

++

++

+

+

+

+

+

+

+

+

++

+

++

+

+

++

++

+

+

++ +

+

++

+

++

+

+++

+

+

++

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+ +

+

++

+

+

+

+

++

++

+ ++

++

+

+

+

+

+

+

+

+

++

+++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

++

+

+

+

+

+

+ +

+

+

+++

++

+

+

+++

+

++++

++

+

+

+

+

+

+

+

+

+

++

+

+

++

+

++

+

+

++

+

+

+

++

+++

+

+

+

+

+

+

+ +

++

++

+

+

+

+

+

+

+

+

+

+

+++

+

+

+

+

++

+

+

+

+

+

+

+

+ +

+++

+

+

++

+

+

+

+

+

++

+

+

+

+

+

+++

+

+

+

+

+

++

++

+

++++

+

+

++

+

++ +

+ +

+

+

+

+

+

+

+

++

+

+

+

+

+

++++

+

++

++

+

+

+

+ +

+

++

+

+

+

+

++

++

+

+ +

+

+

+

+

+

+

+

++ +

+

+

+

+

++

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.5

1.0

1.5

2.0

2.5

rs312258 58C

Intensity A

Inte

nsity

B

+++

+

+

++

+

+

+

+

+

+

+

++

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ ++

+

+

+

+

+

+ +

+

++

+

+

+

+

+

+

+

+

+

++

+

+++

+ ++

+

+

+

+

+

+

+

++

+

+ +

++

+

++

+

+++

+

+

++

+

+

+

++

+

+

+

+

+

++

+

++

+

+

+

++

+

+

+

+

+

++ +

++

+

+

++

++

+

++

+

+

+

+

+

+

+

+

++++

+

+

+

+

+

+

++

+

++

+

+

+

+

+

+

+++

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+ +

++

+

+

+

+

++

+

+

++

+

+

+

+

+

+

++

+

+

+++ +

+

++

++

+

++

+

+

+ +

+

+

+

+

+

++

++

++

++

+

++

+

+

+

+

++

++

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ ++

+

+

+

+

+

+

+

+

+

+ +

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+ +

+

+

+++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ +

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+ +

+

+

+

++

+

+

+

++

+

+

+

++

+

+

++

++

+

+

+

+

+ +

+

++

++

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

++

+

+

+++

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+ +

++

+

++ +

+

+

+

++

+

+

+

+

+

++

+

++

+

+

+

++

+

+

+

+

++

++

+

+++

++

+

++

++

++

+ +

+

+

+

+

+ +

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+ +

++

+

+

+

++

+

++

++

+

+

++

+

+

+

+

+

+

++

+

+ +

++

+

++

+

+ +

+

+

+

+

+

+

++ ++

+

+

+

+

+ +

+

+

++ +

+

+

+

+

+

+++

+

+

++

+

++

+

+

+

+

++

++

+

++

+

+

+

+

+

++

++

+

+

+

+

+++

+

+

+

+

+

++

+

++++

+

+

++

+ +

+ +

+

+

++

+

+

+

++

++

+

+

+

+

+

+

+

+

++

++

++

+

+

+

+

+

+

++

+

+

+

+

+

++

++

+

+

++

++

+++

++

+

+

+

+

++

+

+

+

+

++

++

+

++

+

++

+

++

+

+

+

++

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

+

++

++ +

++

+

++

+

+

+

+

+

+ +

+

+

+

+

+

+

+

++ +

+

+

+

+

+

+

+

+

++

+

+

+

++

+

++

+

++

+

++

+ ++

+

+

+

+ +

+

+

+++

++

++

+

++

++

+ ++

++

+

++

+

+ +

++

++

++

+

++

+

+++

+

+

+

+

+

+

+

+

+ +

+

++

+

+

+

+

++

+

+

+

+

++

+

+

+

+

+

+

++ +

+ +

+

++

+ +

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+ ++

+

+

+

++

+

+

++

+

+

+

+

+

++

++

+

++

+

+

+

+

+ +

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+++++

++

++

+

+

+

+

+

+

+

+++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+ +++

++

+

+

+

+

++

+

+

+++

+

+

+

++

+++++

+

+

+

+

+

+

+

+

+

+

++

+

++

+

+

++

+ +

++

+

+

++

++

++

+

+

+

+

+

+

+

++

+

+

+

+

++++

+

++

+ +

+

+

+

+

+

++

+

+

+

+

+

+ +

+

++

+

++

+

+

+

+ +

+

++

+

+

+

+

+

++

+

+

+

+

+

+

++++

+

+

++

+

+

+

++

+

+

+

+

+

+

+

++

+

++++

+

+ ++ +

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

++

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

++

+

+

+

+

++

+

+

++

+

+

++

+

++

+

+

+

++

+

+

+

+

++

+

+

+

+

+ +

++

++

+

+ ++

+

+

+

+

++ +

+

+

+

+++

+

+

+

++

+

+

++

++

+

+

+

+

+

+

+

+ +

+

+

+

+

+

+

+

++

+

+ ++

+

+

+

++

+

+

++

+

+ +

+

++

+

+

+

+

++

+

+

+

+ +

+

+

+

+

+

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ +

++

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

++ ++ ++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+ +

+

+

+

+

+

+++

+

++ +

+

+

++

+++

+

++

++

++

+++

+

+

+

++

+

+

+

+

++

++

+ +++

+++

+

+

++

+

+ +

++

++

+ +

++

+

+

+

+

+

+

+

+

NBS58C

Sign

al fo

r Alle

le C

Signal for Allele A

Figure 2.3: Genotyping chip signal intensities for rs312258 in controls

2.3 Effect of homology with the sex chromosomes

We perform visual inspection of all SNPs that show significant differences in geno-

type frequency between males and females that were identified using the approach

described in Chapter 4. The SNPs in PAR1 that exhibit significant differences do not

show any sign of genotype calling artifact. Figure 2.3 shows the signal intensity plots

for the most significant SNP rs312258.

We observe that the significant differences identified on autosomes are the result of

bad genotype calling. Furthermore, a difference in signal intensity between males and

females can often be observed for one allele but not the other. We look for sequences

that are homologous to the sequence around SNPs for which we identify this issue.

We find that in such cases the sequence containing one of the alleles is homologous

to sequence on either the X or the Y chromosome.

Sequence homology can adversely affect genotyping. Figure 2.4 shows the effect of

homology on the signal intensity observed on the genotyping chip. In the absence of

any homologous sequence, the entirety of the signal is caused by binding of sequences

around the SNP to the respective probes (Figure 2.4A). In this case the three genotype

clusters can easily be distinguished. If there is homology for the sequence that includes

CHAPTER 2. DATA QUALITY 15

Nor

mal

ized

Sig

nal f

or A

llele

A

Normalized Signal for Allele T

No homologyHomology for sequence including the A allele on

an autosome

Homology for sequence including the A allele on

the Y chromosome

Figure 2.4: Effect of homology on genotype calling

The green, blue and red ovals represent the areas where individuals with respectivelyAA, AT and TT genotypes will cluster. Clusters of male individuals are representedwith dashed borders in panel C.

the A allele of the SNP, but not fort the sequence that includes the T allele, then

the homologous sequence will also bind to the probe for the A allele. As this binding

is independent of which allele the individual has for the SNP of interest, it results in

an overall increase of the observed intensity for the probe with A allele, but not for

the probe with the T allele (Figure 2.4B). Note that while the absolute intensities

are shifted, the normalized intensities would look similar to the case in which there is

no homology. Homology with a sequence on an autosome would therefore impact the

spread of the signal intensity, but genotype clusters should remain correctly separable.

If there is homology with the Y chromosome, then only the intensity of males will

be shifted upwards. This leads to the superposition of clusters shown in Figure 2.4C.

Running a genotype calling algorithm on such an example will likely result in all

heteorzygotes as well as males with TT genotype to be clustered together. This

incorrect genotyping will result in significant difference between male and female

genotype distributions. A similar difference in shift between males and females will

exist if the homologous sequence is on the X chromosome since females will have two

copies of the homologous sequence, and males only one.

CHAPTER 2. DATA QUALITY 16

CC CG GGMale 1114 363 0

Female 1060 342 34

Table 2.2: Incorrect genotype calls for rs2491853

A concrete example of this situation is rs2491853, a SNP located on chromosome

1. The sequence around the SNP with the C allele at rs2491853 is homologous with a

segment of sequence on chromosome Y. Figure 2.5C shows the probe sequences, which

are the reverse complement of a segment of sequence that is present on chromosomes 1

and Y. Table 2.2 shows the incorrect genotype counts obtained when running Chiamo

on the original signal intensities. No male individual is assigned the GG genotype,

which is consistent with a shift of signal intensities for the probe binding to the

sequence containing the C allele, which would be expected based on the homology of

this sequence with the Y chromosome. Figure 2.5A shows the raw signal intensities

for this SNP. The intensities of male individuals are shifted vertically. When scaling

intensities per sex (Figure 2.5C), the intensities for males and females overlap, and it

becomes visible that there are indeed male individuals with a GG genotype.

We assess the extent to which SNPs on the Affymetrix 500K genotyping chip are

potentially subject to this homology artifact. We intersect the list of all SNPs on

the array with the Human Chained Self Alignments track of the UCSC browser [30].

This track uses a method originally developed to compare the human and mouse

genomes [31] in order to align the human genome with itself, and therefore identify

homologous regions of the human genome. We identify a total of 854 SNPs on the

genotyping chip that are in a region that is homologous to a region of the X chro-

mosome, 723 SNPs that are in a region that is homologous to a region of the Y

chromosome, and 54 SNPs that are homologous to regions on both X and Y. These

results illustrate that while this artifact is not limited to the case we discuss herein,

it does affect less than 1% of the genotyped SNPs.

In this work, we identify SNPs that show differences between males and females

due to homology, and discard them as false positives. It is important to note that this

issue is not limited to the specific study of genotype differences between males and

CHAPTER 2. DATA QUALITY 17

AAACAAGAGGGACTGAGGTGAAGGT AAACAAGAGGCACTGAGGTGAAGGT ACAAGAGGCACTGAGGTGAAGGTTT ACAAGAGGGACTGAGGTGAAGGTTT AAAAACAAGAGGCACTGAGGTGAAG AAAAACAAGAGGGACTGAGGTGAAG ACAAAAAACAAGAGGGACTGAGGTG ACAAAAAACAAGAGGCACTGAGGTG TCTTGTTTTTTGTTCTCCCTGACTCCACTTCCAAATATCTTGTTTTTTGTTCTCCCTGACTCCACTTCCAAATA

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.5

1.0

1.5

2.0

2.5

rs2491853 controls

Intensity A

Inte

nsity

B +

+

+

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+ +

+

+++

+

++

+

+

+

++

+

++

+ ++

+

+

+

+

+

+

++

++

+

+

+

++

+

++

++++

++ ++

+

++

++

++

+

+

+++

++

++

+ + +

+

+++ +

++

+

+ +++

+

++ +

+

+

+

+

+++

+

++

++

+ ++++

++

+

++

+

+

++

+ ++

+

++

+

++

+

++

+

+

+

+

++ ++

+++

++

+

+

+

++

++++

+

+ +

+ +

+

+

+++

+

+

++

+

++++++

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+++

+

+

+

++

++

+++

+ ++

+

++

+

+

+

+++

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+ +

++ +

+

+++

+

+

++

+

+

+

++

+

+

+

+ ++

++

+

+ +

++

+

++

+

+

+

++ +

+

+

+

+

+

+++

+

+ +

+

++

+

+

++

++

++

+

+

+

+ +

+

+

+

++ + +

+

+

+ ++ +

+

+

+

+

+

+

+++

++ +

+++ +

+

+

+

++ +

+

+

+++

++

+

+

+

+ +

++

+ +

++

+

+

+

+

+ + +

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+ +++

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+++

++

+ +

+

+

+

+

+

+

+++

+

+

++

+

+

++

+

++

++

+

+ +

+

+

++

+

++ +

+

+ +

+

+

+

+

+ +

+

++ ++

+

+

+

+

+ ++

+

+

++

+

+

+

+ ++

+

+

+

+

+

++

++

++

++

+

+

+

+ +

+

+

++

+

+

+

+

+

+++

+ +

+

++ +

+

+

+

+

++

+

+

+ +

++

+

+

+++

+

+

+

+++

++

+

+ +

+

+

+

+

+

+

++

+

+

++

+

+

++

+

+ ++

+

+

++

+

+++

++

+ +++

+

++

+

+++ ++ + ++

+++

+

+

+

+

+

++ +

+

+

++

+

+

+

+++

+

+

+

+

+

++

+

+++++

+ +

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+ ++

+

+++

+

+

+

+

++

++

++

++

+

++

+++

+

++

++

+

++

+

+

+

+

+

+

+

+ ++

++

++

++

+

+

+

++++

+

+

+

+

+++

+

+

+

+++

+

++

++

+++

+

+++

+

+

++

+

+

+

+

+

+

+

+

+

+

+ ++

++

+

+

+

++

++

++ +++

+

+

+

+

+ +

+

++

+

+

+

+

+ ++

+

+

+

+

+

++

+++

+

++

++++

+

+

+

+

+

+

++

++

+

+

++

+

+

+

++

++

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+ +

+

++

+

+

+

+

+

+

+

++ ++

+

+

+

+

++

+

+++

+

+

+++

++

+

+ +

+++

++

+

+

+

+++

++

++

+

+

+

+

+

++

+ ++++

+

+

+

++

+++

+

+

+

++

+

+

+

+

+

++

+

+

+

++

+

+

++

++

++

++++

+

+

+

+

+

+

+

+

+

+

+

+

+

+ ++

+

+

+

+

+

+

+

+++

+

+

+

+

+

+

+ +

++

+

++ +

++

++

+

+

+

+++ +

++

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

++

++

+

+++

+

++

++++

+

++

+ ++

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

++++ +

+

++

+

+

+ ++++ +

++ +

+

+

+

++

+ +

+

+ ++

+

+

+

++

+

+

+

+

++

+

+

+

++

+++ +

+ +

+

+

+

+++++

+

+

+

++

+++

+

++

++

+

+

+

+

+

+

++ +

+

+

+

+

+

+

+ +

+

+

+

+

++

++

++

+

++ +

+

+

++

+

+

+

+

+

+

+ ++

+

+

+

+

++

+

++

++++

+

+

++

+

+

+

+

+

++

+++

+

+

+

+

++

+

++

+

+

+

+

++ + +

++

+

++

+

+

++

+

+

+

+

+

++

+++

+

+ +

+

+

++

+

+

++ +

+

+

+

+

+

++

+ +

+

++++

+ +

+

+

+

+++

++ ++

+

+

+

+

+

+

+

++

++

+

++

+ +

+

+

+ ++

+++

+ +

+

+

+++

+

+++

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+++

++

++

++

+

+

+

+++

+

+++

+

+++

+

+

+

+++

+

+

++

+

+ +

++

++

+

+

+

+

+

++

++

++ ++ +

+++

+

+

++

+

++

+

+

++

+

+

+

+

+

+

++

++

+

+

++ ++

+ +

++

+

++

+

+

+

+

+

+

+

+

+

+

+

++

+++

+++

+ +

+

++

+

+

++

++

+

+

++

++

+

+

+ ++

++

+ +

+++

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

++

+

+++

+

++

+ +

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+ ++

+

+

+

++

++

+

+

+

+

+

+

++

+++

++

+

++

+

++

+

+

+

+

+

+

+

+

+

++

+

+

++

+ +

+

++

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+

+

+

+

+

++

+

+

+ +

+

++

+

+

+

+

+

+

+

+

+ +

+

++

+

+

+

+

+

+++

+

+

+

++

+

+

+

++

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+++

+

+

++

+

+

+

+

+

+

++

++

+

+

+ +

+

+ +

+

+

++

+ +

++

+

+

++

+

+++

+

++

+ +

+

+

+

+

+

+

+

+

+

+

+ ++

++

+++

+

+

+

+

+

+

+

+

+

+

++ +

++

+

+++

+

+

++

+

++

++

+

+

+

++

+

+

+

+

++

+

+

++ +

+

++

+

+

++

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+++

+

+

+

++

+

+

+

+

+

++

+

++

+++

+

+++

+

+

+

+++

+

+

++ ++

+

+

+

+

+

+

+

+

+ +

+

+

+

+

+

++

+

+

++

++

+

+

+ ++

+

+

+

+

+ +

+

+

+

+ +

+++

+

+

+

+

+

++

+

+

+

+

+

+ +

+

+

+

+

+

+

+

++

+

++

+

+

+

++

+

+

+

+

+

+

+

+

++

++

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

++

+

+

+++

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

+

++

+

+

++

++

+

+

+

+

+

++

+

+

++

+

+

++

+

+

+

++

+

++

+

++

+

+++

++

+

+

+

+

+

+

++

+

+++++

++ +

+

+

+

+

+++

+ + ++

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

++

+

+

+

+++

+

++

+

+++

+

+

+

++ +

+

+

+++

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+++

++ +

+

+

+

+

+

++

+

+

+

+

+

+

+

+ ++

+

++ +

+

+

+

++

+

++

+

+

+

+

++

+

+

+

+

++

+

+

+

+

+++

+

+

+ +

+

+ + +

+

++ +

+

+

+

++

++

+

+

+++

+

+

+

+

++

+

+ ++

+

+

+

+

+

+

+

+

+

+

+ +

++++

+

++

++

+

+

+

++

+

++

++

+

+

++

+++

++

++

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ +

+

+

+ +

+

++

+ +

+

+

+

+

+

+

++

+++

+

+ ++ ++

+

+

+

+

+

++ +

+

+

+

+

+

+

+

+

+

+

+

+

++

+

++

+

+++

+

+

+

+

+

++

+

++

++ +

+

+++

+

+

+

++

+

+

++

++

++

++

+

+

+ +

++ +

+

++ ++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

++

+

++

++

+

+

+

+

+

+

+

+

+

++

++

++ ++

+

+

+++

++

+

+

++

+

+

+ ++

++

++

+

+

++

+

+

++

+

+

++

+ +

+

+

+

++

+

+

+

+

+ +

+

+++

++

+

+

++

+

++

+

+

++

+

++

+

++

+

+

+

++

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

++

+

++

+

+

+

+

+

+

+

++

+ +

+

+

++

+

++

+

+

+ +

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+

+++

++

+

+

++

+

+

+

+

+

+

+

+++

++ ++

+++

++

+

+ +

+

+ +

+

+

+

+

+

+++

+

+

+

++

+

+

+

+

++

++++

+

++

+

+

+ +

++

++

+

+

+

+

+

+

+

++

+

+

++

++

+

+

+

+

+++

+

+

+

+

+

+

+

+++

+ +

++

+ +

+

+++

+

+

+

++

+

+

+

+

++

+

++

+

+

+

+

+

+

+

+

+

+

+ +++

++

+++

+

++ +

+

+ +

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+++

+

+

+

+

+

+

++

+

+

+

++

+

++

+

+

++

+

+

+

+++

+

+

+

+

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

++

++

+

+

+

++

+

++

+

++

+

+

+

+

+

+

++ +

+

+

+

+

+

+

+

++

+

+

+

+

+++

+

+ +

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

++

+

+

+

+

++ + ++

+

+

++

++

++

+

+ +++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+ +

+

+

+

++

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

++

++

+

+

+

+

+

++

+

+

+

+

+++

+

+

+

+

+

+

+

+

+ +

+

+

+

+

+

+

+

++

+ +

+

+

+

++

+

+

+

++

+

+

+ ++

+

+

++

+

+

+

+

+

+

+

+

++ +

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ +

+

++

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

++

++

++

+

+ ++

+

++

+

+

++

+

+

+

++

++

+

+

+ + +

+

+

++ +

+

+

+

+ +

++

+

++ +

+

+

+

+

+

+

+

+

++

+

+

+++

++

+

+

+

++

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+++

+

+

++

++

+

+

+

++

+

+

+

+

+

+

++ +

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

++

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

++

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+ +

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+ +

+

+

+

+

+

++

+

+

+ +

+

+

+

+

+

+

+

++

+

++

+

+

++

+

+

+

+

++ +

+

+

+ +

++

+

+

+

+

+

+

+

++

+

++

+

+++

+

+

+

++ +

+

+

++

+

+

+

+

+

+

+ +

++

+ +

+

+

+

+

+

+

+ + +

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+ +

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

++

+ +

+

+

+

+

+

+

+ ++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

++

+

+

+

+

++

+

++ +

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+ +

+

+

+

+

+

+

+

+

+

+

++

+ +

+

+

++

+

+

+

+

+

+

+

+

++

++

+

+

++

+

+

+

+

++

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

++

+

++

+ +

++

+

+

++

+++++ + +

+

+++

+

+

+

+

+

++ +

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+++

++

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ +

+

+

++

+

+

+

+

+

++

++

+

+

+

++

+

+

++

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+++

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ ++

++

+

+

+

++

+

+

+++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+++ +

+

+

+

+

+

+

+

+

++

+

+

++

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+ +

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

++

+

++

+

+

+

++

+

+

+

+

++

+++

++

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

++

+++

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+ +

++

+

+++

+

+

+

+

+

+

+

++

+ +

++

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

+

+

++

+ ++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

++

+

+

++

+

+

+ +

+

++ +

+

++

+

+

+

++

++

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

++

+ +

+

+

+

+++

+

+

+

+

+

++

++

+

+

+

+

++

+

+

+

+

+

+

+++

+

+

+

+

+

+

+ +

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+

+

+

+

+

+

+

+

+

+++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+ +

+

+

++

+

+

+

+ +

+

+

+

+

+

+

+

++

+

+

+++

+ +

+

+

+

+

++

+++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ +

+

+

++

+

+++

+ +

+

+

+

++

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ +

++

+

+

++

+

+

+

++

+

+

+

++

+

++ +

+

+

+

++

+

+

+

++

+

+ +

++

+

+

+

+

+

+

+

++

+

+

++ +

+ +

++

+

+

+

++

+

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

++ +

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

++

+

++

+

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+ +

+

++

++

+

+

+

+

+

+

+

++

+

+

+

+

−1 0 1 2 3

−3−2

−10

12

3

rs2491853 controls

Normalized Intensity A

Nor

mal

ized

Inte

nsity

B

+

+

+

+

++

+

+

++

+

+

++

+

++

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ +

+

+

+

+

++

+

++

+

+

+

+

+

+

+

++

+

+

+

+

++

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

++

++

+

+ +

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+

+

+

+

+

++

+

+

++

+

++

+

+

+

+

+

+

+

+

+ +

+

++

+

+

+

+

+

++ +

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

++

+

+

+

+

+

+

+

+

+

++

+

++

+

++

+

+

+

+

++

++

+

+

++

+

++

++

++

+ +

+

+

+

+

+

+

+

+

+

+

+ ++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

++

+

+

+

+

+

+

++

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

+++

+

+

+

+

++

+

+

++

++

+

+

+

+

+

+

+

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ +

+

+

+

+

+

+ +

+

+

+

++

+++

+

+

+

+

+

++

+

+

+

+

+

+ +

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

+

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

++

+

+

++

++

+

+

+

+

+

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+++

+

+

+

+

+

+

+

+

+

+

+

+

++

++

+

+ +

+

+

+

+

+

+++

+ ++

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+++

+

+

+

+++

+

+

+++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

++ +

+

+

+

+

+

++

+

+

+

+

+

+

+

+ +

+

+

+

++

+

+

+

++

+

++

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

++

+

+

+ +

+

+ + +

+

++ +

+

+

+

++

++

+

+

+++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+ +

+++

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

++

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+ +

+

++

++

+

+

+

+

+

+

++

++

+

+

++

++

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

++

++

+

+

+

+

++

+

++

+++

+

++

+

+

+

+

++

+

+

++

+

+

++

+

+

+

+

+ +

++ +

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

++

++ +

+

+

+

+++

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

++

+

+

+

+

+

+

++

+ +

+

+

+

++

+

+

+

+

+ +

+

+

++

+

+

+

+

+

+

+

++

+

+

++

+

+

+

+

+

+

+

+

+

++

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

++

+

+

+

+

+

+

+

+

+

+ +

+

+

++

+

++

+

+

+ +

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+++

+

+ ++

++

+

++

+

++

+

++

+

+

+

+

+

++

+

+

+

+

++

+

+

+

+

+

+

++++

+

++

+

+

++

+

+

++

+

+

+

+

+

+

+

++

+

+

++

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

++

+

++

++

++

+

+++

+

+

+

++

+

+

+

+

++

+

++

+

+

+

+

+

+

+

+

+

+

+ +++

+

+

+++

+

+

+ +

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

++

+

++

+

+

++

+

+

+

+++

+

+

+

+

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

++

++

+

++

+

+

+

+

+

+

++ +

+

+

+

+

+

+

+

++

+

+

+

+

+++

+

++

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+ +

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++ +

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+ +

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+ +

+

+

+

+

+

+

+

++

+ +

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

+

+

+

+

+

+

++

A. Raw intensities B. Scaled per sex

Sign

al fo

r Alle

le C

Signal for Allele G

Sign

al fo

r Alle

le C

Signal for Allele G

chr1:243,090,399-243,090,437

chrY:13,946,923-13,946,961

C.Probes sequences:

Figure 2.5: Signal intensities for rs2491853

In panels A. and B. males are represented in red and females in blue. Boxes indicateindividuals for which the algorithm did not make a genotype call. Panel C. representsthe probes assessing this SNP on the Affymetrix chip, and the sequence surroundingthe SNP on chromosome 1. The sequence containing the C allele at rs2491853 ishomologous with chromosome Y.

CHAPTER 2. DATA QUALITY 18

females. Genotype calls will likely be incorrect for both cases and controls whenever

there is homology between the region around a SNP and a region on a sex chromo-

some. As the error is likely to affect one sex disproportionately (for example males in

rs2491853), any difference in sex proportions between the two studies could lead to a

difference in observed genotype frequency between cases and controls. Because of this

artifact, sex can therefore become a hidden variable. Hidden variables can lead to

false positives even when there is no effect of sex on actual genotype frequencies and

no association between the SNP and the disease in either sex. Every GWAS should

therefore include a step in which this issue is addressed for all SNPs in regions ho-

mologous to a sex chromosome. Running a genotype calling algorithm separately on

males and females could potentially lead to similar issues as discussed in Section 2.2.2.

It is therefore recommended to first rescale the intensity for each sex, and then run

the genotype calling algorithm on the rescaled intensities of all individuals.

Chapter 3

Identifying Similarities Between

Diseases

3.1 Introduction

Genome-wide Association Studies (GWAS) allow the identification of associations

between genotype and phenotype. The Wellcome Trust Case Control Consortium

(WTCCC) genotype 500,000 SNPs in seven common diseases: type 1 diabetes (T1D),

type 2 diabetes (T2D), coronary artery disease (CAD), Crohn’s disease (CD), bipolar

disease (BD), hypertension (HT) and rheumatoid arthritis (RA) [10]. In this Chapter

we use the individual genotype data from this study in order to identify similarities

between the genetic architecture of diseases.

Computational methods have been used to identify disease similarities using a

variety of data sources, including gene expression in cancer [32] and known relation-

ships between mutations and phenotypes [33]. However, while a large number of

GWAS focusing on individual diseases have been recently published, the attempts

to integrate the results of multiple studies have been limited. Most of these integra-

tion approaches focus on combining multiple studies of the same disease in order to

increase the statistical power [34], or use data from other high-throughput measure-

ment modalities to improve the results of GWAS studies [35]. Comparison between

the genetic components of diseases have been done using four different approaches.

19

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 20

The first approach is based on the identification of the association between one SNP

in two different diseases in two independent studies. The second approach selects

a group of SNPs that have been previously associated with some disease and tests

if they are also associated with a different disease. An example of this approach is

the genotyping of a large number of individuals with type 1 diabetes at 17 SNPs

that have been associated with other autoimmune diseases, which leads to the iden-

tification of a locus previously associated with only rheumatoid arthritis as being

significantly associated with type 1 diabetes as well [36]. The third approach pools

data from individuals with several diseases prior to the statistical analysis, and has

been used in the original WTCCC study. Several similar diseases (autoimmune dis-

eases, metabolic and cardiovascular diseases) are grouped in order to increase the

statistical power for identifying SNPs that are significantly associated with all the

diseases in the pool. The fourth approach compares the results of multiple GWAS,

and has been previously applied to the WTCCC data set [37]. They use the P-values

indicating the significance of the association between a SNP and a single disease, and

compute the correlations between these P-values in pairs of diseases, as well as the

size of the intersection of the 1000 most significant SNPs in pairs of diseases. They

identify strong similarities between type 1 diabetes and rheumatoid arthritis, between

Crohn’s disease and hypertension, and between bipolar disease and type 2 diabetes.

In this work we introduce a novel approach to identify similarities in the genetic

architecture of diseases. We train a classifier that distinguishes between a reference

disease and the control set. We then use this classifier to classify all the individuals

that have a query disease. If there is a similarity at the genetic level between the

query disease and the reference disease, we expect more individuals with the query

disease to be classified as belonging to the disease class than if there is no similarity.

We generalize our procedure to multiple disease comparison: given a set of multiple

diseases, we use each in turn as the reference disease while treating all others as query

diseases.

There are two main differences between our new approach and existing analyses.

First, previous approaches (such as [37]) compute a significance score for each SNP,

and then use these scores for comparing diseases. In our approach, we first compute a

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 21

classification for each individual, and then compare diseases using these classifications.

Second, we train the classifier using information from all SNPs, and during this

learning process select the SNPs that contribute to the classification based on the

genotype data only. This genome-wide approach makes it possible to see the classifier

as a statistical representation of the differences between the disease set and the control

set.

The use of classifiers in the context of GWAS has been limited so far. In partic-

ular, attempts at using them for predicting outcome based on genotype have been

unsuccessful. For example, a recent prospective study in type 2 diabetes [38] found

that using 18 loci known to be associated with type 2 diabetes in a logistic regression

classifier together with known phenotypic risk factors does not significantly improve

the risk classification, and leads to a reclassification in only 4% of the patients. A par-

ticular challenge in the context of outcome prediction is that the prevalence of most

diseases is relatively low and that it is therefore necessary to achieve high precision in

order for the classifier to be usable. Our goal is not predicting individual outcomes,

and we only compare predictions made by a single classifier. We can therefore ignore

disease prevalence.

A second challenge in the use of a classification approach for finding disease simi-

larities is that the classifier does not explicitly identify genetic features of the disease,

but rather learns to distinguish the disease set from the control set. Differences

between the two sets that are due to other factors might therefore lead to incorrect

results. In most GWAS, a careful choice of matched controls limits this risk. However,

when using a classifier trained on one GWAS to classify individuals from a different

study, there is a risk that the background distribution of SNPs is very different be-

tween the populations in which the data sets have been collected, which could lead to

errors, particularly when comparing diseases using data sets from different geographic

origins. This risk can be limited by using disease data from a single source. In this

work, we use genotype data provided by the WTCCC study, in which all individu-

als were living in Great Britain and individuals with non-Caucasian ancestry were

excluded.

In this chapter, we first provide a detailed description of the analysis approach. We

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 22

then show that we are able to train classifiers that achieve a classification error that is

clearly below the baseline error for type 1 diabetes, type 2 diabetes, bipolar disease,

hypertension and coronary artery disease. We use these classifiers to identify strong

similarities between type 1 diabetes and rheumatoid arthritis, as well as between

hypertension and bipolar disease, and weak similarities between type 1 diabetes and

both bipolar disease and hypertension. We also show that we are able to train a

classifier that distinguishes between the two control sets in the WTCCC data. We

use this classifier to identify similarities between some diseases and individual control

sets. This finding matches observations made during the quality check phase of the

original study. The implications of this finding on our approach are addressed in the

results section. Finally, we discuss the implications of the similarities we find, and

propose extensions of this approach. A detailed description of the data set used in

this work, the data pre-processing, the decision tree classifier and the comparison

procedure are provided in respectively Chapter 2 and Section 3.6.

3.2 Approach

In this section, we define the general classifier-based approach to identify genetic

similarities between diseases. The approach can be separated into four steps: data

collection, pre-processing, classifier training and disease comparison. Figure 3.1 pro-

vides an overview of the training and comparison steps.

The data collection step consists of collecting samples from individuals with several

diseases, as well as matched controls, and genotyping them. Alternatively, existing

data can be reanalyzed. In both cases, it is important to limit the differences between

the disease sets and the control sets that are not related to the disease phenotype.

Similarly, differences between the different disease sets should also be limited. In

particular, it is recommended to use individuals with the same geographic origin, the

same ancestry, and a single genotyping technology for the whole study. In this work

we use existing data from the WTCCC which satisfies these criteria.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 23

Controls

0.0 0.2 0.4 0.6 0.8 1.0

020

040

060

0

Reference Disease

0.0 0.2 0.4 0.6 0.8 1.0

010

020

030

040

050

0

Disease-class probability (average shown in red)

Genotypes of Individuals with a Reference Disease Genotypes of Healthy Control Individuals

Learn a Classi�er For a Reference Disease vs. Control

Apply Reference Disease Classi�er to Genotype Data from Other Dieases and Compare Results

0.26 0.65

Disease A

Disease B

Disease C

0.3 0.35 0.4 0.45 0.5 0.55 0.6

Other diseases

Controls Reference Disease

Freq

uenc

y

Freq

uenc

y

Disease-class probability (average shown in red)

Disease-class probability

Figure 3.1: Overview of the approach.

This figure presents the classification and comparison steps of our analysis pipeline.These steps are repeated using a different reference disease each time. The classifierreturns a real value between 0.0 and 1.0 which we call disease-class probability. Thehistograms represent the distribution of the disease-class probability of the individ-uals with the reference disease (left) and of the controls (right). In the situationdepicted on this figure, there is evidence that query disease C is more similar to thereference disease than the other query diseases.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 24

In the pre-processing step, the data are filtered and uncertain genotype measure-

ments, as well as individuals and SNPs that do not fit quality requirements are dis-

carded. It is important to develop pre-processing steps that ensure good data quality.

Approaches that analyze each SNP individually can afford to have a more stringent,

often manual post-processing step on the relatively few SNPs that show strong asso-

ciation. The SNPs that do not pass this quality inspection can be discarded without

affecting the results obtained on other SNPs. In our approach however, classifier

training is done using genome-wide information, and removing even a single SNP

used by the classifier could potentially require re-training the entire classifier. It is

therefore impractical to perform any kind of post-processing at the SNP level. Chap-

ter 2 describes the data used in this work, as well as the quality control measures we

take.

The classifier training and comparison steps are interleaved. We start with a list

of diseases and a set of individual genotypes for each disease, as well as at least

one set of control genotypes. We pick one disease as reference disease, and refer

to the remaining diseases as query diseases. We train a classifier distinguishing the

corresponding disease set from the control set. For any individual, this classifier could

either return a binary classification (with value 0 and 1 indicating that the classifier

believes the individual is part of, respectively, the controls class or the disease class)

or a continuous value between 0 and 1. This continuous value can be seen as the

probability of the individual to be part of the disease class, as predicted by the

classifier. We refer to this value as disease-class probability. For simplicity, we will only

use the disease-class probability values for the rest of this section, but the comparison

step can be performed similarly using binary classifications. During the comparison

step, we classify individuals from the query disease sets using the classifier obtained

in the training step, and for each query disease, compute the average disease-class

probability. The training and comparison steps are then repeated so that each disease

is used once as reference disease.

We can compare the average disease-class probability of the different query dis-

eases to identify similarities between them. Diseases that have a higher average

disease-class probability are more likely to be similar to the reference disease than

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 25

diseases with a lower average disease-class probability. Using cross-validation, we can

obtain the average disease-class probability of the reference disease set and the control

set used for training the classifier, and compare them to the values of the other dis-

eases. One particular caveat that needs to be considered in this analysis is that while

the classifier does distinguish the control set from the disease set, there is no guaran-

tee that it will only identify genetic features of the disease set. It is also possible that

it will identify and use characteristics of the training set, especially if there are data

quality issues. This case can be identified during the comparison step if the average

disease-class probability of most query diseases is close to the average disease-class

probability of the reference disease, but very different from the average disease-class

probability of the control set. It is therefore important to look at the distribution of

the average disease-class probabilities of all query diseases before concluding that an

individual disease is similar to the reference disease.

It is important to note that the disease-class probability of a given individual

does not correspond to the probability of this individual actually having the disease.

The disease frequency is significantly higher in the data sets we use for training the

classifier than in the real population. In a machine learning problem in which the

test data are class-imbalanced, training is commonly done on class-balanced data, and

class priors are then used to correct for the imbalance. Such priors would, however,

scale all probabilities linearly, and would not affect the relationships we identify, nor

their significance. Estimating the probability of an individual having the disease is

not the goal of this project and we can therefore ignore class priors.

A large variety of classifiers can be integrated into the analysis pipeline used

in our approach. The methods section provides a more formal description of the

classification task. In this work, we use a common classifier, decision trees, to show

that this approach allows us to identify similarities. The specific details about the

decision tree classifier, and how its outputs are used in the analysis step are described

in the methods section.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 26

Disease Baseline Error Precision Recall ∆p LeavesT1D 40.05% 22.93% 71.65% 70.71% 0.383 9RA 38.43% 33.45% 59.12% 42.09% 0.130 12BD 38.24% 33.59% 62.60% 30.18% 0.087 11HT 39.92% 36.77% 57.98% 28.64% 0.080 12CAD 39.05% 36.62% 55.25% 32.73% 0.075 12T2D 39.5% 38.0% 54.12% 25.05% 0.052 14CD 36.63% 36.28% 29.83% 18.43% 0.046 11

Table 3.1: Classifier performance (cross-validation)

Baseline corresponds to the baseline error, Error, Precision and Recall to the cross-validation performance of the decision tree classifier, ∆p to the difference betweenthe average disease-class probability of the control set, and the average disease-classprobability of the disease set, and Leaves to the maximum number of leaves in thepruned classifiers for this disease.

3.3 Results

We evaluate the ability of our analysis approach to identify similarities between dis-

eases using the set of seven diseases provided by the WTCCC. In this section, we first

evaluate the performance of individual classifiers that distinguish one disease from the

joint control set. We then show that these classifiers can identify similarities between

diseases. Finally, we use our classifier to identify differences between the two control

sets, and provide evidence indicating that these differences do not affect the disease

similarities we identify.

3.3.1 Classifier performance

We first train one classifier for each disease using both the 58C and the UKBS sets as

controls. The performance of each classifier is evaluated using cross-validation, and

reported in Table 3.1. We compare our classifier to a baseline classifier that classifies

all individuals into one class without using the SNP data at all. The best error such

a classifier can achieve during cross-validation is the frequency of the smaller class in

the training set. We refer to this value as the baseline error.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 27

The disease for which the classifier performs best is type 1 diabetes, with a clas-

sification error of 22.93%, compared to a baseline error of 40.05%. The classification

error obtained by the decision tree classifier is also below the baseline error for sev-

eral other diseases, although by a substantially smaller margin. This is the case

for rheumatoid arthritis (with an error of 33.45% versus 38.43%), bipolar disease

(33.59% versus 38.24%), hypertension (36.77% versus 39.92%) and coronary artery

disease (36.62% versus 39.05%). For two diseases, type 2 diseases and Crohn’s dis-

ease, the improvement compared to the baseline error is only minimal, and we choose

not to use these classifiers in our analysis. While the classifiers that we keep only

provide small improvements in terms of classification error (with the exception of

type 1 diabetes), they have a significantly better trade-off between precision (at least

55%) and recall (at least 28%) than the baseline classifier (which would classify all

individuals as controls).

We do not use these classifiers in a binary way, but rather use the disease-class

probability, which is the conditional probability of an individual to be part of the

disease-class given its genotype, under the model of the reference disease learned by

the classifier (see Methods for a precise definition for decision trees). It is therefore

interesting to consider the distributions of the disease-class probability, as obtained

during cross-validation. Figure 3.2 illustrates that these distributions differ signifi-

cantly for type 1 diabetes. It can also be seen that there are individuals for which

the disease-class probability is close to 50%, meaning that there are leaf nodes in

the classifier that represent subsets of the data that cannot be distinguished well.

Our approach takes this into account by using disease-class probabilities rather than

binary classifications. In order to evaluate the ability of our classifiers to distinguish

between the disease set and the control set using the disease-class probability met-

ric, we use the difference ∆p of the average disease-class probability between the two

sets. The classifiers that we keep all have values of ∆p above 0.075. This illustrates

that while there are only small improvements in binary classification performance,

the classifiers are able to distinguish between the disease set and the control set in

the way we intend to use them.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 28

3.3.2 Disease similarities

For each of the five classifiers with sufficiently good performance, we compute the aver-

age disease-class probability of each of the six query diseases. In summary, we identify

strong symmetrical similarities between type 1 diabetes and rheumatoid arthritis, as

well as between bipolar disease and hypertension. Furthermore we find that type 1

diabetes is closer to both bipolar disease and hypertension than other diseases, even

though we did not find the symmetrical relation using the type 1 diabetes classifier.

This section provides a detailed presentation of these results.

For type 1 diabetes, the average disease-class probability for the control set and

the disease set, as computed using cross-validation, are respectively 0.259 and 0.642.

Figure 3.2 shows the distribution of the average disease-class probabilities for the

query diseases. Rheumatoid arthritis, another auto-immune disease, is clearly the

closest to type 1 diabetes (average disease-class probability of 0.337). This result is

significant, with P-value smaller than 10−5 (see the Methods section for details on how

P-values are obtained). All other diseases have an average disease-class probability

that is close to that of the control set, which means that there is no evidence of

similarity with type 1 diabetes.

For rheumatoid arthritis, the average disease-class probabilities are 0.303 for the

control set and 0.433 for the disease set. The distribution of the average disease-class

probabilities for the other diseases are shown on Figure 3.3a. We can observe that

type 1 diabetes (average disease-class probability of 0.397) is closest to rheumatoid

arthritis (P-value < 10−5), meaning that we find a symmetrical similarity between

the two diseases. All other diseases have an average disease-class probability close to

the one of the control set.

For bipolar disease, the average disease-class probabilities are 0.297 for the con-

trol set and 0.384 for the disease set. The distribution of the average disease-class

probabilities for the query diseases are shown on Figure 3.3b. We can observe that

there is a wider spread in the average disease-class probabilities, and that there is

no cluster of diseases close to the control set. We can also observe that hypertension

(average disease-class probability of 0.359, P-value < 10−5) is closest to bipolar dis-

ease, followed by type 1 diabetes (average disease-class probability of 0.354, P-value

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 29

Control

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

020

040

060

0

0.259 0.642

CD:BD:CAD:T2D:HT:RA:

0.2390.2460.2480.2540.2550.337

0.3 0.35 0.4 0.45 0.5 0.55 0.6

Other diseases

Type 1 Diabetes

Disease-class probability

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

010

020

030

040

050

0

Disease-class probability

Disease-class probability

Figure 3.2: Distribution of the disease-class probabilities for the type 1 dia-betes classifier.

The two histograms show the distribution of the disease-class probability of theindividuals respectively in the joint control set (top) and in the type 1 diabetesset (bottom), as computed during cross-validation. The red lines represent theaverage disease-class probabilities, and the black line indicates the 0.5 probabilitycut-off used for binary classification. The plot in between the histograms shows theaverage disease-class probabilities of the six other diseases on the interval betweenthe average disease-class probabilities of the control set and of the disease set.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 30

Rheumatoid Arthritis

Controls: 0.303

RA: 0.433

CD:BD:CAD:HT:T2D:T1D:

0.30.3050.3070.3070.3120.397

0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44

Bipolar Disease

Controls: 0.297

BD: 0.384

CD:CAD:RA:T2D:T1D:HT:

0.3320.3380.3430.3440.3540.359

0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44

Hypertension

Controls: 0.315

HT: 0.395

CAD:CD:RA:T2D:T1D:BD:

0.3380.3430.3480.3510.3680.381

0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44

Coronary Artery Disease

Controls: 0.303

CAD: 0.378

HT:CD:RA:BD:T1D:T2D

0.3190.3210.3310.3370.3410.342

0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44

Disease-class probability

Disease-class probability

Disease-class probability

Disease-class probability

a

b

c

d

Figure 3.3: Disease-class probabilities comparisons.

The plots represent the interval between the average disease-class probabilities ofthe control set and of the disease set for respectively rheumatoid arthritis, bipolardisease, hypertension and coronary artery disease. The average disease-class prob-abilities for all the query diseases are shown in blue on every plot. Note that whileall plots on this figure use the same scale, different scales are used for the centralplots of Figures 3.2 and 3.4.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 31

of 0.001).

For hypertension, the average disease-class probabilities are 0.315 for the con-

trol set and 0.395 for the disease set. The distribution of the average disease-class

probabilities for the other diseases are shown in Figure 3.3c. We can observe that

bipolar disease (average disease-class probability of 0.381, P-value < 10−5) is clearly

closest to hypertension. Type 1 diabetes (average disease-class probability of 0.368,

P-value < 10−5) is also closer to hypertension than the remaining diseases.

For coronary artery disease the average differences between the query diseases are

smaller than for all the other classifiers (Figure 3.3d). Furthermore, the classifier

for coronary artery disease is the one with the worst performance amongst the ones

we use in the comparison phase. Therefore we believe that the results are not strong

enough to report putative similarities identified using this classifier, even though some

differences between diseases have significant P-values.

3.3.3 Differences between control sets

The original WTCCC study found several SNPs that are significantly associated with

one of the two control sets. These SNPs are filtered out during pre-processing, both

in the WTCCC study and in this work. However, the mere existence of differences

between two control sets prompted the question whether a classifier could distinguish

the two sets, and if so, what the implications of this finding would be on the validity

of results obtained with these control sets.

We perform several experiments using the two control sets separately, and report

the results in Table 3.2. First, we train a control-control classifier that distinguishes

the two control sets from each other. This classifier achieves an error of 41.15%

compared to a baseline error of 49.62%, and a ∆p of 0.093. This shows that we are

able to distinguish to some extent between the two control sets. Figure 3.4 shows

the distribution of the 58C class probability (which corresponds to the value called

disease-class probability when the classifier distinguishes between one disease and the

controls). In order to verify that this result is due to differences between the two

specific control set, and not the ability of our classifier to distinguish between any

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 32

Experiment Baseline Error Precision Recall ∆p LeavesUKBS / 58C 49.62% 41.15% 58.33% 64.05% 0.093 11R1 / R2 50.03% 49.45% 50.59% 46.42% -0.003 11UKBS / T1D 42.62% 23.15% 79.53% 80.34% 0.402 858C / T1D 42.99% 24.46% 76.60% 82.22% 0.370 8UKBS / RA 44.29% 36.42% 66.21% 70.72% 0.144 1058C / RA 44.66% 38.11% 64.89% 67.83% 0.135 9

Table 3.2: Separate training set classifier performance

Baseline corresponds to the baseline error, Error, Precision and Recall to the cross-validation performance of the decision tree classifier, ∆p to the difference betweenthe average disease-class probability of the control set, and the average disease-classprobability of the disease set, and Leaves to the maximum number of leaves in thepruned classifiers for this experiment. R1 and R2 represent two random splits ofthe joint control set.

two sets, we randomly split all control individuals into two sets, R1 and R2. We

train a classifier to distinguish between these two sets. We find that this classifier

does only minimally improves the classification error (error of 49.45%, baseline error

of 50.03%, ∆p of -0.003).

We apply the comparison step of our pipeline using the control-control classifier

in order to identify possible similarities between the disease set and one of the control

sets. Figure 3.4 shows the distribution of the average 58C class probabilities for each

disease. The average disease-class probabilities obtained during cross-validation are

0.477 for the UKBS set and 0.561 for the 58C set. Both hypertension (average 58C

class probability of 0.521, P-value < 10−5) and bipolar disease (average 58C class

probability of 0.514, P-value of 0.0002) are closer to the 58C control set, whereas

both rheumatoid arthritis (average 58C class probability of 0.487, P-value < 10−5)

and coronary artery disease (average 58C class probability of 0.489, P-value of 0.0003)

are closer to the UKBS control set.

Given the differences between the control sets, and the unexpected similarities

between control sets and diseases, we are interested in verifying that the performance

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 33

UKBS Controls

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

050

150

250

350

0.468 0.561

RA:CAD:T1D:T2D:CD:BD:HT:

0.4870.4890.4980.4990.5030.5140.521

0.48 0.5 0.52 0.54

Diseases

58C Controls

58C class probability

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

010

020

030

040

0

58C class probability

58C class probability

Figure 3.4: Distribution of the class probabilities for the control-control clas-sifier

This classifier distinguishes the UKBS control set from the 58C control set. Thetwo histograms show the distribution of the 58C class probability of the individualsrespectively in the UKBS control set (top) and in the 58C control set (bottom), ascomputed during cross-validation. The red lines represent the average class prob-abilities, and the black line indicates the 0.5 probability cut-off used for binaryclassification. The plot in between the histograms shows the average disease-classprobabilities of all seven other diseases on the interval between the average classprobabilities of the two control sets.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 34

of the disease-classifiers used in the analysis is not an artifact caused by these differ-

ences. We therefore train two new classifiers for each disease, one using only UKBS

as control set, and one using only 58C as control set. The performance of these

classifiers for type 1 diabetes and rheumatoid arthritis is shown in Table 3.2, and is

similar to the performance of the classifiers that use both control sets together. For

the remaining diseases (including hypertension and bipolar disease), the classifiers

using only one of the control sets do not achieve a classification error below the base-

line error, most likely due to the smaller training set (i.e. overfitting). For each of

the classifiers for type 1 diabetes and rheumatoid arthritis we compute the average

disease-class probability for the other six diseases as well as the unused control set.

The similarities between the two diseases are significant in all four classifiers. Further-

more, the average disease-class probability of the unused control set is similar to the

the average disease-class probability of the other five diseases, and not significantly

closer to type 1 diabetes or rheumatoid arthritis. Therefore we can conclude that the

results obtained using the type 1 diabetes and rheumatoid arthritis classifiers are not

due to differences between the control sets. Furthermore, the results using a single

control set provide further evidence indicating that the classifiers do identify relevant

features of respectively type 1 diabetes and rheumatoid arthritis, rather than relevant

features of the control set.

3.4 Discussion

In this work, we introduce a novel approach for identifying genetic similarities between

diseases using classifiers. We identify genetic similarities between several diseases.

In this section, we first discuss the implications of these findings. We then consider

challenges in the application of classifiers to GWAS data. Finally, we propose possible

extensions of this approach.

We identify a strong similarity between type 1 diabetes and rheumatoid arthritis.

Genetic factors that are common to these two autoimmune diseases were identified

well before the advent of GWAS, and linked to the HLA genes [39], [40]. The original

WTCCC study [10] identifies several genes that appear to be associated with both

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 35

diseases. We look at the classifiers corresponding to these two diseases. The SNP

with the highest information gain in type 1 diabetes is rs9273363, which is located on

chromosome 6, near MHC class II gene HLA-DQB1, and is also the SNP that is most

strongly associated with type 1 diabetes in the initial analysis of the WTCCC data,

with a P-value of 4.29 · 10−298 [41]. This is the strongest association reported for any

disease in the WTCCC study, which explains to a large extent why the type 1 diabetes

classifier so clearly outperforms the classifiers for the other diseases. This SNP is also

significantly associated with rheumatoid arthritis (P-value of 6.74 · 10−11). The SNP

with the highest information gain in rheumatoid arthritis is rs9275418, which is also

part of the MHC region, and is strongly associated with both rheumatoid arthritis

(P-value of 1.00 · 10−48) and type 1 diabetes (P-value of 7.36 · 10−126). This shows

that our approach is able to recover a known result, and uses SNPs that have been

found to be significantly associated with both diseases in an independent analysis of

the same data.

The similarity we identify between hypertension and bipolar disease is interest-

ing, since there does not appear to be previous evidence of a link between the two

diseases at the genetic level. However, a recent study identified an increased risk

of hypertension in patients with bipolar disease compared to general population, as

well as compared to patients with schizophrenia in the Dannish population [42]. The

WTCCC study only identified SNPs with moderate association to hypertension (low-

est P-value of 7.85·10−6) and a single SNP with strong association with bipolar disease

(P-value of 6.29 · 10−8). The decision trees for both diseases use a large number of

SNPs that have a very weak association with the respective disease. Both classifiers

have a classification error that is clearly below the baseline error, and provide evi-

dence of similarity between the two diseases. This indicates that our classifier-based

approach is able to use the weak signals of a large number of SNPs to identify evidence

for similarities that would be missed by comparing only SNPs that show moderate or

strong association with the diseases. Further analyzes are necessary to identify the

nature and implications of the similarity we find between hypertension and bipolar

disease, as well as the weaker similarity we identified between these two diseases and

type 1 diabetes.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 36

We also show that we can train a classifier that can distinguish the two control sets,

and we use it to identify diseases that are more similar to one of the control set than

the other. This is not an unexpected finding, since SNPs that were strongly associated

with a control set were identified and discarded in the WTCCC study. These SNPs

were also removed in the pre-processing step of our study, and the results we obtain

when trying to distinguish the two control sets therefore show that the decision tree

classifier is able to achieve a classification error below the baseline error even though

the SNPs with the strongest association could not be used by the classifier. The

similarities between some diseases and one of the control sets can most likely be

explained by some subtle data quality issue. During quality control, the authors of

the WTCCC study found several hundreds of SNPs in which some data sets exhibited

a particular probe intensity clustering (see the Supplementary Material of the original

WTCCC study [10] for details). This particular pattern was always observed in 58C,

BD, CD, HT, T1D, T2D, but not in UKBS, RA and CAD. This matches the result

obtained using our classifier-based approach, in which RA and CAD were predicted

to be most similar to UKBS, and could therefore be a possible explanation of the

similarities we find.

While we do find several interesting similarities between diseases, we also observe

that training a classifier that distinguishes between individuals with a disease and

controls using SNP data poses numerous challenges. The first is that whether someone

will develop a disease is strongly influenced by environmental factors. The genetic

associations that can be identified using GWAS are only predispositions, and it is

therefore likely that some fraction of the control set will have the predispositions, but

will not develop the disease. Furthermore, depending on the level of screening, the

disease might be undiagnosed in some control individuals, and individuals that are

part of a disease set might have other diseases as well. This is especially true for high

prevalence diseases like hypertension.

Obtaining good classifier performance by itself is not, however, the main goal

of our approach. We show that we can find similarities even when the classifier

performance only shows small improvements compared to the baseline error. In this

work, we focus on the comparison approach, not on developing a classifier specially

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 37

suited for the particular task of GWAS classification. We use decision trees because

they are a simple, commonly used classification algorithm.

This work shows that classifiers can be used to identify similarities between dis-

eases. This novel approach can be expanded into several directions. First, classifica-

tion performance can be potentially improved by using a different generic classifier,

or by developing classifiers that do take into account the specific characteristics of

SNP data. Second, further analysis methods need to be developed in order to analyze

the trained classifiers, and identify precisely the SNPs that do lead to the similarities

this approach detects. Such a methodology would be useful, for example, to further

analyze the putative similarity between hypertension and bipolar disease. Third,

building on the fact that our approach considers the whole genotype of an individual,

it could be possible to identify subtypes of diseases, and cluster individuals according

to their subtype. Finally, modifying the approach to allow the integration of studies

performed in populations of different origins or using different genotyping platforms

would allow the comparison of a larger number of diseases.

Our approach identifies similarities between the genetic architecture of diseases.

This is however only one of the many axes along which disease similarities could be

described. In particular, both genetic and environmental factors interact in diseases,

and the genetic architecture for two diseases could be similar, but the environmental

triggers could be different, leading to low co-occurrence. There is therefore a need

for methods that integrate similarities of different kinds that were identified using

different measurement and analysis modalities. An example of such an approach

is the computation of disease profiles that integrate both environmental etiological

factors and genetic factors [43].

In a study published after this work, and which is not part of this thesis, we present

a separate method that integrates allele-specific effects into the study of similarities

and differences in the genetic architecture of diseases [44]. We find that autoimmune

diseases can be separated into two classes. Furthermore, we identify SNPs for which

one allele increases the risk of having one autoimmune disease, but decreases the risk

of having another autoimmune disease.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 38

3.5 Conclusion

Genome-wide associations studies have been used to identify candidate loci likely to

be linked to a wide variety of diseases. In this work, we introduce a novel approach

that allows identifying similarities between diseases using GWAS data. Our approach

is based on training a classifier that distinguishes between a reference disease and a

control set, and then using this classifier for comparing several query diseases to

the reference disease. This approach is based on the classification of individuals using

their full genotype, and is thus different from previous work in which the independent

statistical significance of each SNP is used for comparing diseases.

We apply this approach to the genotype data of seven common diseases provided

by the Wellcome Trust Case-Control Consortium, and show that we are able to iden-

tify similarities between diseases. We replicate the known finding that there is a

common genetic basis for type 1 diabetes and rheumatoid arthritis, find strong ev-

idence for genetic similarities between bipolar disease and hypertension, as well as

evidence for genetic similarities between type 1 diabetes and both bipolar disease

and hypertension. We also find similarities between one of the control sets used in

the WTCCC (UKBS ) and two disease sets, rheumatoid arthritis and coronary artery

disease. This similarity can possibly be a consequence of the subtle differences in

genotyping quality that were observed during the initial quality control performed by

the WTCCC.

Our results demonstrate that it is possible to use a classifier-based approach to

identify genetic similarities between diseases, and more generally between multiple

phenotypes. We expect that this approach can be improved by using classifiers that

are more specifically tailored for the analysis of GWAS data, and by the integration

of a larger number of disease phenotypes. The ability to compare similarities between

diseases at the whole-genome level will likely identify many more currently unknown

similarities. Genetic similarities between diseases provide new hypotheses to pursue

in the investigation of the underlying biology of the diseases, and have the potential

to lead to improvements in how these diseases are treated in the clinical setting.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 39

3.6 Methods

In this section, we first formally define the classification task that is central to our

approach, then describe the specific classifier we use in this work and how we evaluate

its performance, and finally describe how we use the classification results to infer

relationships between diseases.

3.6.1 Classification task

The data consist of a list of individuals i, a list of SNPs s ∈ S, and the measurement

of the genotype g(s, i) of individual i at SNP s. We use Gi = {g(1, i), ..., g(|S|, i)}to denote the genotype of individual i at all the SNPs in the study. The genotype

measurement is a discrete variable which can take four values: homozygote for the

major allele, homozygote for the minor allele, heterozygote and unknown: g(s, i) ∈{maj,min, het, unk}. Each individual belongs to one of several disease sets, or to

the control set. For the WTCCC data used in this work, we have seven disease sets:

T1D, T2D, CAD, CD, BD, RA, HT, and we use the union of the 58C and UKBS

sets as control set.

For each disease d, we train a classifier that distinguish between that disease set

and the controls. The individuals that are not part of these sets are ignored during

the training of this classifier. For each individual i used during training, a binary class

variable ci indicates whether the individual belongs to the disease set (ci == disease)

or to the control set (ci == control). The supervised classification task consists of

predicting the class ci of an individual i given its genotype Gi. In this work, we use

a decision tree classifier, but any algorithm able to solve this classification task can

be easily integrated into our analysis pipeline.

3.6.2 Decision trees

In this section, we describe the decision tree classifier [45]. We use cross-validation in

order to train the classifier, prune the trained decision tree, and evaluate its perfor-

mance on distinct sets of individuals.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 40

We train a decision tree T by recursively splitting the individuals in each node

using maximum information gain for feature selection. We use binary categorical

splits, meaning that we find the best rule of the form g(s, i) == γ, where γ ∈{maj,min, het}. Binary splits make it possible to handle cases in which only one

of the three possible genotypes is associated with the disease without unnecessarily

splitting individuals that have the two other genotypes. Unknown values are ignored

when computing information gain. This is necessary since there is a correlation

between the frequency of unknown values and the quality of the genotyping, which

in turn is variable between the different data sets. Counting unknown values during

training could therefore lead to classifiers separating the two sets of individuals based

on data quality differences, rather than based on genetic differences. However, if a

large number of measurements are unknown for a given SNP, the information gain

for that SNP will be biased. This is particularly true if the fraction of unknowns is

very different between the cases and the controls. In order to avoid this situation,

we discard all SNPs that do have more than 5% of unknown genotypes amongst the

training individuals in the node we are splitting. In each leaf node L, we compute the

fraction fL of training individuals in that node of that are part of the disease class:

fL =∑

i∈L(ci==disease)

|L| .

In order to choose a pruning algorithm, we compare the cross-validation perfor-

mance obtained using Cost-Complexity Pruning [45], Reduced Error Pruning [46], as

well as a simple approach consisting of limiting the tree depth. We find that Reduced

Error Pruning outperforms Cost-Complexity Pruning, and performs similarly well

than limiting the tree depth, but results in smaller decision trees. We therefore use

Reduced Error Pruning, which consists of recursively eliminating subtrees that do not

improve the classification error on the pruning set (which only contains individuals

that were not used during training).

The classification of an individual i using a decision tree T is done by traversing the

tree from the root towards a leaf node L(i) according to the genotype of the individual

which is classified. If fL(i) is greater than 0.5, then the individual is classified as dis-

ease, else the individual is classified as control. We can consider the decision tree T as

a high level statistical model of the difference between the disease and the control sets.

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 41

Under this model, the fraction fL(i) represents the conditional probability of individ-

ual i to be part of the disease class given its genotype: PT (ci == disease | Gi) = fL(i).

This value is the disease-class probability of individual i. In order to compute the

fractions fL over sufficiently large numbers of individuals, we further prune our tree to

only have leaf nodes containing at least 100 training individuals. The benefit of using

this probability rather than the binary classification is that it allows to distinguish

leaf nodes in which there are mainly training individuals from one class from those

in which both classes are almost equally represented.

In order to assess the performance of our classifier, we perform 5-fold cross-

validation. We start by separating the data into five random sets containing 20%

of the individuals each. A decision tree T is trained using four of these sets, while

one set is reserved for pruning and testing. The unused set is split randomly into two

equal sets. The first of these sets is used to obtain pruned tree T ′ from tree T , and

the individuals in the second set are used to evaluate the performance of tree T ′. The

last step is then repeated using the second set for pruning, and the first for testing.

Finally, we repeat the training and evaluation four more times, each time leaving

out a different set for pruning and testing. This ensures that for every individual

in our data set, there is one pruned decision tree for which the individual was used

neither for training nor for pruning. We can therefore evaluate the performance of

the classifier on unseen data. We can also compute the average disease-class proba-

bility p(C) of the control individuals, and the average disease-class probability p(d)

of the individuals with disease d. The difference ∆p between those two probabilities

indicates how well the classifier is able to distinguish controls from diseases. We use

the cross-validation results to compare the performance of the classifier against a

baseline classifier which simply assigns the most frequent label amongst the training

set to all individuals. Classifiers that do not outperform this baseline classifier, or for

which the difference ∆p is small, are not used to identify similarities between diseases.

Given the cross-validation scheme used, we end up training not one, but several

possibly distinct decision trees. Rather than arbitrarily choosing one, we use the set Td

of all decision trees trained during cross-validation for a given disease d. In order to

classify a new individual i, we first classify i using each classifier independently, and

CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 42

then return the average classification. Similarly, we average the results of individual

classifiers to obtain the average disease-class probability: PTd(ci == disease | Gi) =∑

T∈TdPT (ci==disease | Gi)

|Td|.

3.6.3 Identifying similarities

Once a classifier has been trained to distinguish the set of individuals with reference

disease d from the control set, we can use it to identify diseases that are similar

to disease d. Using the classifier, we can compute the disease-class probability of

an individual with a query disease d′. In order to be able to compare diseases, we

are interested in computing the average disease-class probability of all individuals

in d′: p(d′) =∑

i∈d′ PTd(ci==disease | Gi)

|d′| . We expect this average probability to be in, or

close to the interval between p(C) and p(d′), which were the averages computed on

respectively the control set and the disease set d during cross-validation. If p(d′) is

close to p(C), then d′ is not very different from the control set, whereas a value p(d′)

that is close to p(d) indicates similarity between the two diseases. Using this method,

we can compare all query diseases to the reference disease d, and identify if there are

diseases that are more similar to d than others.

If we find that a query disease d′ is closer to reference disease d than the other

query diseases, then we need to assess the significance of this finding. In order to do

so, we randomly sample a set r of individuals from all the disease sets except d, such

that r is of the same size as d′, and compute p(r). We repeat this procedure 10,000

times. The fraction of random samples r for which p(r) ≥ p(d′) indicates how often a

random set of individuals would obtain a probability of being part of the disease-class

at least as high as the set d′, and is therefore a P-value indicating how significant the

similarity between d′ and d is.

Chapter 4

Analysis of the Pseudoautosomal

Regions

4.1 Introduction

In human, the sex chromosomes X and Y represent the major genetic difference be-

tween males and females. In general, females have two copies of the X chromosome,

whereas males have one copy of the X chromosome and one copy of the Y chromo-

some. The Sex-determining region Y gene SRY [47] is located on the Y chromosome

and has been shown to initiate testis development, which leads to a male phenotype.

The two sex chromosomes evolved from autosomes and differentiation started early

in the mammalian lineage [48]. In human, sequence homology between the X and

Y chromosomes is mostly low, and the X chromosome is substantially longer (150

mega base pairs) than the Y chromosome (50 mega base pairs). There are, how-

ever, two regions of high sequence homology between the sex chromosomes called

Pseudoautosomal Regions (PAR). The Pseudoautosomal Region 1 (PAR1 ) is located

at the distal end of the p arm of the X and Y chromosomes, and measures 2.6

mega base pairs [49]. The Pseudoautosomal Region 2 (PAR2 ) is located at the distal

end of the q arm of the X and Y chromosomes, and is much shorter than PAR1

(320 kilo base pairs) [50, 51]. As their name indicates, pseudoautosomal regions be-

have in similar way than autosomes. In particular, given the sequence homology

43

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 44

between the two sex chromosomes in the PARs, recombination can happen in these

regions [52, 53]. An obligatory recombination event does happen in PAR1 during

male meiosis [54]. A specific mechanism underlying this obligatory recombination

has been identified in mice [55]. The Pseudoautosomal Boundary (PAB) separates

the PARs from the chromosome-specific regions of the sex chromosomes, which can-

not recombine [56]. The PAB results from the presence of an Alu repeat sequence

on the Y chromosome [57], an event that happened after the divergence between the

Old World monkeys and great ape lineage [58]. This mechanism is critical for the

XY sex-determination system to work, as recombination between the X and the Y

chromosomes outside of the PAR would result in X chromosomes carrying SRY. It is

therefore interesting to note that SRY is located only 5.3 kilo bases proximal of the

PAB separating PAR1 from the male specific region of the Y chromosome.

Several important genes are located in the PARs [59, 60]. These genes are not

affected by X-inactivation [61], and therefore behave like autosomal genes. Further-

more, the XG gene, which encodes a blood group antigen [62], includes exons located

on both sides of the PAB [63]. Three exons on the 5’ end of the gene are located in

PAR1, and therefore present on both sex chromosomes. The remaining exons of the

functional copy of the gene are located on the X chromosome only [64, 65]. A trun-

cated version of the gene, XGPY appears to be transcribed from the Y chromosome,

but is not known to be functional [66].

In this work, we investigate differences between males and females at common

Single Nucleotide Polymorphisms (SNPs) located in the PARs. In order to do so, we

re-purpose Genome Wide Association Studies (GWAS) data originally collected for

the purpose of identifying disease associations. We separate the individuals in the

control sets according to their sex, and identify significant differences in genotype fre-

quencies between males and females for SNPs located in PAR1. We show that for the

most significant association, SNP rs312258, the genotype frequency differences can be

explained by differences in minor allele frequency between the X and Y chromosome.

We reproduce this result in the HapMap 3 [67] Utah residents with ancestry from

northern and western Europe (CEU) population. We hypothesize that the difference

results from different selective pressures acting on the X and Y chromosomes.

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 45

Chr. X

Chr. Y

PAR1

0 500000 1000000 1500000 2000000 2500000 3000000

−15

−10

−50

chrX

Position

Log(

P−va

lue)

rs2535443

rs311149

rs311150

rs311161

rs312258

Figure 4.1: Manhattan plot of differences between males and females in PAR1

The red line indicates the pseudoautosomal boundary (PAB).

4.2 Results

4.2.1 Significant differences between males and females in

the pseudoautosomal region 1

We use all control individuals, and split them according to the sex information re-

ported by WTCCC. We then perform an analysis that is similar to a traditional

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 46

GWAS, except that we compare males to females rather than cases to controls. We

identify several genome-wide significant differences between males and females in

PAR1. Figure 4.1 shows a Manhattan plot of all SNPs in PAR1. SNPs that show

significant differences in genotype frequencies between males and females appear close

to the PAB. Table 4.1 shows the genotype counts and P-values for all genome-wide

significant SNPs identified in PAR1. The most significance difference is observed for

rs312258, a SNP located in the last intron of gene CD99 (P-value of 5.91 ·10−16), and

41 kilo bases distal to the PAB. The frequency of the minor allele (A) in females is

pf = 0.22, compared to pf = 0.31 in males. This leads to an allelic odds ratio of 1.7.

SNP Allele Female Male P-value A allele frequencyA B AA AB BB AA AB BB Female Male

rs312258 A C 74 487 905 117 641 654 5.91 · 10−16 0.22 0.31rs311161 A C 673 648 157 830 538 74 6.95 · 10−13 0.67 0.76

rs2535443 A G 418 730 337 525 703 211 1.31 · 10−9 0.53 0.61rs311149 A T 323 720 403 207 701 508 7.43 · 10−9 0.47 0.39rs311150 C T 420 710 327 508 691 208 3.73 · 10−8 0.53 0.61

Table 4.1: Significant genotype differences between males and females in WTCCCcontrols

We also identify several genome-wide significant SNPs on autosomes, which are

however all false positives caused by genotype calling artifacts (see Section 2.3).

4.2.2 Replication

We first attempt to replicate the result we observe for rs312258 in the control pop-

ulations using disease populations from the WTCCC study. Results are shown in

Table 4.2. For each disease, we first assess the association between rs312258 and the

disease separately in males and females. We show that there is no genome-wide signif-

icant association for any disease in either sex. The disease data sets can therefore be

used as replication cohorts for the purpose of showing a difference between males and

females in general. We observe significant differences in most disease populations. We

combine the P-values obtained in the disease populations and the control population

using Fisher’s combined probability test [68], and obtain a meta-analysis P-value of

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 47

1.87 · 10−49.

Population Female Male Female vs male Case vs controlAA AC CC AA AC CC Female Male

Controls 74 487 905 117 641 654 5.91 · 10−16

Bipolar disease 61 386 678 63 284 319 4.78 · 10−7 0.73 0.43Coronary artery disease 20 122 242 113 588 643 9.02 · 10−7 0.86 0.71

Crohn’s disease 48 348 621 54 293 304 2.40 · 10−8 0.84 0.99Hypertension 55 410 699 67 306 390 2.20 · 10−5 0.55 0.064

Rheumatoid arthritis 91 475 788 33 190 219 0.0056 0.06 0.50Type 1 diabetes 39 362 548 109 435 446 1.86 · 10−11 0.037 0.073Type 2 diabetes 40 297 455 93 514 486 6.89 · 10−8 0.12 0.61

Table 4.2: Analysis of rs312258 in WTCCC disease populations

As both the controls and the disease data sets were collected as part of the same

WTCCC study, they share the same genotyping platform, genotype calling algorithm

and quality control pipeline. Each of these parts could lead to artifacts that may

explain the association. It is therefore important to replicate the association using a

completely different data set. We do so in the HapMap 3 CEU population, which like

the WTCCC populations is of European descent. We observe a significant difference

between male and females for rs312258 (P-value of 0.0074). Given the small number

of individuals in HapMap 3, this P-value would not reach genome-wide significance

if all SNPs in HapMap 3 had been tested. It is however significant for the purpose of

replicating this individual result. Table 4.3 shows the genotype counts for rs312258

in the HapMap 3 CEU population. The minor allele frequencies in this population

are comparable to those observed in the WTCCC controls (0.23 for females, 0.32 for

males).

AA AC CCFemale 5 15 35Male 2 31 22

Table 4.3: Genotype counts for rs312258 in HapMap 3

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 48

4.2.3 Modified Hardy-Weinberg model

In order to identify possible causes for the observed genotype frequency differences we

compare the observed genotype frequencies in both sexes to the frequencies expected

under the Hardy-Weinberg model [69, 70], in which the allele and genotype frequen-

cies remain stable from one generation to the next. Deviations from the genotype

frequencies expected under Hardy-Weinberg are a symptom of many disturbances,

such as positive selection or non-random mating.

For a given SNP, the Hardy-Weinberg model assumes that both alleles of an indi-

vidual are drawn from the same pool of possible alleles. If the frequency of the minor

allele in the population at the previous generation is p (and the major allele frequency

therefore (1−p)), then the probability of an individual in the current generation to be

homozygote for the minor allele will be p2. Similarly, the probability of an individual

to be homozygote for the major allele will be (1 − p)2, and the probability of an

individual to be heterozygote will be 2 · p · (1− p). These probabilities correspond to

the estimated frequency of each genotype in the population at the current generation.

It can easily be shown that the allele frequencies remain unchanged.

We compare the observed genotype counts for rs312258 in males and females in the

WTCCC controls to the expected genotype counts under Hardy-Weinberg, using the

observed minor allele frequencies for each sex (Table 4.4). We compute the percentage

absolute difference between the observed and the expected genotype counts. We note

that this difference is larger in males (5.1%) than in females (1.5%).

AA AC CC Difference

Female 74 487 905Hardy-Weinberg 68 497 899 1.5%

Male 120 658 675Hardy-Weinberg 138 620 693 5.1%Modified model 126 645 681 1.7%

Table 4.4: Comparison of observed and estimated genotype counts

The Hardy-Weinberg model as described above does assume that both alleles of

the SNP are drawn from the same pool of alleles. This assumption is generally valid

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 49

in the context of autosomes, as well as for the non-PAR regions of the X chromosome.

The Hardy-Weinberg model does not apply to the male specific region of the Y chro-

mosome since a male will only have one copy of the chromosome. In the context of

Hardy-Weinberg, the pseudoautosomal regions do not behave exactly like autosomes.

It is possible that, for a given SNP, the minor allele frequency is different on the X

chromosome than on the Y chromosome. This will not change the Hardy-Weinberg

model for females, who will draw two alleles from the X chromosome pool. However,

for males one allele will be drawn from X chromosome pool and one from the Y

chromosome pool. A female receives one X chromosome from a male and one from

a female, and can transmit either chromosome to a male or a female. We can thus

assume that there is a single X chromosome pool. We therefore need to modify the

Hardy-Weinberg model for males. We consider pX as the minor allele frequency of

the SNP on the X chromosome, and pY as the minor allele frequency of the SNP on

the Y chromosome. Since our model uses a single pool for the X chromosome, we

can use the observed minor allele frequency in females as an estimate of the minor

allele frequency on the X chromosome: pX = pf . Since we have pm = (pX + pY )/2

we can compute an estimate of the minor allele frequency on the Y chromosome:

pY = 2 · pm− pX . The modified genotype frequencies under this model are pX · pY for

homozygous for the minor allele, pX · (1− pY ) + (1− pX) · pY for heterozygous, and

(1− pY ) · (1− pX) for homozygous for the major allele.

We apply this method to the male controls WTCCC data set, and obtain an

estimated minor allele frequency for rs312258 on the Y chromosome of pY = 0.40,

compared to pX = 0.21 on the X chromosome. Using the modified Hardy-Weinberg

model, we reduce the percentage absolute error to 1.7% (Table 4.4). The observation

that a modified Hardy-Weinberg model that allows different minor allele frequencies

on the X and Y chromosome fits the observed data better than the original Hardy-

Weinberg model is evidence that the minor allele frequency at rs312258 likely differs

between the two sex chromosomes.

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 50

4.2.4 Phasing

We can use the modified Hardy-Weinberg model to estimate the minor allele fre-

quencies on the X and Y chromosomes in the population based on the genotype

information only. Genotyping platforms cannot directly indicate which allele is on

which physical chromosome if the individual is heterozygous. The task of assigning

alleles to individual chromosomes is called phasing. Phasing would be particularly

useful in the context of the PARs in males, as it would determine which allele is

on the X chromosome, and which allele is on the Y chromosome. Computational

approaches have been developed to phase genotype data on autosomes [71], but these

approaches have not been designed to handle the particular case of the PARs. In par-

ticular, such approaches do not integrate the differences in allele frequency between

chromosomes that we identify herein. The assumptions underlying the models used

in existing phasing methods are therefore not applicable to the case we’re interested

in, and we cannot use them in order to phase individuals genotyped in the WTCCC

study.

The HapMap 3 data sets does, however, include data for trios (mother, father,

child). This makes it possible to exactly determine which allele a male child has on

each sex chromosome in most cases (see Section 4.5.2). We can therefore compute the

actual frequencies of the minor allele on the X and Y chromosomes, and compare

them to the estimated frequencies obtained on the WTCCC controls. Table 4.5

shows the counts for each allele on each sex chromosome in the HapMap 2 CEU

population. The frequencies of the minor alleles are comparable to the estimates

obtained on the WTCCC controls: pY = 0.34, and pX = 0.21. Given the small

number of individuals in HapMap 3 this difference is not statistically significant on its

own (Fisher’s exact test P-value of 0.10), but this result provides additional evidence

supporting a difference in allele frequencies between males and females at rs312258.

4.2.5 Evolution of differing allele frequencies

We observe a difference in minor allele frequency at rs312258 between the X and the

Y chromosomes in the current population. In this section, we build a model of the

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 51

A CY 15 28X 27 99

Table 4.5: Allelle counts per chromosome for rs312258 in HapMap 3

evolution of this difference over time under the assumption that this region is not

under selective pressure. Let’s consider that the frequencies of the minor allele A

on the X and Y chromosome at the current generation are respectively pX and pY ,

and the same frequencies in the next generation as pY ′ and pX′ . During male meiosis

I, exactly one recombination event between X and Y must happen in PAR1 [54].

As this unique recombination event recombination happens at a stage of meiosis

during which there are two sister chromatids for each sex chromosome, there is a

50% chance that no recombination happens on the sex chromosome transmitted from

the father to the child. We first consider the case in which the child is male. If

the recombination event is distal to the SNP, or not on the transmitted chromosome,

then we have pY ′ = pY . If the recombination event happens between the SNP and the

PAB, then the allele that was on the X chromosome will be on the Y chromosome

passed on to the male child, and we have pY ′ = pX . If we consider r as being the

probability of a recombination happening between the SNP and the PAB, then we

obtain pY ′ = (1− r) · pY + r · pX . Since we have pY > pX the minor allele frequency

on the Y chromosome will decrease: pY ′ < pY . The case in which the child is female

is similar. However only one third of the X chromosomes in the next generation are

passed on from a male, and therefore 2/3 of the X chromosomes will not recombine

with Y between two consecutive generations. The minor allele frequency on the X

chromosome in the next generation is therefore pX′ = 23·pX+1

3·((1−r)·pX+r·pY ) > pX .

Therefore the difference between the minor allele frequencies will decrease over time.

It is thus surprising to observe such a large difference between males and females at

rs312258 in the current population.

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 52

4.3 Discussion

The discovery of SNPs that show significant differences in genotype frequencies be-

tween males and females appears surprising at first. In the context of a traditional

case-control GWAS, a SNP showing a difference between the two populations as

strong as rs312258 (P-value of 5.91 · 10−16, odds ratio of 1.7) would be a very signifi-

cant association. The interpretation would be that the risk allele confers a statistically

significant increase in disease risk. Furthermore, this association would likely be seen

as causal: the SNP is either itself involved in a biological process of relevance for the

disease, or is a proxy (through linkage disequilibrium) of a nearby SNP involved in

the disease. In the context of a complex disease, the SNP would likely be one of many

factors contributing to the overall disease risk of an individual. We are, however, not

comparing cases to controls, but males to females, and most of these interpretations

are not applicable. In particular, having an A allele at rs312258 does not contribute

to the overall probability of an individual to be a male, even though the A allele is sig-

nificantly more prevalent in males. The SRY gene on chromosome Y has been shown

to be the sole determinant of sex in human, and therefore a causal role of rs312258

in sex determination can be ruled out. Unlike in a traditional GWAS, the observed

genotype is not part of the cause of the studied phenotype, but is its consequence.

We show that the reason for the difference in genotype frequencies is likely to be

a difference in minor allele frequency between the X and the Y chromosome. The

possibility of such a difference has been hypothesized in theoretical models of the

sex chromosomes [72, 73], but to our knowledge, this study is the first to report an

observed difference in human. Both existing models are consistent with our analysis of

the evolution of allele frequencies, and show that even small recombination rates are

sufficient to keep the X and Y chromosomes similar unless there is active selection.

This leads to two potential explanations for the differences we observe: extremely low

recombination, and selection.

Given the necessity of one recombination event per male meiosis within the com-

parably small PAR1, the overall recombination rate in this region is high. Estimates

in males indicate that it can be as high as 20 centi Morgan per mega base pairs

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 53

(cM/MB), whereas the recombination rate in females (0.30-2.55 cM/MB) is similar

to the recombination rate observed on the whole X chromosome in females (1.21

cM/Mb), as well as on autosomes (1.40-2.80 cM/Mb) [74]. Estimating precise genet-

ics maps of the PARs has been challenging as discussed in a recent review by Flaquer

et al. [74]: genetic maps built from families lack resolution, genetic maps inferred

from unrelated individuals cannot distinguish males from females, and sperm typing

studies suffer from the high variability between individuals. Early analyzes of the

recombination in PAR1 have shown a linear decrease in recombination rates when

moving from telomeres towards the PAB [75, 76, 77, 78]. It is interesting to note that

all the significant SNPs we identify on PAR1 are located close to the PAB. If the

recombination rate in the region between those SNPs and the PAB is low, then it is

possible that allele frequency differences between males and females can remain in the

population for a long period of time. This explanation alone is, however, insufficient

as it does not indicate why such a difference would exist in the first place, specially

given that the XY sex-determination system, and the PAR1 itself, date back to the

early primate lineage.

Selective forces are likely to be playing an active role in the evolution of PAR1,

and to account for most of the differences we observe. The Y chromosome has

recently been shown as being one of the fastest evolving region of the human genome,

with significant differences between human and the closest relative chimpanzee [79].

Furthermore, a comparison between human and macaque Y chromosomes shows that

events that suppressed recombination between segments of X and Y lead to a rapid

gene loss on the Y chromosome, followed by evolutionary conservation [80]. The SRY

gene itself has undergone rapid sequence evolution in the mammalian lineage [81].

Various models of sex-specific selection that incorporate difference selective forces

in males and females have been proposed [82]. The integration of these models in

the context of a region in which one chromosome is only present in males can lead to

persistent differences in allele frequencies [83]. Only alleles that are beneficial to males

will be selected for on the Y chromosome, whereas selection on the X chromosome

will be stronger for alleles that are beneficial to females. Whether rs312258 does

itself confer any selective advantage to either sex remains unclear. While the SNP is

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 54

located in the last intron of gene CD99, we do not find any overlap between the SNP

and experimental functional data generated by the ENCODE consortium (using the

methods described in Chapter 5).

An alternative explanation for the difference at rs312258 would be that the SNP

is not itself under selection, but in linkage disequilibrium with regions undergoing

selection. In particular, given the proximity of the PAB and the observed reduction

in recombination close to the PAB, the probability of a recombination event between

the SNP and the PAB is low. Therefore, it is possible for the minor allele at rs312258

to rise in frequency on the Y chromosome due to positive selection acting anywhere

on the male-specific region of the Y chromosome, which is not subject to any recom-

bination. This effect is similar to the so-called genetic hitchhiking [84] that has been

observed on autosomes: mutations that have no beneficial effect themselves, but hap-

pen to be physically located close to a beneficial mutation rapidly rise in frequency

in a population, since recombination is unlikely to break the correlation between the

two loci. The difference in this case is that a mutation in the PAR that is hitchhiking

due to its correlation with a beneficial mutation on Y will increase in frequency on

the Y chromosome only. Similarly, it is possible that a beneficial mutation that is

located in the non-PAR region of chromosome X but close to the PAB could lead

to an increase in frequency of a mutation in the PAR on the X chromosome only.

Finally it is possible that these effects all happen simultaneously: selection on the

Y chromosome could lead to an increase in the frequency of one allele of a SNP in

the PAR, whereas selection on the X chromosome (which in turn can be different

between males and females) could lead to an increase in the frequency of the other

allele of the same SNP in the PAR.

We do not identify any significant difference in genotype frequencies between males

and females in PAR2. While the two PAR share the basic property of sequence

homolgy between the X and Y chromosomes, there are also major differences between

them. PAR2 appeared much more recently in evolution, and is unique to humans [51,

85]. Recent evidence shows that the mechanism underlying recombination in PAR2

may be significantly different from PAR1 [86]. This work shows that sex chromosomes

are both a highly complex, and not well explored part of the human genome. While

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 55

we focus on differences at the level of individual SNPs, recent work has shown the

existence of gene conversion between the X and Y chromosomes [87], and a complex

interplay between gene transmission differences and gene silencing [88].

In previous work, we have studied differences in disease associations between males

and females using the same WTCCC data set [89]. In particular we identified a signifi-

cant sex difference in the association between SNP rs3792106 and Crohn’s disease [90].

This SNP is located in gene ATG16L1 on chromosome 2. We showed that the dif-

ference is due to a difference in genotype frequencies in the controls, but not in the

cases, and that the difference is caused by transmission distortion from the mother to

the child. This result is consistent with transmission distortion observed in the same

region in the Framingham cohort population [91]. While the difference in genotype

frequency for rs3792106 is significant when considering this SNP alone, it does not

reach genome-wide significance. In this work we do not identify any genome-wide sig-

nificant difference in genotype frequencies between males and females on autosomes,

and to our knowledge no difference of the magnitude we observe in PAR1 has been

identified on autosomes. The biology underlying potential differences on autosomes

remains unclear. The model we discuss herein for the PAR1 does not apply to au-

tosomes, since each chromosome can be passed on from a male to a female and vice

versa unless there is transmission distortion.

In this work we identify a genetic difference between males and females using data

collected for purpose of comparing cases to controls in a GWAS. We show that GWAS

data can be successfully re-purposed to answer new, interesting questions without

incurring any genotyping cost. This is, however, only possible because the relevant

phenotype information, here sex, has been collected in the original study. Most GWAS

data sets only provide a limited amount of additional phenotype information, often

related to possible confounding variables such as age, sex or geographic origin of an

individual. If additional variables had been recorded, then it would be possible to

ask many more interesting questions by simply reanalyzing existing genotype data.

Collecting additional phenotype information such as height or weight would only

marginally increase the cost of the study, and would allow asking new questions,

either together with the question of interest in the original GWAS, or in a completely

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 56

orthogonal manner, as done in this work. For example, a recent large meta-analysis

of 46 studies and 133,653 individuals has lead to the identification of 180 markers

associated with height [92], markedly more thank the number of currently known

associations for any other phenotype. Yet the number of individuals that could be

used to identify associations with height would be much larger if a meta-analysis could

be performed across all individuals genotyped in any GWAS so far. While privacy

concerns would likely limit the feasibility of such an endeavor, this example shows

the potential that could be unlocked by asking questions about genotype data that

go beyond the narrow focus of an individual study.

4.4 Conclusion

In this chapter, we show that by re-purposing GWAS data, we can identify significant

differences in genotype frequencies between males and females in the pseudoautosomal

region located on the distal end of the short arm of chromosomes X and Y. We provide

evidence showing that these differences are caused by differences in allele frequencies

between the X and Y chromosome, and we hypothesize that selective pressure acting

on the sex chromosome may explain these differences.

4.5 Methods

4.5.1 Identifying differences between males and females

For each SNP, we count the number of individuals for each of the three possible geno-

types (homozygote major allele, heterozygote, homozygote minor allele) separately

for each sex, resulting in a 3x2 table. We apply a Chi-square test with two degrees of

freedom in order to assess whether there is a difference in genotype frequency between

males and females. We apply a Bonferroni correction in order to compute a genome-

wide significance level threshold of 0.05 · 459, 075 = 1.1 · 10−7. We then manually

analyze all SNPs that reach a genome-wide significance level. In particular, we man-

ually inspect the genotyping intensity plots in order to detect potential artifacts in

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 57

genotype calling. Such artifact can be caused by the homology between the sequence

in the immediate vicinity of an autosomal SNP and a region of either the X or the

Y chromosome. This scenario is described in more details in Chapter 2.

We repeat this analysis for each of the disease populations separately. We then

combine the P-values in order to assess the significance of the difference in the entire

WTCCC data set. We use Fisher’s combined probability test [68]. Given a list

of k P-values pi (in our case k = 8 since we have seven disease populations and one

control population) we can compute the summary statistic X2 = −2∑k

i=1 loge(pi).

This statistic has a Chi-squared distribution with 2 · k degrees of freedom.

We apply the same test to the HapMap 3 CEU individuals. As these individuals

are parts of trios, we can only use the parents because the genotype of the child is

not independent of the genotype of the parents.

4.5.2 Trio-based phasing in PARs

The HapMap 3 data set consists of genotype information for trios (mother, father,

child). We know that for all female individuals, both alleles are present on X chro-

mosomes. We are interested in determining the phasing of male individuals in order

to know whether an allele is present on the X or the Y chromosome. We can use

trios in order to determine the exact phasing of a male child in most cases using the

information of individual SNPs only. If the child is homozygous, then the phasing is

trivial since both the X and the Y chromosome contain the same allele. If the child

is heterozygous and at least one parent is homozygous, then we know which allele

this parent passed on to the child. If the mother is homozygous, then the allele the

mother is homozygous for must be on the X chromosome of the child. If the father is

homozygous, then the allele the father is homozygous for must be on the Y chromo-

some of the child. If all three individuals are heterozygotes, then the exact phasing

cannot be determined using the information provided by the single SNP alone.

Furthermore, if we can determine which allele the father passed on to the male

child, then we also know which allele did not get transmitted. It is important to

note that this allele is not necessarily on the X chromosome in the somatic cells

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 58

of the father if the father is heterozygous. During meiosis I, the sex chromosomes

undergo recombination in the PARs. The chromosome passed on to the child is the

result of this recombination, and at each SNP, one allele is passed on to the child.

The other allele ends up on a chromosome that becomes part of another gamete.

Therefore, if the physical location of the recombination event is between a given SNP

and the PAB, then an allele that was on the X chromosome of the father will end

up on the Y chromosome of the male child. In this case, we know that there was

an X chromosome in a gamete that contained the allele not passed on to the child,

but it would be incorrect to conclude that the allele that wasn’t passed on to the

child is on the X chromosome of the father. Such a situation is however unlikely

for the specific case of rs312258, which is located only 40 kilo bases from the PAB,

whereas PAR1 is 2.6 mega bases long. Therefore in most cases the allele that is not

transmitted from the father is on the X chromosome of the father. Furthermore,

even if a recombination even happened and the non-transmitted allele was on the Y

chromosome of the father, there is no specific reason to believe that the gamete which

contained the allele on the X chromosome couldn’t have lead to viable individual,

specially given the high minor allele frequency in both sexes. We choose to use the

information obtained from the non-transmitted allele of the father as well. For each

trio in which the child is a male and at least one individual is heterozygote, we can

therefore identify which three independent alleles are on X chromosomes and which

allele is on a Y chromosome.

If the child in the trio is a female, then we know that both alleles are on X chro-

mosomes. We can however apply a similar approach than for male children in order

to identify which allele the father passed on to the daughter on the X chromosome,

and which allele ended up on a Y chromosome in a gamete. We can therefore also

identify which three independent alleles are on X chromosomes and which allele is on

a Y chromosome unless all three individuals are heterozygote. Table 4.6 provides an

overview of all possible combinations of genotypes, together with the four alleles we

can assign to an X or a Y chromosome in each case.

CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 59

Mother Father Male Child Child X Child Y Mother other X Father Gamete XAA TT AT A T A TAA AT AT A T A A*AT TT AT A T T TAT AT TT T T A A*TT AT TT T T T A*AT TT TT T T A TAT AT AT ? ? ? ?

Mother Father Female Child Child Xm Child Xp Mother other X Father Gamete YAA TT AT A T A TAA AT AT A T A A*AT TT AT A T T TAT AT TT T T A A*TT AT TT T T T A*AT TT TT T T A TAT AT AT ? ? ? ?

Table 4.6: Phasing cases in trios

XM and XP represents the allele passed on to the daughter on the X chromosomeby respectively the mother and the father. A star next to the father gamete indicatecases in which the allele is on the other sex chromosome in somatic cells of the fatherif the recombination event happened between the SNP and the PAB. Cases that aresymmetrical (by swapping the A and T alleles) are not presented. In the caseswhere all individuals are heterozygous, the only inference that can be made is thatall alleles in females are on X chromosomes.

Chapter 5

Integrating Regulatory Information

5.1 Introduction

Although Genome-wide Association Studies (GWAS) provide a list of SNPs that are

statistically associated with a phenotype of interest, they do not offer any direct

evidence about the biological processes that link the associated variant to the pheno-

type. A major challenge in the interpretation of GWAS results comes from the fact

that most detected associations point to larger regions of correlated variants. SNPs

that are located in close proximity in the genome tend to be in linkage disequilib-

rium (LD) with each other [6, 7], and only a few SNPs per linkage disequilibrium

region are measured on a given genotyping platform. Regions of strong linkage dis-

equilibrium can be large, and SNPs associated with a phenotype have been found

to be in perfect linkage disequilibrium with SNPs several hundred kilo bases away.

Although sequencing can be used to assess associated regions more precisely [93],

using sequence information alone is insufficient to distinguish among SNPs that are

in perfect linkage disequilibrium with each other in the studied population, and thus

equally associated with the phenotype.

Various approaches have been developed to identify variants that are likely to play

an important biological role. Most of these approaches focus on the interpretation

of coding or other SNPs in transcribed regions [94, 95, 96]. The vast majority of

associated SNPs identified in GWAS, however, are in non-transcribed regions, and it

60

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 61

is likely that the underlying mechanism linking them to the phenotype is regulatory.

SNPs that influence gene expression (Expression Quantitative Trait Loci, eQTLs, [97,

98] have been shown to be significantly enriched for GWAS associations [99, 100].

Although eQTLs can be used to identify the downstream targets that are likely to

be affected by associations identified in a GWAS, they are still based on genotyping

methods, and therefore also point to regions of linkage disequilibrium rather than

to individual SNPs. Methods for identifying SNPs that overlap regulatory elements,

such as transcription factor binding sites, are therefore necessary. Approaches based

on known transcription factor binding motifs [101, 102] have been successfully used

to refine GWAS results and identify specific loci that have a functional role [103,

104]. However, the presence of a motif does not imply that a transcription factor is

necessarily binding in vivo.

High-throughput functional assays such as chromatin immunoprecipitation assays

followed by sequencing (ChIP-seq) [105, 106] and DNaseI hypersensitive site [107]

identification by sequencing (DNase-seq) [108, 109] can experimentally detect func-

tional regions such as transcription factor binding sites. Experimental evidence shows

that the presence of SNPs in these regions leads to differences in transcription factor

binding between individuals [110]. A SNP that overlaps an experimentally detected

transcription factor binding site and is in strong linkage disequilibrium with a SNP

associated with a phenotype, is thus more likely to play a biological role than other

SNPs in the associated region for which there is no evidence of overlap with any func-

tional data. Several recent analyses of associated regions use these types of functional

data in order to identify functional loci in individual diseases [111, 112, 113, 114]. A

recent study of chromatin marks in nine different cell lines produced a genome-wide

map of regulatory elements, and showed a two-fold enrichment for predicted enhancers

amongst the associated SNPs from GWAS [115]. These examples illustrate the power

of combining statistical associations between a region of the genome and a phenotype

together with functional data in order to generate hypotheses about the mechanism

underlying the association.

The main goal of the Encyclopedia of DNA Elements (ENCODE) project is to

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 62

identify all functional elements in the human genome, including coding and non-

coding transcripts, marks of accessible chromatin and protein binding-sites [116, 117,

118]. The datasets generated by the ENCODE consortium are therefore particularly

well suited for the functional interpretation of GWAS results. To date a total of

147 different cell types have been studied using a wide variety of experimental as-

says [119]. Chromatin accessibility has been studied using DNase-seq, which lead

to the identification of 2.89 million DNaseI hypersensitive sites that may exhibit

regulatory function. DNase footprinting [120, 121, 122] was used to detect binding

between proteins and the genome at a nucleotide resolution. ChIP-seq experiments

were conducted for a total of 119 transcription factors and other DNA-binding pro-

teins. Together these data provide a rich source of information that can be used to

associate GWAS annotations with functional data.

In this work, we show that data generated by the ENCODE consortium can be suc-

cessfully used to functionally annotate associations previously identified in genome-

wide association studies. We combine multiple sources of evidence in order to identify

SNPs that are located in a functional region of the genome and are associated with

a phenotype. We show that a majority of known GWAS associations overlap a func-

tional region, or are in strong linkage disequilibrium with a SNP overlapping a func-

tional region. We find that for a majority of associations, the SNP whose functional

role is most strongly supported by ENCODE data is a SNP in linkage disequilibrium

with the reported SNP, not the genotyped SNP reported in the association study.

We show that there is significant overall enrichment for regulatory function in disease

associated regions, and that combining multiple sources of evidence leads to stronger

enrichment. We use information from RegulomeDB [123], a database designed for

fast annotation of SNPs that combines ENCODE datasets (ChIP-seq peaks, DNaseI

hypersensitivity peaks, DNaseI footprints) with additional data sources (ChIP-seq

data from the NCBI Sequence Read Archive, conserved motifs, eQTLs and experi-

mentally validated functional SNPs). Using these publicly available resources makes

the approach presented herein easily applicable to the analysis of any future GWAS

study.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 63

5.2 Results

We use linkage disequilibrium information in order to integrate GWAS results with

ENCODE data and eQTLs. We call functional SNP any SNP that appears in a

region identified as associated with a biochemical event in at least one ENCODE

cell line. Functional SNPs can be further subdivided into SNPs that overlap coding

or non-coding transcripts, and SNPs that appear in region identified as potentially

regulatory, such as ChIP-seq peaks and DNaseI hypersensitive sites. We call the SNPs

that are reported to be statistically associated with a phenotype lead SNPs. For each

lead SNP we first determine whether the lead SNP itself is a functional SNP, then

find all functional SNPs that are in strong linkage disequilibrium with the lead SNP.

We integrate eQTL information in a similar way, by checking whether the lead SNP

or a SNP in strong linkage disequilibrium with the lead SNP has been associated with

a change in gene expression.

Figure 5.1 illustrates our approach by describing a scenario in which a lead SNP is

in strong linkage disequilibrium with a functional SNP that overlaps a transcription

factor binding site, as well as with a third SNP that is an eQTL. If neither the lead

SNP nor the eQTL SNP overlap a functional region, then the functional SNP is more

likely to be the SNP that plays a biological role in the phenotype than either of the

SNPs that were genotyped. An extreme example would be the case where all three

SNPs are in perfect linkage disequilibrium, but only the associated SNP was present

on the genotyping platform used in the GWAS in which the association was found,

and only the eQTL SNP was present on the genotyping platform used in the eQTL

study. In this scenario, the functional SNP would be associated equally strongly with

the disease and with the change in gene expression than the reported association and

eQTL SNPs. In order to show the potential of this approach, we analyze a set of 5694

curated associations from the NHGRI GWAS catalog [11] that represent a total of

4724 distinct SNPs associated with a total of 470 different phenotypes (see Section 5.6

for details).

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 64

SN

P 1

Asso

cia

ted

with

ph

en

oty

pe

SN

P 2

SN

P 4

SN

P 3

eQ

TL

SN

P 5

Fu

nctio

na

l S

NP

(fS

NP

)

Genome

Predicted Motifs

A

TG CA

Tuesday, November 22, 2011

A TGC

A

SN

P 1

Asso

cia

te

d w

ith

p

he

no

typ

e

SN

P 2

SN

P 4

SN

P 3

eQ

TL

SN

P 5

Fu

nctio

na

l S

NP

(fS

NP

)

Genome

Predicted Motifs

A

TG CA

Tuesday, November 22, 2011

ATG

CA

SN

P 1

Asso

cia

ted

with

ph

en

oty

pe

SN

P 2

SN

P 4

SN

P 3

eQ

TL

SN

P 5

Fu

nctio

na

l S

NP

(fS

NP

)

Genome

Predicted Motifs

A

TG CA

Tuesday, November 22, 2011

SN

P 1

Asso

cia

te

d w

ith

p

he

no

typ

e

SN

P 2

SN

P 4

SN

P 3

eQ

TL

SN

P 5

Fu

nctio

na

l S

NP

(fS

NP

)

Genome

Predicted Motifs

A

TG CA

Tuesday, November 22, 2011

SN

P 1

- Lea

d S

NP

Ass

ocia

ted

with

phe

noty

pe

SN

P 2

SN

P 4

SN

P 3

eQTL

SN

P 6

Func

tiona

l SN

P

Genome

Predicted Motifs

DNaseI Hypersensitivy

Peaks

ChIP-SeqPeaks for TF1

SN

P 5

1.0

1.01.0

1.01.0Linkage

Disequilibrium

Figure 5.1: Schematic overview of the functional SNP approach.

This figure illustrates the approach we use to identify functional SNPs. Three differ-ent types of regulatory data are represented for an area of the genome: motif-basedpredictions, DNaseI hypersensitivity peaks, and ChIP-seq peaks. This region con-tains six SNPs. SNP1 is associated with a phenotype in a Genome Wide AssociationStudy. SNP3 is an eQTL associated with changes in gene expression in a differentstudy. SNP6 overlaps a predicted motif, a DNaseI hypersensitivity peak and aChIP-seq peak. There are therefore multiple sources of evidence that SNP6 is in aregulatory region. Furthermore, SNP6 is in perfect linkage disequilibrium (r2 = 1.0)with SNP1 and SNP3, meaning that there is transitive evidence due to the LD thatSNP6 is also associated with the phenotype and is also eQTL. SNP6 is therefore themost likely functional SNP in this associated region.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 65

5.2.1 Lead SNP annotation

We first annotated each lead SNP with transcription information from GENCODE

v7 and regulatory information from RegulomeDB. Overall, 44.8% of all lead SNPs

overlap with some ENCODE data, making them functional SNPs according to our

definition, and 13.1% of the lead SNPs are supported by more than one type of

functional evidence. Specifically, 223 lead SNPs (4.7%) overlap coding regions, 146

(3.1%) overlap with the non-coding part of an exon, 1714 (36.3%) overlap with a

DNaseI peak in at least one cell line, 355 (7.5%) overlap with a DNaseI footprint, and

938 (19.9%) overlap with a ChIP-seq peak for at least one of the assessed proteins in

at least one cell line. Figure 5.2 shows the fraction of lead SNPs supported by different

sources of evidence. Thus, we find that many GWAS SNPs overlap ENCODE data.

5.2.2 Linkage disequilibrium

For each lead SNP we next located the set of SNPs that are in strong linkage dis-

equilibrium (r2 ≥ 0.8) with the lead SNP in all four HapMap 2 populations, and

annotate each SNP in this set. As expected, the fraction of lead SNPs in strong link-

age disequilibrium with a SNP overlapping each type of functional evidence is larger

than when considering lead SNPs alone (Figure 5.2), and 58% of all associations are

in strong linkage disequilibrium with at least one functional SNP. A similar increase

can be observed for functional SNPs supported by multiple sources of evidence. A

detailed breakdown for each type of functional evidence for multiple linkage disequi-

librium thresholds is provided in Table 5.1. We repeated the same analysis for the

2464 lead SNPs that have been associated with a phenotype in a population of Eu-

ropean descent, using SNPs in strong linkage disequilibrium (r2 ≥ 0.8) with the lead

SNP in the European HapMap population only. A total of 81% of the lead SNPs are

in strong LD with at least one functional SNP, and 59% of the associated SNPs are

in strong linkage disequilibrium with a functional SNP supported by multiple sources

of evidence. A detailed breakdown is provided in Table 5.2.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 66

0% 25% 50% 75% 100%

r2 ! 0.8 in CEU

r2 ! 0.8 in all Populations

Lead SNP

CodingTranscribed, non-codingChIP-seq and DNaseI peak, matched motif or footprintChIP-seq peak and DNaseI peak and motif ChIP-seq and DNaseI peakChIP-seq peak onlyDNaseI peak onlyMotif only (no experimental support)No annotation

4%

2%

1%A: B:

0% 25% 50% 75% 100%

DNaseI peakDNaseI footprintChIP-seq peak

7.5%

15%

38%

36%

20%

46%

32%

75%

61%

12% 6% 20% 18% 24%

36%20%19%5%9%

16% 7% 15% 11% 8%16% 11% 12%

7% 5% 5%

5% 3%3%

Figure 5.2: Proportions of associations for different types of functional data.

Proportions are shown for individual assays (A.), and for all sources of evidencecombined (B.) Proportions are presented separately for lead SNPs and SNPs instrong linkage disequilibrium (r2 ≥ 0.8) with a lead SNP. For each association wedetermine which SNP in the LD region is most strongly supported by functionaldata in order to generate the proportions in panel B. We separately consider SNPsin strong linkage disequilibrium with a lead SNP in all HapMap 2 populations, andSNPs in strong linkage disequilibrium with a lead SNP in the CEU population.For the latter case, we use only associations identified in population of Europeandescent, and show that we can map 80% of these associations to a functional SNPsupported by experimental ENCODE data.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 67

Lead Perfect LD r2 ≥ 0.9 r2 ≥ 0.8Count Count Count Count

Total 4724 4724 4724 4724Predicted motif 1335 28.3% 1656 35.1% 1947 41.2% 2181 46.2%DNaseI hypersensitivity peak 1714 36.3% 1979 41.9% 2189 46.3% 2374 50.3%DNaseI footprint 355 7.5% 471 10.0% 600 12.7% 718 15.2%ChIP-seq peak 938 19.9% 1144 24.2% 1318 27.9% 1491 31.6%

Coding 223 4.7% 268 5.7% 306 6.5% 345 7.3%Exon, non-coding 146 3.1% 165 3.5% 202 4.3% 232 4.9%

RegulomeDB Score 2 (total) 126 2.7% 165 3.5% 197 4.2% 248 5.2%RegulomeDB Score 2a 14 0.3% 16 0.3% 21 0.4% 26 0.6%RegulomeDB Score 2b 109 2.3% 145 3.1% 169 3.6% 212 4.5%RegulomeDB Score 2c 3 0.1% 4 0.1% 7 0.1% 10 0.2%RegulomeDB Score 3 (total) 58 1.2% 75 1.6% 84 1.8% 89 1.9%RegulomeDB Score 3a 55 1.2% 69 1.5% 78 1.7% 83 1.8%RegulomeDB Score 3b 3 0.1% 6 0.1% 6 0.1% 6 0.1%RegulomeDB Score 4 436 9.2% 492 10.4% 538 11.4% 571 12.1%RegulomeDB Score 5 (total) 1110 23.5% 1192 25.2% 1240 26.2% 1254 26.5%RegulomeDB Score 5a 218 4.6% 246 5.2% 270 5.7% 289 6.1%RegulomeDB Score 5b 892 18.9% 946 20.0% 970 20.5% 965 20.4%RegulomeDB Score 6 933 19.8% 908 19.2% 873 18.5% 838 17.7%RegulomeDB Score 7 1692 35.8% 1459 30.9% 1284 27.2% 1147 24.3%

eQTL 462 9.8% 514 10.9% 556 11.8% 597 12.6%eQTL + Predicted motif 113 2.4% 206 4.4% 275 5.8% 329 7.0%eQTL + DNaseI hypersensitivity peak 201 4.3% 300 6.4% 359 7.6% 413 8.7%eQTL + DNaseI footprint 40 0.8% 89 1.9% 129 2.7% 165 3.5%eQTL + ChIP-seq peak 118 2.5% 197 4.2% 247 5.2% 306 6.5%

eQTL + RegulomeDB Score 2 (total) 17 0.4% 28 0.6% 34 0.7% 45 1.0%eQTL + RegulomeDB Score 2a 1 0.0% 1 0.0% 3 0.1% 6 0.1%eQTL + RegulomeDB Score 2b 16 0.3% 27 0.6% 30 0.6% 38 0.8%eQTL + RegulomeDB Score 2c 0 0.0% 0 0.0% 1 0.0% 1 0.0%eQTL + RegulomeDB Score 3 (total) 5 0.1% 11 0.2% 12 0.3% 14 0.3%eQTL + RegulomeDB Score 3a 4 0.1% 10 0.2% 11 0.2% 13 0.3%eQTL + RegulomeDB Score 3b 1 0.0% 1 0.0% 1 0.0% 1 0.0%eQTL + RegulomeDB Score 4 56 1.2% 77 1.6% 87 1.8% 94 2.0%eQTL + RegulomeDB Score 5 (total) 117 2.5% 131 2.8% 137 2.9% 130 2.8%eQTL + RegulomeDB Score 5a 27 0.6% 30 0.6% 33 0.7% 36 0.8%eQTL + RegulomeDB Score 5b 90 1.9% 101 2.1% 104 2.2% 94 2.0%eQTL + RegulomeDB Score 6 200 4.2% 159 3.4% 144 3.0% 136 2.9%

Table 5.1: Fraction of associations overlapping functional regions for differentlinkage disequilibrium thresholds.

Only functional SNPs that are in linkage disequilibrium at or above the indicatethreshold in all HapMap 2 populations are used.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 68

Lead Perfect LD r2 ≥ 0.9 r2 ≥ 0.8Count Count Count Count

Total 2461 2461 2461 2461Predicted motif 729 29.6% 1470 59.7% 1678 68.2% 1846 75.0%DNaseI hypersensitivity peak 959 39.0% 1527 62.0% 1709 69.4% 1846 75.0%DNaseI footprint 213 8.7% 614 24.9% 755 30.7% 935 38.0%ChIP-seq peak 527 21.4% 1107 45.0% 1310 53.2% 1492 60.6%

Coding 135 5.5% 266 10.8% 333 13.5% 396 16.1%Exon, non-coding 86 3.5% 186 7.6% 223 9.1% 276 11.2%

RegulomeDB Score 2 (total) 84 3.4% 204 8.3% 240 9.8% 285 11.6%RegulomeDB Score 2a 9 0.4% 23 0.9% 23 0.9% 31 1.3%RegulomeDB Score 2b 73 3.0% 174 7.1% 207 8.4% 245 10.0%RegulomeDB Score 2c 2 0.1% 7 0.3% 10 0.4% 9 0.4%RegulomeDB Score 3 (total) 28 1.1% 67 2.7% 87 3.5% 92 3.7%RegulomeDB Score 3a 27 1.1% 65 2.6% 85 3.5% 90 3.7%RegulomeDB Score 3b 1 0.0% 2 0.1% 2 0.1% 2 0.1%RegulomeDB Score 4 245 10.0% 367 14.9% 392 15.9% 397 16.1%RegulomeDB Score 5 (total) 579 23.5% 609 24.7% 597 24.3% 544 22.1%RegulomeDB Score 5a 107 4.3% 175 7.1% 187 7.6% 175 7.1%RegulomeDB Score 5b 472 19.2% 434 17.6% 410 16.7% 369 15.0%RegulomeDB Score 6 495 20.1% 382 15.5% 310 12.6% 263 10.7%RegulomeDB Score 7 0 0.0% 380 15.4% 279 11.3% 208 8.5%

eQTL 244 9.9% 373 15.2% 419 17.0% 483 19.6%eQTL + Predicted motif 68 2.8% 270 11.0% 333 13.5% 409 16.6%eQTL + DNaseI hypersensitivity peak 112 4.6% 294 11.9% 361 14.7% 439 17.8%eQTL + DNaseI footprint 30 1.2% 168 6.8% 216 8.8% 288 11.7%eQTL + ChIP-seq peak 64 2.6% 234 9.5% 305 12.4% 390 15.8%

eQTL + RegulomeDB Score 2 (total) 14 0.6% 32 1.3% 33 1.3% 41 1.7%eQTL + RegulomeDB Score 2a 0 0.0% 8 0.3% 6 0.2% 5 0.2%eQTL + RegulomeDB Score 2b 14 0.6% 23 0.9% 27 1.1% 36 1.5%eQTL + RegulomeDB Score 2c 0 0.0% 1 0.0% 0 0.0% 1 0.0%eQTL + RegulomeDB Score 3 (total) 2 0.1% 12 0.5% 19 0.8% 23 0.9%eQTL + RegulomeDB Score 3a 2 0.1% 11 0.4% 18 0.7% 23 0.9%eQTL + RegulomeDB Score 3b 0 0.0% 1 0.0% 1 0.0% 0 0.0%eQTL + RegulomeDB Score 4 27 1.1% 46 1.9% 55 2.2% 59 2.4%eQTL + RegulomeDB Score 5 (total) 60 2.4% 68 2.8% 64 2.6% 43 1.7%eQTL + RegulomeDB Score 5a 13 0.5% 18 0.7% 21 0.9% 12 0.5%eQTL + RegulomeDB Score 5b 47 1.9% 50 2.0% 43 1.7% 31 1.3%eQTL + RegulomeDB Score 6 104 4.2% 56 2.3% 39 1.6% 31 1.3%

Table 5.2: Fraction of associations overlapping functional regions for differentlinkage disequilibrium thresholds (European populations).

Functional SNPs that are in linkage disequilibrium with the lead SNP at or abovethe indicate threshold in the HapMap 2 CEU population are used. Only associationsthat were identified or replicated in populations of European descent are used.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 69

5.2.3 Integrating gene expression data

We integrated data from multiple eQTL studies that identified SNPs associated with

changes in gene expression in several tissues. A total of 462 lead SNPs (9.8%) are

also themselves an eQTL in at least one tissue, and an additional 135 lead SNPs

(2.8%) are in strong LD (r2 ≥ 0.8 in all HapMap 2 populations) with an eQTL.

When considering only associations in populations of European descent, 483 lead

SNPs (19.6%) are either an eQTL, or in strong LD with an eQTL. We observe that

amongst lead SNPs that are also eQTLs, the fraction that overlaps DNaseI peaks (201,

43.5%) and ChIP-seq peaks (118, 25.5%) is significantly higher than when considering

all lead SNPs (P-values of respectively 7.6x10−4 and 1.7x10−3).

5.2.4 SNP comparison within linkage disequilibrium regions

ENCODE data can be used in order to compare multiple functional SNPs that are in

LD with a given lead SNP. We used a two-step approach to compare the functional

annotation of two SNPs. First, if one of the SNPs is in a coding region according to

GENCODE v7 and the other one is not, the coding SNP is considered to be more

likely to be functional. Similarly, a SNP in a non-coding part of an exon is considered

to be more likely to be functional than a SNP in an intergenic region or an intron.

Second, if both SNPs are not in exons, then we compared the amount of evidence

across data sources supporting the functional role of the SNP using a scoring scheme

integrated in RegulomeDB (see Section 5.5.5). We hypothesized that a SNP sup-

ported by multiple types of evidence (eg. a ChIP-seq peak and a DNaseI footprint) is

more likely to be functional than a SNP supported by a single experimental modality.

We find that most associations where the lead SNP is in LD with at least one other

SNP, the SNP with the most strongly supported functional SNP is not the lead SNP

itself, but another SNP in the LD region (22.4% compared to 13.6% when using LD

in all populations, 56.8% compared to 13.6% percent when considering CEU only,

Table 5.3). These results show that in most cases, the associated SNP reported in a

GWAS is not the most likely to play a biological role in the phenotype according to

ENCODE data.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 70

All populations CEU onlyOnly lead SNP coding 199 4.21% 87 3.53%Only lead SNP transcribed, non-coding 113 2.39% 39 1.58%Lead SNP supported by more regulatory evidence 329 6.96% 208 8.44%Lead better 641 13.56% 334 13.56%Lead SNP and SNP in LD coding 24 0.51% 48 1.95%Lead SNP and SNP in LD transcribed, non-coding 21 0.44% 30 1.22%Lead SNP and SNP in LD have similar regulatory evidence 282 5.97% 193 7.83%Lead and SNP in LD equal 327 6.92% 271 11.00%Lead SNP transcribed, non-coding, SNP in LD coding 12 0.25% 17 0.69%Lead SNP not transcribed, SNP in LD coding 110 2.33% 244 9.90%Lead SNP not transcribed, SNP in LD transcribed, non-coding 98 2.07% 207 8.40%SNP in LD supported by more regulatory evidence 356 7.53% 456 18.51%SNP in LD annotated, lead SNP not annotated 483 10.22% 476 19.32%SNP in LD better 1059 22.40% 1400 56.82%No annotation 1147 24.26% 208 8.44%Lead SNP annotated, no SNP in LD 1553 32.85% 251 10.19%

Table 5.3: Comparison of functional evidence between the lead SNP and thebest SNP in the linkage disequilibrium region.

Lines in bold represent totals for each case. When considering a linkage disequilib-rium threshold in the CEU population alone, only associations that were identifiedor replicated in populations of European descent are used.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 71

5.2.5 Associations are enriched for regulatory elements

4%1

1.5

2

2.5

All Lead SNPs Lead SNP and eQTLs

eQTLPredicted motifDNaseI hypersensitivity peakDNaseI footprintChIP-seq peakChIP-seq peak and DNaseI peak and motif

Fold

enr

ichm

ent

1.05 1.121.22 1.25

1.36 1.331.44

1.57 1.63 1.68

2.40

Figure 5.3: Enrichment for different combinations of assays.

Enrichments are reported for all lead SNPs associated with a phenotype, and sepa-

rately for lead SNPs that are also eQTLs or in strong linkage disequilibrium with an

eQTL. The enrichment for predicted motifs alone (italic) is not significant. These

results show that combining multiple types of experimental evidence increases the

observed enrichment.

We performed randomizations in order to compare the fraction of lead SNPs that are

functional SNPs or are in linkage disequilibrium with a functional SNP, to the ex-

pected fraction amongst all SNPs. We found that associated regions are significantly

enriched for functional SNPs identified using DNase-seq and ChIP-seq. Furthermore,

enrichments increased, both when integrating multiple ENCODE assays and when

adding eQTL information. We used a subset of 2364 lead SNPs for which sufficient

information is available, and built 100 random matched SNP sets in which each lead

SNP is replaced by a similar SNP (see Section 5.6 for details). We compared the

fraction of lead SNPs overlapping functional regions in the set of actual lead SNPs to

the fractions observed in the random sets, and computed enrichment values in order

to show that the fraction of associated SNPs that overlap functional regions is higher

than expected. Figure 5.3 provides an overview of the enrichment for different types

of functional data.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 72

When considering lead SNPs only, we observed a 1.12 fold enrichment for DNase

peaks, a 1.22 fold enrichment for DNase footprints and a 1.25 fold enrichment for

ChIP-seq peaks. All enrichments are statistically significant (P-values of respec-

tively 1.3 · 10−4, 0.005 and 1.3 · 10−6). We also observed that combining multiple

types of evidence increases the enrichment: there is a 1.36 fold enrichment for lead

SNPs that overlap with a ChIP-seq peak, a DNase peak, a DNase footprint and a

predicted motif. Similarly, there is an 1.33 fold enrichment for eQTLs, and an even

higher enrichment for eQTLs that also overlap functional regions (up to 2.4 fold). The

enrichments can be compared to the 1.05 fold enrichment (not significant, P-value

0.087) observed when considering overlap with motif based-predictions, which do not

make use of ENCODE data. When extending the set of possible functional SNPs

to SNPs that are in linkage disequilibrium with a lead SNP, we observed a decrease

in the enrichment (Figure 5.4A,B). At an r2 LD threshold of 0.8, enrichments for

most individual modalities are barely significant, but enrichment for functional SNPs

supported by multiple sources of evidence remain significant.

Limiting the set of lead SNPs to the most strongly supported associations (repli-

cation in a different cohort in the original study or in multiple studies) leads to an

increase in enrichment (Figure 5.4C). A total of 1216 lead SNPs were curated from

an original study that included a separate replication population, and 478 of them

overlap DNase1 peaks (39.3%). A total of 166 lead SNPs were replicated in a differ-

ent study, and 85 of them (51.2%) overlap a DNase1 peak. A similar trend can be

observed for ChIP-seq peaks.

The enrichments for different types of functional evidence and multiple linkage

disequilibrium thresholds when considering only functional SNPs that are in linkage

disequilibrium with the lead SNP at or above the indicated threshold in all HapMap

2 are presented in details in Tables 5.4 (lead SNPs), 5.5 (perfect linkage disequilib-

rium, r2 = 1.0), 5.6 (r2 ≥ 0.9), 5.7 (r2 ≥ 0.8) and 5.8 (r2 ≥ 0.5). Entries with

significant enrichment (P-value < 0.05) are represented in bold. SNPs are filtered as

described in Section 5.6.3, resulting in smaller sets than in Table 5.1. We separately

consider associations that were identified or replicated in populations of European

descent separately, and consider functional SNPs that are in linkage disequilibrium

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 73

with the lead SNP at or above the indicated threshold in the HapMap 2 CEU pop-

ulation, and provide a detailed enrichments in Tables 5.9 (lead SNPs), 5.10 (perfect

linkage disequilibrium, r2 = 1.0), 5.11 (r2 ≥ 0.9), 5.12 (r2 ≥ 0.8) and 5.13 (r2 ≥ 0.5).

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 74

!"

#!"

$!"

%!"

&!"

'!"

(!"

)!"

*+,-./0" 0+12+34"*5" *5"!67" *5"!68" *5"!6)" *5"!6(" *5"!6'"

!"#$"%

&'(")*+)'

,,*$-'.*

%,)

/0)1%#-$23"%&)+*#)4-5"#"%&)67)&2#",2*84,)

5/,9+:";+,<9"

=>;+34+-"5/,9+:";+,<9"

?@:0A9+B";+,<9"

=>;+34+-"?@:0A9+B";+,<9"

5C,9+:"2DD4;1EC49"

=>;+34+-"5/,9+:"2DD4;1EC49"

!"#!"$!"%!"&!"'!"(!"

90)1%#-$23"%&,)+*#)7:',";)<"'=,)

!"'"

#!"#'"$!"$'"%!"

>0)1%#-$23"%&,)+*#)>2;!?,"@)<"'=,)

Figure 5.4: Overview of enrichment.

(A.) Percentage of associated SNPs mapped to a functional SNP overlapping DNa-

seI peaks, DNaseI footprints and ChIP-seq peaks (full lines) compared to expected

percentages in the matched null sets (dotted line) for various linkage disequilibrium

thresholds. As the LD threshold decreases, the fraction of associations that can

be mapped to functional SNPs increases, but the enrichment for functional SNPs

amongst associated SNPs decreases. Comparison of the fraction of SNPs overlap-

ping DNaseI (B.) and ChIP-seq peaks (C.) in various null sets of matched random

SNPs (blue) and sets of associated SNPs (green). The fraction of random SNPs

overlapping DNaseI peaks and ChIP-seq peaks increases when properties of associ-

ated SNPs are matched more closely, and when considering associations supported

by more evidence. The red arrows indicate the most stringent matched null set,

and the set of all associations. We compare those two sets in order to show that

enrichments are significant.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 75

Observed Exp. Enrichment P-value

Total 2364

Predicted motif 688 29.1% 27.7% 1.05 0.087

DNaseI hypersensitivity peak 810 34.3% 30.5% 1.12 1.3 · 10−4

DNaseI footprint 178 7.5% 6.2% 1.22 0.005

ChIP-seq peak 446 18.9% 15.1% 1.25 1.3 · 10−6

RegulomeDB Score 2 63 2.7% 2.0% 1.36 0.015

RegulomeDB Score ≤ 3 86 3.6% 2.8% 1.32 0.012

RegulomeDB Score ≤ 4 290 12.3% 9.5% 1.29 3.6 · 10−5

RegulomeDB Score ≤ 5 839 35.5% 31.6% 1.12 1.3 · 10−4

RegulomeDB Score ≤ 6 1326 56.1% 51.1% 1.10 3.5 · 10−7

eQTL 191 8.1% 6.1% 1.33 1.0 · 10−4

eQTL + Predicted motif 53 2.2% 1.6% 1.44 0.015

eQTL + DNaseI hypersensitivity peak 91 3.8% 2.5% 1.57 4.7 · 10−5

eQTL + DNaseI footprint 20 0.8% 0.5% 1.63 0.021

eQTL + ChIP-seq peak 52 2.2% 1.3% 1.68 1.3 · 10−4

eQTL + RegulomeDB Score 2 10 0.4% 0.2% 2.40 0.003

eQTL + RegulomeDB Score ≤ 3 13 0.5% 0.2% 2.36 0.002

eQTL + RegulomeDB Score ≤ 4 32 1.4% 0.8% 1.63 0.002

eQTL + RegulomeDB Score ≤ 5 85 3.6% 2.3% 1.54 2.4 · 10−4

eQTL + RegulomeDB Score ≤ 6 162 6.9% 5.2% 1.31 6.5 · 10−4

Table 5.4: Overview of enrichment: lead SNPs only, all populations.

Observed Exp. Enrichment P-value

Total 2364

Predicted motif 855 36.2% 37.5% 0.97 0.164

DNaseI hypersensitivity peak 942 39.8% 37.2% 1.07 0.009

DNaseI footprint 237 10.0% 9.0% 1.12 0.063

ChIP-seq peak 546 23.1% 20.5% 1.13 0.001

RegulomeDB Score 2 79 3.3% 2.7% 1.22 0.086

RegulomeDB Score ≤ 3 107 4.5% 3.9% 1.15 0.149

RegulomeDB Score ≤ 4 342 14.5% 12.2% 1.19 0.002

RegulomeDB Score ≤ 5 950 40.2% 36.8% 1.09 0.001

RegulomeDB Score ≤ 6 1441 61.0% 57.0% 1.07 7.1 · 10−5

eQTL 213 9.0% 7.1% 1.26 0.001

eQTL + Predicted motif 84 3.6% 3.3% 1.09 0.460

eQTL + DNaseI hypersensitivity peak 129 5.5% 4.0% 1.38 0.001

eQTL + DNaseI footprint 39 1.6% 1.3% 1.30 0.096

eQTL + ChIP-seq peak 84 3.6% 2.6% 1.37 0.007

eQTL + RegulomeDB Score 2 14 0.6% 0.4% 1.63 0.068

eQTL + RegulomeDB Score ≤ 3 19 0.8% 0.5% 1.69 0.045

eQTL + RegulomeDB Score ≤ 4 44 1.9% 1.3% 1.38 0.025

eQTL + RegulomeDB Score ≤ 5 109 4.6% 3.2% 1.45 3.5 · 10−4

eQTL + RegulomeDB Score ≤ 6 169 7.1% 5.6% 1.28 0.002

Table 5.5: Overview of enrichment: perfect LD, all populations.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 76

Observed Exp. Enrichment P-value

Total 2364

Predicted motif 1011 42.8% 44.3% 0.96 0.119

DNaseI hypersensitivity peak 1066 45.1% 42.3% 1.07 0.009

DNaseI footprint 307 13.0% 11.5% 1.13 0.047

ChIP-seq peak 642 27.2% 24.9% 1.09 0.012

RegulomeDB Score 2 97 4.1% 3.4% 1.22 0.041

RegulomeDB Score ≤ 3 130 5.5% 4.9% 1.13 0.147

RegulomeDB Score ≤ 4 393 16.6% 14.3% 1.16 0.002

RegulomeDB Score ≤ 5 1040 44.0% 40.3% 1.09 0.001

RegulomeDB Score ≤ 6 1514 64.0% 60.4% 1.06 1.3 · 10−4

eQTL 234 9.9% 8.0% 1.24 0.001

eQTL + Predicted motif 114 4.8% 4.5% 1.07 0.482

eQTL + DNaseI hypersensitivity peak 159 6.7% 5.1% 1.32 0.001

eQTL + DNaseI footprint 58 2.5% 2.0% 1.24 0.118

eQTL + ChIP-seq peak 111 4.7% 3.6% 1.30 0.007

eQTL + RegulomeDB Score 2 18 0.8% 0.5% 1.63 0.033

eQTL + RegulomeDB Score ≤ 3 23 1.0% 0.6% 1.53 0.052

eQTL + RegulomeDB Score ≤ 4 57 2.4% 1.7% 1.44 0.005

eQTL + RegulomeDB Score ≤ 5 122 5.2% 3.6% 1.43 2.9 · 10−4

eQTL + RegulomeDB Score ≤ 6 176 7.4% 5.7% 1.30 0.001

Table 5.6: Overview of enrichment: r2 ≥ 0.9, all populations.

Observed Exp. Enrichment P-value

Total 2364

Predicted motif 1144 48.4% 50.6% 0.96 0.032

DNaseI hypersensitivity peak 1171 49.5% 47.4% 1.05 0.038

DNaseI footprint 368 15.6% 14.3% 1.09 0.077

ChIP-seq peak 727 30.8% 29.5% 1.04 0.168

RegulomeDB Score 2 125 5.3% 4.0% 1.33 0.001

RegulomeDB Score ≤ 3 162 6.9% 5.8% 1.18 0.038

RegulomeDB Score ≤ 4 449 19.0% 16.4% 1.16 0.002

RegulomeDB Score ≤ 5 1098 46.4% 43.7% 1.06 0.005

RegulomeDB Score ≤ 6 1562 66.1% 63.1% 1.05 0.001

eQTL 256 10.8% 8.7% 1.24 7.3 · 10−04

eQTL + Predicted motif 139 5.9% 5.7% 1.03 0.692

eQTL + DNaseI hypersensitivity peak 183 7.7% 6.2% 1.25 0.003

eQTL + DNaseI footprint 73 3.1% 2.7% 1.13 0.300

eQTL + ChIP-seq peak 136 5.8% 4.6% 1.24 0.014

eQTL + RegulomeDB Score 2 23 1.0% 0.6% 1.63 0.015

eQTL + RegulomeDB Score ≤ 3 29 1.2% 0.8% 1.51 0.020

eQTL + RegulomeDB Score ≤ 4 67 2.8% 2.0% 1.41 0.002

eQTL + RegulomeDB Score ≤ 5 130 5.5% 4.0% 1.37 2.6 · 10−4

eQTL + RegulomeDB Score ≤ 6 181 7.7% 5.8% 1.32 1.7 · 10−4

Table 5.7: Overview of enrichment: r2 ≥ 0.8, all populations.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 77

Observed Exp. Enrichment P-value

Total 2364

Predicted motif 1467 62.1% 64.9% 0.96 0.006

DNaseI hypersensitivity peak 1461 61.8% 60.1% 1.03 0.077

DNaseI footprint 554 23.4% 22.6% 1.04 0.283

ChIP-seq peak 988 41.8% 41.8% 1.00 0.958

RegulomeDB Score 2 174 7.4% 5.9% 1.24 0.004

RegulomeDB Score ≤ 3 231 9.8% 8.7% 1.13 0.063

RegulomeDB Score ≤ 4 573 24.2% 22.2% 1.09 0.022

RegulomeDB Score ≤ 5 1239 52.4% 50.9% 1.03 0.112

RegulomeDB Score ≤ 6 1623 68.7% 67.3% 1.02 0.133

eQTL 305 12.9% 10.8% 1.19 0.001

eQTL + Predicted motif 223 9.4% 8.7% 1.08 0.206

eQTL + DNaseI hypersensitivity peak 258 10.9% 9.1% 1.21 0.001

eQTL + DNaseI footprint 135 5.7% 5.1% 1.11 0.162

eQTL + ChIP-seq peak 207 8.8% 7.6% 1.15 0.030

eQTL + RegulomeDB Score 2 38 1.6% 0.9% 1.71 1.0 · 10−4

eQTL + RegulomeDB Score ≤ 3 46 1.9% 1.3% 1.56 0.002

eQTL + RegulomeDB Score ≤ 4 94 4.0% 2.8% 1.44 8.6 · 10−5

eQTL + RegulomeDB Score ≤ 5 147 6.2% 4.6% 1.34 6.3 · 10−5

eQTL + RegulomeDB Score ≤ 6 181 7.7% 5.8% 1.31 6.6 · 10−5

Table 5.8: Overview of enrichment: r2 ≥ 0.5, all populations.

Observed Exp. Enrichment P-value

Total 1310

Predicted motif 401 30.6% 28.0% 1.09 0.019

DNaseI hypersensitivity peak 474 36.2% 31.1% 1.16 1.3 · 10−4

DNaseI footprint 114 8.7% 6.4% 1.35 4.2 · 10−4

ChIP-seq peak 266 20.3% 15.6% 1.30 4.9 · 10−7

RegulomeDB Score 2 43 3.3% 2.0% 1.62 0.001

RegulomeDB Score ≤ 3 55 4.2% 2.8% 1.48 0.002

RegulomeDB Score ≤ 4 183 14.0% 9.8% 1.43 2.4 · 10−7

RegulomeDB Score ≤ 5 478 36.5% 32.0% 1.14 0.002

RegulomeDB Score ≤ 6 758 57.9% 51.3% 1.13 3.3 · 10−6

eQTL 107 8.2% 6.1% 1.35 0.002

eQTL + Predicted motif 34 2.6% 1.6% 1.67 0.005

eQTL + DNaseI hypersensitivity peak 56 4.3% 2.5% 1.74 6.5 · 10−5

eQTL + DNaseI footprint 16 1.2% 0.5% 2.32 2.5 · 10−4

eQTL + ChIP-seq peak 33 2.5% 1.4% 1.83 1.4 · 10−4

eQTL + RegulomeDB Score 2 8 0.6% 0.2% 3.56 1.2 · 10−4

eQTL + RegulomeDB Score ≤ 3 10 0.8% 0.2% 3.42 1.1 · 10−4

eQTL + RegulomeDB Score ≤ 4 21 1.6% 0.8% 1.89 0.001

eQTL + RegulomeDB Score ≤ 5 47 3.6% 2.4% 1.52 0.007

eQTL + RegulomeDB Score ≤ 6 87 6.6% 5.2% 1.28 0.022

Table 5.9: Overview of enrichment: lead SNPs only, European populations.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 78

Observed Exp. Enrichment P-value

Total 1310

Predicted motif 775 59.2% 62.6% 0.95 0.011

DNaseI hypersensitivity peak 774 59.1% 58.8% 1.00 0.850

DNaseI footprint 302 23.1% 22.3% 1.03 0.526

ChIP-seq peak 553 42.2% 40.8% 1.04 0.233

RegulomeDB Score 2 105 8.0% 6.0% 1.34 0.002

RegulomeDB Score ≤ 3 140 10.7% 8.7% 1.23 0.011

RegulomeDB Score ≤ 4 337 25.7% 21.6% 1.19 1.1 · 10−4

RegulomeDB Score ≤ 5 670 51.1% 49.5% 1.03 0.175

RegulomeDB Score ≤ 6 892 68.1% 65.7% 1.04 0.044

eQTL 150 11.5% 10.1% 1.14 0.120

eQTL + Predicted motif 102 7.8% 7.8% 1.00 0.968

eQTL + DNaseI hypersensitivity peak 119 9.1% 8.1% 1.12 0.212

eQTL + DNaseI footprint 64 4.9% 4.4% 1.11 0.385

eQTL + ChIP-seq peak 88 6.7% 6.8% 0.99 0.952

eQTL + RegulomeDB Score 2 14 1.1% 0.8% 1.31 0.288

eQTL + RegulomeDB Score ≤ 3 20 1.5% 1.1% 1.37 0.129

eQTL + RegulomeDB Score ≤ 4 40 3.1% 2.5% 1.24 0.159

eQTL + RegulomeDB Score ≤ 5 68 5.2% 4.3% 1.21 0.125

eQTL + RegulomeDB Score ≤ 6 90 6.9% 5.6% 1.23 0.061

Table 5.10: Overview of enrichment: perfect LD, European populations.

Observed Exp. Enrichment P-value

Total 1310

Predicted motif 888 67.8% 71.1% 0.95 0.013

DNaseI hypersensitivity peak 884 67.5% 66.7% 1.01 0.512

DNaseI footprint 377 28.8% 28.8% 1.00 0.988

ChIP-seq peak 668 51.0% 49.1% 1.04 0.134

RegulomeDB Score 2 120 9.2% 7.3% 1.25 0.014

RegulomeDB Score ≤ 3 163 12.4% 10.7% 1.16 0.045

RegulomeDB Score ≤ 4 389 29.7% 25.3% 1.18 1.7 · 10−4

RegulomeDB Score ≤ 5 717 54.7% 53.1% 1.03 0.134

RegulomeDB Score ≤ 6 900 68.7% 67.1% 1.02 0.145

eQTL 173 13.2% 11.7% 1.13 0.096

eQTL + Predicted motif 135 10.3% 9.9% 1.04 0.670

eQTL + DNaseI hypersensitivity peak 151 11.5% 10.2% 1.13 0.138

eQTL + DNaseI footprint 89 6.8% 6.4% 1.07 0.480

eQTL + ChIP-seq peak 123 9.4% 8.9% 1.05 0.571

eQTL + RegulomeDB Score 2 14 1.1% 1.0% 1.03 0.916

eQTL + RegulomeDB Score ≤ 3 22 1.7% 1.4% 1.20 0.369

eQTL + RegulomeDB Score ≤ 4 47 3.6% 2.9% 1.24 0.141

eQTL + RegulomeDB Score ≤ 5 73 5.6% 4.6% 1.21 0.106

eQTL + RegulomeDB Score ≤ 6 89 6.8% 5.5% 1.23 0.050

Table 5.11: Overview of enrichment: r2 ≥ 0.9, European populations.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 79

Observed Exp. Enrichment P-value

Total 1310

Predicted motif 985 75.2% 77.6% 0.97 0.044

DNaseI hypersensitivity peak 962 73.4% 73.3% 1.00 0.931

DNaseI footprint 475 36.3% 34.9% 1.04 0.297

ChIP-seq peak 772 58.9% 56.7% 1.04 0.107

RegulomeDB Score 2 144 11.0% 8.7% 1.27 0.004

RegulomeDB Score ≤ 3 196 15.0% 12.6% 1.19 0.011

RegulomeDB Score ≤ 4 433 33.1% 28.3% 1.17 1.2 · 10−4

RegulomeDB Score ≤ 5 733 56.0% 55.1% 1.02 0.417

RegulomeDB Score ≤ 6 889 67.9% 66.8% 1.02 0.291

eQTL 206 15.7% 13.2% 1.19 0.004

eQTL + Predicted motif 170 13.0% 11.9% 1.09 0.188

eQTL + DNaseI hypersensitivity peak 186 14.2% 12.1% 1.18 0.014

eQTL + DNaseI footprint 125 9.5% 8.2% 1.16 0.040

eQTL + ChIP-seq peak 162 12.4% 10.9% 1.13 0.084

eQTL + RegulomeDB Score 2 17 1.3% 1.2% 1.08 0.730

eQTL + RegulomeDB Score ≤ 3 27 2.1% 1.6% 1.27 0.172

eQTL + RegulomeDB Score ≤ 4 53 4.0% 3.2% 1.28 0.042

eQTL + RegulomeDB Score ≤ 5 71 5.4% 4.7% 1.15 0.199

eQTL + RegulomeDB Score ≤ 6 84 6.4% 5.4% 1.19 0.069

Table 5.12: Overview of enrichment: r2 ≥ 0.8, European populations.

Observed Exp. Enrichment P-value

Total 1310

Predicted motif 1179 90.0% 90.3% 1.00 0.755

DNaseI hypersensitivity peak 1154 88.1% 87.0% 1.01 0.207

DNaseI footprint 736 56.2% 52.4% 1.07 0.001

ChIP-seq peak 1025 78.2% 74.5% 1.05 0.002

RegulomeDB Score 2 199 15.2% 12.3% 1.23 0.002

RegulomeDB Score ≤ 3 262 20.0% 17.6% 1.14 0.024

RegulomeDB Score ≤ 4 490 37.4% 35.0% 1.07 0.075

RegulomeDB Score ≤ 5 734 56.0% 56.2% 1.00 0.863

RegulomeDB Score ≤ 6 807 61.6% 62.4% 0.99 0.490

eQTL 304 23.2% 18.5% 1.25 1.7 · 10−6

eQTL + Predicted motif 292 22.3% 18.0% 1.24 1.2 · 10−5

eQTL + DNaseI hypersensitivity peak 300 22.9% 18.1% 1.27 7.1 · 10−7

eQTL + DNaseI footprint 250 19.1% 14.5% 1.31 1.3 · 10−7

eQTL + ChIP-seq peak 287 21.9% 17.2% 1.27 8.6 · 10−7

eQTL + RegulomeDB Score 2 36 2.7% 1.9% 1.46 0.027

eQTL + RegulomeDB Score ≤ 3 49 3.7% 2.5% 1.52 0.004

eQTL + RegulomeDB Score ≤ 4 74 5.6% 4.2% 1.34 0.010

eQTL + RegulomeDB Score ≤ 5 88 6.7% 5.4% 1.25 0.030

eQTL + RegulomeDB Score ≤ 6 91 6.9% 5.6% 1.23 0.046

Table 5.13: Overview of enrichment: r2 ≥ 0.5, European populations.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 80

5.2.6 Analysis at the phenotype level

In addition to considering individual associations separately, we can group associated

SNPs in order to search for patterns at the phenotype level. We first assessed whether

there are specific sequence binding proteins that tend to overlap functional SNPs

associated with certain phenotypes more often than expected, using only associations

in populations of European descent (Figure 5.5). We found a strong association

(P-value 9 · 10−5) between height and CTCF ChIP-seq peaks. A total of 39 SNPs

associated with height overlap a ChIP-seq peak or are strong linkage disequilibrium

(r2 ≥ 0.8 in the CEU population) with a SNP that overlaps a ChIP-seq peak, and

15 of those (38%) overlap a peak for CTCF (Table 5.14), compared to 89 out of

626 SNPs (14%) when considering all phenotypes. We also found an interesting

interaction between prostate cancer and the androgen receptor (AR), a transcription

factor that was not assessed by ENCODE but as a control in a separate study [124].

Of the 9 functional SNPs for prostate cancer that overlap a ChIP-seq peak, 5 overlap

an AR ChIP-seq peak (Table 5.15).

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 81

!"#$%&'(#

)%&*+,-#*.,/0!1

23#4+*(,56&",*$',7"8!91#:,.*&*,1#& !2

-;<=

>0?@=

7)7?

7AB!

B

A!CDD

;?EC

/!8F

/)=)

C

?2E=

F

G=)=

<

;=H<

F

G=)=

F

IJ0H

IJ0

KL7

?2/

K=?

M

0?MBF

K=E

J/?F

LLF

=; AB?F

7HE<

N0?<OC

!"#$% &'() (&( *+, ,+ -, .) (, (( /- /) /* /* /* /+ )- )) )' )& )& )+ '- '' '& '* '* '+ '+>#6P"& **& CQ O @ !" R @ S @ S @ C R C O @ < S < S < C S D C F <7%4%$*4',"#*4&,.61#*1# // FS < F @ F C D F < F F D F D D D F D D < F F F D D <)'(#,<,.6*T#&#1 )' F< C F @ < C < < < F < F F < < F F < F D < D D F D D7%P$6&63#,&#1&,(#4U%4V*$W# ', S D D F D D D D D D D F F D D D D F D D D D D D D DB6(%+*4,.61%4.#4 ', R D < F D D D F D F D D D D D D D D D D F D D D D FB%.',V*11,6$.#X ', F< F D F C F C F F D D D D D D D D D D F D D D F D FJ+W#4*&63#,W%+6&61 '- F@ F < F F F F D F D F D F D F F D < D F D F F D < DB6(%+*4,.61%4.#4,*$.,1W"6Y%("4#$6* '( Z D F F F D D F D F F D D D F D F D D D D D F F F F74%"$[1,.61#*1# '' F< O C < C @ < D F F < < @ @ @ C @ F F C C F D D C F/'1&#V6W,+\(\1,#4'&"#V*&%1\1 '* Q C F D D F < C D D D D F D D < D D @ < F < D F D D/\..#$,W*4.6*W,*44#1& '* FD < C < D F F D F F D F D D D < D D D D F D D F F DK#$*4W"#,]*P#,*&,%$1#&^ '* S < D < F D F F F D D F F F D F D D D D F D D D F F7%P$6&63#,(#4U%4V*$W# '+ S F F D D D D D F D D D D F D D D D D D D D D D D F)'(#,F,.6*T#&#1 &, FD < F F D F D F F D D D D D D D D F F F D F D F D F=&&#$&6%$,.#U6W6&,"'(#4*W&636&',.61%4.#4 &, O D D F < D F D D D D D D D D D D D F D D F D D D D!4%1&*&#,W*$W#4 &- Q < F D D < F D D D C D < F F < < F D F F D " D D DB4#*1&,W*$W#4 &( FF C D C < < < < F @ < < F D C F < D < F D F D < D D!4%&#6$,:\*$&6&*&63#,&4*6&,+%W6 &/ R < D F D F D D < D D D D F F F D F D F D F D F D D;#1(%$1#,&%,1&*&6$,&"#4*(' &/ @ D D F F F D D D F D F F F F D D D D D D D D D D D=V'%&4%("6W,+*&#4*+,1W+#4%161 &) < D D D D D D D D D F D D D D D D D F D D F D D D D;#1(%$1#,&%,*$&6(1'W"%&6W,&4#*&V#$& &' C F D F D D D D D D D D F D D D D D D D D D D F D D;"#\V*&%6.,*4&"46&61 &' FF F F F D D C F F D D F F D D F F D < F F < F F D D>#V*&%+%P6W*+,*$.,T6%W"#V6W*+,&4*6&1 &' R F < F F F D D D < F F F D F F D F D F D F F D D D-%$P#36&' && @ D F F D D D F D D F F D D D D D D D D D D D D D D_),6$&#43*+ && S D D F D D F F D D F F D D D D D D D D D D D D D F8$U%4V*&6%$,(4%W#116$P,1(##. &+ S D F D F C F F F D F D D F D F D F D F F D D D D DK\+&6(+#,1W+#4%161 &+ S F < F < F D D < F D F D < F D D D D F F D D D D D_\*$&6&*&63#,&4*6&1 &+ C F D F D D D F D D F F F D D F D D D D D D D D D D

Figure 5.5: Phenotype level overview of the overlap between associations and

ChIP-seq binding.

This matrix view shows phenotypes vertically and DNA binding proteins assessed

using ChIP-seq horizontally. Each cell represents the number of lead SNPs for

the respective phenotype that overlap with a ChIP-seq peak for the respective DNA

binding protein or are in strong LD (r2 ≥ 0.8 in the CEU HapMap 2 population) with

a SNP that overlaps such a peak. Only phenotypes with at least 20 lead SNPs, and

DNA binding proteins overlapping at least 30 functional SNPs are shown, but totals

are computed over the entire data set. The significant interaction between height-

associated functional SNPs and CTCF, as well as the association between prostate

cancer-associated functional SNPs and androgen receptor (AR) are represented in

bold font.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 82

Associated SNP Functional SNP Gene Study P-value

chr1 rs6686842 rs11209342 SCMH1 Weedon et al. 2008 [125] 2 · 10−8

chr2 rs2580816 NPPC, DIS3L2 Lango Allen et al. 2010 [92] 6 · 10−22

chr2 rs6724465 rs13419740 NHEJ1 Weedon et al. 2008 [125] 2 · 10−8

chr3 rs9863706 RYBP Lango Allen et al. 2010 [92] 4 · 10−13

chr6 rs6899976 L3MBTL3 Gudbjartsson et al. 2008 [126] 6 · 10−6

chr6 rs7742369 KRT18P9, CYCSL1 Okada et al. 2010 [127] 1 · 10−13

chr7 rs2730245 rs6965685 WDR60 Lettre et al. 2008 [128] 3 · 10−7

chr9 rs7032940 rs7036157 AKAP2, C9orf152 Kim et al. 2010 [129] 3 · 10−6

chr9 rs946053 COL27A1 Gudbjartsson et al. 2008 [126] 2 · 10−7

chr14 rs1950500 rs12590407 RIPK3, NFATC4 Lango Allen et al. 2010 [92] 2 · 10−18

chr15 rs8041863 rs8030631 ACAN Weedon et al. 2008 [125] 8 · 10−8

chr17 rs3760318 ADAP2 Gudbjartsson et al. 2008 [126] 2 · 10−9

chr17 rs2665838 GH2, CSH1 Lango Allen et al. 2010 [92] 5 · 10−25

chr19 rs12986413 rs1015670 DOT1L Lettre et al. 2008 [128] 3 · 10−8

chr2 rs5751614 BCR Gudbjartsson et al. 2008 [126] 6 · 10−6

Table 5.14: Height-associated functional SNPs overlapping CTCF binding

sites

An empty cell in the Functional SNP column indicates that the associated lead SNP

is also functional.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 83

Associated SNP Functional SNP Gene Study P-value

chr4 rs7679673 rs10007915 RPL6P14 - TET2 Eeles et al. 2009 [130] 3 · 10−14

chr6 rs339331 RFX6 Takata et al. 2010 [131] 2 · 10−12

chr7 rs10486567 JAZF1 Thomas et al. 2008 [132] 2 · 10−6

chr8 rs1456315 SRRM1P1,POU5F1B Takata et al. 2010 [131] 2 · 10−29

chr17 rs1859962 CALM2P1,SOX9 Schumacher et al. 2011 [133] 3 · 10−11

Eeles et al. 2009 [130] 2 · 10−16

Eeles et al. 2008 [134] 1 · 10−6

Gudmundsson et al. 2007 [135] 3 · 10−10

Table 5.15: Prostate cancer-associated functional SNPs overlapping AR bind-

ing sites

An empty cell in the Functional SNP column indicates that the associated lead SNP

is also functional.

5.3 Discussion

In this work, we used data generated by the ENCODE consortium to identify regu-

latory and transcribed functional SNPs that are associated with a phenotype, either

directly in a genome wide association study or indirectly through linkage disequilib-

rium with a GWAS association. We further added eQTL information, thus identi-

fying SNPs that are associated with a phenotype, for which there is evidence that

they affect a regulatory region or a transcribed region, and for which a downstream

target affected by the SNP is known. This approach therefore has the potential to

provide putative mechanistic explanations for GWAS associations. We showed that

this method is successful in identifying a functional SNP for a majority of previously

reported GWAS associations (up to 81% when considering association studies per-

formed in populations of European descent, and using the CEU population to obtain

linkage disequilibrium information).

The fraction of associated SNPs for which we can provide a functional annota-

tion is similar to the one reported in the ENCODE integrative analysis paper [119].

The integrative analysis uses both DNase-seq and Formaldehyde-Assisted Isolation of

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 84

Regulatory Elements (FAIRE) [136] data to identify regions of open chromatin, and

thus finds a slightly larger fraction of the associated SNP to overlap or be in LD with

open chromatin regions compared to our approach, which does not use FAIRE data.

We found that GWAS associations are significantly enriched for DNase hypersensi-

tivity peaks, DNaseI footprints and ChIP-seq peaks even when accounting for most

features of associated SNPs. Our results are consistent with chromatin state-based

methods [115], in which a segmentation approach was used in order to identify enrich-

ment for disease associations in predicted enhancers. Segmentation-based approaches

use machine learning methods to predict chromatin state at every position in the

genome based mostly on histone information. These predictions are then compared

to GWAS results, thus showing enrichment for predicted states. A major difference

of our work is that we directly used ChIP-seq and DNaseI-seq functional data in our

analysis, and show enrichment for observed ChIP-seq peaks or DNaseI hypersensi-

tive regions. In this work, we demonstrated that there is significant enrichment of

GWAS associations for these types of data. Furthermore, we found that 1) inte-

grating multiple types of functional data and expression information identifies more

likely candidate causal SNPs within an LD region, and 2) phenotypic information

from GWAS studies can be associated with biochemical data.

Existing methods for prioritizing SNPs based on their functional role focused on

transcribed regions [94, 95, 96], whereas we focused on regulatory regions. In the con-

text of regulatory regions, most approaches are based on motif information [101, 102],

and approaches using experimental data have generally been limited to individual as-

sociations [114]. The comprehensive data sets generated by the ENCODE consortium

are the first to offer sufficient information to allow for genome-wide methods that rely

on experimental information. We used enrichment to compare the sensitivity of our

approach with motif-based methods. We found that there is no significant enrichment

for GWAS associations amongst conserved motifs.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 85

5.3.1 Identifying functional SNPs in linkage disequilibrium

with lead SNPs

We found that in most cases, there is more evidence supporting another SNP in

strong LD with the lead SNP than the lead SNP itself. This is consistent with re-

sults from fine mapping analyses that indicate that multiple variants in the linkage

disequilibrium region surrounding a lead SNP appear to play a role in the phenotype

of interest [137, 93]. This result is of particular importance for the interpretation of

GWAS results, as LD patterns differ markedly between populations. If the functional

SNP is in strong LD with the lead SNP in the population in which the GWAS was

performed, but not in a different population, then the lead SNP will not be asso-

ciated with the phenotype in this second population. An example of this situation

is functional SNP rs1333047, which lies in a region associated with coronary artery

disease. This SNP is in perfect LD with two lead SNPs in populations of European

descent in which the studies identifying the associations were performed, but not in

populations of African descent. This example is discussed in details in Section 6.4.

5.3.2 Comparison of functional assays

We integrated data from multiple types of functional assays in order to identify func-

tional SNPs. We found that the highest enrichments are obtained when requiring

functional SNPs to be supported by multiple sources of experimental evidence rather

than only one. The highest enrichments are observed when using both eQTL informa-

tion and ENCODE data, and when considering associations that have been replicated.

A similar trend can be observed when examining individual assays. The more spe-

cific the assay, the higher the enrichment for overlap amongst GWAS associations:

DNase hypersensitivity peak, which broadly capture regions in which chromatin is

accessible do overlap with a large fraction of SNPs in general, thus leading to rela-

tively weak enrichments, whereas the enrichment is much higher for ChIP-seq peaks,

which experimentally identify the binding of specific transcription factors and other

molecules. There is a clear trade-off between the more significant enrichment we ob-

serve, and the lower fraction of associations annotated with ChIP-seq peaks. The

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 86

ChIP-seq data generated so far by the ENCODE consortium only assesses 119 tran-

scription factors, a fraction of the 1,800 known ones [119]. Most transcription factors

are assessed in a small subset of the ENCODE cell lines, whereas DNase-seq has been

performed on most ENCODE cell lines. DNase footprinting, which combines DNase-

seq data with sequence and motif information, is useful to identify potential binding

sites for transcription factors not assessed using ChIP-seq. An example of this situ-

ation is functional SNP rs7163757, which is in LD with a lead SNP associated with

type 2 diabetes. DNaseI footprinting identifies a nuclear factor of activated T-cells

(NFAT) footprint that overlaps rs7163757. NFAT is part of the calcineurin/NFAT

pathway [138], which has been involved in the regulation of growth and function of the

insulin-producing pancreatic beta cells, and linked to the expression of genes known

to be associated with type 2 diabetes [139].

5.3.3 Differences between tissue types

Transcription factor binding patterns are heterogeneous, and differ between tissue

types. Assessing this heterogeneity has been a main motivation for the ENCODE

project. One concern is that the cell lines from which the functional information is

derived does not necessarily correspond to the tissue type that is most relevant to

the phenotype of interest. A similar approach has been successfully used to identify

functional SNPs that play a role in coronary artery disease based on a ChIP-seq assay

performed in the immortalized HeLa cell line [114]. By choosing to use functional

data across all tissues, we purposefully favor sensitivity over specificity. An example

illustrating the benefits of this trade-off is rs2074238, a functional SNP associated with

long QT syndrome. A ChIP-seq experiment identifies the binding of estrogen receptor

alpha at this location in an epithelial cell line. Long QT syndrome is more prevalent

in women [140, 141], menstrual cycle affects the QT interval [142], and estrogen

therapy has been shown to affect the duration of the QT interval in postmenoposal

women [143, 144]. ChIP-seq data for this transcription factor is only available for

two cell lines, neither of cardiac origin. By limiting our approach to functional data

obtained in cardiac tissues, we would have excluded a transcription factor whose

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 87

role in the phenotype is supported by extensive prior evidence. When examining

all associations, the significant enrichments we report demonstrate that our current

approach improves specificity compared to using motif information only.

Although the ChIP-seq data generated so far by the ENCODE consortium is

sparse, especially in terms of number of different tissues in which a transcription

factor is assessed, the number of available data sets is growing rapidly. We expect

that it will soon become possible to refine this approach by considering the most

relevant tissue types only, thus further improving its specificity. A remaining challenge

is the identification of specific tissue types that are relevant for a given phenotype.

A specific example is a functional SNP we identify in the context of Alzheimers

disease: in cell lines of hepatic origin, rs3764650 overlaps a binding site for HNF4A,

a transcription factor known to mainly play a role in the liver. Although Alzheimers

is a neurodegenerative disease, a recently published study shows that the liver might

play an important role in the disease mechanism as well [145]. This example shows

the benefits of looking broadly at all available experimental data from ENCODE.

5.3.4 Functional SNPs beyond reported associations

In this work, we focused on using ENCODE information in order to identify functional

SNPs in strong LD with previously reported associations. It is however important

to note that these SNPs only represent a small fraction of all the SNPs that overlap

functional regions identified by ENCODE. SNPs that alter transcription factor bind-

ing sites are likely to have some biologically important effect, and have an impact on

some phenotype. Such a SNP will, however, only be found in a GWAS if the spe-

cific phenotype it affects is assessed. Given this fundamental limitation of association

studies, an orthogonal approach would be to study the functional effects of common

SNPs regardless of their association with a phenotype. Furthermore, this effect ex-

plains why the enrichments we observe, while significant, are relatively modest. We

used a stringent null model in which a lead SNP is matched to a random SNP that

is similar to the lead SNP, and in particular located at a similar distance to the near-

est transcription start site. Associated SNPs are located more closely to genes than

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 88

SNPs in general, and therefore null sets are also biased towards SNPs that are likely

to have some biological effect. Relaxing the null model leads to higher enrichments

(Figure 5.4B,C).

5.3.5 Analysis at the at the phenotype level

We identify a significant association between height and CTCF, with 15 associated

SNPs either overlapping a CTCF ChIP-seq peak, or in strong linkage disequilibrium

with a CTCF peak. CTCF [146] is a transcription factor that plays a key role in insu-

lators in the human genome [147, 148]. While CTCF plays a role in many biological

processes [149], the association with height that we observe is significant when com-

pared to the fraction of other GWAS associations overlapping CTCF. Methylation of

a CTCF site has been shown to control the expression of IGF2 [147], a gene involved

in embryonic growth regulation [150] that also plays a major role in muscle growth

in pigs [151]. It is possible that disruptions of CTCF binding sites could inactivate

insulators, either due to the genotype at loci in the binding sites, or due to the joint

effect of genotype and methylation, and that these disruptions could have an effect

on tissue growth, and thus height.

We identify five functional SNPs that are associated with prostate cancer and

overlap androgen receptor (AR) binding sites identified using ChIP-seq. The androgen

receptor is activated by testosterone [152] and has been shown to play an important

role in prostate cancer progression [153, 154], and therapy [155]. Our results therefore

indicate that these associations might be related to a known mechanism involved in

prostate cancer.

5.4 Conclusion

We showed that genome wide experimental data sets generated by the ENCODE

consortium can be successfully used to provide putative functional annotations for

the majority of the GWAS associations reported in the literature. The use of these

experimental assays outperforms the use of in-silico binding predictions based on

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 89

sequence motifs when trying to identify functional SNPs associated with a phenotype

in a GWAS. We demonstrate that an integrative approach combining genome wide

association studies, gene expression analysis and experimental evidence of regulatory

activity leads to the identification of loci that are involved in common diseases, and

generates hypotheses about the biological mechanism underlying the association. In

the majority of cases, the SNP most likely to play a functional role according to

ENCODE evidence is not the reported association, but a different SNP in strong

linkage disequilibrium with the reported association. Our approach, which builds

directly on the publicly available RegulomeDB database, provides a simple framework

that can be applied to the functional analysis of any genome wide association study.

5.5 Data

5.5.1 GWAS catalog

We use the NHGRI GWAS catalog [11], downloaded on August 10, 2011) to obtain

a list of GWAS associations. The version of the catalog we use contains 5922 entries.

Each entry associates a SNP with a phenotype in a study. Most studies report

several associated SNPs and each of them has a separate entry. A SNP associated

with two phenotypes in the same study results in two entries entries. The GWAS

catalog provides additional information for each entry. We use information about

the genotyping technology used (Affymetrix, Illumina or Perlegen), whether the SNP

was directly genotyped or imputed, the statistical support for the association (P-

value), the population(s) in which the study was performed and the presence or not

of a replication cohort. We map each entry in the GWAS catalog to dbSNP 131,

which results in a total of 5694 associations on the autosomes and on chromosome

X. We exclude the single association on chromosome Y from our analysis. The 5694

associations represent a total of 4724 distinct SNPs, 470 phenotypes and 810 studies.

416 SNPs are associated in more than one study.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 90

5.5.2 Linkage disequilibrium

We use HapMap version 2 [7] and version 3 [67] in order to obtain Linkage Disequilib-

rium information between SNPs. HapMap 2 provides a higher SNP density, whereas

HapMap 3 provides information for more populations. The HapMap 2 data we use in-

cludes 2,776,528 SNPs in the CEPH (Utah residents with ancestry from northern and

western Europe, abbreviation: CEU) population, 2,554,939 SNPs in the Han Chinese

in Beijing, China (abbreviation: CHB) population, 3,114,362 SNPs in the Yoruba in

Ibadan, Nigeria (abbreviation: YRI) population, and 2,509,881 SNPs in the Japanese

in Tokyo, Japan (abbreviation: JPT) population. We create an intersection set that

includes all 2,135,736 SNPs that are assessed in all four HapMap 2 populations. We

use HapMap 2 and HapMap 3 data in order to create a list of pairs of SNP for which

there is some evidence of LD (r2 ≥ 0.1) in any HapMap 2 or HapMap 3 population.

5.5.3 Genotyping arrays

We download the list of SNPs that appear on genotyping arrays from the SNP Geno-

typing Array track of the UCSC genome browser [30] and use all SNPs that also ap-

pear in dbSNP 132. Arrays include Affymetrix SNP 6.0 (905,283 SNPs), Affymetrix

SNP 5.0 (435,360 SNPs), Affymetrix GeneChip Human Mapping 250K Nsp (257,159

SNPs) Affymetrix GeneChip Human Mapping 250K Sty (233,887 SNPs), Illumina

Human Hap 650v3 (660,388 SNPs), Illumina Human Hap 550v3 (560,972 SNPs) Illu-

mina Human Hap 300v3 (318,046 SNPs), Illumina Human1M-Duo (1,146,891 SNPs)

Illumina Human CytoSNP-12 (299,358 SNPs), Illumina Human 660W-Quad (593,197

SNPs), and Illumina Human Omni1-Quad (972,372 SNPs). This track does not pro-

vide information for the Perlegen arrays. We compute combined lists of all SNPs that

appear on any Affymetrix array, and of all SNPs that appear on any Illumina array,

remove SNPs that do not appear in all HapMap2 populations, and remove SNPs on

Chromosome Y. We obtain a total of 1,006,273 SNPs for the Illumina arrays, and

920,693 SNPs for the Affymetrix arrays.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 91

5.5.4 SNP properties

We use the function information generated by the UCSC genome browser for each

SNP in dbSNP 132 [156]. The functional role is predicted based on UCSC genes.

Functional classes are: near the 3’ end of the gene (within 500 bases of a transcript),

near the 5’ end of a gene (within 2kB of a transcript), coding synonymous, coding

non-synonymous (nonsense, missense, frameshift, coding indel or coding unknown), 3

or 5 untranslated regions, introns, splice sites or unknown (intergenic regions). These

classifications do not use GENCODE v7 information, or any regulatory information

from ENCODE. We use the UCSC Genes track in order to find the closest transcrip-

tion start site (TSS) for each SNP, and the dbSNP allele frequency information to

determine the minor allele frequency of each SNP. We use a total of 26,561,892 SNPs

from dbSNP 132.

5.5.5 Functional annotations

We use he November 7, 2011 version RegulomeDB1 in order to annotate SNPs with

regulatory information. RegulomeDB integrates various types of assays, including

DNaseI-seq peaks and ChIP-seq peaks, both generated by the ENCODE consortium,

DNaseI footprints, conserved motifs, eQTLs curated from several studies in multiple

tissues, and validated functional loci. For each SNP RegulomeDB provides a list

of datasets in which there is evidence of function in a region overlapping the SNP,

as well as a score indicating the confidence that the SNP is functional based on all

the available evidence for the locus. The RegulomeDB ENCODE companion paper

describes the database and scoring metric in more details [123].

We run RegulomeDB on every SNP in dbSNP 132. There is some evidence of

function for a total of 13,453,666 SNPs (50.7%). In this study, we use a slightly

modified format of the RegulomeDB scores in order to integrate linkage disequilibrium

and eQTL information. A RegulomeDB score of 1a through 1f indicates that a SNP is

an eQTL, with the letter indicating how much other functional information supports

there is for the SNP. Each letter maps to another score (2a through 5), with the

1http://RegulomeDB.org

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 92

Score Evidence supporting the functional SNP in RegulomeDB2a ChIP-seq peak, matched DNaseI footprint, matched motif, DNaseI peak2b ChIP-seq peak, DNaseI footprint and peak, and motif2c ChIP-seq peak, matched motif and DNaseI peak3a ChIP-seq peak, DNaseI peak and motif3b ChIP-seq peak and matched motif4 ChIP-seq and DNaseI peak5a ChIP-seq peak only5b DNaseI peak only6 Motif only7 No annotation

Table 5.16: Modified RegulomeDB scoring scheme.

Lower scores indicate more evidence for the SNP to be in a regulatory region.

only difference being that the higher scores denote SNPs that are not eQTLs. We

map scores between 1a and 1f back to the corresponding scores between 2a and

5, and handle eQTLs separately. We also create two additional scores, 5a for SNPs

overlapping ChIP-seq peaks only, and 5b for SNPs overlapping DNaseI-seq peaks only,

whereas RegulomeDB uses a score of 5 for both. Table 5.16 provides an overview of

the modified scoring scheme.

5.5.6 Transcribed regions

We use GENCODE v7 [157] to identify SNPs that overlap transcribed regions. We

intersect the GENCODE v7 Genes basic set track from the UCSC browser with all

SNPs in dbSNP 132 and determine whether the SNP lies in an exon. If the SNP is

in an exon, we use the coding region start and end information of the browser table

in order to determine whether the SNP is in a coding region or is in a non-coding

part of an exon. We consider introns in a similar way than intergenic regions since

regulatory elements can be found in both. This leads to a transcriptional annotation

for each SNP.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 93

5.6 Methods

5.6.1 Lead SNP annotation

We call the associated SNP reported in a GWAS lead SNP. For each lead SNP we

retrieve the regulatory annotation from RegulomeDB, and the transcriptional an-

notation from GENCODE v7. We determine the fraction of lead SNPs that are

coding, in non-coding parts of exons, that overlap DNaseI peaks, DNaseI footprints

and ChIP-seq peaks independently of each other. This means that if, for example, a

SNP overlaps both a DNase peak and a ChIP-seq peak, then it will be counted for

both types of assays. We consider that there is an overlap between the SNP and the

type of assay if there is one ENCODE cell line in which there is respectively a DNase

peak, a DNase footprint for at least one motif, or a ChIP-seq peak for at least one

binding protein that overlaps the SNP. In order to determine a score for lead SNPs,

we first assess whether the SNP is an exon. If the SNP is not in an exon, then we

assign the modified RegulomeDB score to this SNP. We use Fisher’s exact test on

a 2x2 table to compute a P-value for the difference in the fraction of functionally

annotated SNPs between all lead SNPs and lead SNPs that are eQTLs.

5.6.2 Linkage disequilibrium integration

For each lead SNP we compute the set of all SNPs in LD with that lead SNP. We first

use an r2 threshold in order to limit the LD set to SNPs in strong LD with the lead

SNP. In order to add a SNP to the LD set, we require that the r2 is above the threshold

in all four HapMap 2 populations. We then look separately at associations found in

populations of European descent. For each of these lead SNPs, we obtain a set of

SNPs in LD with the lead SNP when considering the HapMap 2 CEU population only.

We separately analyze the set of all lead SNPs, and the subset of European-descent

lead SNPs.

In order to compute the fraction of SNPs in LD with a lead SNP that overlap

a type of functional data, we do count every lead SNP at most once, namely when

one or more SNPs in the LD set overlaps with the functional data type. In order to

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 94

compute a score, we find the best candidate in the LD set corresponding to each lead

SNP. We consider that a coding SNP had more functional evidence than a SNP in a

non-coding part of an exon, and that a SNP in an exon has more functional evidence

than a regulatory SNP. If no SNP in the LD set is transcribed, then we find the SNP

with the best RegulomeDB score. We consider an associated region to be an eQTL

if there is at least one eQTL in the set of SNPs in LD with the lead SNP.

The use of linkage disequilibrium to identify functional SNPs is based on the as-

sumption that the linkage disequilibrium structure used in this analysis is the same

than in the population in which the association study was performed. This assump-

tion is necessary in order to consider that a functional SNP in strong linkage disequi-

librium with a lead SNP is associated with the phenotype. If the two SNPs are not

correlated in the actual study population, then there is no evidence that the func-

tional SNP has an effect on the phenotype. Linkage disequilibrium patterns differ

significantly between populations, and it is therefore challenging to obtain linkage

disequilibrium information that closely matches the population in which the GWAS

was performed. We choose to use a conservative approach in which we consider two

SNPs to be in strong linkage disequilibrium only if they are in strong linkage dise-

quilibrium (r2 ≥ 0.8) in all four HapMap 2 populations. These populations are of

European, African, Japanese and Chinese origin, and thus encompass a large part

of the variation between populations, and in particular represent populations that

diverged early on in recent human evolution. Linkage disequilibrium patterns that

are conserved across all four populations are likely to be conserved in the populations

studied in GWAS as well, and this approach should thus reduce the number of false

positives amongst the functional SNPs we identify. This clearly comes at the cost

of sensitivity: when considering SNPs in strong LD with the lead SNP in all pop-

ulations, 33% of the lead SNPs are annotated but have no SNP in LD with them.

We separately repeat our analysis by considering functional SNPs that are in strong

LD in the CEU HapMap population with a lead SNP associated with a phenotype in

a population of European descent. This increases the number of SNPs assessed for

functional evidence. The fraction of lead SNPs that are mapped to a functional SNP

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 95

also increases: 80% of the lead SNP are found to be in strong LD with a SNP over-

lapping a region identified to be functional in at least one ENCODE assay. Recent

analysis of the genetic structure of the European population do however show sig-

nificant differences between different sub-populations, and it is therefore important

to keep in mind that further analysis is needed, as the correlation between a lead

SNP and an functional SNP in the HapMap CEU population might be weaker the

population in which the original association was performed.

5.6.3 Randomization

We use a conservative approach in order to estimate enrichment. We do so in two

steps, by first filtering lead SNPs and then generating random null sets that closely

match the properties of the lead SNPs. In the filtering step, we ensure that the

lead SNPs that we use when computing enrichment are independent. If two lead

SNPs were in strong linkage disequilibrium, then the set of SNPs in LD would likely

overlap, and a functional region in LD with both SNPs would be double counted.

This is a fairly likely situation given that there are large regions of perfect linkage

disequilibrium, and different chips use different tag SNPs to genotype the same region.

If a phenotype is assessed on two different platforms, then two different SNPs could

be reported as significant even though they are both in the same associated region.

Accurately replicating the fine linkage disequilibrium structure between lead SNPs

when building random null sets would be extremely challenging, and we therefore

decide to consider only SNPs that are not in LD with each other in order to estimate

enrichment. Second, as our method for identifying associated functional SNPs relies

on linkage disequilibrium information from HapMap, we do need to ensure that the

fraction of SNPs that are assessed in HapMap does not differ substantially between

the actual lead SNP set and the random sets. If we had allowed a lead SNP to be

matched to any SNP in dbSNP, then LD information would only be available for a

small subset of the random null sets, which would artificially decrease the number of

functional SNPs identified in those random sets. In order to avoid any subtle difference

that could bias our enrichment estimation, we limit the set of lead SNPs to SNPs that

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 96

have been assessed in all four HapMap 2 populations, and similarly limit the set of

random SNPs. While this substantially decreases the set of lead SNPs that are used

to compute enrichment estimates, the fraction of lead SNPs overlapping functional

regions in this smaller set (Tables 5.4 and 5.9) is comparable to using all reported

associations (Tables 5.1 and 5.2). We then compute a random set that matches the

properties of the lead SNPs. A lead SNP is matched to a SNP with similar minor

allele frequency, since minor allele frequency can affect the strength of the statistical

association between a SNP and a phenotype. We then map each lead SNP to another

SNP on the same genotyping array. This corrects for biases that may result from

the choice of SNPs put on the genotyping array. We also map a lead SNP to a SNP

that has the same function (with respect to UCSC genes), such that the fraction of

coding SNPs, for example, is the same in the random sets than amongst lead SNPs.

Finally, we also ensure that each lead SNP is mapped to a random SNP that is at

a similar distance to nearest transcription start site. This avoids the situation in

which an association that is close to a gene, and thus more likely to overlap with

some functional data is matched by a SNP in a so-called gene desert in a null set. We

still obtain significant enrichment for functional regions even when taking all these

factors into account, and when comparing the most stringent random set to all lead

SNPs. While using such a conservative approach and still reaching significance is

strong evidence that there is enrichment for functional elements in regions associated

with disease and other phenotypes, one can argue that matching the properties of

lead SNPs that closely actually amounts to overcorrecting. For example, while there

appears to be a bias for associated SNPs to be closer to known genes than random

SNPs on the same genotyping chip, this is actually a biologically interesting property

of disease associations. By requiring the matched null sets to also lay more close to

known genes, we increase the probability that those random SNPs are in regulatory

regions, or in strong LD with a regulatory region. It is likely that the functional

SNPs identified when using random null sets, and in particular those supported by

multiple types of functional data, also affect some phenotype in some way, but that

this association has yet to be discovered in a GWAS.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 97

Filtering

In order to obtain null sets that are similar to the set of associated SNPs, we only

consider SNPs that were assessed in all HapMap 2 populations, for which the minor

allele frequency is known in dbSNP, and that we can map to a genotyping platform.

We use the GWAS catalog to determine whether an associated SNP was found using

an Affymetrix or an Illumina array, and then determine whether the SNP is the

corresponding list of SNPs we have previously computed. If the SNP is not in the

list, or if the platform is not Affymetrix or Illumina, then the lead SNP is filtered

out. An exception is the case where the SNP was found using imputation, in which

case the SNP is kept in the set as long as it is in HapMap 2. A total of 1160 lead

SNPs are filtered out at this stage. We then search for pairs of lead SNPs that show

some evidence of linkage disequilibrium between them (r2 ≥ 0.1 in any HapMap 2

or HapMap 3 population). For each such pair we keep only the lead SNP with the

stronger association (more significant P-value reported in the GWAS catalog), and

repeat this process until no pair of SNPs in linkage disequilibrium with each other

is left in the filtered set. A total of 1200 lead SNPs are filtered out due to LD,

leaving us with a set of 2364 SNPs. We repeat the annotation steps on this new set,

and perform all enrichment comparisons using this set and its subset of European

association, which is obtained in a similar way than for the set of all lead SNPs.

Matched null sets

We compute several matched null sets that model the properties of the associated

SNPs increasingly closely. We group SNPs into bins based on their minor allele fre-

quency, with each bin representing a 5% minor allele frequency interval. For all null

sets, a lead SNP is always matched to a random SNP in the same minor allele fre-

quency bin. We also ensure that SNPs in each null set are not in linkage disequilibrium

with each other, using the same criterion than for the lead SNPs. The null-dbSNP set

is obtained by matching each lead SNP to a SNP in dbSNP 132. The null-HapMap2

set is obtained in a similar way, except that the matched SNPs must be amongst

the SNPs assessed in all HapMap 2 populations. The null-Array set is obtained by

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 98

matching a lead SNP with a SNP that appears on the same platform (Illumina or

Affymetrix) than the lead SNP, or a SNP that is in HapMap 2 if the SNP has been

imputed. If the original study includes both Affymetrix and Illumina platform, and

the lead SNP is present on both, then the SNP is matched to a random SNP from ei-

ther platform. The null-Array-Function set uses the same criteria than the null-Array

set, but in addition also requires that the matched lead SNP is in the same functional

category (as predicted using UCSC genes, see above) than the lead SNP. Finally, the

null-Array-Function-Distance is obtained by also requiring that the matched SNP is

at a similar distance to the nearest transcription start site (with respect to UCSC

genes) than the lead SNP if the lead SNP is located in an intergenic region or an

intron. We group SNPs into bins according to the logarithm of their distance to the

nearest transcription start site, and each bin represents a log10(distance) interval of

0.1. All SNPs with a log10(distance) under 3.5 are grouped into one bin, and all SNPs

with a log10(distance) above 13 are grouped into one bin. While we use all these ran-

dom null sets in Figure 5.4B and C, all the remaining enrichments presented in this

work are only computed using the most stringent null-Array-Function-Distance sets.

Statistical analysis

We create n = 100 matched null sets and then repeat the annotation steps on each

null set, and obtain an empirical distribution of the fraction of functional SNPs ex-

pected for matched SNPs, and of the score distribution amongst matched SNPs. We

obtain a P-value for the difference between the lead SNPs and the null sets using a

Students t distribution with n-1 degrees of freedom and the same mean and standard

deviation from the empirical distribution of the counts overlapping the feature in the

n randomized null sets. This distribution is used to estimate the probability of hav-

ing a null set (which is by construction of same size as the set of lead SNPs) with a

fraction of SNPs overlapping the feature that is as extreme or more extreme than the

fraction observed for the lead SNP set, which results in a two-tailed P-value.

CHAPTER 5. INTEGRATING REGULATORY INFORMATION 99

5.6.4 Analysis at the phenotype level

We group all lead SNPs per phenotype using the GWAS catalog phenotype classi-

fication. We do not further group phenotypes, even though some are similar. We

use only associations identified or replicated in populations of European descent. For

each lead SNP, we count how many times the lead SNP or at least one SNP in strong

LD (r2 ≥ 0.8 in the HapMap 2 CEU population) overlaps with a ChIP-seq peak for a

given DNA binding protein. Each lead SNP is counted at most once for each DNA-

binding protein, and we ensure that no two lead SNPs are in LD with each other.

We then add the totals for all the lead SNP associated with each phenotype. We use

Fishers exact test on a 2x2 table to show that the fraction of lead SNPs associated

with height that are in strong LD with at least one SNP overlapping with a CTCF

ChIP-seq peak is higher than the same fraction for all associated lead SNPs.

Chapter 6

Analysis of Functional SNPs

In this chapter we describe in more details several functional SNPs identified using

the methods discussed in Chapter 5. In particular, we present examples that show

how ENCODE data can be used in order to generate novel interesting biological

hypotheses.

6.1 Strongly supported functional SNPs

We identified functional SNPs in strong linkage disequilibrium for a large fraction

of all reported associations. A table mapping each association to a list of candidate

functional SNPs is available on the RegulomeDB website1. Table 6.1 highlights the

lead SNPs supported by the strongest functional evidence. These overlap a ChIP-seq

peak, a DNase peak, a DNase footprint and a predicted motif, and the transcription

factor binding detected using ChIP-seq matches the conserved motif used in DNase

footprinting. Table 6.2 provides a similar list for functional SNPs supported by the

same amount of regulatory evidence, but which are in strong LD with a lead SNP in

all HapMap 2 populations. The lead SNP itself is supported by less or no evidence of

a functional role. Table 6.3 provides a list of functional SNPs supported by the same

amount of regulatory evidence, but which are in strong LD with a lead SNP in the

HapMap 2 CEU population, but not all HapMap 2 populations. Only associations

1http://RegulomeDB.org/GWAS

100

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 101

Lead SNP Phenotype P-valuechr1 rs1967017 Serum urate [158] 4 · 10−8

chr5 rs2188962 Crohn’s disease [159] 1 · 10−7

Crohn’s disease [160] 2 · 10−18

chr6 rs9491696 Waist-hip ratio [161] 2 · 10−32

chr6 rs9483788 Hematocrit [162] 3 · 10−15

Other erythrocyte phenotypes [162] 1 · 10−47

chr11 rs2074238 QT interval [163] 3 · 10−17

chr11 rs7940646 Platelet aggregation [164] 1 · 10−6

chr12 rs902774 Prostate cancer [133] 5 · 10−9

chr14 rs1256531 Conduct disorder (symptom count) [165] 4 · 10−6

chr15 rs17293632 Crohn’s disease [166] 3 · 10−19

chr16 rs4788084 Type 1 diabetes [167] 3 · 10−13

chr17 rs9303029 Protein quantitative trait loci [168] 4 · 10−7

chr19 rs10411210 Colorectal cancer [169] 5 · 10−9

chr19 rs3764650 Alzheimer’s disease [170] 5 · 10−17

Table 6.1: Overview of the lead SNPs most strongly supported by functionalevidence.

Each of these lead SNPs overlaps a ChIP-seq peak, matched DNase footprint,matched motif and a DNaseI-seq peak (RegulomeDB score of 2a). Rep indicatesthat the study includes a replication cohort. SNPs in bold are also eQTLs.

identified in populations of European descent are used for this table. Functional

SNPs in strong LD with the lead SNP are located as far as 170 kilo base pairs

from the reported association. Each of the functional SNPs we identify is a biological

hypothesis supported by experimental regulatory data, but which still requires further

validation.

6.2 Replication of a previously validated functional

SNP

We show that we can re-identify a previously validated functional SNP. Lead SNP

rs1541160 is associated with Amyotrophic Lateral Sclerosis (ALS) in a GWAS, and

there is no evidence that this SNP overlaps a functional region. However it is in

perfect LD with rs522444, a functional SNP overlapping DNase hypersensitivity re-

gions and ChIP-seq peaks in a large number of ENCODE cell lines. The authors of

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 102

LD (r2)Lead SNP Score Phenotype P-value Best SNP

in LDDistanceto leadSNP (bp)

CEU CHB JPT YRI

chr1 rs6686842 6 Height [125] 2 · 10−8 rs11209342 61,205 0.96 0.90 1.00 1.00chr1 rs380390 6 Age-related macular degeneration [171] 4 · 10−8 rs381974 8,379 1.00 0.85 1.00 1.00chr3 rs6806528 4 Celiac disease [172] 2 · 10−7 rs6784841 733 1.00 1.00 1.00 1.00chr4 rs1800789 7 Fibrinogen [173] 2 · 10−30 rs4333166 1,232 1.00 1.00 1.00 1.00chr5 rs3776331 7 Serum uric acid [174] 8 · 10−6 rs3893579 4,547 0.96 1.00 0.95 1.00chr6 rs7743761 5b Ankylosing spondylitis [175] 1 · 10−303 rs6457401 1,248 0.93 0.95 1.00 1.00chr6 rs642858 6 Type 2 diabetes [176] 2 · 10−6 rs1361248 21,711 0.94 1.00 1.00 0.92chr7 rs12700667 4 Endometriosis [177] 1 · 10−9 rs1451385 6,918 0.85 0.92 1.00 1.00chr9 rs3890182 5b HDL cholesterol [178] 5 · 10−7 rs3847302 940 1.00 1.00 1.00 0.94

HDL cholesterol [179] 3 · 10−10

chr9 rs2383207 7 Abdominal aortic aneurysm [180] 2 · 10−8 rs1333047 8,545 0.89 0.95 1.00 1.00chr16 rs7197475 6 Systemic lupus erythematosus [181] 3 · 10−8 rs7194347 2,777 1.00 1.00 1.00 0.83chr16 rs7186852 5a Systemic lupus erythematosus [181] 3 · 10−7 rs7194347 9,985 0.96 1.00 1.00 0.83chr19 rs12986413 4 Height [182] 3 · 10−8 rs1015670 536 1.00 1.00 1.00 0.96

Table 6.2: Strongly supported functional SNPs in linkage disequilibrium withan associated lead SNP in all populations.

Each best functional SNP in this table overlaps a ChIP-seq peak, matched DNasefootprint, matched motif and a DNaseI-seq peak (RegulomeDB score of 2a), and isin strong LD (r2 ≥ 0.8) with the lead SNP in all four HapMap 2 populations. SNPsin bold are also eQTLs.

the original study identified rs522444 due to its position in a putative SP1 binding

site and experimentally validated its functional role [103] in altering the expression

of gene KIFAP3.

6.3 A new functional SNP for type 2 diabetes

6.3.1 Results

We identify rs7163757 as a novel putative functional SNP associated with type 2

diabetes (Figure 6.1). This SNP is in strong LD with rs7172432, a SNP recently

shown to be associated with type 2 diabetes in the Japanese population and repli-

cated in a European population [206], and associated with insulin response in the

Danish population [207]. This functional SNP is supported by evidence from both

DNaseI hypersensitivity and ChIP-seq assays. DNase footprinting indicates that the

functional SNP overlaps a potential NFAT binding site. Interestingly, the risk allele

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 103

LD (r2)Lead SNP Score Phenotype P-value Best SNP

in LDDistanceto leadSNP (bp)

CEU CHB JPT YRI

chr1 rs2816316 7 Celiac disease [172] 2 · 10−17 rs2984920 7,982 1.00 1.00 1.00 0.05Celiac disease [183] 3 · 10−11 rs1323296 695 1.00 1.00 1.00 0.52

chr1 rs4949526 5a Bipolar disorder and schizophrenia [184] 4 · 10−7 rs4949524 9,517 0.84 0.69 0.59 0.00chr4 rs4234798 2b Insulin-like growth factors [185] 5 · 10−10 rs4234797 26 1.00 0.79 1.00 1.00chr6 rs1361108 5b Menarche (age at onset) [186] 2 · 10−8 rs9388486 106,446 0.92 1.00 0.00 0.51chr6 rs9494145 7 Red blood cell traits [187] 3 · 10−15 rs9483788 2,949 0.87 0.73 0.89 1.00chr7 rs1055144 5b Waist-hip ratio [161] 1 · 10−24 rs1451385 23,612 0.83 0.17 0.11 0.02chr8 rs2019960 7 Hodgkin’s lymphoma [188] 1 · 10−13 rs7826019 5,723 1.00 0.00 0.63 0.96chr9 rs7873102 7 Brain structure [189] 6 · 10−7 rs776010 111,304 0.96 0.79 0.75 0.00chr9 rs1333049 7 Coronary heart disease [190] 7 · 10−58 rs1333047 999 1.00 0.51 0.40 0.00

Coronary heart disease [191] 3 · 10−19

Coronary heart disease [10] 1 · 10−13

chr9 rs4977574 2c Coronary heart disease [192] 1 · 10−22 rs1333047 25,930 0.89 0.47 0.36 0.00Coronary heart disease [193] 2 · 10−25

Myocardial infarction (early onset) [194] 3 · 10−44

chr9 rs3905000 7 MRI atrophy measures [195] 9 · 10−6 rs3847302 8,475 1.00 1.00 1.00 0.67HDL cholesterol [196] 9 · 10−13

chr10 rs1561570 4 Paget’s disease [197] 4 · 10−38 rs10752286 4,377 0.96 1.00 1.00 0.70Paget’s disease [198] 6 · 10−13

chr10 rs563507 5b Acute lymphoblastic leukemia (childhood) [199] 9 · 10−6 rs773983 38,857 1.00 0.00 0.00 0.13chr11 rs7127900 5b Prostate cancer [130] 3 · 10−33 rs7123299 1,230 1.00 1.00 0.89 0.65chr11 rs561655 5b Alzheimer’s disease (late onset) [200] 7 · 10−11 rs1237999 14,751 0.84 0.83 0.62 1.00chr11 rs10898392 7 Height [201] 3 · 10−6 rs575050 174,791 0.81 0.95 0.42 0.73chr12 rs2638953 7 Height [202] 7 · 10−17 rs10506037 116,731 0.84 1.00 1.00 0.00chr14 rs7142002 7 Autism [203] 3 · 10−6 rs3993395 38,489 0.87 0.26 0.20 0.62chr15 rs261334 5b HDL cholesterol [178] 5 · 10−22 rs8034802 1,952 0.83 0.27 0.48 0.57chr17 rs12946454 5b Systolic blood pressure [204] 1 · 10−8 rs4792867 41,323 0.83 0.43 0.27 0.51

rs11657325 34,573 1.00 0.93 0.81 0.51chr17 rs6504218 7 Coronary heart disease [193] 1 · 10−6 rs9902260 8,427 0.92 1.00 1.00 0.56chr22 rs738322 4 Cutaneous nevi [205] 1 · 10−6 rs2016755 29,402 0.89 0.71 0.44 0.71

Table 6.3: Strongly supported functional SNPs in linkage disequilibrium withan associated lead SNP in the European population.

Each functional SNP in this table overlaps a ChIP-seq peak, matched DNase foot-print, matched motif and a DNaseI-seq peak (RegulomeDB score of 2a), and is instrong LD (r2 ≥ 0.8) with the lead SNP in the HapMap2 CEU population. Associ-ation were identified and replicated in a population of European descent. SNPs inbold are also eQTLs. SNPs in italic are eQTLs in LD with the lead SNP, but donot have a RegulomeDB score of 2a.

at rs7172432 is the common allele in the population (53%), and there is a single hap-

lotype with frequency above 1% that includes the risk allele between the associated

SNP and the functional SNP, but several alleles with high frequency that include the

protective allele.

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 104

Scalechr15:

BAF155TBP

c-JunJunD

STAT3CEBPB

p300_(N-15)GATA3_(SC-268)SMC3_(ab9263)

p300FOXA1_(C-20)

HeLa-S3 PkHeLa-S3 Pk

Fibrobl PkFibP AG08395 PkFibP AG08396 PkFibP AG20443 Pk

HSMMtube PkMCF-7 PkMelano Pk

PanIsletD PkPanIslets Pk

ProgFib PkStellate Pk

T-47D PkT-47D Pk

Urothelia Dnse PkUrothel UT DNs Pk

500 bases hg1962,391,000 62,391,500 62,392,000

Transcription Factor ChIP-seq from ENCODE

Open Chromatin by DNaseI HS from ENCODE/OpenChrom(Duke University)

HH

HH

HmmmH

HtH

tt

AGTGATTTTTCCATTTTAAGCWebLogo 3.2

0.0

1.0

2.0

bits

T

G

C

A

G

TA

C

A

GTG

T5

C

TTCA

T

CC

AT

10

T

G

C

A

G

A

C

TT

A

G

Lead SNP: rs7172432

functional SNP: rs7163757

A:

B:

C:

D:

E:

Info

rmat

ion

cont

ent

2

1

1.5

0.5

0

NFAT

Scalechr15:

BAF155TBP

c-JunJunD

STAT3CEBPB

p300_(N-15)GATA3_(SC-268)SMC3_(ab9263)

p300FOXA1_(C-20)

HeLa-S3 PkHeLa-S3 Pk

Fibrobl PkFibP AG08395 PkFibP AG08396 PkFibP AG20443 Pk

HSMMtube PkMCF-7 PkMelano Pk

PanIsletD PkPanIslets Pk

ProgFib PkStellate Pk

T-47D PkT-47D Pk

Urothelia Dnse PkUrothel UT DNs Pk

500 bases hg1962,391,000 62,391,500 62,392,000

Transcription Factor ChIP-seq from ENCODE

Open Chromatin by DNaseI HS from ENCODE/OpenChrom(Duke University)

HH

HH

HmmmH

HtH

tt

Scalechr15:

DNase ClustersTxn Factor ChIP

50 kb hg1962,370,000 62,380,000 62,390,000 62,400,000 62,410,000 62,420,000 62,430,000 62,440,000 62,450,000 62,460,000

UCSC Genes (RefSeq, UniProt, CCDS, Rfam, tRNAs & Comparative Genomics)

Digital DNaseI Hypersensitivity Clusters from ENCODETranscription Factor ChIP-seq from ENCODE

C2CD4AC2CD4A

C2CD4BC2CD4B

.95

.93.81

1.0

.93.81

.93

.93

.93

.95.75

rs71

6375

7

rs11

6354

41

rs71

6435

9

rs80

3789

4

rs11

8583

55

rs80

3779

6

rs64

9430

6

rs64

9430

7

rs17

2714

58

rs71

6787

8

rs71

6788

1

rs71

7243

2

CCGACGCGCC .508CGCATAGGAC .192CCCACGGAAT .100CCCACGGGAT .075AGCGTAGGAC .050CCCACGGAAC .025CCCACGGGAC .017

C .525T .475

A .525G .475

Figure 6.1: Functional SNP rs7163757

Multiple sources of evidence indicate that SNP rs7163757 is functional. (A.)Overview of the region between genes C2CD4A and C2CD4B. Functional SNPrs7163757 is indicated using a blue vertical line, lead SNP rs7172432 using a greenvertical line. Multiple ChIP-seq and DNase-seq peaks can be seen, including onethat overlaps rs71763757. (B.) Vicinity of functional SNP rs7163757. ChIP-seqbinding is observed for multiple transcription factors in multiple cell lines. Dueto space, DNase peaks are represented only for a subset of the peaks overlappingthe region. (C.) Sequence around rs7163757 and motif for the NFAT binding sitethat overlaps the functional SNP. The minor allele is T. (D.) Linkage disequilib-rium region between the functional SNP and the lead SNP in the HapMap 2 CEUpopulation. The two SNPs are in perfect LD (r2 = 1.0). (E.) Haplotypes betweenthe functional SNP and the lead SNP. There is a single haplotype with frequencyabove 1% that carries the identified risk allele (A at rs7172432), whereas there aremultiple haplotypes that include the protective allele. Haplotypes with frequencyof less than 1% are not shown.

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 105

6.3.2 Discussion

The association between lead SNP rs7172432 and type 2 diabetes and insulin resis-

tance was replicated in multiple populations of both European and Asian origin, the

mechanism underlying the association is unclear. The region is located between genes

C2CD4A and C2CD4B (C2 calcium-dependent domain containing 4A/B), whose bi-

ological role is unknown and which had not been previously implicated in diabetes.

This region contains additional SNPs associated with type 2 diabetes: rs1439955 in

Chinese [208], rs11071657 [209] and rs17271305 [210] in European, that are in linkage

disequilibrium with rs7163757 in the respective populations (respectively r2 of 0.523 in

CHB, 0.521 in CEU and 0.283 in CEU). There is therefore strong evidence linking this

region to diabetes. DNaseI footprinting identifies a Nuclear factor of activated T-cells

(NFAT) footprint that overlaps rs7163757. NFAT is part of the Calcineurin/NFAT

pathway [138], which has been involved in the regulation of growth and function of the

insulin-producing pancreatic beta cells, and linked to the expression of genes known

to be associated with type 2 diabetes [139]. Glucose and glucagon-like peptide-1

(GLP-1) together lead to the expression of NFAT, which regulates the transcription

of the insulin gene [211]. The binding site affected by fSNP rs7163757 might thus be

involved in linking glucose level to the expression of genes in this region. It is interest-

ing to observe that the haplotype which includes the risk allele identified at rs7172432

is the major allele in the CEU population (50.8%), and that there is markedly less

diversity between the lead SNP and the fSNP for the risk allele (a single haplotype)

than for the protective allele (6 different alleles, three of which with a frequency above

5%). Long haplotypes are a sign of positive selection [212], and positive selection for

type 2 diabetes risk allele would be compatible with the thrifty gene hypothesis [213]

under which alleles that currently cause diabetes were advantageous in the past his-

tory of the human species, as they increased fat storage and thus survival in times

of starvation. In this context, the link between this region and glucose and GLP-1

through NFAT signaling would provide a putative mechanistic hypothesis for the as-

sociation. Further work is however needed, in particular in determining which genes

have differential expression correlated with SNPs in this region, and how they affect

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 106

pathways related to diabetes. This example does highlight the value of DNase foot-

printing, which identified a motif that is very relevant to the associated phenotype,

even though NFAT was not assessed in the current ENCODE ChIP-seq experiments.

ChIP-seq experiments identified additional proteins bound in this region that might

interact with NFAT in order to regulate the expression of genes in this region.

6.4 The 9p21 region in coronary artery disease

The 9p21 region of chromosome 9 contains several SNPs strongly associated with

coronary artery disease in multiple studies across different populations [214]. Risk

alleles in 9p21 leads to a 20-30% per allele increase in disease risk [215]. The region

containing the SNPs associated with coronary artery disease is adjacent to a region

containing SNPs associated with type 2 diabetes [216]. The 9p21 region is a gene

desert: the closest known gene to 9p21 is located hundreds of thousands of kilobases

away. Furthermore, strong linkage disequilibrium has been observed across the 9p21

region. Figure 6.2 provides an overview of the features of the 9p21 region. This region

is therefore a perfect use case for approaches aimed at detecting variants that play a

biological role in disease. High throughput functional information has also been used

to identify a large number of enhancers in the 9p21 region, and determine that two

SNPs associated with coronary artery disease overlap with an enhancer and disrupt

a STAT1 binding site [114].

In this section we discuss several specific functional SNPs, and in particular pro-

vide evidence indicating that rs1333047, a SNP in perfect linkage disequilibrium with

coronary artery disease associated SNP rs1333049 in the CEU population only, is

likely a functional SNP. This result can explain why the association between rs1333049

and coronary artery disease has not been replicated in populations of African descent.

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 107

rs1333045 rs4977574rs10757278

rs1333049

Chromosome 9:

Figure 6.2: Overview of the 9p21 region

SNPs significantly associated with coronary artery disease are indicated. The SNPsshown on Figure 6.3 are highlighted in green. These SNPs are located over 100 kilobases away from the nearest genes CDKN2B and CDKN2A. Exons of a non-codingRNA, CDKN2BAS (also called ANRIL) stretch across the whole region. Stronglinkage disequilibrium can be observed across the entire region.

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 108

6.4.1 Results

The 9p21 gene desert is a region that contains multiple SNPs that are strongly as-

sociated with coronary artery disease. We consider the functional information avail-

able from ENCODE in order to generate candidate functional SNPs in this region.

The association between rs1333049 and coronary artery disease has been replicated

in multiple studies in populations of European descent [10, 191, 217, 190] as well

as in populations of Japanese and Korean descent [218, 219]. In the HapMap 2

CEU population, this SNP is part of a haplotype block that includes rs10757278 and

rs1333047, both of which are in perfect LD with rs1333049. rs10757278 has also been

itself associated with coronary artery disease in multiple populations of European

descent [214, 220, 217] and in the Chinese Han population [221]. Figure 6.3 provides

an overview of this region. There is no evidence supporting a functional role for

rs1333049. However, both rs10757278 and rs1333047 overlap a DNase hypersensitiv-

ity peak as well as ChIP-seq peaks for STAT1 and STAT3 in HeLA-S3 cells, and are

therefore functional SNPs. Furthermore, rs10757278 lies in a STAT1 binding site, and

rs1333047 lies in a binding site and a DNaseI footprint for Interferon-stimulated gene

factor 3 (ISGF3). The motif is a good match when extending the less specific part

of the motif (positions 8-9) located between the two very specific regions (positions

2-7 and 10-13) by a base pair. This is similar to cases of variable spacer length pre-

viously observed for transcription factor binding motifs [222]. While the functional

role of rs10757278 has been previously reported [114], evidence of the functional role

of rs1333047 is novel. Interestingly, while only 27 base pairs separate the two SNPs,

they are in perfect linkage disequilibrium in the CEU population only. The frequency

of the A allele at rs1333047 in the Yoruba in Ibadan, Nigeria (YRI) HapMap 2 pop-

ulation is only 0.8%, compared to 50.8% in the CEU population. This allele is part

of the protective haplotype found in GWAS performed in populations of European

descent. The A allele is part of the motif for ISGF3 binding, whereas the T allele is

not.

The most recent meta-analysis in populations of European descent [192] identifies

rs4977574 as the most strongly associated locus in 9p21. We find that this SNP

overlaps DNase hypersensitivity peaks in two ENCODE cell lines (Hah and Lncap),

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 109

rs13

3304

9 ch

r9:2

2,12

5,50

3

rs13

3304

7 ch

r9:2

2,12

4,50

4

rs10

7572

78 c

hr9:

22,1

24,4

77

rs10

8116

56 c

hr9:

22,1

24,4

72

CEU A A GG T C

50.8%49.2%

D’ = 1.0r2 = 1.0

D’ = 1.0r2 = 1.0

CHB+

JPT

G T CA A GA T GA T C

48.9%31.1%19.4%0.6%

D’ = 1.0r2 = 0.442

D’ = 1.0r2 = 0.978

YRI

A T GA T CG T CG T GA A G

80.0%10.0%7.5%1.7%0.8%

D’ = 1.0r2 = 0.002

D’ = 0.78 r2 = 0.289

rs1333049 and rs1333047 rs1333049 and rs10757278

GTCATTCCGGTAAGCAGCGATGCAGAATCAAGACAGAGTAGTTTCTCCTTCTCTC..G

Info

rmat

ion

cont

ent

2

1

1.5

0.5

0

Info

rmat

ion

cont

ent

2

1

1.5

0.5

0

STAT1 ISGF3

WebLogo 3.2

0.0

1.0

2.0

bits

A

TG

CG

T

C

A

G

A

CTC

T5

G

TT

A

CA

T

CATC

T

AG

10

C

A

TGG

T

CAT

C

G

AC

T

GAG

A

TC

15

A

G

T

C

WebLogo 3.2

0.0

1.0

2.0

bits

A

TG

CGAGT

5

TTCGCTAAGTC10

ATTCTTCTGAC

15

T

A

GC

WebLogo 3.2

0.0

1.0

2.0

bits

A

TG

CGAGT

5

TTCGCTAAGTC10

ATTCTTCTGAC

15

T

A

GC*

HeLa IFg3 STA1 SdHeLa STA3 IgR

500 bases22,124,000 22,125,500

Scalechr9:

rs35087431rs72655404rs1333046rs7857118

rs17761458rs10757277rs10811656rs10757278rs1333047

rs72655405rs10757279rs4977575

rs72655406rs72655407rs73443203rs73650062rs1333048rs1333049

rs73650063

500 bases hg1922,124,500 22,125,000 22,125,500

Transcription Factor Binding Sites by ChIP-seq from ENCODE/Stanford/Yale/USC/Harvard

Simple Nucleotide Polymorphisms (dbSNP build 130 - Provisional Mapping to GRCh37)HeLa IFg3 STA1 Sd

HeLa STA3 IgR

Scalechr9:

rs35087431rs72655404rs1333046rs7857118

rs17761458rs10757277rs10811656rs10757278rs1333047

rs72655405rs10757279rs4977575

rs72655406rs72655407rs73443203rs73650062rs1333048rs1333049

rs73650063

500 bases hg1922,124,500 22,125,000 22,125,500

Transcription Factor Binding Sites by ChIP-seq from ENCODE/Stanford/Yale/USC/Harvard

Simple Nucleotide Polymorphisms (dbSNP build 130 - Provisional Mapping to GRCh37)HeLa IFg3 STA1 Sd

HeLa STA3 IgR

Figure 6.3: Evidence supporting the implication of rs1333047 in coronaryartery disease.

Functional data (ChIP-seq) generated by the ENCODE consortium shows evidenceof STAT1 binding in the 9p21 region associated with coronary artery disease.rs10757278 and rs1333047 are both located in the peak, whereas rs1333049 is atag SNP that does not overlap any functional region in RegulomeDB. rs10757278 ispart of a regulatory motif for STAT1 binding, and rs1333049 is part of a regulatorymotif for ISGF3 binding. The star symbol denotes the location at which a gap isinserted into the motif to handle variable linker length. Haplotype frequency andlinkage disequilibrium data from the different HapMap2 populations shows that allthree SNPs are in perfect linkage disequilibrium in the CEU population, but not theCHB and JPT populations. In the YRI population, the frequency of the A allele atrs1333047 is only 0.8%. Risk alleles for all SNPs are determined using the haplotypeassociated with coronary artery disease in the CEU population, and represented inred. There is an absence of linkage disequilibrium between rs1333047 and rs1333049in YRI, and the association between rs1333049 and rs10757278 and coronary arterydisease has not been replicated in populations of African descent.

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 110

a conserved motif for the Androgen Receptor (AR), and a ChIP-seq peak for AR in a

data set from Wei et al. [124]. The RegulomeDB score for this SNP is 2c. In the YRI

population, the minor allele frequency of this SNP is only 7.5%, and it is in strong

LD with rs10757278 (r2 = 0.803) and in weaker LD with rs1333049 (r2 = 0.382). It

is in strong LD with both in the CEU population (r2 of respectively 0.874 and 0.885).

Previously identified functional candidate rs1333045 [104] obtains a RegulomeDB

score of 3a, as it overlaps a conserved motif (Hand1::Tcfe2a), a ChIP-seq peak for

GATA3 in the T-47D ENCODE cell line, and a DNase peak in the Huvec ENCODE

cell line. This SNP has a minor allele frequency of 45.8% in the YRI population, and

is in weak LD with both rs10757278 (r2 = 0.119) and rs1333049 (r2 = 0.251). In the

CEU population it is in strong LD with both (r2 of respectively 0.808 and 0.815).

6.4.2 Discussion

A new functional SNP in 9p21 could explain the lack of association in populations

of African descent We identify rs1333047 as a candidate functional SNP in the 9p21

region. This region is associated with coronary artery disease [223] and several other

diseases, and the risk of coronary artery disease for the 25% of individuals in popula-

tions of European descent that are homozygous for the risk allele is two times higher

than for individuals homozygous for protective alleles [215]. This region is a gene

desert, but the non-coding RNA ANRIL overlaps the SNPs associated with coronary

artery disease [217]. A recent study in mice showed that the deletion of the region

orthologous to 9p21 leads to changes in the expression of the orthologs of the two

human genes closest to 9p21, cyclin-dependent kinase inhibitors genes CDKN2A and

CDKN2B, and has effects on the phenotype [224]. SNPs associated with coronary

artery disease affect the expression of ANRIL, and to a smaller extent CDKN2A and

CDKN2B in human [225].

The candidate functional SNP rs1333047 potentially disrupts a binding site for

Interferon-stimulated gene factor 3 (ISGF3). ISGF3 is part of the JAK-STAT (Janus

Activated Kinase - Signal Transducer and Activator of Transcription) cascade. In

Type-I-Interferon signaling, STAT1, STAT2 and IFN-regulatory factor 9 (IFN9) form

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 111

the ISGF3 complex [226] that binds to IFN-stimulated response elements (ISRE) in

the nucleus. This contrasts with Type-II-Interferon signaling, in which STAT1-STAT1

homodymers directly bind to IFN-γ-activated sites (GAS). Interferon-γ is the only

Type-II-Interferon. A review of both Interferon signaling pathways can be found in

Platanias 2005. We find two functional SNPs in perfect linkage disequilibrium with

each other and with tag SNP rs1333049 in the HapMap 2 CEU population. The

second functional SNP, rs10757278, has been previously shown to be functional [114],

and is located in a GAS. The experimental evidence supporting the functional role of

rs10757278 does, however, also support rs1333047. Both SNPs are in perfect linkage

disequilibrium in all individuals re-sequenced by Harismendy et al., and would thus

be equally strongly associated with the phenotype. Harismendy et al. compare

lymphoblastoid cell lines (which have a high expression level of STAT1) that are

homozygous for the risk allele to lymphoblastoid cell lines that are homozygous for

the protective allele at rs10757278. Given the perfect LD in this region, they are likely

also homozygous respectively for the risk and protective alleles at rs1333047. They

show that in cell lines homozygous for the protective allele, STAT1 knockdown leads

to a 7-fold up-regulation of the expression of ANRIL, a non-coding transcript located

in the 9p21 region, whereas there is a much smaller effect in cell lines homozygous

for the risk allele. This would, however, also be consistent with an effect caused by

rs1333047 since STAT1 is part of the ISGF3 complex, and a knockdown of STAT1

would therefore also affect ISGF3. Furthermore, they use ChIP to identify that

STAT1 binds at rs10757278 only in cell lines with the protective allele. Binding of

STAT1 in this region does not imply the absence of ISGF3 binding, and given that

STAT1 is part of of the ISGF3 complex, it is also possible that ISGF3 binding was

detected rather than binding of the STAT1-STAT1 homodymer. Finally, Harismendy

et al. show that treatment with Interferon-γ leads to a change in expression of

ANRIL in HeLA and HUVEC cell lines. As Interferon-γ is only know to be involved

in the Type-II-Interferon pathway, this cannot be explained by ISGF3 binding to the

ISRE at rs1333047. This change is, however, in the opposite direction to the change

expected from the observation in the STAT1 knockdown experiment, which indicates

that there might be multiple binding sites at play in this region. The evidence is

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 112

therefore compatible with a functional role for both rs1333047 and rs10757278.

While both functional SNPs are in perfect linkage disequilibrium in the HapMap 2

CEU population, as well as in the individuals studied by Harismendy et al., this is not

the case in other populations. In particular, the protective allele at rs1333047 is rare

(0.8%) in the HapMap 2 YRI population. Major differences in allele frequency and

linkage disequilibrium structure between populations in the 9p21 region have been

previously identified [227]. Interestingly, multiple GWAS of coronary artery disease

in populations of African descent failed to replicate the association at rs1333049 [228]

and rs10757278 [229, 182]. While these studies did not replicate the most strongly

associated SNPs in European populations, they did identify SNPs that are associated

with coronary artery disease in populations of African descent. Two additional asso-

ciatied SNPs, rs10757274 and rs2383206 were identified in the European [223], South

Korean [220] but not replicated in African-American [223]. rs10757274 has been as-

sociated with heart failure in individuals of European descent, but the association

was not significant in African American [230]. Therefore there is strong evidence

that associations with coronary artery disease identified in population of European

origin in 9p21 are not replicated in populations of African origin. If rs10757278 is the

functional SNP that has the largest effect on the phenotype in this region, then the

absence of replication can only be explained by an interaction between the effect at

rs10757278 and some other region in which the populations of African descent differ

from the populations of European descent. If, however, rs1333047 functionally affects

the phenotype, then the lack of replication can be explained by the lack of linkage

disequilibrium between rs1333047 and the genotyped SNPs. Therefore, the lack of

replication of this finding in populations of African descent supports rs1333047 as

a candidate functional SNP in this region. Furthermore, rs1333047 is the strongest

association with coronary artery disease association identified in an African Amer-

ican population using a method that combines association and admixture informa-

tion [231].

The linkage disequilibrium patterns also differ between the HapMap 2 CEU pop-

ulation and the two Asian populations (CHB and JPT). Linkage disequilibrium does,

however, remain relatively high (r2 of respectively 0.978 and 0.442 between rs1333047

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 113

and associated SNPs rs10757278 and rs1333049). A large scale, gene centric analy-

sis [232] showed that the effect size for the association between 9p21 SNP rs1333042

and coronary artery disease was larger in European than Asian (odds ratio of respec-

tively 1.27 and 1.14). We analyze the linkage disequilibrium between rs1333047 and

haplotypes identified in previous studies in the Han Chinese population [221], which

includes rs2383206, rs1004638, rs17761446 and rs10757278. Only haplotype AATA

includes the protective A allele at rs1333047, and rs1333047 is in strong linkage dis-

equilibrium (r2 = 0.975) with rs1004638. The AATA haplotype is more frequent in

controls (30.5%) than in cases (27.3%). Similarly, we analyze the haploblock reported

in an association study of the Korean population [220]. Only two SNPs, rs2383206

and rs10757278 are also in HapMap 2. For these SNPs, only haplotype AA includes

the protective A allele at rs1333047, and this haplotype is protective in the Korean

population, with a frequency of 52.1% in controls and 45.1% in cases. Therefore

previous results in population of Asian ancestry are also compatible with a potential

functional role of rs1333047.

The implications of this potential functional association are significant. A re-

cent meta-analysis of 7 independent studies with a total of 9,487 cases and 30,171

controls [190] identifies rs1333049 as the strongest association with coronary artery

disease (P-value 7.12 · 10−58), odds ratio 1.27). The original study that associates

rs10757278 with coronary artery disease in the Icelandic population and three popu-

lations in the United States [214] shows that the odds ratio for heterozygous carriers

of the risk allele is 1.26, and the odds ratio for the homozygous carriers is 1.64, and

that this association alone might explain up to 21% of the population attributable

risk. This study did not specifically analyze rs1333047. Since all three SNPs are

part of the same haplotype in the CEU population, and are perfectly correlated, the

odds ratios and attributable risks would be similar for rs1333047. The differences

in LD structure however mean that both rs10757278 and rs1333049 are poor proxies

for rs1333047 in other populations. This has important implications in personalized

medicine, as testing those SNPs would lead to incorrect risk predictions if rs1333047 is

indeed the mutation that plays a functional role in the phenotype. Furthermore, a po-

tential functional role for rs1333047 would mean that the Type-I-Interferon signaling

CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 114

pathway plays a role in the association at 9p21, in addition to the Type-II-Interferon

pathway previously identified by Harismendy et al. Finally, it is important to note

that in YRI the frequency of the protective allele is very low (0.8%), meaning that

most individuals in populations of African descent might be at a higher risk for coro-

nary artery disease if rs1333047 is the mutation that plays a functional role in the

disease. Interestingly, given the low minor allele frequency at rs1333047 in the YRI

population, even genotyping this locus would require a much larger population in

order to reach statistical significance in this population. This example illustrates the

power of combining the results of functional studies such as ENCODE with results

from GWAS in multiple population, and in particular in considering populations in

which replication of an association were unsuccessful together with linkage disequilib-

rium data. Further experimental validation of the binding of ISGF3 to this region, of

the effect of rs1333047 on this binding site, and of the association between rs1333047

and coronary artery disease across populations are however necessary to definitely

prove that rs1333047 is a functional SNP linked to coronary artery disease. Given

the evidence supporting a functional role for multiple loci in tight linkage disequilib-

rium in 9p21, it appears likely that multiple SNPs in binding sites for transcription

factors that are part of a broad range of pathways together play a role in the biological

process underlying the association of this region with coronary artery disease.

6.5 Methods

We use Haploview [233] to analyze linkage disequilibrium data and haplotype fre-

quencies in individual regions. We obtain transcription factor binding motifs from

Transfac (STAT1, NFAT) and Jaspar (ISGF3). Motifs representations in Figure 6.1

and Figure 6.3 were created using WebLogo 3 [234]. Figures include graphical ele-

ments generated using the UCSC genome browser [30].

Bibliography

[1] E.S. Lander, L.M. Linton, B. Birren, C. Nusbaum, M.C. Zody, J. Baldwin,

K. Devon, K. Dewar, M. Doyle, W. FitzHugh, et al. Initial sequencing and

analysis of the human genome. Nature, 409:860–921, 2001.

[2] J.C. Venter, M.D. Adams, E.W. Myers, P.W. Li, R.J. Mural, G.G. Sutton, H.O.

Smith, M. Yandell, C.A. Evans, R.A. Holt, et al. The sequence of the human

genome. Science, 291:1304–51, 2001.

[3] E.S. Lander. Initial impact of the sequencing of the human genome. Nature,

470:187–97, 2011.

[4] J.G. Taylor, E.H. Choi, C.B. Foster, and S.J. Chanock. Using genetic variation

to study human disease. Trends in Molecular Medicine, 7:507–12, 2001.

[5] R. Sachidanandam, D. Weissman, S.C. Schmidt, J.M. Kakol, L.D. Stein,

G. Marth, S. Sherry, J.C. Mullikin, B.J. Mortimore, D.L. Willey, et al. A map

of human genome sequence variation containing 1.42 million single nucleotide

polymorphisms. Nature, 409:928–33, 2001.

[6] The International HapMap Consortium. A haplotype map of the human

genome. Nature, 437:1299–320, 2005.

[7] The International HapMap Consortium. A second generation human haplotype

map of over 3.1 million SNPs. Nature, 449:851–61, 2007.

115

BIBLIOGRAPHY 116

[8] D.G. Wang, J.B. Fan, C.J. Siao, A. Berno, P. Young, R. Sapolsky, G. Ghandour,

N. Perkins, E. Winchester, J. Spencer, et al. Large-scale identification, map-

ping, and genotyping of single-nucleotide polymorphisms in the human genome.

Science, 280:1077–82, 1998.

[9] N. Risch and K. Merikangas. The future of genetic studies of complex human

diseases. Science, 273:1516–7, 1996.

[10] The Wellcome Trust Case Control Consortium. Genome-wide association study

of 14,000 cases of seven common diseases and 3,000 shared controls. Nature,

447(7145):661–678, Jun 2007.

[11] L.A. Hindorff, P. Sethupathy, H.A. Junkins, E.M. Ramos, J.P. Mehta, F.S.

Collins, and T.A. Manolio. Potential etiologic and functional implications of

genome-wide association loci for human diseases and traits. Proceedings of the

National Academy of Sciences of the United States of America, 106:9362–7,

2009.

[12] T.A. Manolio. Genomewide association studies and assessment of the risk of

disease. The New England Journal of Medicine, 363:166–76, 2010.

[13] C.E. Jaquish. The Framingham Heart Study, on its way to becoming the gold

standard for Cardiovascular Genetic Epidemiology?. BMC Medical Genetics,

8:63, 2007.

[14] J.N. Hirschhorn and M.J. Daly. Genome-wide association studies for common

diseases and complex traits. Nature Reviews. Genetics, 6:95–108, 2005.

[15] D.J. Balding. A tutorial on statistical methods for population association stud-

ies. Nature Reviews. Genetics, 7:781–91, 2006.

[16] C.E. Bonferroni. Il calcolo delle assicurazioni su gruppi di teste. In Studi in

Onore del Professore Salvatore Ortu Carboni, pages 13–60. 1935.

BIBLIOGRAPHY 117

[17] D.E. Reich, M. Cargill, S. Bolk, J. Ireland, P.C. Sabeti, D.J. Richter, T. Lavery,

R. Kouyoumjian, S.F. Farhadian, R. Ward, et al. Linkage disequilibrium in the

human genome. Nature, 411:199–204, 2001.

[18] R.C. Lewontin and K. Kojima. he evolutionary dynamics of complex polymor-

phisms. Evolution, 14:458472, 1960.

[19] P.W. Hedrick. Gametic disequilibrium measures: proceed with caution. Genet-

ics, 117:331–41, 1987.

[20] J. Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly. A new multipoint

method for genome-wide association studies by imputation of genotypes. Nature

Genetics, 39:906–13, 2007.

[21] P. Donnelly. Progress and challenges in genome-wide association studies in

humans. Nature, 456:728–31, 2008.

[22] J.P. Ioannidis, G. Thomas, and M.J. Daly. Validating, augmenting and refining

genome-wide association signals. Nature Reviews. Genetics, 10:318–29, 2009.

[23] C. Libioulle, E. Louis, S. Hansoul, C. Sandor, F. Farnir, D. Franchimont, S. Ver-

meire, O. Dewit, de Vos M, A. Dixon, et al. Novel Crohn disease locus identified

by genome-wide association maps to a gene desert on 5p13.1 and modulates ex-

pression of PTGER4. PLoS Genetics, 3:e58 07–PLGE–RA–0108R2 [pii], 2007.

[24] M. Ghoussaini, H. Song, T. Koessler, Al Olama AA, Z. Kote-Jarai, K.E. Driver,

K.A. Pooley, S.J. Ramus, S.K. Kjaer, E. Hogdall, et al. Multiple loci with

different cancer specificities within the 8q24 gene desert. Journal of the National

Cancer Institute, 100:962–6, 2008.

[25] T.A. Manolio, F.S. Collins, N.J. Cox, D.B. Goldstein, L.A. Hindorff, D.J.

Hunter, M.I. McCarthy, E.M. Ramos, L.R. Cardon, A. Chakravarti, et al. Find-

ing the missing heritability of complex diseases. Nature, 461:747–53, 2009.

BIBLIOGRAPHY 118

[26] S.J. Chanock, T. Manolio, M. Boehnke, E. Boerwinkle, D.J. Hunter,

G. Thomas, J.N. Hirschhorn, G. Abecasis, D. Altshuler, J.E. Bailey-Wilson,

et al. Replicating genotype-phenotype associations. Nature, 447:655–60, 2007.

[27] K. Miclaus, M. Chierici, C. Lambert, L. Zhang, S. Vega, H. Hong, S. Yin,

C. Furlanello, R. Wolfinger, and F. Goodsaid. Variability in GWAS analysis: the

impact of genotype calling algorithm inconsistencies. The Pharmacogenomics

Journal, 10:324–35, 2010.

[28] J. Marchini, C. Spencer, Y. Teo, and P. P. Donnelly. A Bayesian hierarchical

mixture model for genotype calling in a multi-cohort study. (in preparation).

[29] J.M. Korn, F.G. Kuruvilla, S.A. McCarroll, A. Wysoker, J. Nemesh, S. Cawley,

E. Hubbell, J. Veitch, P.J. Collins, K. Darvishi, et al. Integrated genotype

calling and association analysis of SNPs, common copy number polymorphisms

and rare CNVs. Nature Genetics, 40:1253–60, 2008.

[30] W.J. Kent, C.W. Sugnet, T.S. Furey, K.M. Roskin, T.H. Pringle, A.M. Zahler,

and D. Haussler. The human genome browser at UCSC. Genome Research,

12:996–1006, 2002.

[31] W.J. Kent, R. Baertsch, A. Hinrichs, W. Miller, and D. Haussler. Evolution’s

cauldron: duplication, deletion, and rearrangement in the mouse and human

genomes. Proceedings of the National Academy of Sciences of the United States

of America, 100:11484–9, 2003.

[32] D.R. Rhodes, J. Yu, K. Shanker, N. Deshpande, R. Varambally, D. Ghosh,

T. Barrette, A. Pandey, and A.M. Chinnaiyan. Large-scale meta-analysis of

cancer microarray data identifies common transcriptional profiles of neoplas-

tic transformation and progression. Proceedings of the National Academy of

Sciences of the United States of America, 101(25):9309–9314, Jun 2004.

[33] K.I. Goh, M.E. Cusick, D. Valle, B. Childs, M. Vidal, and A.L. Barabasi. The

human disease network. Proceedings of the National Academy of Sciences of the

United States of America, 104(21):8685–8690, May 2007.

BIBLIOGRAPHY 119

[34] E. Zeggini, L.J. Scott, R. Saxena, B.F. Voight, J.L. Marchini, T. Hu, P.I.

de Bakker, G.R. Abecasis, P. Almgren, G. Andersen, et al. Meta-analysis of

genome-wide association data and large-scale replication identifies additional

susceptibility loci for type 2 diabetes. Nat Genet, 40(5):638–645, May 2008.

[35] R. Chen, A.A. Morgan, J. Dudley, T. Deshpande, L. Li, K. Kodama, A.P.

Chiang, and A.J. Butte. Fitsnps: highly differentially expressed genes are more

likely to have variants associated with disease. Genome Biology, 9(12), Dec

2008.

[36] E.Y. Fung, D.J. Smyth, J.M. Howson, J.D. Cooper, N.M. Walker, H. Stevens,

L.S. Wicker, and J.A. Todd. Analysis of 17 autoimmune disease-associated

variants in type 1 diabetes identifies 6q23/tnfaip3 as a susceptibility locus.

Genes Immun, 10(2):188–191, Mar 2009.

[37] A. Torkamani, E.J. Topol, and N.J. Schork. Pathway analysis of seven common

diseases assessed by genome-wide association. Genomics, 92(5):265–272, Nov

2008.

[38] J.B. Meigs, P. Shrader, L.M. Sullivan, J.B. McAteer, C.S. Fox, J. Dupuis, A.K.

Manning, J.C. Florez, P.W.F. Wilson, R.B. D’Agostino Sr, et al. Genotype

Score in Addition to Common Risk Factors for Prediction of Type 2 Diabetes.

The New England Journal of Medicine, 359(21):2208, 2008.

[39] C.P. Torfs, M.C. King, B. Huey, J. Malmgren, and F.C. Grumet. Genetic in-

terrelationship between insulin-dependent diabetes mellitus, the autoimmune

thyroid diseases, and rheumatoid arthritis. American Journal of Human Ge-

netics, 38(2):170, 1986.

[40] J.P. Lin, J.M. Cash, S.Z. Doyle, S. Peden, K. Kanik, C.I. Amos, S.J. Bale, and

R.L. Wilder. Familial clustering of rheumatoid arthritis with other autoimmune

diseases. Human Genetics, 103(4):475–482, 1998.

[41] S. Nejentsev, J.M.M. Howson, N.M. Walker, J. Szeszko, S.F. Field, H.E.

Stevens, P. Reynolds, M. Hardy, E. King, J. Masters, et al. Localization of

BIBLIOGRAPHY 120

type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A.

Nature, 450(7171):887, 2007.

[42] L. Johannessen, U. Strudsholm, L. Foldager, and P. Munk-Jørgensen. Increased

risk of hypertension in patients with bipolar disorder and patients with anxiety

compared to background population and patients with schizophrenia. Journal

of Affective Disorders, 95(1-3):13–17, 2006.

[43] Y.I. Liu, P.H. Wise, and A.J. Butte. The ”etiome”: identification and clustering

of human disease etiological factors. BMC Bioinformatics, 10 Suppl 2:S14, 2009.

[44] M. Sirota, M.A. Schaub, S. Batzoglou, W.H. Robinson, and A.J. Butte. Au-

toimmune disease classification by inverse association with SNP alleles. PLoS

Genetics, 5:e1000792 10.1371/journal.pgen.1000792 [doi], 2009.

[45] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and

Regression Trees. Wadsworth. Belmont, CA, 1984.

[46] J.R. Quinlan. Simplifying Decision Trees. AI Memo 930, 1986.

[47] A.H. Sinclair, P. Berta, M.S. Palmer, J.R. Hawkins, B.L. Griffiths, M.J. Smith,

J.W. Foster, A.M. Frischauf, R. Lovell-Badge, and P.N. Goodfellow. A gene

from the human sex-determining region encodes a protein with homology to a

conserved DNA-binding motif. Nature, 346:240–4, 1990.

[48] B.T. Lahn and D.C. Page. Four evolutionary strata on the human X chromo-

some. Science, 286:964–7, 1999.

[49] M.C. Simmler, F. Rouyer, G. Vergnaud, M. Nystrom-Lahti, K.Y. Ngo, de la

Chapelle A, and J. Weissenbach. Pseudoautosomal DNA sequences in the pair-

ing region of the human sex chromosomes. Nature, 317:692–7, 1985.

[50] D. Freije, C. Helms, M.S. Watson, and H. Donis-Keller. Identification of a sec-

ond pseudoautosomal region near the Xq and Yq telomeres. Science, 258:1784–

7, 1992.

BIBLIOGRAPHY 121

[51] K. Kvaloy, F. Galvagni, and W.R. Brown. The sequence organization of the long

arm pseudoautosomal region of the human sex chromosomes. Human Molecular

Genetics, 3:771–8, 1994.

[52] P.L. Pearson and M. Bobrow. Definitive evidence for the short arm of the Y

chromosome associating with the X chromosome during miosis in the human

male. Nature, 226:959–61, 1970.

[53] P.E. Polani. Pairing of X and Y chromosomes, non-inactivation of X-linked

genes, and the maleness factor. Human Genetics, 60:207–11, 1982.

[54] P.S. Burgoyne. Genetic homology and crossing over in the X and Y chromosomes

of Mammals. Human Genetics, 61:85–90, 1982.

[55] L. Kauppi, M. Barchi, F. Baudat, P.J. Romanienko, S. Keeney, and M. Jasin.

Distinct properties of the XY pseudoautosomal region crucial for male meiosis.

Science, 331:916–20, 2011.

[56] N. Ellis, A. Taylor, B.O. Bengtsson, J. Kidd, J. Rogers, and P. Goodfellow. Pop-

ulation structure of the human pseudoautosomal boundary. Nature, 344:663–5,

1990.

[57] N.A. Ellis, P.J. Goodfellow, B. Pym, M. Smith, M. Palmer, A.M. Frischauf,

and P.N. Goodfellow. The pseudoautosomal boundary in man is defined by an

Alu repeat sequence inserted on the Y chromosome. Nature, 337:81–4, 1989.

[58] N. Ellis, P. Yen, K. Neiswanger, L.J. Shapiro, and P.N. Goodfellow. Evolution

of the pseudoautosomal boundary in Old World monkeys and great apes. Cell,

63:977–86, 1990.

[59] C. Mondello, H.H. Ropers, I.W. Craig, E. Tolley, and P.N. Goodfellow. Physical

mapping of genes and sequences at the end of the human X chromosome short

arm. Annals of human genetics, 51:137–43, 1987.

[60] G.A. Rappold. The pseudoautosomal regions of the human sex chromosomes.

Human Genetics, 92:315–24, 1993.

BIBLIOGRAPHY 122

[61] C. Mondello, P.J. Goodfellow, and P.N. Goodfellow. Analysis of methylation of

a human X located gene which escapes X inactivation. Nucleic Acids Research,

16:6813–24, 1988.

[62] J.D. Mann, A. Cahan, A.G. Gelb, N. Fisher, J. Hamper, P. Tipett, R. Sanger,

and R.R. Race. A sex-linked blood group. Lancet, 1:8–10, 1962.

[63] P.J. Goodfellow, C. Pritchard, P. Tippett, and P.N. Goodfellow. Recombination

between the X and Y chromosomes: implications for the relationship between

MIC2, XG and YG. Annals of Human Genetics, 51:161–7, 1987.

[64] N.A. Ellis, T.Z. Ye, S. Patton, J. German, P.N. Goodfellow, and P. Weller.

Cloning of PBDX, an MIC2-related gene that spans the pseudoautosomal

boundary on chromosome Xp. Nature Genetics, 6:394–400, 1994.

[65] N.A. Ellis, P. Tippett, A. Petty, M. Reid, P.A. Weller, T.Z. Ye, J. German,

P.N. Goodfellow, S. Thomas, and G. Banting. PBDX is the XG blood group

gene. Nature Genetics, 8:285–90, 1994.

[66] P.A. Weller, R. Critcher, P.N. Goodfellow, J. German, and N.A. Ellis. The

human Y chromosome homologue of XG: transcription of a naturally truncated

gene. Human Molecular Genetics, 4:859–68, 1995.

[67] The International HapMap Consortium. A map of human genome variation

from population-scale sequencing. Nature, 467:1061–73, 2010.

[68] R.A. Fisher. Statistical Methods for Research Workers. Oliver and Boyd, Ed-

inburgh, 1925.

[69] G.H. Hardy. Mendelian proportions in a mixed population. Nature, 28:49–50,

1908.

[70] W. Weinberg. Uber den Nachweis der Vererbung beim Menschen. Jahreshefte

des Vereins Varterlandische Naturkdunde in Wurttemberg, 64:369–382, 1908.

BIBLIOGRAPHY 123

[71] B.L. Browning and S.R. Browning. A unified approach to genotype imputa-

tion and haplotype-phase inference for large data sets of trios and unrelated

individuals. American journal of human genetics, 84:210–23, 2009.

[72] B.O. Bengtsson and P.N. Goodfellow. The effect of recombination between the

X and Y chromosomes of mammals. Annals of Human Genetics, 51:57–64,

1987.

[73] A.G. Clark. The evolution of the Y chromosome with X-Y recombination.

Genetics, 119:711–20, 1988.

[74] A. Flaquer, G.A. Rappold, T.F. Wienker, and C. Fischer. The human pseu-

doautosomal regions: a review for genetic epidemiologists. European Journal of

Human Genetics, 16:771–9, 2008.

[75] H.J. Cooke, W.R. Brown, and G.A. Rappold. Hypervariable telomeric sequences

from the human sex chromosomes are pseudoautosomal. Nature, 317:687–92,

1985.

[76] F. Rouyer, M.C. Simmler, C. Johnsson, G. Vergnaud, H.J. Cooke, and J. Weis-

senbach. A gradient of sex linkage in the pseudoautosomal region of the human

sex chromosomes. Nature, 319:291–5, 1986.

[77] P.J. Goodfellow, S.M. Darling, N.S. Thomas, and P.N. Goodfellow. A pseu-

doautosomal gene in man. Science, 234:740–3, 1986.

[78] D.C. Page, K. Bieker, L.G. Brown, S. Hinton, M. Leppert, J.M. Lalouel,

M. Lathrop, M. Nystrom-Lahti, de la Chapelle A, and R. White. Linkage,

physical mapping, and DNA sequence analysis of pseudoautosomal loci on the

human X and Y chromosomes. Genomics, 1:243–56, 1987.

[79] J.F. Hughes, H. Skaletsky, T. Pyntikova, T.A. Graves, van Daalen SK, P.J.

Minx, R.S. Fulton, S.D. McGrath, D.P. Locke, C. Friedman, et al. Chimpanzee

and human Y chromosomes are remarkably divergent in structure and gene

content. Nature, 463:536–9, 2010.

BIBLIOGRAPHY 124

[80] J.F. Hughes, H. Skaletsky, L.G. Brown, T. Pyntikova, T. Graves, R.S. Fulton,

S. Dugan, Y. Ding, C.J. Buhay, C. Kremitzki, et al. Strict evolutionary conser-

vation followed rapid gene loss on human and rhesus Y chromosomes. Nature,

483:82–6, 2012.

[81] L.S. Whitfield, R. Lovell-Badge, and P.N. Goodfellow. Rapid sequence evolution

of the mammalian sex-determining gene SRY. Nature, 364:713–5, 1993.

[82] R.M. Cox and R. Calsbeek. Sexually antagonistic selection, sexual dimor-

phism, and the resolution of intralocus sexual conflict. The American Naturalist,

173:176–87, 2009.

[83] S.P. Otto, J.R. Pannell, C.L. Peichel, T.L. Ashman, D. Charlesworth, A.K.

Chippindale, L.F. Delph, R.F. Guerrero, S.V. Scarpino, and B.F. McAllister.

About PAR: the distinct evolutionary dynamics of the pseudoautosomal region.

Trends in Genetics, 27:358–67, 2011.

[84] N.H. Barton. Genetic hitchhiking. Philosophical transactions of the Royal So-

ciety of London. Series B, Biological sciences, 355:1553–62, 2000.

[85] F.J. Charchar, M. Svartman, N. El-Mogharbel, M. Ventura, P. Kirby, M.R.

Matarazzo, A. Ciccodicola, M. Rocchi, M. D’Esposito, and J.A. Graves. Com-

plex events in the evolution of the human pseudoautosomal region 2 (PAR2).

Genome Research, 13:281–6, 2003.

[86] S. Sarbajna, M. Denniff, A.J. Jeffreys, R. Neumann, Soler Artigas M, A. Veselis,

and C.A. May. A major recombination hotspot in the XqYq pseudoautosomal

region gives new insight into processing of human gene conversion events. Hu-

man Molecular Genetics, 21:2029–38, 2012.

[87] Z.H. Rosser, P. Balaresque, and M.A. Jobling. Gene conversion between the X

chromosome and the male-specific region of the Y chromosome at a transloca-

tion hotspot. American Journal of Human Genetics, 85:130–4, 2009.

BIBLIOGRAPHY 125

[88] R.P. Meisel, J.H. Malone, and A.G. Clark. Disentangling the relationship be-

tween sex-biased gene expression and X-linkage. Genome Research, 2012.

[89] L.Y. Liu, M.A. Schaub, M. Sirota, and A.J. Butte. Sex differences in disease

risk from reported genome-wide association study findings. Human Genetics,

131:353–64, 2012.

[90] L.Y. Liu, M.A. Schaub, M. Sirota, and A.J. Butte. Transmission distortion in

Crohn’s disease risk gene ATG16L1 leads to sex difference in disease association.

Inflammatory Bowel Diseases, 18:312–22, 2012.

[91] A.D. Paterson, D. Waggott, A. Schillert, C. Infante-Rivard, S.B. Bull, Y.J. Yoo,

and D. Pinnaduwage. Transmission-ratio distortion in the Framingham Heart

Study. BMC Proceedings, 3 Suppl 7:S51, 2009.

[92] H. Lango Allen, K. Estrada, G. Lettre, S.I. Berndt, M.N. Weedon, F. Ri-

vadeneira, C.J. Willer, A.U. Jackson, S. Vedantam, S. Raychaudhuri, et al.

Hundreds of variants clustered in genomic loci and biological pathways affect

human height. Nature, 467:832–8, 2010.

[93] S. Sanna, B. Li, A. Mulas, C. Sidore, H.M. Kang, A.U. Jackson, M.G. Piras,

G. Usala, G. Maninchedda, A. Sassu, et al. Fine mapping of five loci associated

with low-density lipoprotein cholesterol detects variants that double the ex-

plained heritability. PLoS Genetics, 7:e1002198 10.1371/journal.pgen.1002198

[doi], 2011.

[94] P.C. Ng and S. Henikoff. SIFT: Predicting amino acid changes that affect

protein function. Nucleic Acids Research, 31:3812–4, 2003.

[95] I.A. Adzhubei, S. Schmidt, L. Peshkin, V.E. Ramensky, A. Gerasimova, P. Bork,

A.S. Kondrashov, and S.R. Sunyaev. A method and server for predicting dam-

aging missense mutations. Nature Methods, 7:248–249, 2010.

[96] S.F. Saccone, R. Bolze, P. Thomas, J. Quan, G. Mehta, E. Deelman, J.A. Tis-

chfield, and J.P. Rice. SPOT: a web-based tool for using biological databases to

BIBLIOGRAPHY 126

prioritize SNPs after a genome-wide association study. Nucleic Acids Research,

38:W201–9, 2010.

[97] B.E. Stranger, A.C. Nica, M.S. Forrest, A. Dimas, C.P. Bird, C. Beazley, C.E.

Ingle, M. Dunning, P. Flicek, D. Koller, et al. Population genomics of human

gene expression. Nature Genetics, 39:1217–1224, 2007.

[98] E.E. Schadt, C. Molony, E. Chudin, K. Hao, X. Yang, P.Y. Lum, A. Kasarskis,

B. Zhang, S. Wang, C. Suver, et al. Mapping the genetic architecture of gene

expression in human liver. PLoS Biology, 6:e107 07–PLBI–RA–4030 [pii], 2008.

[99] H. Zhong, J. Beaulaurier, P.Y. Lum, C. Molony, X. Yang, D.J. Macneil, D.T.

Weingarth, B. Zhang, D. Greenawalt, R. Dobrin, et al. Liver and adipose

expression associated SNPs are enriched for association to type 2 diabetes.

PLoS Genetics, 6:e1000932 10.1371/journal.pgen.1000932 [doi], 2010.

[100] D.L. Nicolae, E. Gamazon, W. Zhang, S. Duan, M.E. Dolan, and N.J. Cox.

Trait-associated SNPs are more likely to be eQTLs: annotation to enhance dis-

covery from GWAS. PLoS Genetics, 6:e1000888 10.1371/journal.pgen.1000888

[doi], 2010.

[101] Z. Xu and J.A. Taylor. SNPinfo: integrating GWAS and candidate gene infor-

mation into functional SNP selection for genetic association studies. Nucleic

Acids Research, 37:W600–5, 2009.

[102] G. Macintyre, J. Bailey, I. Haviv, and A. Kowalczyk. is-rSNP: a novel technique

for in silico regulatory SNP detection. Bioinformatics, 26:i524–30, 2010.

[103] J.E. Landers, J. Melki, V. Meininger, J.D. Glass, van den Berg LH, van Es MA,

P.C. Sapp, van Vught PW, D.M. McKenna-Yasek, H.M. Blauw, et al. Reduced

expression of the Kinesin-Associated Protein 3 (KIFAP3) gene increases survival

in sporadic amyotrophic lateral sclerosis. Proceedings of the National Academy

of Sciences of the United States of America, 106:9004–9, 2009.

BIBLIOGRAPHY 127

[104] O. Jarinova, A.F. Stewart, R. Roberts, G. Wells, P. Lau, T. Naing, C. Buerki,

B.W. McLean, R.C. Cook, J.S. Parker, et al. Functional analysis of the chro-

mosome 9p21.3 coronary artery disease risk locus. Arteriosclerosis, thrombosis,

and vascular biology, 29:1671–7, 2009.

[105] G. Robertson, M. Hirst, M. Bainbridge, M. Bilenky, Y. Zhao, T. Zeng, G. Eu-

skirchen, B. Bernier, R. Varhol, A. Delaney, et al. Genome-wide profiles of

STAT1 DNA association using chromatin immunoprecipitation and massively

parallel sequencing. Nature Methods, 4:651–7, 2007.

[106] D.S. Johnson, A. Mortazavi, R.M. Myers, and B. Wold. Genome-wide mapping

of in vivo protein-DNA interactions. Science, 316:1497–502, 2007.

[107] D.S. Gross and W.T. Garrard. Nuclease hypersensitive sites in chromatin.

Annual Review of Biochemistry, 57:159–97, 1988.

[108] G.E. Crawford, I.E. Holt, J. Whittle, B.D. Webb, D. Tai, S. Davis, E.H.

Margulies, Y. Chen, J.A. Bernat, D. Ginsburg, et al. Genome-wide map-

ping of DNase hypersensitive sites using massively parallel signature sequencing

(MPSS). Genome Research, 16:123–31, 2006.

[109] A.P. Boyle, S. Davis, H.P. Shulha, P. Meltzer, E.H. Margulies, Z. Weng, T.S.

Furey, and G.E. Crawford. High-resolution mapping and characterization of

open chromatin across the genome. Cell, 132:311–22, 2008.

[110] M. Kasowski, F. Grubert, C. Heffelfinger, M. Hariharan, A. Asabere, S.M.

Waszak, L. Habegger, J. Rozowsky, M. Shi, A.E. Urban, et al. Variation in

transcription factor binding among humans. Science, 328:232–5, 2010.

[111] H. Lou, M. Yeager, H. Li, J.G. Bosquet, R.B. Hayes, N. Orr, K. Yu, A. Hutchin-

son, K.B. Jacobs, P. Kraft, et al. Fine mapping and functional analysis of a

common variant in MSMB on chromosome 10q11.2 associated with prostate

cancer susceptibility. Proceedings of the National Academy of Sciences of the

United States of America, 106:7933–8, 2009.

BIBLIOGRAPHY 128

[112] L.G. Carvajal-Carmona, J.B. Cazier, A.M. Jones, K. Howarth, P. Broderick,

A. Pittman, S. Dobbins, A. Tenesa, S. Farrington, J. Prendergast, et al. Fine-

mapping of colorectal cancer susceptibility loci at 8q23.3, 16q22.1 and 19q13.11:

refinement of association signals and use of in silico analysis to suggest func-

tional variation and unexpected candidate target genes. Human Molecular Ge-

netics, 20:2879–88, 2011.

[113] D.S. Paul, J.P. Nisbet, T.P. Yang, S. Meacham, A. Rendon, K. Hautaviita,

J. Tallila, J. White, M.R. Tijssen, S. Sivapalaratnam, et al. Maps of open

chromatin guide the functional follow-up of genome-wide association signals:

application to hematological traits. PLoS Genetics, 7:e1002139 10.1371/jour-

nal.pgen.1002139 [doi], 2011.

[114] O. Harismendy, D. Notani, X. Song, N.G. Rahim, B. Tanasa, N. Heintzman,

B. Ren, X.D. Fu, E.J. Topol, M.G. Rosenfeld, et al. 9p21 DNA variants associ-

ated with coronary artery disease impair interferon-gamma signalling response.

Nature, 470:264–8, 2011.

[115] J. Ernst, P. Kheradpour, T.S. Mikkelsen, N. Shoresh, L.D. Ward, C.B. Epstein,

X. Zhang, L. Wang, R. Issner, M. Coyne, et al. Mapping and analysis of

chromatin state dynamics in nine human cell types. Nature, 473:43–9, 2011.

[116] The ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA

Elements) Project. Science, 306:636–40, 2004.

[117] The ENCODE Project Consortium. Identification and analysis of functional

elements in 1% of the human genome by the ENCODE pilot project. Nature,

447:799–816, 2007.

[118] The ENCODE Project Consortium. A user’s guide to the encyclopedia of DNA

elements (ENCODE). PLoS Biology, 9:e1001046 10.1371/journal.pbio.1001046

[doi], 2011.

[119] The ENCODE Project Consortium. An Integrated Encyclopedia of DNA Ele-

ments in the Human Genome. Nature, 2012 (in press).

BIBLIOGRAPHY 129

[120] J.R. Hesselberth, X. Chen, Z. Zhang, P.J. Sabo, R. Sandstrom, A.P. Reynolds,

R.E. Thurman, S. Neph, M.S. Kuehn, W.S. Noble, et al. Global mapping

of protein-DNA interactions in vivo by digital genomic footprinting. Nature

Methods, 6:283–9, 2009.

[121] A.P. Boyle, L. Song, B.K. Lee, D. London, D. Keefe, E. Birney, V.R. Iyer, G.E.

Crawford, and T.S. Furey. High-resolution genome-wide in vivo footprinting

of diverse transcription factors in human cells. Genome Research, 21:456–64,

2011.

[122] R. Pique-Regi, J.F. Degner, A.A. Pai, D.J. Gaffney, Y. Gilad, and J.K.

Pritchard. Accurate inference of transcription factor binding from DNA se-

quence and chromatin accessibility data. Genome Research, 21:447–55, 2011.

[123] A.P. Boyle, E.L. Hong, M. Hariharan, Y. Cheng, M.A. Schaub, M. Kasowski,

K.J. Karczewski, J. Park, B.C. Hitz, S. Weng, et al. Annotation of Functional

Variation in Personal Genomes Using RegulomeDB. Genome Research, 2012

(in press).

[124] G.H. Wei, G. Badis, M.F. Berger, T. Kivioja, K. Palin, M. Enge, M. Bonke,

A. Jolma, M. Varjosalo, A.R. Gehrke, et al. Genome-wide analysis of ETS-

family DNA-binding in vitro and in vivo. The EMBO journal, 29:2147–60,

2010.

[125] M.N. Weedon, H. Lango, C.M. Lindgren, C. Wallace, D.M. Evans, M. Mangino,

R.M. Freathy, J.R. Perry, S. Stevens, A.S. Hall, et al. Genome-wide association

analysis identifies 20 loci that influence adult height. Nature Genetics, 40:575–

83, 2008.

[126] D.F. Gudbjartsson, G.B. Walters, G. Thorleifsson, H. Stefansson, B.V. Hall-

dorsson, P. Zusmanovich, P. Sulem, S. Thorlacius, A. Gylfason, S. Steinberg,

et al. Many sequence variants affecting diversity of adult human height. Nature

Genetics, 40:609–15, 2008.

BIBLIOGRAPHY 130

[127] Y. Okada, Y. Kamatani, A. Takahashi, K. Matsuda, N. Hosono, H. Ohmiya,

Y. Daigo, K. Yamamoto, M. Kubo, Y. Nakamura, et al. A genome-wide associ-

ation study in 19 633 Japanese subjects identified LHX3-QSOX2 and IGF1 as

adult height loci. Human Molecular Genetics, 19:2303–12, 2010.

[128] G. Lettre, A.U. Jackson, C. Gieger, F.R. Schumacher, S.I. Berndt, S. Sanna,

S. Eyheramendy, B.F. Voight, J.L. Butler, C. Guiducci, et al. Identification

of ten loci associated with height highlights new biological pathways in human

growth. Nature Genetics, 40:584–91, 2008.

[129] J.J. Kim, H.I. Lee, T. Park, K. Kim, J.E. Lee, N.H. Cho, C. Shin, Y.S. Cho,

J.Y. Lee, B.G. Han, et al. Identification of 15 loci influencing height in a Korean

population. Journal of human Genetics, 55:27–31, 2010.

[130] R.A. Eeles, Z. Kote-Jarai, Al Olama AA, G.G. Giles, M. Guy, G. Severi,

K. Muir, J.L. Hopper, B.E. Henderson, C.A. Haiman, et al. Identification of

seven new prostate cancer susceptibility loci through a genome-wide association

study. Nature Genetics, 41:1116–21, 2009.

[131] R. Takata, S. Akamatsu, M. Kubo, A. Takahashi, N. Hosono, T. Kawaguchi,

T. Tsunoda, J. Inazawa, N. Kamatani, O. Ogawa, et al. Genome-wide asso-

ciation study identifies five new susceptibility loci for prostate cancer in the

Japanese population. Nature Genetics, 42:751–4, 2010.

[132] G. Thomas, K.B. Jacobs, M. Yeager, P. Kraft, S. Wacholder, N. Orr, K. Yu,

N. Chatterjee, R. Welch, A. Hutchinson, et al. Multiple loci identified in a

genome-wide association study of prostate cancer. Nature Genetics, 40:310–5,

2008.

[133] F.R. Schumacher, S.I. Berndt, A. Siddiq, K.B. Jacobs, Z. Wang, S. Lindstrom,

V.L. Stevens, C. Chen, A.M. Mondul, R.C. Travis, et al. Genome-wide associ-

ation study identifies new prostate cancer susceptibility loci. Human Molecular

Genetics, 20:3867–75, 2011.

BIBLIOGRAPHY 131

[134] R.A. Eeles, Z. Kote-Jarai, G.G. Giles, A.A. Olama, M. Guy, S.K. Jugurnauth,

S. Mulholland, D.A. Leongamornlert, S.M. Edwards, J. Morrison, et al. Multi-

ple newly identified loci associated with prostate cancer susceptibility. Nature

Genetics, 40:316–21, 2008.

[135] J. Gudmundsson, P. Sulem, V. Steinthorsdottir, J.T. Bergthorsson, G. Thor-

leifsson, A. Manolescu, T. Rafnar, D. Gudbjartsson, B.A. Agnarsson, A. Baker,

et al. Two variants on chromosome 17 confer prostate cancer risk, and the one

in TCF2 protects against type 2 diabetes. Nature Genetics, 39:977–83, 2007.

[136] P.G. Giresi, J. Kim, R.M. McDaniell, V.R. Iyer, and J.D. Lieb. FAIRE

(Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regu-

latory elements from human chromatin. Genome Research, 17:877–85, 2007.

[137] C.C. Chung, J. Ciampa, M. Yeager, K.B. Jacobs, S.I. Berndt, R.B. Hayes,

J. Gonzalez-Bosquet, P. Kraft, S. Wacholder, N. Orr, et al. Fine mapping of a

region of chromosome 11q13 reveals multiple independent loci associated with

risk of prostate cancer. Human Molecular Genetics, 20:2869–78, 2011.

[138] G.R. Crabtree and E.N. Olson. NFAT signaling: choreographing the social lives

of cells. Cell, 109 Suppl:S67–79, 2002.

[139] J.J. Heit, A.A. Apelqvist, X. Gu, M.M. Winslow, J.R. Neilson, G.R. Crab-

tree, and S.K. Kim. Calcineurin/NFAT signalling regulates pancreatic beta-cell

growth and function. Nature, 443:345–9, 2006.

[140] K. Hashiba. Hereditary QT prolongation syndrome in Japan: genetic analy-

sis and pathological findings of the conducting system. Japanese Circulation

Journal, 42:1133–50, 1978.

[141] E.H. Locati, W. Zareba, A.J. Moss, P.J. Schwartz, G.M. Vincent, M.H.

Lehmann, J.A. Towbin, S.G. Priori, C. Napolitano, J.L. Robinson, et al. Age-

and sex-related differences in clinical manifestations in patients with congenital

long-QT syndrome: findings from the International LQTS Registry. Circulation,

97:2237–44, 1998.

BIBLIOGRAPHY 132

[142] M. Nakagawa, T. Ooie, N. Takahashi, Y. Taniguchi, F. Anan, H. Yonemochi,

and T. Saikawa. Influence of menstrual cycle on QT interval dynamics. Pacing

and Clinical Electrophysiology, 29:607–13, 2006.

[143] A.H. Kadish, P. Greenland, M.C. Limacher, W.H. Frishman, S.A. Daugherty,

and J.B. Schwartz. Estrogen and progestin use and the QT interval in post-

menopausal women. Annals of Noninvasive Electrocardiology, 9:366–74, 2004.

[144] M. Gokce, B. Karahan, R. Yilmaz, C. Orem, C. Erdol, and S. Ozdemir. Long

term effects of hormone replacement therapy on heart rate variability, QT in-

terval, QT dispersion and frequencies of arrhythmia. International Journal of

Cardiology, 99:373–9, 2005.

[145] J.G. Sutcliffe, P.B. Hedlund, E.A. Thomas, F.E. Bloom, and B.S. Hilbush.

Peripheral reduction of beta-amyloid is sufficient to reduce brain beta-amyloid:

implications for Alzheimer’s disease. Journal of Neuroscience Research, 89:808–

14, 2011.

[146] G.N. Filippova, S. Fagerlie, E.M. Klenova, C. Myers, Y. Dehner, G. Goodwin,

P.E. Neiman, S.J. Collins, and V.V. Lobanenkov. An exceptionally conserved

transcriptional repressor, CTCF, employs different combinations of zinc fingers

to bind diverged promoter sequences of avian and mammalian c-myc oncogenes.

Molecular and Cellular Biology, 16:2802–13, 1996.

[147] A.C. Bell and G. Felsenfeld. Methylation of a CTCF-dependent boundary con-

trols imprinted expression of the Igf2 gene. Nature, 405:482–5, 2000.

[148] T.H. Kim, Z.K. Abdullaev, A.D. Smith, K.A. Ching, D.I. Loukinov, R.D. Green,

M.Q. Zhang, V.V. Lobanenkov, and B. Ren. Analysis of the vertebrate insulator

protein CTCF-binding sites in the human genome. Cell, 128:1231–45, 2007.

[149] R. Ohlsson, R. Renkawitz, and V. Lobanenkov. CTCF is a uniquely versatile

transcription regulator linked to epigenetics and disease. Trends in Genetics,

17:520–7, 2001.

BIBLIOGRAPHY 133

[150] Z.Q. Wang, M.R. Fung, D.P. Barlow, and E.F. Wagner. Regulation of embry-

onic growth and lysosomal targeting by the imprinted Igf2/Mpr gene. Nature,

372:464–7, 1994.

[151] A.S. Van Laere, M. Nguyen, M. Braunschweig, C. Nezer, C. Collette, L. Moreau,

A.L. Archibald, C.S. Haley, N. Buys, M. Tally, et al. A regulatory mutation in

IGF2 causes a major QTL effect on muscle growth in the pig. Nature, 425:832–6,

2003.

[152] P.B. Grino, J.E. Griffin, and J.D. Wilson. Testosterone at high concentrations

interacts with the human androgen receptor similarly to dihydrotestosterone.

Endocrinology, 126:1165–72, 1990.

[153] T. Visakorpi, E. Hyytinen, P. Koivisto, M. Tanner, R. Keinanen, C. Palmberg,

A. Palotie, T. Tammela, J. Isola, and O.P. Kallioniemi. In vivo amplification of

the androgen receptor gene and progression of human prostate cancer. Nature

Genetics, 9:401–6, 1995.

[154] N. Craft, Y. Shostak, M. Carey, and C.L. Sawyers. A mechanism for hormone-

independent prostate cancer through modulation of androgen receptor signaling

by the HER-2/neu tyrosine kinase. Nature Medicine, 5:280–5, 1999.

[155] N. Sharifi, J.L. Gulley, and W.L. Dahut. Androgen deprivation therapy for

prostate cancer. JAMA, 294:238–44, 2005.

[156] E.W. Sayers, T. Barrett, D.A. Benson, E. Bolton, S.H. Bryant, K. Canese,

V. Chetvernin, D.M. Church, M. Dicuccio, S. Federhen, et al. Database re-

sources of the National Center for Biotechnology Information. Nucleic Acids

Research, 2011.

[157] T. Derrien, R. Johnson, G. Bussotti, A. Tanzer, S. Djebali, H. Tilgner,

G. Guernec, D. Martin, A. Merkel, D. Gonzalez, et al. The GENCODE v7

catalogue of human long non-coding RNAs: Analysis of their gene structure,

evolution and expression. Genome Research, 2012 (in press).

BIBLIOGRAPHY 134

[158] Q. Yang, A. Kottgen, A. Dehghan, A.V. Smith, N.L. Glazer, M.H. Chen, D.I.

Chasman, T. Aspelund, G. Eiriksdottir, T.B. Harris, et al. Multiple genetic loci

influence serum urate levels and their relationship with gout and cardiovascular

disease risk factors. Circulation. Cardiovascular genetics, 3:523–30, 2010.

[159] D.P. McGovern, M.R. Jones, K.D. Taylor, K. Marciante, X. Yan, M. Dubinsky,

A. Ippoliti, E. Vasiliauskas, D. Berel, C. Derkowski, et al. Fucosyltransferase 2

(FUT2) non-secretor status is associated with Crohn’s disease. Human Molec-

ular Genetics, 19:3468–76, 2010.

[160] J.C. Barrett, S. Hansoul, D.L. Nicolae, J.H. Cho, R.H. Duerr, J.D. Rioux,

S.R. Brant, M.S. Silverberg, K.D. Taylor, M.M. Barmada, et al. Genome-wide

association defines more than 30 distinct susceptibility loci for Crohn’s disease.

Nature Genetics, 40:955–62, 2008.

[161] I.M. Heid, A.U. Jackson, J.C. Randall, T.W. Winkler, L. Qi, V. Steinthorsdot-

tir, G. Thorleifsson, M.C. Zillikens, E.K. Speliotes, R. Magi, et al. Meta-analysis

identifies 13 new loci associated with waist-hip ratio and reveals sexual dimor-

phism in the genetic basis of fat distribution. Nature Genetics, 42:949–60, 2010.

[162] S.K. Ganesh, N.A. Zakai, van Rooij FJ, N. Soranzo, A.V. Smith, M.A. Nalls,

M.H. Chen, A. Kottgen, N.L. Glazer, A. Dehghan, et al. Multiple loci influ-

ence erythrocyte phenotypes in the CHARGE Consortium. Nature Genetics,

41:1191–8, 2009.

[163] C. Newton-Cheh, M. Eijgelsheim, K.M. Rice, de Bakker PI, X. Yin, K. Estrada,

J.C. Bis, K. Marciante, F. Rivadeneira, P.A. Noseworthy, et al. Common vari-

ants at ten loci influence QT interval duration in the QTGEN Study. Nature

Genetics, 41:399–406, 2009.

[164] A.D. Johnson, L.R. Yanek, M.H. Chen, N. Faraday, M.G. Larson, G. Tofler, S.J.

Lin, A.T. Kraja, M.A. Province, Q. Yang, et al. Genome-wide meta-analyses

identifies seven loci associated with platelet aggregation in response to agonists.

Nature Genetics, 42:608–13, 2010.

BIBLIOGRAPHY 135

[165] D.M. Dick, F. Aliev, R.F. Krueger, A. Edwards, A. Agrawal, M. Lynskey, P. Lin,

M. Schuckit, V. Hesselbrock, Nurnberger J Jr, et al. Genome-wide association

study of conduct disorder symptomatology. Molecular Psychiatry, 16:800–8,

2011.

[166] A. Franke, D.P. McGovern, J.C. Barrett, K. Wang, G.L. Radford-Smith, T. Ah-

mad, C.W. Lees, T. Balschun, J. Lee, R. Roberts, et al. Genome-wide meta-

analysis increases to 71 the number of confirmed Crohn’s disease susceptibility

loci. Nature Genetics, 42:1118–25, 2010.

[167] J.C. Barrett, D.G. Clayton, P. Concannon, B. Akolkar, J.D. Cooper, H.A. Er-

lich, C. Julier, G. Morahan, J. Nerup, C. Nierras, et al. Genome-wide associa-

tion study and meta-analysis find that over 40 loci affect risk of type 1 diabetes.

Nature Genetics, 41:703–7, 2009.

[168] D. Melzer, J.R. Perry, D. Hernandez, A.M. Corsi, K. Stevens, I. Rafferty, F. Lau-

retani, A. Murray, J.R. Gibbs, G. Paolisso, et al. A genome-wide associa-

tion study identifies protein quantitative trait loci (pQTLs). PLoS Genetics,

4:e1000072 10.1371/journal.pgen.1000072 [doi], 2008.

[169] R.S. Houlston, E. Webb, P. Broderick, A.M. Pittman, Di Bernardo MC,

S. Lubbe, I. Chandler, J. Vijayakrishnan, K. Sullivan, S. Penegar, et al. Meta-

analysis of genome-wide association data identifies four new susceptibility loci

for colorectal cancer. Nature Genetics, 40:1426–35, 2008.

[170] P. Hollingworth, D. Harold, R. Sims, A. Gerrish, J.C. Lambert, M.M. Car-

rasquillo, R. Abraham, M.L. Hamshere, J.S. Pahwa, V. Moskvina, et al. Com-

mon variants at ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are

associated with Alzheimer’s disease. Nature Genetics, 43:429–35, 2011.

[171] R.J. Klein, C. Zeiss, E.Y. Chew, J.Y. Tsai, R.S. Sackler, C. Haynes, A.K.

Henning, J.P. SanGiovanni, S.M. Mane, S.T. Mayne, et al. Complement factor

H polymorphism in age-related macular degeneration. Science, 308:385–9, 2005.

BIBLIOGRAPHY 136

[172] P.C. Dubois, G. Trynka, L. Franke, K.A. Hunt, J. Romanos, A. Curtotti, A. Zh-

ernakova, G.A. Heap, R. Adany, A. Aromaa, et al. Multiple common variants

for celiac disease influencing immune gene expression. Nature Genetics, 42:295–

302, 2010.

[173] A. Dehghan, Q. Yang, A. Peters, S. Basu, J.C. Bis, A.R. Rudnicka, M. Kavousi,

M.H. Chen, J. Baumert, G.D. Lowe, et al. Association of novel genetic Loci with

circulating fibrinogen levels: a genome-wide association study in 6 population-

based cohorts. Circulation. Cardiovascular genetics, 2:125–33, 2009.

[174] P.F. McArdle, A. Parsa, Y.P. Chang, M.R. Weir, J.R. O’Connell, B.D. Mitchell,

and A.R. Shuldiner. Association of a common nonsynonymous variant in

GLUT9 with serum uric acid levels in old order amish. Arthritis and Rheuma-

tism, 58:2874–81, 2008.

[175] J.D. Reveille, A.M. Sims, P. Danoy, D.M. Evans, P. Leo, J.J. Pointon, R. Jin,

X. Zhou, L.A. Bradbury, L.H. Appleton, et al. Genome-wide association study

of ankylosing spondylitis identifies non-MHC susceptibility loci. Nature Genet-

ics, 42:123–7, 2010.

[176] X. Sim, R.T. Ong, C. Suo, W.T. Tay, J. Liu, D.P. Ng, M. Boehnke, K.S. Chia,

T.Y. Wong, M. Seielstad, et al. Transferability of type 2 diabetes implicated

loci in multi-ethnic cohorts from Southeast Asia. PLoS Genetics, 7:e1001363

10.1371/journal.pgen.1001363 [doi], 2011.

[177] J.N. Painter, C.A. Anderson, D.R. Nyholt, S. Macgregor, J. Lin, S.H. Lee,

A. Lambert, Z.Z. Zhao, F. Roseman, Q. Guo, et al. Genome-wide associa-

tion study identifies a locus at 7p15.2 associated with endometriosis. Nature

Genetics, 43:51–4, 2011.

[178] D.M. Waterworth, S.L. Ricketts, K. Song, L. Chen, J.H. Zhao, S. Ripatti, Y.S.

Aulchenko, W. Zhang, X. Yuan, N. Lim, et al. Genetic variants influencing

circulating lipid levels and risk of coronary artery disease. Arteriosclerosis,

thrombosis, and vascular biology, 30:2264–76, 2010.

BIBLIOGRAPHY 137

[179] S. Kathiresan, O. Melander, C. Guiducci, A. Surti, N.P. Burtt, M.J. Rieder,

G.M. Cooper, C. Roos, B.F. Voight, A.S. Havulinna, et al. Six new loci asso-

ciated with blood low-density lipoprotein cholesterol, high-density lipoprotein

cholesterol or triglycerides in humans. Nature Genetics, 40:189–97, 2008.

[180] S. Gretarsdottir, A.F. Baas, G. Thorleifsson, H. Holm, den Heijer M,

de Vries JP, S.E. Kranendonk, C.J. Zeebregts, van Sterkenburg SM, R.H. Geelk-

erken, et al. Genome-wide association study identifies a sequence variant within

the DAB2IP gene conferring susceptibility to abdominal aortic aneurysm. Na-

ture Genetics, 42:692–7, 2010.

[181] J.W. Han, H.F. Zheng, Y. Cui, L.D. Sun, D.Q. Ye, Z. Hu, J.H. Xu, Z.M. Cai,

W. Huang, G.P. Zhao, et al. Genome-wide association study in a Chinese Han

population identifies nine new susceptibility loci for systemic lupus erythemato-

sus. Nature Genetics, 41:1234–7, 2009.

[182] G. Lettre, C.D. Palmer, T. Young, K.G. Ejebe, H. Allayee, E.J. Benjamin,

F. Bennett, D.W. Bowden, A. Chakravarti, A. Dreisbach, et al. Genome-

wide association study of coronary heart disease and its risk factors in 8,090

African Americans: the NHLBI CARe Project. PLoS Genetics, 7:e1001300

10.1371/journal.pgen.1001300 [doi], 2011.

[183] K.A. Hunt, A. Zhernakova, G. Turner, G.A. Heap, L. Franke, M. Bruinenberg,

J. Romanos, L.C. Dinesen, A.W. Ryan, D. Panesar, et al. Newly identified

genetic risk variants for celiac disease related to the immune response. Nature

Genetics, 40:395–402, 2008.

[184] K.S. Wang, X.F. Liu, and N. Aragam. A genome-wide meta-analysis identifies

novel loci associated with schizophrenia and bipolar disorder. Schizophrenia

Research, 124:192–9, 2010.

BIBLIOGRAPHY 138

[185] R.C. Kaplan, A.K. Petersen, M.H. Chen, A. Teumer, N.L. Glazer, A. Doring,

C.S. Lam, N. Friedrich, A. Newman, M. Muller, et al. A genome-wide associa-

tion study identifies novel loci associated with circulating IGF-I and IGFBP-3.

Human Molecular Genetics, 20:1241–51, 2011.

[186] C.E. Elks, J.R. Perry, P. Sulem, D.I. Chasman, N. Franceschini, C. He, K.L.

Lunetta, J.A. Visser, E.M. Byrne, D.L. Cousminer, et al. Thirty new loci for age

at menarche identified by a meta-analysis of genome-wide association studies.

Nature Genetics, 42:1077–85, 2010.

[187] I.J. Kullo, K. Ding, H. Jouni, C.Y. Smith, and C.G. Chute. A genome-wide

association study of red blood cell traits using the electronic medical record.

PloS One, 5, 2010.

[188] V. Enciso-Mora, P. Broderick, Y. Ma, R.F. Jarrett, H. Hjalgrim, K. Hemminki,

van den Berg A, B. Olver, A. Lloyd, S.E. Dobbins, et al. A genome-wide

association study of Hodgkin’s lymphoma identifies new susceptibility loci at

2p16.1 (REL), 8q24.21 and 10p14 (GATA3). Nature Genetics, 42:1126–30, 2010.

[189] J.L. Stein, X. Hua, S. Lee, A.J. Ho, A.D. Leow, A.W. Toga, A.J. Saykin,

L. Shen, T. Foroud, N. Pankratz, et al. Voxelwise genome-wide association

study (vGWAS). NeuroImage, 53:1160–74, 2010.

[190] P.S. Wild, T. Zeller, A. Schillert, S. Szymczak, C.R. Sinning, A. Deiseroth,

R.B. Schnabel, E. Lubos, T. Keller, M.S. Eleftheriadis, et al. A genome-wide

association study identifies LIPA as a susceptibility gene for coronary artery

disease. Circulation. Cardiovascular Genetics, 4:403–12, 2011.

[191] N.J. Samani, J. Erdmann, A.S. Hall, C. Hengstenberg, M. Mangino, B. Mayer,

R.J. Dixon, T. Meitinger, P. Braund, H.E. Wichmann, et al. Genomewide

association analysis of coronary artery disease. The New England Journal of

Medicine, 357:443–53, 2007.

[192] H. Schunkert, I.R. Konig, S. Kathiresan, M.P. Reilly, T.L. Assimes, H. Holm,

M. Preuss, A.F. Stewart, M. Barbalic, C. Gieger, et al. Large-scale association

BIBLIOGRAPHY 139

analysis identifies 13 new susceptibility loci for coronary artery disease. Nature

Genetics, 43:333–8, 2011.

[193] Coronary Artery Disease (C4D) Genetics Consortium. A genome-wide associa-

tion study in Europeans and South Asians identifies five new loci for coronary

artery disease. Nature Genetics, 43:339–44, 2011.

[194] S. Kathiresan, B.F. Voight, S. Purcell, K. Musunuru, D. Ardissino, P.M. Man-

nucci, S. Anand, J.C. Engert, N.J. Samani, H. Schunkert, et al. Genome-wide

association of early-onset myocardial infarction with single nucleotide polymor-

phisms and copy number variants. Nature Genetics, 41:334–41, 2009.

[195] S.J. Furney, A. Simmons, G. Breen, I. Pedroso, K. Lunnon, P. Proitsi,

A. Hodges, J. Powell, L.O. Wahlund, I. Kloszewska, et al. Genome-wide associ-

ation with MRI atrophy measures as a quantitative trait locus for Alzheimer’s

disease. Molecular Psychiatry, 16:1130–8, 2011.

[196] Y.S. Aulchenko, S. Ripatti, I. Lindqvist, D. Boomsma, I.M. Heid, P.P. Pram-

staller, B.W. Penninx, A.C. Janssens, J.F. Wilson, T. Spector, et al. Loci in-

fluencing lipid levels and coronary heart disease risk in 16 European population

cohorts. Nature Genetics, 41:47–55, 2009.

[197] O.M. Albagha, S.E. Wani, M.R. Visconti, N. Alonso, K. Goodman, M.L.

Brandi, T. Cundy, P.Y. Chung, R. Dargie, J.P. Devogelaer, et al. Genome-

wide association identifies three new susceptibility loci for Paget’s disease of

bone. Nature Genetics, 43:685–9, 2011.

[198] O.M. Albagha, M.R. Visconti, N. Alonso, A.L. Langston, T. Cundy, R. Dargie,

M.G. Dunlop, W.D. Fraser, M.J. Hooper, G. Isaia, et al. Genome-wide asso-

ciation study identifies variants at CSF1, OPTN and TNFRSF11A as genetic

risk factors for Paget’s disease of bone. Nature Genetics, 42:520–4, 2010.

[199] L.R. Trevino, W. Yang, D. French, S.P. Hunger, W.L. Carroll, M. Devidas,

C. Willman, G. Neale, J. Downing, S.C. Raimondi, et al. Germline genomic

BIBLIOGRAPHY 140

variants associated with childhood acute lymphoblastic leukemia. Nature Ge-

netics, 41:1001–5, 2009.

[200] A.C. Naj, G. Jun, G.W. Beecham, L.S. Wang, B.N. Vardarajan, J. Buros, P.J.

Gallins, J.D. Buxbaum, G.P. Jarvik, P.K. Crane, et al. Common variants at

MS4A4/MS4A6E, CD2AP, CD33 and EPHA1 are associated with late-onset

Alzheimer’s disease. Nature Genetics, 43:436–41, 2011.

[201] K. Estrada, M. Krawczak, S. Schreiber, van Duijn K, L. Stolk, J.B. van Meurs,

F. Liu, B.W. Penninx, J.H. Smit, N. Vogelzangs, et al. A genome-wide associ-

ation study of northwestern Europeans involves the C-type natriuretic peptide

signaling pathway in the etiology of human height variation. Human Molecular

Genetics, 18:3516–24, 2009.

[202] Lango Allen H, K. Estrada, G. Lettre, S.I. Berndt, M.N. Weedon, F. Ri-

vadeneira, C.J. Willer, A.U. Jackson, S. Vedantam, S. Raychaudhuri, et al.

Hundreds of variants clustered in genomic loci and biological pathways affect

human height. Nature, 467:832–8, 2010.

[203] R. Anney, L. Klei, D. Pinto, R. Regan, J. Conroy, T.R. Magalhaes, C. Correia,

B.S. Abrahams, N. Sykes, A.T. Pagnamenta, et al. A genome-wide scan for

common alleles affecting risk for autism. Human Molecular Genetics, 19:4072–

82, 2010.

[204] C. Newton-Cheh, T. Johnson, V. Gateva, M.D. Tobin, M. Bochud, L. Coin, S.S.

Najjar, J.H. Zhao, S.C. Heath, S. Eyheramendy, et al. Genome-wide association

study identifies eight loci associated with blood pressure. Nature Genetics,

41:666–76, 2009.

[205] H. Nan, M. Xu, J. Zhang, M. Zhang, P. Kraft, A.A. Qureshi, C. Chen, Q. Guo,

F.B. Hu, E.B. Rimm, et al. Genome-wide association study identifies nidogen 1

(NID1) as a susceptibility locus to cutaneous nevi and melanoma risk. Human

Molecular Genetics, 20:2673–9, 2011.

BIBLIOGRAPHY 141

[206] T. Yamauchi, K. Hara, S. Maeda, K. Yasuda, A. Takahashi, M. Horikoshi,

M. Nakamura, H. Fujita, N. Grarup, S. Cauchi, et al. A genome-wide association

study in the Japanese population identifies susceptibility loci for type 2 diabetes

at UBE2E2 and C2CD4A-C2CD4B. Nature Genetics, 42:864–8, 2010.

[207] N. Grarup, M. Overvad, T. Sparso, D.R. Witte, C. Pisinger, T. Jorgensen,

T. Yamauchi, K. Hara, S. Maeda, T. Kadowaki, et al. The diabetogenic

VPS13C/C2CD4A/C2CD4B rs7172432 variant impairs glucose-stimulated in-

sulin response in 5,722 non-diabetic Danish individuals. Diabetologia, 54:789–94,

2011.

[208] X.O. Shu, J. Long, Q. Cai, L. Qi, Y.B. Xiang, Y.S. Cho, E.S. Tai, X. Li, X. Lin,

W.H. Chow, et al. Identification of new genetic risk variants for type 2 diabetes.

PLoS Genetics, 6, 2010.

[209] J. Dupuis, C. Langenberg, I. Prokopenko, R. Saxena, N. Soranzo, A.U. Jackson,

E. Wheeler, N.L. Glazer, N. Bouatia-Naji, A.L. Gloyn, et al. New genetic loci

implicated in fasting glucose homeostasis and their impact on type 2 diabetes

risk. Nature Genetics, 42:105–16, 2010.

[210] R. Saxena, M.F. Hivert, C. Langenberg, T. Tanaka, J.S. Pankow, P. Vollen-

weider, V. Lyssenko, N. Bouatia-Naji, J. Dupuis, A.U. Jackson, et al. Genetic

variation in GIPR influences the glucose and insulin responses to an oral glucose

challenge. Nature Genetics, 42:142–8, 2010.

[211] M.C. Lawrence, H.S. Bhatt, and R.A. Easom. NFAT regulates insulin gene

promoter activity in response to synergistic pathways induced by glucose and

glucagon-like peptide-1. Diabetes, 51:691–8, 2002.

[212] P.C. Sabeti, D.E. Reich, J.M. Higgins, H.Z. Levine, D.J. Richter, S.F. Schaffner,

S.B. Gabriel, J.V. Platko, N.J. Patterson, G.J. McDonald, et al. Detecting

recent positive selection in the human genome from haplotype structure. Nature,

419:832–7, 2002.

BIBLIOGRAPHY 142

[213] J.V. Neel. Diabetes mellitus: a ”thrifty” genotype rendered detrimental by

”progress”?. American Journal of Human Genetics, 14:353–62, 1962.

[214] A. Helgadottir, G. Thorleifsson, A. Manolescu, S. Gretarsdottir, T. Blondal,

A. Jonasdottir, A. Jonasdottir, A. Sigurdsson, A. Baker, A. Palsson, et al. A

common variant on chromosome 9p21 affects the risk of myocardial infarction.

Science, 316:1491–3, 2007.

[215] R. McPherson. Chromosome 9p21 and coronary artery disease. The New Eng-

land Journal of Medicine, 362:1736–7, 2010.

[216] R. Saxena, B.F. Voight, V. Lyssenko, N.P. Burtt, de Bakker PI, H. Chen, J.J.

Roix, S. Kathiresan, J.N. Hirschhorn, M.J. Daly, et al. Genome-wide associa-

tion analysis identifies loci for type 2 diabetes and triglyceride levels. Science,

316:1331–6, 2007.

[217] H.M. Broadbent, J.F. Peden, S. Lorkowski, A. Goel, H. Ongen, F. Green,

R. Clarke, R. Collins, M.G. Franzosi, G. Tognoni, et al. Susceptibility to coro-

nary artery disease and diabetes is encoded by distinct, tightly linked SNPs in

the ANRIL locus on chromosome 9p. Human Molecular Genetics, 17:806–14,

2008.

[218] Y. Hiura, Y. Fukushima, M. Yuno, H. Sawamura, Y. Kokubo, T. Okamura,

H. Tomoike, Y. Goto, H. Nonogi, R. Takahashi, et al. Validation of the as-

sociation of genetic variants on chromosome 9p21 and 1q41 with myocardial

infarction in a Japanese population. Circulation journal, 72:1213–7, 2008.

[219] K. Hinohara, T. Nakajima, M. Takahashi, S. Hohda, T. Sasaoka, K. Nakahara,

K. Chida, M. Sawabe, T. Arimura, A. Sato, et al. Replication of the associ-

ation between a chromosome 9p21 polymorphism and coronary artery disease

in Japanese and Korean populations. Journal of Human Genetics, 53:357–9,

2008.

[220] G.Q. Shen, S. Rao, N. Martinelli, L. Li, O. Olivieri, R. Corrocher, K.G. Abdul-

lah, S.L. Hazen, J. Smith, J. Barnard, et al. Association between four SNPs on

BIBLIOGRAPHY 143

chromosome 9p21 and myocardial infarction is replicated in an Italian popula-

tion. Journal of Human Genetics, 53:144–50, 2008.

[221] H. Ding, Y. Xu, X. Wang, Q. Wang, L. Zhang, Y. Tu, J. Yan, W. Wang, R. Hui,

C.Y. Wang, et al. 9p21 is a shared susceptibility locus strongly for coronary

artery disease and weakly for ischemic stroke in Chinese Han population. Cir-

culation. Cardiovascular genetics, 2:338–46, 2009.

[222] G. Badis, M.F. Berger, A.A. Philippakis, S. Talukder, A.R. Gehrke, S.A. Jaeger,

E.T. Chan, G. Metzler, A. Vedenko, X. Chen, et al. Diversity and complexity

in DNA recognition by transcription factors. Science, 324:1720–3, 2009.

[223] R. McPherson, A. Pertsemlidis, N. Kavaslar, A. Stewart, R. Roberts, D.R.

Cox, D.A. Hinds, L.A. Pennacchio, A. Tybjaerg-Hansen, A.R. Folsom, et al. A

common allele on chromosome 9 associated with coronary heart disease. Science,

316:1488–91, 2007.

[224] A. Visel, Y. Zhu, D. May, V. Afzal, E. Gong, C. Attanasio, M.J. Blow, J.C.

Cohen, E.M. Rubin, and L.A. Pennacchio. Targeted deletion of the 9p21 non-

coding coronary artery disease risk interval in mice. Nature, 464:409–12, 2010.

[225] M.S. Cunnington, Santibanez Koref M, B.M. Mayosi, J. Burn, and B. Keavney.

Chromosome 9p21 SNPs Associated with Multiple Disease Phenotypes Cor-

relate with ANRIL Expression. PLoS Genetics, 6:e1000899 10.1371/jour-

nal.pgen.1000899 [doi], 2010.

[226] X.Y. Fu, D.S. Kessler, S.A. Veals, D.E. Levy, and Darnell JE Jr. ISGF3,

the transcriptional activator induced by interferon alpha, consists of multiple

interacting polypeptide chains. Proceedings of the National Academy of Sciences

of the United States of America, 87:8555–9, 1990.

[227] K. Silander, H. Tang, S. Myles, E. Jakkula, N.J. Timpson, L. Cavalli-Sforza,

and L. Peltonen. Worldwide patterns of haplotype diversity at 9p21.3, a locus

associated with type 2 diabetes and coronary heart disease. Genome Medicine,

1:51, 2009.

BIBLIOGRAPHY 144

[228] T.L. Assimes, J.W. Knowles, A. Basu, C. Iribarren, A. Southwick, H. Tang,

D. Absher, J. Li, J.M. Fair, G.D. Rubin, et al. Susceptibility locus for clinical

and subclinical coronary artery disease at chromosome 9p21 in the multi-ethnic

ADVANCE study. Human Molecular Genetics, 17:2320–8, 2008.

[229] B.G. Kral, R.A. Mathias, B. Suktitipat, I. Ruczinski, D. Vaidya, L.R. Yanek,

A.A. Quyyumi, R.S. Patel, A.M. Zafari, V. Vaccarino, et al. A common variant

in the CDKN2B gene on chromosome 9p21 protects against coronary artery

disease in Americans of African ancestry. Journal of Human Genetics, 56:224–

9, 2011.

[230] K. Yamagishi, A.R. Folsom, W.D. Rosamond, and E. Boerwinkle. A genetic

variant on chromosome 9p21 and incident heart failure in the ARIC study.

European Heart Journal, 30:1222–8, 2009.

[231] B. Pasaniuc, N. Zaitlen, G. Lettre, G.K. Chen, A. Tandon, W.H. Kao,

I. Ruczinski, M. Fornage, D.S. Siscovick, X. Zhu, et al. Enhanced statisti-

cal tests for GWAS in admixed populations: assessment using African Ameri-

cans from CARe and a Breast Cancer Consortium. PLoS Genetics, 7:e1001371

10.1371/journal.pgen.1001371 [doi], 2011.

[232] The IBC 50K CAD Consortium. Large-scale gene-centric analysis identi-

fies novel variants for coronary artery disease. PLoS Genetics, 7:e1002260

10.1371/journal.pgen.1002260 [doi], 2011.

[233] J.C. Barrett, B. Fry, J. Maller, and M.J. Daly. Haploview: analysis and visual-

ization of LD and haplotype maps. Bioinformatics, 21:263–5, 2004.

[234] G.E. Crooks, G. Hon, J.M. Chandonia, and S.E. Brenner. WebLogo: a sequence

logo generator. Genome Research, 14:1188–90, 2004.