The role of protozoa-driven selection in shaping human genetic variability

11
Online Supplementary Material The Role of Protozoa-Driven Selection in Shaping Human Genetic Variability Uberto Pozzoli 1 , Matteo Fumagalli 1,2 , Rachele Cagliani 1 , Giacomo P. Comi 3 , Nereo Bresolin 1,3 , Mario Clerici 4,5 , Manuela Sironi 1 1 Scientific Institute IRCCS E. Medea, Bioinformatic Lab, Via don L. Monza 20, 23842 Bosisio Parini (LC), Italy. 2 Bioengineering Department, Politecnico di Milano, P.zza L. da Vinci, 32, 20133 Milan, Italy. 3 Dino Ferrari Centre, Department of Neurological Sciences, University of Milan, IRCCS Ospedale Maggiore Policlinico, Mangiagalli and Regina Elena Foundation, Via F. Sforza 35, 20100 Milan, Italy. 4 Department of Biomedical sciences and Technologies LITA Segrate, University of Milan, Via F.lli Cervi 93, 20090 Milan, Italy. 5 Don C. Gnocchi ONLUS Foundation IRCCS, Via Capecelatro 66, 20148 Milan, Italy. Corresponding author: Sironi, M. ([email protected])

Transcript of The role of protozoa-driven selection in shaping human genetic variability

Online Supplementary Material

The Role of Protozoa-Driven Selection in Shaping Human Genetic Variability

Uberto Pozzoli1, Matteo Fumagalli1,2, Rachele Cagliani1, Giacomo P. Comi3, Nereo Bresolin1,3,

Mario Clerici4,5, Manuela Sironi1

1Scientific Institute IRCCS E. Medea, Bioinformatic Lab, Via don L. Monza 20, 23842 Bosisio

Parini (LC), Italy.

2Bioengineering Department, Politecnico di Milano, P.zza L. da Vinci, 32, 20133 Milan, Italy.

3Dino Ferrari Centre, Department of Neurological Sciences, University of Milan, IRCCS Ospedale

Maggiore Policlinico, Mangiagalli and Regina Elena Foundation, Via F. Sforza 35, 20100 Milan,

Italy.

4Department of Biomedical sciences and Technologies LITA Segrate, University of Milan, Via F.lli

Cervi 93, 20090 Milan, Italy.

5Don C. Gnocchi ONLUS Foundation IRCCS, Via Capecelatro 66, 20148 Milan, Italy.

Corresponding author: Sironi, M. ([email protected])

Methods

Environmental variables

Protozoa absence/presence matrices for the 21 countries where HGDP-CEPH populations are

located were derived from the Gideon database (http://www.gideononline.com) as previously

described [1-3]. The diversity of viruses, bacteria and helminths was obtained following the same

procedure as for protozoa. Malaria prevalence was obtained from either the Gideon or WHO

(http://www.who.int) databases in terms of cases/year per 100,000 inhabitants. In general, the

number of autochthonous cases per country obtained from Gideon was averaged over all available

surveyed years (which often differ among countries). We considered, when available, historical

notes; for example, information in Gideon indicated that due to eradication campaigns, Italy was

declared malaria-free in 1970; therefore only surveys dating before that period were included. As

for WHO data (WHO Global Health Atlas), the total number of reported malaria cases was

averaged over a time period ranging from 1989 to 2003 (information is not available for every

single year in all countries). Total population estimates per country (in 2005) were derived from the

WHO. All climatic variables were retrieved from the NCEP/NCAR database

(http://www.ngdc.noaa.gov/ecosys/cdroms/ged_iia/datasets/a04/, Legates and Willmott Average,

re-gridded dataset or CDC Derived NCEP Reanalysis Products Surface Level) using the geographic

coordinates reported by HGDP-CEPH (http://www.cephb.fr/en/hgdp/table.php) for each population.

Since malaria prevalence and protozoa diversity, due to data organization in available

epidemiological databases, can only be calculated per country (rather than per population), the same

procedure was applied to climatic variables. Therefore the values of climatic variables were

averaged for populations located in the same country. This assures that a similar number of ties is

maintained in all correlation analyses.

Data retrieval and statistical analysis

Data concerning the HGDP-CEPH panel (Table S1) derive from a previous work [4]. Atypical or

duplicated samples and pairs of close relatives were removed [5]. Following previous indications

[2,3], Bantu individuals (South Africa) were considered as one population.

A SNP was ascribed to a specific gene if it was located within the transcribed region or no farther

than 500 bp upstream the transcription start site. In analogy to previous studies [2,3], MAF for any

single SNP was calculated as the average over all populations.

The malaria resistance gene list (Table S2) includes loci with a strong evidence of influencing

susceptibility to the disease (i.e. genes identified in single association studies were not included)

and were retrieved from a previous review [6] and by inspection of the On Line Mendelian

Inheritance in Man web site (http://www.ncbi.nlm.nih.gov/omim, OMIM #611162) with the

inclusion of PKLR [7]. HA genes were obtained from by manually inspecting OMIM entries.

All correlations were calculated by Kendall's rank correlation coefficient (τ), a non-parametric

statistic used to measure the degree of correspondence between two rankings. The reason for using

this test is that even in the presence of ties, the sampling distribution of τ satisfactorily converges to

a normal distribution for values of n larger than 10 [8]. Partial Mantel tests were performed using

the “Vegan” R package [9]. Matrices were computed as pairwise euclidean distances in allele

frequency, in geographic distance and in pathogen (protozoa or bacteria or viruses or helminths)

diversity or malaria prevalence (either from the WHO or Gideon). Geographic distances were

derived from a previous work [10] and refer to a model of human migration from East Africa (the

postulated origin of modern humans); geographic distances are computed as the shorter distances

along landmasses and avoiding mountain regions with altitude over 2000 m.

In order to estimate the probability of obtaining n genes carrying at least one significantly

associated SNP out of a group of m genes, we applied a re-sampling approach: samples of m genes

were randomly extracted from a list of all genes covered by at least one SNP in the HGDP-CEPH

panel (number of genes = 15,280) and for each sample the number of genes with at least one

significant SNP were counted. The empirical probability of obtaining n genes was then calculated

from the distribution of counts deriving from 10,000 random samples.

The FST was calculated using the R package HIERFSTAT [11]. The number of significant and

control SNP in the 5 MAF bins were as follows: 420, 144145; 956, 146870; 1348, 133031; 1310,

118775; 1146, 112831.

Network construction

Biological network analysis was performed with Ingenuity Pathways Analysis (IPA) software

(Ingenuity Systems, www.ingenuity.com) using an unsupervised analysis. IPA builds networks by

querying the Ingenuity Pathways Knowledge Base for interactions between the identified genes and

all other gene objects stored in the knowledge base; it then generates networks with a maximum

network size of 35 genes/proteins. We used all genes showing at least one significantly associated

SNP as the input set (n = 1,145). All network edges are supported by at least one published

reference or from canonical information stored in the Ingenuity Pathways Knowledge Base. To

determine the probability of the analyzed genes to be found together in a network from Ingenuity

Pathways Knowledge Base due to random chance alone, IPA applies a Fisher's exact test. The

network score represents the -log (p value).

References

1. Prugnolle, F. et al. (2005) Pathogen-driven selection and worldwide HLA class I diversity. Curr.

Biol. 15, 1022-1027.

2. Fumagalli, M. et al. (2009) Parasites represent a major selective force for interleukin genes and

shape the genetic predisposition to autoimmune conditions. J. Exp. Med. 206, 1395-1408.

3. Fumagalli, M.et al. (2009) Widespread balancing selection and pathogen-driven selection at

blood group antigen genes. Genome Res. 19, 199-212.

4. Li, J.Z. et al. (2008) Worldwide human relationships inferred from genome-wide patterns of

variation. Science 319, 1100-1104.

5. Rosenberg, N.A. (2006) Standardized subsets of the HGDP-CEPH Human Genome Diversity

Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann.

Hum. Genet. 70, 841-847.

6. Kwiatkowski, D.P. (2005) How malaria has affected the human genome and what human

genetics can teach us about malaria. Am. J. Hum. Genet. 77, 171-192.

7. Ayi, K. et al. (2008) Pyruvate kinase deficiency and malaria. N. Engl. J. Med. 358, 1805-1810.

8. Salkind, N.J. (2007) Encyclopedia of measurement and statistics. Sage Publications, Thousand

Oaks, CA.

9. R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna,

Austria: ; 2008.

10. Handley, L.J. et al. (2007) Going the distance: human population genetics in a clinal world. Trends

Genet. 23, 432-439.

11. Goudet, J. (2005) Hierfstat, a package for R to compute and test hierarchical F-statistics. Mol.

Ecol. Notes: 184-186.

Supplementary Figures

Figure S1. Manhattan plots showing significance of association of all SNPs with protozoa

diversity. SNPs are plotted on the x-axis according to their chromosome location; the y-axis shows

Kendall's rank correlation p values (−log10). The threshold value corresponding to a Bonferroni-

corrected p < 0.05 is shown.

Figure S2. FST comparison. SNPs are divided in MAF bins and SNPs significantly associated with

protozoa diversity are represented by gray bars. Whiskers denote the 25th and 75th percentiles. All

comparisons are statistically significant (two-tailed Wilcoxon Rank Sum test p values for the 5

MAF classes are as follows: 0.025 for the first class and < 10-16 for the remaining ones).

Figure S3. Network analysis of genes associated with protozoa diversity. In addition to the

network reported in the main text, IPA identified four additional networks (A-D) with p values <

10-8. Genes (or complexes) are represented as nodes, edges indicate known interactions between

proteins (solid lines depicts direct and dashed lines depict indirect interaction). Genes are color

coded as follows: orange, genes with at least one SNP significantly associated with protozoa

diversity; yellow, genes with at least one SNP that did not withstand genome-wide Bonferroni

correction but displayed an rM rank higher than the 95th and a p value lower than 10-6 (these genes

were not included in the input IPA list used to generate networks); gray, genes covered by at least

one SNP in the HGDP-CEPH panel that showed no association with protozoa diversity; white,

genes with no SNPs in the panel.

A.

B.

C.

D.

Figure S4. Analysis of SNPs located around TLR4 and significantly associated with protozoa

diversity. Significantly associated SNPs are shown in red while the region covered by TLR4 and the

closest telomeric gene (DBC1) are shown in blue. The positions of SNPs located in the region and

included in the HDGDP-CEPH panel are shown in black. The image was generated by using the

“add custom track” utility available through the UCSC Genome Browser (http://genome.ucsc.edu/).