The role of protozoa-driven selection in shaping human genetic variability
Transcript of The role of protozoa-driven selection in shaping human genetic variability
Online Supplementary Material
The Role of Protozoa-Driven Selection in Shaping Human Genetic Variability
Uberto Pozzoli1, Matteo Fumagalli1,2, Rachele Cagliani1, Giacomo P. Comi3, Nereo Bresolin1,3,
Mario Clerici4,5, Manuela Sironi1
1Scientific Institute IRCCS E. Medea, Bioinformatic Lab, Via don L. Monza 20, 23842 Bosisio
Parini (LC), Italy.
2Bioengineering Department, Politecnico di Milano, P.zza L. da Vinci, 32, 20133 Milan, Italy.
3Dino Ferrari Centre, Department of Neurological Sciences, University of Milan, IRCCS Ospedale
Maggiore Policlinico, Mangiagalli and Regina Elena Foundation, Via F. Sforza 35, 20100 Milan,
Italy.
4Department of Biomedical sciences and Technologies LITA Segrate, University of Milan, Via F.lli
Cervi 93, 20090 Milan, Italy.
5Don C. Gnocchi ONLUS Foundation IRCCS, Via Capecelatro 66, 20148 Milan, Italy.
Corresponding author: Sironi, M. ([email protected])
Methods
Environmental variables
Protozoa absence/presence matrices for the 21 countries where HGDP-CEPH populations are
located were derived from the Gideon database (http://www.gideononline.com) as previously
described [1-3]. The diversity of viruses, bacteria and helminths was obtained following the same
procedure as for protozoa. Malaria prevalence was obtained from either the Gideon or WHO
(http://www.who.int) databases in terms of cases/year per 100,000 inhabitants. In general, the
number of autochthonous cases per country obtained from Gideon was averaged over all available
surveyed years (which often differ among countries). We considered, when available, historical
notes; for example, information in Gideon indicated that due to eradication campaigns, Italy was
declared malaria-free in 1970; therefore only surveys dating before that period were included. As
for WHO data (WHO Global Health Atlas), the total number of reported malaria cases was
averaged over a time period ranging from 1989 to 2003 (information is not available for every
single year in all countries). Total population estimates per country (in 2005) were derived from the
WHO. All climatic variables were retrieved from the NCEP/NCAR database
(http://www.ngdc.noaa.gov/ecosys/cdroms/ged_iia/datasets/a04/, Legates and Willmott Average,
re-gridded dataset or CDC Derived NCEP Reanalysis Products Surface Level) using the geographic
coordinates reported by HGDP-CEPH (http://www.cephb.fr/en/hgdp/table.php) for each population.
Since malaria prevalence and protozoa diversity, due to data organization in available
epidemiological databases, can only be calculated per country (rather than per population), the same
procedure was applied to climatic variables. Therefore the values of climatic variables were
averaged for populations located in the same country. This assures that a similar number of ties is
maintained in all correlation analyses.
Data retrieval and statistical analysis
Data concerning the HGDP-CEPH panel (Table S1) derive from a previous work [4]. Atypical or
duplicated samples and pairs of close relatives were removed [5]. Following previous indications
[2,3], Bantu individuals (South Africa) were considered as one population.
A SNP was ascribed to a specific gene if it was located within the transcribed region or no farther
than 500 bp upstream the transcription start site. In analogy to previous studies [2,3], MAF for any
single SNP was calculated as the average over all populations.
The malaria resistance gene list (Table S2) includes loci with a strong evidence of influencing
susceptibility to the disease (i.e. genes identified in single association studies were not included)
and were retrieved from a previous review [6] and by inspection of the On Line Mendelian
Inheritance in Man web site (http://www.ncbi.nlm.nih.gov/omim, OMIM #611162) with the
inclusion of PKLR [7]. HA genes were obtained from by manually inspecting OMIM entries.
All correlations were calculated by Kendall's rank correlation coefficient (τ), a non-parametric
statistic used to measure the degree of correspondence between two rankings. The reason for using
this test is that even in the presence of ties, the sampling distribution of τ satisfactorily converges to
a normal distribution for values of n larger than 10 [8]. Partial Mantel tests were performed using
the “Vegan” R package [9]. Matrices were computed as pairwise euclidean distances in allele
frequency, in geographic distance and in pathogen (protozoa or bacteria or viruses or helminths)
diversity or malaria prevalence (either from the WHO or Gideon). Geographic distances were
derived from a previous work [10] and refer to a model of human migration from East Africa (the
postulated origin of modern humans); geographic distances are computed as the shorter distances
along landmasses and avoiding mountain regions with altitude over 2000 m.
In order to estimate the probability of obtaining n genes carrying at least one significantly
associated SNP out of a group of m genes, we applied a re-sampling approach: samples of m genes
were randomly extracted from a list of all genes covered by at least one SNP in the HGDP-CEPH
panel (number of genes = 15,280) and for each sample the number of genes with at least one
significant SNP were counted. The empirical probability of obtaining n genes was then calculated
from the distribution of counts deriving from 10,000 random samples.
The FST was calculated using the R package HIERFSTAT [11]. The number of significant and
control SNP in the 5 MAF bins were as follows: 420, 144145; 956, 146870; 1348, 133031; 1310,
118775; 1146, 112831.
Network construction
Biological network analysis was performed with Ingenuity Pathways Analysis (IPA) software
(Ingenuity Systems, www.ingenuity.com) using an unsupervised analysis. IPA builds networks by
querying the Ingenuity Pathways Knowledge Base for interactions between the identified genes and
all other gene objects stored in the knowledge base; it then generates networks with a maximum
network size of 35 genes/proteins. We used all genes showing at least one significantly associated
SNP as the input set (n = 1,145). All network edges are supported by at least one published
reference or from canonical information stored in the Ingenuity Pathways Knowledge Base. To
determine the probability of the analyzed genes to be found together in a network from Ingenuity
Pathways Knowledge Base due to random chance alone, IPA applies a Fisher's exact test. The
network score represents the -log (p value).
References
1. Prugnolle, F. et al. (2005) Pathogen-driven selection and worldwide HLA class I diversity. Curr.
Biol. 15, 1022-1027.
2. Fumagalli, M. et al. (2009) Parasites represent a major selective force for interleukin genes and
shape the genetic predisposition to autoimmune conditions. J. Exp. Med. 206, 1395-1408.
3. Fumagalli, M.et al. (2009) Widespread balancing selection and pathogen-driven selection at
blood group antigen genes. Genome Res. 19, 199-212.
4. Li, J.Z. et al. (2008) Worldwide human relationships inferred from genome-wide patterns of
variation. Science 319, 1100-1104.
5. Rosenberg, N.A. (2006) Standardized subsets of the HGDP-CEPH Human Genome Diversity
Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann.
Hum. Genet. 70, 841-847.
6. Kwiatkowski, D.P. (2005) How malaria has affected the human genome and what human
genetics can teach us about malaria. Am. J. Hum. Genet. 77, 171-192.
7. Ayi, K. et al. (2008) Pyruvate kinase deficiency and malaria. N. Engl. J. Med. 358, 1805-1810.
8. Salkind, N.J. (2007) Encyclopedia of measurement and statistics. Sage Publications, Thousand
Oaks, CA.
9. R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna,
Austria: ; 2008.
10. Handley, L.J. et al. (2007) Going the distance: human population genetics in a clinal world. Trends
Genet. 23, 432-439.
11. Goudet, J. (2005) Hierfstat, a package for R to compute and test hierarchical F-statistics. Mol.
Ecol. Notes: 184-186.
Supplementary Figures
Figure S1. Manhattan plots showing significance of association of all SNPs with protozoa
diversity. SNPs are plotted on the x-axis according to their chromosome location; the y-axis shows
Kendall's rank correlation p values (−log10). The threshold value corresponding to a Bonferroni-
corrected p < 0.05 is shown.
Figure S2. FST comparison. SNPs are divided in MAF bins and SNPs significantly associated with
protozoa diversity are represented by gray bars. Whiskers denote the 25th and 75th percentiles. All
comparisons are statistically significant (two-tailed Wilcoxon Rank Sum test p values for the 5
MAF classes are as follows: 0.025 for the first class and < 10-16 for the remaining ones).
Figure S3. Network analysis of genes associated with protozoa diversity. In addition to the
network reported in the main text, IPA identified four additional networks (A-D) with p values <
10-8. Genes (or complexes) are represented as nodes, edges indicate known interactions between
proteins (solid lines depicts direct and dashed lines depict indirect interaction). Genes are color
coded as follows: orange, genes with at least one SNP significantly associated with protozoa
diversity; yellow, genes with at least one SNP that did not withstand genome-wide Bonferroni
correction but displayed an rM rank higher than the 95th and a p value lower than 10-6 (these genes
were not included in the input IPA list used to generate networks); gray, genes covered by at least
one SNP in the HGDP-CEPH panel that showed no association with protozoa diversity; white,
genes with no SNPs in the panel.
A.
Figure S4. Analysis of SNPs located around TLR4 and significantly associated with protozoa
diversity. Significantly associated SNPs are shown in red while the region covered by TLR4 and the
closest telomeric gene (DBC1) are shown in blue. The positions of SNPs located in the region and
included in the HDGDP-CEPH panel are shown in black. The image was generated by using the
“add custom track” utility available through the UCSC Genome Browser (http://genome.ucsc.edu/).