A novel strategy for molecular signature discovery based on independent component analysis

28
Int. J. Data Mining and Bioinformatics, Vol. 9, No. 3, 2014 277 Copyright © 2014 Inderscience Enterprises Ltd. A novel strategy for molecular signature discovery based on independent component analysis Hang-Phuong Pham, Nicolas Dérian and Wahiba Chaara Immunology, Immunopathology, Immunotherapy, UPMC Univ Paris 06, UMR 7211, F-75013 Paris, France and Immunology, Immunopathology, Immunotherapy, CNRS, UMR 7211, F-75013 Paris, France E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] Bertrand Bellier and David Klatzmann Immunology, Immunopathology, Immunotherapy, UPMC Univ Paris 06, UMR 7211, F-75013 Paris, France and Immunology, Immunopathology, Immunotherapy, CNRS, UMR 7211, F-75013 Paris, France and Immunology, Immunopathology, Immunotherapy, INSERM, U959, F-75013 Paris, France E-mail: [email protected] E-mail: [email protected] Adrien Six* Immunology, Immunopathology, Immunotherapy, UPMC Univ Paris 06, UMR 7211, F-75013 Paris, France

Transcript of A novel strategy for molecular signature discovery based on independent component analysis

Int. J. Data Mining and Bioinformatics, Vol. 9, No. 3, 2014 277

Copyright © 2014 Inderscience Enterprises Ltd.

A novel strategy for molecular signature discovery based on independent component analysis

Hang-Phuong Pham, Nicolas Dérian and Wahiba ChaaraImmunology, Immunopathology, Immunotherapy,UPMC Univ Paris 06, UMR 7211,F-75013 Paris, France

and

Immunology, Immunopathology, Immunotherapy,CNRS, UMR 7211,F-75013 Paris, FranceE-mail: [email protected]: [email protected]: [email protected]

Bertrand Bellier and David KlatzmannImmunology, Immunopathology, Immunotherapy,UPMC Univ Paris 06, UMR 7211,F-75013 Paris, France

and

Immunology, Immunopathology, Immunotherapy,CNRS, UMR 7211,F-75013 Paris, France

and

Immunology, Immunopathology, Immunotherapy,INSERM, U959,F-75013 Paris, FranceE-mail: [email protected] E-mail: [email protected]

Adrien Six*Immunology, Immunopathology, Immunotherapy,UPMC Univ Paris 06, UMR 7211,F-75013 Paris, France

278 H-P Pham et al.

and

Immunology, Immunopathology, Immunotherapy,CNRS, UMR 7211,F-75013 Paris, FranceE-mail: [email protected]*Corresponding author

Abstract: Microarray analysis often leads to either too large or too small numbers of gene candidates to allow meaningful identifi cation of functional signatures. We aimed at overcoming this hurdle by combining two algorithms:

i Independent Component Analysis to extract statistically-based potential signatures.

ii Gene Set Enrichment Analysis to produce a score of enrichment with statistical signifi cance of each potential signature.

We have applied this strategy to identify regulatory T cell (Treg) molecular signatures from two experiments in mice, with cross-validation. These signatures can detect the ~1% Treg in whole spleen. These fi ndings demonstrate the relevance of our approach as a signature discovery tool.

Keywords: data mining; bioinformatics; statistical modelling; transcriptome; gene expression; microarray data analysis; GSEA; gene set enrichment analysis; ICA; independent component analysis; T lymphocyte; regulatory T cell; Treg; molecular signature; signature discovery.

Reference to this paper should be made as follows: Pham, H-P., Dérian, N., Chaara, W., Bellier, B., Klatzmann, D. and Six, A. (2014) ‘A novel strategy for molecular signature discovery based on Independent component analysis’, Int. J. Data Mining and Bioinformatics, Vol. 9, No. 3, pp.277–304.

Biographical notes: Hang-Phuong Pham, MSc Major in Statistics, is currently a PhD student at the Pierre and Marie Curie University under the supervision of Adrien Six. His research focuses on integration and modelling of high-throughput multi-scale biological data.

Nicolas Dérian is an Engineer in Bioinformatics. He develops transcriptome analysis strategies to identify and validate characteristic signatures of immune responses in vaccination and cancer.

Wahiba Chaara is an Engineer in Bioinformatics. She specialises in immune repertoire diversity analysis, high-throughput biological data analysis and biological network annotation.

Bertrand Bellier received his PhD in immunology at the Pierre and Marie Curie University. He has since specialised in the study of T-cell responses in physiological and pathological situations, and has expertise in novel vaccine development.

David Klatzmann, MD, PhD head of I3 lab, is a Professor of immunology at the Pierre and Marie Curie Medical School. His research is devoted to the development of translational and systems biology approaches in immunology.

Molecular signature discovery by ICA→GSEA 279

Adrien Six, head of I2D3 team, received his PhD in immunology at the Pierre and Marie Curie University. He has since specialised in the study of lymphocyte development and selection in various physiological and pathological situations. He has developed original technological and bioinformatics tools for global TCR and Ig diversity assessment.

Hang-Phuong Pham and Nicolas Dérian contributed equally to this work.

1 Introduction

Analysis of DNA microarray can be performed by several methods. While data normalisation is a crucial point (Schmid et al., 2010), the diffi culties met in interpreting differential gene expression between conditions make it necessary to develop alternative methods to classical statistical tests. Indeed, it happens frequently that a gene-to-gene comparison between two conditions leads either to a very large number of differentially expressed genes (e.g., >1,500) or to the lack of any difference, in particular when using False Discovery Rate (FDR) control as recommended (Dudoit et al., 2003). In both cases, no relevant functional signature can be established, although the underlying data structure may bear such information at a fi ner level of organisation. This is particularly the case when looking at discrete differences related to a change in numbers or state of activation of a defi ned subpopulation in the overall signal. To address this problem, advanced methods have been proposed, such as knowledge-driven classifi cation (Tseng and Yu, 2011), machine learning (Liu et al., 2005), or matrix factorisation methods (Kossenkov and Ochs, 2010).

The method we propose here allows addressing these issues, offering an alternative strategy for identifi cation of molecular signatures. Our original strategy combines two well-known algorithms, each having proven useful for microarray analysis. A fi rst step of our methodology relies upon Independent Component Analysis (ICA). ICA (Comon, 1994) is a statistical model which aims at fi nding a linear representation of non-Gaussian components. The purpose of this method is to extract components from the original raw data that are statistically independent, or as independent as possible, and that represent the essential structure of the data. ICA is a blind source separation technique widely used in a number of signal processing applications, such as biomedical engineering, medical imaging and communication systems (James and Hesse, 2005). Interestingly, ICA does not assume a priori knowledge about the parameters of fi ltering and mixing systems. ICA considers that the value of a registered signal is the superposition of several signals which add up. Applied to transcriptome data analysis, the expression value of a gene can thus be decomposed into the sum of the underlying signals which corresponds to the various biological contexts (e.g., cell populations, biological pathways...) in which this gene is expressed. The algorithm thus supplies reduced gene sets (i.e., sources) which are over-expressed in a set of samples as compared to others. Hence, ICA offers a sensitive and unbiased mean to identify gene signatures of a particular condition.

The signifi cance and robustness of such potential signatures are then tested with the Gene Set Enrichment Analysis (GSEA) methodology (Mootha et al., 2003; Subramanian et al., 2005), which supplies a score of enrichment indicating the quality of the signature in comparing controlled conditions within experiments. GSEA is a statistical method which

280 H-P Pham et al.

determines if predefi ned gene sets are differentially expressed in different phenotypes. The advantage of GSEA is the ability to assess globally the gene expression enrichment of a given gene set, rather than looking at each gene individually. This increases the power of discovery, since each individual gene need not be differentially expressed. GSEA is particularly interesting when series of gene sets have been established, either based on their common statistical behaviour, or because they belong to a common biological pathway or ontology term. In this case, it dramatically reduces the dimensionality of the statistical question and provides a rapid means to screen large gene set databases. In turn, it can reveal whether such biological pathway or cell population is involved in a particular tissue, condition or treatment. In microarray experiments where no single gene shows statistically signifi cant differential expression between phenotypes, GSEA has already proved its power to identify signifi cant signatures in various situations (de la Fuente et al., 2009; Murohashi et al., 2010).

We have developed a strategy based on the successive implementation of ICA and GSEA aiming at the discovery of molecular signatures of a cell population of interest. The cell population under study in our laboratory is the regulatory T lymphocyte subset (Treg), which usually represents about 5% of CD4+ T lymphocytes, and is known to play important immunomodulatory roles in the immune system (Sakaguchi et al., 2007). We have applied the proposed ICA→GSEA methodology to microarray datasets, comparing gene expression in mouse whole spleen vs. sorted spleen Treg samples, as well as whole spleen vs. Treg-depleted spleen samples, and were able to defi ne cross-validated Treg-specifi c molecular signatures. These fi ndings underline the relevance of the proposed approach as an additional signature discovery tool.

2 Methods

2.1 Biological sample preparationThe overall scheme of experimental group and sample setup is summarised in Figure 1.

Figure 1 Experimental design for microarray dataset production and analysis. Treg: whole spleen-isolated regulatory T lymphocytes

Molecular signature discovery by ICA→GSEA 281

All biological samples prepared in this study were obtained from seven-week-old C57BL/6 female mice purchased from Elevage Janvier, France. In a fi rst experiment (Exp.1), we used two groups of 30 mice. The mice in one group, which was used as the control, were injected with 100 μL of PBS intravenously. Mice in the other group were vaccinated with 100 μL of GVO43, a recombinant adenovirus (rAd) vector derived from human adenovirus type 5 deleted in early regions 1 and 3 (Desjardins et al., 2009). The spleens were sampled six hours after injection. Spleen cells were isolated using mechanical dissociation and, due to the low frequency of Tregs, fi ve pools of six whole splenocytes per group were prepared. For Tregs cell sorting, part of each pool was enriched for CD4+CD25+ cells on LS columns (Miltenyi Biotec, France). Purifi ed cells were then labelled with FITC-conjugated anti-CD4 (clone RM4-5, BD Bioscience) and biotinylated anti-CD25 (clone 7D4, BD Biosciences) revealed with streptavidin/PE-Cy5 (eBiosciences) and sorted on a FACSAria (BD Biosciences). Cell purity was checked thereafter by FACS analysis with the DIVA software (BD Biosciences). Treg samples were additionally tested for FoxP3 expression using the FoxP3 Staining Buffer kit (eBioscience). Treg purity ranged from 72 to 83% (Figure 2(A)). RNA was extracted from each of the fi ve splenocyte pools as well as from the fi ve sorted Treg samples using the RNeasy Mini kit (Qiagen).

In a second experiment (Exp. 2), total splenocytes were isolated using mechanical dissociation from six individual mice injected with 100 μL of PBS. Part of each splenocyte preparation was used to obtain Treg-depleted spleen samples as follows: Tregs were depleted using FASCAria (BD Biociences) after anti-CD4 and anti-CD25 staining, as described above for Exp. 1. In Exp. 2, sorted cells are those labelled as non-Treg. Purity of each cell preparation was then checked by FACS analysis with the DIVA software (BD Biosciences), and ranged between 99 and 100% (Figure 2(B)). For each of the six total and six Treg-depleted splenocyte samples, RNA was extracted using the RNeasy Mini kit (Qiagen).

2.2 Microarray datasetsThe Treg-related microarray datasets for Exp. 1 and Exp. 2 were produced by our campus Illumina platform. Briefl y, RNA was extracted from cell preparations, as described above, using the RNeasy Mini kit (Qiagen) and its quality was checked in an Agilent Bioanalyzer. cRNA was synthesised, amplifi ed and purifi ed using the Illumina TotalPrep RNA Amplifi cation Kit (Ambion Inc.), following the manufacturer recommendations from 200 ng of RNA. After a second strand synthesis, the cDNA was transcribed in vitro, and cRNA was labelled with biotin-16-UTP. Labelled probe hybridisation to Illumina BeadChips Mouse-6v2 (Illumina, USA) was carried out using Illumina’s BeadChip 6v2 protocol. These beadchips contain 48,701 unique 50-mer oligonucleotides in total. Beadchips were scanned on the Illumina BeadArray 500 GX Reader using the Illumina BeadScan image data acquisition software (version 2.3.0.13). The Illumina BeadChip data were normalised using the RSN method, and VST transformation was developed and implemented in the R package lumi. Probes were fi ltered following their detection p-value and kept if well detected (detection p-value < 0.01), with a representation among the samples in at least one condition. The detailed experimental procedure is provided in the MIAME metadata code E-MEXP-2781 and E-MEXP-2783.

2.3 Tools and implementationDuring our study, the R software v.2.9.2 was used on an Intel Core 2 quad CPU in a Windows environment, with the following packages installed: gplots 2.7.4, fastICA 1.1-11, lumi 1.1.0, limma 2.18.3. R software and packages are available at http://www.R-project.org and

282 H-P Pham et al.

Figure 2 Cell sorting strategies for biological sample preparation. (A) Treg cell sorting (Exp.1): ‘CD4+CD25hi’ Tregs (rectangle in left panel) were sorted from whole splenocytes. Purity (top right panel) and FoxP3 expression (bottom right panel) of ‘CD4+CD25hi’ sorted cells were evaluated. (B) Treg-depleted splenocyte preparation (Exp. 2): Whole spleno-cytes were negatively sorted against ‘CD4+CD25hi’ Tregs (rectangle in left panel). Purity of Treg-depleted splenocytes was evaluated (rectangle in right panel). Experimental details are provided in Section 2.1

http://www.bioconductor.org. In parallel, we used GSEA 2.0 running with 1 Gb of memory. Installation was performed with the executable installation fi le available at http://www.broadinstitute.org/gsea. For signature annotation, we implemented a R function, enabling the automatic annotation of batches of molecular signatures by assessing their biological signifi cance with regard to their Gene Ontology (Ashburner et al., 2000) and KEGG pathway (Kanehisa and Goto, 2000) signifi cant terms (p-value < 0.01). This tool uses the following R packages: lumiMouseAll.db 1.8.1, illuminaMousev2.db 1.4.0, annotate 1.24.1, Rgraphviz 1.24.1, KEGGgraph 1.2.2 and GOstats 2.12.0.

Molecular signature discovery by ICA→GSEA 283

2.4 Independent Component Analysis (ICA) Figure 3 summarises the ICA analysis schema that we implemented: we used the fastICA algorithm (Hyvärinen and Oja, 2000) available for R, with input format constructed as a fi ltered and normalised expression matrix with probe IDs in rows and samples in columns. We used the default parameters provided in the fastICA package, except for the tolerance at which the unmixing matrix A–1 is considered to have converged, which was set to 10–6 for optimal result. The maximum number of iterations for one fastICA run was set to 200, and all of our fastICA runs converged before this limit. Considering that two different runs of fastICA provide different results, we proceeded to an iteration of the algorithm (100 times for Exp. 1 and 75 times for Exp. 2). Each run provides a number of sources limited to the number of samples in the dataset. As detailed in Chiappetta et al. (2004), we performed correlation matrix and credibility calculus. The correlation matrix of sources provided by fastICA allows iterative selection of lists of sources with a correlation greater than 0.95 across iterations with the best credibility (equal to the proportion of fastICA runs for which a source belongs to the list). Although Chiappetta et al. (2004) note that only sources with a credibility >0.6 should be considered statistically signifi cant, we kept all extracted sources to assess their possible biological signifi cance. At each step, an average source is calculated from the list of correlated sources, which are then removed from the correlation matrix. This calculus was iterated until only two sources remained. This collection of average sources was next used to extract potential signatures. FastICA sources typically exhibit a super-Gaussian shape, a majority of values being grouped around 0 (μ = 0; σ = 1), except for some extremely high values in absolute. For a given independent source, high values often correspond to signifi cantly over- or under-expressed genes, and putting such genes into correspondence with conditions of interest can be extremely informative. Following the recommendation of Chiappetta et al. (2004), we selected, for each average source generated, two potential signatures corresponding to the tails of the super-Gaussian distribution (|Sg| > 3σ).

Figure 3 ICA-based signature discovery process. (1) Input expression matrix X. (2) and (3) fastICA output: mixing (A) and source (S) matrixes obtained from fastICA iterations. ¯̄S is the mean of high similarity sources across iterations. S̄ column values follow a super Gaussian distribution (μ = 0; σ = 1). (4) Each source provides two potential signatures, corresponding to probes that are over a ±3σ threshold on both tails of the curve. (5) Example on Exp. 1 (sorted Treg vs. whole Spleen): a 76-probe signature is overex-pressed in sorted Treg samples. Details are provided in section 2.4

2.5 Gene Set Enrichment Analysis (GSEA)Probe sets were formatted into gmt fi les, a GSEA input fi le format. Since sample numbers in each condition were not high enough for the phenotype permutation option, we followed the developers’ recommendations to use the gene set permutation type. The number

284 H-P Pham et al.

of permutations was set at 1,000. Probes were sorted following the score of the eBayes algorithm from the limma package, and inputted in the GSEA program as pre-ranked list rank fi les. We used the weighted scoring scheme to compute the enrichment score. A FDR q-value, indicating the probability of false positive score, is computed from the Normalised Enrichment Score (NES). Of note, the FDR and NES were calculated based on a database of 2,700 potential signatures. A detailed explanation of GSEA can be found in Subramanian et al. (2005) and Mootha et al. (2003). Typically, GSEA results are considered signifi cant when p-value < 0.01 and q-value < 0.05.

3 Results

The purpose of our microarray studies is to identify relevant gene expression signatures to characterise the immune response in mice after vaccination with different virus vectors. In particular, we aim at following the modifi cation of the Treg compartment, a rare population in the whole spleen. We present here an alternative method for signature discovery and apply this strategy to two series of Treg-related microarray datasets to identify Treg-specifi c signatures.

3 .1 Signature discovery strategy and implementationAs described below, the direct analysis of these datasets yielded either too many or too few (or no) differentially expressed genes to identify meaningful characteristic signatures of the conditions tested in these experiments. Therefore, we developed an approach combining the ICA with the GSEA to

• produce numerous potential signatures from each microarray dataset (ICA)

• validate these signatures on the original data (GSEA).

The proposed ICA→GSEA work fl ow is shown on Figure 4. As detailed in the Methods section, a list of potential signatures is systematically extracted from each dataset (Figure 3). Briefl y, average sources were extracted from iterative ICA on normalised Illumina probe intensities. Two probe sets, corresponding to extreme high expression values found at both tails of the source super-Gaussian distribution, were then extracted from each source to constitute potential molecular signatures. We then used the eBayes function from the limma package to produce a sorted list of probes according to eBayes statistic scores between each set of experimental conditions to compare. GSEA was run on this ranked list with the pre-ranked option and 1,000 permutations on gene sets.

This strategy was applied to study two experimental datasets (summarised in Figure 1) for which no signature could be easily established by direct differential gene expression analysis: Experiment 1 (Exp. 1) is composed of four experimental groups of fi ve biological replicates:

i whole spleen samples (‘Spleen PBS’) & ii sorted spleen Tregs (‘Treg PBS’) from con-trol mice.

ii whole spleen samples (‘Spleen Adeno’) & iv sorted spleen Tregs (‘Treg Adeno’) from mice vaccinated with an adenoviral vector.

Molecular signature discovery by ICA→GSEA 285

Figure 4 ICA→GSEA signature discovery strategy. See Section “Signature discovery strategy and implementation” for detailed explanation

Experiment 2 (Exp. 2) is composed of two experimental groups:

i whole spleen samples (‘Spleen’) & ii Treg-depleted spleen samples (‘Spleen minus Treg’) from control mice.

3.2 Signature identifi cation in samples displaying no detectable differential gene expression

3.2.1 Exp.1 ‘Spleen Adeno’ vs. ‘Spleen PBS’ comparisonWe fi rst looked at Exp. 1, ‘Spleen Adeno’ vs. ‘Spleen PBS’ comparison, for which changes in the vaccinated samples must be subtle, since we looked only 6 h after vaccination. Indeed, application of classical statistical tests (quantile normalisation and Illumina custom comparison algorithm with adjusted p-value controlling by the Benjamini-Hochberg method (Benjamini and Hochberg, 1995)) identifi es no differentially expressed probes at adjusted p-value < 0.01. Only fi ve genes are differentially expressed at adjusted p-value <0.05: Ifi 203, A130038J17Rik, AW011738, LOC226691 and LOC240921.

With no relevant differential gene expression identifi ed by direct comparison, we applied ICA source extraction to Exp. 1 ‘Spleen Adeno’ vs. ‘Spleen PBS’ data. One hundred ICA iterations yielded 20 average sources, of which 7 have a credibility >0.6. The 40 deduced potential signatures were then tested for GSEA enrichment in the Exp. 1 ‘Spleen Adeno’

286 H-P Pham et al.

vs. ‘Spleen PBS’ data. Only one signature is signifi cantly enriched in ‘Spleen PBS’ (NES = –2.7671; p-value < 0.0001; q-value < 0.0001) when 17 signatures are enriched in ‘Spleen Adeno’ with p-value < 0.01 and q-value < 0.05. We focused our attention on signature Sig-E1.1, showing the highest enrichment score (NES = 4.5151; p-value < 0.0001; q-value < 0.0001) (Table 1 and Figure 5). Sig-E1.1 has 100% credibility, and contains 257 probes, of which 230 contribute to the enrichment score. As shown in Supplementary Table 1, Sig-E1.1 is characterised by signifi cant GO terms related to innate immune response, infl ammation and immune response to virus. The KEGG pathway annotation of Sig-E1.1 reveals DNA viral infection-associated signalling pathways such as RIG-I-like receptor and Toll-like receptor pathways (Cao, 2009). This functional characterisation is fully consistent with the induction of an immune response after adenovirus injection. We, therefore, demonstrate that ICA→GSEA applied to this dataset allows identifying a signature of immunisation when no direct differential gene expression was detected.

Figure 5 A typical three-panel GSEA report for Sig-E1.1 showing signifi cant enrichment in ‘Spleen Adeno’ (left) compared to ‘Spleen PBS’ (right). NES = 4.5151, p-value <0.0001, q-value <0.0001. (bottom) Ranked probe list according to a moderated t-statistic. (middle) Signature probe position on the ranked list. (top) Enrichment score (ES) is the maximum of the running sum. (see online version for colours)

3.2.2 Exp. 2 ‘Spleen’ vs. ‘Spleen minus Treg’ comparisonSimilarly, in Exp. 2, we searched for gene expression differences between whole spleen (‘Spleen’) and Treg-depleted spleen (‘Spleen minus Treg’) samples, Treg cells representing ~1% of all splenocytes. Not surprisingly, direct comparison of this dataset with classical

Molecular signature discovery by ICA→GSEA 287

E1.1

_Spl

een_

Aden

o/PB

SE1

.2_P

BS_T

reg/

Sple

enE1

.3_A

deno

_Tre

g/Sp

leen

E2.1

_Spl

een/

Sple

en_m

inus

_Tre

g

Sign

atur

e#p

robe

sN

ESp-

valu

eq-

valu

eN

ESp-

valu

eq-

valu

eN

ESp-

valu

eq-

valu

eN

ESp-

valu

eq-

valu

e

Sig-

E1.1

257

4.51

51<0

.000

1<0

.000

1–2

.405

3<0

.000

1<0

.000

1–2

.479

0<0

.000

1<0

.000

1–0

.946

50.

6474

0.65

32Si

g-E1

.276

1.88

16<0

.000

10.

0035

3.29

84<0

.000

1<0

.000

13.

2775

<0.0

001

<0.0

001

2.29

80<0

.000

1<0

.000

1Si

g-E1

.383

1.75

120.

0021

0.01

153.

3083

<0.0

001

<0.0

001

3.38

71<0

.000

1<0

.000

12.

0095

<0.0

001

0.00

82Si

g-E2

.165

–1.4

292

0.03

530.

1267

1.12

630.

2733

0.54

911.

4645

0.02

930.

1826

2.62

66<0

.000

1<0

.000

1

#pro

bes:

Num

ber o

f pro

bes i

n th

e si

gnat

ure;

NES

: GSE

A N

orm

alis

ed E

nric

hmen

t Sco

re w

ith a

ssoc

iate

d p-

valu

e an

d q-

valu

e.

Table 1 GSEA results for signatures identifi ed in Exp. 1 and Exp. 2

288 H-P Pham et al.

statistical tests (BeadStudio and limma) was not able to detect signifi cant differences between these groups at adjusted p-value < 0.05.

Twenty-eight average sources, of which 12 have a credibility of > 0.6, were computed after 75 ICA iterations, and yielded 55 probe sets. These potential signatures were tested for GSEA enrichment in Exp. 2 ‘Spleen’ vs. ‘Spleen minus Treg’ data. Nine signatures are signifi cantly enriched in ‘Spleen’ and 19 signatures are enriched in ‘Spleen minus Treg’, with p-value < 0.01 and q-value < 0.05. We focused our attention on signature Sig-E2.1, showing the highest enrichment score in ‘Spleen’ (NES = 2.6266; p-value < 0.0001; q-value < 0.0001) (Table 1 and Supplementary Figure 3(a)). Sig-E2.1 contains 65 probes, of which 39 contribute to the enrichment score. Sig-E2.1 is characterised by signifi cant GO terms related to regulation of the immune response and positive for the JAK-STAT signalling KEGG pathways (Supplementary Table 4). This characterisation is partly attributable to two well-known Treg-related genes, Foxp3 and Tnfrsf4. Again, although no individual genes are found to be differentially expressed, ICA→GSEA allows establishment of a signature detecting a 1% difference between the two sample groups that correspond to the Treg population, which has been removed from the ‘Spleen minus Treg’ group.

3.2.3 Treg-specifi c signature identifi cationWe then processed a (PBS or Adeno) ‘Spleen’ vs. ‘Treg’ analysis to identify genes that are up-regulated in Treg-sorted samples that could defi ne a Treg signature. PBS or Adeno ‘Spleen’ vs. ‘Treg’ analysis provided too many differentially expressed genes (>5,000 up-regulated genes at adjusted p-value < 0.05 with BeadStudio and >1,300 with limma) to allow a meaningful analysis. Therefore, we submitted these datasets to ICA→GSEA. For the ‘Spleen PBS’ vs. ‘Treg PBS’ dataset, ICA extraction after 100 iterations produced 20 average sources, of which 10 have a credibility >0.6, leading to 39 probe sets. Similarly, 20 average sources were obtained from the ‘Spleen Adeno’ vs. ‘Treg Adeno’ dataset, of which 8 have a credibility >0.6, leading to 39 probe sets. These potential signatures were tested for GSEA enrichment. Two and one signatures are signifi cantly enriched in ‘Treg PBS’ and ‘Treg Adeno’, respectively (p-value <0.01 and q-value <0.05). Ten and 17 signatures are enriched in ‘Spleen PBS’ and ‘Spleen Adeno’, respectively (p-value < 0.01 and q-value <0.05).

The best signatures, Sig-E1.2 (76 probes) and Sig-E1.3 (83 probes), enriched in ‘Treg PBS’ and ‘Treg Adeno’, respectively, exhibit very similar GSEA profi les, as shown on Supplementary Figure 1(a) and Supplementary Figure 2(a), with 75 and 78 probes contributing to the enrichment score. Sig-E1.2 and Sig-E1.3 are highly signifi cant, with a credibility index of 1, p-value < 0.0001, q-value < 0.0001 (Table 1). Strikingly, they share 64 probes (hypergeometric overlap test p-value = 1.35 × 10–154; (Fury et al., 2006)), including a number of reported Treg-related genes such as Foxp3, Ctla4, Tnfrsf4, Cish, Folr4, Tnfrsf9. Altogether, Sig-E1.2 and Sig-E1.3 are characterised by signifi cant GO terms related to regulation of the immune system processes, with focus on T lymphocytes, and to the JAK-STAT signalling and cytokine-cytokine receptor interaction KEGG pathways (Supplementary Table 2 and Supplementary Table 3).

We next asked whether Sig-E1.2 and Sig-E1.3 could distinguish Exp. 2 ‘Spleen’ vs. ‘Spleen minus Treg’ data. We obtained very signifi cant GSEA enrichment for both Sig-E1.2 (NES = 2.2980; p-value < 0.0001; q-value < 0.0001; Table 1 and Supplementary Figure 1(b)) and Sig-E1.3 (NES = 2.0095; p-value < 0.0001; q-value = 0.0082; Table 1 and Supplementary Figure 2(b)). This result demonstrates the relevance of this strategy, showing that a signature

Molecular signature discovery by ICA→GSEA 289

specifi c for a discrete population (Treg) can detect the minute (~1%) presence of this population in a whole spleen sample. Interestingly, Sig-E1.2 and Sig-E1.3 signatures are also enriched in ‘Spleen Adeno’ vs. ‘Spleen PBS’ (see Table 1 for enrichment values). This supports the idea that these Treg-specifi c signatures are able to catch quantitative changes of the spleen Treg compartment following adenovirus injection. Conversely, Sig-E2.1 does not provide good enrichment scores when tested on Exp.1 ‘Spleen’ vs. ‘Treg’ (PBS or Adeno) (Supplementary Figure 3(b) and Table 1). We interpret this apparent discrepancy in relation with the datasets from which these signatures have been obtained: Sig-E1.2 and SigE1.3 derive from ICA sources computed with Exp.1 datasets that are highly different (‘Spleen’ vs. ‘Treg’), when SigE2.1 derives from Exp.2 closely related datasets (‘Spleen’ vs. ‘Spleen minus Treg’). Therefore, Sig-E2.1 probes, which together capture small differences between Exp.2 ‘Spleen’ and ‘Spleen minus Treg’ datasets, must be scattered across different ICA sources of Exp.1 ‘Spleen’ vs. ‘Treg’ datasets. It is noteworthy that only three canonical Treg genes (Foxp3, Tnfrsf4 and Gpr83) are shared between Sig-E2.1 and Sig-E1.2/Sig-E1.3.

4 Discussion

Our work seeded from the need to establish robust molecular signatures that can help to characterise transcriptome datasets obtained under various vaccination schemes and assess their respective effi ciency. In particular, we aim at following changes in cell population size or state of activation in the whole spleen of vaccinated mice. When performing direct differential gene expression analysis on these datasets using classical statistical tests (e.g., Illumina custom, limma eBayes modifi ed t-test), we typically cannot detect any differential expression at the recommended controlled p-values. Therefore, we chose to implement an alternative strategy, based on ICA→GSEA methods, to identify gene signatures that are globally signifi cantly over-represented in our conditions. Our choice of ICA was led by its increasing use in microarray analysis (Chen et al., 2009; Kong et al., 2008). For example, it has recently been applied to separate transcriptome signals related to monocyte/macrophage differentiation into biological relevant signatures (Lutter et al., 2008). Similarly, GSEA has repeatedly been proven helpful to identify gene signatures in various conditions. For example, Haining et al. (2008) identifi ed a cross-species signature of memory T-lymphocyte. We have, therefore, combined these two methods in a semi-automatic scheme to extract and validate potential signatures corresponding to the underlying biological processes revealed through iterative ICA source extraction. In doing so, we have successfully identifi ed signatures from microarray datasets for which no differential gene expression was detected with classical methods: one signature detects subtle transcriptome changes in mouse whole spleen only 6 h after immunisation with an adenoviral vector vaccine. It is noteworthy that the GO/KEGG pathway annotation clearly links this signature to biological functions and pathways activated by adenovirus (Cao, 2009; Hartman et al., 2007). The second signature concerns the comparative transcriptome data of whole spleen samples (containing Treg) and Treg-depleted spleen samples. This signature underlines the sensitivity of the ICA→GSEA strategy, which can detect differences in transcriptome expression pertaining to just ~1% of the contributing data.

We have thus identifi ed two closely related Treg-specifi c signatures from two independent datasets comparing sorted Treg and whole spleen samples. Both of these signatures were found signifi cantly enriched in whole spleen, compared to Treg-depleted samples. To

290 H-P Pham et al.

assess the discriminatory power of these ICA-derived signatures, we tested in parallel Treg signatures derived from classical differential expression studies (Fontenot et al., 2005; Lin et al., 2007; Pfoertner et al., 2006; Sugimoto et al., 2006): these literature-based signatures are found enriched in the Exp. 1 ‘Treg’ vs. ‘Spleen’ (PBS or Adeno) datasets. However, they cannot distinguish Exp. 2 ‘Spleen’ vs. ‘Spleen minus Treg’ dataset (data not shown), highlighting again the sensitivity of the ICA→GSEA strategy.

We are now moving onto the comparative analysis of whole spleen samples of mice vaccinated with different vectors. The ICA→GSEA procedure has already produced a database of more than 8,000 potential signatures with their associated enrichment score. Similarly to what has been proposed by Chaussabel et al. (2008) and Oldham et al. (2008), this signature database will be used to obtain a comparative composite description of the underlying mechanisms (infl ammation, activation, regulation…) or cell populations (Treg…) at work under various vaccination regimens.

Acknowledgement

The authors thank Christophe Huret, Bruno Gouritin (UMR 7211) for sample preparation, Wassila Carpentier (P3S platform, CHU Pitié-Salpêtrière) for microarray data production, and Drs Véronique Thomas-Vaslin, Gilbert Bensimon and Chris McEwan for helpful discussions and critical reading of the manuscript.Funding: This work was supported by the Université Pierre et Marie Curie, Centre National de la Recherche Scientifi que, and by grants from the European Union [grant numbers LSHB-CT-04-005246 (EC-FP6-COMPUVAC), LSBH-CT-06-018933 (EC-FP6-Clinigene)]. HPP is recipient of a PhD fellowship from the French Ministry of Research and the Université Pierre et Marie Curie.Confl ict of interest: None declared.

ReferencesAshburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski,

K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M. and Sherlock, G. (2000) ‘Gene ontol-ogy: tool for the unifi cation of biology’, Nature Genetics, Vol. 25, No. 1, pp.25–29.

Benjamini, Y. and Hochberg, Y. (1995) ‘Controlling the false discovery rate: a practical and powerful approach to multiple testing’, Journal of the Royal Statistical Society, Series B (Statistical Methodology), Vol. 57, No. 1, pp.289–300.

Cao, X. (2009) ‘New DNA-sensing pathway feeds RIG-I with RNA’, Nature Immunology, Vol. 10, No. 10, pp.1049–1051.

Chaussabel, D., Quinn, C., Shen, J., Patel, P., Glaser, C., Baldwin, N., Stichweh, D., Blankenship, D., Li, L., Munagala, I., Bennett, L., Allantaz, F., Mejias, A., Ardura, M., Kaizer, E., Monnet, L., Allman, W., Randall, H., Johnson, D., Lanier, A., Punaro, M., Wittkowski, K.M., White, P., Fay, J., Klintmalm, G., Ramilo, O., Palucka, A.K., Banchereau, J. and Pascual, V. (2008) ‘A modular analysis framework for blood genomics studies: application to systemic lupus erythematosus’, Immunity, Vol. 29, No. 1, pp.150–164.

Chen, L., Xuan, J., Wang, C., Wang, Y., Shih, I., Wang, T.L., Zhang, Z., Clarke, R. and Hoffman, E.P. (2009) ‘Biomarker identifi cation by knowledge-driven multilevel ICA and motif analysis’, International Journal of Data Mining and Bioinformatics, Vol. 3, No. 4, pp.365–381.

Chiappetta, P., Roubaud, M.C. and Torresani, B. (2004) ‘Blind source separation and the analysis of microarray data’, Journal of Computational Biology, Vol. 11, No. 6, pp.1090–1109.

Molecular signature discovery by ICA→GSEA 291

Comon, P. (1994) ‘Independent component analysis, a new concept?’, Signal Processing, Vol. 36, pp.287–314.

de la Fuente, H., Lamana, A., Mittelbrunn, M., Perez-Gala, S., Gonzalez, S., Garcia-Diez, A., Vega, M. and Sanchez-Madrid, F. (2009) ‘Identifi cation of genes responsive to solar simulated UV radiation in human monocyte-derived dendritic cells’, PLoS ONE, Vol. 4, No. 8, p.e6735.

Desjardins, D., Huret, C., Dalba, C., Kreppel, F., Kochanek, S., Cosset, F.L., Tangy, F., Klatzmann, D. and Bellier, B. (2009) ‘Recombinant retrovirus-like particle forming DNA vaccines in prime-boost immunization and their use for hepatitis C virus vaccine development’, Journal of Gene Medicine, Vol. 11, No. 4, pp.313–325.

Dudoit, S., Shaffer, J.P. and Boldrick, J.C. (2003) ‘Multiple hypothesis testing in microarray experi-ments’, Statistical Science, Vol. 18, No. 1, pp.71–103.

Fontenot, J.D., Rasmussen, J.P., Gavin, M.A. and Rudensky, A.Y. (2005) ‘A function for interleukin 2 in Foxp3-expressing regulatory T cells’, Nature Immunology, Vol. 6, No. 11, pp.1142–1151.

Fury, W., Batliwalla, F., Gregersen, P.K. and Li, W. (2006) ‘Overlapping probabilities of top ranking gene lists, hypergeometric distribution, and stringency of gene selection criterion’, Conference Proceedings IEEE Engineering in Medicine and Biology Society, Vol. 1, pp.5531–5534.

Haining, W.N., Ebert, B.L., Subramanian, A., Wherry, E.J., Eichbaum, Q., Evans, J.W., Mak, R., Rivoli, S., Pretz, J., Angelosanto, J., Smutko, J.S., Walker, B.D., Kaech, S.M., Ahmed, R., Nadler, L.M. and Golub, T.R. (2008) ‘Identifi cation of an evolutionarily conserved transcriptional signature of CD8 memory differentiation that is shared by T and B cells’, Journal of Immunology, Vol. 181, No. 3, pp.1859–1868.

Hartman, Z.C., Black, E.P. and Amalfi tano, A. (2007) ‘Adenoviral infection induces a multi-faceted innate cellular immune response that is mediated by the toll-like receptor pathway in A549 cells’, Virology, Vol. 358, No. 2, pp.357–372.

Hartman, Z.C., Kiang, A., Everett, R.S., Serra, D., Yang, X.Y., Clay, T.M. and Amalfi tano, A. (2007) ‘Adenovirus infection triggers a rapid, MyD88-regulated transcriptome response critical to acute-phase and adaptive immune responses in vivo’, Journal of Virology, Vol. 81, No. 4, pp.1796–1812.

Hyvärinen, A. and Oja, E. (2000) ‘Independent component analysis: algorithms and applications’, Neural Networks, Vol. 13, Nos. 4–5, pp.411–430.

James, C.J. and Hesse, C.W. (2005) ‘Independent component analysis for biomedical signals’, Physiological Measurement, Vol. 26, No. 1, p.R15–R39.

Kanehisa, M. and Goto, S. (2000) ‘KEGG: Kyoto encyclopedia of genes and genomes’, Nucleic Acids Research, Vol. 28, No. 1, pp.27–30.

Kong, W., Vanderburg, C.R., Gunshin, H., Rogers, J.T. and Huang, X. (2008) ‘A review of independ-ent component analysis application to microarray gene expression data’, BioTechniques, Vol. 45, No. 5, pp.501–520.

Kossenkov, A.V. and Ochs, M.F. (2010) ‘Matrix factorisation methods applied in microarray data analysis’, International Journal of Data Mining and Bioinformatics, Vol. 4, No. 1, pp.72–90.

Lin, W., Haribhai, D., Relland, L., Truong, N., Carlson, M., Williams, C. and Chatila, T. (2007) ‘Regulatory T cell development in the absence of functional Foxp3’, Nature Immunology, Vol. 8, No. 4, pp.359–368.

Liu, Z., Chen, D., Xu, Y. and Liu, J. (2005) ‘Logistic support vector machines and their application to gene expression data’, International Journal of Bioinformatics Research and Applications, Vol. 1, No. 2, pp.169–182.

Lutter, D., Ugocsai, P., Grandl, M., Orso, E., Theis, F., Lang, E.W. and Schmitz, G. (2008) ‘Analyz-ing M-CSF dependent monocyte/macrophage differentiation: expression modes and meta-modes derived from an independent component analysis’, BMC Bioinformatics, Vol. 9, p.100.

Mootha, V.K., Lindgren, C.M., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M.J., Patterson, N., Mesirov, J.P., Golub, T.R., Tamayo, P., Spiegelman, B., Lander, E.S., Hirschhorn, J.N., Altshuler, D. and Groop, L.C. (2003) ‘PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordi-nately downregulated in human diabetes’, Nature Genetics, Vol. 34, No. 3, pp.267–273.

292 H-P Pham et al.

Murohashi, M., Hinohara, K., Kuroda, M., Isagawa, T., Tsuji, S., Kobayashi, S., Umezawa, K., Tojo, A., Aburatani, H. and Gotoh, N. (2010) ‘Gene set enrichment analysis provides insight into novel signalling pathways in breast cancer stem cells’, British Journal of Cancer, Vol. 102, No. 1, pp.206–212.

Oldham, M.C., Konopka, G., Iwamoto, K., Langfelder, P., Kato, T., Horvath, S. and Geschwind, D.H. (2008) ‘Functional organization of the transcriptome in human brain’, Nature Neuroscience, Vol. 11, No. 11, pp.1271–1282.

Pfoertner, S., Jeron, A., Probst-Kepper, M., Guzman, C.A., Hansen, W., Westendorf, A.M., Toepfer, T., Schrader, A.J., Franzke, A., Buer, J. and Geffers, R. (2006) ‘Signatures of human regulatory T cells: an encounter with old friends and new players’, Genome Biology, Vol. 7, No. 7, p.R54.

Sakaguchi, S., Wing, K. and Miyara, M. (2007) ‘Regulatory T cells – a brief history and perspective’, European Journal of Immunology, Vol. 37, No. S1, pp.116–123.

Schmid, R., Baum, P., Ittrich, C., Fundel-Clemens, K., Huber, W., Brors, B., Eils, R., Weith, A., Mennerich, D. and Quast, K. (2010) ‘Comparison of normalization methods for Illumina BeadChip(R) HumanHT-12 v3’, BMC Genomics, Vol. 11, No. 1, p.349.

Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S. and Mesirov, J.P. (2005) ‘Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profi les’, Proceedings of the National Academy of Sciences of the United States of America, Vol. 102, No. 43, pp.15545–15550.

Sugimoto, N., Oida, T., Hirota, K., Nakamura, K., Nomura, T., Uchiyama, T. and Sakaguchi, S. (2006) ‘Foxp3-dependent and -independent molecules specifi c for CD25+CD4+ natural regulatory T cells revealed by DNA microarray analysis’, International Immunology, Vol. 18, No. 8, pp.1197–1209.

Tseng, V.S. and Yu, H-H. (2011) ‘Microarray data classifi cation by multi-information based gene scoring integrated with Gene Ontology’, International Journal of Data Mining and Bioinformatics, Vol. 5, No. 4, pp.402–416.

List of abbreviationsExp. 1 Experiment 1 ‘Treg vs. Spleen’ (see Figure 1 for details)Exp. 2 Experiment 2 ‘Spleen vs. Spleen minus Treg’ (see Figure 1 for details)GO Gene ontologyGSEA Gene set enrichment analysisICA Independent component analysisKEGG Kyoto encyclopedia of genes and genomesSig-E1.1 Molecular signature number 1 from experiment 1Sig-E1.2 Molecular signature number 2 from experiment 1Sig-E1.3 Molecular signature number 3 from experiment 1Sig-E2.1 Molecular signature number 1 from experiment 2Treg regulatory T lymphocyte

Molecular signature discovery by ICA→GSEA 293

Supplemental materials

Supplementary Figure 1 GSEA report for Sig-E1.2 (a) Sig-E1.2 (76 probes) was tested with GSEA on Exp.1 “Treg PBS” vs. “Spleen PBS” conditions (displayed on the left and right, respectively). All probes are extremely related to the Treg condition (see “large black bar” on the left). NES = 3.2984, p-value < 0.0001; q-value < 0.0001. (b) Sig-E1.2 was tested on Exp.2 “Spleen” vs. “Spleen minus Treg” conditions (displayed on the left and right, respectively). NES = 2.2980, p-value < 0.0001, q-value < 0.0001. Chart description is identical to that of Figure 4 (see online version for colours)

Supplementary Figure 2 GSEA report for Sig-E1.3 (a) Sig-E1.3 was tested on Exp.1 “Treg PBS” vs. “Spleen PBS” conditions (displayed on the left and right, respectively). All probes are extremely related to the Treg condition (see “large black bar” on the left). NES = 3.3871, p-value < 0.0001; q-value < 0.0001. (b) Sig-E1.3 was tested on Exp.2 “Spleen” vs. “Spleen minus Treg” conditions (displayed on the left and right, respectively). NES = 2.0095, p-value < 0.0001, q-value < 0.0001. Chart description is identical to that of Figure 4 (see online version for colours)

294 H-P Pham et al.

Supplementary Figure 3 GSEA report for Sig-E2.1. (a) Sig-E2.1 was tested on Exp.2 “Spleen” vs. “Spleen minus Treg” conditions (displayed on the left and right, respectively). NES = 2.6266, p-value < 0.0001, q-value < 0.0001. (b) Sig-E2.1 was tested on Exp.1 “Treg PBS” vs. “Spleen PBS” conditions (displayed on the left and right, respectively). NES = 1.1263, p-value = 0.2733, q-value = 0.5491. Chart description is identical to that of Figure 4 (see online version for colours)

Supplementary Table 1 Sig-E1.1 annotation on Gene Ontology and KEGG pathway databases

Signature Gene List (with Entrez Gene ID)

Symbol Name

A530023O14Rik RIKEN cDNA A530023O14 geneAdar adenosine deaminase, RNA-specifi cAgrn agrinAI451617 expressed sequence AI451617AI607873 expressed sequence AI607873Aida axin interactor, dorsalization associatedAmica1 adhesion molecule, interacts with CXADR antigen 1Arg2 arginase type IIAsb13 ankyrin repeat and SOCS box-containing 13Atf3 activating transcription factor 3BC006779 cDNA sequence BC006779BC094916 cDNA sequence BC094916Bst2 bone marrow stromal cell antigen 2C3 complement component 3Car13 carbonic anhydrase 13Ccl12 chemokine (C-C motif) ligand 12Ccnd1 cyclin D1Ccnd2 cyclin D2Ccrl2 chemokine (C-C motif) receptor-like 2Cd274 CD274 antigenCd33 CD33 antigenChi3l1 chitinase 3-like 1

Molecular signature discovery by ICA→GSEA 295

Signature Gene List (with Entrez Gene ID)

Symbol Name

Clec4a3 C-type lectin domain family 4, member a3Clec4d C-type lectin domain family 4, member dCmpk2 cytidine monophosphate (UMP-CMP) kinase 2, mitochondrialCxcl10 chemokine (C-X-C motif) ligand 10Cxcl13 chemokine (C-X-C motif) ligand 13Cxcl2 chemokine (C-X-C motif) ligand 2Cxcl9 chemokine (C-X-C motif) ligand 9D14Ertd668e DNA segment, Chr 14, ERATO Doi 668, expressedDaxx Fas death domain-associated proteinDdx58 DEAD (Asp-Glu-Ala-Asp) box polypeptide 58Ddx60 DEAD (Asp-Glu-Ala-Asp) box polypeptide 60Dhx58 DEXH (Asp-Glu-X-His) box polypeptide 58Dtx3l deltex 3-like (Drosophila)E330016A19Rik RIKEN cDNA E330016A19 geneEhd4 EH-domain containing 4Emilin2 elastin microfi bril interfacer 2F13a1 coagulation factor XIII, A1 subunitFcgr1 Fc receptor, IgG, high affi nity IFcgr4 Fc receptor, IgG, low affi nity IVFgl2 fi brinogen-like protein 2Fpr2 formyl peptide receptor 2Gbp2 guanylate binding protein 2Gbp3 guanylate binding protein 3Gbp5 guanylate binding protein 5Gcnt2 glucosaminyl (N-acetyl) transferase 2, I-branching enzymeGm11428 predicted gene 11428Gm12250 predicted gene 12250Gm14446 predicted gene 14446Gm2785 predicted gene 2785Gm4070 predicted gene 4070Gm4951 predicted gene 4951Gm5483 predicted gene 5483Hdc histidine decarboxylaseHk2 hexokinase 2Hmgn3 high mobility group nucleosomal binding domain 3Hmox2 heme oxygenase (decycling) 2Hp haptoglobinHsh2d hematopoietic SH2 domain containingI830012O16Rik RIKEN cDNA I830012O16 geneIfi 203 interferon activated gene 203Ifi 204 interferon activated gene 204Ifi 205 interferon activated gene 205Ifi 35 interferon-induced protein 35Ifi 47 interferon gamma inducible protein 47Ifi h1 interferon induced with helicase C domain 1

Supplementary Table 1 Sig-E1.1 annotation on Gene Ontology and KEGG pathway databases (continued)

296 H-P Pham et al.

Signature Gene List (with Entrez Gene ID)

Symbol Name

Ifi t2 interferon-induced protein with tetratricopeptide repeats 2Ifi t3 interferon-induced protein with tetratricopeptide repeats 3Ifi tm1 interferon induced transmembrane protein 1Ifi tm2 interferon induced transmembrane protein 2Ifi tm3 interferon induced transmembrane protein 3Ifi tm6 interferon induced transmembrane protein 6Igtp interferon gamma induced GTPaseIl15 interleukin 15Il1r2 interleukin 1 receptor, type IIIl8rb interleukin 8 receptor, betaIrf7 interferon regulatory factor 7Irf9 interferon regulatory factor 9Irg1 immunoresponsive gene 1Irgm1 immunity-related GTPase family M member 1Irgm2 immunity-related GTPase family M member 2Isg15 ISG15 ubiquitin-like modifi erLcn2 lipocalin 2Lgals3bp lectin, galactoside-binding, soluble, 3 binding proteinLgals9 lectin, galactose binding, soluble 9Lilrb4 leukocyte immunoglobulin-like receptor, subfamily B, member 4LOC100044206 hypothetical protein LOC100044206LOC100044430 similar to Interferon activated gene 205LOC100047963 similar to ADIR1LOC192690 similar to interferon activated gene 205LOC435565 interferon-inducible GTPase-like

LOC623121similar to Interferon-activatable protein 203 (Ifi -203) (Interferon-induc-ible protein p203)

LOC625360 similar to 2-cell-stage, variable group, member 3LOC630296 similar to Ig heavy chain V region RF precursorLOC630837 similar to Ig heavy chain V region 5-84 precursorLOC634417 similar to fos-like antigen 2

LOC637605similar to Gamma-interferon-inducible protein Ifi -16 (Interferon-induc-ible myeloid differentiation transcriptional activator) (IFI 16)

LOC640675similar to Interferon-activatable protein 204 (Ifi -204) (Interferon-induc-ible protein p204)

LOC640746 similar to Tripartite motif protein 34Lrg1 leucine-rich alpha-2-glycoprotein 1Mitd1 MIT, microtubule interacting and transport, domain containing 1Mlkl mixed lineage kinase domain-likeMmp13 matrix metallopeptidase 13Mnda myeloid cell nuclear differentiation antigenMov10 Moloney leukemia virus 10Mrgpra2 MAS-related GPR, member A2Mrgpra7 MAS-related GPR, member A7Ms4a6d membrane-spanning 4-domains, subfamily A, member 6D

Supplementary Table 1 Sig-E1.1 annotation on Gene Ontology and KEGG pathway databases (continued)

Molecular signature discovery by ICA→GSEA 297

Signature Gene List (with Entrez Gene ID)

Symbol Name

Mx1 myxovirus (infl uenza virus) resistance 1Mx2 myxovirus (infl uenza virus) resistance 2Nampt nicotinamide phosphoribosyltransferaseNfi l3 nuclear factor, interleukin 3, regulatedNiacr1 niacin receptor 1Nmral1 NmrA-like family domain containing 1Nrap nebulin-related anchoring proteinOas1g 2'-5' oligoadenylate synthetase 1GOas3 2'-5' oligoadenylate synthetase 3Oasl1 2'-5' oligoadenylate synthetase-like 1Oasl2 2'-5' oligoadenylate synthetase-like 2Ogfr opioid growth factor receptorOgfrl1 opioid growth factor receptor-like 1P2ry13 purinergic receptor P2Y, G-protein coupled 13P2ry14 purinergic receptor P2Y, G-protein coupled, 14Parp10 poly (ADP-ribose) polymerase family, member 10Parp14 poly (ADP-ribose) polymerase family, member 14Pcgf5 polycomb group ring fi nger 5Phf11 PHD fi nger protein 11Pkib protein kinase inhibitor beta, cAMP dependent, testis specifi cPml promyelocytic leukemiaPols polymerase (DNA directed) sigmaPpa1 pyrophosphatase (inorganic) 1Ppm1k protein phosphatase 1K (PP2C domain containing)Prss34 protease, serine, 34Retnlg resistin like gammaRilpl1 Rab interacting lysosomal protein-like 1Rnf114 ring fi nger protein 114Rsad2 radical S-adenosyl methionine domain containing 2Rtp4 receptor transporter protein 4Saa3 serum amyloid A 3Samd9l sterile alpha motif domain containing 9-likeSamhd1 SAM domain and HD domain, 1Sap30 sin3 associated polypeptideSerpina3f serine (or cysteine) peptidase inhibitor, clade A, member 3FSerpina3g serine (or cysteine) peptidase inhibitor, clade A, member 3GSetdb2 SET domain, bifurcated 2Sgcb sarcoglycan, beta (dystrophin-associated glycoprotein)Slfn1 schlafen 1Slfn5 schlafen 5Sp100 nuclear antigen Sp100

St6galnac4ST6 (alpha-N-acetyl-neuraminyl-2,3-beta-galactosyl-1,3)-N-acetylga-lactosaminide alpha-2,6-sialyltransferase 4

Stat1 signal transducer and activator of transcription 1Stat2 signal transducer and activator of transcription 2

Supplementary Table 1 Sig-E1.1 annotation on Gene Ontology and KEGG pathway databases (continued)

298 H-P Pham et al.

Signature Gene List (with Entrez Gene ID)

Symbol Name

Tdrd7 tudor domain containing 7Tgtp T-cell specifi c GTPaseTlr7 toll-like receptor 7Tmem184b transmembrane protein 184bTnfsf10 tumor necrosis factor (ligand) superfamily, member 10Tor1aip1 torsin A interacting protein 1Tor1aip2 torsin A interacting protein 2Tpst1 protein-tyrosine sulfotransferase 1Trafd1 TRAF type zinc fi nger domain containing 1Trex1 three prime repair exonuclease 1Trim21 tripartite motif-containing 21Trim30 tripartite motif-containing 30Upp1 uridine phosphorylase 1Usp18 ubiquitin specifi c peptidase 18Xaf1 XIAP associated factor 1Xdh xanthine dehydrogenaseZbp1 Z-DNA binding protein 1Zeb2 zinc fi nger E-box binding homeobox 2Zfp365 zinc fi nger protein 365Znfx1 zinc fi nger, NFX1-type containing 1

Gene Ontology and KEGG pathway database terms were tested for enrichment: Signatures were tested, using hypergeometric test with a p-value cutoff of 0.01, against the total list of transcripts associated with each term after fi ltering out transcripts without Entrez Gene ID. For each molecular signature, an excel fi le with fi ve tabs is presented:• “Signature GeneList” tab: List of all genes composing the signature.• “GO Annotation” tab: Results of the enrichment analysis of Gene Ontology terms. • GOBPID Gene Ontology Biological Process Identifi cation number • Pvalue p value given by the hypergeometric test (p<0.01) • OddsRatio ratio of odds that a GO term is enriched in the selected category • ExpCount expected number of transcripts found associated with the GO term for enrichment • Count real number of transcripts found associated with the GO term • Size population size of transcripts found associated with the GO term within the analysis • Term Gene Ontology Biological Process description term • Inf Infi nite value• “GO GenePerTerm” tab: Detailed list of signature genes which are involved in signifi cantly

enriched GO terms.• “KEGG Signifi cantPathways” tab: Results of the enrichment analysis of KEGG pathways terms. • “KEGGID KEGG Identifi cation number • Pvalue p value given by the hypergeometric test (p<0.01) • OddsRatio ratio of odds that a KEGG pathway is enriched in the selected category • ExpCount expected number of transcripts found associated with the KEGG pathway for

enrichment • Count real number of transcripts found associated with the KEGG pathway • Size population size of transcripts found associated with the KEGG pathway within the

analysis • Term KEGG pathway description term • Inf Infi nite value• KEGG Signifi cantTerm”: Detailed list of signature genes which are involved in signifi cantly

enriched KEGG pathways.

Supplementary Table 1 Sig-E1.1 annotation on Gene Ontology and KEGG pathway databases (continued)

Molecular signature discovery by ICA→GSEA 299

Supplementary Table 2 Sig-E1.2 annotation on Gene Ontology and KEGG pathway databases

Signature Gene List (with Entrez Gene ID)

Symbol NameAreg amphiregulinArhgap20 Rho GTPase activating protein 20Bag3 BCL2-associated athanogene 3Cd28 CD28 antigenCish cytokine inducible SH2-containing proteinCoq10b coenzyme Q10 homolog B (S. cerevisiae)Dgat2 diacylglycerol O-acyltransferase 2Dos downstream of Stk11Dusp10 dual specifi city phosphatase 10Dusp4 dual specifi city phosphatase 4Ecm1 extracellular matrix protein 1ENSMUSG00000068790 predicted gene, ENSMUSG00000068790Errfi 1 ERBB receptor feedback inhibitor 1Fam110a family with sequence similarity 110, member AFam148b family with sequence similarity 148, member BFolr4 folate receptor 4 (delta)Foxp3 forkhead box P3Frmd6 FERM domain containing 6Gata3 GATA binding protein 3Gjb2 gap junction protein, beta 2Gm4956 predicted gene 4956Gm5514 predicted gene 5514Gpr83 G protein-coupled receptor 83Hey1 hairy/enhancer-of-split related with YRPW motif 1Igf1r insulin-like growth factor I receptorIkzf2 IKAROS family zinc fi nger 2Ikzf4 IKAROS family zinc fi nger 4Il2ra interleukin 2 receptor, alpha chainIpcef1 interaction protein for cytohesin exchange factors 1LOC100046862 similar to chemokine receptor

LOC100047674

similar to solute carrier family 35 (UDP-glucuronic acid/UDP-N-acetylgalactosamine dual transporter), member D1

LOC100048844 similar to cytotoxic T-lymphocyte associated molecule 4Lrig1 leucine-rich repeats and immunoglobulin-like domains 1Nfi l3 nuclear factor, interleukin 3, regulatedNr4a3 nuclear receptor subfamily 4, group A, member 3Nrn1 neuritin 1Nrp1 neuropilin 1

Pard6gpar-6 partitioning defective 6 homolog gamma (C. elegans)

Pde4d phosphodiesterase 4D, cAMP specifi cPhf13 PHD fi nger protein 13Phlda1 pleckstrin homology-like domain, family A, member 1Prickle1 prickle like 1 (Drosophila)Prkch protein kinase C, eta

300 H-P Pham et al.

Signature Gene List (with Entrez Gene ID)

Symbol NamePsmb1 proteasome (prosome, macropain) subunit, beta type 1Rasl11b RAS-like, family 11, member BRgs16 regulator of G-protein signaling 16Rnf125 ring fi nger protein 125Rora RAR-related orphan receptor alphaSdcbp2 syndecan binding protein (syntenin) 2Socs2 suppressor of cytokine signaling 2

Spty2d1SPT2, Suppressor of Ty, domain containing 1 (S. cerevi-siae)

Synpo synaptopodinSyp synaptophysinTiam1 T-cell lymphoma invasion and metastasis 1Tnfrsf18 tumor necrosis factor receptor superfamily, member 18Tnfrsf4 tumor necrosis factor receptor superfamily, member 4Tnfrsf9 tumor necrosis factor receptor superfamily, member 9

Gene Ontology and KEGG pathway database terms were tested for enrichment: Signatures were tested, using hypergeometric test with a p-value cutoff of 0.01, against the total list of transcripts associated with each term after fi ltering out transcripts without Entrez Gene ID. For each molecular signature, an excel fi le with fi ve tabs is presented:• “Signature GeneList” tab: List of all genes composing the signature.• “GO Annotation” tab: Results of the enrichment analysis of Gene Ontology terms. • GOBPID Gene Ontology Biological Process Identifi cation number • Pvalue p value given by the hypergeometric test (p<0.01) • OddsRatio ratio of odds that a GO term is enriched in the selected category • ExpCount expected number of transcripts found associated with the GO term for enrichment • Count real number of transcripts found associated with the GO term • Size population size of transcripts found associated with the GO term within the analysis • Term Gene Ontology Biological Process description term • Inf Infi nite value• “GO GenePerTerm” tab: Detailed list of signature genes which are involved in signifi cantly

enriched GO terms.• “KEGG Signifi cantPathways” tab: Results of the enrichment analysis of KEGG pathways terms. • “KEGGID KEGG Identifi cation number • Pvalue p value given by the hypergeometric test (p<0.01) • OddsRatio ratio of odds that a KEGG pathway is enriched in the selected category • ExpCount expected number of transcripts found associated with the KEGG pathway for

enrichment • Count real number of transcripts found associated with the KEGG pathway • Size population size of transcripts found associated with the KEGG pathway within the

analysis • Term KEGG pathway description term • Inf Infi nite value• KEGG Signifi cantTerm”: Detailed list of signature genes which are involved in signifi cantly

enriched KEGG pathways.

Supplementary Table 2 Sig-E1.2 annotation on Gene Ontology and KEGG pathway databases (continued)

Molecular signature discovery by ICA→GSEA 301

Signature Gene List (with Entrez Gene ID)

Symbol Name

1810055G02Rik RIKEN cDNA 1810055G02 geneAreg amphiregulinCd28 CD28 antigenCish cytokine inducible SH2-containing proteinCoq10b coenzyme Q10 homolog B (S. cerevisiae)Dgat2 diacylglycerol O-acyltransferase 2Dos downstream of Stk11Dusp10 dual specifi city phosphatase 10Dusp4 dual specifi city phosphatase 4Ecm1 extracellular matrix protein 1ENSMUSG00000068790 predicted gene, ENSMUSG00000068790Errfi 1 ERBB receptor feedback inhibitor 1Fam100b family with sequence similarity 100, member BFam110a family with sequence similarity 110, member AFam148b family with sequence similarity 148, member BFolr4 folate receptor 4 (delta)Foxp3 forkhead box P3Frmd6 FERM domain containing 6Gadd45b growth arrest and DNA-damage-inducible 45 betaGata3 GATA binding protein 3Gjb2 gap junction protein, beta 2Gm4956 predicted gene 4956Gm6273 predicted gene 6273Gpr83 G protein-coupled receptor 83Hey1 hairy/enhancer-of-split related with YRPW motif 1Ikzf2 IKAROS family zinc fi nger 2Il2ra interleukin 2 receptor, alpha chainIpcef1 interaction protein for cytohesin exchange factors 1LOC100046862 similar to chemokine receptor

LOC100047674similar to solute carrier family 35 (UDP-glucuronic acid/UDP-N-acetylgalactosamine dual transporter), member D1

LOC100048844 similar to cytotoxic T-lymphocyte associated molecule 4LOC634417 similar to fos-like antigen 2Lrig1 leucine-rich repeats and immunoglobulin-like domains 1Med14 mediator complex subunit 14Nfi l3 nuclear factor, interleukin 3, regulated

Nfkbianuclear factor of kappa light polypeptide gene enhancer in B-cells inhibitor, alpha

Nrn1 neuritin 1Nrp1 neuropilin 1Nt5e 5' nucleotidase, ectoPard6g par-6 partitioning defective 6 homolog gamma (C. elegans)Pde4d phosphodiesterase 4D, cAMP specifi cPhf13 PHD fi nger protein 13Phlda1 pleckstrin homology-like domain, family A, member 1Pim1 proviral integration site 1

Supplementary Table 3 Sig-E1.3 annotation on Gene Ontology and KEGG pathway databases

302 H-P Pham et al.

Signature Gene List (with Entrez Gene ID)

Symbol Name

Prickle1 prickle like 1 (Drosophila)Prkch protein kinase C, etaPsmb1 proteasome (prosome, macropain) subunit, beta type 1Rasl11b RAS-like, family 11, member BRel reticuloendotheliosis oncogeneRgs16 regulator of G-protein signaling 16Rnf125 ring fi nger protein 125Rora RAR-related orphan receptor alphaRrad Ras-related associated with diabetesSdcbp2 syndecan binding protein (syntenin) 2Slc30a1 solute carrier family 30 (zinc transporter), member 1Snx18 sorting nexin 18Syp synaptophysinTiam1 T-cell lymphoma invasion and metastasis 1Tnfrsf18 tumor necrosis factor receptor superfamily, member 18Tnfrsf4 tumor necrosis factor receptor superfamily, member 4Tnfrsf9 tumor necrosis factor receptor superfamily, member 9Zfand2a zinc fi nger, AN1-type domain 2AZfp281 zinc fi nger protein 281

Gene Ontology and KEGG pathway database terms were tested for enrichment: Signatures were tested, using hypergeometric test with a p-value cutoff of 0.01, against the total list of transcripts associated with each term after fi ltering out transcripts without Entrez Gene ID. For each molecular signature, an excel fi le with fi ve tabs is presented:• “Signature GeneList” tab: List of all genes composing the signature.• “GO Annotation” tab: Results of the enrichment analysis of Gene Ontology terms. • GOBPID Gene Ontology Biological Process Identifi cation number • Pvalue p value given by the hypergeometric test (p<0.01) • OddsRatio ratio of odds that a GO term is enriched in the selected category • ExpCount expected number of transcripts found associated with the GO term for enrichment • Count real number of transcripts found associated with the GO term • Size population size of transcripts found associated with the GO term within the analysis • Term Gene Ontology Biological Process description term • Inf Infi nite value• “GO GenePerTerm” tab: Detailed list of signature genes which are involved in signifi cantly

enriched GO terms.• “KEGG Signifi cantPathways” tab: Results of the enrichment analysis of KEGG pathways terms. • “KEGGID KEGG Identifi cation number • Pvalue p value given by the hypergeometric test (p<0.01) • OddsRatio ratio of odds that a KEGG pathway is enriched in the selected category • ExpCount expected number of transcripts found associated with the KEGG pathway for

enrichment • Count real number of transcripts found associated with the KEGG pathway • Size population size of transcripts found associated with the KEGG pathway within the

analysis • Term KEGG pathway description term • Inf Infi nite value• KEGG Signifi cantTerm”: Detailed list of signature genes which are involved in signifi cantly

enriched KEGG pathways.

Supplementary Table 3 Sig-E1.3 annotation on Gene Ontology and KEGG pathway databases (continued)

Molecular signature discovery by ICA→GSEA 303

Signature Gene List (with Entrez Gene ID)

Symbol Name

5730494M16Rik RIKEN cDNA 5730494M16 gene5830418K08Rik RIKEN cDNA 5830418K08 geneActn4 actinin alpha 4AI646023 expressed sequence AI646023Ambra1 autophagy/beclin 1 regulator 1Ankzf1 ankyrin repeat and zinc fi nger domain containing 1B230380D07Rik RIKEN cDNA B230380D07 geneB3gnt8 UDP-GlcNAc:betaGal beta-1,3-N-acetylglucosaminyltrans-

ferase 8Chd3 chromodomain helicase DNA binding protein 3Clec2i C-type lectin domain family 2, member iEif4ebp2 eukaryotic translation initiation factor 4E binding protein 2Fam76a family with sequence similarity 76, member AFastk Fas-activated serine/threonine kinaseFoxp3 forkhead box P3Gm2427 predicted gene 2427Gpr83 G protein-coupled receptor 83Grk6 G protein-coupled receptor kinase 6Il6ra interleukin 6 receptor, alphaInts5 integrator complex subunit 5Kctd14 potassium channel tetramerisation domain containing 14Kdm6a 4lysine (K)-specifi c demethylase 6ALcn2 lipocalin 2LOC100043977 similar to Igh proteinLOC100045519 similar to Tripartite motif protein 21LOC100047427 similar to thyroid hormone receptorLOC100047481 similar to SEC24 related gene family, member B (S. cerevisiae)Lphn1 latrophilin 1Lrrfi p1 leucine rich repeat (in FLII) interacting protein 1Ltf lactotransferrinMtmr9 myotubularin related protein 9Myo18a myosin XVIIIAPkn1 protein kinase N1Pmvk phosphomevalonate kinasePrpf8 pre-mRNA processing factor 8Rad23a RAD23a homolog (S. cerevisiae)S100a8 S100 calcium binding protein A8 (calgranulin A)S100a9 S100 calcium binding protein A9 (calgranulin B)Secisbp2 SECIS binding protein 2Smchd1 SMC hinge domain containing 1Sri sorcin

Supplementary Table 4 Sig-E2.1 annotation on Gene Ontology and KEGG pathway databases

304 H-P Pham et al.

Signature Gene List (with Entrez Gene ID)

Symbol Name

Stat3 signal transducer and activator of transcription 3Stat6 signal transducer and activator of transcription 6Tef thyrotroph embryonic factorTmem81 transmembrane protein 81Tnfrsf4 tumor necrosis factor receptor superfamily, member 4Trabd TraB domain containingTtc7 tetratricopeptide repeat domain 7Usf1 upstream transcription factor 1Vps13a vacuolar protein sorting 13A (yeast)Zbtb9 zinc fi nger and BTB domain containing 9

Supplementary Table 4 Sig-E2.1 annotation on Gene Ontology and KEGG pathway databases (continued)