A Comparative Evaluation of Unsupervised Anomaly Detection ...
Unsupervised Selection of Highly Coexpressed and Noncoexpressed Genes Using a Consensus Clustering...
-
Upload
independent -
Category
Documents
-
view
3 -
download
0
Transcript of Unsupervised Selection of Highly Coexpressed and Noncoexpressed Genes Using a Consensus Clustering...
For Peer Review
OMICS: A Journal of Integrative Biology: http://mc.manuscriptcentral.com/omics
Unsupervised selection of highly coexpressed and non-
coexpressed genes using a consensus clustering approach
Journal: OMICS: A Journal of Integrative Biology
Manuscript ID: OMI-2008-0074
Manuscript Type: Original Article
Date Submitted by the Author:
30-Oct-2008
Complete List of Authors: Nguyen, Tung Nowakowski, Richard Androulakis, Ioannis; Rutgers University, Biomedical Engineering
Keyword: Computational Genomics, Data Mining, Gene Expression, Statistical Analysis
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
For Peer Review
1
Unsupervised selection of highly coexpressed and non-coexpressed
genes using a consensus clustering approach
Tung T. Nguyen1
BioMaPS Institute for Quantitative Biology, Rutgers University
Mailing address: 23885 BPO WAY, Piscataway, New Jersey 08854
Email: [email protected] Tel: 732-447-6750 Fax: 732-445-3753
Richard S. Nowakowski2
Department of Neuroscience and Cell Biology, UMDNJ-RWJMS
Mailing address: 675 Hoes Lane, Piscataway, New Jersey 08854
Email: [email protected] Tel: 732-235-4981 Fax: 732-235-4029
Ioannis P. Androulakis3,∗
Biomedical Engineering Department, Rutgers University
Mailing address: 599 Taylor Road, Piscataway, New Jersey 08854
Email: [email protected] Tel: 732-445-4500 (x6212) Fax: 732-445-3753
Keywords: consensus clustering, agreement matrix, number of clusters, gene selection,
clusterable subset, co-expression, non-coexpression, coexpressed genes.
∗ correspondence author
Page 1 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
2
Abbreviations
AM: Agreement Matrix
AUC: Area Under the CDF Curve
CDF: Cumulative Distribution Function
SPOR: SPORulation
LPS: LipoPolySaccharide
hclust: hierarchical clustering
diana: divisive analysis clustering
fanny: fuzzy analysis clustering
pam: partitioning around medoid
kmeans: k-means
cmeans: fuzzy c-means
som: self-organizing map
mclust: model-based clustering
Page 2 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
3
Abstract
In this paper we explore the concept of consensus clustering to identify, within a set of
differentially expressed genes, a subset of genes that are either highly co-expressed or highly
non-coexpressed based on the hypothesis that this subset would serve as a better starting point
for further analyses. A number of core clustering methods form the basis for the assertion of an
agreement matrix (AM) characterizing the level of co-expression between any two probesets. In
order to overcome the limitations of using a single distance metric, we explore different metrics
and examine the sensitivity of the AM as a function of the input number of clusters to find a
suggestive number of clusters that best describes a particular dataset. The result of this level of
analysis is a systematic framework for eliminating probesets that cannot be clearly characterized
as either coexpressed or non-coexpressed with others, thus eliminating a number of probesets
from further analysis. Subsequently, an agglomerative hierarchical clustering approach is applied
to cluster the selected subset using the agreement metric information as the similarity measure.
Thus the goal of the proposed methodology is two-fold: (a) we opt to identify a more
‘clusterable’ subset of the original set; and (b) we aim at further refining the subset in order to
identify a core of genes which contains genes which are either coexpressed or non-coexpressed
within a certain confidence level.
Page 3 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
4
Introduction
Microarray technology has become a powerful and widely accepted experimental technique in
molecular biology, allowing the study of genome-wide cellular responses under a multitude of
experimental conditions. In order to analyze such a large amount of data, clustering and
classification have been applied successfully as an essential exploratory tool for discovering
biological patterns and class prediction (Androulakis, et al., 2007). However, to assist those
analyses a widely used pre-processing step involves the selection of a reduced subset of
probesets whose recorded responses are further analyzed with a variety of state of the art
techniques. Although a variety of gene selection methods have been proposed, they can be
broadly classified into two main categories: filters and wrappers (Inza, et al., 2004). Filters
evaluate the significance of each gene individually across multiple conditions or time-points
using a variety of statistical tests such as fold-change, t-test (Pan, 2002), information gain (Su, et
al., 2003), ANOVA (Pavlidis, 2003), and very successful implementations are now available
such as SAM (Tusher, et al., 2001) and EDGE (Storey, et al., 2005). Wrappers invoke a
particular model-based algorithm to rank the relevance of a gene compared to others. Although
filter methods are faster, wrapper approaches can find a smaller subset of more interested genes
for next analyses. Wrappers are also divided into two subtypes: unsupervised selection that does
not use any prior information (Ding, 2003; Jornsten and Yu, 2003; Kim and Gao, 2006; Watson,
2006) and supervised selection which is associated tightly with a specific models (Diaz-Uriarte
and Alvarez de Andres, 2006; Wang, et al., 2008; Yousef, et al., 2007). Once genes have been
Page 4 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
5
identified for further analysis, the entire subset is usually subjected to clustering or classification
studies.
The traditional way for performing clustering analyses is using one clustering method to group
all genes in a dataset into a number of clusters given a pre-defined metric of similarity. Those
genes that belong to one cluster can be considered as co-expressed and those that belong to
different clusters are non-coexpressed. However, it is widely accepted that a number of critical
problems associated with clustering remain open: (i) it is not immediately obvious what the
optimal number of clusters is (Epter S, 1999) and it has been recognized that it is difficult to
develop a systematic and generic method for addressing such a question (Bolshakova and
Azuaje, 2006; Dudoit and Fridlyand, 2002; Monti S 2003; Ressom, et al., 2003; Sharan and
Shamir, 2000; Tibshirani, et al., 2001; Yan and Ye, 2007; Yu, et al., 2007). Approaches such as
DBSCAN (Ester M, 1996) showed great promise but the issue associated with high-dimensional
data still remains (Belacel, et al., 2006); (ii) every clustering method relies on the definition of an
appropriate distance metric such as Euclidean, Pearson correlation, Manhattan etc. (Huang and
Pan, 2006; Hunter, et al., 2001; Qian, et al., 2001), however an algorithm’s performance is
highly sensitive to the selection of the metric, particularly for objects lying at the boundaries
between clusters;(iii) all clustering methods express their own bias and assumptions, and their
performance depends highly on the properties of the input dataset (Belacel, et al., 2006).
Therefore, alternative methods have been proposed to attempt to reduce the bias by combining
two or more clustering algorithms (Chipman and Tibshirani, 2006; van der Laan MJ, 2003) or by
incorporating with prior domain-specific knowledge to guide the clustering process. In the
context of microarray analysis it may include gene ontology (Fang, et al., 2006; Kustra R,
Page 5 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
6
2006), gene annotation (Huang and Pan, 2006), gene function (Pan, 2006) etc.; finally (iv) the
significance of a cluster and/or the probability that two genes may belong to one cluster or two
different clusters is also an issue (Munneke, et al., 2005; van der Laan MJ, 2003; Zhang and
Zhao, 2000).
In order to overcome some of the aforementioned complications simultaneously, the concept
‘ensemble’ or ‘consensus’ was introduced into the clustering literature (Strehl and Ghosh, 2002).
By averaging, in some way, the results of multiple runs, one can estimate an ‘agreement matrix’
(AM) and infer a better proxy of what a more ‘correct’ result ought to look like. A number of
approaches have been proposed; for example Monti et al. (Monti S 2003) applied one clustering
algorithm on multiple sub-sampled datasets without replacement based on the original data set
whereas Grotkjaer et al. (Grotkjaer, et al., 2006) used different random initializations of a single
clustering method to generate multiple results from which the agreement matrix was built to
yield the final clustering assignment. Although such approaches offer definite advantages, they
still express a strong bias for a given method and/or metric. Consequently, an alternative strategy
with a meta-clustering step is applied on the agreement matrix as the distance matrix to reach the
final result. Different studies chose different clustering methods based on those frequently used
in the literature for the first level. In the meta-level, although it is based on a single method, it is
still not evident which clustering method should be used and different studies selected different
methods, e.g. simulated annealing (Hirsch, et al., 2007; Swift, et al., 2004), mapping by Jaccard
index (Laderas and McWeeney, 2007), expectation maximization algorithm (Topchy, et al.,
2005).
Page 6 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
7
In this paper, we however do not mainly focus on solving the problems of clustering. Instead we
will explore the concept of consensus clustering to identify, within a set of differentially
expressed genes, a subset of genes that are either highly co-expressed or highly non-coexpressed
with the hypothesis that this subset would serve as a better starting point for further analysis,
such as coregulation. A number of core clustering methods, supported by R packages e.g.
hierarchical clustering (hclust), divisive analysis clustering (diana), fuzzy analysis clustering
(fanny), partitioning around medoid (pam), k-means (kmeans), fuzzy c-means (cmeans), self-
organizing map (som), and model-based clustering (mclust) will be employed in the first-level
(Dimitriadou, et al., 2006; Fraley and Raftery, 2007; Gentleman, et al., 2004; Ihaka and
Gentleman, 1996; Maechler, et al., 2005; Yan, 2004). Additionally, in order to overcome the
limitations of using a single distance metric, we explore different metrics (Euclidean, Pearson
correlation, and Manhattan) that have already been established (Gibbons and Roth, 2002). The
sensitivity of the AM was also examined as a function of the input number of clusters to find a
suggestive number of clusters that best describes a particular dataset. The result of the first-level
analysis is a systematic framework for eliminating all genes that can not be clearly characterized
as either coexpressed or non-coexpressed with others in the ongoing selected subset.
Subsequently, an agglomerative hierarchical clustering approach is applied to cluster the selected
subset using the agreement metric information as the similarity measure. Thus the goal of the
proposed methodology is two-fold: (a) we opt to identify a more ‘clusterable’ subset of the
original set; and (b) we aim at further refining the subset in order to identify a core of genes
which are either coexpressed or non-coexpressed with a certain confidence level.
Page 7 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
8
Materials and Methods
The main problem to be addressed in this paper can be defined as follows. Given a set of n
objects { }niigG 1== , with each described by a list of d numerical attributes
{ } djniRggggg ijidiii ..1,..1,,,...,, 21 ==∈= , we wish to pick out a ‘clusterable’ subset of objects
GG ⊂' with a confidence level δ% . The term ‘clusterable’ subset is used in the sense that
( ) ( ){ }'i j q i j q q i j qg , g G C : P g g C C , P g g C 1⎡ ⎤ ⎡ ⎤∀ ∈ ⇒ ∃ ∋ ∧ ∈ ≥ δ ∨ ∀ ∧ ∈ ≤ − δ⎣ ⎦ ⎣ ⎦ , where Cq
denotes a, yet to be determined, cluster and P is the probability that the two objects belong to the
same cluster.
The definition is quite general and applicable to a large family of problems (Figure 1). However,
in order to maintain consistency with the specific problem at hand (microarray data), objects will
refer to genes, or better yet probesets, with expression levels measured in different experimental
conditions or time-points, and clusters are groups of genes sharing similar expression profiles.
Thus, genes that belong to one cluster are considered to be coexpressed and genes that belong to
different clusters are considered as non-coexpressed with a confidence level δ.
Figure 1
Agreement matrix
The agreement matrix M quantifies the frequency with which two genes belong to the same
cluster (Figure 2). If N clustering runs are performed on the data, each entry Mij (termed
Page 8 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
9
‘agreement level’) shows the fraction of clustering times two genes are assigned to the same
cluster. The AM entries are defined as:
( )
( ) ( )
Nh
ij i jh 1
(h)h i j
i j
1M M (g ,g )N
1 if g and g are clustered together when running method Mwhere M g ,g (1)
0 othewise
==
⎧⎪= ⎨⎪⎩
∑
and N is the number of clustering runs performed with either different methods or distance
metrics. In our work, we are using hclust, diana, fanny, and pam with Pearson correlation and
Manhattan metric, kmeans, cmeans, som, and mclust with Euclidean metric as the core clustering
methods (Dimitriadou, et al., 2006; Fraley and Raftery, 2007; Gentleman, et al., 2004; Ihaka and
Gentleman, 1996; Maechler, et al., 2005; Yan, 2004).
Figure 2
The evaluation of the AM entries requires the identification of a ‘suggestive’ number of clusters
since, as mentioned earlier, clustering results are highly dependent upon this input value.
Motivated by the work of Monti et al. (Monti S 2003), in order to identify a robust estimate for
the suggestive number of clusters of a given dataset, we examined the distribution of the
agreement matrix entries as a function of the number of clusters (k). By definition the AM
entries vary from zero to one whereas the number of entries falling into the zero-end region
always increases as the input number of clusters k increases. Ideally, and assuming that the data
in question do possess a definite underlying structure there should exist an ‘optimal’ number of
clusters (k*). Thus, one would expect that as k varies from 2 to k*, the rate at which the AM
entries shift to the zero-end is faster than that when k>k*. The rationale behind this hypothesis is
that when the optimal number of clusters is reached, each clustering method individually makes
a more appropriate cluster assignment to objects in the dataset and thus the cluster assignments
Page 9 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
10
from various clustering methods are more common. After that, the reassignment rate is reduced,
making the agreement levels between objects change lesser and lesser. As a result, we would
expect the distribution of the AM entries to change rapidly early on and eventually the rate of
change would drop as k>k*.
We tested the hypothesis by observing the histogram of the AM entries (Epter S, 1999) as the
number of putative clusters k changes i.e. k is varied from 2 to some number K and successive
AM matrices are built. The corresponding distribution of the AM entries is represented by an
empirical cumulative distribution function (Monti S 2003)
)2(..1,,)1(
1)(
21
njinn
xMifxCDF ji ij
k ∈−
<=∑ < (Figure 3). The histogram-based area under the CDF
curve (AUC) corresponding to each value of k is evaluated by
( ) ( ) )3(..1,/,1 BlBlxxCDFxxAUC ll lkllk ∈=−=∑ − where B is the number of buckets used to
construct the histogram or numerically define the CDF. As a result, the change of the distribution
of AM entries is reflected by the changes of the AUCk. The hypothesis earlier stated effectively
is to look at the rate at which the successive distributions change when k increases in order to
identify a putative number of clusters. Therefore, and in order to evaluate a more unbiased metric
for determining the rate of change of the successive CDFs, we made use of the gap statistic
metric (Tibshirani, et al., 2001; Yan and Ye, 2007) and redefined it as:
{ } )4(, 13 −=−=∆∆−∆= kkkk
K
kkk AUCAUCmeanGapii
Due to the high computational requirement, we used the mean of all ∆k (excluding ∆2) instead of
calculating the expected value for each ∆k in the first part of (4) from uniform data as originally
suggested. Because of the faster shift to the zero-end region of AM entries early on, the rate of
Page 10 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
11
change of ∆k based on the AUCs is larger at the beginning and decreases gradually. As a result,
the gap quantity ‘Gapk’ varies from negative to positive. We select the k value at which Gapk+1
becomes positive to be the suggestive number of clusters for the dataset since the distribution of
the AM entries seems to be stabilized from that value. Besides, with the above definition the
mean value of all ∆k will be highly dependent on the selection of value K. However, when k is
over some value, the change of the AM distribution is trivial just because of the nature of the
clustering methods. Consequently, value K must be selected to be appropriate with the changing
amount ∆k of the AUCs. The key point here is to select the right ‘elbow’ of the curve of AUCk.
Therefore, we suggest an empirical default value 4 2ndK = which can be considered as a
balance between the number of objects, object attributes and the significant change in the
distribution of AM entries as well as the expected number of clusters in the dataset (see more in
supplemental data for the algorithm).
Figure 3
Gene selection
The analysis of the agreement matrix results reflects the expected relationship between two
genes, i.e., the probability of belonging or not to the same cluster. As such entries associated
with genes at the ‘hypothetical core of a cluster structure will be consistently grouped together
over multiple runs. This should be manifested by high corresponding values in the AM, whereas
genes belonging to the ‘hypothetical’ core of clearly distinct clusters should be associated with
consistently low AM entries. On the contrary, genes around the ‘hypothetical’ boundary between
two clusters would be very sensitive to changes in the clustering method. As a result, a gene at
the cluster boundary should be characterized by relatively moderate agreement levels in relation
to other genes (Figure 4a). Thus, our hypothesis is that eliminating genes associated with
Page 11 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
12
moderate AM entries would create a more ‘clusterable’ subset. It also should be emphasized that
this approach is not aimed at identifying and eliminating ‘outliers’ and thus this is not an outlier
detection procedure. We simply hypothesize on the potential properties of a more clusterable
subset of objects.
In order to quantify the aforementioned observation, we define an AM entry as an ‘inconsistent’
entry if its value is within the interval 1 – δ < Mij < δ, where δ expresses a user-defined
confidence level (Figure 4b). The AM is now transformed into an adjacency matrix where
consistent pairs of genes i.e. genes that are frequently assigned to the same or different cluster(s)
receive a value of ‘0’ and inconsistent entries are assigned value ‘1’. The adjacency matrix is
then converted to an ‘inconsistent’ graph with nodes indicating genes and edges connecting two
nodes (genes) representing the cluster assignments between those two genes over multiple
clustering runs are unclear. The problem now is removing a number of vertices so that the
resulting graph is completely disconnected (Dangalchev, 2006). We called an inconsistent rank
of a vertex is the order of that vertex, i.e. the number of edges at that vertex
( ) )5(..1,11_ njMandMifrankij ijiji =<<−=∑ δδ
Therefore, vertices with many edges or genes with many inconsistent AM entries will get high
inconsistent ranks; the ones with highest inconsistent rank and all of its edges will be removed
first. The inconsistent rank for each vertex is then recalculated and the step is repeated (Figure
4c). If there are some equivalent inconsistent ranks, the removed vertex can be chosen to be the
one with the highest original inconsistent rank or randomly (e.g. vertex with the smallest index in
our implementation, creating the consistency of removed genes over different running times).
The routine is repeated until the ‘inconsistent’ graph becomes completely disconnected i.e. the
Page 12 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
13
selected subset contains no gene with an ambiguous cluster assignment with other exiting genes
with a given confidence level (see supplemental data for the algorithm).
Figure 4
Consensus clustering
Without dependence on any other parameter besides an agreement threshold to form clusters,
hierarchical clustering was selected to perform the final clustering task. The algorithm starts with
every gene as a cluster and tries to group two clusters into a new one at each iteration. Any pair
of genes belonging to that new cluster needs to have an agreement level more than or equal to δ
(δ–rule). A new cluster is formed by joining two clusters Cp and Cq whose total agreement of all
pairs of genes in these two clusters
( ) { }{ }
)6(_ ∑===∧
qp
CingeneslCingenesk klqp MCCagreementtotal
is maximal. This selection assures that large clusters are given priority to join together since the
total agreement between cluster C and a large one will be greater than that between C and a
smaller one (Figure 5). Besides, although new clusters can be formed with the δ–rule, we still
favor those which contain genes more likely to be clustered together. Therefore, we introduce a
cooling rate to replace the role for δ. As a result, instead of satisfying the δ-rule, any pair of two
genes in a new cluster now needs to satisfy the θ-rule (i.e. their agreement level will be greater
than or equal to θ) and θ decreases slowly from 1.0 to δ (see supplemental data for algorithms).
Figure 5
The algorithm produces a list of clusters in which any two genes belonging to one cluster always
have an agreement level greater than or equal to δ. Although δ is a measure of the frequency with
which two genes can be found in the same cluster over a variety of clustering runs, it can be also
be considered as the confidence that the two genes are coexpressed since δ, by construction, aims
Page 13 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
14
at eliminating method-specific biases and assumptions. Furthermore, the gene selection step
assures that the inconsistencies in the AM are minimized since the relationship between any two
genes is evaluated with a confidence level. Therefore, genes that belong to different clusters can
also be considered as highly non-coexpressed with a confidence level δ.
Additionally, we also provide an optional procedure to exclude trivial clusters formed due to the
nature of clustering methods. Each cluster C is assigned with a simple hypothetical quantity
called ‘cluster significance’ which represents how large the cluster is and how coexpressed the
genes in the cluster are. To select significant clusters, we then estimate the distribution of cluster
significance on random data and compute the p-value for each cluster C above. The dataset is
randomly resampled (permutation plus convex-hull (Munneke, et al., 2005)) a number of times
(nr), for each of which the entire process starting from building the AM with k* selected above to
the consensus clustering step is done and random–resulting clusters are returned. Subsequently,
the procedure estimates the cluster significance for these random clusters and builds up a
distribution of cluster significance. The cluster significance of a cluster is defined as its size
times its homogeneity as mentioned above; random clusters can be in large-size depending on
the input number of clusters but these clusters contain arbitrarily objects (genes) and thus their
homogeneity will not be large, leading to the cluster significance remains trivial. As a result, the
number of random clusters with greater values of cluster significance than that of a selected
cluster C over the total number of random-resulting clusters in nr times resampling will be
considered as the p-value of cluster C for selection.
Experimental data
Page 14 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
15
To assess our approach for finding highly coexpressed and non-coexpressed genes, we analyzed
a number of data sets from the open literature. Specifically, we used the synthetic data to
evaluate fundamental concepts of the algorithm since the structure is precisely known. We
therefore utilized five low-noise and five high-noise 20-attrribute sine-format synthetic datasets
from (Yeung, et al., 2003). To demonstrate the effectiveness of the approach, we illustrated the
accuracy and the clusterability on the selected subset as well as the properties on the removed
domain from the synthetic datasets using Rand index (Jiang, et al., 2004; Rand 1971) and
Friedman-Rafsky test (Friedman JH, 1979; Smith SP, 1984). Besides that, in order to visualize
the effect of different cut-off agreement levels on the selection results, we used five two-
dimension (2D) testing sets from (Pei Y, 2006). The capability to find out a suggestive number
of clusters for a dataset is also demonstrated using these synthetic datasets.
Two experimental datasets (Calvano, et al., 2005; Chu, et al., 1998) were used to further
illustrate the approach. The sporulation yeast dataset (SPOR) (Chu, et al., 1998) contains the
mRNA levels of 6,118 genes. After preprocessing for duplicate gene-name removal and
significantly differently expressed filtering with a fold-change greater than 2.0, 2,595 genes were
identified. The next experimental dataset describes a human model of bacterial infections
(lipopolysaccharide - LPS) (Calvano, et al., 2005) originally containing 44,924 probesets over 6
time-points with four replicates at each time point. After filtering by ANOVA (p-value = 10-4)
implemented in R (Gentleman, et al.) and also customized by our package for easy uses, 3,269
probesets were considered for further analysis. Average expression profiles of probesets over
replicates for each time-point were used as the final input data (Yeung, et al., 2003). A summary
Page 15 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
16
of the datasets is presented in Table 1 (see more in supplemental data for datasets and evaluation
methods).
Table 1
Page 16 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
17
Results
Distribution of the AM entries
In order to examine the properties of the AM, we made use of the synthetic datasets where we
could obtain random, and structured, high- and low-noise, data (Table 1). Random data are
generated through resampling (permutation plus convex-hull (Munneke, et al., 2005)) synthetic
datasets, each with 10 times. The AM histogram, CDF, AUC and gap curves were built
independently for each dataset and then the average ones are made for each data type to have a
consensus view (Figure 6).
Figure 6
The expected change of the histograms as k increases is manifested by the shift of the
distribution of AM entries observed with both high- and low noise data (Figure 6b & 6c panels –
top row). In both cases, AM entries shift faster to the zero-end when k is less than k* (k* = 6 in
this case). Therefore, compared to the random data, we observed that the synthetic data (i)
produce distributions that are not normal, and (ii) beyond the hypothetical optimal k* the
distributions changes at a slower rate. Besides, the random AUC curve is also drawn to be compared
to the non-random AUC curves.
Accuracy in predicting a suggestive number of clusters
A most critical parameter characterizing the performance of this, and any clustering, approach is
related to the selection of an appropriate suggestive number of clusters k. The results on the
synthetic data here provide strong evidence for the method, suggesting that this could be used as
a reasonable starting point (Table 2). However, one could attempt to interpret the AUC curve to
Page 17 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
18
suggest alternatives but for consistency purposes, in all our studies here, we made use of the
Gap-based heuristic for estimating the putative value for k* as showed below.
Table 2
The impact of the confidence level δ
The next critical parameter that needs to be user-defined in the overall scheme is the agreement
threshold or confidence level δ as it affects the selection and clustering steps. We examined five
2D instances of the synthetic data sets (Pei Y, 2006) with varying levels of noise and object
selection is performed for different agreement thresholds δ (Figure 7). The selected clusters are
kept intact without applying the trivial-cluster removal procedure. At high confidence levels, the
majority of points lying in the boundaries between core clusters are eliminated and possibly core
clusters are partitioned further due to the strict requirements for membership to the same cluster.
As δ decreases, clusters get bigger but their boundaries appear to become fuzzier.
Figure 7
Consistency and accuracy of clustering and selection results
To assess the accuracy of the selection and the quality of clustering, we applied the approach on
synthetic datasets with a known class-structure distribution of each object. Ten synthetic
datasets, 5 low- and 5 high-noise, (Table 1) with log-transformation (Huber, et al., 2002) were
used for this purpose. Since the question we originally posed was whether selected objects are
either highly similar or non-similar to each other (or highly coexpressed and non-coexpressed in
the context of gene expression data), we do not need to classify all objects into their correct class
structure. However, we need to identify a smaller set of objects for which we would be confident
that the correct assignment can be made. A brief look on how the data look like and what our
approach picked out is presented in Figure 8. To evaluate the accuracy we used the original
Page 18 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
19
Rand index (Jiang, et al., 2004; Rand 1971) to estimate the correctness of the selection and
clustering on the selected subset (Table 3).
Figure 8
Table 3
We further evaluate the accuracy when a single method, a single metric and the entire dataset is
used (Table 4). Even though some clustering methods/metrics can be highly accurate, the
average accuracies still fluctuate around 80% on high-noise datasets whereas the accuracies our
selection and clustering are around 90% (even with the moderate agreement value of δ=70%).
Overall the accuracy is very high in all cases, further confirming the efficacy of the selection.
Table 4
Evaluating the ‘clusterability’ of the selected subset
‘Noisy’ data tend to lack class structure and as a result different clustering methods with
different metrics produce very inconsistent class assignment results. Consequently, the process
of gene selection tries to remove the noise’ from the data and pick out a more clusterable subset
which contains distinguishable patterns. To evaluate this property of the selected subset we
applied the uniformity testing suggested in (Smith SP, 1984) by using Friedman-Rafsky’s
minimum spanning tree test (Friedman JH, 1979; Smith SP, 1984) to estimate the clusterability
of a dataset (see supplemental data at section 3.2).
Table 5 quantifies the ‘clusterability’ of the original, the selected and removed data for each of
the synthetic datasets. The selected subsets have consistently superior clusterability
characteristics compared to both the entire set and removed subset. Furthermore, the removed
subset is consistently the less clusterable, compared even to the entire dataset
Page 19 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
20
Table 5
Case Studies
We test the overall approach in two experimental datasets, SPOR and LPS (Table 1). A
commonly-used data transformation is employed as a pre-processing step for each gene (Jiang, et
al., 2004) to capture the overall shape of the expression profiles by converting the expression
values according to ( )
raw raws i i
i rawi
g gg
g−
=σ
.
Sporulation
Based on the results of hierarchical clustering with Pearson correlation metric on selected
induced genes (Eisen, et al., 1998), Chu et al. tried to infer the biological function of the
sporulation process by classifying these genes into seven separate temporal patterns (Chu, et al.,
1998). We therefore attempted to apply our strategy to the data to understand how our selection
captured this piece of information. As a result, only large patterns are produced, each of which
consists of a few smaller patterns of genes with distinguishable expression profiles among
previous selected patterns (Figure 9). Specifically, our approach picked out three up-regulated
patterns which contain a smaller number of genes in seven temporal patterns of induced
transcription during the sporulation process; the red pattern contains only 24 ‘metabolic’ genes
as the classification of Chu et al. (Chu, et al., 1998); the green one includes 31 ‘early I’, 5 ‘early
II’ genes and 5 first-assigned ‘metabolic’ genes (ACH1, CAR1, POL30, RFA1, and YIL132C);
and the blue one contains 47 ‘early-mid’, 80 ‘middle’, 46 ‘ mid-late’, and 2 ‘late’ genes (totally
240 selected genes over 475 in seven temporal classes from the original work (Chu, et al.,
1998)). Furthermore, there are five genes that might be considered as a misclassification from the
Page 20 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
21
previous work (five red expression profiles in the green pattern (5 ‘metabolic’ genes)) since their
expression profiles show clearly their new pattern.
Figure 9
Bacterial Endotoxin
The analysis identifies a reduced subset of genes which form, initially, five distinct responses
(Figure 10). These include two clusters that exhibit an early and middle up-regulation event 182
and 199 genes respectively), one cluster that is characterized by later up-regulation (284 genes)
and two clusters that exhibit a down-regulation response (1118 and 27 genes respectively). The
smallest of the down-regulated clusters can be eliminated using our cluster elimination procedure
as a non-significant statistic cluster (p-value = 0.05). It must be emphasized that the design of the
study was to evaluate a self-limited inflammatory response in humans injected with endotoxin.
As such, once the infection is cleared the system is expected to return to homeostasis. It is
important therefore to realize that all clusters to show deviations from homeostasis and eventual
return to base line. In order to further evaluate the significance of our selection we characterized
functionally the populations making up the identified clusters using ArrayTrack (Tong, et al.,
2003). Table 6 displays some significant pathways with p-value less than 0.05.
Figure 10
Table 6
Page 21 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
22
Discussion
The AM histograms, especially on k = 2, depict the difference due to data properties between
random, high- and low-noise data (Figure 6a, b, and c – top-left). Different clustering methods
and/or metrics will partition to random data into two clusters a number of different ways. As a
result, the AM will consist of many moderate agreement level values and thus the distribution of
the AM entries is likely to resemble a normal distribution. On the contrary, when the data have
some structure, the various clustering runs will identify the underlying, but still obscured,
structure in the data. As such, the assignment is not random anymore and thus the distribution of
the AM entries will have properties not characteristic of a normal distribution. As previously
noted, edge histogram (Epter S, 1999) or consensus matrix histogram from multi-clustering runs
on sub-sampling datasets (Monti S 2003), the more convex the shape of the histogram is, the
more clusterable the dataset is and otherwise.
The AUC curve contains an ‘elbow’, around the value of the hypothesized ‘optimal’ number of
clusters due to the shift rate in the distribution of AM entries (Figure 6b&c – middle-bottom).
Since the change of the AM entries on random datasets is only based on the nature of clustering
methods without being affected by any data structure, the ‘elbow’ on random AUC curve seems
to be very blurry compared to that on the non-random AUC curves. This may also become
another characteristic that reflects the property of a dataset i.e. the more slippery and smoothly
the AUC curve is, the more likely random the dataset is. Based on this observation and analysis,
we argue that the interpretation of the distribution the AM entries and the properties of the
Page 22 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
23
heuristic function ‘Gap’, may provide clues as to a rationally selected and appropriate value for
k*. The results in Table 2 provided strong evidence for the hypothesis. Also, it must be
emphasized that the number of clusters for a dataset can never be rigorously correct and truly
predictive in any case, especially for real experimental data. Therefore, ones could attempt to
interpret the AUC curve or use the Gap-based heuristic as proposed here for estimating the
putative value for k* as a starting point for the analysis.
The next, but not less important, parameter that affects the selection and clustering results is the
confidence level (δ). Theoretically, higher confidence levels produce ‘better’ results. However, if
δ is very high, only a small number of genes are selected and if δ is low, many outliers or most of
the noise will be included in the analysis (Figure 7). Therefore, much like k*, there is no truly
rigorous way to decide on what an optimal value ought to be especially in high-dimensional,
sparsely populated datasets like the one generated from microarray studies. In addition, almost
all real data never have absolutely accurate values to enter computational analyses and
furthermore each computational technique we applied always has its own advantages and
disadvantages. Combining different computational techniques will make use of reducing the
faults in each as well as attempting to enhance their individual advantages. However, it is hardly
to get one hundred percent the same result over multiple differently applied methods. Along with
these results (Table 2&3, Figure 7&8), we therefore reasonably recommend that ones should
start with the suggestive number of clusters k* and δ at 70% to examine experimental datasets.
But for low dimensional datasets i.e. datasets with a small number of attributes, higher
confidence levels should be considered.
Page 23 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
24
Because of the selection step, the clustering step only analyzes a subset of the original data.
Consequently, only a smaller number of putative clusters (2, 3, 4, or 5) are generated, especially
for high-noise datasets or when high confidence levels are employed (Table 3). From the results
of Figure 8, we can clearly see that four essential patterns have been identified. The remaining
two can not be separated clearly and thus were eliminated. This is not to imply that the removed
data does not possess a structure, it simply means that it gets harder to clearly identify what this
structure is and how these patterns differ from each other.
Besides, it must be emphasized that in this analysis we do not consider auxiliary information,
such as functional ontologies, in order to refine or guide the clustering process (Flintoft, 2007)
since we want to test the hypothesis whether high coexpressed groups do possess special
properties relying exclusively on expression data. A good example in this study is the results on
the SPOR dataset. Although the sporulation process with up-regulated genes can be divided into
four temporal classes or a more refined classification with seven temporal classes (Chu, et al.,
1998), their expression profiles are still clear with only three up- regulated patterns from a large-
scale analysis (Figure 9). However, these divisions can be amended or further divided on a small
interested subset to reach the goal but it will contain some bias. Therefore, this again in some
aspect shows the irrelevant relationship between the expression profile and the function of some
genes (Flintoft, 2007).
Further interpretation of the results validates the importance of the selection. A careful analysis
of the implicated pathways those identified patterns (Table 6) shows how the strategy captures
critical biological events. First, the ‘early-up’ pattern contains genes whose expression levels
Page 24 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
25
increase during the first 2hrs after the administration of endotoxin and then return to the baseline
within the first 24hrs. Such an ‘early-peak’ response consists of genes that are involved in critical
pro-inflammatory signaling pathways (e.g. Toll-like receptor signaling (TNF, CCL4, IL1B,
NFkBIA) and Cytokine-cytokine receptor interaction (C-X-C motifs, CXCL1, CXCL2, CCL20,
IL1A) which play an integral role in the progression of systemic inflammation. For example,
endotoxin when binds to its signaling receptor triggers a signal transduction cascade that
converges to the activation of transcription factors (NF-kB) essential for the transcriptional
synthesis of various pro-inflammatory genes (IL1, TNF, IL8) (Chow, et al., 1999). Therefore, the
expression level of NFkBIA which encodes for the primary inhibitor of NF-kB (Carmody and
Chen, 2007) goes up, coupling with the co-expression of the pro-inflammatory cytokines (TNF,
IL1A, IL1B).
Next, the ‘middle-up’ pattern is characterized by an increased expression of genes that peak at
4hrs post-endotoxin administration and participate in inflammatory relevant signaling pathways
such as Apoptosis (CASP10, CFLAR, FAS) and Toll-like receptor signaling (NFkB1, NFkB2,
RELA). The Toll-like receptor signaling is repeatedly appeared as an enriched pathway in this
pattern compared with the ‘early-up’ one since some inflammatory genes (e.g. members of NF-
kB/RELA family) show increased expression levels during the first 2-4hrs which were already
reported in (Calvano, et al., 2005). In the other hand, recent insight (Hotchkiss and Nicholson,
2006) indicated that there was an excessive death of immune effector cells (apoptotic cells)
during the progression of an aberrant inflammatory response. This fact shows how the apoptosis
is important in this event and thus how efficient the approach captured the biology function with
the fact that the most enriched pathway in this class of genes is Apoptosis (p-value ~ 10-7).
Page 25 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
26
Subsequently, the ‘late-up’ pattern composes of genes with late expression level during the 4-
6hrs post-endotoxin administration and subsequent resolution at 24hrs. Such a temporal pattern
is enriched with genes involved in inflammatory relevant biological pathways as it previously
stated e.g. Apoptosis (CASP8, IRAK4, PIK3G) and Cytokine-cytokine receptor interaction
(IL10RB, IL13RA1, IL8RB). However, herein, JAK-STAT cascade (IL10RB, STAT5B, JAK3,
and IL13RA) is an additional inflammatory relevant pathway that discriminates this pattern from
the aforementioned. From a biological point of view, JAK-STAT cascade is essential to regulate
the expression of target genes that counteract the inflammatory response. In addition to this,
research evidence (Murray, 2007) suggest that a STAT pathway from a receptor signaling system
is a major determinant of key regulatory systems including feedback loops such as SOCS
induction which subsequently suppresses the early induced cytokine signaling. Therefore, genes
that are co-expressed in this pattern participate in anti-inflammatory processes that aim to restore
homeostasis.
Finally, the ‘down’ pattern is the most populated expression motif characterized by a decreased
gene expression level during the time course of the experiment. These genes are involved in
cellular bio-energetic processes with a large array of genes to participate in pathways (p-value ~
10-7) such as Oxidative phosphorylation (ATP5A, COX and NDUF members) and Ribosome
biogenesis and assembly (RPL/RPS family). Other suppressed genes that involve Purine
(PDE4A, PDE8A, PRPS1) and Pyruvate metabolism (GLO1, PDHB, LDHB) participate in TCA
cycle (MDH1, MDH2, ACLY) as well as in metabolic pathways. Endotoxin–induced
inflammation causes the dysregulation of leukocyte bioenergetics and persistent decrease in
Page 26 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
27
mitochondrial activity can lead to reduced cellular metabolism (Singer, et al., 2004). That is to
say, co-expressed genes in this down-regulated pattern indicate the shut-down in cellular
energetic of human blood leukocytes when exposed to an inflammatory stress.
Acknowledgements
The authors acknowledge financial support from from the National Science Foundation under the
NSF-BES 0519563 Metabolic Engineering Grant and the EPA under the EPA GAD R 832721-
010 grant.
Page 27 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
28
References
ANDROULAKIS, I.P., YANG, E. and ALMON, R.R. (2007) Analysis of time-series gene
expression data: methods, challenges, and opportunities, Annu Rev Biomed Eng, 9, 205-228.
BELACEL, N., WANG, Q. and CUPERLOVIC-CULF, M. (2006) Clustering methods for
microarray gene expression data, OMICS, 10, 507-531.
BOLSHAKOVA, N. and AZUAJE, F. (2006) Estimating the number of clusters in DNA
microarray data, Methods Inf Med, 45, 153-157.
CALVANO, S.E., XIAO, W., RICHARDS, D.R., FELCIANO, R.M., BAKER, H.V., CHO, R.J.,
et al. (2005) A network-based analysis of systemic inflammation in humans, Nature, 437,
1032-1037.
CARMODY, R.J. and CHEN, Y.H. (2007) Nuclear factor-kappaB: activation and regulation
during toll-like receptor signaling, Cell Mol Immunol, 4, 31-41.
CHIPMAN, H. and TIBSHIRANI, R. (2006) Hybrid hierarchical clustering with applications to
microarray data, Biostatistics, 7, 286-301.
CHOW, J.C., YOUNG, D.W., GOLENBOCK, D.T., CHRIST, W.J. and GUSOVSKY, F. (1999)
Toll-like receptor-4 mediates lipopolysaccharide-induced signal transduction, J Biol Chem,
274, 10689-10692.
CHU, S., DERISI, J., EISEN, M., MULHOLLAND, J., BOTSTEIN, D., BROWN, P.O., et al.
(1998) The transcriptional program of sporulation in budding yeast, Science, 282, 699-705.
DANGALCHEV, C. (2006) Residual closeness in networks, Physica A, 365, 556-564.
Page 28 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
29
DIAZ-URIARTE, R. and ALVAREZ DE ANDRES, S. (2006) Gene selection and classification
of microarray data using random forest, BMC Bioinformatics, 7, 3.
DIMITRIADOU, E., HORNIK, K., LEISCH, F., MEYER, D. and WEINGESSEL, A. (2006)
e1071: Misc Functions of the Department of Statistics, R packages.
DING, C.H. (2003) Unsupervised feature selection via two-way ordering in gene expression
analysis, Bioinformatics, 19, 1259-1266.
DUDOIT, S. and FRIDLYAND, J. (2002) A prediction-based resampling method for estimating
the number of clusters in a dataset, Genome Biol, 3, RESEARCH0036.
EISEN, M.B., SPELLMAN, P.T., BROWN, P.O. and BOTSTEIN, D. (1998) Cluster analysis
and display of genome-wide expression patterns, Proc Natl Acad Sci U S A, 95, 14863-
14868.
EPTER S, K.M., ZAKI M (1999) Clusterability detection and initial seed selection in large
datasets, Technical report, Rensselaer Polytechnic Institute.
ESTER M, K.H., SANDER J, XU X (1996) A Density-Based Algorithm for Discovering
Clusters in Large Spatial Databases with Noise, Proc. KDD' 96, AAAI Press, Menlo Park,
226-231.
FANG, Z., YANG, J., LI, Y., LUO, Q. and LIU, L. (2006) Knowledge guided analysis of
microarray data, J Biomed Inform, 39, 401-411.
FLINTOFT, L. (2007) Gene regulation: The many paths to coexpression, Nature Reviews
Genetics, 8, 827.
FRALEY and RAFTERY, A. (2007) mclust: Model-Based Clustering / Normal Mixture
Modeling, R packages.
Page 29 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
30
FRIEDMAN JH, R.L. (1979) Multivariate generalization of the Wald-Wolfowitz and Smirnov
two-sample tests, Ann. Statist., 7, 697-717.
GENTLEMAN, R.C., CAREY, V. and HUBER, W. genefilter: methods for filtering genes from
microarray experiments, R packages.
GENTLEMAN, R.C., CAREY, V.J., BATES, D.M., BOLSTAD, B., DETTLING, M.,
DUDOIT, S., et al. (2004) Bioconductor: open software development for computational
biology and bioinformatics, Genome Biol, 5, R80.
GIBBONS, F.D. and ROTH, F.P. (2002) Judging the quality of gene expression-based clustering
methods using gene annotation, Genome Res, 12, 1574-1581.
GROTKJAER, T., WINTHER, O., REGENBERG, B., NIELSEN, J. and HANSEN, L.K. (2006)
Robust multi-scale clustering of large DNA microarray datasets with the consensus
algorithm, Bioinformatics, 22, 58-67.
HIRSCH, M., SWIFT, S. and LIU, X. (2007) Optimal search space for clustering gene
expression data via consensus, J Comput Biol, 14, 1327-1341.
HOTCHKISS, R.S. and NICHOLSON, D.W. (2006) Apoptosis and caspases regulate death and
inflammation in sepsis, Nat Rev Immunol, 6, 813-822.
HUANG, D. and PAN, W. (2006) Incorporating biological knowledge into distance-based
clustering analysis of microarray gene expression data, Bioinformatics, 22, 1259-1268.
HUBER, W., VON HEYDEBRECK, A., SULTMANN, H., POUSTKA, A. and VINGRON, M.
(2002) Variance stabilization applied to microarray data calibration and to the quantification
of differential expression, Bioinformatics, 18 Suppl 1, S96-104.
Page 30 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
31
HUNTER, L., TAYLOR, R.C., LEACH, S.M. and SIMON, R. (2001) GEST: a gene expression
search tool based on a novel Bayesian similarity metric, Bioinformatics, 17 Suppl 1, S115-
122.
IHAKA, R. and GENTLEMAN, R. (1996) R: A Language for Data Analysis and Graphics, J
Comp. Graphical Statistics, 5, 299-314 (http://www.R-project.org).
INZA, I., LARRANAGA, P., BLANCO, R. and CERROLAZA, A.J. (2004) Filter versus
wrapper gene selection approaches in DNA microarray domains, Artif Intell Med, 31, 91-103.
JIANG, D., TANG, C. and ZHANG, A. (2004) Cluster analysis for gene expression data: a
survey, IEEE Transactions on Knowledge and Data Engineering, 16, 1370-1386.
JORNSTEN, R. and YU, B. (2003) Simultaneous gene clustering and subset selection for sample
classification via MDL, Bioinformatics, 19, 1100-1109.
KIM, Y.B. and GAO, J. (2006) Unsupervised Gene Selection For High Dimensional Data, IEEE
Symposium on BionInformatics and BioEngineering, 227-234.
KUSTRA R, Z.A. (2006) Incorporating Gene Ontology in Clustering Gene Expression Data,
IEEE on Computer-Based Medical Systems, CBMS' 06, 555-563.
LADERAS, T. and MCWEENEY, S. (2007) Consensus framework for exploring microarray
data using multiple clustering methods, Omics, 11, 116-128.
MAECHLER, M., ROUSSEEUW, P., STRUYF, A. and HUBERT, M. (2005) cluster: Cluster
Analysis Basics and Extensions, R packages.
MONTI S , T.P., MESIROV J , GOLUB T (2003) Consensus Clustering: A Resampling-Based
Method for Class Discovery and Visualization of Gene Expression Microarray Data, Mach.
Learn., 52, 91-118.
Page 31 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
32
MUNNEKE, B., SCHLAUCH, K.A., SIMONSEN, K.L., BEAVIS, W.D. and DOERGE, R.W.
(2005) Adding confidence to gene expression clustering, Genetics, 170, 2003-2011.
MURRAY, P.J. (2007) The JAK-STAT signaling pathway: input and output integration, J
Immunol, 178, 2623-2629.
PAN, W. (2002) A comparative review of statistical methods for discovering differentially
expressed genes in replicated microarray experiments, Bioinformatics, 18, 546-554.
PAN, W. (2006) Incorporating gene functions as priors in model-based clustering of microarray
gene expression data, Bioinformatics, 22, 795-801.
PAVLIDIS, P. (2003) Using ANOVA for gene selection from microarray studies of the nervous
system, Methods, 31, 282-289.
PEI Y, Z.O. (2006) A Synthetic Data Generator for Clustering and Outlier Analysis, Technical
report, University of Alberta,, TR06-15.
QIAN, J., DOLLED-FILHART, M., LIN, J., YU, H. and GERSTEIN, M. (2001) Beyond
synexpression relationships: local clustering of time-shifted and inverted gene expression
profiles identifies new, biologically relevant interactions, J Mol Biol, 314, 1053-1066.
RAND , W.M. (1971) Objective criteria for the evaluation of clustering methods, J. American
Statistical Association, 66, 846-850.
RESSOM, H., WANG, D. and NATARAJAN, P. (2003) Adaptive double self-organizing maps
for clustering gene expression profiles, Neural Netw, 16, 633-640.
SHARAN, R. and SHAMIR, R. (2000) CLICK: a clustering algorithm with applications to gene
expression analysis, Proc Int Conf Intell Syst Mol Biol, 8, 307-316.
Page 32 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
33
SINGER, M., DE SANTIS, V., VITALE, D. and JEFFCOATE, W. (2004) Multiorgan failure is
an adaptive, endocrine-mediated, metabolic response to overwhelming systemic
inflammation, Lancet, 364, 545-548.
SMITH SP, J.A. (1984) Testing for uniformity in multidimensional data, IEEE Trans. Pattern
Anal. Machine Intell., PAMI-6, 73-80.
STOREY, J.D., XIAO, W., LEEK, J.T., TOMPKINS, R.G. and DAVIS, R.W. (2005)
Significance analysis of time course microarray experiments, Proc Natl Acad Sci U S A, 102,
12837-12842.
STREHL, A. and GHOSH, J. (2002) Cluster Ensembles A Knowledge Reuse Framework for
Combining Multiple Partitions, Journal on Machine Learning Research 3, 583-617.
SU, Y., MURALI, T.M., PAVLOVIC, V., SCHAFFER, M. and KASIF, S. (2003) RankGene:
identification of diagnostic genes based on expression data, Bioinformatics, 19, 1578-1579.
SWIFT, S., TUCKER, A., VINCIOTTI, V., MARTIN, N., ORENGO, C., LIU, X., et al. (2004)
Consensus clustering and functional interpretation of gene-expression data, Genome Biol, 5,
R94.
TIBSHIRANI, R., WALTHER, G. and HASTIE, T. (2001) Estimating the Number of Clusters in
a Data Set via the Gap Statistic, J. Royal Statistical Society, 63, 411-423.
TONG, W., CAO, X., HARRIS, S., SUN, H., FANG, H., FUSCOE, J., et al. (2003) ArrayTrack-
-supporting toxicogenomic research at the U.S. Food and Drug Administration National
Center for Toxicological Research, Environ Health Perspect, 111, 1819-1826.
TOPCHY, A., JAIN, A.K. and PUNCH, W. (2005) Clustering ensembles: models of consensus
and weak partitions, IEEE Trans Pattern Anal Mach Intell, 27, 1866-1881.
Page 33 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
34
TUSHER, V.G., TIBSHIRANI, R. and CHU, G. (2001) Significance analysis of microarrays
applied to the ionizing radiation response, Proc Natl Acad Sci U S A, 98, 5116-5121.
VAN DER LAAN MJ, P.K. (2003) Hybrid clustering of gene expression data with visualization
and the bootstrap, J of Stat. Planning and Inference, 117, 275-303.
WANG, L., ZHU, J. and ZOU, H. (2008) Hybrid huberized support vector machines for
microarray classification and gene selection, Bioinformatics, 24, 412-419.
WATSON, M. (2006) CoXpress: differential co-expression in gene expression data, BMC
Bioinformatics, 7, 509.
YAN, J. (2004) som: Self-Organizing Map, R packages.
YAN, M. and YE, K. (2007) Determining the number of clusters using the weighted gap statistic,
Biometrics, 63, 1031-1037.
YEUNG, K.Y., MEDVEDOVIC, M. and BUMGARNER, R.E. (2003) Clustering gene-
expression data with repeated measurements, Genome Biol, 4, R34.
YOUSEF, M., JUNG, S., SHOWE, L.C. and SHOWE, M.K. (2007) Recursive cluster
elimination (RCE) for classification and feature selection from gene expression data, BMC
Bioinformatics, 8, 144.
YU, Z., WONG, H.S. and WANG, H. (2007) Graph-based consensus clustering for class
discovery from gene expression data, Bioinformatics, 23, 2888-2896.
ZHANG, K. and ZHAO, H. (2000) Assessing reliability of gene clusters from gene expression
data, Funct Integr Genomics, 1, 156-173.
Page 34 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
1
Table 1: Brief description of synthetic and real datasets used in experimental evaluation
Datasets No. of clusters No. of attributes No. of objects References Five 2D synthetic 4-5 2 2,000 [56] Five low-noise synthetic 6 20 400 [51] Five high-noise synthetic 6 20 400 [51] Yeast sporulation - SPOR n/a 7 2,595 (6,118+) [57] LPS n/a 6 (4*) 3,269 (44,927+) [58]
*: number of replicates for each attribute (time-point); +: original number of objects (genes, probesets) before filtering.
Page 35 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
1
Table 2: Prediction the number of clusters by the process automatically
Datasets
2D synthetic synthetic data Real data
true suggestive low-noise high-noise Sporulation LPS true sugg. true sugg. suggestive suggestive
Set 1 4 4 6 6 6 6
7 6 Set 2 4 5 6 6 6 7* Set 3 5 5 6 6 6 6 Set 4 5 5 6 6 6 6 Set 5 5 5 6 6 6 6
* implies that the suggestive numbers of clusters are suitable even though the true ones are different.
Page 36 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
1
Table 3: Accuracy of the selection and clustering on the synthetic class structure*
Confidence δ
Datasets low-noise high-noise
set 1 set 2 set 3 set 4 set 5 set 1 set 2 set 3 set 4 set 5 0.9 262|4|100 317|5|99.69 234|4|99.79 333|5|100 331|5|100 76|2|100 76|2|92.63 98|2|78.29 98|2|100 90|2|95.630.8 266|4|100 324|5|99.37 247|4|98.53 400|6|100 332|5|100 108|4|98.65 106|3|90.3 188|4|83.5 159|3|100 101|3|90.730.7 399|6|100 329|5|99.36 261|4|98.18 400|6|100 399|6|100 228|4|89.32 134|4|92.04 264|4|85.27 196|4|100 161|4|87.280.6 399|6|100 336|5|98.83 316|5|98.34 400|6|100 400|6|100 316|4|86.98 185|5|93.03 316|4|86.73 229|4|99.53 232|4|89.88
*: the format of each cell is ‘number of selected genes | corresponding number of patterns | accuracy’
Page 37 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
1
Table 4: Accuracy of running one clustering element on the entire dataset
clustering methods
Datasets low-noise high-noise
set 1 set 2 set 3 set 4 set 5 set 1 set 2 set 3 set 4 set 5 hclust – Pear 88.43 94.21 88.00 94.38 94.29 74.16 86.04 75.44 83.69 81.88 diana – Pear 88.43 88.51 87.84 94.38 94.14 80.52 83.17 79.80 89.06 82.92 fanny – Pear 99.83 96.64 96.42 100.0 99.83 83.05 63.95 82.32 74.60 64.47 pam – Pear 100.0 96.75 96.42 100.0 100.0 88.36 87.41 84.40 92.37 88.97
hclust – Manh 100.0 94.29 88.67 100.0 100.0 67.37 21.41 66.58 22.76 19.78 diana – Manh 100.0 94.21 93.57 100.0 94.21 87.71 90.70 87.31 87.39 83.84 fanny – Manh 100.0 96.09 98.71 100.0 100.0 66.19 64.28 65.49 64.83 63.68 pam – Manh 100.0 98.45 99.02 100.0 100.0 98.68 92.77 89.02 98.70 96.98
kmeans – Euc 99.83 94.14 95.91 100.0 100.0 95.62 90.75 87.11 96.79 94.65 cmeans – Euc 99.83 97.26 92.23 100.0 100.0 86.08 60.79 86.77 75.00 62.80
som – Euc 88.75 93.72 88.75 94.46 88.75 86.13 84.66 85.88 85.73 86.55 mclust – Euc 100.0 94.37 91.42 100.0 100.0 83.82 84.69 82.33 84.25 84.51
average 97.09 94.89 93.08 98.60 97.60 83.14 75.88 81.04 79.60 75.92
(Pear: Pearson correlation metric; Manh: Manhattan metric; Euc: Euclidean metric)
Page 38 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
1
Table 5: Friedman-Rafsky test* for clusterability on high-noise synthetic sets with δ = 70%
set 1 set 2 set 3 set 4 set 5 Original data 0.54 0.49 0.56 0.56 0.58 Selected domain 0.49 0.35 0.41 0.45 0.56 Removed domain 0.74 0.58 0.86 0.65 0.65 *: the smaller the better
Page 39 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
1
Table 6: Pathway enrichment in four selected patterns (P-value < 0.05) Patterns Map Title P-value Patterns Map Title P-value
Early Up
Toll-like receptor signaling pathway* 0.00039
Late Up
Apoptosis* 0.00040 Type I diabetes mellitus 0.00126 Cytokine-cytokine receptor interaction* 0.00857 Cytokine-cytokine receptor interaction* 0.00155 Limonene and pinene degradation 0.00953
Coumarine and phenylpropanoid biosynthesis 0.00241 Jak-STAT signaling pathway* 0.01241
Hematopoietic cell lineage 0.01448
Apoptosis 0.01309 Epithelial cell signaling in Helicobacter pylori infection 0.02522
Alzheimer's disease 0.03749 Alkaloid biosynthesis II 0.03847
Epithelial cell signaling in Helicobacter pylori infection 0.03816
Down
Oxidative phosphorylation* 0.00000 Ribosome* 0.00000
Glycan structures – degradation 0.03999 Caprolactam degradation 0.00130 Adipocytokine signaling pathway 0.04406 Lysine degradation 0.00147 Fc epsilon RI signaling pathway 0.04877 Fatty acid elongation in mitochondria 0.00191
Middle Up
Apoptosis* 0.00000 Reductive carboxylate cycle (CO2 fixation) 0.00287 Adipocytokine signaling pathway 0.00334 Citrate cycle (TCA cycle) * 0.00514 Toll-like receptor signaling pathway* 0.00743 Folate biosynthesis 0.00716 B cell receptor signaling pathway 0.01715 N-Glycan biosynthesis 0.00825
Epithelial cell signaling in Helicobacter pylori infection 0.02101
Butanoate metabolism 0.01386 Type I diabetes mellitus 0.01386
Pancreatic cancer 0.02531 T cell receptor signaling pathway 0.02075 Chronic myeloid leukemia 0.02810 Antigen processing and presentation 0.02215 Prostate cancer 0.03856 Aminoacyl-tRNA biosynthesis 0.02295 Small cell lung cancer 0.03856 Amyotrophic lateral sclerosis (ALS) 0.02321
Sphingolipid metabolism 0.03910 Pyrimidine metabolism* 0.03201 Pyruvate metabolism* 0.03584
Folate biosynthesis 0.04525 Valine, leucine and isoleucine degradation 0.03966 Galactose metabolism 0.04307
T cell receptor signaling pathway 0.04691 Purine metabolism 0.04504 Cell adhesion molecules (CAMs) 0.04799
*: selected pathways for discussing biological functions
Page 40 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
1
Figure legends
Figure 1: Schematic overview of microarray data analysis using multiple clustering runs to select a ‘clusterable’
subset – the subset which contains genes that are either highly coexpressed or non-coexpressed with a confidence
level δ%. The preprocessing step (filtered by fold-change, ANOVA [4, 5], SAM [6], EDGE[7]) removes as many as
possible genes that are not significantly differentially expressed across conditions or time-points. Data with repeated
measurements can be averaged before clustering [50]. Each clustering method needs an input number of clusters k
as the required input parameter; therefore we examine the agreement matrix (AM) for a number of different k and
try to select one as a suggestive number of clusters for the dataset. Then, the final AM is built and pass through the
process of gene selection which eliminates all genes that have at least one ‘inconsistent’ value with some other gene.
δ is the threshold to say whether the agreement level of two genes belong to one cluster (≥δ) or two clusters (≤ 1-δ)
is consistent or not. The last step is dividing the selected subset into a number of patterns with the agreement
threshold δ based on the remainder of the agreement matrix as the input distance matrix.
Figure 2: An example of the agreement matrix (right). The left is the results from N clustering runs (N = 4 in this
example, represented by M1…M4) with k = 3 as the input number of clusters on n genes (n = 7, represented by
g1…g7). The right shows the corresponding agreement matrix that each entry Mij is the frequency of gene i and gene
j grouped into the same cluster by M1...M4.
Figure 3: Histogram of AM entries (left) and the corresponding CDF curve (right) from the AM in Figure 2.
Assume that five buckets (B=5) are used to build the histogram; each represents the proportion of AM entries that
fall into segment ( )[ ) 5..1,/,/1 =− lBlBl ; the last segment also includes all entries with value one. The CDF curve
is constructed based on the histogram and thus the horizontal axis is the axis of the agreement level as well as the
histogram buckets.
Figure 4: The gene selection process. (a) Genes at boundaries or outliers between clusters will have many moderate
agreement levels; g2 and g3 in cluster I will have a high agreement level whereas g2 and g4 have a low agreement
Page 41 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
2
level; g1 can belong to either cluster I or cluster II among different clustering running times, causing agreement
levels between g1 and other genes e.g. g2, g3, g4 are moderate. (b) The inconsistent region of agreement levels. (c)
The process of disconnecting the inconsistent graph; g1 is selected to remove since it has the highest inconsistent
rank; g2 has the same inconsistent rank with g4 but it is still removed next since it has a higher original inconsistent
rank than g4; then g3, g4, and g5 are eliminated respectively (genes with green color will be remained; red ones are
removed; blue ones are being examined).
Figure 5: Illustration of the consensus clustering on the agreement matrix in Figure 2. (a) List of clusters
corresponding to different agreement thresholds. (b) A hierarchical dendrogram to visually show the way of forming
clusters corresponding to δ. This example also demonstrates the effects of the total agreement and/or the cooling rate
θ: the algorithm always guarantees that large clusters are taken priority and/or that clusters with more pairs of high
agreement genes are joined together first, e.g. in the case of δ = 0.50, g2 and g5 (0.75) are joined first and g4 cannot
join the group although the agreement level between g2 and g4 (0.5) satisfies the δ-rule; this reduces the effect of
breaking down the pattern or non-optimal patterns are formed e.g. (g2∧ g5) compared to (g2∧ g4).
Figure 6: Examining the agreement matrix. (a) Average histograms of the AM on 100 random datasets (random
resampling 10 times from 10 synthetic datasets (Table 1)) for several input numbers of clusters. For each given input
number of clusters k, we have a corresponding histogram of the AM entries for a specific dataset with agreement
level index { }||(||..1 datasetsqrtBl == (B = 20 buckets in this case); the height of each bucket l is proportional to
the number of Mij entries falling into the segment ( )[ ]BlBl /,/1− ; the process is repeated on 100 random datasets to
generate the average histogram for each corresponding k. (b) Histogram of the AM (top) and the CDF lines, AUC
curve, as well as Gap curve on high-noise synthetic sets (Table 1); repeat on five synthetic high-noise sets and take
the corresponding average. (c) Histogram of the AM (top) and the CDF lines, AUC curve, as well as Gap curve on
low-noise synthetic sets (Table 1); repeat on five synthetic low-noise sets and take the corresponding average. k = 6
clusters is the right number of clusters for these datasets. On the AUC graphs, the blue one is the AUC curve on
random datasets obtained from (a).
Page 42 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
3
Figure 7: Illustration of the selection and clustering as well as the effect of varying the confidence level δ. The
objects at the boundaries and the outliers between clusters are eliminated as the agreement threshold δ decreases,
selected clusters become bigger, some more ‘noise’ is added, and thus the confidence of coexpressed and non-
coexpressed also reduces in the case of gene expression data.
Figure 8: An analysis of the synthetic data and selected genes from low-noise set 1 (top) and high-noise set 1
(bottom). Left panel is the original data; middle is the projection of the original data on its two first eigenvectors;
right is the average profiles of selected genes in corresponding patterns (δ=70%). The class structure or patterns of
the datasets can be viewed visually from the low-noise set. The red and the blue patterns (or the green and the cyan
pattern) are close together, implying that one of them could be removed if the confidence level is high. In high-noise
datasets it becomes more difficult to distinguish the patterns.
Figure 9: Selected genes and patterns for the SPOR dataset. The initial number of clusters is seven and six distinct
patterns are identified (3 up- and 3 down- regulation); red – 130 genes, green – 108 genes, blue – 354 genes, cyan –
234 genes, magenta – 246 genes, yellow – 132 genes, and in total 1204 genes are selected out of 2595. Top-left is
the average expression profiles of those six patterns; bottom-left is the corresponding heatmap; the remaining plots
depict average expression profiles of selected genes for each pattern
Figure 10: Selected genes and patterns for the LPS dataset. The initial number of clusters is six and five distinct
patterns are identified(3 up- and 2 down- regulation) in which cyan pattern can be omitted since it is not significant;
early up – 182 genes (red), middle up – 119 genes (green), late up – 284 genes (blue), cyan – 27 genes, down – 1118
genes (magenta), and in total 1730 selected genes out of 3269. Top-left is the average expression profiles of the five
patterns and the remaining plots present the expression profiles of selected genes in the five patterns. The insert in
each figure is the zoom-out of genes in the pattern with expression values from 500 to 1500.
Page 43 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
216x279mm (200 x 200 DPI)
Page 44 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
216x279mm (200 x 200 DPI)
Page 45 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
215x279mm (200 x 200 DPI)
Page 46 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
215x279mm (200 x 200 DPI)
Page 47 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
216x279mm (200 x 200 DPI)
Page 48 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
190x180mm (96 x 96 DPI)
Page 49 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
285x228mm (96 x 96 DPI)
Page 50 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
190x114mm (96 x 96 DPI)
Page 51 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
190x95mm (96 x 96 DPI)
Page 52 of 53
Mary Ann Liebert, Inc., 140 Huguenot Street, New Rochelle, NY 10801
OMICS: A Journal of Integrative Biology
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960