On Biclustering of Gene Expression Data
Transcript of On Biclustering of Gene Expression Data
204 Current Bioinformatics, 2010, 5, 204-216
1574-8936/10 $55.00+.00 © 2010 Bentham Science Publishers Ltd.
On Biclustering of Gene Expression Data
Anirban Mukhopadhyay*,1, Ujjwal Maulik
2 and Sanghamitra Bandyopadhyay
3
1Department of Theoretical Bioinformatics, DKFZ (Deutsches Krebsforschungszentrum, German Cancer Research Cen-
ter), Im Neuenheimer Feld 580, D-69120, Heidelberg, Germany; 2Department of Computer Science and Engineering,
Jadavpur University, Kolkata-700032, India; 3Machine Intelligence Unit, Indian Statistical Institute, Kolkata-700108,
India
Abstract: Microarray technology enables the monitoring of the expression patterns of a huge number of genes across
different experimental conditions or time points simultaneously. Biclustering of microarray data is an important technique
to discover a group of genes that are co-regulated in a subset of experimental conditions. Traditional clustering algorithms
find groups of genes/conditions over the complete feature space. Therefore they may fail to discover the local patterns
where a subset of genes has similar behaviour over a subset of conditions. Biclustering algorithms aim to discover such
local patterns from the gene expression matrix, thus can be thought as simultaneous clustering of genes and conditions. In
recent years, a large number of biclustering algorithms have been proposed in literature. In this article, a study has been
made on various issues regarding the biclustering problem along with a comprehensive survey on available biclustering
algorithms. Moreover, a survey on freely available biclustering software is also made.
Keywords: Microarray, gene expression, biclustering, bicluster types, biclustering algorithms, biclustering software.
1. INTRODUCTION
The classical approach to genomic research was based on
the local study and collection of data on single genes. With
the advancement in microarray technology, it has now be-
come feasible to have a global and simultaneous view of the
expression levels of many thousands of genes across
di erent time points or experimental conditions [1]. Microar-
ray technology in recent years has major impacts in many
fields such as medical diagnosis, bio-medicine, characteriz-
ing various gene functions, understanding di erent molecu-
lar biological processes, gene expression profiling etc [2-5].
New application opportunities have been created for data
mining methodologies due to the development of microar-
rays. Microarray chips consist of expression levels of a large
number of genes. Hence they produce large amounts of data
to handle. Due to its large volume, computational analysis is
essential for extracting knowledge from microarray gene
expression data. Clustering is one of the primary approaches
to analyze such large amount of data to discover the groups
of co-expressed genes.
Clustering [6], an important microarray analysis tool, has
been used to identify the sets of genes with similar expres-
sion profiles. In some early works, visual analysis was suc-
cessfully done for grouping genes into functionally relevant
classes in Yeast cell cycle [3, 7] and Human large B-cell
lymphoma [2] data sets. However, as these methods were
very subjective, standard clustering methods, such as
K-means [8], fuzzy C-means [9], hierarchical methods [4],
*Address correspondence to this author. on leave from Department of
Computer Science and Engineering, University of Kalyani, Kalyani –
741235, India; Tel: +91 33 2580 9618; Fax: +91 33 2582 8282;
E-mail: [email protected]
Self Organizing Maps (SOM) [10], graph theoretic approach
[11], simulated annealing based approach [12] and genetic
algorithm (GA) based clustering methods [13] have been
utilized for clustering microarray data.
Clustering algorithms have been applied on microarray
data either to group the genes across the time points or ex-
perimental conditions/samples [10, 13-16] or group the sam-
ples across the genes [17-19]. Clustering techniques, which
aim to find the clusters of genes over all the experimental
conditions, may fail to discover the genes having similar
expression patterns over a subset of conditions. Similarly, a
clustering algorithm that groups the conditions/samples
across all the genes, may not capture the group of samples
having similar expression values for a subset of genes. It is
often the case that a subset of genes are co-regulated and co-
expressed across a subset of experimental conditions and
have almost different expression patterns over the remaining
conditions. Traditional clustering methods are not able to
identify such local patterns, usually termed as biclusters.
Thus biclustering can be thought as the simultaneous
clustering of genes and conditions instead of clustering them
separately. The aim of the biclustering algorithms is to
discover a subset of genes that are co-regulated over a subset
of experimental conditions. Hence they provide better
reflection of the biological reality.
Although biclustering is a relatively new approach
applied in gene expression data, it has a fast growing
literature. In this article, we have discussed several issues of
biclustering including a comprehensive review of the recent
literature. The rest of the article is organized as follows: the
next section describes the structure of a microarray gene
expression data set. In Section 3, the biclustering problem is
defined formally and different related definitions are
On Biclustering of Gene Expression Data Current Bioinformatics, 2010, Vol. 5, No. 3 205
provided. In Section 4, a discussion is made on the available
biclustering algorithms. Section 5 describes some publicly
available biclustering software. Section 6 concludes the
article.
2. MICROARRAY GENE EXPRESSION DATA
A microarray [20] is a small chip onto which a large
number of DNA molecules (probes) are attached in fixed
grids. The chip is made of chemically coated glass, nylon,
membrane or silicon. Each grid cell of a microarray chip
corresponds to a DNA sequence. There are mainly two types
of microarrays, viz., two-channel microarrays and single-
channel microarrays [21]. In two-channel microarrays (also
called as two-color microarrays), two mRNA samples are
reverse-transcribed into cDNA (targets) labelled using
di erent fluorescent dyes (red-fluorescent dye Cy5 and
green-fluorescent dye Cy3). Due to the complementary na-
ture of the base-pairs, the cDNA binds to the specific oli-
gonucleotides on the array. In the subsequent stage, the dye
is excited by a laser so that the amount of cDNA can be
quantified by measuring the fluorescence intensities. The log
ratio of two intensities of each dye is used as the gene ex-
pression profiles.
.3)(
5)(log=
2CyIntensity
CyIntensitylevelexpressiongene
(1)
Although absolute levels of gene expression may be
determined using the two-channel microarrays, the system is
more useful for the determination of relative differences in
gene expression within a sample and between samples.
Single-channel microarrays (also called as one-color
microarrays) are prepared to estimate the absolute levels of
gene expression, thus requiring two separate single-dye
hybridizations for the comparison of the two sets of
conditions. As only a single dye is used, the data represent
absolute values of gene expression. An advantage of single-
channel microarrays is that data are more easily compared to
arrays from different experiments. However, in single-
channel system, one needs twice as many microarrays to
compare the samples within an experiment.
Mathematically, a microarray data set can be viewed as a
G C matrix A(G,C) that represents the expression level of
a set of G genes G = {I1, I2 ,K, IG} over a set of C
conditions C = {J1, J2 ,K, JG} . Each element ijm of matrix
A(G,C) represents the expression level of the i th gene at the
j th condition, where Gi and Cj . (Eqn. 2).
GCGGG
C
C
C
A
mmmI
mmmI
mmmI
JJJ
CG
L
MOMMM
L
L
L
21
222212
112111
21
=),(
(2)
3. BICLUSTERING PROBLEM AND DEFINITIONS
Given a CG microarray data matrix ),( CGA
consisting of a set of G genes G = {I1, I2 ,K, IG} and a set
of C conditions C = {J1, J2 ,K, JG} , a bicluster can be
defined as follows:
Definition 1 (Bicluster) A bicluster is a submatrix
][=),( ijmJIM , JjIi , , of matrix ),( CGA , where
GI and CJ , and the subset of genes in the
bicluster are similarly expressed over the subset of
conditions.
The problem of biclustering is thus to identify a set of
biclusters from a given data matrix depending on some
coherence criterion to evaluate the quality of the biclusters.
In general, the complexity of a biclustering problem depends
on the coherence criterion used. However, in almost all
cases, the biclustering problem is known to be NP-complete.
Therefore, a number of approaches use heuristics for
discovering biclusters from a gene expression matrix.
Depending on how the genes in a biclusters are similar to
each other under the experimental conditions, biclusters can
be categorized into different types. The following subsection
provides the definitions of different types of biclusters.
3.1. Types of Biclusters
There are mainly six types of biclusters viz., (1)
biclusters with constant values, (2) biclusters with constant
rows, (3) biclusters with constant columns, (4) biclusters
with additive pattern, (5) biclusters with multiplicative
pattern and (6) biclusters with both additive and
multiplicative pattern. The additive and multiplicative
patterns are also referred as shifting and scaling patterns,
respectively [22]. The different types of biclusters are
defined as follows:
Definition 2 (Biclusters with Constant Values) In a
bicluster ][=),( ijmJIM , JjIi , with constant values,
all the elements have the same value, i.e.,
., ,= JjIimij (3)
Definition 3 (Biclusters with constant Rows) In a
bicluster ][=),( ijmJIM , JjIi , with constant rows,
all the elements of each row of the bicluster have the same
value. Hence in this type of bicluster, each element is
represented using one of the following notations:
,, ,= JjIiam iij + (4)
,, ,= JjIibm iij (5)
., ,= JjIiabm iiij + (6)
206 Current Bioinformatics, 2010, Vol. 5, No. 3 Mukhopadhyay et al.
Here is a constant value for a bicluster, ia is the
additive (shifting) factor for row i and ib is the
multiplicative (scaling) factor for row i .
Definition 4 (Biclusters with Constant Columns) In a
bicluster ][=),( ijmJIM , JjIi , with constant
columns, all the elements of each column of the bicluster
have the same value. Hence in this type of bicluster, each
element is represented using one of the following notations:
,, ,= JjIipm jij + (7)
,, ,= JjIiqm jij (8)
., ,= JjIipqm jjij + (9)
Here is a constant value for a bicluster, jp is the
additive (shifting) factor for column j and jq is the
multiplicative (scaling) factor for column j .
Definition 5 (Biclusters with Additive Pattern) In a
bicluster ][=),( ijmJIM , JjIi , with additive
(shifting) pattern, each column and row has only some
additive (shifting) factors. Hence in this type of bicluster,
each element is represented as:
., ,= JjIipam jiij ++ (10)
Definition 6 (Biclusters with Multiplicative Pattern) In a
bicluster ][=),( ijmJIM , JjIi , with multiplicative
(scaling) pattern, each column and row has only some
multiplicative (scaling) factors. Hence in this type of
bicluster, each element is represented as:
., ,= JjIiqbm jiij (11)
Definition 7 (Biclusters with both Additive and
Multiplicative Patterns) In a bicluster ][=),( ijmJIM ,
JjIi , with both additive (shifting) and multiplicative
(scaling) pattern, each column and row has both additive
(shifting) and multiplicative (scaling) factors. Hence in this
type of bicluster, each element is represented as:
., ,= JjIipaqbm jijiij ++ (12)
Note that these biclusters are the most general form of
biclusters. All other types of biclusters are special cases of
these biclusters.
3.2. Some Important Definitions
Here we discuss some important terms regarding
biclusters and the biclustering problem.
Definition 8 (Bicluster Variance) Bicluster variance
),( JIVARIANCE of a bicluster ),( JIM is defined as follows:
,)(=),( 2
,
IJij
JjIi
mmJIVARIANCE
(13)
where ijJjIiIJ m
JIm
,||||
1= , i.e., the mean of the elements
in the bicluster.
Definition 9 (Residue) The residue ijr of any element
ijm of a bicluster ),( JIM is defined as:
Fig. (1). Examples of different types of biclusters: (a) Constant, (b) Row-constant, (c) Column-constant, (d) Additive Pattern, (e) Multiplicative Pattern, (f) Both Additive and Multiplicative Patterns.
On Biclustering of Gene Expression Data Current Bioinformatics, 2010, Vol. 5, No. 3 207
,= IJIjiJijij mmmmr + (14)
where iJm is the mean of the i th row, i.e.,
ijJjiJ mJ
m||
1= , Ijm is the mean of the j th column,
i.e., ijIiIj mI
m||
1= , and IJm is the mean of all the
elements in the bicluster, i.e., ijJjIiIJ m
JIm
,||||
1= .
Definition 10 (Mean Squared Residue) The mean
squared residue ( ),( JIMSR ) of a bicluster ),( JIM is
defined as:
.||||
1=),( 2
,
ij
JjIi
rJI
JIMSR
(15)
The mean squared residue score of a bicluster represents
the level of coherence among the elements of the bicluster.
Lower residue score indicates greater coherence and thus
better quality of the bicluster.
Definition 11 (Row Variance) The row variance
VAR(I , J ) of a bicluster M (I , J ) is defined as:
.)(||||
1=),( 2
,
iJij
JjIi
mmJI
JIVAR
(16)
A high row variance indicates that the rows (genes) of
the biclusters have large variance across the conditions.
Sometimes high row variance is desirable in order to escape
from trivial constant biclusters.
4. BICLUSTERING ALGORITHMS
In recent years, a large number of biclustering algorithms
have been proposed for gene expression data analysis. In this
section, we discuss some popular biclustering algorithms in
different categories such as iterative greedy search,
randomized greedy search, evolutionary techniques, graph
based algorithms, fuzzy methods etc.
4.1. Iterative Greedy Search
The concept of biclustering was first introduced by
Hartigan in [23] in the form of direct clustering. As a
coherence measure of a bicluster ),( JIM , bicluster variance
( ),( JIVARIANCE ) was used (Eqn. 13). The goal of the
algorithm was to extract K biclusters from the given data
set while minimizing the sum of the bicluster variances of
the K biclusters. In each iteration, the algorithm partitions
the data matrix into a set of submatrices, each of which is
considered as a bicluster. As can be noted, for a constant
bicluster, ),( JIVARIANCE is zero. As each element of the
data matrix satisfies the zero variance criterion, to avoid this,
the algorithm was executed until the data matrix was
partitioned into K submatrices. Hartigan's algorithm was
able to detect constant biclusters only. However, he proposed
using other homogeneity criteria to detect other types of
biclusters.
Cheng and Church first introduced the biclustering
problem in the case of microarray gene expression data [24].
The coherence measure called Mean Squared Residue
(MSR) was introduced by them (Eqn 15). Cheng and Church
proposed a greedy search heuristic that searches for largest
possible bicluster keeping MSR under a threshold
(called as -bicluster). The algorithm has two phases. In the
first phase, starting with the complete data matrix, they first
delete rows and columns in order to bring the MSR score
below . In this regard, Cheng and Church suggested a
greedy heuristic to rapidly converge to a locally maximal
submatrix with MSR score below . In the second phase,
the rows and columns are added as long as MSR score does
not increase. The same procedure is executed for K
iterations in order to discover K -biclusters. At each
iteration, the bicluster found in the previous iteration is
masked with random values in order to avoid overlaps. Since
MSR score is zero for the biclusters with constant values,
constant rows, constant columns and additive patterns,
Cheng and Church algorithm is able to detect these kind of
biclusters only. However, the algorithm is known to stuck at
local optima often and also suffers from random interference
due to masking of biclusters with random values.
In [25], the authors extended the concept of -bicluster
to cope with the problem of masking the missing values as
well as masking the biclusters found in the previous iteration
with random values. In this algorithm, the residue of a
specified (non-missing) element in a bicluster is taken as
same as per Eqn. 14, but residue of an unspecified (missing)
element is taken to be zero. This algorithm allows the
biclusters to overlap and thus is termed as FLexible
Overlapped biClustering (FLOC). FLOC algorithm begins
with a initial set of biclusters (seeds) and iteratively
improves the overall quality of the biclustering. At each
iteration, each row and column is moved among the
biclusters to yield a better biclustering in terms of lower
MSR . The best biclustering obtained during an iteration is
used as the initial biclustering seed in the next iteration. The
algorithm terminates automatically when the current iteration
fails to improve the overall biclustering quality. Thus FLOC
is able to evolve k biclusters simultaneously. However, this
algorithm also can only identify constant and additive
patterns, and fails to detect multiplicative patterns.
In [26], an algorithm called Order Preserving Sub-matrix
(OPSM) is proposed. Here a bicluster is defined as a
submatrix where the order of the selected conditions is
preserved for all of the selected genes. Hence, the expression
values of the genes within a bicluster induce an identical
linear ordering across the selected conditions. The authors
proposed a deterministic iterative algorithm to find large and
statistically significant biclusters. The time complexity of
this technique is )( 3kO GC where G and C are the number
of genes and conditions of the input data set, respectively
208 Current Bioinformatics, 2010, Vol. 5, No. 3 Mukhopadhyay et al.
and k is the number of biclusters found. Thus OPSM does
not scale well for high-dimensional data sets.
In [27] and [28], the authors proposed Iterative Signature
Algorithm (ISA) where a bicluster is considered to be a
transcription module, i.e., a set of co-regulated genes together
with the associated set of regulating conditions. The algorithm
starts with an initial set of genes and all samples are scored
with respect to this gene set. The samples, for which the score
exceeds a predefined threshold are chosen. Similarly, all
genes are scored regarding the selected samples and a new set
of genes is selected based on another user-defined threshold.
This procedure is iterated until the set of genes and the set of
samples converge, i.e., do not change anymore. ISA can
discover more than one bicluster by starting with different
initial gene sets. The choice of initial reference gene set
plays an important role in ISA in order to obtain good
quality results. ISA is highly sensitive to the threshold values
and often tends to identify a strong bicluster many times.
In xMotif biclustering [29], the biclusters which contain
genes that are almost constantly expressed across the
selected conditions are identified. At first, each gene is
assigned a set of statistically significant states which define
the set of valid biclusters. In xMotif, a bicluster is considered
to be a submatrix where each gene is exactly in the same
state for all the selected conditions. The aim is to identify the
largest bicluster. To identify the largest valid biclusters, an
iterative search method is proposed that is run on different
initial random seeds. It should be noted that xMotif
framework requires pre-identification of the classes of
biclusters present in the data which may not be feasible for
most of the real life data sets.
In general, greedy search algorithms scale well in large
data sets. However, they mainly suffer from the problem of
getting stuck at local optima depending on the initial
configuration.
4.2. Two-Way Clustering
In [30], the authors present a coupled two-way clustering
(CTWC) approach to gene microarray data analysis. The
main idea is to identify subsets of the genes and samples,
such that when one of these is used to cluster the other,
stable and significant partitions emerge. They present an
algorithm, based on iterative clustering, that performs such a
search. This two-way clustering algorithm repeatedly
performs one-way clustering on the rows and columns of the
data matrix using stable clusters of rows as attributes for
column clustering and vice-versa. Although the authors used
hierarchical clustering, any reasonable choice of clustering
method and definition of stable cluster can be used within
the framework of CTWC. As a preprocessing step, they used
normalization which allowed them to capture biclusters with
constant columns also.
Interrelated Two-Way Clustering (ITWC) [31], an
algorithm similar to CTWC, combines the results of one-way
clustering on both dimensions of the gene expression matrix
for producing biclusters. As a preprocessing step, the rows of
the data matrix is first normalized. Thereafter, the vector-
angle cosine value between each row and a predefined stable
pattern is computed to determine whether the row values
vary much among the columns. The rows with very little
variation are then removed. After that, correlation coefficient
is used to measure the strength of the linear relationship
between two rows or two columns, to perform the two-way
clustering. As correlation coefficient is independent of the
magnitude and only depends on the pattern, ITWC is able to
detect both additive and multiplicative biclusters.
Double Conjugated Clustering (DCC) [32] algorithm is
node-driven algorithm that unifies the two view points of
microarray clustering, viz., clustering the samples taking the
genes as the features and clustering the genes taking samples
as the features. DCC performs the both tasks simultaneously
to achieve a unified clustering where the sample clusters are
discriminated by subsets of genes. The clustering in sample
space and gene space are synchronized by a projection of
nodes between the spaces mapping the sample clusters to the
corresponding gene clusters. The method may utilize any
relevant clustering technique like SOM and K-means. The
data does not scatter across all offered nodes due to the
projection between the two clustering spaces. DCC
algorithm can provide sharp clusters and empty nodes even
in the case of number of nodes exceeding the number of
clusters. However, DCC can only find constant biclusters
from the input data set.
The two-way clustering algorithms in general cluster the
data set from both the dimensions (rows and columns) and
finally try to combine the clustering of the two dimensions in
order to obtain the biclusters. However, there is no standard
rule for the choice of the number of clusters in both the gene
and condition dimensions.
4.3. Evolutionary Biclustering
Evolutionary algorithms, like Genetic Algorithms (GA)
[33] and Simulated Annealing (SA) [34] have been used
extensively in the biclustering problem. Some of these
algorithms are described below.
4.3.1. GA Based Biclustering
In [35], a genetic algorithm based biclustering framework
has been developed. As an encoding strategy, the authors use
a binary string of length G +C , where G and C denote the
number of genes and number of conditions/samples/time
points, respectively. If a bit position is `1', then the
corresponding gene or condition is selected in the bicluster
and if a bit position is `0', the corresponding gene or
condition is not selected in the bicluster. Hence, each
chromosome encodes one possible bicluster. Following
fitness function F is minimized:
F =
1
| I || J |if MSR(I,J)
MSR(I , J )otherwise.
(17)
On Biclustering of Gene Expression Data Current Bioinformatics, 2010, Vol. 5, No. 3 209
Hence, if MSR of the bicluster encoded in a
chromosome is less than the threshold (i.e., a -
bicluster), the objective is to maximize the volume.
Otherwise, the objective is to minimize the MSR . The
algorithms employs a special selection operator called
environment selection to maintain the diversity of the
population in order to identify a set of biclusters at one run.
A local search strategy is used to expedite the rate of
convergence. As the local search, one iteration of Cheng and
Church node deletion and addition algorithm is executed
before computing the fitness value of a chromosome. Also
the chromosome is updated with the new bicluster obtained
after the local search. Standard uniform crossover and bit-
flip mutation operators are adopted for generating the next
generation.
A similar GA based biclustering approach can be found
in [36]. Here, instead of using Cheng and Church algorithm
as a local search strategy in each step of fitness computation,
it is only used once initially. The initial population consists
of biclusters seeds generated through K-means clustering in
both dimensions and combining the gene and sample
clusters. Thereafter these seeds are grown up through Cheng
and Church algorithm. Subsequently the normal GA process
follows. As the fitness function, the authors minimized the
ratio of MSR to the volume of the biclusters in order to
capture large yet coherent biclusters.
Another GA based biclustering, called Sequential
Evolutionary BIclustering (SEBI) is proposed in [37]. In this
work also, the authors use binary chromosomes as discussed
above. SEBI minimizes the following fitness function:
,),(
1),(= penaltyw
JIVAR
JIMSRd +++F
(18)
where )||||
(=J
wI
www crVd + . Here Vw , rw and cw
represent weights on volume, number of rows and number of
columns in the bicluster, respectively. Also
)(=, ijpJjIi
mwpenalty , where )( ijp mw is an weight
associated with each element ijm of the bicluster and it is
defined as:
0.|>)COV(m| if
0|=)COV(m| if0
=)(
ij|)(|
|)(|
,
ij
ijmCOV
klmCOV
JlIkijp
e
emw
(19)
Here |)(| ijmCOV denotes the number of biclusters
containing ijm . The weight )( ijp mw is used to control the
amount of overlaps among the biclusters. Binary tournament
selection is used. Three crossover operators, one-point, two-
point and uniform crossover have been studied. Also three
mutation operators, namely standard bit-flip mutation,
mutation by adding a row and mutation by adding a column
are used for study. SEBI does not use any local search
strategy for updating the chromosomes.
All the above algorithms use chromosomes of length
equal to the number of genes plus the number of conditions.
Thus the chromosomes are very large if the data set is large.
This may cause the other operators like crossover and
mutation to take longer and thus slowing down the
convergence. Taking this into account, a novel encoding
strategy is proposed in GA based Biclustering (GABI) [38].
Here each string has two parts: one for clustering the genes,
and another for clustering the conditions. If M and N
denote the maximum number of gene clusters and the
maximum number of condition clusters, respectively, then
the length of each string is M + N . The first M positions
represent the M cluster centers for the genes, and the
remaining N positions represent the N cluster centers for
the conditions. Thus a string looks like following:
{gc1 gc2 K gcM cc1 cc2 K ccN }, where each gci , i =1 M ,
represents the index of a gene that acts as a cluster center of
a set of genes, and each ccj , j =1…N , represents the index
of a condition that acts as a cluster center of a set of
conditions. For a data set having n points, it is usual to
assume that the data set may contain at most n clusters.
Taking this into account, the values of the maximum number
of gene clusters ( M ) and the maximum number of condition
clusters ( N ) are used as G and C , respectively.
Here G and C denote the number of genes and the number
of conditions in the data set, respectively. The first M
positions can have values in the range {0,1, 2,K,G} and the
next N positions can have values in the range {0,1,2…C}.
Hence the gene and condition cluster centers are represented
by indices of the genes and conditions, respectively, while a
0 value at any position means absence of any cluster center.
A string that encodes M gene clusters and N condition
clusters, represents a set of NM biclusters, taking each
pair of gene and condition clusters. Each pair < gci ,ccj > ,
i = 1KM , j = 1KN , represents a bicluster that consists of all
genes of the gene cluster centered at gene igc , and all
conditions of the condition cluster centered at condition jcc .
During the fitness computation, the gene clusters and
condition clusters encoded in the chromosome are updated in
K-means like iteration. The fitness function of a bicluster is
defined as follows:
.)),(.(1
),(=
JIVAR
JIMSR
+F
(20)
The denominator of F is chosen such way to avoid
accidental divide-by-zero condition when row variance
( ),( JIVAR ) becomes 0. F is minimized to obtain highly
coherent yet “interesting'' (high variance) biclusters. For each
encoded -bicluster, the fitness function F is computed.
The fitness function of a chromosome is then computed as
the mean of the fitness values of all the encoded -
210 Current Bioinformatics, 2010, Vol. 5, No. 3 Mukhopadhyay et al.
biclusters in it. Conventional roulette wheel selection and
uniform crossover operation are used in GABI. The mutation
operation works as follows. A random position is chosen
from the first M positions and its value is replaced by an
index randomly chosen from the range {0,1,2,K,G} , where
G is the total number of genes. Similarly, to mutate the
condition portion of the string, a random position is selected
from the next N positions and its value is substituted using
a randomly selected index from the range {0,1,2,K,C} ,
where C is the total number of conditions. Elitism is used to
track the best string found until the current generation.
4.3.2. SA Based Biclustering
There are many instances in literature that use Simulated
Annealing (SA) for the biclustering problem. A standard
representation of a configuration in SA is equivalent to a
binary string used in GA based biclustering. In [39], this
representation is used. Here the initial configuration consists
of all `1's, i.e., it encodes the complete data set. The
perturbation is equivalent to bit-flipping mutation used in
GA. The energy to be minimized is taken as MSR of the
encoded bicluster.
A similar approach is found in [40], where instead of
starting from the complete data matrix, the author first create
a seed bicluster by clustering the genes and samples and
combining them. Thereafter SA is used to grow up the seed.
Here also, MSR is used as the coherence measure. The
perturbation includes only addition of a random gene and/or
condition.
4.3.3. Hybrid Approaches
In [41], a hybrid Genetic Algorithm-Particle Swarm
Optimization (GA-PSO) approach, which uses binary strings
to encode the biclusters, is proposed. The GA and PSO have
there own populations that evolve through standard GA and
PSO process, respectively. At each iteration, a random set of
individual solutions are exchanged between the two
population. As the fitness function, it uses the same
described in Eqn. 17.
4.3.4. Multiobjective Biclustering
As the biclustering problem requires several objectives to
be optimized such as MSR , volume, row variance etc.,
there are some approaches that pose the biclustering problem
as multiobjective optimization [42]. The work in [38] has
been extended to multiobjective case in [43]. The algorithm
is termed as MultiObjective GA based Biclustering
(MOGAB). Here the authors used the same encoding
strategy consisting of gene clusters and condition clusters.
Two objectives, viz., ),( JIMSR
and ),(1
1
JIVAR+ are
optimized simultaneously. This algorithm uses NSGA-II
[44] as the underlying multiobjective optimization tool. The
crossover and mutation operators are kept same as in [38].
In [45], the authors extended their work of [37] to the
multiobjective case. The algorithm is called as Sequential
Multi-Objective Biclustering (SMOB). Here also they used
binary encoding strategy. Three objective functions, viz.,
mean squared residue, volume and row variance are
optimized simultaneously. In [46], a Crowding distance
based Multi-Objective Particle Swarm Optimization
Biclustering (CMOPSOB) algorithm is proposed that uses
binary encoding. The algorithm optimizes the MSR , volume
and VAR simultaneously. In [47], a hybrid multiobjective
biclustering algorithm that combines NSGA-II and
Estimation of Distribution Algorithm (EDA) [48] for
searching biclusters is proposed. The volume and MSR of
the biclusters are optimized simultaneously. A
multiobjective artificial immune system based biclustering
that is capable of performing a multi-population search,
named MOM-aiNet, is proposed in [49].
In general, evolutionary algorithms are known for their
strength in avoiding locally optimum solutions. Specially,
when they are equipped with some local search, they can
converge fast toward the global optimum. However, the
algorithms which optimize MSR as an objective function,
fail to discover the multiplicative patterns. Also,
evolutionary algorithms are inherently slower compared to
the greedy iterative algorithms and depend a lot on different
parameters like population size, number of generations,
crossover and mutation rates, annealing schedule etc. But in
general, it has been found that evolutionary algorithms,
specially the multiobjective ones, work better than the
greedy search strategies in terms of performance.
4.4. Fuzzy Biclustering
Some recent biclustering algorithms employ fuzzy set
theory in developing biclustering algorithms in order to
capture overlapping biclusters. In [50], a flexible fuzzy co-
clustering algorithm which incorporates feature-cluster
weighting in the formulation is proposed. The algorithm is
called as Flexible Fuzzy Co-clustering with Feature-cluster
Weighting (FFCFW) which allows the number of object
clusters to be different from the number of feature clusters.
A feature-cluster weighting scheme is incorporated for each
object cluster generated by FFCFW so that the relationships
between the two types of clusters are manifested in the
feature-cluster weights. This enables FFCFW to generate
more accurate representation of fuzzy co-clusters. FFCFW
uses an iterative optimization procedure.
In [51], a GA based possibilistic fuzzy biclustering
algorithm GFBA is proposed. In GFBA, instead of binary
chromosome, the authors use real valued chromosome of
length G +C . Each position in the chromosome has value
between 0 and 1, representing the degree of membership of
the corresponding gene or condition to the encoded bicluster.
They fuzzified the different coherence and quality metrics
such as MSR , VAR and volume of the biclusters as follows:
The means of each row ( iJm ), each column ( Ijm ) and all
the elements ( IJm ) of a bicluster are redefined as:
On Biclustering of Gene Expression Data Current Bioinformatics, 2010, Vol. 5, No. 3 211
,
)(
.)(
=
1=
1=
μ
μ
jf
mjf
m
J
j
ijJ
j
iJ C
C
(21)
,
)(
.)(
=
1=
1=
μ
μ
if
mif
m
I
i
ijI
iIj G
G
(22)
and
,
)(.)(
.)(.)(
=
1=1=
1=1=
μμ
μμ
jfif
mjfif
m
JI
ji
ijJI
ji
iJ CG
CG
(23)
where )(if I and )( jf J denote the membership degree of
the i th gene and j th condition to the bicluster, respectively
and μ is the fuzzy exponent. Hence fuzzy mean squared
residue FMSR is defined as:
,)()(.)(||||
1=),( 2
1=1=
IJIjiJijJI
ji
mmmmjfifJI
JIFMSR +μμCG
(24)
where )(|=|1=
ifI Ii
G
and )(|=|1=
jfJ Jj
C . The objective
function to be minimized is selected as:
||
))().(1,(.
||
))().(1,(.
),(=1=1=
J
jfJIFMSR
I
ifJIFMSR
JIFMSR
J
jI
i ++CG
F
CGμμ
(25)
where and are parameters provided to satisfy different
requirements on the incoherence and the sizes of the
biclusters. Conventional roulette wheel selection and single
point crossover followed by mutation (increasing or
decreasing a membership value) have been used. GFBA also
uses a bicluster optimization technique at each generation for
faster convergence.
In [52], an NSGA-II based multiobjective probabilistic
fuzzy biclustering algorithm is proposed which uses
chromosomes encoding a set of gene cluster centers and a set
of condition cluster centers as in [38, 43]. In this case, the
gene and condition cluster centers are updated using one step
of fuzzy K-medoids clustering [53] and for each gene and
condition, fuzzy membership degree to each gene cluster and
condition cluster, respectively is computed. The fuzzy
volume of bicluster ),( JIB corresponding to
>,< yx ccgc pair ( < gene cluster x , condition cluster
>y ) is defined as:
.=),(1=1=
m
yj
m
xi
ji
JIfvol μCG
(26)
Here I is a fuzzy set corresponding to fuzzy gene
cluster centered at xgc . It consists of all genes ig with
membership degree xiμ , Gi1 . Similarly, J is a fuzzy
set corresponding to fuzzy condition cluster centered at ycc .
It consists of all conditions jc with membership degree yj ,
Cj1 .
Residue of an element ija of the fuzzy bicluster
),( JIB is defined as:
,= IJIjiJijij aaaafr + (27)
where
,=
1=
1=
m
yj
j
ij
m
yj
j
iJ
a
aC
C
(28)
aIj =
i=1
�
μximaij
i=1
�
μxim
, (29)
and
,),(
=1=1=
JIfvol
a
a
ij
m
yj
m
xi
ji
IJ
μCG
(30)
where m is the fuzzy exponent. The fuzzy mean squared
residue ( ),( JIFMSR ) of the fuzzy bicluster ),(= JIB
is defined as:
.),(
1=),( 2
1=1=
ij
m
yj
m
xi
ji
frJIfvol
JIFMSR μCG
(31)
Subsequently, fuzzy expression profile variance of
),( JIB is computed as:
.)(),(
1=),( 2
1=1=
iJij
m
yj
m
xi
ji
aaJIfvol
JIfvar μCG
(32)
For each >,< yx ccgc pair, representing a fuzzy
bicluster, the above three objectives ( fvol , FMSR and
frvar ) are computed. As each chromosome encodes a
number of possible biclusters, the average value of each of
the above three terms, i.e., fuzzy volume, fuzzy MSR and
fuzzy variance are taken as three objectives to be optimized
simultaneously. Note that, the first and the third objectives
are to be maximized while minimizing the second one. The
other genetic operators used are similar to that used in [43].
In [54], another fuzzy biclustering algorithm called
Fuzzy Biclustering for Microarray Data Analysis (FBMDA)
212 Current Bioinformatics, 2010, Vol. 5, No. 3 Mukhopadhyay et al.
is proposed. The method employs a combination of the
Nelder-Mead and min-max algorithm to construct
hierarchically structured biclustering, thus can represent the
biclustering information at different levels. FBMDA uses
multiobjective optimization that optimizes volume, variance
and fuzzy entropy simultaneously. The Nelder-Mead
algorithm is used to compute a single objective optimal
solution, and the min-max algorithm is used to trade-off
between multiple objectives. FBMDA is not subject to the
convexity limitations, and also does not use the derivatives
information. FBMDA ensures that the current local optimal
solution is removed and that a higher precision is reached.
Incorporation of fuzziness in biclustering algorithms
enables them to deal with noisy data and overlapping
biclusters efficiently. But as most of the aforementioned
fuzzy algorithms use evolutionary techniques as the
underlying optimization strategy, they suffer from the
fundamental disadvantages of evolutionary methods.
Furthermore, computation of fuzzy membership degrees
takes additional time which adds up to the time taken by the
fuzzy biclustering methods.
4.5. Graph Theoretic Approaches
Graph theoretic concepts and techniques have been
utilized in detecting biclusters. In [55], the authors
introduced SAMBA (Statistical Algorithmic Method for
Bicluster Analysis), a graph-theoretic approach to
biclustering in combination with a statistical data model. In
SAMBA the expression matrix is modeled as a bipartite
graph consisting of two sets of vertices corresponding to
genes and conditions. A bicluster is defined as a subgraph,
and a likelihood score is used in order to assess the
significance of observed subgraphs. SAMBA repeatedly
finds the maximal highly connected subgraph in the bipartite
graph. Then it performs local improvement by adding or
deleting a single vertex until no further improvement is
possible. SAMBA's time complexity is O(N2d ) , where d is
the upper bound on the degree of each vertex.
The Binary inclusion-Maximal (BiMax) biclustering
algorithm proposed in [56] identifies all biclusters in the
input matrix. BiMax algorithm works on a binary matrix.
The input matrix is first discretized to zeros and ones
according to a user-specified threshold. Based on this binary
matrix, BiMax identifies all maximal biclusters where a
bicluster is defined as a submatrix E containing all 1s. An
inclusion-maximal bicluster means that this bicluster is not
completely contained in any other bicluster. They used an
incremental algorithm to find the inclusion-maximal
biclusters exploiting the fact that the matrix E induces a
bipartite graph. As BiMax works with binary matrix, it is
suitable only for detecting constant biclusters.
In [57], the optimal biclustering problem is posed as a
problem of maximal crossing number reduction
(minimization) in a weighted bipartite graph. In this regard,
an algorithm called cHawk, is proposed that employs
barycenter heuristic and local search technique. There are
three main steps of the algorithm, viz., construction of a
bipartite graph from the input matrix, bipartite graph
crossing minimization and finally, the bicluster
identification. This approach reorders the matrix so that all
rows and columns belonging to the same bicluster are
brought into the vicinity of each other. cHawk is able to
detect constant, additive and overlapped noisy biclusters.
The graph based biclustering algorithms usually model
the input data set as a bipartite graph with two sets of nodes
corresponding to the genes and conditions, respectively. The
edges of the graph represent the level of overexpression and
underexpression of a gene under the certain condition. A
bicluster is a subgraph of the bipartite graph, where the
genes have coherence across the selected conditions. In these
types of algorithms, the genes and conditions are partitioned
in same number of clusters, which may be impractical.
Moreover, the input data set has to be discretized properly
before applying graph based algorithms. Also they do not
scale well with large data sets.
4.6. Randomized Greedy Search
In [58], a greedy random walk search technique for
biclustering problem that is enriched by a local search
strategy to escape local optima has been presented. The
algorithm begins with initial random solution and searches
for a locally optimal solution by successive transformations
(including random moves depending on some probability) to
improve a gain function defined as a combination of mean
squared residue, expression profile variance and the volume
of the biclusters. The algorithm iterates k times to generate
k biclusters.
In [59], the basic concepts of the metaheuristics Greedy
Randomized Adaptive Search Procedure (GRASP)-
construction and local search phases are reviewed. Also a
method which is a variant of GRASP called Reactive Greedy
Randomized Adaptive Search Procedure (Reactive GRASP)
is proposed to detect significant biclusters from large
microarray datasets. The method has two major steps. First,
high quality bicluster seeds are generated by using the K -
means clustering from both dimensions and combining the
clusters. In the second step, these seeds are grown using the
Reactive GRASP. In Reactive GRASP, the basic parameter
that defines the restrictiveness of the candidate list is self-
adjusted, depending on the quality of the solutions found
previously.
Randomized greedy search algorithms try to combine the
advantages of greedy search and randomization, so that they
execute fast as well as don't stuck at local optima. However,
sill these algorithms heavily depend on the initial choice of
the solution and there is no clear way to get out from a poor
choice.
4.7. Other Recent Approaches
A number of biclustering algorithms have appeared in
recent literature that follow new methodologies. Some of
them are described here.
On Biclustering of Gene Expression Data Current Bioinformatics, 2010, Vol. 5, No. 3 213
In [60], the authors introduces plaid model as a statistical
model assuming that the expression value ijm in a bicluster
is the sum of the main effect , the gene effect ip , the
condition effect jq , and the noise term ij :
.= ijjiij qpm +++ (33)
Also it is assumed that the expression values of two
overlapping biclusters are the sum of the two module effects.
In plaid model, a greedy search strategy is used, hence errors
can accumulate easily. Moreover, in case of multiple
clusters, the clusters identified by the method tend to overlap
to a great extent.
In [61], a biclustering algorithm is proposed based on
probabilistic Gibbs sampling. Gibbs sampling does not suffer
from the problem of local minima that often characterizes
Expectation Maximization. However, when the microarray
data is organized as patient vs. gene fashion, and the number
of patients is much lower compared to the number of genes,
the algorithm faces computational difficulties. Moreover the
algorithm is only able to identify biclusters with constant
columns.
In [62], the authors developed a spectral biclustering
method that simultaneously clusters genes and conditions,
finding distinctive checkerboard patterns in matrices of gene
expression data, if they exist. The method is based on the
observation that checkerboard structures can be found in
eigenvectors corresponding to the characteristic expression
patterns across the genes or conditions. In addition, these
eigenvectors can be readily identified by commonly used
linear algebra approaches such as singular value
decomposition (SVD), coupled with closely integrated
normalization steps.
In [63], the authors proposed a biclustering method that
employs dynamic programming and a divide-and-conquer
technique, as well as efficient data structures such as the trie
and zero-suppressed decision diagrams (ZBDDs). Use of
ZBDDs extends the stability of the method substantially.
In [64], the authors developed MicroCluster, a
deterministic biclustering method. In MicroCluster, only the
maximal biclusters satisfying certain homogeneity criteria
are considered. The clusters can be arbitrarily positioned
anywhere in the input data matrix, and they can have
arbitrary overlapping regions. MicroCluster uses a flexible
definition of a cluster that lets it mine several types of
biclusters. Moreover, MicroCluster can delete or merge
biclusters that have large overlaps. So, it can tolerate some
noise in the data set and let the users focus on the most
important clusters. As MicroCluster relies on extracting
maximal cliques from the constructed range multigraph, it is
computationally demanding. Moreover, there are several
input parameters that are to be tuned properly in order to find
suitable biclusters.
A method based on application of the non-smooth non-
negative matrix factorization technique for discovering local
structures (biclusters) from gene expression datasets is
developed in [65]. This method utilizes non negative matrix
factorization with non-smoothness constraints for identifying
biclusters in gene expression data for a given factorization
rank.
In [66], biclustering algorithms using basic linear algebra
and arithmetic tools have been developed. The proposed
biclustering algorithms can be used to search for all
biclusters with constant values, biclusters with constant
values on rows, biclusters with constant values on columns,
and biclusters with coherent values from a set of data in a
timely manner and without solving any optimization
problem.
In [67], the authors proposed a biclustering method by
alternatively sorting the genes and condition using dominant
set. By using weighted correlation coefficient, they emphasize
the similarities across a subset of the genes/conditions.
Additionally, a coherence measure called Average
Correlation Value (ACV) is proposed which is effective in
determining both additive and multiplicative patterns. Some
special preprocessing of the input data set is needed for
detecting additive and multiplicative biclusters. To detect
different types of biclusters, different runs are needed.
In [68], a biclustering algorithm that adopts bucketing
technique to find a raw submatrix is proposed. The algorithm
refines and extends the raw submatrix into a bicluster. The
algorithm is called as Bucketing and Extending Algorithm
(BEA).
A Bayesian BiCustering (BBC) model is proposed in
[69] that uses Gibbs sampling. For a single bicluster, the
same model as in the plaid model is assumed. Whereas for
multiple biclusters, the overlapping of biclusters is allowed
either in genes or conditions. Moreover, the authors used a
flexible error model, which permits the error term of each
bicluster to have a different variance.
In [70] the authors presented a rigorous approach to
biclustering, which is based on the Optimal RE-Ordering
(OREO) of the rows and columns of a data matrix so as to
globally minimize the dissimilarity metric. The physical
permutations of the rows and columns of the data matrix can
be modeled as either a network flow problem or a traveling
salesman problem. Cluster boundaries in one dimension are
used to partition and re-order the other dimensions of the
corresponding submatrices to generate the biclusters. The
reordering of the rows and the columns for large data sets
can be computationally demanding.
The authors in [71] proposed an algorithm that finds and
reports all maximal contiguous column coherent (CCC)
biclusters in time linear in the size of the expression matrix.
The linear time complexity of CCC-Biclustering relies on the
use of a discretized matrix and efficient string processing
techniques based on suffix trees. This algorithm can only
detect biclusters with columns arranged contiguously.
In [72], an iterative density based biclustering algorithm,
called BIDENS is proposed. BIDENS is able to detect a set
214 Current Bioinformatics, 2010, Vol. 5, No. 3 Mukhopadhyay et al.
of k possibly overlapping biclusters simultaneously. The
algorithm is similar to FLOC, but instead of having residue
as the objecting function, it tries to maximize the overall
density of the biclusters. The input data set is needed to be
discretized before the application of BIDENS algorithm.
5. BICLUSTERING SOFTWARE
There are a number of free biclustering software
available for downloading for offline use, or in the form of
web server. Here we list some free/open source biclustering
software and discuss them in brief.
5.1. BicAT
BicAT (Biclustering Analysis Toolbox) [73] integrates
various biclustering (Cheng and Church, Bimax, xMotif,
OPSM, ISA) and clustering techniques (K-means,
hierarchical clustering) with a common graphical user
interface. Moreover, BicAT provides different facilities for
data preparation, inspection and postprocessing such as
discretization, filtering of biclusters according to specific
criteria or gene pair analysis for constructing gene
interconnection graphs. The toolbox is described in the
context of gene expression analysis, but is also applicable to
other types of data, e.g. data from proteomics or synthetic
lethal experiments. The BicAT toolbox is freely available at
http://www.tik.ee.ethz.ch/sop/bicat and it is platform
independent. The Java source code of the program and a
developer's guide is provided on the website as well. There is
provision for the users to add further algorithms or
extensions.
5.2. BiVisu
BiVisu (Bicluster detection and Visualization) [74] is an
open-source biclustering software tool for detection and
visualization of biclusters embedded in a gene expression
matrix. By using of appropriate coherence relations, BiVisu
is able to detect constant, row-constant, column-constant,
additive and multiplicative biclusters. The biclustering
results can also be visualized under a 2D setting in the form
of parallel coordinate (PC) plots for each bicluster. From the
PC plots of the biclusters, both objective and subjective
cluster quality evaluation can be performed. BiVisu also
integrates some data preprocessing and postprocessing
techniques. BiVisu has been developed in Matlab and is
available at http://www.eie.polyu.edu.hk/~nflaw/Biclustering/
for free download.
5.3. GEMS
GEMS (Gene Expression Mining Server) [75] is a web-
enabled service for biclustering microarray gene expression
data. Users may upload their gene expression data and
specify a set of criteria. GEMS then performs biclustering
based on a Gibbs sampling paradigm. GEMS web server
provides a useful and flexible platform for the discovery of
co-expressed and potentially co-regulated gene modules.
GEMS is an open source software and is available at
http://genomics10.bu.edu/terrence/gems/ for free down load.
5.4. EXPANDER
EXPANDER (EXpression Analyzer and DisplayER) [76]
is a java-based tool for analysis of gene expression data. It is
capable of clustering, visualization, biclustering and
performing downstream analysis of clusters and biclusters
such as functional enrichment and promoter analysis. In
general, EXPANDER can analyze groups of genes for
enrichment of transcription factor binding sites in their
promoters. EXPANDER currently integrates the SAMBA
[68] biclustering algorithm. The software is freely
downloadable from http://acgt.cs.tau.ac.il/expander/.
5.5. BicOverlapper
BicOverlapper [77] is a tool for visualizing biclusters
from gene-expression matrices in a way that helps to
compare biclustering methods, to unravel trends and to
highlight relevant genes and conditions. The technique is
based on a force-directed graph where biclusters are
represented as flexible overlapped groups of genes and
conditions. The BicOverlapper software and supplementary
material are available at http://vis.usal.es/bicoverlapper.
5.6. BiGGEsTS
BiGGEsTS (Biclustering Gene Expression Time Series)
[78] is a free and open source software tool providing an
integrated environment for the biclustering of time series
gene expression data. It offers a set of biclustering
algorithms (CCC [71], e-CCC [71], CC-TSB [79]) for time
series expression data. Moreover, it implements several
visualization techniques such that colored matrices,
expression evolution charts, pattern charts, dendrograms and
gene ontology graphs. BiGGEsTS integrates well known
techniques for preprocessing data: filtering genes, filling
missing values, smoothing, normalization and discretization.
The software is available at http://kdbio.inesc-
id.pt/software/biggests/.
5.7. BICLUST
BICLUST is an R-package for biclustering analysis
which contains a collection of bicluster algorithms,
preprocessing methods (normalization and discretization) for
two way data, and validation and visualization techniques for
bicluster results. The main function biclust provides several
algorithms to find biclusters in two-dimensional data: Cheng
and Church, Spectral, Plaid Model, Xmotifs and BiMax. The
package is available at the following website:
http://crantastic.org/packages/biclust.
CONCLUSION AND FUTURE CHALLENGES
Biclustering is a method for simultaneous clustering of
both genes and conditions of a microarray gene expression
matrix. Unlike clustering, biclustering methods try to capture
local modules, i.e., set of genes that are coregulated and
coexpressed in a subset of conditions. In recent times, there
has been a tremendous growth in biclustering research and a
large number of algorithms have been proposed. In this
article, we have made an attempt to present a comprehensive
review on the biclustering models. Recent biclustering
On Biclustering of Gene Expression Data Current Bioinformatics, 2010, Vol. 5, No. 3 215
algorithms of different categories along with their pros and
cons have been discussed. Moreover, an overview of some
freely available biclustering software is provided.
Most of the biclustering algorithms have been applied to
microarray gene expression data sets for identifying
coregulated genes and classifying tissue samples.
Biclustering algorithms have also been applied for detection
of different responses to treatment, and the set of genes to be
used as the most effective probes, mainly in cancer
microarrays, such as Leukemia [29, 30, 61]. Other than gene
expression data sets, biclustering algorithms have also been
successfully applied to e-commerce data and collaborative
filtering [80], marketing data [81], and text mining [82] etc.
Although a lot of publications are coming out in
biclustering area, still there remains many challenges to be
addressed by the researchers. In many of the papers, the
authors have posed the biclustering as an optimization
problem that optimizes some coherence measures. Many
such algorithms optimize MSR to capture the coherent
biclusters. However, recently it has been proved that MSR
is only able to detect constant and additive patterns and
unable to detect the multiplicative or combined patterns [22].
Therefore, it is a challenge for the researchers to devise some
new coherence measure that can capture both additive and
multiplicative patterns and more desirably the combined
patterns also. Moreover, still there is no overall accepted
measure to compare the quality of the biclusters obtained
using different biclustering algorithms. Therefore it is
difficult to judge the superiority of any particular
biclustering algorithm. This issue must be addressed by the
researchers. Furthermore, some studies are to be made for
extending the biclustering algorithms to generate the
triclusters from 3D gene-sample-time microarray data sets.
REFERENCES
[1] Sharan R, Adi M-K, Shamir R. CLICK and EXPANDER: A
system for clustering and visualizing gene expression data. Bioinformatics 2003; 19: 1787-99.
[2] Alizadeh AA, Eisen MB, Davis R, et al. Distinct types of diffuse large B-cell lymphomas identified by gene expression profiling.
Nature 2000; 403: 503-11. [3] Chu S, DeRisi J, Eisen M, et al. The transcriptional program of
sporulation in budding yeast. Science 1998; 282: 699-705. [4] Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis
and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998; 95(25): 14863-68.
[5] Bandyopadhyay S, Maulik U, Wang JTL. Anal Biol Data: A Soft Comput Approach World Scientific 2007.
[6] Jain AK, Dubes RC. Algorithms for clustering data englewood cliffs. NJ: Prentice-Hall, 1988.
[7] Cho RJ, Campbell MJ, Winzeler EA, et al. A genome-wide tran-scriptional analysis of mitotic cell cycle. Mol Cell 1998; 2: 65-73.
[8] Herwig R, Poustka A, Meuller C, Lehrach H, OBrien J. Large-scale clustering of cDNA fingerprinting data. Genome Res 1999; 9(11):
1093-105. [9] Dembele D, Kastner P. Fuzzy c-means method for clustering mi-
croarray data. Bioinformatics 2003; 19(8) 973-80. [10] Tamayo P, Slonim D, Mesirov J, et al. Interpreting patterns of
gene expression with self-organizing maps: Methods and applica-tion to hematopoietic differentiation. Proc Natl Acad Sci USA 1999;
96: 2907-12. [11] Hartuv E, Shamir R. A clustering algorithm based on graph con-
nectivity. Inform Proc Lett 2000; 76(200): 175-81.
[12] Lukashin AV, Fuchs R. Analysis of temporal gene expression
profiles: clustering by simulated annealing and determining the op-timal number of clusters. Bioinformatics 2001; 17(5): 405-19.
[13] Bandyopadhyay S, Mukhopadhyay A, Maulik U. An improved algorithm for clustering gene expression data. Bioinformatics 2007;
23(21): 2859-65. [14] Maulik U, Mukhopadhyay A, Bandyopadhyay S. Combining pa-
reto-optimal clusters using supervised learning for identifying co-expressed genes. BMC Bioinformatics 2009; 10: 27.
[15] Yeung KY, Haynor DR, Ruzzo WL. Validating clustering for gene expression data. Bioinformatics 2001; 17(4): 309-18.
[16] Qin ZS. Clustering microarray gene expression data using weighted Chinese restaurant process. 2006; Bioinformatics 22(16): 1988-97.
[17] Pan H, Zhu J, Han D. Genetic algorithms applied to multi-class clustering for gene expression data. Genomics Proteomics Bioin-
formatics 2003; 1: 279-87. [18] Tasoulis DK, Plagianakos VP, Vrahatis MN. Unsupervised cluster-
ing of bioinformatics data. In Eur Symp Int Tech Hybrid Syst im-plementation Smart Adaptive Syst 2004; pp. 47-53.
[19] Mukhopadhyay A, Maulik U, Bandyopadhyay S. Unsupervised cancer classification through SVM-boosted multiobjective fuzzy
clustering with majority voting ensemble. In Proc IEEE Congress on Evolutionary Comput 2009; pp. 255-61.
[20] Causton HC, Quackenbush J, Brazma A. Microarray gene expres-sions data analysis: A beginner's guide. Blackwell Pub., April
2003. [21] http://en.wikipedia.org/wiki/dna microarray.
[22] Aguilar-Ruiz JS. Shifting and scaling patterns from gene expres-sion data. Bioinformatics 2005; 21(20): 3840-45.
[23] Hartigan J. Direct clustering of a data matrix. J Am Stat Assoc 1972; 67(337): 123-29.
[24] Cheng Y, Church GM. Biclustering of gene expression data. Proc Int Conf Int Syst Mol Biol 2000; pp. 93-103.
[25] Yang J, Wang W, Wang H, Yu P. Enhanced biclustering on ex-pression data. In Proc 3rd IEEE Conf Bioinform Bioeng 2003; pp.
321-27. [26] Ben-Dor A, Chor B, Karp R, Yakhini Z. Discovering local struc-
ture in gene expression data: The order-preserving sub-matrix prob-lem. In Proc 6th Ann Int Conf Comput Biol 2002; 1-58113-498-3:
pp. 49-57. [27] Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N.
Revealing modular organization in the yeast transcriptional net-work. Nat Genet 2002; 31: 370-7.
[28] Ihmels J, Bergmann S, Barkai N. Defining transcription modules using large-scale gene expression data. Bioinformatics 2004; 20:
1993-2003. [29] Murali TM, Kasif S. Extracting conserved gene expression motifs
from gene expression data. In Proc Pacific Symp Biocomput 2003; 8: 77-88.
[30] Getz G, Levine E, Domany E. Coupled two-way cluster analysis of gene microarray data. Proc Natl Acad Sci USA 2000; 12079-84.
[31] Tang C, Zhang L, Zhang I, Ramanathan M. Interrelated two-way clustering: An unsupervised approach for gene expression data
analysis. In Proc Sec IEEE Int Symp Bioinform Bioeng 2001; pp. 41-8.
[32] Busygin S, Jacobsen G, Krmer E, Ag C. Double conjugated cluster-ing applied to leukemia microarray data. In Proc 2nd SIAM ICDM
Workshop on clustering high dimensional data 2002. [33] Goldberg DE. Genetic algorithms in search optimization and ma-
chine learning. New York: Addison-Wesley, 1989. [34] Kirkpatrik S, Gelatt CD, Vecchi MP. Optimization by simulated
annealing. Science 1983; 220: 671-80. [35] Bleuler S, Prelic A, Zitzler E, An EA framework for biclustering of
gene expression data. In Proc IEEE Congress on Evolutionary Comput 2004; pp. 166-73.
[36] Chakraborty A, Maka H. Biclustering of gene expression data using genetic algorithm. In Proc IEEE Symp Comput Int Bioinform
Comput Biol 2005. [37] Divina F, Aguilar-ruiz JS. Biclustering of expression data with
evolutionary computation. IEEE Trans Knowl Data Eng 2006; 18: 590-602.
[38] Mukhopadhyay A, Maulik U, Bandyopadhyay S. Evolving coher-ent and non-trivial biclusters from gene expression data: An evolu-
tionary approach. Proc IEEE Region 10 Conf 2008.
216 Current Bioinformatics, 2010, Vol. 5, No. 3 Mukhopadhyay et al.
[39] Bryan K, Cunningham P, Bolshakova N. Biclustering of expression
data using simulated annealing. In Proc 18th IEEE Symp Comput Based Medical Syst (Dublin, Ireland) 2005; pp. 383-8.
[40] Chakraborty A. Biclustering of gene expression data by simulated annealing. In Proc Eighth Intl Conf High-Perform Comput Asia-
Paci c Region 2005; pp. 627-32. [41] Xie B, Chen S, Liu F. Biclustering of gene expression data using
PSO-GA hybrid. Proc Int Conf Bioinform Biomed Eng 2007; 302-05.
[42] Deb K. Multi-objective optimization using evolutionary algorithms. England: John Wiley and Sons, Ltd, 2001.
[43] Maulik U, Mukhopadhyay A, Bandyopadhyay S. Finding multiple coherent biclusters in microarray data using variable string length
multiobjective genetic algorithm. IEEE Trans Inform Tech Biomed 2009; 13(6): 969-75.
[44] Deb K, Pratap A, Agrawal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol
Comput 2002; 6: 182-97. [45] Divina F, Aguilar-Ruiz JS. A multi-objective approach to discover
biclusters in microarray data. In Proc 9th Ann Conf Genetic Evol Comput New York, NY, USA 2007; pp. 385-92, ACM.
[46] Liu J, Li Z, Hu X, Chen Y, Biclustering of microarray data with MOPSO based on crowding distance. BMC Bioinformatics 2008;
10(Suppl 4): S9. [47] Fei L, Juan L. Biclustering of gene expression data with a new
hybrid multi-objective evolutionary algorithm of NSGA-II and EDA. In Proc Int Conf Bioinform Biomed Eng 2008; pp. 1912-5.
[48] Larranaga P, Lozano JA. Estimation Distrib Algorithms: A New Tool Evol Comput MA: Kluwer Academic Publisher 2001.
[49] Coelho GP, Franca FO, Zuben FJ. A multi-objective multipopula-tion approach for biclustering. In Proc 7th Int Conf Artificial Im-
mune Syst Springer-Verlag 2008; pp. 71-82. [50] Tjhi W-C, Lihui C. Flexible fuzzy co-clustering with feature-
cluster weighting. In Proc 9th Int Conf Control, Automation, Ro-botics and Vision 2006.
[51] Fei X, Lu S, Pop HF, Liang LR. GFBA: A biclustering algorithm for discovering value-coherent biclusters. In Proc Int Symp Bioin-
form Res Appl 2007. [52] Maulik U, Mukhopadhyay A, Bandyopadhyay S, Zhang MQ,
Zhang X. Multiobjective fuzzy bi-clustering in microarray data: Method and a new performance measure. In Proc IEEE World
Congress Comput Int/IEEE Congress Evol Comput 2008; (Hong Kong), pp. 383-8.
[53] Krishnapuram R, Joshi A, Yi L. A fuzzy relative of the k-medoids algorithm with application to document and snippet clustering. In
Proc IEEE Intl Conf Fuzzy Systems -FUZZ-IEEE 99, 1999;
(Seoul, South Korea), pp. 1281-6.
[54] Han L, Yan H. Fuzzy biclustering for dna microarray data analysis. In Proc IEEE Int Conf Fuzzy Syst FUZZ-IEEE 2008 (IEEE World
Congress Comput Int) 2008; pp. 1132-8. [55] Tanay A, Sharan R, Shamir R. Discovering statistically significant
biclusters in gene expression data. Bioinformatics 2002; 18: S136-S44.
[56] Prelic A, Bleuler S, Zimmermann P, et al. A systematic comparison and evaluation of biclustering methods for gene expression data.
Bioinformatics 2006; 22(9): 1122-9. [57] Ahmad W, Khokhar A. cHawk: A highly efficient biclustering
algorithm using bigraph crossing mini-mization. In Proc 2nd Int Workshop Data Mining Bioinform 2007.
[58] Angiulli F, Cesario E, Pizzuti C. Random walk biclustering for microarray data. Inform Sci 2008; 178(6): 1479-97.
[59] Dharan S, Nair AS. Biclustering of gene expression data using reactive greedy randomized adaptive search procedure. BMC Bioin-
formatics 2009; 10(Suppl 1): S27. [60] Lazzaroni L, Owen A. Plaid models for gene expression data. Sta-
tistica Sinica 2002; 12: 61-86. [61] Sheng Q, Moreau Y, Moor BD. Biclustering microarray data by
gibbs sampling. Bioinformatics 2003; 19: 196-205.
[62] Kluger Y, Basri R, Chang JT, Gerstein M. Spectral biclustering of
microarray data: coclustering genes and conditions. Genome Res 2003; 13: 703-16.
[63] Yoon S, Nardini C, Benini L, Micheli GD. Discovering coherent biclusters from gene expression data using zero-suppressed binary
decision diagrams. IEEE/ACM Trans Comput Biol Bioinform 2005; 2(4): 339-54.
[64] Zhao L, Zaki MJ. MicroCluster: Efficient deterministic biclustering of microarray data. IEEE Int Syst 2005; 20(6): 40-9.
[65] Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pas-cual-Montano A. Biclustering of gene expression data by non-
smooth non-negative matrix factorization. BMC Bioinformatics 2006; 7: 366.
[66] Tchagang BA, Tewfik AH. DNA microarray data analysis: a novel biclustering algorithm approach. EURASIP J Appl Signal Proc
2006. [67] Teng L, Chan L-W. Biclustering gene expression profiles by alter-
nately sorting with weighted correlated coefficient. In Proc IEEE Int Workshop Machine Learning Signal Proc 2006; pp. 289-94.
[68] Liu F, Zhou H. Biclustering of gene expression data based on bucketing technique. In: Proc. 1st International Conference on Bio-
informatics and Biomedical Engineering (ICBBE) Wuhan, China 2007; pp. 359-62,
[69] Gu J, Liu JS. Bayesian biclustering of gene expression data. BMC Genomics 2008; 9(Suppl 1): S4.
[70] Dimaggio P, Mcallister S, Floudas C, Feng XJ, Rabinowitz J, Ra-bitz H. Biclustering via optimal re-ordering of data matrices in sys-
tems biology: rigorous methods and comparative studies. BMC Bioinformatics 2008; 9(1): 458.
[71] Madeira SC, Oliveira AL. An efficient biclustering algorithm for finding genes with similar patterns in time-series expression data.
In Proc 5th Asia Paci c Bioinformatics Conference Series in Ad-vances in Bioinformatics and Computational Biology 5: (Hong
Kong), Imperial College Press, January 2007; pp. 67-80. [72] Mahfouz MA, Ismail MA. BIDENS: Iterative density based biclus-
tering algorithm with application to gene expression analysis. Proc World Acad Sci Eng Tech 2009; 37: 342-48.
[73] Barkow S, Bleuler S, Prelic A, Zimmermann P, Zitzler E. BicAT: a Biclustering Analysis Toolbox. Bioinformatics 2006; 22(10): 1282-
3. [74] Cheng KO, Law NF, Siu WC, Lau TH. BiVisu: software tool for
bicluster detection and visualization. Bioinformatics 2007; 23: 2342-4.
[75] Wu C-J, Kasif S. GEMS: A web server for biclustering analysis of expression data. Nucleic Acids Res 2005; 33: 596-9.
[76] Shamir R, Maron-Katz A, Tanay A, et al. EXPANDER–an integra-tive program suite for microarray data analysis. BMC Bioinformat-
ics 2005; 6: 232. [77] Santamar´ıa R, Theron R, Quintales L. BicOverlapper: a tool for
bicluster visualization. Bioinformatics 2008; 24(9): 1212-3 . [78] Gonçalves JP, Madeira SC, Oliveira AL. BiGGEsTS: integrated
environment for biclustering analysis of time series gene expres-sion data. Instituto de Engenharia de Sistemas e Computadores In-
vestigao e Desenvolvimento em Lisboa (INESC-ID). Tech Rep 2009; p. 23.
[79] Zhang Y, Zha H, Chu C-H. A time-series biclustering algorithm for revealing co-regulated genes. In Proc Int Conf Inform Tech Coding
Comput 2005; 1: pp. 32-7. [80] de Franca FO, Ferreira HM, Zuben FJ V. Applying biclustering to
perform collaborative filtering. In Proc Int Conf Int Syst Design Appl (Los Alamitos, CA, USA) 2007; pp. 421-6, IEEE Computer
Society. [81] Liu S, Chen Y, Yang M, Ding R. Bicluster algorithm and used in
market analysis. In Intl Workshop Knowl Dis Data Mining (Los Alamitos, CA, USA), 2009; pp. 504-7, IEEE Computer Society.
[82] de Castro PA D, de Franca FO, Ferreira HM, Zuben FJ V. Apply-ing biclustering to text mining: An immune-inspired approach. In
Proc Int Conf Artificial Immune Syst 2007; pp. 83-94.
Received: August 22, 2009 Revised: November 04, 2009 Accepted: January 06, 2010