On Biclustering of Gene Expression Data

13
204 Current Bioinformatics, 2010, 5, 204-216 1574-8936/10 $55.00+.00 © 2010 Bentham Science Publishers Ltd. On Biclustering of Gene Expression Data Anirban Mukhopadhyay* ,1 , Ujjwal Maulik 2 and Sanghamitra Bandyopadhyay 3 1 Department of Theoretical Bioinformatics, DKFZ (Deutsches Krebsforschungszentrum, German Cancer Research Cen- ter), Im Neuenheimer Feld 580, D-69120, Heidelberg, Germany; 2 Department of Computer Science and Engineering, Jadavpur University, Kolkata-700032, India; 3 Machine Intelligence Unit, Indian Statistical Institute, Kolkata-700108, India Abstract: Microarray technology enables the monitoring of the expression patterns of a huge number of genes across different experimental conditions or time points simultaneously. Biclustering of microarray data is an important technique to discover a group of genes that are co-regulated in a subset of experimental conditions. Traditional clustering algorithms find groups of genes/conditions over the complete feature space. Therefore they may fail to discover the local patterns where a subset of genes has similar behaviour over a subset of conditions. Biclustering algorithms aim to discover such local patterns from the gene expression matrix, thus can be thought as simultaneous clustering of genes and conditions. In recent years, a large number of biclustering algorithms have been proposed in literature. In this article, a study has been made on various issues regarding the biclustering problem along with a comprehensive survey on available biclustering algorithms. Moreover, a survey on freely available biclustering software is also made. Keywords: Microarray, gene expression, biclustering, bicluster types, biclustering algorithms, biclustering software. 1. INTRODUCTION The classical approach to genomic research was based on the local study and collection of data on single genes. With the advancement in microarray technology, it has now be- come feasible to have a global and simultaneous view of the expression levels of many thousands of genes across di erent time points or experimental conditions [1]. Microar- ray technology in recent years has major impacts in many fields such as medical diagnosis, bio-medicine, characteriz- ing various gene functions, understanding di erent molecu- lar biological processes, gene expression profiling etc [2-5]. New application opportunities have been created for data mining methodologies due to the development of microar- rays. Microarray chips consist of expression levels of a large number of genes. Hence they produce large amounts of data to handle. Due to its large volume, computational analysis is essential for extracting knowledge from microarray gene expression data. Clustering is one of the primary approaches to analyze such large amount of data to discover the groups of co-expressed genes. Clustering [6], an important microarray analysis tool, has been used to identify the sets of genes with similar expres- sion profiles. In some early works, visual analysis was suc- cessfully done for grouping genes into functionally relevant classes in Yeast cell cycle [3, 7] and Human large B-cell lymphoma [2] data sets. However, as these methods were very subjective, standard clustering methods, such as K-means [8], fuzzy C-means [9], hierarchical methods [4], *Address correspondence to this author. on leave from Department of Computer Science and Engineering, University of Kalyani, Kalyani – 741235, India; Tel: +91 33 2580 9618; Fax: +91 33 2582 8282; E-mail: [email protected] Self Organizing Maps (SOM) [10], graph theoretic approach [11], simulated annealing based approach [12] and genetic algorithm (GA) based clustering methods [13] have been utilized for clustering microarray data. Clustering algorithms have been applied on microarray data either to group the genes across the time points or ex- perimental conditions/samples [10, 13-16] or group the sam- ples across the genes [17-19]. Clustering techniques, which aim to find the clusters of genes over all the experimental conditions, may fail to discover the genes having similar expression patterns over a subset of conditions. Similarly, a clustering algorithm that groups the conditions/samples across all the genes, may not capture the group of samples having similar expression values for a subset of genes. It is often the case that a subset of genes are co-regulated and co- expressed across a subset of experimental conditions and have almost different expression patterns over the remaining conditions. Traditional clustering methods are not able to identify such local patterns, usually termed as biclusters. Thus biclustering can be thought as the simultaneous clustering of genes and conditions instead of clustering them separately. The aim of the biclustering algorithms is to discover a subset of genes that are co-regulated over a subset of experimental conditions. Hence they provide better reflection of the biological reality. Although biclustering is a relatively new approach applied in gene expression data, it has a fast growing literature. In this article, we have discussed several issues of biclustering including a comprehensive review of the recent literature. The rest of the article is organized as follows: the next section describes the structure of a microarray gene expression data set. In Section 3, the biclustering problem is defined formally and different related definitions are

Transcript of On Biclustering of Gene Expression Data

204 Current Bioinformatics, 2010, 5, 204-216

1574-8936/10 $55.00+.00 © 2010 Bentham Science Publishers Ltd.

On Biclustering of Gene Expression Data

Anirban Mukhopadhyay*,1, Ujjwal Maulik

2 and Sanghamitra Bandyopadhyay

3

1Department of Theoretical Bioinformatics, DKFZ (Deutsches Krebsforschungszentrum, German Cancer Research Cen-

ter), Im Neuenheimer Feld 580, D-69120, Heidelberg, Germany; 2Department of Computer Science and Engineering,

Jadavpur University, Kolkata-700032, India; 3Machine Intelligence Unit, Indian Statistical Institute, Kolkata-700108,

India

Abstract: Microarray technology enables the monitoring of the expression patterns of a huge number of genes across

different experimental conditions or time points simultaneously. Biclustering of microarray data is an important technique

to discover a group of genes that are co-regulated in a subset of experimental conditions. Traditional clustering algorithms

find groups of genes/conditions over the complete feature space. Therefore they may fail to discover the local patterns

where a subset of genes has similar behaviour over a subset of conditions. Biclustering algorithms aim to discover such

local patterns from the gene expression matrix, thus can be thought as simultaneous clustering of genes and conditions. In

recent years, a large number of biclustering algorithms have been proposed in literature. In this article, a study has been

made on various issues regarding the biclustering problem along with a comprehensive survey on available biclustering

algorithms. Moreover, a survey on freely available biclustering software is also made.

Keywords: Microarray, gene expression, biclustering, bicluster types, biclustering algorithms, biclustering software.

1. INTRODUCTION

The classical approach to genomic research was based on

the local study and collection of data on single genes. With

the advancement in microarray technology, it has now be-

come feasible to have a global and simultaneous view of the

expression levels of many thousands of genes across

di erent time points or experimental conditions [1]. Microar-

ray technology in recent years has major impacts in many

fields such as medical diagnosis, bio-medicine, characteriz-

ing various gene functions, understanding di erent molecu-

lar biological processes, gene expression profiling etc [2-5].

New application opportunities have been created for data

mining methodologies due to the development of microar-

rays. Microarray chips consist of expression levels of a large

number of genes. Hence they produce large amounts of data

to handle. Due to its large volume, computational analysis is

essential for extracting knowledge from microarray gene

expression data. Clustering is one of the primary approaches

to analyze such large amount of data to discover the groups

of co-expressed genes.

Clustering [6], an important microarray analysis tool, has

been used to identify the sets of genes with similar expres-

sion profiles. In some early works, visual analysis was suc-

cessfully done for grouping genes into functionally relevant

classes in Yeast cell cycle [3, 7] and Human large B-cell

lymphoma [2] data sets. However, as these methods were

very subjective, standard clustering methods, such as

K-means [8], fuzzy C-means [9], hierarchical methods [4],

*Address correspondence to this author. on leave from Department of

Computer Science and Engineering, University of Kalyani, Kalyani –

741235, India; Tel: +91 33 2580 9618; Fax: +91 33 2582 8282;

E-mail: [email protected]

Self Organizing Maps (SOM) [10], graph theoretic approach

[11], simulated annealing based approach [12] and genetic

algorithm (GA) based clustering methods [13] have been

utilized for clustering microarray data.

Clustering algorithms have been applied on microarray

data either to group the genes across the time points or ex-

perimental conditions/samples [10, 13-16] or group the sam-

ples across the genes [17-19]. Clustering techniques, which

aim to find the clusters of genes over all the experimental

conditions, may fail to discover the genes having similar

expression patterns over a subset of conditions. Similarly, a

clustering algorithm that groups the conditions/samples

across all the genes, may not capture the group of samples

having similar expression values for a subset of genes. It is

often the case that a subset of genes are co-regulated and co-

expressed across a subset of experimental conditions and

have almost different expression patterns over the remaining

conditions. Traditional clustering methods are not able to

identify such local patterns, usually termed as biclusters.

Thus biclustering can be thought as the simultaneous

clustering of genes and conditions instead of clustering them

separately. The aim of the biclustering algorithms is to

discover a subset of genes that are co-regulated over a subset

of experimental conditions. Hence they provide better

reflection of the biological reality.

Although biclustering is a relatively new approach

applied in gene expression data, it has a fast growing

literature. In this article, we have discussed several issues of

biclustering including a comprehensive review of the recent

literature. The rest of the article is organized as follows: the

next section describes the structure of a microarray gene

expression data set. In Section 3, the biclustering problem is

defined formally and different related definitions are

On Biclustering of Gene Expression Data Current Bioinformatics, 2010, Vol. 5, No. 3 205

provided. In Section 4, a discussion is made on the available

biclustering algorithms. Section 5 describes some publicly

available biclustering software. Section 6 concludes the

article.

2. MICROARRAY GENE EXPRESSION DATA

A microarray [20] is a small chip onto which a large

number of DNA molecules (probes) are attached in fixed

grids. The chip is made of chemically coated glass, nylon,

membrane or silicon. Each grid cell of a microarray chip

corresponds to a DNA sequence. There are mainly two types

of microarrays, viz., two-channel microarrays and single-

channel microarrays [21]. In two-channel microarrays (also

called as two-color microarrays), two mRNA samples are

reverse-transcribed into cDNA (targets) labelled using

di erent fluorescent dyes (red-fluorescent dye Cy5 and

green-fluorescent dye Cy3). Due to the complementary na-

ture of the base-pairs, the cDNA binds to the specific oli-

gonucleotides on the array. In the subsequent stage, the dye

is excited by a laser so that the amount of cDNA can be

quantified by measuring the fluorescence intensities. The log

ratio of two intensities of each dye is used as the gene ex-

pression profiles.

.3)(

5)(log=

2CyIntensity

CyIntensitylevelexpressiongene

(1)

Although absolute levels of gene expression may be

determined using the two-channel microarrays, the system is

more useful for the determination of relative differences in

gene expression within a sample and between samples.

Single-channel microarrays (also called as one-color

microarrays) are prepared to estimate the absolute levels of

gene expression, thus requiring two separate single-dye

hybridizations for the comparison of the two sets of

conditions. As only a single dye is used, the data represent

absolute values of gene expression. An advantage of single-

channel microarrays is that data are more easily compared to

arrays from different experiments. However, in single-

channel system, one needs twice as many microarrays to

compare the samples within an experiment.

Mathematically, a microarray data set can be viewed as a

G C matrix A(G,C) that represents the expression level of

a set of G genes G = {I1, I2 ,K, IG} over a set of C

conditions C = {J1, J2 ,K, JG} . Each element ijm of matrix

A(G,C) represents the expression level of the i th gene at the

j th condition, where Gi and Cj . (Eqn. 2).

GCGGG

C

C

C

A

mmmI

mmmI

mmmI

JJJ

CG

L

MOMMM

L

L

L

21

222212

112111

21

=),(

(2)

3. BICLUSTERING PROBLEM AND DEFINITIONS

Given a CG microarray data matrix ),( CGA

consisting of a set of G genes G = {I1, I2 ,K, IG} and a set

of C conditions C = {J1, J2 ,K, JG} , a bicluster can be

defined as follows:

Definition 1 (Bicluster) A bicluster is a submatrix

][=),( ijmJIM , JjIi , , of matrix ),( CGA , where

GI and CJ , and the subset of genes in the

bicluster are similarly expressed over the subset of

conditions.

The problem of biclustering is thus to identify a set of

biclusters from a given data matrix depending on some

coherence criterion to evaluate the quality of the biclusters.

In general, the complexity of a biclustering problem depends

on the coherence criterion used. However, in almost all

cases, the biclustering problem is known to be NP-complete.

Therefore, a number of approaches use heuristics for

discovering biclusters from a gene expression matrix.

Depending on how the genes in a biclusters are similar to

each other under the experimental conditions, biclusters can

be categorized into different types. The following subsection

provides the definitions of different types of biclusters.

3.1. Types of Biclusters

There are mainly six types of biclusters viz., (1)

biclusters with constant values, (2) biclusters with constant

rows, (3) biclusters with constant columns, (4) biclusters

with additive pattern, (5) biclusters with multiplicative

pattern and (6) biclusters with both additive and

multiplicative pattern. The additive and multiplicative

patterns are also referred as shifting and scaling patterns,

respectively [22]. The different types of biclusters are

defined as follows:

Definition 2 (Biclusters with Constant Values) In a

bicluster ][=),( ijmJIM , JjIi , with constant values,

all the elements have the same value, i.e.,

., ,= JjIimij (3)

Definition 3 (Biclusters with constant Rows) In a

bicluster ][=),( ijmJIM , JjIi , with constant rows,

all the elements of each row of the bicluster have the same

value. Hence in this type of bicluster, each element is

represented using one of the following notations:

,, ,= JjIiam iij + (4)

,, ,= JjIibm iij (5)

., ,= JjIiabm iiij + (6)

206 Current Bioinformatics, 2010, Vol. 5, No. 3 Mukhopadhyay et al.

Here is a constant value for a bicluster, ia is the

additive (shifting) factor for row i and ib is the

multiplicative (scaling) factor for row i .

Definition 4 (Biclusters with Constant Columns) In a

bicluster ][=),( ijmJIM , JjIi , with constant

columns, all the elements of each column of the bicluster

have the same value. Hence in this type of bicluster, each

element is represented using one of the following notations:

,, ,= JjIipm jij + (7)

,, ,= JjIiqm jij (8)

., ,= JjIipqm jjij + (9)

Here is a constant value for a bicluster, jp is the

additive (shifting) factor for column j and jq is the

multiplicative (scaling) factor for column j .

Definition 5 (Biclusters with Additive Pattern) In a

bicluster ][=),( ijmJIM , JjIi , with additive

(shifting) pattern, each column and row has only some

additive (shifting) factors. Hence in this type of bicluster,

each element is represented as:

., ,= JjIipam jiij ++ (10)

Definition 6 (Biclusters with Multiplicative Pattern) In a

bicluster ][=),( ijmJIM , JjIi , with multiplicative

(scaling) pattern, each column and row has only some

multiplicative (scaling) factors. Hence in this type of

bicluster, each element is represented as:

., ,= JjIiqbm jiij (11)

Definition 7 (Biclusters with both Additive and

Multiplicative Patterns) In a bicluster ][=),( ijmJIM ,

JjIi , with both additive (shifting) and multiplicative

(scaling) pattern, each column and row has both additive

(shifting) and multiplicative (scaling) factors. Hence in this

type of bicluster, each element is represented as:

., ,= JjIipaqbm jijiij ++ (12)

Note that these biclusters are the most general form of

biclusters. All other types of biclusters are special cases of

these biclusters.

3.2. Some Important Definitions

Here we discuss some important terms regarding

biclusters and the biclustering problem.

Definition 8 (Bicluster Variance) Bicluster variance

),( JIVARIANCE of a bicluster ),( JIM is defined as follows:

,)(=),( 2

,

IJij

JjIi

mmJIVARIANCE

(13)

where ijJjIiIJ m

JIm

,||||

1= , i.e., the mean of the elements

in the bicluster.

Definition 9 (Residue) The residue ijr of any element

ijm of a bicluster ),( JIM is defined as:

Fig. (1). Examples of different types of biclusters: (a) Constant, (b) Row-constant, (c) Column-constant, (d) Additive Pattern, (e) Multiplicative Pattern, (f) Both Additive and Multiplicative Patterns.

On Biclustering of Gene Expression Data Current Bioinformatics, 2010, Vol. 5, No. 3 207

,= IJIjiJijij mmmmr + (14)

where iJm is the mean of the i th row, i.e.,

ijJjiJ mJ

m||

1= , Ijm is the mean of the j th column,

i.e., ijIiIj mI

m||

1= , and IJm is the mean of all the

elements in the bicluster, i.e., ijJjIiIJ m

JIm

,||||

1= .

Definition 10 (Mean Squared Residue) The mean

squared residue ( ),( JIMSR ) of a bicluster ),( JIM is

defined as:

.||||

1=),( 2

,

ij

JjIi

rJI

JIMSR

(15)

The mean squared residue score of a bicluster represents

the level of coherence among the elements of the bicluster.

Lower residue score indicates greater coherence and thus

better quality of the bicluster.

Definition 11 (Row Variance) The row variance

VAR(I , J ) of a bicluster M (I , J ) is defined as:

.)(||||

1=),( 2

,

iJij

JjIi

mmJI

JIVAR

(16)

A high row variance indicates that the rows (genes) of

the biclusters have large variance across the conditions.

Sometimes high row variance is desirable in order to escape

from trivial constant biclusters.

4. BICLUSTERING ALGORITHMS

In recent years, a large number of biclustering algorithms

have been proposed for gene expression data analysis. In this

section, we discuss some popular biclustering algorithms in

different categories such as iterative greedy search,

randomized greedy search, evolutionary techniques, graph

based algorithms, fuzzy methods etc.

4.1. Iterative Greedy Search

The concept of biclustering was first introduced by

Hartigan in [23] in the form of direct clustering. As a

coherence measure of a bicluster ),( JIM , bicluster variance

( ),( JIVARIANCE ) was used (Eqn. 13). The goal of the

algorithm was to extract K biclusters from the given data

set while minimizing the sum of the bicluster variances of

the K biclusters. In each iteration, the algorithm partitions

the data matrix into a set of submatrices, each of which is

considered as a bicluster. As can be noted, for a constant

bicluster, ),( JIVARIANCE is zero. As each element of the

data matrix satisfies the zero variance criterion, to avoid this,

the algorithm was executed until the data matrix was

partitioned into K submatrices. Hartigan's algorithm was

able to detect constant biclusters only. However, he proposed

using other homogeneity criteria to detect other types of

biclusters.

Cheng and Church first introduced the biclustering

problem in the case of microarray gene expression data [24].

The coherence measure called Mean Squared Residue

(MSR) was introduced by them (Eqn 15). Cheng and Church

proposed a greedy search heuristic that searches for largest

possible bicluster keeping MSR under a threshold

(called as -bicluster). The algorithm has two phases. In the

first phase, starting with the complete data matrix, they first

delete rows and columns in order to bring the MSR score

below . In this regard, Cheng and Church suggested a

greedy heuristic to rapidly converge to a locally maximal

submatrix with MSR score below . In the second phase,

the rows and columns are added as long as MSR score does

not increase. The same procedure is executed for K

iterations in order to discover K -biclusters. At each

iteration, the bicluster found in the previous iteration is

masked with random values in order to avoid overlaps. Since

MSR score is zero for the biclusters with constant values,

constant rows, constant columns and additive patterns,

Cheng and Church algorithm is able to detect these kind of

biclusters only. However, the algorithm is known to stuck at

local optima often and also suffers from random interference

due to masking of biclusters with random values.

In [25], the authors extended the concept of -bicluster

to cope with the problem of masking the missing values as

well as masking the biclusters found in the previous iteration

with random values. In this algorithm, the residue of a

specified (non-missing) element in a bicluster is taken as

same as per Eqn. 14, but residue of an unspecified (missing)

element is taken to be zero. This algorithm allows the

biclusters to overlap and thus is termed as FLexible

Overlapped biClustering (FLOC). FLOC algorithm begins

with a initial set of biclusters (seeds) and iteratively

improves the overall quality of the biclustering. At each

iteration, each row and column is moved among the

biclusters to yield a better biclustering in terms of lower

MSR . The best biclustering obtained during an iteration is

used as the initial biclustering seed in the next iteration. The

algorithm terminates automatically when the current iteration

fails to improve the overall biclustering quality. Thus FLOC

is able to evolve k biclusters simultaneously. However, this

algorithm also can only identify constant and additive

patterns, and fails to detect multiplicative patterns.

In [26], an algorithm called Order Preserving Sub-matrix

(OPSM) is proposed. Here a bicluster is defined as a

submatrix where the order of the selected conditions is

preserved for all of the selected genes. Hence, the expression

values of the genes within a bicluster induce an identical

linear ordering across the selected conditions. The authors

proposed a deterministic iterative algorithm to find large and

statistically significant biclusters. The time complexity of

this technique is )( 3kO GC where G and C are the number

of genes and conditions of the input data set, respectively

208 Current Bioinformatics, 2010, Vol. 5, No. 3 Mukhopadhyay et al.

and k is the number of biclusters found. Thus OPSM does

not scale well for high-dimensional data sets.

In [27] and [28], the authors proposed Iterative Signature

Algorithm (ISA) where a bicluster is considered to be a

transcription module, i.e., a set of co-regulated genes together

with the associated set of regulating conditions. The algorithm

starts with an initial set of genes and all samples are scored

with respect to this gene set. The samples, for which the score

exceeds a predefined threshold are chosen. Similarly, all

genes are scored regarding the selected samples and a new set

of genes is selected based on another user-defined threshold.

This procedure is iterated until the set of genes and the set of

samples converge, i.e., do not change anymore. ISA can

discover more than one bicluster by starting with different

initial gene sets. The choice of initial reference gene set

plays an important role in ISA in order to obtain good

quality results. ISA is highly sensitive to the threshold values

and often tends to identify a strong bicluster many times.

In xMotif biclustering [29], the biclusters which contain

genes that are almost constantly expressed across the

selected conditions are identified. At first, each gene is

assigned a set of statistically significant states which define

the set of valid biclusters. In xMotif, a bicluster is considered

to be a submatrix where each gene is exactly in the same

state for all the selected conditions. The aim is to identify the

largest bicluster. To identify the largest valid biclusters, an

iterative search method is proposed that is run on different

initial random seeds. It should be noted that xMotif

framework requires pre-identification of the classes of

biclusters present in the data which may not be feasible for

most of the real life data sets.

In general, greedy search algorithms scale well in large

data sets. However, they mainly suffer from the problem of

getting stuck at local optima depending on the initial

configuration.

4.2. Two-Way Clustering

In [30], the authors present a coupled two-way clustering

(CTWC) approach to gene microarray data analysis. The

main idea is to identify subsets of the genes and samples,

such that when one of these is used to cluster the other,

stable and significant partitions emerge. They present an

algorithm, based on iterative clustering, that performs such a

search. This two-way clustering algorithm repeatedly

performs one-way clustering on the rows and columns of the

data matrix using stable clusters of rows as attributes for

column clustering and vice-versa. Although the authors used

hierarchical clustering, any reasonable choice of clustering

method and definition of stable cluster can be used within

the framework of CTWC. As a preprocessing step, they used

normalization which allowed them to capture biclusters with

constant columns also.

Interrelated Two-Way Clustering (ITWC) [31], an

algorithm similar to CTWC, combines the results of one-way

clustering on both dimensions of the gene expression matrix

for producing biclusters. As a preprocessing step, the rows of

the data matrix is first normalized. Thereafter, the vector-

angle cosine value between each row and a predefined stable

pattern is computed to determine whether the row values

vary much among the columns. The rows with very little

variation are then removed. After that, correlation coefficient

is used to measure the strength of the linear relationship

between two rows or two columns, to perform the two-way

clustering. As correlation coefficient is independent of the

magnitude and only depends on the pattern, ITWC is able to

detect both additive and multiplicative biclusters.

Double Conjugated Clustering (DCC) [32] algorithm is

node-driven algorithm that unifies the two view points of

microarray clustering, viz., clustering the samples taking the

genes as the features and clustering the genes taking samples

as the features. DCC performs the both tasks simultaneously

to achieve a unified clustering where the sample clusters are

discriminated by subsets of genes. The clustering in sample

space and gene space are synchronized by a projection of

nodes between the spaces mapping the sample clusters to the

corresponding gene clusters. The method may utilize any

relevant clustering technique like SOM and K-means. The

data does not scatter across all offered nodes due to the

projection between the two clustering spaces. DCC

algorithm can provide sharp clusters and empty nodes even

in the case of number of nodes exceeding the number of

clusters. However, DCC can only find constant biclusters

from the input data set.

The two-way clustering algorithms in general cluster the

data set from both the dimensions (rows and columns) and

finally try to combine the clustering of the two dimensions in

order to obtain the biclusters. However, there is no standard

rule for the choice of the number of clusters in both the gene

and condition dimensions.

4.3. Evolutionary Biclustering

Evolutionary algorithms, like Genetic Algorithms (GA)

[33] and Simulated Annealing (SA) [34] have been used

extensively in the biclustering problem. Some of these

algorithms are described below.

4.3.1. GA Based Biclustering

In [35], a genetic algorithm based biclustering framework

has been developed. As an encoding strategy, the authors use

a binary string of length G +C , where G and C denote the

number of genes and number of conditions/samples/time

points, respectively. If a bit position is `1', then the

corresponding gene or condition is selected in the bicluster

and if a bit position is `0', the corresponding gene or

condition is not selected in the bicluster. Hence, each

chromosome encodes one possible bicluster. Following

fitness function F is minimized:

F =

1

| I || J |if MSR(I,J)

MSR(I , J )otherwise.

(17)

On Biclustering of Gene Expression Data Current Bioinformatics, 2010, Vol. 5, No. 3 209

Hence, if MSR of the bicluster encoded in a

chromosome is less than the threshold (i.e., a -

bicluster), the objective is to maximize the volume.

Otherwise, the objective is to minimize the MSR . The

algorithms employs a special selection operator called

environment selection to maintain the diversity of the

population in order to identify a set of biclusters at one run.

A local search strategy is used to expedite the rate of

convergence. As the local search, one iteration of Cheng and

Church node deletion and addition algorithm is executed

before computing the fitness value of a chromosome. Also

the chromosome is updated with the new bicluster obtained

after the local search. Standard uniform crossover and bit-

flip mutation operators are adopted for generating the next

generation.

A similar GA based biclustering approach can be found

in [36]. Here, instead of using Cheng and Church algorithm

as a local search strategy in each step of fitness computation,

it is only used once initially. The initial population consists

of biclusters seeds generated through K-means clustering in

both dimensions and combining the gene and sample

clusters. Thereafter these seeds are grown up through Cheng

and Church algorithm. Subsequently the normal GA process

follows. As the fitness function, the authors minimized the

ratio of MSR to the volume of the biclusters in order to

capture large yet coherent biclusters.

Another GA based biclustering, called Sequential

Evolutionary BIclustering (SEBI) is proposed in [37]. In this

work also, the authors use binary chromosomes as discussed

above. SEBI minimizes the following fitness function:

,),(

1),(= penaltyw

JIVAR

JIMSRd +++F

(18)

where )||||

(=J

wI

www crVd + . Here Vw , rw and cw

represent weights on volume, number of rows and number of

columns in the bicluster, respectively. Also

)(=, ijpJjIi

mwpenalty , where )( ijp mw is an weight

associated with each element ijm of the bicluster and it is

defined as:

0.|>)COV(m| if

0|=)COV(m| if0

=)(

ij|)(|

|)(|

,

ij

ijmCOV

klmCOV

JlIkijp

e

emw

(19)

Here |)(| ijmCOV denotes the number of biclusters

containing ijm . The weight )( ijp mw is used to control the

amount of overlaps among the biclusters. Binary tournament

selection is used. Three crossover operators, one-point, two-

point and uniform crossover have been studied. Also three

mutation operators, namely standard bit-flip mutation,

mutation by adding a row and mutation by adding a column

are used for study. SEBI does not use any local search

strategy for updating the chromosomes.

All the above algorithms use chromosomes of length

equal to the number of genes plus the number of conditions.

Thus the chromosomes are very large if the data set is large.

This may cause the other operators like crossover and

mutation to take longer and thus slowing down the

convergence. Taking this into account, a novel encoding

strategy is proposed in GA based Biclustering (GABI) [38].

Here each string has two parts: one for clustering the genes,

and another for clustering the conditions. If M and N

denote the maximum number of gene clusters and the

maximum number of condition clusters, respectively, then

the length of each string is M + N . The first M positions

represent the M cluster centers for the genes, and the

remaining N positions represent the N cluster centers for

the conditions. Thus a string looks like following:

{gc1 gc2 K gcM cc1 cc2 K ccN }, where each gci , i =1 M ,

represents the index of a gene that acts as a cluster center of

a set of genes, and each ccj , j =1…N , represents the index

of a condition that acts as a cluster center of a set of

conditions. For a data set having n points, it is usual to

assume that the data set may contain at most n clusters.

Taking this into account, the values of the maximum number

of gene clusters ( M ) and the maximum number of condition

clusters ( N ) are used as G and C , respectively.

Here G and C denote the number of genes and the number

of conditions in the data set, respectively. The first M

positions can have values in the range {0,1, 2,K,G} and the

next N positions can have values in the range {0,1,2…C}.

Hence the gene and condition cluster centers are represented

by indices of the genes and conditions, respectively, while a

0 value at any position means absence of any cluster center.

A string that encodes M gene clusters and N condition

clusters, represents a set of NM biclusters, taking each

pair of gene and condition clusters. Each pair < gci ,ccj > ,

i = 1KM , j = 1KN , represents a bicluster that consists of all

genes of the gene cluster centered at gene igc , and all

conditions of the condition cluster centered at condition jcc .

During the fitness computation, the gene clusters and

condition clusters encoded in the chromosome are updated in

K-means like iteration. The fitness function of a bicluster is

defined as follows:

.)),(.(1

),(=

JIVAR

JIMSR

+F

(20)

The denominator of F is chosen such way to avoid

accidental divide-by-zero condition when row variance

( ),( JIVAR ) becomes 0. F is minimized to obtain highly

coherent yet “interesting'' (high variance) biclusters. For each

encoded -bicluster, the fitness function F is computed.

The fitness function of a chromosome is then computed as

the mean of the fitness values of all the encoded -

210 Current Bioinformatics, 2010, Vol. 5, No. 3 Mukhopadhyay et al.

biclusters in it. Conventional roulette wheel selection and

uniform crossover operation are used in GABI. The mutation

operation works as follows. A random position is chosen

from the first M positions and its value is replaced by an

index randomly chosen from the range {0,1,2,K,G} , where

G is the total number of genes. Similarly, to mutate the

condition portion of the string, a random position is selected

from the next N positions and its value is substituted using

a randomly selected index from the range {0,1,2,K,C} ,

where C is the total number of conditions. Elitism is used to

track the best string found until the current generation.

4.3.2. SA Based Biclustering

There are many instances in literature that use Simulated

Annealing (SA) for the biclustering problem. A standard

representation of a configuration in SA is equivalent to a

binary string used in GA based biclustering. In [39], this

representation is used. Here the initial configuration consists

of all `1's, i.e., it encodes the complete data set. The

perturbation is equivalent to bit-flipping mutation used in

GA. The energy to be minimized is taken as MSR of the

encoded bicluster.

A similar approach is found in [40], where instead of

starting from the complete data matrix, the author first create

a seed bicluster by clustering the genes and samples and

combining them. Thereafter SA is used to grow up the seed.

Here also, MSR is used as the coherence measure. The

perturbation includes only addition of a random gene and/or

condition.

4.3.3. Hybrid Approaches

In [41], a hybrid Genetic Algorithm-Particle Swarm

Optimization (GA-PSO) approach, which uses binary strings

to encode the biclusters, is proposed. The GA and PSO have

there own populations that evolve through standard GA and

PSO process, respectively. At each iteration, a random set of

individual solutions are exchanged between the two

population. As the fitness function, it uses the same

described in Eqn. 17.

4.3.4. Multiobjective Biclustering

As the biclustering problem requires several objectives to

be optimized such as MSR , volume, row variance etc.,

there are some approaches that pose the biclustering problem

as multiobjective optimization [42]. The work in [38] has

been extended to multiobjective case in [43]. The algorithm

is termed as MultiObjective GA based Biclustering

(MOGAB). Here the authors used the same encoding

strategy consisting of gene clusters and condition clusters.

Two objectives, viz., ),( JIMSR

and ),(1

1

JIVAR+ are

optimized simultaneously. This algorithm uses NSGA-II

[44] as the underlying multiobjective optimization tool. The

crossover and mutation operators are kept same as in [38].

In [45], the authors extended their work of [37] to the

multiobjective case. The algorithm is called as Sequential

Multi-Objective Biclustering (SMOB). Here also they used

binary encoding strategy. Three objective functions, viz.,

mean squared residue, volume and row variance are

optimized simultaneously. In [46], a Crowding distance

based Multi-Objective Particle Swarm Optimization

Biclustering (CMOPSOB) algorithm is proposed that uses

binary encoding. The algorithm optimizes the MSR , volume

and VAR simultaneously. In [47], a hybrid multiobjective

biclustering algorithm that combines NSGA-II and

Estimation of Distribution Algorithm (EDA) [48] for

searching biclusters is proposed. The volume and MSR of

the biclusters are optimized simultaneously. A

multiobjective artificial immune system based biclustering

that is capable of performing a multi-population search,

named MOM-aiNet, is proposed in [49].

In general, evolutionary algorithms are known for their

strength in avoiding locally optimum solutions. Specially,

when they are equipped with some local search, they can

converge fast toward the global optimum. However, the

algorithms which optimize MSR as an objective function,

fail to discover the multiplicative patterns. Also,

evolutionary algorithms are inherently slower compared to

the greedy iterative algorithms and depend a lot on different

parameters like population size, number of generations,

crossover and mutation rates, annealing schedule etc. But in

general, it has been found that evolutionary algorithms,

specially the multiobjective ones, work better than the

greedy search strategies in terms of performance.

4.4. Fuzzy Biclustering

Some recent biclustering algorithms employ fuzzy set

theory in developing biclustering algorithms in order to

capture overlapping biclusters. In [50], a flexible fuzzy co-

clustering algorithm which incorporates feature-cluster

weighting in the formulation is proposed. The algorithm is

called as Flexible Fuzzy Co-clustering with Feature-cluster

Weighting (FFCFW) which allows the number of object

clusters to be different from the number of feature clusters.

A feature-cluster weighting scheme is incorporated for each

object cluster generated by FFCFW so that the relationships

between the two types of clusters are manifested in the

feature-cluster weights. This enables FFCFW to generate

more accurate representation of fuzzy co-clusters. FFCFW

uses an iterative optimization procedure.

In [51], a GA based possibilistic fuzzy biclustering

algorithm GFBA is proposed. In GFBA, instead of binary

chromosome, the authors use real valued chromosome of

length G +C . Each position in the chromosome has value

between 0 and 1, representing the degree of membership of

the corresponding gene or condition to the encoded bicluster.

They fuzzified the different coherence and quality metrics

such as MSR , VAR and volume of the biclusters as follows:

The means of each row ( iJm ), each column ( Ijm ) and all

the elements ( IJm ) of a bicluster are redefined as:

On Biclustering of Gene Expression Data Current Bioinformatics, 2010, Vol. 5, No. 3 211

,

)(

.)(

=

1=

1=

μ

μ

jf

mjf

m

J

j

ijJ

j

iJ C

C

(21)

,

)(

.)(

=

1=

1=

μ

μ

if

mif

m

I

i

ijI

iIj G

G

(22)

and

,

)(.)(

.)(.)(

=

1=1=

1=1=

μμ

μμ

jfif

mjfif

m

JI

ji

ijJI

ji

iJ CG

CG

(23)

where )(if I and )( jf J denote the membership degree of

the i th gene and j th condition to the bicluster, respectively

and μ is the fuzzy exponent. Hence fuzzy mean squared

residue FMSR is defined as:

,)()(.)(||||

1=),( 2

1=1=

IJIjiJijJI

ji

mmmmjfifJI

JIFMSR +μμCG

(24)

where )(|=|1=

ifI Ii

G

and )(|=|1=

jfJ Jj

C . The objective

function to be minimized is selected as:

||

))().(1,(.

||

))().(1,(.

),(=1=1=

J

jfJIFMSR

I

ifJIFMSR

JIFMSR

J

jI

i ++CG

F

CGμμ

(25)

where and are parameters provided to satisfy different

requirements on the incoherence and the sizes of the

biclusters. Conventional roulette wheel selection and single

point crossover followed by mutation (increasing or

decreasing a membership value) have been used. GFBA also

uses a bicluster optimization technique at each generation for

faster convergence.

In [52], an NSGA-II based multiobjective probabilistic

fuzzy biclustering algorithm is proposed which uses

chromosomes encoding a set of gene cluster centers and a set

of condition cluster centers as in [38, 43]. In this case, the

gene and condition cluster centers are updated using one step

of fuzzy K-medoids clustering [53] and for each gene and

condition, fuzzy membership degree to each gene cluster and

condition cluster, respectively is computed. The fuzzy

volume of bicluster ),( JIB corresponding to

>,< yx ccgc pair ( < gene cluster x , condition cluster

>y ) is defined as:

.=),(1=1=

m

yj

m

xi

ji

JIfvol μCG

(26)

Here I is a fuzzy set corresponding to fuzzy gene

cluster centered at xgc . It consists of all genes ig with

membership degree xiμ , Gi1 . Similarly, J is a fuzzy

set corresponding to fuzzy condition cluster centered at ycc .

It consists of all conditions jc with membership degree yj ,

Cj1 .

Residue of an element ija of the fuzzy bicluster

),( JIB is defined as:

,= IJIjiJijij aaaafr + (27)

where

,=

1=

1=

m

yj

j

ij

m

yj

j

iJ

a

aC

C

(28)

aIj =

i=1

μximaij

i=1

μxim

, (29)

and

,),(

=1=1=

JIfvol

a

a

ij

m

yj

m

xi

ji

IJ

μCG

(30)

where m is the fuzzy exponent. The fuzzy mean squared

residue ( ),( JIFMSR ) of the fuzzy bicluster ),(= JIB

is defined as:

.),(

1=),( 2

1=1=

ij

m

yj

m

xi

ji

frJIfvol

JIFMSR μCG

(31)

Subsequently, fuzzy expression profile variance of

),( JIB is computed as:

.)(),(

1=),( 2

1=1=

iJij

m

yj

m

xi

ji

aaJIfvol

JIfvar μCG

(32)

For each >,< yx ccgc pair, representing a fuzzy

bicluster, the above three objectives ( fvol , FMSR and

frvar ) are computed. As each chromosome encodes a

number of possible biclusters, the average value of each of

the above three terms, i.e., fuzzy volume, fuzzy MSR and

fuzzy variance are taken as three objectives to be optimized

simultaneously. Note that, the first and the third objectives

are to be maximized while minimizing the second one. The

other genetic operators used are similar to that used in [43].

In [54], another fuzzy biclustering algorithm called

Fuzzy Biclustering for Microarray Data Analysis (FBMDA)

212 Current Bioinformatics, 2010, Vol. 5, No. 3 Mukhopadhyay et al.

is proposed. The method employs a combination of the

Nelder-Mead and min-max algorithm to construct

hierarchically structured biclustering, thus can represent the

biclustering information at different levels. FBMDA uses

multiobjective optimization that optimizes volume, variance

and fuzzy entropy simultaneously. The Nelder-Mead

algorithm is used to compute a single objective optimal

solution, and the min-max algorithm is used to trade-off

between multiple objectives. FBMDA is not subject to the

convexity limitations, and also does not use the derivatives

information. FBMDA ensures that the current local optimal

solution is removed and that a higher precision is reached.

Incorporation of fuzziness in biclustering algorithms

enables them to deal with noisy data and overlapping

biclusters efficiently. But as most of the aforementioned

fuzzy algorithms use evolutionary techniques as the

underlying optimization strategy, they suffer from the

fundamental disadvantages of evolutionary methods.

Furthermore, computation of fuzzy membership degrees

takes additional time which adds up to the time taken by the

fuzzy biclustering methods.

4.5. Graph Theoretic Approaches

Graph theoretic concepts and techniques have been

utilized in detecting biclusters. In [55], the authors

introduced SAMBA (Statistical Algorithmic Method for

Bicluster Analysis), a graph-theoretic approach to

biclustering in combination with a statistical data model. In

SAMBA the expression matrix is modeled as a bipartite

graph consisting of two sets of vertices corresponding to

genes and conditions. A bicluster is defined as a subgraph,

and a likelihood score is used in order to assess the

significance of observed subgraphs. SAMBA repeatedly

finds the maximal highly connected subgraph in the bipartite

graph. Then it performs local improvement by adding or

deleting a single vertex until no further improvement is

possible. SAMBA's time complexity is O(N2d ) , where d is

the upper bound on the degree of each vertex.

The Binary inclusion-Maximal (BiMax) biclustering

algorithm proposed in [56] identifies all biclusters in the

input matrix. BiMax algorithm works on a binary matrix.

The input matrix is first discretized to zeros and ones

according to a user-specified threshold. Based on this binary

matrix, BiMax identifies all maximal biclusters where a

bicluster is defined as a submatrix E containing all 1s. An

inclusion-maximal bicluster means that this bicluster is not

completely contained in any other bicluster. They used an

incremental algorithm to find the inclusion-maximal

biclusters exploiting the fact that the matrix E induces a

bipartite graph. As BiMax works with binary matrix, it is

suitable only for detecting constant biclusters.

In [57], the optimal biclustering problem is posed as a

problem of maximal crossing number reduction

(minimization) in a weighted bipartite graph. In this regard,

an algorithm called cHawk, is proposed that employs

barycenter heuristic and local search technique. There are

three main steps of the algorithm, viz., construction of a

bipartite graph from the input matrix, bipartite graph

crossing minimization and finally, the bicluster

identification. This approach reorders the matrix so that all

rows and columns belonging to the same bicluster are

brought into the vicinity of each other. cHawk is able to

detect constant, additive and overlapped noisy biclusters.

The graph based biclustering algorithms usually model

the input data set as a bipartite graph with two sets of nodes

corresponding to the genes and conditions, respectively. The

edges of the graph represent the level of overexpression and

underexpression of a gene under the certain condition. A

bicluster is a subgraph of the bipartite graph, where the

genes have coherence across the selected conditions. In these

types of algorithms, the genes and conditions are partitioned

in same number of clusters, which may be impractical.

Moreover, the input data set has to be discretized properly

before applying graph based algorithms. Also they do not

scale well with large data sets.

4.6. Randomized Greedy Search

In [58], a greedy random walk search technique for

biclustering problem that is enriched by a local search

strategy to escape local optima has been presented. The

algorithm begins with initial random solution and searches

for a locally optimal solution by successive transformations

(including random moves depending on some probability) to

improve a gain function defined as a combination of mean

squared residue, expression profile variance and the volume

of the biclusters. The algorithm iterates k times to generate

k biclusters.

In [59], the basic concepts of the metaheuristics Greedy

Randomized Adaptive Search Procedure (GRASP)-

construction and local search phases are reviewed. Also a

method which is a variant of GRASP called Reactive Greedy

Randomized Adaptive Search Procedure (Reactive GRASP)

is proposed to detect significant biclusters from large

microarray datasets. The method has two major steps. First,

high quality bicluster seeds are generated by using the K -

means clustering from both dimensions and combining the

clusters. In the second step, these seeds are grown using the

Reactive GRASP. In Reactive GRASP, the basic parameter

that defines the restrictiveness of the candidate list is self-

adjusted, depending on the quality of the solutions found

previously.

Randomized greedy search algorithms try to combine the

advantages of greedy search and randomization, so that they

execute fast as well as don't stuck at local optima. However,

sill these algorithms heavily depend on the initial choice of

the solution and there is no clear way to get out from a poor

choice.

4.7. Other Recent Approaches

A number of biclustering algorithms have appeared in

recent literature that follow new methodologies. Some of

them are described here.

On Biclustering of Gene Expression Data Current Bioinformatics, 2010, Vol. 5, No. 3 213

In [60], the authors introduces plaid model as a statistical

model assuming that the expression value ijm in a bicluster

is the sum of the main effect , the gene effect ip , the

condition effect jq , and the noise term ij :

.= ijjiij qpm +++ (33)

Also it is assumed that the expression values of two

overlapping biclusters are the sum of the two module effects.

In plaid model, a greedy search strategy is used, hence errors

can accumulate easily. Moreover, in case of multiple

clusters, the clusters identified by the method tend to overlap

to a great extent.

In [61], a biclustering algorithm is proposed based on

probabilistic Gibbs sampling. Gibbs sampling does not suffer

from the problem of local minima that often characterizes

Expectation Maximization. However, when the microarray

data is organized as patient vs. gene fashion, and the number

of patients is much lower compared to the number of genes,

the algorithm faces computational difficulties. Moreover the

algorithm is only able to identify biclusters with constant

columns.

In [62], the authors developed a spectral biclustering

method that simultaneously clusters genes and conditions,

finding distinctive checkerboard patterns in matrices of gene

expression data, if they exist. The method is based on the

observation that checkerboard structures can be found in

eigenvectors corresponding to the characteristic expression

patterns across the genes or conditions. In addition, these

eigenvectors can be readily identified by commonly used

linear algebra approaches such as singular value

decomposition (SVD), coupled with closely integrated

normalization steps.

In [63], the authors proposed a biclustering method that

employs dynamic programming and a divide-and-conquer

technique, as well as efficient data structures such as the trie

and zero-suppressed decision diagrams (ZBDDs). Use of

ZBDDs extends the stability of the method substantially.

In [64], the authors developed MicroCluster, a

deterministic biclustering method. In MicroCluster, only the

maximal biclusters satisfying certain homogeneity criteria

are considered. The clusters can be arbitrarily positioned

anywhere in the input data matrix, and they can have

arbitrary overlapping regions. MicroCluster uses a flexible

definition of a cluster that lets it mine several types of

biclusters. Moreover, MicroCluster can delete or merge

biclusters that have large overlaps. So, it can tolerate some

noise in the data set and let the users focus on the most

important clusters. As MicroCluster relies on extracting

maximal cliques from the constructed range multigraph, it is

computationally demanding. Moreover, there are several

input parameters that are to be tuned properly in order to find

suitable biclusters.

A method based on application of the non-smooth non-

negative matrix factorization technique for discovering local

structures (biclusters) from gene expression datasets is

developed in [65]. This method utilizes non negative matrix

factorization with non-smoothness constraints for identifying

biclusters in gene expression data for a given factorization

rank.

In [66], biclustering algorithms using basic linear algebra

and arithmetic tools have been developed. The proposed

biclustering algorithms can be used to search for all

biclusters with constant values, biclusters with constant

values on rows, biclusters with constant values on columns,

and biclusters with coherent values from a set of data in a

timely manner and without solving any optimization

problem.

In [67], the authors proposed a biclustering method by

alternatively sorting the genes and condition using dominant

set. By using weighted correlation coefficient, they emphasize

the similarities across a subset of the genes/conditions.

Additionally, a coherence measure called Average

Correlation Value (ACV) is proposed which is effective in

determining both additive and multiplicative patterns. Some

special preprocessing of the input data set is needed for

detecting additive and multiplicative biclusters. To detect

different types of biclusters, different runs are needed.

In [68], a biclustering algorithm that adopts bucketing

technique to find a raw submatrix is proposed. The algorithm

refines and extends the raw submatrix into a bicluster. The

algorithm is called as Bucketing and Extending Algorithm

(BEA).

A Bayesian BiCustering (BBC) model is proposed in

[69] that uses Gibbs sampling. For a single bicluster, the

same model as in the plaid model is assumed. Whereas for

multiple biclusters, the overlapping of biclusters is allowed

either in genes or conditions. Moreover, the authors used a

flexible error model, which permits the error term of each

bicluster to have a different variance.

In [70] the authors presented a rigorous approach to

biclustering, which is based on the Optimal RE-Ordering

(OREO) of the rows and columns of a data matrix so as to

globally minimize the dissimilarity metric. The physical

permutations of the rows and columns of the data matrix can

be modeled as either a network flow problem or a traveling

salesman problem. Cluster boundaries in one dimension are

used to partition and re-order the other dimensions of the

corresponding submatrices to generate the biclusters. The

reordering of the rows and the columns for large data sets

can be computationally demanding.

The authors in [71] proposed an algorithm that finds and

reports all maximal contiguous column coherent (CCC)

biclusters in time linear in the size of the expression matrix.

The linear time complexity of CCC-Biclustering relies on the

use of a discretized matrix and efficient string processing

techniques based on suffix trees. This algorithm can only

detect biclusters with columns arranged contiguously.

In [72], an iterative density based biclustering algorithm,

called BIDENS is proposed. BIDENS is able to detect a set

214 Current Bioinformatics, 2010, Vol. 5, No. 3 Mukhopadhyay et al.

of k possibly overlapping biclusters simultaneously. The

algorithm is similar to FLOC, but instead of having residue

as the objecting function, it tries to maximize the overall

density of the biclusters. The input data set is needed to be

discretized before the application of BIDENS algorithm.

5. BICLUSTERING SOFTWARE

There are a number of free biclustering software

available for downloading for offline use, or in the form of

web server. Here we list some free/open source biclustering

software and discuss them in brief.

5.1. BicAT

BicAT (Biclustering Analysis Toolbox) [73] integrates

various biclustering (Cheng and Church, Bimax, xMotif,

OPSM, ISA) and clustering techniques (K-means,

hierarchical clustering) with a common graphical user

interface. Moreover, BicAT provides different facilities for

data preparation, inspection and postprocessing such as

discretization, filtering of biclusters according to specific

criteria or gene pair analysis for constructing gene

interconnection graphs. The toolbox is described in the

context of gene expression analysis, but is also applicable to

other types of data, e.g. data from proteomics or synthetic

lethal experiments. The BicAT toolbox is freely available at

http://www.tik.ee.ethz.ch/sop/bicat and it is platform

independent. The Java source code of the program and a

developer's guide is provided on the website as well. There is

provision for the users to add further algorithms or

extensions.

5.2. BiVisu

BiVisu (Bicluster detection and Visualization) [74] is an

open-source biclustering software tool for detection and

visualization of biclusters embedded in a gene expression

matrix. By using of appropriate coherence relations, BiVisu

is able to detect constant, row-constant, column-constant,

additive and multiplicative biclusters. The biclustering

results can also be visualized under a 2D setting in the form

of parallel coordinate (PC) plots for each bicluster. From the

PC plots of the biclusters, both objective and subjective

cluster quality evaluation can be performed. BiVisu also

integrates some data preprocessing and postprocessing

techniques. BiVisu has been developed in Matlab and is

available at http://www.eie.polyu.edu.hk/~nflaw/Biclustering/

for free download.

5.3. GEMS

GEMS (Gene Expression Mining Server) [75] is a web-

enabled service for biclustering microarray gene expression

data. Users may upload their gene expression data and

specify a set of criteria. GEMS then performs biclustering

based on a Gibbs sampling paradigm. GEMS web server

provides a useful and flexible platform for the discovery of

co-expressed and potentially co-regulated gene modules.

GEMS is an open source software and is available at

http://genomics10.bu.edu/terrence/gems/ for free down load.

5.4. EXPANDER

EXPANDER (EXpression Analyzer and DisplayER) [76]

is a java-based tool for analysis of gene expression data. It is

capable of clustering, visualization, biclustering and

performing downstream analysis of clusters and biclusters

such as functional enrichment and promoter analysis. In

general, EXPANDER can analyze groups of genes for

enrichment of transcription factor binding sites in their

promoters. EXPANDER currently integrates the SAMBA

[68] biclustering algorithm. The software is freely

downloadable from http://acgt.cs.tau.ac.il/expander/.

5.5. BicOverlapper

BicOverlapper [77] is a tool for visualizing biclusters

from gene-expression matrices in a way that helps to

compare biclustering methods, to unravel trends and to

highlight relevant genes and conditions. The technique is

based on a force-directed graph where biclusters are

represented as flexible overlapped groups of genes and

conditions. The BicOverlapper software and supplementary

material are available at http://vis.usal.es/bicoverlapper.

5.6. BiGGEsTS

BiGGEsTS (Biclustering Gene Expression Time Series)

[78] is a free and open source software tool providing an

integrated environment for the biclustering of time series

gene expression data. It offers a set of biclustering

algorithms (CCC [71], e-CCC [71], CC-TSB [79]) for time

series expression data. Moreover, it implements several

visualization techniques such that colored matrices,

expression evolution charts, pattern charts, dendrograms and

gene ontology graphs. BiGGEsTS integrates well known

techniques for preprocessing data: filtering genes, filling

missing values, smoothing, normalization and discretization.

The software is available at http://kdbio.inesc-

id.pt/software/biggests/.

5.7. BICLUST

BICLUST is an R-package for biclustering analysis

which contains a collection of bicluster algorithms,

preprocessing methods (normalization and discretization) for

two way data, and validation and visualization techniques for

bicluster results. The main function biclust provides several

algorithms to find biclusters in two-dimensional data: Cheng

and Church, Spectral, Plaid Model, Xmotifs and BiMax. The

package is available at the following website:

http://crantastic.org/packages/biclust.

CONCLUSION AND FUTURE CHALLENGES

Biclustering is a method for simultaneous clustering of

both genes and conditions of a microarray gene expression

matrix. Unlike clustering, biclustering methods try to capture

local modules, i.e., set of genes that are coregulated and

coexpressed in a subset of conditions. In recent times, there

has been a tremendous growth in biclustering research and a

large number of algorithms have been proposed. In this

article, we have made an attempt to present a comprehensive

review on the biclustering models. Recent biclustering

On Biclustering of Gene Expression Data Current Bioinformatics, 2010, Vol. 5, No. 3 215

algorithms of different categories along with their pros and

cons have been discussed. Moreover, an overview of some

freely available biclustering software is provided.

Most of the biclustering algorithms have been applied to

microarray gene expression data sets for identifying

coregulated genes and classifying tissue samples.

Biclustering algorithms have also been applied for detection

of different responses to treatment, and the set of genes to be

used as the most effective probes, mainly in cancer

microarrays, such as Leukemia [29, 30, 61]. Other than gene

expression data sets, biclustering algorithms have also been

successfully applied to e-commerce data and collaborative

filtering [80], marketing data [81], and text mining [82] etc.

Although a lot of publications are coming out in

biclustering area, still there remains many challenges to be

addressed by the researchers. In many of the papers, the

authors have posed the biclustering as an optimization

problem that optimizes some coherence measures. Many

such algorithms optimize MSR to capture the coherent

biclusters. However, recently it has been proved that MSR

is only able to detect constant and additive patterns and

unable to detect the multiplicative or combined patterns [22].

Therefore, it is a challenge for the researchers to devise some

new coherence measure that can capture both additive and

multiplicative patterns and more desirably the combined

patterns also. Moreover, still there is no overall accepted

measure to compare the quality of the biclusters obtained

using different biclustering algorithms. Therefore it is

difficult to judge the superiority of any particular

biclustering algorithm. This issue must be addressed by the

researchers. Furthermore, some studies are to be made for

extending the biclustering algorithms to generate the

triclusters from 3D gene-sample-time microarray data sets.

REFERENCES

[1] Sharan R, Adi M-K, Shamir R. CLICK and EXPANDER: A

system for clustering and visualizing gene expression data. Bioinformatics 2003; 19: 1787-99.

[2] Alizadeh AA, Eisen MB, Davis R, et al. Distinct types of diffuse large B-cell lymphomas identified by gene expression profiling.

Nature 2000; 403: 503-11. [3] Chu S, DeRisi J, Eisen M, et al. The transcriptional program of

sporulation in budding yeast. Science 1998; 282: 699-705. [4] Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis

and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998; 95(25): 14863-68.

[5] Bandyopadhyay S, Maulik U, Wang JTL. Anal Biol Data: A Soft Comput Approach World Scientific 2007.

[6] Jain AK, Dubes RC. Algorithms for clustering data englewood cliffs. NJ: Prentice-Hall, 1988.

[7] Cho RJ, Campbell MJ, Winzeler EA, et al. A genome-wide tran-scriptional analysis of mitotic cell cycle. Mol Cell 1998; 2: 65-73.

[8] Herwig R, Poustka A, Meuller C, Lehrach H, OBrien J. Large-scale clustering of cDNA fingerprinting data. Genome Res 1999; 9(11):

1093-105. [9] Dembele D, Kastner P. Fuzzy c-means method for clustering mi-

croarray data. Bioinformatics 2003; 19(8) 973-80. [10] Tamayo P, Slonim D, Mesirov J, et al. Interpreting patterns of

gene expression with self-organizing maps: Methods and applica-tion to hematopoietic differentiation. Proc Natl Acad Sci USA 1999;

96: 2907-12. [11] Hartuv E, Shamir R. A clustering algorithm based on graph con-

nectivity. Inform Proc Lett 2000; 76(200): 175-81.

[12] Lukashin AV, Fuchs R. Analysis of temporal gene expression

profiles: clustering by simulated annealing and determining the op-timal number of clusters. Bioinformatics 2001; 17(5): 405-19.

[13] Bandyopadhyay S, Mukhopadhyay A, Maulik U. An improved algorithm for clustering gene expression data. Bioinformatics 2007;

23(21): 2859-65. [14] Maulik U, Mukhopadhyay A, Bandyopadhyay S. Combining pa-

reto-optimal clusters using supervised learning for identifying co-expressed genes. BMC Bioinformatics 2009; 10: 27.

[15] Yeung KY, Haynor DR, Ruzzo WL. Validating clustering for gene expression data. Bioinformatics 2001; 17(4): 309-18.

[16] Qin ZS. Clustering microarray gene expression data using weighted Chinese restaurant process. 2006; Bioinformatics 22(16): 1988-97.

[17] Pan H, Zhu J, Han D. Genetic algorithms applied to multi-class clustering for gene expression data. Genomics Proteomics Bioin-

formatics 2003; 1: 279-87. [18] Tasoulis DK, Plagianakos VP, Vrahatis MN. Unsupervised cluster-

ing of bioinformatics data. In Eur Symp Int Tech Hybrid Syst im-plementation Smart Adaptive Syst 2004; pp. 47-53.

[19] Mukhopadhyay A, Maulik U, Bandyopadhyay S. Unsupervised cancer classification through SVM-boosted multiobjective fuzzy

clustering with majority voting ensemble. In Proc IEEE Congress on Evolutionary Comput 2009; pp. 255-61.

[20] Causton HC, Quackenbush J, Brazma A. Microarray gene expres-sions data analysis: A beginner's guide. Blackwell Pub., April

2003. [21] http://en.wikipedia.org/wiki/dna microarray.

[22] Aguilar-Ruiz JS. Shifting and scaling patterns from gene expres-sion data. Bioinformatics 2005; 21(20): 3840-45.

[23] Hartigan J. Direct clustering of a data matrix. J Am Stat Assoc 1972; 67(337): 123-29.

[24] Cheng Y, Church GM. Biclustering of gene expression data. Proc Int Conf Int Syst Mol Biol 2000; pp. 93-103.

[25] Yang J, Wang W, Wang H, Yu P. Enhanced biclustering on ex-pression data. In Proc 3rd IEEE Conf Bioinform Bioeng 2003; pp.

321-27. [26] Ben-Dor A, Chor B, Karp R, Yakhini Z. Discovering local struc-

ture in gene expression data: The order-preserving sub-matrix prob-lem. In Proc 6th Ann Int Conf Comput Biol 2002; 1-58113-498-3:

pp. 49-57. [27] Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N.

Revealing modular organization in the yeast transcriptional net-work. Nat Genet 2002; 31: 370-7.

[28] Ihmels J, Bergmann S, Barkai N. Defining transcription modules using large-scale gene expression data. Bioinformatics 2004; 20:

1993-2003. [29] Murali TM, Kasif S. Extracting conserved gene expression motifs

from gene expression data. In Proc Pacific Symp Biocomput 2003; 8: 77-88.

[30] Getz G, Levine E, Domany E. Coupled two-way cluster analysis of gene microarray data. Proc Natl Acad Sci USA 2000; 12079-84.

[31] Tang C, Zhang L, Zhang I, Ramanathan M. Interrelated two-way clustering: An unsupervised approach for gene expression data

analysis. In Proc Sec IEEE Int Symp Bioinform Bioeng 2001; pp. 41-8.

[32] Busygin S, Jacobsen G, Krmer E, Ag C. Double conjugated cluster-ing applied to leukemia microarray data. In Proc 2nd SIAM ICDM

Workshop on clustering high dimensional data 2002. [33] Goldberg DE. Genetic algorithms in search optimization and ma-

chine learning. New York: Addison-Wesley, 1989. [34] Kirkpatrik S, Gelatt CD, Vecchi MP. Optimization by simulated

annealing. Science 1983; 220: 671-80. [35] Bleuler S, Prelic A, Zitzler E, An EA framework for biclustering of

gene expression data. In Proc IEEE Congress on Evolutionary Comput 2004; pp. 166-73.

[36] Chakraborty A, Maka H. Biclustering of gene expression data using genetic algorithm. In Proc IEEE Symp Comput Int Bioinform

Comput Biol 2005. [37] Divina F, Aguilar-ruiz JS. Biclustering of expression data with

evolutionary computation. IEEE Trans Knowl Data Eng 2006; 18: 590-602.

[38] Mukhopadhyay A, Maulik U, Bandyopadhyay S. Evolving coher-ent and non-trivial biclusters from gene expression data: An evolu-

tionary approach. Proc IEEE Region 10 Conf 2008.

216 Current Bioinformatics, 2010, Vol. 5, No. 3 Mukhopadhyay et al.

[39] Bryan K, Cunningham P, Bolshakova N. Biclustering of expression

data using simulated annealing. In Proc 18th IEEE Symp Comput Based Medical Syst (Dublin, Ireland) 2005; pp. 383-8.

[40] Chakraborty A. Biclustering of gene expression data by simulated annealing. In Proc Eighth Intl Conf High-Perform Comput Asia-

Paci c Region 2005; pp. 627-32. [41] Xie B, Chen S, Liu F. Biclustering of gene expression data using

PSO-GA hybrid. Proc Int Conf Bioinform Biomed Eng 2007; 302-05.

[42] Deb K. Multi-objective optimization using evolutionary algorithms. England: John Wiley and Sons, Ltd, 2001.

[43] Maulik U, Mukhopadhyay A, Bandyopadhyay S. Finding multiple coherent biclusters in microarray data using variable string length

multiobjective genetic algorithm. IEEE Trans Inform Tech Biomed 2009; 13(6): 969-75.

[44] Deb K, Pratap A, Agrawal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol

Comput 2002; 6: 182-97. [45] Divina F, Aguilar-Ruiz JS. A multi-objective approach to discover

biclusters in microarray data. In Proc 9th Ann Conf Genetic Evol Comput New York, NY, USA 2007; pp. 385-92, ACM.

[46] Liu J, Li Z, Hu X, Chen Y, Biclustering of microarray data with MOPSO based on crowding distance. BMC Bioinformatics 2008;

10(Suppl 4): S9. [47] Fei L, Juan L. Biclustering of gene expression data with a new

hybrid multi-objective evolutionary algorithm of NSGA-II and EDA. In Proc Int Conf Bioinform Biomed Eng 2008; pp. 1912-5.

[48] Larranaga P, Lozano JA. Estimation Distrib Algorithms: A New Tool Evol Comput MA: Kluwer Academic Publisher 2001.

[49] Coelho GP, Franca FO, Zuben FJ. A multi-objective multipopula-tion approach for biclustering. In Proc 7th Int Conf Artificial Im-

mune Syst Springer-Verlag 2008; pp. 71-82. [50] Tjhi W-C, Lihui C. Flexible fuzzy co-clustering with feature-

cluster weighting. In Proc 9th Int Conf Control, Automation, Ro-botics and Vision 2006.

[51] Fei X, Lu S, Pop HF, Liang LR. GFBA: A biclustering algorithm for discovering value-coherent biclusters. In Proc Int Symp Bioin-

form Res Appl 2007. [52] Maulik U, Mukhopadhyay A, Bandyopadhyay S, Zhang MQ,

Zhang X. Multiobjective fuzzy bi-clustering in microarray data: Method and a new performance measure. In Proc IEEE World

Congress Comput Int/IEEE Congress Evol Comput 2008; (Hong Kong), pp. 383-8.

[53] Krishnapuram R, Joshi A, Yi L. A fuzzy relative of the k-medoids algorithm with application to document and snippet clustering. In

Proc IEEE Intl Conf Fuzzy Systems -FUZZ-IEEE 99, 1999;

(Seoul, South Korea), pp. 1281-6.

[54] Han L, Yan H. Fuzzy biclustering for dna microarray data analysis. In Proc IEEE Int Conf Fuzzy Syst FUZZ-IEEE 2008 (IEEE World

Congress Comput Int) 2008; pp. 1132-8. [55] Tanay A, Sharan R, Shamir R. Discovering statistically significant

biclusters in gene expression data. Bioinformatics 2002; 18: S136-S44.

[56] Prelic A, Bleuler S, Zimmermann P, et al. A systematic comparison and evaluation of biclustering methods for gene expression data.

Bioinformatics 2006; 22(9): 1122-9. [57] Ahmad W, Khokhar A. cHawk: A highly efficient biclustering

algorithm using bigraph crossing mini-mization. In Proc 2nd Int Workshop Data Mining Bioinform 2007.

[58] Angiulli F, Cesario E, Pizzuti C. Random walk biclustering for microarray data. Inform Sci 2008; 178(6): 1479-97.

[59] Dharan S, Nair AS. Biclustering of gene expression data using reactive greedy randomized adaptive search procedure. BMC Bioin-

formatics 2009; 10(Suppl 1): S27. [60] Lazzaroni L, Owen A. Plaid models for gene expression data. Sta-

tistica Sinica 2002; 12: 61-86. [61] Sheng Q, Moreau Y, Moor BD. Biclustering microarray data by

gibbs sampling. Bioinformatics 2003; 19: 196-205.

[62] Kluger Y, Basri R, Chang JT, Gerstein M. Spectral biclustering of

microarray data: coclustering genes and conditions. Genome Res 2003; 13: 703-16.

[63] Yoon S, Nardini C, Benini L, Micheli GD. Discovering coherent biclusters from gene expression data using zero-suppressed binary

decision diagrams. IEEE/ACM Trans Comput Biol Bioinform 2005; 2(4): 339-54.

[64] Zhao L, Zaki MJ. MicroCluster: Efficient deterministic biclustering of microarray data. IEEE Int Syst 2005; 20(6): 40-9.

[65] Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pas-cual-Montano A. Biclustering of gene expression data by non-

smooth non-negative matrix factorization. BMC Bioinformatics 2006; 7: 366.

[66] Tchagang BA, Tewfik AH. DNA microarray data analysis: a novel biclustering algorithm approach. EURASIP J Appl Signal Proc

2006. [67] Teng L, Chan L-W. Biclustering gene expression profiles by alter-

nately sorting with weighted correlated coefficient. In Proc IEEE Int Workshop Machine Learning Signal Proc 2006; pp. 289-94.

[68] Liu F, Zhou H. Biclustering of gene expression data based on bucketing technique. In: Proc. 1st International Conference on Bio-

informatics and Biomedical Engineering (ICBBE) Wuhan, China 2007; pp. 359-62,

[69] Gu J, Liu JS. Bayesian biclustering of gene expression data. BMC Genomics 2008; 9(Suppl 1): S4.

[70] Dimaggio P, Mcallister S, Floudas C, Feng XJ, Rabinowitz J, Ra-bitz H. Biclustering via optimal re-ordering of data matrices in sys-

tems biology: rigorous methods and comparative studies. BMC Bioinformatics 2008; 9(1): 458.

[71] Madeira SC, Oliveira AL. An efficient biclustering algorithm for finding genes with similar patterns in time-series expression data.

In Proc 5th Asia Paci c Bioinformatics Conference Series in Ad-vances in Bioinformatics and Computational Biology 5: (Hong

Kong), Imperial College Press, January 2007; pp. 67-80. [72] Mahfouz MA, Ismail MA. BIDENS: Iterative density based biclus-

tering algorithm with application to gene expression analysis. Proc World Acad Sci Eng Tech 2009; 37: 342-48.

[73] Barkow S, Bleuler S, Prelic A, Zimmermann P, Zitzler E. BicAT: a Biclustering Analysis Toolbox. Bioinformatics 2006; 22(10): 1282-

3. [74] Cheng KO, Law NF, Siu WC, Lau TH. BiVisu: software tool for

bicluster detection and visualization. Bioinformatics 2007; 23: 2342-4.

[75] Wu C-J, Kasif S. GEMS: A web server for biclustering analysis of expression data. Nucleic Acids Res 2005; 33: 596-9.

[76] Shamir R, Maron-Katz A, Tanay A, et al. EXPANDER–an integra-tive program suite for microarray data analysis. BMC Bioinformat-

ics 2005; 6: 232. [77] Santamar´ıa R, Theron R, Quintales L. BicOverlapper: a tool for

bicluster visualization. Bioinformatics 2008; 24(9): 1212-3 . [78] Gonçalves JP, Madeira SC, Oliveira AL. BiGGEsTS: integrated

environment for biclustering analysis of time series gene expres-sion data. Instituto de Engenharia de Sistemas e Computadores In-

vestigao e Desenvolvimento em Lisboa (INESC-ID). Tech Rep 2009; p. 23.

[79] Zhang Y, Zha H, Chu C-H. A time-series biclustering algorithm for revealing co-regulated genes. In Proc Int Conf Inform Tech Coding

Comput 2005; 1: pp. 32-7. [80] de Franca FO, Ferreira HM, Zuben FJ V. Applying biclustering to

perform collaborative filtering. In Proc Int Conf Int Syst Design Appl (Los Alamitos, CA, USA) 2007; pp. 421-6, IEEE Computer

Society. [81] Liu S, Chen Y, Yang M, Ding R. Bicluster algorithm and used in

market analysis. In Intl Workshop Knowl Dis Data Mining (Los Alamitos, CA, USA), 2009; pp. 504-7, IEEE Computer Society.

[82] de Castro PA D, de Franca FO, Ferreira HM, Zuben FJ V. Apply-ing biclustering to text mining: An immune-inspired approach. In

Proc Int Conf Artificial Immune Syst 2007; pp. 83-94.

Received: August 22, 2009 Revised: November 04, 2009 Accepted: January 06, 2010