Bit-table based biclustering and frequent closed itemset mining in high-dimensional binary data

Research ArticleBit-Table Based Biclustering and Frequent Closed ItemsetMining in High-Dimensional Binary Data

Andraacutes Kiraacutely1 Attila Gyenesei2 and Jaacutenos Abonyi1

1 Department of Process Engineering University of Pannonia Veszprem 8200 Hungary2 Bioinformatics amp Scientific Computing Core Campus Science Support Facilities Vienna Biocenter 1030 Vienna Austria

Correspondence should be addressed to Andras Kiraly kandras85gmailcom

Received 15 August 2013 Accepted 4 December 2013 Published 30 January 2014

Academic Editors Y Blanco Fernandez and Y-B Yuan

Copyright copy 2014 Andras Kiraly et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional dataThe twomost prominent application fields in this research proposed independently are frequent itemset mining(developed for market basket data) and biclustering (applied to gene expression data analysis) The common limitation of bothmethodologies is the limited applicability for very large binary data sets In this paper we propose a novel and efficient method tofind both frequent closed itemsets and biclusters in high-dimensional binary dataThemethod is based on simple but very powerfulmatrix and vectormultiplication approaches that ensure that all patterns can be discovered in a fastmannerTheproposed algorithmhas been implemented in the commonly used MATLAB environment and freely available for researchers

1 Introduction

One of the most important research fields in data miningis mining interesting patterns (such as sequences episodesassociation rules correlations or clusters) in large data setsFrequent itemset mining is one of the earliest such conceptsoriginating from economic market basket analysis with theaim of understanding the behaviour of retail customers or inother words finding frequent combinations and associationsamong items purchased together [1] Market basket data canbe considered as a matrix with transactions as rows anditems as columns If an item appears in a transaction it isdenoted by 1 and otherwise by 0The general goal of frequentitemset mining is to identify all itemsets that contain at leastas many transactions as required referred to as minimumsupport threshold By definition all subsets of a frequentitemset are frequentTherefore it is also important to providea minimal representation of all frequent itemsets withoutlosing their support information Such itemsets are calledfrequent closed itemsets An itemset is defined as closed ifnone of its immediate supersets has exactly the same supportcount as the itemset itself For comprehensive reviews aboutthe efficient frequent itemset mining algorithms see [2 3]

Independently of frequent itemset mining biclusteringanother important data mining concept was proposed tocomplement and expand the capabilities of the standardclustering methods by allowing objects to belong to multipleor none of the resulting clusters purely based on theirsimilarities This property makes biclustering a powerfulapproach especially when it is applied to data with a largenumber of objects During recent years many biclusteringalgorithms have been developed especially for the analysisof gene expression data [4] With biclustering genes withsimilar expression profiles can be identified not only overthe whole data set but also across subsets of experimentalconditions by allowing genes to simultaneously belong toseveral expression patterns For comprehensive reviews onbiclustering see [4ndash6]

One of the most important properties of biclusteringwhen applied to binary (0 1) data is that it provides thesame results as frequent closed itemsets mining (Figure 1)Such biclusters called inclusion-maximal biclusters (or IMBs)were introduced in [7] together with a mining algorithmBiMAX to discover all biclusters in a binary matrix thatare not entirely contained by any other cluster By defaultan IMB can contain any number of genes and samples

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 870406 7 pageshttpdxdoiorg1011552014870406

2 The Scientific World Journal

x1

x2

xn

B4

B4B4

B4

B3

B3

B3

B2 B2

B2

y1 y2 y3 y4 y mmiddot middot middot

A

Figure 1 Illustrative representation of biclustersfrequent closeditemsets on binary data

Once additional minimum support threshold is requiredfor discovering clusters having at least as many genes asthe provided minimum support threshold (ie minimumnumber of genes) BiMAX and all frequent closed itemsetmining methods result in the same patterns

In this paper we propose an efficient pattern miningmethod to find frequent closed itemsetsbiclusters whenapplied to binary high-dimensional data The method isbased on simple but very powerful matrix and vector mul-tiplication approaches that ensure that all patterns can bediscovered in a fast manner The proposed algorithm hasbeen implemented in the commonly used MATLAB envi-ronment rigorously tested on both synthetic and real datasets and freely available for researchers (httpprmkuni-pannonhuResearchbit-table-biclustering)

2 Problem Formulation

In this section we will show how bothmarket basket data andgene expression data can be represented as bit-tables beforeproviding a new mining method in the next section In caseof real gene expression data it is a common practice of thefield of biclustering to transform the original gene expressionmatrix into a binary one in such a way that gene expressionvalues are transformed to 1 (expressed) or 0 (not expressed)using an expression cutoff (eg twofold change of the log2expression values) Then the binarized data can be used asclassic market basket data and defined as follows (Figure 2)let 119879 = 119905

1 119905

119899 be the set of transactions and let 119868 =

1198941 119894

119898 be the set of items The Transaction Database can

be transformed into a binary matrix B0 where each rowcorresponds to a transaction and each column correspondsto an item (right side of Figure 2) Therefore the bit-tablecontains 1 if the item is present in the current transaction and0 otherwise [8]

Using the above terminology a transaction 119905119894is said to

support an itemset 119869 if it contains all items of 119869 that is 119869 sube 119905119894

The support of an itemset 119869 is the number of transactions thatsupport this itemset Using 120590 for support count the supportof itemset 119869 is 120590(119869) = |119905

119894| 119869 sube 119905

119894 119905119894isin 119879| An itemset

is frequent if its support is greater than or equal to a user-specified threshold sup(119869) ge minsupp An itemset 119869 is called119896-itemset if it contains 119896 items from 119868 that is |119869| = 119896 Anitemset 119869 is a frequent closed itemset if it is frequent and thereexists no proper superset 1198691015840 sup 119869 such that sup(1198691015840) = sup(119869)

The problem of mining frequent itemsets was introducedby Agrawal et al in [1] and the first efficient algorithmcalled Apriori was published by the same group in [9]The name of the algorithm is based on the fact that thealgorithm uses prior knowledge of the previously determinedfrequent itemsets to identify longer and longer frequentitemsets Mannila et al proposed the same technique inde-pendently in [10] and both works were combined in [11]In many cases frequent itemset mining approaches havegood performance but they may generate a huge number ofsubstructures satisfying the user-specified threshold It can beeasily realized that if an itemset is frequent then all its subsetsare frequent as well (for more details see ldquodownward closurepropertyrdquo in [9]) Although increasing the threshold mightreduce the resulted itemsets and thus solve this problem itwould also remove interesting patterns with low frequencyTo overcome this the problem of mining frequent closeditemsets was introduced by Pasquier et al in 1999 [12] wherefrequent itemsets which have no proper superitemset withthe same support value (or frequency) are searched Themain benefit of this approach is that the set of closed fre-quent itemsets contains the complete information regardingits corresponding frequent itemsets During the followingfew years various algorithms were presented for miningfrequent closed itemsets including CLOSET [13] CHARM[14] FPclose [15] AFOPT [16] CLOSET+ [17] DBV-Miner[18] and STreeDC-Miner [19] The main computational taskof closed itemset mining is to check whether an itemset isa closed itemset Different approaches have been proposedto address this issue CHARM for example uses a hashingtechnique on its TID (transaction identifier) values whileAFOPT FPclose CLOSET CLOSET+ or STreeDC-Minermaintains the identified detected itemsets in an FP-tree-likepattern-tree Further reading about closed itemset miningcan be found in [20]

The formulations above yield the close relationshipbetween closed frequent itemsets and biclusters since thegoal of biclustering is to find biclusters 119861

119896= (119868119896 119869119896) such

that 119868119896sube 119868119897 119869119896sube 119869119897 Therefore while the size restriction for

columns in a bicluster corresponds to the frequency condi-tion of itemsets the ldquomaximalityrdquo of a bicluster correspondsto the closeness of an itemset Thus if itemsets that containless than min rows number of rows are filtered out the setof all closed frequent itemsets will be equal to the set of allmaximal biclusters

3 Mining Frequent Closed Itemsets UsingBit-Table Operations

In this section we introduce a novel frequent closed item-set mining algorithm and propose efficient implementa-tion of the algorithm in the MATLAB environment Notethat the proposed method can also be applied to various

The Scientific World Journal 3

TID Items1

2

3

4

5

Bread milk

Bread diaper beer eggs

Milk diaper beer coke

Bread milk diaper beer

Bread milk diaper coke

0

1

1

1

0

1

1

0

1

1

1

0

1

1

1

0

1

1

1

1

0

1

0

0

0

0

0

1

0

1

T1

T2

T3

T4

T5

Beer

Brea

d

Milk

Dia

per

Eggs

Cok

e

Figure 2 Bit-table representation of market basket data

Data preprocessing Frequent itemset mining

Extraction of frequent closed

itemsets (biclusters)Analysis of results

Figure 3 Schematic view of frequent closed itemset discovery

biclustering application fields such as gene expression dataanalysis after a proper preprocessing (binarization) stepThe schematic view of the proposed pipeline is shown inFigure 3

31 The Proposed Mining Algorithm The mining procedureis based on the Apriori principle Apriori is an iterative algo-rithm that determines frequent itemsets level-wise in severalsteps (iterations) In any step 119896 the algorithm calculates allfrequent 119896-itemsets based on the already generated (119896minus1)-itemsets Each step has two phases candidate generationand frequency counting In the first phase the algorithmgenerates a set of candidate 119896-itemsets from the set offrequent (119896minus1)-itemsets from the previous pass This is car-ried out by joining frequent (119896minus 1)-itemsets together Twofrequent (119896minus1)-itemsets are joinable if their lexicographicallyordered first 119896 minus 2 items are the same and their last items aredifferent Before the algorithm enters the frequency countingphase it discards every new candidate itemset having a subsetthat is infrequent (utilizing the downward closure property)In the frequency counting phase the algorithm scans throughthe database and counts the support of the candidate 119896-itemsets Finally candidates with support not lower thanthe minimum support threshold are added into the set offrequent itemsets

A simplified pseudocode of the Apriori algorithm ispresented in Pseudocode 1 which is extended by extractingonly the closed itemsets in line 9 While the 119869119900119894119899() proceduregenerates candidate itemsets 119862119896 the 119875119903119906119899119890()method (in row5) counts the support of all candidate itemsets and removesthe infrequent ones

The storage structure of the candidate itemsets is crucialto keep both memory usage and running time reasonableIn the literature hash-tree [9 11 21] and prefix-tree [22 23]storage structures have been shown to be efficientThe prefix-tree structure is more common due to its efficiency andsimplicity but naive implementation could be still very spaceconsuming

Our procedure is based on a simple and easily imple-mentable matrix representation of the frequent itemsets Theidea is to store the data and itemsets in vectors Then simplematrix and vector multiplication operations can be applied tocalculate the supports of itemsets efficiently

To indicate the iterative nature of our process we definethe inputmatrix (A

119898times119899) asA119898times119899= B01198730times119899

where b0119895represents

the 119895th column of B01198730times119899

which is related to the occurrenceof the 119894

119895th item in transactions The support of item 119894

119895can be

easily calculated as sup(119883 = 119894119895) = (b0

119895)119879b0119895

Similarly the support of itemset 119883119894119895= 119894119894 119894119895 can be

obtained by a simple vector product of the two related vectorsbecause when both 119894

119894and 119894119895items appear in a given transac-

tion the product of the two related items can be representedby the AND connection of the two items sup(119883

119894119895= 119894119894 119894119895) =

(b0119894)119879b0119895 The main benefit of this approach is that counting

and storing the itemsets are not needed only matrices ofthe frequent itemsets are generated based on the element-wise products of the vectors corresponding to the previouslygenerated (119896 minus 1)-frequent itemsetsTherefore simple matrixand vector multiplications are used to calculate the supportof the potential 119896 + 1 itemsets S119896 = (B119896minus1)119879B119896minus1 where the119894th and 119895th element of the matrix S119896 represent the supportof the 119883

119894119895= L119896minus1119894 L119896minus1119895 itemset where L119896minus1 represents the


(1) L1 = 1minusitemsets(2) 119896 = 2(3)while L119896minus1 = (4) C119896 = 119869119900119894119899(L119896minus1)(5) L119896 = 119875119903119906119899119890(C119896)(6) L = L cup L119896(7) 119896 = 119896 + 1(8) end(9)B = ExtractClosed(L)

Pseudocode 1 Pseudocode of the Apriori-like algorithm

0

0

0

0

0 0 0

0

00

0

0

1 1

1

11

1 1 1 1

1 1 1

1

1

1

1

0

0 0 0

000

0 0 0

0 0

000

11

1

1

11

1

1

1

1

1

1

1 1 1

1

1

114443 322 3 3 3

1

t1t2

t3

t4

t5

k = 2

i1 i2 i3 i4 i5 i6B1

S1

t1t2

t3

t4

t5

B2

S2

Minsupp = 3i1i2 i1i3 i1i4 i2i3 i2i4 i3i4

Figure 4 Mining process example using the bit-table representation

set of (119896minus1)-itemsets As a consequence only matrices of thefrequent itemsets are generated by forming the columns ofthe B119896119873119896times119899119896minus1

as the element-wise products of the columns ofB119896minus1119873119896minus1times119899119896minus1

that is B119896119873119896times119899119896minus1= b119896minus1119894∘ b119896minus1119895

for all 119894 = 119895 where119860 ∘ 119861means the Hadamard product of matrices 119860 and 119861

The concept is simple and easily interpretable and sup-ports compact and effective implementation The proposedalgorithm has a similar philosophy to the Apriori TID[24] method to generate candidate itemsets None of thesemethods have to revisit the original data tableB0

119873times119899 for com-

puting the support of larger itemsets Instead our methodtransforms the table as it goes along with the generation ofthe 119896-itemsets B1

1198731times1198991 B119896

119873119896times119899119896 119873119896lt 119873119896minus1lt sdot sdot sdot lt 119873

1

B11198731times1198991

represents the data related to the 1-frequent itemsetsThis table is generated from B0

119873times119899 by erasing the columns

related to the nonfrequent items to reduce the size of thematrices and improve the performance of the generationprocess

Rows that are not containing any frequent itemsets (thesum of the row is zero) in B119896

119873119896times119899119896are also deleted If a column

remains the index of its original position is written into amatrix that stores only the indices (ldquopointersrdquo) of the elementsof itemsets L1

1198731times1 When L119896minus1

119873119896minus1times119896minus1matrices related to the

indexes of the (119896 minus 1)-itemsets are ordered it is easy to followthe heuristics of the Apriori algorithm as only those L

119896minus1

itemsets will be joinedwhose first 119896minus1 items are identical (theset of these itemsets form the blocks of the B119896minus1

119873119896minus1times119899119896minus1matrix)

Figure 4 represents the second step of the algorithmusing minsupp = 3 in the 119875119903119906119899119890() procedure

32 MATLAB Implementation of the Proposed AlgorithmThe proposed algorithm uses matrix operations to identifyfrequent itemsets and count their support values Herewe provide a simple but powerful implementation of thealgorithm using the user friendly MATLAB environmentTheMATLAB code 2 (Algorithm 1) and code 3 (Algorithm 2)present working code snippets of frequent closed itemsetmining only within 34 lines of code

The first code segment presents the second step of thediscovery pipeline (see Figure 3) Preprocessed data is storedin the variable bM in bit-table format as discussed aboveThefirst and second steps of the iterative procedure are presentedin lines 1 and 2 where S2 and B2 are calculated The Aprioriprinciple is realized in the while loop in lines 4ndash19 Usingthe notation in Pseudocode 1 Cks are generated in lines 10-11 while Lks are prepared in the loop in lines 12ndash16

MATLAB code 3 (Algorithm 2) shows the usually mostexpensive calculation the generation of closed frequentitemsets which is denoted by extraction of frequent closeditemsets in Figure 3 Using the set of frequent items as thecandidate frequent closed itemsets our approach calculatesthe support as the sum of columns (see Section 32) andeliminates nonclosed itemsets from the candidate set (line11) Again an itemset 119869 is a frequent closed itemset if it is


(1) s1=sum(bM) items1=find(s1 gesuppn)1015840s1=s1(items1)(2) dum=bM1015840lowastbM [i1 i2]=find(triu(dum 1)gesuppn) items2=[1198941 1198942](3) k=3(4) while simisempty(itemskminus1)(5) itemsk=[] sk=[] ci=[](6) for i=1size(itemskminus11)(7) vv=prod(bM(itemskminus1(i)) 2)(8) if k==3 s2(i)=sum(vv) end(9) TID=find(vvgt0)(10) pf=(unique(itemskminus1(find(ismember(itemskminus1(1end minus1)(11) itemskminus1(i1end minus1) ldquorowsrdquo)) end)))(12) fi=pf(find(pfgtitemskminus1(i end)))(13) for jj=fi1015840(14) j=find(items1==jj)(15) v=vv(TID)lowastbM(TIDitems1(j)) sv=sum(v)(16) itemsk=[itemsk [itemskminus1(i)items1(j)]] sk=[sk sv](17) end(18) end(19) k=k+1(20) end

Algorithm 1 MATLAB code 2 mining frequent itemsets

(1) for k=1 length(items)minus1(2) Citemsk=[](3) for i=1size(itemsk 1)(4) part=0(5) for j=1size(itemskminus1 1)(6) IS = intersect(itemsk(i) itemsk+1(j))(7) if and((sum(ismember(itemsk(i) IS))==k) sk(i)==sk+1(j))(8) part=part+1 end(9) end(10) if part==0(11) Citemsk=[Citemsk itemsk(i)](12) end(13) end(14) end(15) Citemsk+1=itemsend

Algorithm 2 MATLAB code 3 the generation of closed frequent itemsets

frequent and there exists no proper superset 1198691015840 sup 119869 such thatsup(1198691015840) = sup(119869) This is ensured by the loop in lines 5ndash9

4 Experimental Results

In this section we compare our proposed method to BiMAX[7] which is a highly recognized reference method withinthe biclustering research community As BiMAX is regularlyapplied to binary gene expression data it serves as a goodreference for the comparison Using several biological andvarious synthetic data sets we show that while bothmethodsare able to discover all patterns (frequent closed item-setsbiclusters) our pattern discovery approach outperformsBiMAX

To compare the two mining methods and demonstratethe computational efficiency we applied them to several realand synthetic data sets Real data come from various biolog-ical studies previously used as reference data in biclusteringresearch [25ndash28] For the comparison of the computationalefficiency all biological data sets were binarized For boththe fold-change data (stem cell data sets) and the absoluteexpression data (Leukemia Compendium and Yeast-80)fold-change cutoff 2 is used Results are shown in Table 1(synthetic data) and Table 2 (real data) respectively Bothmethods were able to discover all closed patterns for allsynthetic and real data setsThe results show that ourmethodoutperforms BiMAX and provides the best running times inall cases especially when the number of rows and columns


Table 1 Performance test using synthetic data

Size Density Minsupp Number of closed itemsets Time (s) Number of BiMAX biclusters Time (s)50 times 50 10 2 78 08 78 sim150 times 50 20 4 140 11 140 sim150 times 50 50 15 238 09 238 sim1100 times 100 10 3 337 5 337 sim2100 times 100 20 7 488 7 488 sim2100 times 100 50 30 694 9 694 sim3300 times 300 10 8 437 17 437 sim5300 times 300 20 22 156 6 156 52300 times 300 50 90 1038 40 1038 gt600700 times 700 10 15 1318 120 1318 195700 times 700 20 45 375 33 375 gt300700 times 700 50 210 283 25 283 gt3001000 times 1000 10 20 1496 196 1496 gt6001000 times 1000 20 60 714 92 714 gt6001000 times 1000 50 290 1030 135 1030 gt600

Table 2 Test runs using biological data

Name Size Minsupp Number of closed itemsets Time (s) Number of BiMAX biclusters Time (s)Compendium 6316 times 300 50 2594 12 2594 sim19StemCell-27 45276 times 27 200 7972 27 7972 sim115Leukemia 125336 times 72 400 3643 147 3643 gt600StemCell-9 1840 times 9 2 177 08 177 sim1Yeast-80 6221 times 80 80 3285 8 3285 sim17

is higher Biological validation of the discovered patternstogether with detailed explanations is given in [28]

5 Conclusion

In this paper we have proposed a novel and efficient methodto find both frequent closed itemsets and biclusters in high-dimensional binary data The method is based on a simplebit-table based matrix and vector multiplication approachand ensures that all patterns can be discovered in a fastmanner The proposed algorithm can be successfully appliedto various bioinformatics problems dealingwith high-densitybiological data including high-throughput gene expressiondata

Disclosure

Attila Gyenesei is joint first author

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The financial support TAMOP-422C-111KONV-2012-0004 Project is gratefully acknowledged The research of

Janos Abonyi was realized in the frames of TMOP 424 A2-111-2012-0001 ldquoNational Excellence Program Elaboratingand operating an inland student and researcher personalsupport systemrdquoThe project was subsidized by the EuropeanUnion and cofinanced by the European Social Fund

References

[1] R Agrawal T Imielinski and A Swami ldquoMining associationrules between sets of items in large databasesrdquo in Proceedings ofthe ACM SIGMOD Record vol 22 pp 207ndash216 ACM 1993

[2] ldquoFimirsquo03 workshop on frequent itemset mining implementa-tionsrdquo in Proceedings of the IEEE International Conference onData Mining Workshop on Frequent Itemset Mining Implemen-tations B Gothals and M J Zaki Eds Melbourne Fla USADecember 2003

[3] ldquoFimirsquo04 workshop on frequent itemset mining implementa-tionsrdquo in Proceedings of the IEEE International Conference onData Mining Workshop on Frequent Item set Mining Implemen-tations R Bayardo B Gothals and M J Zaki Eds BrightonUK 2004

[4] S C Madeira and A L Oliveira ldquoBiclustering algorithms forbiological data analysis a surveyrdquo IEEEACM Transactions onComputational Biology and Bioinformatics vol 1 no 1 pp 24ndash45 2004

[5] S Busygin O Prokopyev and P M Pardalos ldquoBiclustering indata miningrdquo Computers amp Operations Research vol 35 no 9pp 2964ndash2987 2008


[6] A Tanay R Sharan and R Shamir ldquoDiscovering statisticallysignificant biclusters in gene expression datardquo Bioinformaticsvol 18 supplement 1 pp S136ndashS144 2002

[7] A Prelic S Bleuler P Zimmermann et al ldquoA systematiccomparison and evaluation of biclustering methods for geneexpression datardquo Bioinformatics vol 22 no 9 pp 1122ndash11292006

[8] P-N Tan M Steinbach and V Kumar Introduction to DataMining Pearson Addison Wesley Boston Mass USA 2006

[9] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago Chile September 1994

[10] H Mannila H Toivonen and A I Verkamo ldquoEfficient algo-rithms for discovering association rulesrdquo in Proceedings of theAAAI Workshop on Knowledge Discovery in Databases (KDDrsquo94) pp 181ndash192 Seattle Wash USA 1994

[11] R Agrawal H Mannila R Srikant H Toivonen and A IVerkamo ldquoFast discovery of association rulesrdquo in Advances inKnowledge Discovery and Data Mining vol 12 pp 307ndash328American Association for Artificial Intelligence Menlo ParkCalif USA 1996

[12] N Pasquier Y Bastide R Taouil and L Lakhal ldquoDiscoveringfrequent closed itemsets for association rulesrdquo in DatabaseTheorymdashICDTrsquo99 LectureNotes inComputer Science pp 398ndash416 Springer London UK 1999

[13] J Pei J Han and R Mao ldquoCLOSET an efficient algorithm formining frequent closed itemsetsrdquo in Proceedings of the ACMSIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery vol 4 pp 21ndash30 Dallas Tex USA 2000

[14] J M Zaki and C-J Hsiao ldquoCHARM an efficient algorithm forclosed association rule miningrdquo in Proceedings of the 2nd SIAMInternational Conference on Data Mining pp 457ndash473 1999

[15] G Grahne and J Zhu ldquoEfficiently using prefix-trees in miningfrequent itemsetsrdquo in Proceedings of the Workshop on FrequentItemset Mining Implementations (FIMI rsquo03) pp 123ndash132 2003

[16] G Liu H Lu W Lou and J X Yu ldquoOn computing storingand querying frequent patternsrdquo in Proceeding of the 9th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo03) pp 607ndash612 Washington DC USAAugust 2003

[17] J Wang J Han and J Pei ldquoCLOSET+ searching for the beststrategies for mining frequent closed itemsetsrdquo in Proceedingsof the 9th ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo03) pp 236ndash245 Washing-ton DC USA August 2003

[18] B Vo T-P Hong and B Le ldquoDBV-miner a dynamic bit-vectorapproach for fast mining frequent closed itemsetsrdquo ExpertSystems with Applications vol 39 no 8 pp 7196ndash7206 2012

[19] A Y Rodriguez-Gonzalez J F Martinez-Trinidad J ACarrasco-Ochoa and J Ruiz-Shulcloper ldquoMining frequentpatterns and association rules using similaritiesrdquoExpert Systemswith Applications vol 40 no 17 pp 6823ndash6836 2013

[20] J Han H Cheng D Xin and X Yan ldquoFrequent patternmining current status and future directionsrdquo Data Mining andKnowledge Discovery vol 15 no 1 pp 55ndash86 2007

[21] J S Park M S Chen and P S Yu An Effective Hash-BasedAlgorithm for Mining Association Rules vol 24 ACM 1995

[22] A Amir R Feldman andR Kashi ldquoA new and versatilemethodfor association generationrdquo Information Systems vol 22 no 6-7pp 333ndash347 1997

[23] R J Bayardo ldquoEfficiently mining long patterns from databasesrdquoin Proceedings of the ACM Sigmod Record vol 27 pp 85ndash93ACM 1998

[24] F P Pach A Gyenesei and J Abonyi ldquoCompact fuzzy associa-tion rule-based classifierrdquo Expert Systems with Applications vol34 no 4 pp 2406ndash2416 2008

[25] A Gyenesei U Wagner S Barkow-Oesterreicher E Stolteand R Schlapbach ldquoMining co-regulated gene profiles for thedetection of functional associations in gene expression datardquoBioinformatics vol 23 no 15 pp 1927ndash1935 2007

[26] G Li Q Ma H Tang A H Paterson and Y Xu ldquoQUBIC aqualitative biclustering algorithm for analyses of gene expres-sion datardquo Nucleic Acids Research vol 37 no 15 article e1012009

[27] D S Rodriguez-Baena A J Perez-Pulido and J S Aguilar-Ruiz ldquoA biclustering algorithm for extracting bit-patterns frombinary datasetsrdquo Bioinformatics vol 27 no 19 pp 2738ndash27452011

[28] A Kiraly J Abonyi A Laiho and A Gyenesei ldquoBiclusteringof high-throughput gene expression data with bicluster minerrdquoin Proceedings of the 12th IEEE International Conference on DataMiningWorkshops (ICDMW rsquo12) pp 131ndash138 Brussels BelgiumDecember 2012

Submit your manuscripts athttpwwwhindawicom

International Journal ofComputer GamesTechnologyHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances in Software Engineering



Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of


ComputationalIntelligence ampNeuroscience


Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



x1

x2

xn

B4

B4B4

B4

B3

B3

B3

B2 B2

B2

y1 y2 y3 y4 y mmiddot middot middot

A

Figure 1 Illustrative representation of biclustersfrequent closeditemsets on binary data

Once additional minimum support threshold is requiredfor discovering clusters having at least as many genes asthe provided minimum support threshold (ie minimumnumber of genes) BiMAX and all frequent closed itemsetmining methods result in the same patterns

In this paper we propose an efficient pattern miningmethod to find frequent closed itemsetsbiclusters whenapplied to binary high-dimensional data The method isbased on simple but very powerful matrix and vector mul-tiplication approaches that ensure that all patterns can bediscovered in a fast manner The proposed algorithm hasbeen implemented in the commonly used MATLAB envi-ronment rigorously tested on both synthetic and real datasets and freely available for researchers (httpprmkuni-pannonhuResearchbit-table-biclustering)

2 Problem Formulation

In this section we will show how bothmarket basket data andgene expression data can be represented as bit-tables beforeproviding a new mining method in the next section In caseof real gene expression data it is a common practice of thefield of biclustering to transform the original gene expressionmatrix into a binary one in such a way that gene expressionvalues are transformed to 1 (expressed) or 0 (not expressed)using an expression cutoff (eg twofold change of the log2expression values) Then the binarized data can be used asclassic market basket data and defined as follows (Figure 2)let 119879 = 119905

1 119905

119899 be the set of transactions and let 119868 =

1198941 119894

119898 be the set of items The Transaction Database can

be transformed into a binary matrix B0 where each rowcorresponds to a transaction and each column correspondsto an item (right side of Figure 2) Therefore the bit-tablecontains 1 if the item is present in the current transaction and0 otherwise [8]

Using the above terminology a transaction 119905119894is said to

support an itemset 119869 if it contains all items of 119869 that is 119869 sube 119905119894

The support of an itemset 119869 is the number of transactions thatsupport this itemset Using 120590 for support count the supportof itemset 119869 is 120590(119869) = |119905

119894| 119869 sube 119905

119894 119905119894isin 119879| An itemset

is frequent if its support is greater than or equal to a user-specified threshold sup(119869) ge minsupp An itemset 119869 is called119896-itemset if it contains 119896 items from 119868 that is |119869| = 119896 Anitemset 119869 is a frequent closed itemset if it is frequent and thereexists no proper superset 1198691015840 sup 119869 such that sup(1198691015840) = sup(119869)

The problem of mining frequent itemsets was introducedby Agrawal et al in [1] and the first efficient algorithmcalled Apriori was published by the same group in [9]The name of the algorithm is based on the fact that thealgorithm uses prior knowledge of the previously determinedfrequent itemsets to identify longer and longer frequentitemsets Mannila et al proposed the same technique inde-pendently in [10] and both works were combined in [11]In many cases frequent itemset mining approaches havegood performance but they may generate a huge number ofsubstructures satisfying the user-specified threshold It can beeasily realized that if an itemset is frequent then all its subsetsare frequent as well (for more details see ldquodownward closurepropertyrdquo in [9]) Although increasing the threshold mightreduce the resulted itemsets and thus solve this problem itwould also remove interesting patterns with low frequencyTo overcome this the problem of mining frequent closeditemsets was introduced by Pasquier et al in 1999 [12] wherefrequent itemsets which have no proper superitemset withthe same support value (or frequency) are searched Themain benefit of this approach is that the set of closed fre-quent itemsets contains the complete information regardingits corresponding frequent itemsets During the followingfew years various algorithms were presented for miningfrequent closed itemsets including CLOSET [13] CHARM[14] FPclose [15] AFOPT [16] CLOSET+ [17] DBV-Miner[18] and STreeDC-Miner [19] The main computational taskof closed itemset mining is to check whether an itemset isa closed itemset Different approaches have been proposedto address this issue CHARM for example uses a hashingtechnique on its TID (transaction identifier) values whileAFOPT FPclose CLOSET CLOSET+ or STreeDC-Minermaintains the identified detected itemsets in an FP-tree-likepattern-tree Further reading about closed itemset miningcan be found in [20]

The formulations above yield the close relationshipbetween closed frequent itemsets and biclusters since thegoal of biclustering is to find biclusters 119861

119896= (119868119896 119869119896) such

that 119868119896sube 119868119897 119869119896sube 119869119897 Therefore while the size restriction for

columns in a bicluster corresponds to the frequency condi-tion of itemsets the ldquomaximalityrdquo of a bicluster correspondsto the closeness of an itemset Thus if itemsets that containless than min rows number of rows are filtered out the setof all closed frequent itemsets will be equal to the set of allmaximal biclusters

3 Mining Frequent Closed Itemsets UsingBit-Table Operations

In this section we introduce a novel frequent closed item-set mining algorithm and propose efficient implementa-tion of the algorithm in the MATLAB environment Notethat the proposed method can also be applied to various


TID Items1

2

3

4

5

Bread milk





0

1

1

1

0

1

1

0

1

1

1

0

1

1

1

0

1

1

1

1

0

1

0

0

0

0

0

1

0

1

T1

T2

T3

T4

T5

Beer

Brea

d

Milk

Dia

per

Eggs

Cok

e

















119895can be


119895)119879b0119895





119894119895= 119894119894 119894119895) =







0

0

0

0

0 0 0

0

00

0

0

1 1

1

11

1 1 1 1

1 1 1

1

1

1

1

0

0 0 0

000

0 0 0

0 0

000

11

1

1

11

1

1

1

1

1

1

1 1 1

1

1

114443 322 3 3 3

1

t1t2

t3

t4

t5

k = 2

i1 i2 i3 i4 i5 i6B1

S1

t1t2

t3

t4

t5

B2

S2










1198731times1198991 B119896


1

B11198731times1198991










119896minus1






















5 Conclusion


Disclosure




Acknowledgments



References



































Advances in

FuzzySystems


Volume 2014













Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




TID Items1

2

3

4

5

Bread milk





0

1

1

1

0

1

1

0

1

1

1

0

1

1

1

0

1

1

1

1

0

1

0

0

0

0

0

1

0

1

T1

T2

T3

T4

T5

Beer

Brea

d

Milk

Dia

per

Eggs

Cok

e

















119895can be


119895)119879b0119895





119894119895= 119894119894 119894119895) =







0

0

0

0

0 0 0

0

00

0

0

1 1

1

11

1 1 1 1

1 1 1

1

1

1

1

0

0 0 0

000

0 0 0

0 0

000

11

1

1

11

1

1

1

1

1

1

1 1 1

1

1

114443 322 3 3 3

1

t1t2

t3

t4

t5

k = 2

i1 i2 i3 i4 i5 i6B1

S1

t1t2

t3

t4

t5

B2

S2










1198731times1198991 B119896


1

B11198731times1198991










119896minus1






















5 Conclusion


Disclosure




Acknowledgments



References



































Advances in

FuzzySystems


Volume 2014













Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






0

0

0

0

0 0 0

0

00

0

0

1 1

1

11

1 1 1 1

1 1 1

1

1

1

1

0

0 0 0

000

0 0 0

0 0

000

11

1

1

11

1

1

1

1

1

1

1 1 1

1

1

114443 322 3 3 3

1

t1t2

t3

t4

t5

k = 2

i1 i2 i3 i4 i5 i6B1

S1

t1t2

t3

t4

t5

B2

S2










1198731times1198991 B119896


1

B11198731times1198991










119896minus1






















5 Conclusion


Disclosure




Acknowledgments



References



































Advances in

FuzzySystems


Volume 2014













Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in


















5 Conclusion


Disclosure




Acknowledgments



References



































Advances in

FuzzySystems


Volume 2014













Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in
































Advances in

FuzzySystems


Volume 2014













Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



Bit-table based biclustering and frequent closed itemset mining in high-dimensional binary data

Documents

Transcript of Bit-table based biclustering and frequent closed itemset mining in high-dimensional binary data