Efficient mining of emerging patterns: discovering trends and differences

E�cient Mining of Emerging Patterns:Discovering Trends and Di�erencesGuozhu DongDepartment of CSEWright State [email protected] Jinyan LiDepartment of CSSEThe University of [email protected] introduce a new kind of patterns, called emerging pat-terns (EPs), for knowledge discovery from databases. EPsare de�ned as itemsets whose supports increase signi�c-antly from one dataset to another. EPs can capture emer-ging trends in timestamped databases, or useful contrastsbetween data classes. EPs have been proven useful: wehave used them to build very powerful classi�ers, which aremore accurate than C4.5 and CBA, for many datasets. Webelieve that EPs with low to medium support, such as 1%{20%, can give useful new insights and guidance to experts,in even \well understood" applications.The e�cient mining of EPs is a challenging problem, since(i) the Apriori property no longer holds for EPs, and (ii)there are usually too many candidates for high dimensionaldatabases or for small support thresholds such as 0.5%.Naive algorithms are too costly. To solve this problem, (a)we promote the description of large collections of itemsetsusing their concise borders (the pair of sets of the minimaland of the maximal itemsets in the collections). (b) Wedesign EP mining algorithms which manipulate only bordersof collections (using our multi-border-di�erential algorithm),and which represent discovered EPs using borders. All EPssatisfying a constraint can be e�ciently discovered by ourborder-based algorithms, which take the borders, derived byMax-Miner, of large itemsets as inputs. In our experimentson large and high dimensional datasets including the UScensus and Mushroom datasets, many EPs, including somewith large cardinality, are found quickly. We also give otheralgorithms for discovering general or special types of EPs.

1 IntroductionIn this paper we introduce a new kind of patterns, calledemerging patterns (EPs), which capture signi�cantchanges and di�erences between datasets. EPs arede�ned as itemsets whose supports increase signi�cantlyfrom one dataset, D1, to another, D2. More speci�cally,EPs are itemsets whose growth rates { the ratios ofthe two supports { are larger than a given threshold�. When applied to timestamped databases, EPs cancapture emerging trends in business or demographicdata. When applied to datasets with classes (male vsfemale, poisonous vs edible, cured vs not cured), EPscan capture useful contrasts between the classes.Example 1.1 Many EPs were found in the MushroomData (from the UCI repository) for the growth ratethreshold 2:5. The following are two typical EPsconsisting of 3 items:X = f(Odor = none), (Gill Size = broad),(Ring Number = one)gY = f(Bruises = no), (Gill Spacing = close),(Veil Color = white)gEP supp in poisonous supp in edible growth rateX 0% 63.9% 1Y 81.4% 3.8% 21.4Those EPs with very large growth rates are notabledi�erentiating characteristics between the edible andpoisonous Mushrooms, and they have been useful forbuilding powerful classi�ers [7, 13]. Interestingly, noneof the following singleton itemsets fOdor = noneg,fGill Size = broadg, and fRing Number = onegis an EP. Moreover, among the discovered EPs, somecontain more than 8 items and they played a key rolein building accurate classi�ers cited above.Example 1.2 We discovered about 120 EP groupscontaining up to 13 items from the U.S. census dataset,PUMS dataset (from www.census.gov, more details inSection 7). These EPs are derived from the population 1

of Texas to that of Michigan using the growth ratethreshold 1:2. A typical one is: fDisabl1:2, Lang1:2,Means:1, Mobilili:2, Perscar:2, Rlabor:1, Travtim:[1..59],Work89:1g; the items are about disability, language athome, means of transport, personal care, employmentstatus, travel time to work, and working or not in1989. Such EPs can describe di�erences of populationcharacteristics between di�erent social groups. Clearly,domain experts can analyze similar EPs, and select theuseful ones for further attention in their applications.Example 1.3 Suppose in 1985 there were 1000 pur-chases of fComputer, Modems, Edu-softwaresgout of 20 million transactions, and in 1986 there were2100 such purchases out of 21 million transactions. Thispurchase pattern has a growth rate of 2 from 1985 to1986; so it is an EP if the growth rate threshold is setto any value less than 2. Observe that the support forthis itemset is very small even in 1986. If companiesunderstood the signi�cance of these EPs and took theopportunities, then they would have bene�ted greatlyin the long run.We believe that EPs with low to medium support,such as 1%{20%, can give very useful new insights andguidance to experts, in even \well understood" applic-ations. This is because, using traditional statistics andcomputation methods, scientists have been con�ned todiscovering EPs with very few (e.g. one to four) vari-ables, or con�ned to known EPs which are typicallyfolklore and have very high supports; in other words,the applications have been well understood only aboutcontrasts which are either strong folklore or short innumber of items.Example 1.4 Consider an application about cancerpatients, where one dataset contains records of pa-tients who were cured and another dataset of pa-tients who were not cured. A hypothetical useful EPfS1; S2; T1; T2; T3g, with growth rate of 9 from the not-cured to cured, may say that, among all cancer patientswho had both symptoms of S1 and S2 and who had re-ceived all treatments of T1, T2, and T3, the number ofcured patients is 9 times the number of patients whowere not cured; this may suggest that the treatmentcombination should be applied whenever the symptomcombination occurs (if there are no better plans). TheEP may be low in support, such as 1% only; such EPsmay be new knowledge to the medical �eld because theydid not have e�cient methods to �nd EPs with suchlow support and such long length. This EP may evencontradict the prevailing knowledge about the e�ect ofeach treatment Tj on each symptom Si. A selected setof such EPs can be a useful guide to doctors in decid-ing what treatment should be used for a given medicalsituation.

1.1 Di�culties and challengesSince EPs with large supports are perhaps folklorealready, an interesting problem is to discover EPswith small support (e.g. 5% or even 0.1%). This isa challenge due to these two reasons: (i) the usefulApriori property no longer holds for EPs (as can beseen from Example 1.1); (ii) there are usually toomany candidates. These make naive algorithms toocostly. Consider the naive algorithm which examinesall itemsets on the PUMS dataset. Since there aremore than 350 items, this naive algorithm would needto process 2350 itemsets (to �nd their supports in bothD1 and D2 and then determine their growth rates).Obviously, this is an impossible task. We will see laterthat even a clever naive algorithm which conducts anexhaustive search over a reduced candidate collection(but still too large, e.g. having 240 � 1012 itemsets 1)of itemsets will also take too long.1.2 Main contributionsIn addition to introducing the new data mining problemof EPs, we make the following main contributionsto solving this problem. (Some other algorithms fordiscovery of general or special EPs are also given here.)Set intervals and their border description:In the mining of EPs, we frequently need to deal withvery large collections of sets. As shown above, naivehandling of such collections usually requires unbearablylong processing time. However, these collections ofitemsets may have some nice properties that can beutilized in devising ways to e�ciently process them.Previous data mining work observed and utilized someaspects of such nice properties, for example [1, 3, 16].We go one step further by formalizing the notion ofset intervals, de�ned as collections S of sets that areinterval closed { if X and Z are in S and Y is aset such that X � Y � Z, then Y is in S. Forexample, collections of all large itemsets for giventhresholds are interval closed. Importantly, we describelarge interval-closed collections of itemsets using theirborders, de�ned as the pair of the sets of the minimalitemsets and of the maximal ones. (Previous datamining research [1, 3, 16] only used one-sided borderson subset-closed collections, see discussion on relatedwork.) Clearly, borders are usually much smaller thanthe collections they represent.The concept of borders lays the foundation for ournext contribution.EP discovery by border manipulation:We propose a suite of EP mining algorithms whichdiscover a class of EPs by manipulating only borders1Bayardo got large itemsets of length 38 at 10% supportthreshold and of length 40 at 5%, on a housing dataset extractedfrom PUMS. 2

of some two collections. (The backbone operation isour border di�erential procedure.) This avoids the longprocess required by naive algorithms to get the countsof all itemsets in a large collection of candidates. Ouralgorithms can derive all EPs whose support satis�es aminsup threshold in D2, where minsup is the smallestpossible threshold that Bayardo's Max-Miner succeeds,from borders of large itemsets. The discovered EPs arealso represented using borders, and thus our algorithmsmay �nish quickly even when the number of EPs islarge. (For mushroom, there are about 228 EPs for thegrowth rate threshold of 2.5; these are represented byabout half a million borders.)1.3 Related work and paper organizationAlthough EPs are also similar to discriminant rules[10] (assertions true on instances of a given class butuntrue on other instances) and evolution rules [10]in that they are all about di�erent datasets/classes,EPs are di�erent because they are not limited bythe exclusiveness constraint and because the extrainformation of growth rate is attached. Because ourEPs are not restricted by the exclusiveness constraints,the value-merge based induction method of [10] is notapplicable. We note that jumping EPs (cf. Section 5)are special types of discriminant rules.Our de�nition and ways of using borders are di�erentfrom those used in most of previous investigations.Our borders are de�ned to represent interval closedcollections by their boundary elements, whereas bordersof [16] are limited to subset closed collections. The twobounds of our borders are subcollections of the givencollections, consisting of respectively their minimal(maximal) elements; one bound of borders of [16], thenegative border, consists however of minimal elementsnot in the original collections. To eliminate the need toexamine too many candidates, we introduce and usenovel algorithms such as border di�erential, whereas[16] uses borders directly to control level-wise searchover the candidate space. Max-Miner [3] only uses onebound (the right-hand) of our large borders.We obtained all the results without knowing [8],which is concerned with e�ciency issues of ATMS(assumption-based truth maintenance system). Inter-estingly, that paper contained some ideas similar toours, including the representation of interval-closedcollections (called convex space there) using borders(called boundaries there), and some operations for ex-tracting the border of the di�erence of two collectionsfrom their borders. We have made several main newcontributions to borders: (a) Regarding the algebra ofborders, our backbone operation of border di�erential isnew. However, for the di�erence operation [8] only dis-cussed how to �nd the maximal (respectively, minimal)elements of the di�erence of two subset-closed (respect-

ively, superset-closed) collections from the maximal (re-spectively, minimal) elements of the two collections. Itis interesting to see if our border-di�erential operationcan be applied to ATMS's. (b) We have brought theuse of (two sided) borders to the �eld of data mining,and we hope that this tool will be used more widely inthis �eld. We are really happy to see that our investig-ations and [8], from very di�erent areas, share the useof the tool of borders, which indicates that these toolsare really powerful.Our work is also related to the mining of regularitiesin time series [9, 11, 18, 2, 17, 14, 4, 12]. Our work isdi�erent in that we look for abnormal growth, insteadof regularities.The rest of the paper is organized as follows: Section2 formally de�nes the EP mining problem and givesa decomposition of it. Section 3 discusses the borderdescription of interval-closed collections of itemsets,and the e�cient discovery of borders using Max-Miner.Section 4 presents our main algorithms on discoveringEPs by border manipulation. Section 5 gives anoverview of other algorithms. Section 6 presents aperformance evaluation of the algorithms. Finally,Section 7 o�ers some concluding remarks.2 The EP mining problem and itsdecompositionLet I = fi1; i2; � � � ; iNg be a set of items. Atransaction is a subset T of I. A dataset is a setD of transactions. A subset X of I is called a k-itemset (or simply an itemset), where k = jXj. Wesay a transaction T contains an itemset X, if X � T .The support of an itemset X in a dataset D, denotedas suppD(X), is countD(X)jDj , where countD(X) is thenumber of transactions in D containing X. Given apositive number �, we say an itemset X is �-large in Dif suppD(X) � �, and X is �-small in D otherwise. LetLarge�(D) (resp. Small�(D)) denote the collection ofall �-large (resp. �-small) itemsets.Assume that we are given an ordered pair of datasetsD1 and D2 and let suppi(X) denote suppDi(X). Thegrowth rate of an itemset X from D1 to D2, denoted asGrowthRate(X), is de�ned as8<: 0; if supp1(X) = 0 and supp2(X) = 01; if supp1(X) = 0 and supp2(X) 6= 0supp2(X)supp1(X) ; otherwiseDe�nition 2.1 Given � > 1 as a growth-rate threshold,an itemset X is said to be an �-emerging pattern (�-EPor simply EP) from D1 to D2 if GrowthRate(X) � �.Example EPs are given in Section 1. Our interest inEPs is mainly on the degree of change in supports, butnot on their actual supports. 3

1 l1supp1(X)A0H l2 : supp1(X) = �minsupp2(X)l4l3 : supp2(X) = �minF E(1;1)DC(1; 1� )B 1GFigure 1: The support plane.De�nition 2.2 The EP mining problem is, for a givengrowth-rate threshold �, to �nd all �-EPs.A 2-D pictorial support plane (Figure 1) is helpfulfor describing our decomposition of the EP miningproblem. We identify a point (�1; �2) in this planewith all itemsets X such that (supp1(X); supp2(X)) =(�1; �2). Observe that, given �, the supports of all �-EPs from D1 to D2 must fall onto (the region enclosedby) the triangle �ACE. The point G = (�min; �min) isimportant to our analysis. We require that �min � � =�min, and that this point should be as near the origin aspossible. The exact value of �min is chosen so that thebest algorithm can discover all �min-large itemsets in D1and all �min-large itemsets in D2, within allowed time.(Actually, we do not need to �nd all these large itemsets;we only need to �nd their descriptions using \borders".)For many applications, the best algorithm should beMax-Miner, which was designed to work e�ciently evenwhen long patterns are present. Now, the EP miningproblem can be divided into three sub-problems.Finding EPs falling into (or simply in) the BCDGrectangle. In this paper, we will concentrate our e�orton this subproblem. From Figure 1, we see that theEPs in the BCDG rectangle are precisely those itemsetswhose supports in D2 are � �min but in D1 are < �min.The basic ideas of our novel algorithms are as fol-lows: We use the border of Large�min (D1) and theborder of Large�min (D2), instead of Large�min (D1)and Large�min (D2) themselves, as inputs to our al-gorithms. The algorithms derive the EPs by manipulat-ing only the two given borders and produce the borderrepresentation of the derived EPs. The high e�ciencyof these algorithms comes from their avoiding handlingexponentially large number of candidates and avoid-ing printing a large number of EPs. As a result, ouralgorithms are e�cient, even for discovering long EPs

which the naive algorithms and Apriori-like algorithmsmay fail to �nd.We now brie y describe a semi-naive algorithm andanalyze why it is usually ine�cient. This semi-naivealgorithm will try to �rst �nd the supports, in D1and in D2, of all itemsets in Large�min (D2), andcheck if their growth rates are greater than �. Whilethis is smarter than the naive algorithm given earlierand it may be quite fast for \small" applications,it is still too ine�cient to be of use in \large andwide" applications. Indeed, for reasonable �min,Large�min (D2) can contain around 240 itemsets in thePUMS dataset. When the datasets are big, �nding thecounts of those large itemsets will also take too long.Finding emerging patterns in�GDE. A candidateset of EPs in this region is the set of all itemsets whosesupports in D1 are � �min and in D2 are � �min.(The supports of these EPs lie in the big rectanglejust above the segment GD in Figure 1.) This set ofcandidates is exactly Large�min (D1)\Large�min (D2).When the intersection is relative small we can �ndthe EPs by checking the supports of all candidates inthe intersection. Observe that the approximate sizeof this candidate set can be quickly estimated fromits border description. When the intersection is large,we solve this subproblem by recursively applying theborder algorithm used for the BCDG rectangle within�ACE, to this new triangle �GDE (by creating anew corresponding rectangle in �GDE), until we havefound all EPs (or almost all). The details are in the fullpaper.Finding emerging patterns in �ABG. This is avery hard case, as EPs in this region may have verysmall supports in D1 and D2 or both. There are simplytoo many itemsets X such that supp1(X) < �min (orsupp2(X) < �min). In the full paper we o�er somemethods, which reduce the number of candidates usingborder manipulations, to �nd some EPs in this region.It is still a challenge to discover all EPs in this triangle.In summary, we present algorithms to �nd all EPsin all but the �ABG triangle in this paper, based one�cient algorithms for the BCDG rectangle.We conclude this section with a few words onpreprocessing. In real datasets collected from di�erentperiods of time or di�erent regions, there can be itemswhich only occur in one dataset but not the other. Allitemsets containing such items will have the growth rateof 0 or 1, depending on whether these are new itemsin D2 or discontinued items in D1. Such informationcan be collected in two passes of the two datasets.Henceforth, we will assume that all such items areremoved from the transactions. 4

3 Borders and their e�cientdiscoveryThe key notion of borders is introduced for e�cientlyrepresenting large collections of itemsets (or other typesof sets), and it plays a central role in our mainalgorithms for discovering EPs. Collections of largeitemsets for a given support threshold can be conciselydescribed by borders, and such borders can be e�cientlydiscovered by Max-Miner [3].3.1 Using borders to represent largecollectionsIn the mining of EPs, we frequently need to deal withvery large collections of itemsets. These collectionsmay have some nice properties that can be utilizedin devising ways to e�ciently process them. Previousdata mining research observed and utilized mainly oneof such nice properties, namely the subset-closednessof collections of large itemsets; representatives of suchresearch are Apriori's candidate generation [1], Max-Miner's look-ahead technique [3] and levelwise search[16]. We observed that collections of large itemsets areinterval-closed, and will make use of this property, inthe e�cient mining of EPs.A collection S of sets is called interval closed if, forall X;Z 2 S and for all Y , it is the case that Y 2 Swhenever X � Y � Z. A concrete example of interval-closed collections isS = ff1; 2g; f2;3g;f1;2; 3g;f1;2; 4g; f2; 3; 4g; f1; 2; 3;4gg:Proposition 3.1 The collection of all large itemsetsw.r.t. any �xed threshold is interval closed.Borders are introduced for succinct representation ofvery large interval-closed collections.De�nition 3.2 An ordered pair <L;R> is called aborder, L the left-hand bound of this border and R theright-hand bound, if (a) each of L and R is an antichain2collection of sets, and (b) each element ofL is a subset ofsome element in R and each element of R is a supersetof some element in L. The collection of sets representedby, or the set interval of, a border <L;R>, is [L;R] =fY j 9X 2 L; 9Z 2 R such that X � Y � Zg: Thecollection [L;R] is said to have <L;R> as border.Example 3.3 The set interval of<ff1g, f2; 3gg, ff1; 2; 3g,f2; 3; 4gg> is ff1g, f1; 2g, f1; 3g, f1; 2; 3g, f2; 3g,f2; 3; 4gg. The set interval of <ff1; 2gg, ff1; 2; 3; 4;5g,f1; 2; 4; 5; 6gg> consists of 12 itemsets: all sets thatare both supersets of f1; 2g and subsets of eitherf1; 2; 3; 4; 5g or f1; 2; 4; 5; 6g. There are many simpleborders whose associated set intervals are very large2A collection S of sets is an antichain if X and Y areincomparable sets (i.e. X 6� Y and Y 6� X) for all X;Y 2 S.

collections; e.g. the set interval of<ff1gg, ff1; 2; 3; 4;5;6; 7; 8; 9;10;11;12gg> has 211 item-sets.A border <L;R> is a syntactic object consistingof the two collections L and R, and its semantics is[L;R] consisting of the interval of sets bounded bythe sets in L (R) from below (above). Conditions(a) and (b) in the de�nition ensure that borders areminimal in size. Observe that <;; ;> is a valid border,and its set interval is [;; ;] = ;, the empty collection.A set interval can be viewed as a generalization ofintervals over linearly ordered domains (e.g. the reals)to partially ordered domains, especially the partiallyordered domain of sets.There is a one-to-one correspondence between bor-ders and interval-closed collections:Proposition 3.4 Each interval-closed collection S ofsets has a unique border <L;R>, where L is thecollection of minimal sets in S and R is the collectionof maximal sets in S.In this paper, we prefer the \rooted" borders, to non-rooted ones, for their conceptual simplicity. A border<L;R> is called left-rooted if L is a singleton set, right-rooted ifR is a singleton set, and rooted if it is left-rootedor right-rooted or both. It is easy to see that the setinterval of a border is the union of the set intervalsof a number borders which are all left rooted (or allright rooted). Our algorithms will aim to produce suchoutputs as representations of emerging patterns. Thefollowing result makes borders useful in this work.Proposition 3.5 The collection of all large (resp.,small) itemsets, Large�(D) (resp., Small�(D)), withrespect to any �xed threshold � in a dataset D has aleft(resp., right)-rooted border. More speci�cally, theleft-hand bound of the border of Large�(D) is f;gand its right-hand bound is the set of maximal largeitemsets. Similarly, the left-hand bound of the borderof the �-small itemsets in D is the set of minimal smallitemsets and its right-hand bound is fIg.We will call the border of Large�(D) a large border,denoted as LargeBorder�(D), and the border ofSmall�(D) a small border. As the left-hand bound ofall large borders is always f;g, any two large bordersdi�er only in their right-hand bounds.The sizes of borders <L;R> of such collections oflarge (or small) itemsets, in practical situations, interms of jLj + jRj, are bounded by some polynomialof N = jIj, although the worst case bound can bethe exponential function CNN2 (which happens when allitemsets of cardinality N2 are in L or R). 5

3.2 Discovering large bordersWe will use Max-Miner for the e�cient discovery oflarge borders. Max-Miner was introduced in [3]. Itderives the set of maximal large itemsets from a datasetfor a given support threshold, which is, in our notation,the right-hand bound of the large border of all largeitemsets. Compared to other mining algorithms suchas Apriori or its variants, Max-Miner is able to discoverlonger patterns in high dimensional data because ituses a look-ahead technique in addition to the Aprioricandidate generation technique. Roughly speaking, thelook-ahead technique �nds the counts of some candidateitemsets that are not generated by Apriori's candidategeneration method (from the currently known largeitemsets), and uses those counts to �lter out the needto get counts of many candidate itemsets that might beneeded using Apriori. This technique is useful since, onknowing that these look-ahead itemsets are large, weknow that all sub-itemsets of these large itemsets arelarge and hence there is no need to �nd their counts. Ituses the set-enumeration trees (SE-trees) of [19] as theframework for this \look-ahead" search strategy.4 Border-based discovery ofemerging patternsOur border-based algorithms can discover all EPs in theBCDG rectangle of Figure 1, and they do this withoutenumerating elements of very large collections. In fact,the entire process of discovering EPs in this rectangleonly needs to deal with borders: They will take borders,that represent collections of large itemsets, as inputs;they only manipulate borders in the internal process;and they produce as output borders that represent EPs.As a consequence, the running time of this processis short for most practical situations, even when thenumber of EPs in the BCDG rectangle is huge.We concentrate our discussion on the situation whentwo large borders, one from D1 and one from D2, areavailable. We will brie y discuss the other combinationsin the next section. The di�erential procedure, calledBorder-Diff, is given in the �rst subsection. Thenthe main algorithm,MBD-LLborder, using Border-Diff as a subroutine, is given in the second subsectionfor discovering EPs themselves.4.1 Di�erential between bordersBorder-Diff aims to derive the di�erential between apair of borders with a special form: Given a pair of bor-ders <f;g; fUg> and <f;g;R1>, Border-Diff de-rives another border <L2; fUg> such that [L2; fUg] =[f;g; fUg] � [f;g;R1] (see Figure 2). Importantly,Border-Diff achieves this by manipulating only theitemsets in the borders. We give two versions of thealgorithm; the �rst version is more declarative and thus

< f;g; fUg >< f;g;R1 > < L2; fUg >Given To �ndFigure 2: Border di�erential.11;2; 3;4 1;2; 31;2; 42 1;3; 42;3; 4 1;23;4 2;4 1;32;3 1;4fg 34Figure 3: [f;g, ff1; 2; 3; 4gg] � [f;g, ff3; 4g, f2; 4g,f2; 3gg] = [ff1g, f2; 3; 4gg, ff1; 2; 3; 4gg].easier to understand, and the second version is moreprocedural and more e�cient.Border-Diff(<f;g; fUg>; <f;g; fS1; S2; � � � ; Skg>);; �rst version;; return border of [f;g; fUg]�[f;g;fS1; S2; � � � ; Skg]1) L ffs1; s2; � � � ; skgjsi 2 U � Si; 1 � i � kg;2) remove from L all non-minimal itemsets;3) return <L; fUg>;Example 4.1 We now illustrate how Border-Diffworks on the arguments of <f;g; f1; 2; 3;4g> and<f;g; ff3; 4g; f2; 4g; f2; 3gg>. (See Figure 3.)Border-Diff �rst derives the setL = ff1g; f1; 4g;f1; 3g; f1; 3;4g;f1; 2g; f1;2; 4g; f1; 2; 3g; f2; 3; 4gg:(The set f1g is the result of \removing duplicatesfrom the multiset f1; 1; 1g".) Then it removes thenon-minimal sets in L to produce <ff1g, f2; 3; 4gg,ff1; 2; 3; 4gg>. In essence, this algorithm generates thewhole \cartesian products" of U �S1; � � � ; U�Sk beforeeliminating the non-minimal elements.We argue that the algorithm is correct using Ex-ample 4.1. (A formal proof is given in the full paper.)Let U = f1; 2; 3; 4g, S1 = f3; 4g, S2 = f2; 4g, and S3 = 6

f2; 3g. Let I denote the set interval [f;g; ff1; 2; 3; 4gg]�[f;g; ff3; 4g;f2;4g;f2;3gg]. Observe that I consists ofall itemsets that are subsets of U , but not subsets ofany of S1, S2, and S3. For each such itemset X, there-fore, its elements must not all in S1, S2, and S3. Thus,there must exist at least one element of X in U � S1,U � S2, and U � S3 respectively. Therefore, itemset Xmust be superset (including proper superset) of someone in ffx1; x2; x3g j xi 2 U � Si; i = 1; 2; 3g. Onthe other hand, it is obvious that any set fx1; x2; x3g,xi 2 U � Si, i = 1; 2; 3 must be subset of U but nota subset of any of S1, S2, and S3. Together with thede�nition of borders, the correctness of Border-Diffis thus veri�ed.As can be seen from the above example, the way thecurrent Border-Diff �nds L may not be very e�cientif k is large and many U � Si (say ` of them) have twoor more elements (since it then needs to enumerate atleast 2` sets). The following improved Border-Diffalgorithm uses a more e�cient method for deriving L.Border-Diff(<f;g; fUg>; <f;g; fS1; S2; � � � ; Skg>);; improved version;; return border of [f;g; fUg]�[f;g;fS1; S2; � � � ; Skg]1) initialize L to ffxg j x 2 U � S1g;2) for i = 2 to k do3) L fX [ fxg j X 2 L; x 2 U � Sig;4) remove all Y in L that are not minimal;5) return <L; fUg>;Example 4.1 (cont'd) The improved Border-Diffalgorithm works as follows: It �rst initializes L toff1g; f2gg since U � S1 = f1; 2g. Then it updatesL to ff1g; f1; 2g;f1;3g;f2;3gg since U � S2 = f1; 3g,and reduces it to ff1g; f2; 3gg. Finally it updates Lto ff1g; f1; 2;3g;f1;4g;f2;3;4gg since U �S3 = f1; 4g,and reduces it to ff1g; f2; 3; 4gg.The improved Border-Diff is more e�cient thanthe �rst version because it iteratively removes non-minimal elements in the intermediate result for U �S1; � � � ; U�Si before processing U�Si+1, thus avoidinggenerating large intermediate results in general, andavoiding generating the whole \cartesian products" ofU � S1; � � � ; U � Sk in particular.We can see that Border-Diff is more e�cientthan the naive algorithms. Using naive algorithms, weneed to enumerate all itemsets covered by the border<f;g; f1; 2; 3; 4g> and remove those itemsets coveredby the border <f;g; ff3; 4g;f2; 4g; f2; 3gg>. On theother hand, Border-Diff is more clever by examiningthe border bounds only.4.2 Emerging patterns by multipleBorder-Diff (MBD-LLborder)We are now ready to present the main algorithm,namely MBD-LLborder, of this paper. This al-gorithm can discover all EPs in the BCDG rectangle

of Figure 1, and it achieves this goal by manipulatingonly its input borders. Procedurally, this algorithmderives the EPs in the BCDG rectangle by callingBorder-Diff a multiple number of times. Each callwill use one itemset in the right-hand bound of the largeborder of D2 and the whole right-hand bound of thelarge border of D1 to form the two arguments.Assume that we have found LargeBorder�(D1)and LargeBorder�(D2) for some � and � satisfying� = � � �. SupposeLargeBorder�(D1) = <f;g; fC1; C2; � � � ; Cmg>;LargeBorder�(D2) = <f;g; fD1; D2; � � � ; Dng>:We �rst describe the basis of this algorithm. Since allEPs in the BCDG rectangle must have support � � inD2 but < � in D1, they are exactly the elements of thefollowing set3:[jPowSet(Dj) �[mi=1PowSet(Ci)= [j(PowSet(Dj)�[mi=1PowSet(Ci)):Given j; 1 � j � n, the collection (PowSet(Dj) �[mi=1PowSet(Ci)) consists of those sets that are subsetsof Dj but not subsets of any Ci (1 � i � m);equivalently, the collection consists of those sets thatare subsets of Dj but not subsets of any C 0i (1 �i � m), where C 0i denotes Ci \Dj; again equivalently,the collection consists of those sets that are subsetsof Dj but not subsets of any of the maximal C 0isamong C 01; :::; C 0m. Suppose fC 01; C 02; � � � ; C 0kg is anenumeration of these maximal itemsets (k � m).Now, (PowSet(Dj) � [ki=1PowSet(C 0i)) is the collectionof all itemsets covered by a border, and that border isprecisely the one derived byBorder-Diff(<f;g; fDjg>;<f;g; fC 01; C 02; � � � ; C 0kg>):Therefore, the collection of all EPs in the BCDGrectangle is the union of up to n set intervals of all theborders derived by calling Border-Diff in this wayfor all j.The above approach is now formulated as our al-gorithmMBD-LLborder.MBD-LLborder(LargeBorder�(D1),LargeBorder�(D2));; return all EPs in the BCDG rectangle1) EPborders fg;2) for j from 1 to n do3) if some Ci is a superset of Dj then continue;4) fC 01; � � � ; C 0mg fC1 \Dj; � � � ; Cm \Djg ;5) RightBound the set of all maximalitemsets in fC 01; � � � ; C 0mg;6) add Border-Diff(<f;g; Dj>;<f;g, RightBound>) into EPborders;7) return EPborders;3PowSet(S) (resp. SupSet(S)) denote the collection of allsubsets (resp. supersets) of S. 7

Example 4.2 For LargeBorder�(D1)= <f;g; ff2; 3; 5g; f3; 4;6;7;8g;f2;4; 5; 8;9gg> andLargeBorder�(D2) = <f;g; ff1; 2; 3;4g;f6;7;8gg>;the EPs in the BCDG rectangle are the union of thetwo sets represented by the following two di�erences:PowSet(f1; 2; 3; 4g) � PowSet(f2; 3; 5g)� PowSet(f3; 4; 6; 7;8g)� PowSet(f2; 4; 5; 8;9g)and PowSet(f6; 7; 8g) � PowSet(f2; 3; 5g)� PowSet(f3; 4; 6; 7; 8g)� PowSet(f2; 4; 5; 8; 9g):The (intersection step of the) algorithm now updatesthe �rst di�erence toPowSet(f1; 2; 3; 4g) � PowSet(f2; 3g)� PowSet(f3; 4g)� PowSet(f2; 4g):It then calls Border-Diff(<f;g; ff1; 2; 3;4gg>,<f;g; ff3; 4g; f2; 4g; f2; 3gg>) to produce the border<ff1g; f2; 3; 4gg;ff1;2;3;4gg>(discussed above). Since f6; 7; 8g � f3; 4; 6; 7; 8g (thisCi is a superset of Dj), MBD-LLborder does notcall Border-Diff for the second di�erence since it isempty.One can �nd out an estimate of the number of thediscovered EPs from their border description. If thereare too many EPs, it will be more sensible to examinethe border representation, as this avoids the tediousprocess of enumerating the EPs. One can also use thisrepresentation for membership test to see if an itemsetis an EP, which can be done very e�ciently (needingat most 2 � n �m set containment tests). If there areonly a small number of EPs in the rectangle, the EPsthemselves can be easily enumerated, either throughthe SE-tree enumeration method [19] or the dec-SE-treeenumeration method given in our full paper.In our experiments the MBD-LLborder algorithmis very fast in producing all EPs in the BCDG rectanglefor many practical situations. We also give somealternative techniques to �nd some EPs in this rectanglefor the other rare cases in the full paper.5 Other algorithmsIn the previous section we only discussed how to�nd EPs from two large borders. In this section wewill give sketches (due to space limitation) of other

algorithms detailed in the full paper or elsewhere. Theother algorithms consider other border combinations,or discover \strong" EPs, or discover \jumping" EPs.Recall that Small�(D) denotes the set of �-smallitemsets inD, and that we call the border of Small�(D)a small border. Similar to large itemsets, small itemsetsare also interval closed.In the full paper we discuss how to �nd EPs fromother border combinations: One large and one smallborders, and two small borders. There we presentborder conversion algorithms, which can translate alarge border to the corresponding small border andvice versa. (Recall that a large border w.r.t. asupport threshold � and the small border w.r.t. � arecomplementary.) Since Max-Miner and SE-trees cannotbe used to discover small borders, in the full paper wealso introduce Min-Miner and decreasing SE-trees fordiscovering small borders directly.The discovery of strong EPs (de�ned as those EPsall of whose subsets are also EPs) can be done in a waysimilar to Apriori, by using the subset closure property.Another method was introduced in [6] for discoveringanother special type of EPs, called jumping EPs.Jumping EPs are special EPs whose supports increaseabruptly from zero support in one dataset to non-zerosupport in another. In that method, we do not use Max-Miner to �nd the needed large borders. Instead, we usea new algorithm, Horizon-Miner, to �nd the largeborder of all itemsets with non-zero support. Then weuse MBD-LLborder to �nd the jumping EPs, usingthe two large borders derived by Horizon-Miner asinput. These jumping EPs were used to build verypowerful classi�ers [7, 13].6 Performance evaluationWe now report a performance evaluation of the al-gorithms proposed in this paper, showing their e�-ciency.We carried out experiments on many datasets, in-cluding two high dimensional datasets, namely themushroom dataset and the U.S. Census dataset. Theperformance report will concentrate on our MBD-LLborder algorithm, and show that it is very e�cienton these datasets. The large borders can be discoveredvery quickly4 either using Bayardo's Max-Miner (evenfor very small support threshold such as 0.1%) or us-ing our Horizon-Miner [6]; consequently, we will notinclude the timing of Max-Miner or Horizon-Minerused in discovering the large borders.All experiments were conducted on a 150MHz SunSPARCstation-10 machine with 160M bytes of RAM,4See Figure 9 of [3]: CPU time is 10 seconds for supportthreshold of 10%, 20 seconds for support threshold of 5%, 40second for support threshold of 2.5%, and 100 second for supportthreshold of 0.1%. 8

shared by many users of the University of Melbourne.6.1 MBD-LLborder on PUMSThe U.S. census dataset is publically accessible atthe US Census Bureau's homepage (www.census.gov),through its data extraction system under the entry\(PUMS) Decennial Census Public Use MicrodataSamples". PUMS contains a 5-percent sample ofrecords on persons from the 1981-1990 decennial census.We intended to discover the regional di�erences hiddenin these data.In our experiment, 10,000 person records of the Texasdataset are selected as D1 and 10,000 of the Michigandataset selected as D2. Each person record of the rawdataset from the Web site consists of 75 attributes. Wediscretized the attributes by partitioning quantitativeattributes into intervals of varying lengths. Afterdiscretizing, each transaction still contains 75 items.After some encoding, where we discarded those itemswhich are the smallest value of their correspondingattributes, there are a total of 295 items (now binaryattributes), and the average size of the transactions isaround 35.Max-Miner was used to �nd the large borders for thesupport thresholds of 25%, 30%, 40%, and 50% in D1,and for the support thresholds of 20%, 30%, 40%, and50% in D2. With each pair (one from D1, and onefrom D2) of large borders with di�erent supports, weuse the MBD-LLborder algorithm to discover EPsfrom the dataset (with a smaller support threshold) tothe dataset (with a larger support threshold). MBD-LLborder found the EPs almost instantly. With theabove large borders in D1 and D2, EPs were only foundfrom 3 pairs of them (and no EPs were found for theother combinations). These 3 pairs are listed in thetable below.From To C NB MSTexas 25% Michigan 30% 20s 122 13Texas 30% Michigan 40% 0.28s 8 8Texas 40% Michigan 50% 0.08s 9 7C: CPU time; NB: Number of Borders of EPsMS: Max EP SizeThe following is a right-rooted border of the 1:2-EPs between the population of Texas and of Michigan(obtained from the large border of 25%-large itemsetsof Texas and that of 30%-large itemsets of Michigan):< ff26; 116; 125; 147;158;216; 272;278g;f26; 116; 125; 158; 272;278;280gg;ff26; 28; 116; 125; 147;158;164; 166;216;272; 278; 280;287gg>These numbers correspond to the notations in PUMS asfollows: 26 for Disabl1:2, 28 for Disabl2:2, 116 for Lang1:2,

125 for Means:1, 147 for Mobilili:2, 158 for Perscar:2, 164for Powpuma:[1..99899], 166 for Powstat:[1..56], 272 forTravtim:[1..59], 278 for Work89:1, 280 for Worklwk:1, 287for Yearwrk:1.6.2 MBD-LLborder on MushroomThe mushroom dataset (available from the UCI Ma-chine Learning Repository) consists of two classes ofdata: the edible class containing 4208 instances andthe poisonous class containing 3916 instances. The en-coded dataset contains 121 binary items after each at-tribute (22 in total) is mapped to a certain numberof items. The large border in the edible class of dataw.r.t the support threshold of 1/4500 has a right-handbound consisting of 4208 itemsets, and the large borderin the poisonous class of data has a right-hand boundconsisting of 3916 itemsets. These were discovered us-ing Horizon-Miner. Taking these two large borders,MBD-LLborder has found a huge number of EPsusing about 30 minutes. We need 299811 borders torepresent all such EPs from the poisonous class to theedible class and 271715 borders to represent all suchEPs from the edible class to the poisonous class. Onaverage, each border represents around 218 = 262144sets. Obviously, without the border mechanism, it isimpossible to enumerate the complete EPs. Withoutthe border mechanism, it would be very time consum-ing to mine the EPs. Even using the semi-naive al-gorithm, we would need to examine approximately 232itemsets, requiring at least 29 times of I/O operationsover the data (residing in disk). (Each itemset needs25 bytes, assuming one byte per integer. 160Mbytes is228 bytes; so 160M can hold 223 itemsets. Hence 232itemsets would require 29 iterations. The CPU time,which is dominant for high dimensional datasets, wouldbe very very long.) Among the discovered EPs, someof them are singleton itemsets; some of them can reacha cardinality of 22; many of them have a cardinalityaround 18.6.3 Other datasetsIn our investigations into building EP-based classi�ers[7], we applied MBD-LLborder to large bordersextracted from a large number of datasets includingbreast-w, iris, pima, sonar, and wine from the UCImachine learning repository. The time needed forMBD-LLborder is always very short (in fact less than30 seconds).7 Concluding remarksWe have introduced the data mining problem of emer-ging patterns (EPs). EPs can capture emerging trendsin timestamped databases, or capture di�erentiatingcharacteristics between classes of data. EPs have been 9

useful: we have used EPs to build very powerful classi-�ers, including the Mushroom dataset, which are moreaccurate than C4.5 and CBA [15]. We believe that theyare useful in many other applications.These patterns can be large in size, and may havevery small support (e.g. a trend at the forming stage).We observed that naive algorithms are too costly,because the useful Apriori property no longer holds forEPs and because there are usually too many candidates.Indeed, since there are large itemsets with 40 items evenfor the 5% support threshold for some PUMS datasets,even EPs with such support threshold cannot be foundusing naive algorithms.We studied the e�cient mining of EPs and madethe following major contributions: (a) We promotedthe description of large collections of itemsets usingtheir borders, which are usually much smaller. (b) Wedesigned our EP mining algorithms which manipulateonly borders of two collections, and which representdiscovered EPs using borders. The major borderoperation we used is our border di�erential operation.Analysis and experiment show that our approach workswell for large datasets and for dense high dimensionaldatasets. These algorithms can �nd all EPs in theBCDG rectangle and all in the GDE triangle. Webelieve that the border-based algorithms are useful inmany other areas of research and applications, besidesdata mining.Regarding future work, it is worthwhile to parallelizethe MBD-LLborder algorithm, and to discover otherclasses of EPs in the �AGB triangle.AcknowledgementsWe are very grateful to Xiuzhen Zhang, for her tremend-ous help with the implementation of the algorithms pro-posed in this paper; Richard Tse helped in the earlystages of this project. We are grateful to LimsoonWongfor bringing to our attention the paper [8], to RobertoBayardo for sharing information about experiments ofhis Max-Miner on the housing dataset, and to membersof Jiawei Han's Data Mining Lab at Simon Fraser Uni-versity for feedback received when the �rst author gavea seminar there.References[1] R. Agrawal and R. Srikant. Fast algorithms formining association rules. In Proc. Int. Conf. VeryLarge Data Bases (VLDB), 1994.[2] R. Agrawal and R. Srikant. Mining sequential pat-terns. In Proc. 1995 Int. Conf. Data Engineering(ICDE), 1995.[3] Roberto J. Bayardo. E�ciently mining longpatterns from databases. In Proc. of 1998 ACM-SIGMOD Intern'l Conference on Management ofData (SIGMOD), 1998.

[4] C. Bettini, X. Sean Wang, and S. Jajodia. Miningtemporal relationships with multiple granularitiesin time sequences. Data Engineering Bulletin,21:32{38, 1998.[5] G. Dong and J. Li. Interestingness of discoveredassociation rules in terms of neighborhood-basedunexpectedness. In X Wu and K Ramamohanarao,editors, Paci�c Asia Conference on KnowledgeDiscovery from Databases, LNAI, 1998.[6] G. Dong, J. Li and X. Zhang. Discovering Jump-ing Emerging Patterns and Experiments on RealDatasets. Proc. of 9th International Database Con-ference on Heterogeneous and Internet Databases(IDC99), Hong Kong, 1999.[7] G. Dong, X. Zhang, L. Wong, and J. Li. CAEP:Classi�cation by Aggregating Emerging Patterns.Technical report, March 1999.[8] Carl A. Gunter, Teow-Hin Ngair, and Devika Sub-ramanian. The common order-theoretic structureof version spaces and atms's. Arti�cial Intelligence,95(2):357{407, 1997.[9] J Han, G Dong, and Y Yin. E�cient mining ofpartial periodic patterns in time series database.In ICDE, 1999.[10] J. Han and Y. Fu. Exploration of the power ofattribute-oriented induction in data mining. InU.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth,and R. Uthurusamy, editors, Advances in Know-ledge Discovery and Data Mining, pages 399{421.AAAI/MIT Press, 1996.[11] J. Han, W. Gong, and Y. Yin. Mining segment-wise periodic patterns in time-related databases.In Proc. 1998 Int'l Conf. on Knowledge Discoveryand Data Mining (KDD), 1998.[12] B Lent, R Agrawal, and R Srikank. Discoveringtrends in text databases. In KDD 1997.[13] J. Li, G. Dong, and K. Ramamohanarao. JEP-Classi�er: Classi�cation by Aggregating JumpingEmerging Patterns. Technical report, February1999.[14] H. Lu, J. Han, and L. Feng. Stock movement andn-dimensional inter-transaction association rules.In Proc. 1998 SIGMOD Workshop on ResearchIssues on Data Mining and Knowledge Discovery(DMKD'98), 1998.[15] B Liu, W Hsu, and Y Ma. Integrating classi�cationand association rule mining. In KDD 1998.[16] H. Mannila, and H. Toivonen. Levelwise search andborders of theories in knowledge discovery. DataMining and Knowledge Discovery 1(3): 241 - 258,November 1997.[17] H. Mannila, H Toivonen, and A. I. Verkamo.Discovering frequent episodes in sequences. InKDD 1995.[18] B. Ozden, S. Ramaswamy, and A. Silberschatz.Cyclic association rules. In ICDE 1998. 10

[19] Ron Rymon. Search through systematic set enu-meration. In Proceedings of the Third InternationalConference on Principles of Knowledge Represent-ation and Reasoning, 1992.

11

Efficient mining of emerging patterns: discovering trends and differences

Documents

Transcript of Efficient mining of emerging patterns: discovering trends and differences