CS490D: Introduction to Data Mining Chris Clifton

10
1 CS 5331 by Rattikorn Hewett Texas Tech University 1 Data Mining Primitives 2 Outline Motivation Data mining primitives Data mining query languages Designing GUI for data mining systems Architectures 3 Motivations: Why primitives? Data mining systems uncover a large set of patterns not all are interesting Data mining should be an interactive process User directs what to be mined Users need data mining primitives to communicate with the data mining system by incorporating them in a data mining query language Benefits: More flexible user interaction Foundation for design of graphical user interface Standardization of data mining industry and practice 4 Data mining primitives Data mining tasks can be specified in the form of data mining queries by five data mining primitives: Task-relevant data input The kinds of knowledge to be mined function & output Background knowledge interpretation Interestingness measures evaluation Visualization of the discovered patterns presentation

Transcript of CS490D: Introduction to Data Mining Chris Clifton

1

CS 5331 by Rattikorn Hewett

Texas Tech University 1

Data Mining

Primitives

2

Outline

Motivation

Data mining primitives

Data mining query languages

Designing GUI for data mining systems

Architectures

3

Motivations: Why primitives?

Data mining systems uncover a large set of patterns –

not all are interesting

Data mining should be an interactive process

User directs what to be mined

Users need data mining primitives to communicate with

the data mining system by incorporating them in a data

mining query language

Benefits: More flexible user interaction

Foundation for design of graphical user interface

Standardization of data mining industry and practice

4

Data mining primitives

Data mining tasks can be specified in the form of data

mining queries by five data mining primitives:

Task-relevant data input

The kinds of knowledge to be mined function & output

Background knowledge interpretation

Interestingness measures evaluation

Visualization of the discovered patterns presentation

2

5

Task-relevant data

Specify data to be mined

Database, data warehouse, relation, cube

Condition for selection & grouping

Relevant attributes

6

Knowledge to be mined

Specify data mining “functions”:

Characterization/discrimination

Association

Classification/prediction

Clustering

7

Background Knowledge

Typically, in the form of concept hierarchies Schema hierarchy

E.g., street < city < state < country

Set-grouping hierarchy

E.g., {low, high} all, {30..49} low, {50..100} high

Operation-derived hierarchy

E.g., email address: [email protected]

login-name < department < university < organization

Rule-based hierarchy

E.g., 87 ≤ temperature < 90 normal_temperature

8

Interestingness

Objective measures:

Simplicity: simpler rules are easier to understand and likely to be interesting

(association) rule length, (decision) tree size

Certainty: validity of the rule Rule A => B has confidence, P(A|B) = #(A and B)/ #(B)

classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc.

Utility: potential usefulness Rule A => B has support, #(A and B)/ sample size

noise threshold (description)

Novelty not previously known, surprising (used to remove redundant rules)

3

9

Visualization of Discovered Patterns

Specify the form to view the patterns

E.g., rules, tables, chart, decision trees, cubes, reports etc.

Specify operations for data exploration in

multiple levels of abstraction

E.g., drill-down, roll-up etc.

10

DMQL(data mining query language)

A DMQL can provide the ability to support ad-hoc and

interactive data mining

By providing a standardized language

Hope to achieve a similar effect like that SQL has on

relational database

Foundation for system development and evolution

Facilitate information exchange, technology transfer,

commercialization and wide acceptance

DMQL is designed with the primitives described earlier

11

Languages & Standardization Efforts

Association rule language specifications MSQL (Imielinski & Virmani’99)

MineRule (Meo Psaila and Ceri’96)

Query flocks based on Datalog syntax (Tsur et al’98)

OLEDB for DM (Microsoft’2000) Based on OLE, OLE DB, OLE DB for OLAP

Integrating DBMS, data warehouse and data mining

CRISP-DM (CRoss-Industry Standard Process for Data Mining) Providing a platform and process structure for effective data

mining

Emphasizing on deploying data mining technology to solve business problems

12

Designing GUI based on DMQL

What tasks should be considered in the design

GUIs based on a data mining query language?

Data collection and data mining query composition

Presentation of discovered patterns

Hierarchy specification and manipulation

Manipulation of data mining primitives

Interactive multilevel mining

Other information

4

13

Architectures

Coupling data mining system with DB/DW system

No coupling - Flat file processing, not recommended

Loose coupling - Fetching data from DB/DW

Semi-tight coupling - Enhanced DM performance

Provide efficient implementation of a few data mining primitives in

a DB/DW system, e.g., sorting, indexing, aggregation, histogram

analysis, multiway join, precomputation of some stat functions

Tight coupling - A uniform information processing environment

DM is smoothly integrated into a DB/DW system, mining query is

optimized based on mining query, indexing, query processing

methods, etc.

CS 5331 by Rattikorn Hewett

Texas Tech University 14

Concept Description

15

Outline

Review terms

Characterization

Summarization

Hierarchical generalization

Attribute relevance analysis

Comparison/discrimination

Descriptive statistical measures

16

Review terms

Descriptive vs. predictive data mining

Descriptive: describes the data set in concise,

summarative, informative, discriminative forms

Predictive: constructs models representing the data set,

and uses them to predict behaviors of unknown data

Concept description: involves

Characterization: provides a concise and succinct

summarization of the given collection of data

Comparison (discrimination): provides descriptions

comparing two or more collections of data

5

17

Concept Description vs. OLAP

Concept description:

can handle complex data types (e.g., text,

image) of the attributes and their aggregations

a more automated process

OLAP:

restricted to a small number of dimension and

measure data types

user-controlled process

18

Outline

Review terms

Characterization

Summarization

Hierarchical generalization

Attribute relevance analysis

Comparison/discrimination

Descriptive statistical measures

19

Characterization methods

One approach for characterization is to transform data from

low conceptual levels to high ones “data

generalization”

E.g., daily sales annual sales

Biology Science

Two Methods:

Summarization – as in Data Cube’s OLAP

Hierarchical generalization – Attribute-oriented induction

Data generalization?

20

Summarization by OLAP

Data are stored in data cubes

Identify summarization computations e.g., count( ), sum( ), average( ), max( )

Perform computations and store results in data cubes

Generalization and specialization can be performed on a data cube by roll-up and drill-down

An efficient implementation of data generalization

Limitations: Can handle only simple non-numeric data type of dimensions

Can handle only summarization of numeric data

Do not guide users which dimensions to explore or which levels to reach

6

21

Outline

Review terms

Characterization

Summarization

Hierarchical generalization

Attribute relevance analysis

Comparison/discrimination

Descriptive statistical measures

22

Attribute-Oriented Induction

Proposed in 1989 (KDD ‘89 workshop)

Not confined to categorical data nor particular measures.

How is it done?

Collect the task-relevant data (initial relation) using a relational database query

Perform generalization by attribute removal or attribute generalization.

Apply aggregation by merging identical, generalized tuples and accumulating their respective counts

Interactive presentation with users

23

Basic Elements

Data focusing: task-relevant data, including dimensions, and the result is the initial relation.

Attribute-removal and Attribute-generalization:

Attribute A has a large set of distinct values

If there is no generalization operator on A, or

A’s higher level concepts are expressed in terms of other attributes (giving redundancy)

Remove A

If there exists a set of generalization operators on A

Select an operator to generalize A

Generalization threshold controls Attribute generalization: controls size of attribute values for

generalization or removal (~ 2-8, specified/default)

Relation generalization: controls the final relation/rule size (~ 10-30).

24

General Steps

InitialRel: Query processing of task-relevant data, deriving the initial relation.

PreGen: Based on the analysis of the number of distinct

values in each attribute, determine generalization plan for

each attribute: removal? or how high to generalize?

PrimeGen: Based on the PreGen plan, perform

generalization to the right level to derive a “prime

generalized relation”, accumulating the counts.

Presentation: User interaction: (1) adjust levels by

drilling, (2) pivoting, (3) mapping into rules, cross tabs,

visualization presentations.

7

25

Example

DMQL: Describe general characteristics of graduate students in the Big-University databaseuse Big_University_DBmine characteristics as “Science_Students”in relevance to name, gender, major, birth_place, birth_date,

residence, phone#, gpafrom studentwhere status in “graduate”

Transform to corresponding SQL statement:Select name, gender, major, birth_place, birth_date, residence,

phone#, gpafrom studentwhere status in {“Msc”, “MBA”, “PhD” }

26

Example (cont.)

Gender Major Birth_ country Age_range Residence GPA Count

M Science Canada 20-25 Richmond Very-good 16

F Science Foreign 25-30 Burnaby Excellent 22

… … … … … … …

Prime

Generalized

Relation

Name Gender Major Birth-Place Birth_date Residence Phone # GPA

Jim

Woodman M CS

Vancouver,BC,

Canada 8-12-76

3511 Main St.,

Richmond 687-4598 3.67

Scott

Lachance M CS

Montreal, Que,

Canada 28-7-75

345 1st Ave.,

Richmond 253-9106 3.70

Laura Lee

F

Physics

Seattle, WA, USA

25-8-70

125 Austin

Ave., Burnaby

420-5232

3.83

Removed Retained

Generalized

to

Sci,Eng,Bus

Generalized

to

Country

Generalized

to

Age range

Generalized

to

City

Removed

Generalized

to

Excl, VG,..

Initial

Relation

Birth_Region Gender

Canada Foreign Total

M 16 14 30 F 10 22 32

Total 26 36 62

Presentation

27

Presentation of results

Generalized relation: Relations where some or all attributes are generalized, with

counts or other aggregation values accumulated.

Cross tabulation: Mapping results into cross tabulation

Visualization techniques:

Pie charts, bar charts, curves, cubes, and other visual forms.

Quantitative characteristic rules: Mapping generalized result into characteristic rules with

quantitative information associated with it, e.g., t = typical

.%]47:["")(_%]53:["")(_

)()(

tforeignxregionbirthtCanadaxregionbirth

xmalexgrad

=Ú=

ÞÙ

28

Outline

Review terms

Characterization

Summarization

Hierarchical generalization

Attribute relevance analysis

Comparison/discrimination

Descriptive statistical measures

8

29

Analysis of Attribute Relevance

To filter out statistically irrelevant attributes or rank

attributes for mining

Irrelevant attributes inaccurate/unnecessary complex

patterns

An attribute is highly relevant for classifying/predicting a class, if it is

likely that its values can be used to distinguish the class from others

E.g., to describe cheap vs. expensive cars

Is “color” a relevant attribute?

What about using “color” to compare banana and apple?

30

Methods

Idea: Compute a measure that quantifies the relevance of an attribute with respect to a given class or concept

These measures can be:

Information gain

The Gini index

Uncertainty

Correlation coefficients

31

Example

Relevance measure: Information gain

Review formulae: For an attribute value set S, each labeled with a class

in C and pi is a probability that class i is in S, then

Expected information needed to classify a sample if it is partitioned into Si’s for data point that has A’s value i

Information gain: Gain(A) = Ent(S) – I(A)

i

Ci

i ppSEnt 2log)( åÎ

-=

åÎ

=)(

)()(Adomi

i

i

SEntS

SAI

32

Example (cont)

Gender Major Birth_ country Age_range GPA Count

M Science Canada 20-25 Very-good 16

F Science Foreign 25-30 Excellent 22

M Eng Foreign …. 18

F Science Foreign 25

M Science Canada …… ….. 21

F Eng Canada 18

M Science Foreign 18

F Business Canada … ….. 20

M Business Canada 22

F Science Canada ….. …… 24

M Eng Foreign 22

F Eng Canada 24

How much attribute “major” is relevant to classification

of graduate/undergraduate students?

120 Graduates

130 Undergraduates

Dom(Major) = {Science, Eng, Business}

Partition the data into Sc, Eng, Bus representing a set of data points

whose “Major” is Science, Eng and Business, respectively

9

33

Example (cont)i

Ci

i ppSEnt 2log)( åÎ

-=

Gender Major Birth_ country Age_range GPA Count

M Science Canada 20-25 Very-good 16

F Science Foreign 25-30 Excellent 22

M Eng Foreign …. 18

F Science Foreign 25

M Science Canada …… ….. 21

F Eng Canada 18

M Science Foreign 18

F Business Canada … ….. 20

M Business Canada 22

F Science Canada ….. …… 24

M Eng Foreign 22

F Eng Canada 24

120 Graduates:Science = 84 (= 16+22+25+21)Eng = 36Business = 0

130 UndergraduatesScience = 42Eng = 46Business = 42

Ent(S) = 120/250 log2 (120/250) 130/250 log2 (130/250) = 0.9988

Ent(Sc) = 84/126 log2 (84/126) 42/126 log2 (42/126) = ….

I(Major) = 126/250Ent(Sc) + 82/250Ent(Eng) + 42/250Ent(Bus) = 0.7873

Gain(Major) = Ent(S) – I(Major) = 0.9988 – 0.7873 = 0.2115

Class Information captured from S

Expected class information induced by attribute “Major”

åÎ

=)(

)()(Adomi

i

i

SEntS

SAI

Ent(Eng) = 36/82log2 (36/82) 46/82 log2 (46/82) = ….

Ent(Bus) = 0/42 log2 (0/42) 42/42 log2 (42/42) = ….

34

Example (cont)i

Ci

i ppSEnt 2log)( åÎ

-=

Gender Major Birth_ country Age_range GPA Count

M Science Canada 20-25 Very-good 16

F Science Foreign 25-30 Excellent 22

M Eng Foreign …. 18

F Science Foreign 25

M Science Canada …… ….. 21

F Eng Canada 18

M Science Foreign 18

F Business Canada … ….. 20

M Business Canada 22

F Science Canada ….. …… 24

M Eng Foreign 22

F Eng Canada 24

Gain(Major) = Ent(S) – I(Major) = 0.9988 – 0.7873 = 0.2115

Similarly, find

Gain(gender), Gain(Birth_country), Gain(Age_range), Gain(GPA)

• We can rank “importance” or degree of “relevance” by Gain values

• We can use a threshold to prune out attributes that are less “relevant”

åÎ

=)(

)()(Adomi

i

i

SEntS

SAI

120 Graduates:Science = 84 (= 16+22+25+21)Eng = 36Business = 0

130 UndergraduatesScience = 42Eng = 46Business = 42

35

Outline

Review terms

Characterization

Summarization

Hierarchical generalization

Attribute relevance analysis

Comparison/discrimination

Descriptive statistical measures

36

Class comparison

Goal: mine properties (or rules) to compare a target class with a contrasting class

The two classes must be comparableE.g., address and gender are not comparable

store_address and home_address are comparable

CS students and Eng students are comparable

Comparable classes should be generalized to the same conceptual level

Approaches Use attribute-oriented induction or data cube to generalize data

for two contrasting classes and then compare the results --- !!!!

Pattern Recognition approach –Approximate discriminating rulesfrom a data set, repeatedly fine-tune until errors are small enough

10

37

Outline

Review terms

Characterization

Summarization

Hierarchical generalization

Attribute relevance analysis

Comparison/discrimination

Descriptive statistical measures

38

Descriptive statistical measures

Data Characteristics that can be computed Central Tendency

mean

median

Dispersion five number summary: Min, Quartile1, Median, Quartile3, Max

variance, standard deviation

Outliers

Useful displays Boxplots, quantile-quantile plot (q-q plot), scatter plot, loess curve

When is “mean” not an appropriate measure?

For a very large data set, how do we compute

median ?

Spread about the mean. What does var = 0 mean?

Detected by rules of thumb: values falling at

least 1.5 of (Q3-Q1) above Q3 or below Q1

39

References

E. Baralis and G. Psaila. Designing templates for mining association rules. Journal of Intelligent Information Systems, 9:7-32, 1997.

Microsoft Corp., OLEDB for Data Mining, version 1.0, http://www.microsoft.com/data/oledb/dm, Aug. 2000.

J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane, “DMQL: A Data Mining Query Language for Relational Databases”, DMKD'96, Montreal, Canada, June 1996.

T. Imielinski and A. Virmani. MSQL: A query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999.

M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM’94, Gaithersburg, Maryland, Nov. 1994.

R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, pages 122-133, Bombay, India, Sept. 1996.

A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996.

S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, Seattle, Washington, June 1998.

D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998.