1
CS 5331 by Rattikorn Hewett
Texas Tech University 1
Data Mining
Primitives
2
Outline
Motivation
Data mining primitives
Data mining query languages
Designing GUI for data mining systems
Architectures
3
Motivations: Why primitives?
Data mining systems uncover a large set of patterns –
not all are interesting
Data mining should be an interactive process
User directs what to be mined
Users need data mining primitives to communicate with
the data mining system by incorporating them in a data
mining query language
Benefits: More flexible user interaction
Foundation for design of graphical user interface
Standardization of data mining industry and practice
4
Data mining primitives
Data mining tasks can be specified in the form of data
mining queries by five data mining primitives:
Task-relevant data input
The kinds of knowledge to be mined function & output
Background knowledge interpretation
Interestingness measures evaluation
Visualization of the discovered patterns presentation
2
5
Task-relevant data
Specify data to be mined
Database, data warehouse, relation, cube
Condition for selection & grouping
Relevant attributes
6
Knowledge to be mined
Specify data mining “functions”:
Characterization/discrimination
Association
Classification/prediction
Clustering
7
Background Knowledge
Typically, in the form of concept hierarchies Schema hierarchy
E.g., street < city < state < country
Set-grouping hierarchy
E.g., {low, high} all, {30..49} low, {50..100} high
Operation-derived hierarchy
E.g., email address: [email protected]
login-name < department < university < organization
Rule-based hierarchy
E.g., 87 ≤ temperature < 90 normal_temperature
8
Interestingness
Objective measures:
Simplicity: simpler rules are easier to understand and likely to be interesting
(association) rule length, (decision) tree size
Certainty: validity of the rule Rule A => B has confidence, P(A|B) = #(A and B)/ #(B)
classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc.
Utility: potential usefulness Rule A => B has support, #(A and B)/ sample size
noise threshold (description)
Novelty not previously known, surprising (used to remove redundant rules)
3
9
Visualization of Discovered Patterns
Specify the form to view the patterns
E.g., rules, tables, chart, decision trees, cubes, reports etc.
Specify operations for data exploration in
multiple levels of abstraction
E.g., drill-down, roll-up etc.
10
DMQL(data mining query language)
A DMQL can provide the ability to support ad-hoc and
interactive data mining
By providing a standardized language
Hope to achieve a similar effect like that SQL has on
relational database
Foundation for system development and evolution
Facilitate information exchange, technology transfer,
commercialization and wide acceptance
DMQL is designed with the primitives described earlier
11
Languages & Standardization Efforts
Association rule language specifications MSQL (Imielinski & Virmani’99)
MineRule (Meo Psaila and Ceri’96)
Query flocks based on Datalog syntax (Tsur et al’98)
OLEDB for DM (Microsoft’2000) Based on OLE, OLE DB, OLE DB for OLAP
Integrating DBMS, data warehouse and data mining
CRISP-DM (CRoss-Industry Standard Process for Data Mining) Providing a platform and process structure for effective data
mining
Emphasizing on deploying data mining technology to solve business problems
12
Designing GUI based on DMQL
What tasks should be considered in the design
GUIs based on a data mining query language?
Data collection and data mining query composition
Presentation of discovered patterns
Hierarchy specification and manipulation
Manipulation of data mining primitives
Interactive multilevel mining
Other information
4
13
Architectures
Coupling data mining system with DB/DW system
No coupling - Flat file processing, not recommended
Loose coupling - Fetching data from DB/DW
Semi-tight coupling - Enhanced DM performance
Provide efficient implementation of a few data mining primitives in
a DB/DW system, e.g., sorting, indexing, aggregation, histogram
analysis, multiway join, precomputation of some stat functions
Tight coupling - A uniform information processing environment
DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query, indexing, query processing
methods, etc.
CS 5331 by Rattikorn Hewett
Texas Tech University 14
Concept Description
15
Outline
Review terms
Characterization
Summarization
Hierarchical generalization
Attribute relevance analysis
Comparison/discrimination
Descriptive statistical measures
16
Review terms
Descriptive vs. predictive data mining
Descriptive: describes the data set in concise,
summarative, informative, discriminative forms
Predictive: constructs models representing the data set,
and uses them to predict behaviors of unknown data
Concept description: involves
Characterization: provides a concise and succinct
summarization of the given collection of data
Comparison (discrimination): provides descriptions
comparing two or more collections of data
5
17
Concept Description vs. OLAP
Concept description:
can handle complex data types (e.g., text,
image) of the attributes and their aggregations
a more automated process
OLAP:
restricted to a small number of dimension and
measure data types
user-controlled process
18
Outline
Review terms
Characterization
Summarization
Hierarchical generalization
Attribute relevance analysis
Comparison/discrimination
Descriptive statistical measures
19
Characterization methods
One approach for characterization is to transform data from
low conceptual levels to high ones “data
generalization”
E.g., daily sales annual sales
Biology Science
Two Methods:
Summarization – as in Data Cube’s OLAP
Hierarchical generalization – Attribute-oriented induction
Data generalization?
20
Summarization by OLAP
Data are stored in data cubes
Identify summarization computations e.g., count( ), sum( ), average( ), max( )
Perform computations and store results in data cubes
Generalization and specialization can be performed on a data cube by roll-up and drill-down
An efficient implementation of data generalization
Limitations: Can handle only simple non-numeric data type of dimensions
Can handle only summarization of numeric data
Do not guide users which dimensions to explore or which levels to reach
6
21
Outline
Review terms
Characterization
Summarization
Hierarchical generalization
Attribute relevance analysis
Comparison/discrimination
Descriptive statistical measures
22
Attribute-Oriented Induction
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How is it done?
Collect the task-relevant data (initial relation) using a relational database query
Perform generalization by attribute removal or attribute generalization.
Apply aggregation by merging identical, generalized tuples and accumulating their respective counts
Interactive presentation with users
23
Basic Elements
Data focusing: task-relevant data, including dimensions, and the result is the initial relation.
Attribute-removal and Attribute-generalization:
Attribute A has a large set of distinct values
If there is no generalization operator on A, or
A’s higher level concepts are expressed in terms of other attributes (giving redundancy)
Remove A
If there exists a set of generalization operators on A
Select an operator to generalize A
Generalization threshold controls Attribute generalization: controls size of attribute values for
generalization or removal (~ 2-8, specified/default)
Relation generalization: controls the final relation/rule size (~ 10-30).
24
General Steps
InitialRel: Query processing of task-relevant data, deriving the initial relation.
PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
Presentation: User interaction: (1) adjust levels by
drilling, (2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
7
25
Example
DMQL: Describe general characteristics of graduate students in the Big-University databaseuse Big_University_DBmine characteristics as “Science_Students”in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpafrom studentwhere status in “graduate”
Transform to corresponding SQL statement:Select name, gender, major, birth_place, birth_date, residence,
phone#, gpafrom studentwhere status in {“Msc”, “MBA”, “PhD” }
26
Example (cont.)
Gender Major Birth_ country Age_range Residence GPA Count
M Science Canada 20-25 Richmond Very-good 16
F Science Foreign 25-30 Burnaby Excellent 22
… … … … … … …
Prime
Generalized
Relation
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim
Woodman M CS
Vancouver,BC,
Canada 8-12-76
3511 Main St.,
Richmond 687-4598 3.67
Scott
Lachance M CS
Montreal, Que,
Canada 28-7-75
345 1st Ave.,
Richmond 253-9106 3.70
Laura Lee
…
F
…
Physics
…
Seattle, WA, USA
…
25-8-70
…
125 Austin
Ave., Burnaby
…
420-5232
…
3.83
…
Removed Retained
Generalized
to
Sci,Eng,Bus
Generalized
to
Country
Generalized
to
Age range
Generalized
to
City
Removed
Generalized
to
Excl, VG,..
Initial
Relation
Birth_Region Gender
Canada Foreign Total
M 16 14 30 F 10 22 32
Total 26 36 62
Presentation
27
Presentation of results
Generalized relation: Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
Cross tabulation: Mapping results into cross tabulation
Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules: Mapping generalized result into characteristic rules with
quantitative information associated with it, e.g., t = typical
.%]47:["")(_%]53:["")(_
)()(
tforeignxregionbirthtCanadaxregionbirth
xmalexgrad
=Ú=
ÞÙ
28
Outline
Review terms
Characterization
Summarization
Hierarchical generalization
Attribute relevance analysis
Comparison/discrimination
Descriptive statistical measures
8
29
Analysis of Attribute Relevance
To filter out statistically irrelevant attributes or rank
attributes for mining
Irrelevant attributes inaccurate/unnecessary complex
patterns
An attribute is highly relevant for classifying/predicting a class, if it is
likely that its values can be used to distinguish the class from others
E.g., to describe cheap vs. expensive cars
Is “color” a relevant attribute?
What about using “color” to compare banana and apple?
30
Methods
Idea: Compute a measure that quantifies the relevance of an attribute with respect to a given class or concept
These measures can be:
Information gain
The Gini index
Uncertainty
Correlation coefficients
31
Example
Relevance measure: Information gain
Review formulae: For an attribute value set S, each labeled with a class
in C and pi is a probability that class i is in S, then
Expected information needed to classify a sample if it is partitioned into Si’s for data point that has A’s value i
Information gain: Gain(A) = Ent(S) – I(A)
i
Ci
i ppSEnt 2log)( åÎ
-=
åÎ
=)(
)()(Adomi
i
i
SEntS
SAI
32
Example (cont)
Gender Major Birth_ country Age_range GPA Count
M Science Canada 20-25 Very-good 16
F Science Foreign 25-30 Excellent 22
M Eng Foreign …. 18
F Science Foreign 25
M Science Canada …… ….. 21
F Eng Canada 18
M Science Foreign 18
F Business Canada … ….. 20
M Business Canada 22
F Science Canada ….. …… 24
M Eng Foreign 22
F Eng Canada 24
How much attribute “major” is relevant to classification
of graduate/undergraduate students?
120 Graduates
130 Undergraduates
Dom(Major) = {Science, Eng, Business}
Partition the data into Sc, Eng, Bus representing a set of data points
whose “Major” is Science, Eng and Business, respectively
9
33
Example (cont)i
Ci
i ppSEnt 2log)( åÎ
-=
Gender Major Birth_ country Age_range GPA Count
M Science Canada 20-25 Very-good 16
F Science Foreign 25-30 Excellent 22
M Eng Foreign …. 18
F Science Foreign 25
M Science Canada …… ….. 21
F Eng Canada 18
M Science Foreign 18
F Business Canada … ….. 20
M Business Canada 22
F Science Canada ….. …… 24
M Eng Foreign 22
F Eng Canada 24
120 Graduates:Science = 84 (= 16+22+25+21)Eng = 36Business = 0
130 UndergraduatesScience = 42Eng = 46Business = 42
Ent(S) = 120/250 log2 (120/250) 130/250 log2 (130/250) = 0.9988
Ent(Sc) = 84/126 log2 (84/126) 42/126 log2 (42/126) = ….
I(Major) = 126/250Ent(Sc) + 82/250Ent(Eng) + 42/250Ent(Bus) = 0.7873
Gain(Major) = Ent(S) – I(Major) = 0.9988 – 0.7873 = 0.2115
Class Information captured from S
Expected class information induced by attribute “Major”
åÎ
=)(
)()(Adomi
i
i
SEntS
SAI
Ent(Eng) = 36/82log2 (36/82) 46/82 log2 (46/82) = ….
Ent(Bus) = 0/42 log2 (0/42) 42/42 log2 (42/42) = ….
34
Example (cont)i
Ci
i ppSEnt 2log)( åÎ
-=
Gender Major Birth_ country Age_range GPA Count
M Science Canada 20-25 Very-good 16
F Science Foreign 25-30 Excellent 22
M Eng Foreign …. 18
F Science Foreign 25
M Science Canada …… ….. 21
F Eng Canada 18
M Science Foreign 18
F Business Canada … ….. 20
M Business Canada 22
F Science Canada ….. …… 24
M Eng Foreign 22
F Eng Canada 24
Gain(Major) = Ent(S) – I(Major) = 0.9988 – 0.7873 = 0.2115
Similarly, find
Gain(gender), Gain(Birth_country), Gain(Age_range), Gain(GPA)
• We can rank “importance” or degree of “relevance” by Gain values
• We can use a threshold to prune out attributes that are less “relevant”
åÎ
=)(
)()(Adomi
i
i
SEntS
SAI
120 Graduates:Science = 84 (= 16+22+25+21)Eng = 36Business = 0
130 UndergraduatesScience = 42Eng = 46Business = 42
35
Outline
Review terms
Characterization
Summarization
Hierarchical generalization
Attribute relevance analysis
Comparison/discrimination
Descriptive statistical measures
36
Class comparison
Goal: mine properties (or rules) to compare a target class with a contrasting class
The two classes must be comparableE.g., address and gender are not comparable
store_address and home_address are comparable
CS students and Eng students are comparable
Comparable classes should be generalized to the same conceptual level
Approaches Use attribute-oriented induction or data cube to generalize data
for two contrasting classes and then compare the results --- !!!!
Pattern Recognition approach –Approximate discriminating rulesfrom a data set, repeatedly fine-tune until errors are small enough
10
37
Outline
Review terms
Characterization
Summarization
Hierarchical generalization
Attribute relevance analysis
Comparison/discrimination
Descriptive statistical measures
38
Descriptive statistical measures
Data Characteristics that can be computed Central Tendency
mean
median
Dispersion five number summary: Min, Quartile1, Median, Quartile3, Max
variance, standard deviation
Outliers
Useful displays Boxplots, quantile-quantile plot (q-q plot), scatter plot, loess curve
When is “mean” not an appropriate measure?
For a very large data set, how do we compute
median ?
Spread about the mean. What does var = 0 mean?
Detected by rules of thumb: values falling at
least 1.5 of (Q3-Q1) above Q3 or below Q1
39
References
E. Baralis and G. Psaila. Designing templates for mining association rules. Journal of Intelligent Information Systems, 9:7-32, 1997.
Microsoft Corp., OLEDB for Data Mining, version 1.0, http://www.microsoft.com/data/oledb/dm, Aug. 2000.
J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane, “DMQL: A Data Mining Query Language for Relational Databases”, DMKD'96, Montreal, Canada, June 1996.
T. Imielinski and A. Virmani. MSQL: A query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM’94, Gaithersburg, Maryland, Nov. 1994.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, pages 122-133, Bombay, India, Sept. 1996.
A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996.
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, Seattle, Washington, June 1998.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998.
Top Related