Data mining techniques
Transcript of Data mining techniques
Business Intelligence Technologies – Data Mining
+++http://www.monografias.com/trabajos55/mineria-de-datos/mineria-de-datos2.shtml
Lecture 1 Introduction
1
What is covered in this course
Theories/Methods Data mining cycle/process/methodology, evaluation
Association rules, decision trees, clustering, nearest neighbor, neural networks, link analysis, Web mining etc.
Applications Market basket analysis, customer segmentation, CRM, personalization, Financial analysis etc.
Business Cases Hands-on Experience
SAS – Enterprise Miner
3
Course Objectives Understand data mining theories Learn popular data mining methods Enable you to solve special business applications
Master a data mining package
4
Course Logistics Qing Li
TA Jia Wang [email protected]
Office hours: Walk-in By appointment Before and after class Call me
6
Class Resources Class homepage:
http://liqing.cai.swufe.edu.cn/ post slides, announcements, downloads
Text Book + Cases + Handouts
7
Text BookData Mining Techniques: For Marketing, Sales, and Customer Relationship Management, Second Edition
Michael Berry and Gordon Linoff, 2004, Wiley, ISBN 0471-470643
8
Class Schedule
Topic1 Course Overview, Intro to Data Mining
2Market Basket Analysis & Association Rules, CRM
3Market Segmentation & Clustering, Prepare data
4Prediction & Classification – Decision Tree
5 Personalization & Nearest Neighbor6 Financial Forecasting & Neural Networks7 Link Analysis & Web mining8 Misc. Topics 9 Guest Speaker10 Term project presentations
9
Group Term Project Group of 2-3 (3 is better).
Due one week from now Identify a company to study
Focus: Data and Business Intelligence Current practiceYour recommendations
Two phasesPhase 1: Identify the company and brief description (Due 3 weeks from now)
Phase 2: Final report + class presentation
10
Software SAS – Enterprise Miner
Used for homework assignments Need Windows XP Professional or Mac OS9
I’ll demo SAS in most classes. Tutorial available on course website Every student is recommended to have a copy in order to follow class demo.
Alternative for Vista users - WEKA
11
Grading 15% Participation
3: Excellent 2: Good 1: OK 0: Absent with good reason and advance notification
-3: Absent with no reason 50% Homework
2 big assignments Problem solving, data analysis and/or case discussion.
25% each 35% Term Project
Phase 1 report --- 5% Final report --- 20% Class presentation --- 5% Peer evaluation --- 5%
(No Curve)12
Misc. Issues Slides are available before class
Download or print them before class Lectures may be different from the text book Some materials in the lectures may not be in the book, so please focus in class
The book is a great reference book, not a bible Finish assigned case readings before each class
Attendance is required
13
Case 1: Bank of America Discussion Questions:
1. What is BoA trying to achieve?2. What are the alternative
solutions? Pros and cons of each?3. What are the stages of data
mining? Describe each.4. What are the data mining
techniques used, and what are the findings from each technique?
16
Case 2: A Wireless Company
Discussion Questions:1. What is the company trying to
achieve?2. How can data mining help?3. Where did data come from and How
are data processed?4. How is the data mining approach
evaluated?
17
Case 3: SUV Discussion Questions:
1. What is the company trying to achieve?
2. How can data mining help?3. What data files are used? What
information are contained in these files?
4. How is the two data mining technique combined and why is it more powerful to combine?
18
What is data mining? Informal definition: Finding patterns in data
More formal definition: Non-trivial process of identifying valid, novel, potentially useful, and understandable patterns in data
Business Intelligence: a process for increasing the competitive advantage of a business by intelligent use of available data in decision making. (one definition)
20
What is a pattern? Informal definition: Any structure that can be found in the data. e.g.People with good credit ratings have fewer accidents
Risk = 0.93*prior_default + 0.23*num_cards –1.3* employed
On Friday nights male customers who buy diapers also tend to buy beer
Not every pattern is desirablePeople with high income buy expensive cars
21
Marketing Which customers are likely to respond to this campaign? What other products or services should be offered to a customer?
(cross-selling) What types of customers are loyal?
Telecommunications Which customers will switch to competitors ? Which calls are fraudulent?
Finance and Insurance What types of customers have high credit risks / insurance risks ? What interest rate or insurance premium should be given to different
customers? Which stocks are likely to perform well in the next 3 months?
Healthcare Which patients may take longer to recover ? What is the likely cause of an illness ?
Retail Which products do customers buy together (or in sequence)?
Customer Support Which customer service representative should be assigned to a task ? When a customer calls, the customer representative’s screen shows
exactly where to lead the conversation.
Why Data Mining ?
Wherever there is data, there is and should be data mining!
Because Data Mining virtually affects all data-intensive industry
22
Why Data Mining ? – Some Real Examples
Safeway: Shopper cards capture point-of-sale data and personal information. Arrange products on shelves: Beer & Diaper Sell names to suppliers so that manufacturer coupons can be
targeted. Pfizer pharmaceuticals:
Construct a predictive model which tells patients their cholesterol risk score.
High risk patients can request Lipitor, Pfizer’s cholesterol medication.
Fidelity: Cross selling, when a customer calls, know what other services to
offer Build models to figure out what makes a loyal customer These models saved a marginally profitable bill-paying service
Amazon: Recommendations
Capital One: What terms should be offered to different customers? The lowest loan loss rates in the industry
23
Why Data Mining Now?
Better and cheaper Computing
Power
Mature data miningtechnology
Improved Data Collection & Storage
DM
Plus: Data is being produced at a tremendous speed. Competitive pressures are enormous
24
Descriptive vs. Predictive Data Mining
Descriptive DM is used to learn about and understand the data. What items are purchased together? Identify and describe groups of customers with common buying behavior
Predictive DM aims to build models in order to predict unknown values of interest. A model that given a customer’s characteristics predicts how much the customer will spend on the next catalog order.
Predicting which customers are likely to leave Which direction is Stock X going to move tomorrow? Most predictive models are also descriptive
25
Data Mining Software Big Names:
IBM Intelligent Miner SPSS Clementine Microsoft SQL Server 2000 Analysis Service Oracle 9i Data Mining SAS Enterprise Miner
Smaller Companies: ANGOSS KnowledgeStudio XLMiner MegaPuter PolyAnalyst DBMiner
Free or Open Source: Weka Lots of free programs on the Internet supporting
individual data mining techniques. A good portal for data mining related stuff:
http://www.kdnuggets.com
26
Virtuous Cycle of Data Mining
1, Identify the business problem2, Mining data to transform the data into actionable information3, Acting on the information4, Measuring the results
Finding patterns is not enough
Must respond to the patterns by taking action
Turning: Data into Information Information into Action Action into Value
27
1, Identify the Business Opportunity Many business processes are good candidates:
New product introduction Direct marketing campaign Understanding customer attrition/churn Evaluating the results of a test market
Or more specific problems What types of customers responded to our last campaign? Where do the best customers live? Are long waits in check-out lines a cause of customer
attrition? What products should be promoted with our XYZ product?
TIP: When talking with business users about data mining opportunities, make sure you focus on the business problems/opportunities and not on technology and algorithms.
Another goal of this course is for you to think strategically about what business opportunities can be addressed by data mining techniques.
28
2, Mining the Data to Transform it into Actionable Information
Success is making business sense of the data Need to figure out the specific data mining tasks used to address the business opportunities identified in the first step.
Deal with messy data Don’t expect clean data. Data cleaning accounts for 70%
of efforts Implementation problems:
What techniques to use? How to use the techniques? Selecting the right model
Other problems Data privacy issue
29
3, Take Action Taking action is the whole purpose of data mining
Now with discovered patterns (from mining data), we have better informed decisions.
Examples Contact targeted customers Prioritizing customer service
Cingular and AT&T were fined for $1.5 million on Sept. 10, 2004 for discriminating their services based on customers’ credit rating.
Adjusting inventory levels Rearrange products on the shelves Verizon sends out 40k mails to selected customers per month
30
4, Measuring Results Assess the impact of the action taken Often overlooked, ignored, skipped Planning for the measurement should begin when analyzing the business opportunity, not after it is “all over”
Assessment questions (examples): Did this campaign do what we hoped? Did some offers work better than others? Lower cost, increase profit? Tons of others…
31
Data Mining General Guidelines
The DM virtuous cycle (4 steps) is iterative
No steps should be skipped Common sense prevails with respect to how rigorous each step is carried out
The 4 steps of the virtuous cycle expand to become an 11-step methodology --- more rigorous
32
Detailed Data Mining Process – 11 Steps1, Translate the business problem into a data mining problem2, Select appropriate data3, Get to know the data4, Create a model set5, Fix problems with the data6, Transform data to bring information to the surface7, Build models8, Assess models9, Deploy models10, Assess results11, Begin again
33
Business problems can often be big and vague
Data mining tasks need to be more concrete
Sample business problems: How to improve response to a direct marketing campaign?
Which ads to place on web pages in order to improve click thorough rate?
How to transform these to DM task?
Step 1: Transforming Business Problems into DM Tasks
34
Step 2-6: Data Preparation Get data
Different (heterogeneous) sources Need to collect additional data? Credit card charge records, points-of sale, web log etc.
Clean/correct data Correct errors Add missing values Discard of garbage, remove outliers
Transform data if needed Derived attributes --- bring information to the surface Income Income bracket when model requires categorical data
DOB Age
35
Step 7-9: Model BuildingChoice of model, model building and model
assessment Decide what model type to use
Descriptive or Predictive model? Which specific technique? Often can try different techniques Things to consider:
Computational issues Implementation issues Availability of relevant and amount of data Do we have the necessary expertise
Assess Models Accuracy on testing data Small is beautiful Easier to understand
Step 9 is more about scoring or ranking in the real data
36
Step 10: Assess the Result
It’s not model accuracy any more It’s more about achieving the business goal
It’s closely related to business decisions E.g. if it’s more expensive to deploy a data mining model, a mass mailing may be more cost-effective than a targeted one.
But it’s often hard to isolate the effect of a solution. Indirect benefits may be hard to see. Do a market test
37
Common Data Mining Mistakes
Learn things that aren’t true Patterns may not represent any underlying rule
Tall candidates win presidential election True in data, but has no predictive power
The data may not reflect the relevant population The sample should not be biased Otherwise, the result can not be extended E.g. Your existing customers are not like the customers you want to acquire
Data may be at the wrong level of detail Refer to the Simpson’s paradox (next slide)
Learn things that are true, but not useful Things that are already known
Majority of rules learned are normal business rules E.g. Retired employees don’t respond to retirement plan promotion
Things that can’t be used (AT&T/Cingular example) Inability to act upon patterns because of political, legal and ethical reasons
38
Simpson’s Paradox
Adm it Deny TotalM ale 480 (80% ) 120 (20% ) 600
Fem ale 180 (90% ) 20 (10% ) 200
Business School Law Adm it Deny Total
M ale 10 (10% ) 90 (90% ) 100
Fem ale 100 (33% )200 (66% ) 300
Adm it Deny TotalM ale 490 (70% ) 210 (30% ) 700Fem ale 280 (56% ) 220 (44% ) 500
Business and Law Schools
Simpson’s Paradox refers to the reversal of the direction of a comparison or an association when data from several groups are combined to form a single group.
This is caused by the different percentages in admission in the two tables - they really shouldn't be combined.
39