The Next Step in Data Analysis: Predictive Analytics Barber, R.
Transcript of The Next Step in Data Analysis: Predictive Analytics Barber, R.
+
Predictive analytics The next step in Data Analysis Dr. Rebecca Barber Senior Director, Management Analysis Arizona State University
+Who I am
n Rebecca Barber, PhD n Senior Director of Management Analysis, Arizona State University
Office of Planning and Budget
n Consulting Statistician, PAR Framework
n Statistics Faculty, Rio Salado College
n Former Data Warehouse Architect and BI Expert
2
+Predictive Analytics
n Uses historical data to make a prediction about data you don’t have n Examples you may be familiar with:
n Your credit score
n Insurance prices
n Why your neighbor gets different junk mail than you do
3
Data Mining/Machine Learning
Statistics
Business Intelligence
Modeling
+As it applies to education
n Example: Admissions decisions – SAT scores can predict 1st year GPA
n BUT there are a lot more applications n Retention
n Enrollment Management
n Course Management
n Mass Customization of Interventions
n Revenue Optimization
n Your institution’s leadership wants and needs this
4
+You already know something about predictive analytics
n If you took a basic statistics course: Linear Regression
n BUT
n Requires independence of predictor variables – hard in the real world
n Assumes a linear relationship between the predictors and the outcome – rare
n Doesn’t deal well with dichotomous (pass/fail, yes/no, male/female) outcomes
5
+Two projects
n Predict students who won’t complete their current course
n Multiple iterations
n Political struggle to Pilot
n Eventual full implementation
n Predict students who will still be active at the end of a 1 year period
n Primarily a research effort to identify relevant variables
6
University of Phoenix PAR Framework (POC Phase)
+Tools for Predictive Analytics
(an incomplete list)
n SAS
n SPSS
n Stata
n R ß Open source and rapidly expanding in capabilities
n Simpler capabilities, but you may already have these
n SAS Enterprise Miner
n SPSS Modeler
n Oracle Data Miner
n Rapid Insight
n Rapidminer ß Open source
n WEKA ß Open Source
n Orange ß Open Source
7
Statistical Tools Data Mining Tools
+Step 1: Business Understanding
n What question(s) are you trying to answer? What problems are you trying to solve?
n What variables are suggested in the literature? Institutional knowledge? Other institutions?
n How will an answer impact the organization? What process and procedures will need to change?
9
+Step 1: Business Understanding
n Improve retention and identify at-risk students early
n Internal task force came up with a list of potential predictors
n Literature review
n Discussion with other institutions
n Understand what the risk factors are for attrition and which predict attrition
n All 6 institutions contributed what they had seen to the list of variables of interest
n Literature review
10
University of Phoenix PAR Framework (POC phase)
+Step 2: Data Understanding
n What have you got? n How many missing values and is there a pattern to them?
n Can you actually get your hands on every variable you need to answer your question(s)? If not, do proxies exist?
n Explore the data n Descriptive statistics
n Frequencies
n Means/Standard deviations
n VISUALIZE the data
11
+Step 2: Data Understanding
n Degree Level n DevEd Courses n Transfer Credits n Prior Term Withdrawals n Courses completed at this institution n Military Status n Prior Degrees ß turned out to be
circularly defined n Demographics ß high levels of
missing values
12
University of Phoenix PAR Framework (POC phase)
+Step 3: Data Preparation
n Deal with missing data (drop row, multiple imputation)
n Calculate derived variables (GPA, prior term withdrawals)
n Standardize/Normalize the data (scale, interpretability)
n Transform data where appropriate (z-scores, log/power)
n Consider changing continuous variables to categories/ranged buckets
n Reduce predictor variables (factor/principle components analysis)
13
+Step 3: Data Preparation
n Converted many continuous variables to categorical
n Calculated variables
n Credit Ratio (Completed/Attempted)
n Change in % of points earned over the same week in the last course taken
n Days to 1st login to LMS
n Converted many continuous variables to categorical
n Calculated variables
n Credit Ratio (Completed/Attempted)
n New student indicator
n Results of factor analysis
14
University of Phoenix PAR Framework (POC phase)
+Step 4: Modeling
n Randomly subdivide the data into at least 2 sets n Training
n Validation
n Hold-out
n Look at the different TYPES of models and pick one to start with n Generally I have tested 2-4 different types of models on each
project, sometimes more, to see which provides the most accurate results
n Run model à Evaluate à Refine à Iterate
n Run another model àEvaluate à Refine à Iterate
n Compare results between the different models
15
+ Model Types
Task Algorithms Applications Predicting a discrete attribute
Logistic Regression Decision Trees (CHAID, CART, Random Forest) Naïve Bayes Support Vector Machine Survival Analysis Neural Networks
6-year graduation rate Attrition
Predicting a continuous attribute
Multiple Linear Regression Support Vector Machine Time series Decision Trees (CHAID, CART, Random Forest)
GPA in a future term
Finding common groups (Clustering)
Hierarchical k-nearest neighbors Neural Networks Decision Trees (CHAID, CART, Random Forest)
Grouping types of students to optimize interventions
16
+Step 4: Model
n Looked at n Logistic Regression n Random Forest (multiple
decision trees) n Naïve Bayes
n MULTIPLE ITERATIONS n Switched variables to
categorical n Added variables found in
new studies
n Looked at
n Decision Tree/CHAID
n Logistic Regression
n MULTIPLE ITERATIONS
n Switched variables to categorical
n Performed factor analysis
17
University of Phoenix PAR Framework (POC phase)
+Step 5: Evaluate the results
n Each different type of model has different metrics/accuracy BUT…
n Examine n Confusion Matrix
n R-squared for regression-type models n Lift Chart – how much improvement
n Look for over-fitting of the model to the training data set
18
Pass Fail
Predicted Pass 1756 21
Predicted Fail 98 1013
+Step 5: Evaluate the results
n 85% correct predictions at week 0 (before the course started)
n 95% correct by week 3
n Because of the exploratory nature, most focus was on the amount of lift each variable gave
19
University of Phoenix PAR Framework (POC Phase)
+Key findings
n Early engagement in the course is indicative of success
n Change from prior behavior, even as small as 5%, can be an indicator of trouble
n Demographics aren’t necessary for high levels of accuracy
n Credit Ratio is critical
n New students are at significantly more risk
n DevEd course attempts also highly predictive of future problems
n Withdrawals are addictive
n Demographics rarely matter
20
University of Phoenix PAR Framework (POC Phase)
+Step 6: Deploy and train users
n Integrate into existing systems n The more manual, the less people will use the results
n Teach users what the result means and how they can apply it n Consider simplifying the results – rather than a percentage risk
score, consider using something like red/yellow/green or 1 to 10; eliminates false precision
n Get executive support for implementation
21
+Step 6: Deploy
n Pilot implemented for subset of academic counselors.
n Initial result showed small but statistically significant improvement in retention
n Additionally, counselors felt the risk score provided qualitatively improved their ability to work with their students
n Rolled out to all counselors
n Not deployed, but informing the next phase of the project
n Join us at 4pm in room 102A to learn more and preview the dashboards with the model integrated
22
University of Phoenix PAR Framework (POC Phase)
+ Learning more about Predictive Analytics
n However you can use what you’ve learned as a base for learning more: n University degree and certificate programs ß More appearing
all the time
n Statistics and Data Mining courses (teach the algorithms)
n Vendor or Tool-based training
n MOOCs and Open Courseware
n Self-education/Reading
n TALK TO YOUR PEERS
23
+
Questions?
Contact information: Rebecca Barber, PhD Arizona State University [email protected] or 480-965-5980
24