The Next Step in Data Analysis: Predictive Analytics Barber, R.

24
+ Predictive analytics The next step in Data Analysis Dr. Rebecca Barber Senior Director, Management Analysis Arizona State University

Transcript of The Next Step in Data Analysis: Predictive Analytics Barber, R.

+

Predictive analytics The next step in Data Analysis Dr. Rebecca Barber Senior Director, Management Analysis Arizona State University

+Who I am

n  Rebecca Barber, PhD n  Senior Director of Management Analysis, Arizona State University

Office of Planning and Budget

n  Consulting Statistician, PAR Framework

n  Statistics Faculty, Rio Salado College

n  Former Data Warehouse Architect and BI Expert

2

+Predictive Analytics

n  Uses historical data to make a prediction about data you don’t have n  Examples you may be familiar with:

n  Your credit score

n  Insurance prices

n  Why your neighbor gets different junk mail than you do

3

Data Mining/Machine Learning

Statistics

Business Intelligence

Modeling

+As it applies to education

n  Example: Admissions decisions – SAT scores can predict 1st year GPA

n  BUT there are a lot more applications n  Retention

n  Enrollment Management

n  Course Management

n  Mass Customization of Interventions

n  Revenue Optimization

n  Your institution’s leadership wants and needs this

4

+You already know something about predictive analytics

n  If you took a basic statistics course: Linear Regression

n  BUT

n  Requires independence of predictor variables – hard in the real world

n  Assumes a linear relationship between the predictors and the outcome – rare

n  Doesn’t deal well with dichotomous (pass/fail, yes/no, male/female) outcomes

5

+Two projects

n  Predict students who won’t complete their current course

n  Multiple iterations

n  Political struggle to Pilot

n  Eventual full implementation

n  Predict students who will still be active at the end of a 1 year period

n  Primarily a research effort to identify relevant variables

6

University of Phoenix PAR Framework (POC Phase)

+Tools for Predictive Analytics

(an incomplete list)

n  SAS

n  SPSS

n  Stata

n  R ß Open source and rapidly expanding in capabilities

n  Simpler capabilities, but you may already have these

n  SAS Enterprise Miner

n  SPSS Modeler

n  Oracle Data Miner

n  Rapid Insight

n  Rapidminer ß Open source

n  WEKA ß Open Source

n  Orange ß Open Source

7

Statistical Tools Data Mining Tools

+ 8

The Process

+Step 1: Business Understanding

n  What question(s) are you trying to answer? What problems are you trying to solve?

n  What variables are suggested in the literature? Institutional knowledge? Other institutions?

n  How will an answer impact the organization? What process and procedures will need to change?

9

+Step 1: Business Understanding

n  Improve retention and identify at-risk students early

n  Internal task force came up with a list of potential predictors

n  Literature review

n  Discussion with other institutions

n  Understand what the risk factors are for attrition and which predict attrition

n  All 6 institutions contributed what they had seen to the list of variables of interest

n  Literature review

10

University of Phoenix PAR Framework (POC phase)

+Step 2: Data Understanding

n  What have you got? n  How many missing values and is there a pattern to them?

n  Can you actually get your hands on every variable you need to answer your question(s)? If not, do proxies exist?

n  Explore the data n  Descriptive statistics

n  Frequencies

n  Means/Standard deviations

n  VISUALIZE the data

11

+Step 2: Data Understanding

n  Degree Level n  DevEd Courses n  Transfer Credits n  Prior Term Withdrawals n  Courses completed at this institution n  Military Status n  Prior Degrees ß turned out to be

circularly defined n  Demographics ß high levels of

missing values

12

University of Phoenix PAR Framework (POC phase)

+Step 3: Data Preparation

n  Deal with missing data (drop row, multiple imputation)

n  Calculate derived variables (GPA, prior term withdrawals)

n  Standardize/Normalize the data (scale, interpretability)

n  Transform data where appropriate (z-scores, log/power)

n  Consider changing continuous variables to categories/ranged buckets

n  Reduce predictor variables (factor/principle components analysis)

13

+Step 3: Data Preparation

n  Converted many continuous variables to categorical

n  Calculated variables

n  Credit Ratio (Completed/Attempted)

n  Change in % of points earned over the same week in the last course taken

n  Days to 1st login to LMS

n  Converted many continuous variables to categorical

n  Calculated variables

n  Credit Ratio (Completed/Attempted)

n  New student indicator

n  Results of factor analysis

14

University of Phoenix PAR Framework (POC phase)

+Step 4: Modeling

n  Randomly subdivide the data into at least 2 sets n  Training

n  Validation

n  Hold-out

n  Look at the different TYPES of models and pick one to start with n  Generally I have tested 2-4 different types of models on each

project, sometimes more, to see which provides the most accurate results

n  Run model à Evaluate à Refine à Iterate

n  Run another model àEvaluate à Refine à Iterate

n  Compare results between the different models

15

+ Model Types

Task Algorithms Applications Predicting a discrete attribute

Logistic Regression Decision Trees (CHAID, CART, Random Forest) Naïve Bayes Support Vector Machine Survival Analysis Neural Networks

6-year graduation rate Attrition

Predicting a continuous attribute

Multiple Linear Regression Support Vector Machine Time series Decision Trees (CHAID, CART, Random Forest)

GPA in a future term

Finding common groups (Clustering)

Hierarchical k-nearest neighbors Neural Networks Decision Trees (CHAID, CART, Random Forest)

Grouping types of students to optimize interventions

16

+Step 4: Model

n  Looked at n  Logistic Regression n  Random Forest (multiple

decision trees) n  Naïve Bayes

n  MULTIPLE ITERATIONS n  Switched variables to

categorical n  Added variables found in

new studies

n  Looked at

n  Decision Tree/CHAID

n  Logistic Regression

n  MULTIPLE ITERATIONS

n  Switched variables to categorical

n  Performed factor analysis

17

University of Phoenix PAR Framework (POC phase)

+Step 5: Evaluate the results

n  Each different type of model has different metrics/accuracy BUT…

n  Examine n  Confusion Matrix

n  R-squared for regression-type models n  Lift Chart – how much improvement

n  Look for over-fitting of the model to the training data set

18

Pass Fail

Predicted Pass 1756 21

Predicted Fail 98 1013

+Step 5: Evaluate the results

n  85% correct predictions at week 0 (before the course started)

n  95% correct by week 3

n  Because of the exploratory nature, most focus was on the amount of lift each variable gave

19

University of Phoenix PAR Framework (POC Phase)

+Key findings

n  Early engagement in the course is indicative of success

n  Change from prior behavior, even as small as 5%, can be an indicator of trouble

n  Demographics aren’t necessary for high levels of accuracy

n  Credit Ratio is critical

n  New students are at significantly more risk

n  DevEd course attempts also highly predictive of future problems

n  Withdrawals are addictive

n  Demographics rarely matter

20

University of Phoenix PAR Framework (POC Phase)

+Step 6: Deploy and train users

n  Integrate into existing systems n  The more manual, the less people will use the results

n  Teach users what the result means and how they can apply it n  Consider simplifying the results – rather than a percentage risk

score, consider using something like red/yellow/green or 1 to 10; eliminates false precision

n  Get executive support for implementation

21

+Step 6: Deploy

n  Pilot implemented for subset of academic counselors.

n  Initial result showed small but statistically significant improvement in retention

n  Additionally, counselors felt the risk score provided qualitatively improved their ability to work with their students

n  Rolled out to all counselors

n  Not deployed, but informing the next phase of the project

n  Join us at 4pm in room 102A to learn more and preview the dashboards with the model integrated

22

University of Phoenix PAR Framework (POC Phase)

+ Learning more about Predictive Analytics

n  However you can use what you’ve learned as a base for learning more: n  University degree and certificate programs ß More appearing

all the time

n  Statistics and Data Mining courses (teach the algorithms)

n  Vendor or Tool-based training

n  MOOCs and Open Courseware

n  Self-education/Reading

n  TALK TO YOUR PEERS

23

+

Questions?

Contact information: Rebecca Barber, PhD Arizona State University [email protected] or 480-965-5980

24