Enhanced feature mining and classifier models to predict ...

Graduate Theses and Dissertations Iowa State University Capstones, Theses andDissertations

2016

Enhanced feature mining and classifier models topredict customer churn for an e-retailerKarthik B. SubramanyaIowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/etd

Part of the Computer Engineering Commons

This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University DigitalRepository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University DigitalRepository. For more information, please contact [email protected].

Recommended CitationSubramanya, Karthik B., "Enhanced feature mining and classifier models to predict customer churn for an e-retailer" (2016). GraduateTheses and Dissertations. 16023.https://lib.dr.iastate.edu/etd/16023

brought to you by COREView metadata, citation and similar papers at core.ac.uk

provided by Digital Repository @ Iowa State University

https://core.ac.uk/display/141671507?utm_source=pdf&utm_medium=banner&utm_campaign=pdf-decoration-v1

http://lib.dr.iastate.edu/?utm_source=lib.dr.iastate.edu%2Fetd%2F16023&utm_medium=PDF&utm_campaign=PDFCoverPages

http://lib.dr.iastate.edu/?utm_source=lib.dr.iastate.edu%2Fetd%2F16023&utm_medium=PDF&utm_campaign=PDFCoverPages

https://lib.dr.iastate.edu/etd?utm_source=lib.dr.iastate.edu%2Fetd%2F16023&utm_medium=PDF&utm_campaign=PDFCoverPages

https://lib.dr.iastate.edu/theses?utm_source=lib.dr.iastate.edu%2Fetd%2F16023&utm_medium=PDF&utm_campaign=PDFCoverPages

https://lib.dr.iastate.edu/theses?utm_source=lib.dr.iastate.edu%2Fetd%2F16023&utm_medium=PDF&utm_campaign=PDFCoverPages

https://lib.dr.iastate.edu/etd?utm_source=lib.dr.iastate.edu%2Fetd%2F16023&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/258?utm_source=lib.dr.iastate.edu%2Fetd%2F16023&utm_medium=PDF&utm_campaign=PDFCoverPages

https://lib.dr.iastate.edu/etd/16023?utm_source=lib.dr.iastate.edu%2Fetd%2F16023&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

Enhanced feature mining and classifier models to predict customer churn for an

e-retailer

by

Karthik Subramanya

A thesis submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

Major: Electrical and Computer Engineering

Program of Study Committee:

Arun Somani, Major Professor

Srikanta Tirthapura

Glenn Luecke

Iowa State University

Ames, Iowa

2016

Copyright c© Karthik Subramanya, 2016. All rights reserved.

ii

DEDICATION

I would like to dedicate this thesis to my parents and all my teachers who are largely

responsible to have groomed us into individuals we are today. I would also like to thank my

friends and family for their loving guidance and encouragement all through my life. Last but

not the least, I would sincerely thank my adviser for providing an opportunity to work under

him and constantly provide guidance and motivation to come up with this Work.

iii

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

CHAPTER 1. OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 The E-commerce Business Model . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation and Previous Contributions . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Organization of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

CHAPTER 2. OUR SOLUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Customer Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Data Mining and Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Data Mining and Machine Learning Pipeline . . . . . . . . . . . . . . . . . . . 9

2.4.1 Machine learning libraries . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.2 Customer churn model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CHAPTER 3. FEATURE SELECTION . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Data Cleaning and Data Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Feature Selection and Feature Reduction . . . . . . . . . . . . . . . . . . . . . . 17

3.3.1 Features with low threshold . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.2 Features with high correlation: Pearson correlation coefficient . . . . . . 18

iv

3.3.3 Imbalance in output class labels . . . . . . . . . . . . . . . . . . . . . . 18

3.3.4 Uni-variate feature selection . . . . . . . . . . . . . . . . . . . . . . . . . 19

CHAPTER 4. BINOMIAL CLASSIFICATION . . . . . . . . . . . . . . . . . . 23

4.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3.1 Logistic regression with regularization . . . . . . . . . . . . . . . . . . . 27

4.4 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.6 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.6.1 Ensemble learning : gradient boosting . . . . . . . . . . . . . . . . . . . 30

CHAPTER 5. CLASSIFIER EVALUATION AND RESULTS . . . . . . . . . 31

5.1 Classifier Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.2 Reverse operating characteristics (ROC) curve . . . . . . . . . . . . . . 33

5.2 K-fold Strategy for Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Results from Classifier Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.4 ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.4.1 Regularized logistic regression: L1-norm . . . . . . . . . . . . . . . . . . 36

5.4.2 SVM classifier with F-anova feature selection . . . . . . . . . . . . . . . 36

5.5 Best Performing Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

CHAPTER 6. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.1 Summarization of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

v

LIST OF TABLES

Table 2.1 Volume of Customer data . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Feature variables for Customer feature matrix . . . . . . . . . . . . . . 14

Table 3.2 Result of L1 regularization on Feature Matrix . . . . . . . . . . . . . . 22

Table 5.1 Classification without Feature Selection . . . . . . . . . . . . . . . . . . 38

Table 5.2 Classification with F-Anova Feature Selection . . . . . . . . . . . . . . 38

vi

LIST OF FIGURES

Figure 2.1 Data pipeline framework from Apache . . . . . . . . . . . . . . . . . . 10

Figure 2.2 Customer churn model life cycle . . . . . . . . . . . . . . . . . . . . . . 11

Figure 3.1 Customer feature Scoring through Univariate F-score . . . . . . . . . . 21

Figure 4.1 Rexer Analytics Survey on most Commonly used Algorithms . . . . . . 23

Figure 4.2 Plot of the Logistic Losses . . . . . . . . . . . . . . . . . . . . . . . . . 27

Figure 4.3 Plot of Binomial Classification using SVM . . . . . . . . . . . . . . . . 28

Figure 5.1 Confusion Matrix :Binomial Classification (Source : Wikipedia) . . . . 31

Figure 5.2 Example of an ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . 34

Figure 5.3 Confusion Matrix for Gradient Boost Classifier . . . . . . . . . . . . . 38

Figure 5.4 ROC Curve for Logistic Regression with L1 Regularization . . . . . . . 39

Figure 5.5 ROC Curve for SVM classifier and F-score based Feature Selection . . 40

vii

ACKNOWLEDGEMENTS

I would like to take this opportunity to express my gratitude to those who helped me

with various aspects of conducting research and the writing of this thesis. First and foremost,

Dr. Arun Somani for his guidance, patience and support throughout this research and the

writing of this thesis. His technical prowess and words of encouragement have always inspired

me all through my graduate study. He also motivated me to take up the thesis track in

Graduate education. I would also like to thank my committee members for their kind guidance:

Dr.Srikanta Thirthapura and Dr.Greg Luecke. I would additionally thank my employer for

providing me with an opportunity to come up with this work through constant encouragement.

viii

ABSTRACT

The evolution of large scale distributed computing, sustained progress in augmenting the

technical expertise in algorithms and data sciences in the recent past have opened new avenues

to address several large problems in science and commerce which were previously not feasible

due to lack of computing infrastructure. Developments in algorithms and advanced statisti-

cal modeling i.e, machine learning, has provided the required intelligence methods to handle

large volume of data, or BIG Data [MCB+11] for applications such as basic sciences, further

advanced topics like Climate patterns, Bio-informatics, GIS, Infrastructure planning, finance,

E-commerce, Social networking, Policy planning, etc.

We address an important problem, Customer churn [HHR10] faced across all industries

who depend on customer loyalty for growing their businesses. Customer churn is formally

defined as a customer abandoning an established relation with a organization. It is also called

as customer attrition, customer turnover or customer defection according to the wikipedia.

Predicting customer churn is prioritized by businesses to save their businesses as the cost of

retaining an existing customer is far less than acquiring a new one [FP08].

Customer churn models are applicable in many industries, like financial, telecom and au-

tomobile industries to name a few [XLNY09, AKR08, WC02]. We develop our own customer

churn predictive model for E-commerce industry that leverages some of the advantages a Big

Data infrastructure brings to the table. Our work is well tailored to suit the industry model.

To the best of our knowledge there is no published work on customer churn prediction for an

e-retailer that is similar to our model in terms of Data mining and model building.

We use a binomial classifier approach [Alp14] by first deriving a customer feature matrix

using customer data. This model is often used by researchers in the field of medicine, drug

discovery, disease diagnosis, sports, etc. We model our entire customer base as a feature matrix

[DL97] with each customer representing a feature vector containing a combination of features

ix

that influences his/her churn. We then apply a suitable feature selection algorithm [MBN02] to

choose the best subset of features from the feature vector. Next, we apply classifier algorithms

[KZP07] on the resultant data and cross-validate the results from the predictions.

1

CHAPTER 1. OVERVIEW

The paradigm of distributed parallel computing is increasingly enabling progress in big

data technologies and enabling new avenues for large number of data-centric and compute

intensive applications. Availability of supercomputers with thousands of nodes and high speed

communication also allows solving complex problem involving large data and computation that

was once outside the realm of possibilities.

Hadoop technology [SKRC10], provides commercially scalable big data technology that is

completely open source (Apache creative common license) and is embraced by industry and

academia all across the world. Several technologies that enabled solutions include high perfor-

mance computation using Message passing Interface (MPI) for distributed computing [GL99],

OpenMP for shared memory computing, distributed and redundant file systems, hadoop stack,

NoSql [SKRC10] Databases etc. The advances in data sciences, machine learning, neural net-

works, deep learning, etc. have been developed to run on the hadoop stack. Some of the

real world problems benefiting include but are not limited to Predictive analytics, Prescrip-

tive analytics, Recommender systems, Financial forecasting, Sports analytics, DNA modelling,

Climatic predictions and Cancer prediction.

1.1 The E-commerce Business Model

E-commerce [Gef00] (electronic commerce or EC) include the buying and selling of goods

and services and the transmitting of funds or data, over an electronic network. These business

transactions occur between business-to-business (B2B), business-to-consumer (B2C), consumer-

to-consumer (C2C) or consumer-to-business (C2B). The terms e-commerce and e-retailer are

often used interchangeably in this work. We are primarily interested in e-retail business which

2

is a form of electronic commerce that allows consumers to directly buy goods or services from

a seller over the Internet using a web browser or a mobile app. It is projected that in the year

2017 the online e-retail industry will grow upwards of 600 billion dollars. Some of the household

e-retail names are Amazon, Alibaba, Walmart, e-bay, Staples, Macy’s, Apple, etc. While most

of these e-retailers operate on a B2C business model, a B2B model or a combination of both is

also common.

Many businesses have migrated from owning a Brick mortar shop alone to include e-retail

business to cater to the needs of the customer and to keep up with the competition while others

like Amazon, Alibaba follow only the e-commerce route.

Customer loyalty [SAP02] is an important driver to many E-retailers as the cost of acquiring

a new customer is a significant effort in comparison to the cost of retaining one. Unlike a

brick and mortar shopping experience that involves a look and feel, location advantage and

human interaction component among others, the e-retail business model comes packaged in a

single website from the landing page to exit. Therefore it is the most important priority of

these companies to entice the customer with great line of products, pricing, attractive offers,

recommendations, personalization, etc to create a desirable shopping experience.

1.2 Motivation and Previous Contributions

A review of Customer churn prediction approaches across various industry verticals and

their efficiency motivates us to develop a new framework for churn in e-commerce customers.

We surveyed their methodologies for data collections and algorithms, varied data sources that

the researchers used for selecting features, their approach for feature selection algorithms,

classification models, cross-validation techniques, etc were surveyed.

The features driving the algorithm can be used for other data science initiatives within

the organization as it makes a rich set of features available for every customers that can be

re-purposed to solve other problems in area of predictive & prescriptive customer analytics.

This work also lays foundation for future work to drive other models like propensity to buy,

customer segmentation, cross-sell, up-sell among these customers.

3

Our work follows closely around the model building techniques proposed in [XLNY09,

YKG10, YGGH11]. Since we deal with a totally different industry vertical and a business

problem to solve, the source and mixture of attributes that make up the data are different.

The authors in [YKG10] predict customer churn of Google Adword customers. They first

build a feature matrix for these customers and then further employ classification algorithms

like Random Forests, Gradient Boost Method (GBM), etc. The authors in [XLNY09] predict

customer churn in Bank’s. They employ an Inverse Balance Random Forest classifier (IBRF),

a technique that they use to prevent classifier algorithms from mis-classifying a minor class

label on account of imbalance in the label distribution.

[YGGH11] proposes an enhanced Singular vector Machine (SVM) called the (ESVM) frame-

work that claims to scale well over large scale data and ability to handle non-linear data ef-

fectively. [MCeC13] uses Multivariate regression Splines (MARS) as classification technique to

detect customer churn. The authors in [CVdP09] and [CFS12] using several user behavioral

data like email sentiment mining and longitudinal behavioral data to aid classifiers to make

accurate predictions. The model proposed in [SR15] also uses significant qualitative customer

behavior data to drive fraud detection in insurance claims using One class SVM (OCSVM) for

classification task and K-reverse nearest neighborhood that handles class imbalances. Linear

and non-linear classifiers like SVM, Logistic Regression, Artificial Neural Networks (ANN) and

Tree based Ensemble classifiers and their variants are predominant choice of classifiers used by

researchers to solve the customer churn problem. We observe significant gap in feature mining

process in previously published work. They all fail to effectively represent the e-commerce

business model. The choice of feature-set to drive churn prediction is mostly restricted to a

list of conventional feature-set which has a huge share of static features and features generated

through sales. Although some of the recent work [CFS12] published on customer churn for E-

commerce customer focuses on behavioral features, this feature set is still narrow and restricted

to only a handful of metrics like recency and frequency factors which by no means are able to

capture the complete behavioral footprint for a customer during his life-cycle. The reason for

this limitation in feature mining process can partly be attributed on technology limitations in

data capturing and the rest on volume of the data that it entails to deal with.

4

The E-commerce business drives mainly on digital channels which are centered around,

but not limited to online website and mobile applications. These channels act as a single

window between the organization and the customer during his entire tenure. The shopping

activity or the online interaction which can be labeled as a browse session generate valuable

metrics and footprints for customer interaction with the online retailer. The sales generated

by these online sessions, categories of products brought, etc that are more easily contained

in volume, mainly qualify as explicit features for our feature matrix. The user click activity,

browse path behavior and overall web interaction generates several terabytes of data every

single day for an E-commerce retailer operating on a large scale. This valuable user behavior

data that was previously ignored by researchers for feature mining owing to the lack of large

scale data ingestion, storage and computing technology. This data can now be easily extracted

through a feature mining process involving a big data pipeline. The proposed work aims to

lay a firm foundation to develop a comprehensive feature mining process that starts from

definition, extraction and the study of impact these features have on customer churn. We aim

to capture a complete list of implicit and explicit customer footprints through feature matrix

that enables better prediction of customer churn. We finally intend to come up with an end-

to-end framework and make it available for the organization and the academic community for

predicting customer churn for e-commerce business model.

1.3 Organization of Work

The thesis is divided into six chapters that follow a natural progression of our approach to

the solution. Chapter one introduces the E-commerce business model and discusses the existing

techniques and algorithms published for the customer churn. The Second chapter discusses

about the customer engagement model for the e-retailer in consideration, we briefly discuss the

analysis of the types of the segmentation applied for these customers, custom segmentation

[Mah00] that are created on the fly. We then discuss how we mine individual features for

the customer feature matrix for the sample size (segment) of customers. This is done using

various customer channels within the organization that are housed across different systems.

5

The most important are the traditional order management system, Customer data warehouse,

Clickstream logs [SLLM02], etc.

In chapter three, we discuss techniques such as feature selection [GE03] and dimensionality

reduction algorithms to ensure that the feature matrix contains the best contenders to predict

the output accurately and the ones that do not influence the output of the classifier are removed.

This helps our model to increase speed, avoid issues like over fitting and to increase the accuracy

of prediction.

In the fourth chapter, we discuss all the classification and regression models we consider

for predicting customer churn for the chosen feature-set. We try a wide variety of linear and

non-linear models, tree based ensemble models for deciding the best model.

In chapter five, we discuss the results of all the classification algorithms we explored in

chapter four using cross validation techniques [AC+10] employing ROC Curve and Confusion

Matrices. We also discuss an evaluation plan for deciding the best Classifier model based on

maximizing the area under the ROC curve[Bra97] where we empirically tweak the decision

making probability thresholds of the algorithm to get the best possible area under the curve.

In chapter six, we present our conclusions and set the stage for future work in this area.

6

CHAPTER 2. OUR SOLUTION

Keeping e-commerce in mind, we develop a churn model to be able apply to any business.

We use data from a particular source to experiment our model. But we choose to anonymize

the name and background of the data of customers. Our data are from a popular e-retailer who

has a large e-commerce presence across the north American market, whose e-retail website sells

a broad range of products across multiple categories. We use the data to target the prediction

of the churn rates for their B2B customers.

We model the customer churn problem as a binary classifier problem where the output of

the classifier is a boolean output. A ”1” indicates churn and a ”0” indicates being active. This

problem appears like one of the most common machine learning problem that has been solved

with help of classifier algorithms. However, this has rarely been applied to the application

and data we are interested in. To solve the problem, we choose a sample set of customers

for our study who are similar to each other when referring to their size, spending, behavior,

demographics, etc. This is to ensure that our predictive models is applied to the right set

of data. The available data is divided into subsets train, test respectively. The ratio for the

division is empirically decided to be 7:3.

2.1 Customer Attributes

We start with all possible customer attributes as prospective candidate for features in

the feature vector. We make use of data mining tools within the Hadoop Stack to look into

conventional and non-conventional channels to mine customer data, sales, behavioral data

through interaction that the customer have during his interaction with the e-retailer’s website.

Bringing the right features inside the feature vector for training the model and ensuring that

7

they have a definite influence on the independent variable is the goal of feature selection.

Selection of some features are intuitive and mandatory as they have been used by previous

works [YGGH11]. A few others which do not directly impact customer churn may have to

be accounted for by the feature selection method to measure their impact on the independent

variable. for example, features like spending slope of a customer is certainly a feature that has

impact on customer churn and has been used before. Similarly frequency of visits of a customer

to the web page, cart abandonment ratio, etc can also be considered a feature that has sizable

impact on the problem, but has never been used before. Features like customer spending on a

particular product category do not strike at first as a feature that can predict churn, but may

indicate if a particular category of products is driving customer churn across the organization.

Our goal is to best evolve a classifier algorithm that can most optimally classify all of the

existing data points to lead to an effective prediction.

2.2 Data Sources

Enterprise Data warehouse

All major e-retailers maintain large data warehouses that contain present and historic cus-

tomer data, marketing data, sales and promotions data. This warehouse is a master database

with many replications containing all customer data. Multiple teams across the organization

would then use this data to run reports, create business views to suit their requirement. There

exists complex relationship hierarchies that define different shipping and billing address and

points of contacts and ordering privileges of users. This data warehouse also maintains day-

to-day sales data that provides important information about the customer spending behaviors,

their periodic purchase patterns that help the customer engagement managers or CRM tools

like Salesforce to better interact with the customer to drive sales and win their loyalty.

Data Volume

Table 2.1 shows the size of the data we use to develop our model. As stated before, the

e-retailer data we use has a huge customer base. A flavor for volume of this data can be inferred

from this table.

8

Table 2.1 Volume of Customer data

Data Size

Total B2B Customer base 0.5 Million

Customer Segment considered for model 86K

Total Orders / Day 60K

Total Products Sold / Day 0.35 Million

Total Number of Clicks / Day 8 Million

Big Data Lake

While a lot of structured data like the ones we discussed above is stored in Enterprise

data warehouses, there are many other unconventional sources of data like user generated

clickstream, social media data, chat data, product review data that is not readily stored in

enterprise warehouse system as the volume of this data may be huge to fit into a relational

database. Such data can be used by an open source technology for querying and computing on

a non-proprietary hardware that is easily scalable.

2.3 Data Mining and Data Preparation

Hadoop Stack

Hadoop [SKRC10], is defined as a framework that allows for distributed processing of large

data set across multiple clusters of machines. It encompasses a set of software library modules

that provide end to end capabilities for big data processing. Hadoop and its modules are

licensed under Apache commons for open source applications for both commercial and academic

implementations.

Hadoop uses MapReduce [DG08] as a software framework to develop applications to process

a large amount of data in-parallel on large clusters in a reliable, fault-tolerant manner. The

framework involves shuffling data as a key-value pair to accomplish most of the operations in

a distributed fashion.

The data mining modules from the Hadoop framework discussed below are useful for our

application.

9

Hive

Hive [TSA+10] is defined as a data warehouse infrastructure that creates a high level

database layer over existing data residing on Hadoop distributed file system (HDFS) to provide

data summarization and ad-hoc querying capabilities to programmers using SQL languages.

These queries are in turn converted into optimized Map reduce programs that are spawned

across the cluster for execution. Hive is popularly used for data aggregation, Expand Trans-

form Load (ETL) operations, analytics, etc on the hadoop cluster.

Pig

Pig [ORS+08] is defined as a high level data flow language for developing one or more rounds

of map reduce jobs that are highly optimized. This tool is mainly used by inexperienced

java programmers to develop analytic computation. Pig is popularly used to create custom

transformation of data and implement machine learning algorithms.

2.4 Data Mining and Machine Learning Pipeline

Figure 2.1 describes a data pipeline framework which we use to build our model in the

desired form using various big data ETL tools like Pig and Hive. The model building is an

iterative process (not shown in the figure) that continuously monitors the output of the model

and ensures changes in the data mining and data transformation process. The indicative image

for the big data lake is a courtesy of www.zdatainc.com. The logos for the apache hive, apache

Spark and sci-kit learn are sourced copyrighted under creative common license.

2.4.1 Machine learning libraries

Machine learning research is largely driven by a community of researchers and organizations

who have embraced the open source model to drive innovation and collaboration. Thus we have

a good set of libraries and packages readily available to experiment different classification models

such as Python, Spark (Mllib), R, Weka, etc. We use Sci-kit learn [BLB+13] that provides a

number of machine learning modules in the area of regression, clustering, classification, feature

selection, cross validation, etc. This library is available in Python.

10

Figure 2.1 Data pipeline framework from Apache

2.4.2 Customer churn model

The customer churn model is shown in the figure 2.2. We split the model building process

into three discreet phases.

Phase 1 : Data Preparation

The purpose of data preparation step is to process the data to enable the classifier algorithm

to handle all variables effectively. We can use a flat file, or a CSV file, to output the processed

data for the next stage in the pipeline.

Phase 2 : Data Science model building

The task of applying additional intelligence to the data and bring about meaningful predic-

tion models is the purpose of this process. More details of this process are discussed in chapters

three and four.

Phase 3 : Cross validation, Business action and performance observation

The final step in process is to evaluate the accuracy of prediction results and make a

comparison between models. Once an optimal percentage of accuracy in prediction is achieved,

11

Figure 2.2 Customer churn model life cycle

the model may be used in the production systems. Details of this step are discussed in chapter

five.

12

CHAPTER 3. FEATURE SELECTION

Feature identification and feature selection [GE03, KR92] are important steps for all super-

vised machine learning algorithms. Domain expertise and past experience help in identifying

a set of features that play a role in any outcome prediction including customer churn. Feature

Selection prunes the data set by selecting a subset of relevant features from a large pool, thus

preventing problems like overfitting [Haw04], poor efficiency, etc. A small number of features

would make an algorithm to do a poor job in identifying independent variables and result in

high bias, while a large number of features would result in overfitting.

3.1 Previous Work

A study of other industry verticals in customer churn predictive modeling reveals how the

features are identified. Earlier customer churn modeling work published revolved around the

Mobile Telecom industries. Some of the factors traditionally considered by researchers are

demographics, call durations of the customer, spending behaviors, types of plans enrolled,

split up of long distance/short distance calling, etc. These features correspond to data that

are generated at point of sales or checkout and order confirmation page with respect to an

E-commerce or retail industry. These are direct metrics that establish customer behavior

generated through a completed transaction. These are explicit factors. The feature mining

process has undergone improvements over the years as researchers are now considering at

several other behavioral attributes of the customers like the number of calls dropped, network

quality experienced by the customer, etc. These are features where the metrics does not flow

explicitly through a conventional data channel. Domain experience, aggressive data capturing

13

and active customer feedback channels within the organization help to capture such data and

bring them out as new factors into the feature building process.

With respect to the e-retail industry we organize our feature collection process into four

broad categories.

Customer Demographics

Customer demographics refers to factors that define the type, scale and other attributes

concerned with the customer alone which are independent of the e-retailer. They could de-

pend on the size of the company, the number of employees in the company, financial worth,

geographical demography, the type of the industry the customer belongs to, etc. Some of the

features concerning customer demography are categorical like geographical location, industry,

etc while the other features like size of the company, the number of employees, etc are Ordinal

variables. If a particular industry vertical is on the verge of decline, it is not very surprising

to see that customers have reduced their spending leading to churn, similar reasons may be

attributed to region, etc. Thus customer demographic information plays a vital role.

Enterprise Sales Data

The sales metrics and customer buying pattern are captured from point of sales system or

order management systems. Some of the important features part of enterprise sales data are

total sales, recency of sales, frequency of sales, categories of products bought, year to date buy

ratio, etc.

Customer Interaction Data

These metrics are captured from channels that handle and store customer interaction data,

customer survey data, chat data, email marketing, marketing campaign outcomes, etc.

Customer Behavior Data

These metrics are captured from clickstream logs which captures the overall customer in-

teraction with the organization’s e-commerce platform. Some of the important metrics that

are valuable features to consider are session lengths, cart activity, cart abandonment’s, user

navigation experience, User Product finding experience, visit to conversion ratio, response to

marketing emails, etc.

14

A few important features that were considered for building the feature set are included

below. Owing to the confidential nature of this data we use for the model, we refrain from

discussing in detail the individual features that are identified to build feature vector for every

customer. Much of it can be inferred from table 3.1.

Table 3.1: Feature variables for Customer feature matrix

Category Feature Source Type Feature Description

Demographic

Features

Customer Size Customer

Warehouse

Static Size of Customer

Vertical Customer

Warehouse

Static Type of Industry that Customer be-

longs

Location Customer

Warehouse

Static Billing address of the Customer

Customer

Info

Age Customer

Warehouse

Quantitative Age of Customer with the organiza-

tion

Customer Tier Customer

Warehouse

Static Business Tier which the customer is

enrolled in

No. of Registered

Users

Customer

Warehouse

Quantitative Total number of Registered users

enrolled by customer account

Customer

Sales

Annual Sales Customer

Warehouse

Quantitative Avg. annual sales done by the Cus-

tomer

YTD Sales Customer

Warehouse

Quantitative Year To Date Sales done by the Cus-

tomer

Spending Slope Customer

Warehouse

Quantitative Plot of Spend over time

Total Returns Customer

Warehouse

Quantitative Total value of goods returned by

customer

Total Orders Customer

Warehouse

Quantitative Total number of orders placed by

Customer

15

Table 3.1 – Continued


Total Rebate Customer

Warehouse

Quantitative Total rebates offered to the Cus-

tomer

YOY Sales Customer

Warehouse

Quantitative Year over Year drop/rise in Sales

Product

Sales

Total products Customer

Warehouse

Quantitative Total count of unique products sold

Cat 1 sales Customer

Warehouse

Quantitative Percentage of wallet spent on Cat 1

products

Cat 2 sales Customer

Warehouse

Quantitative Percentage of wallet spent on Cat 2

products

. .. Cat n sales Customer

Warehouse

Quantitative Percentage of wallet spent on Cat n

products

Frequency Frequency orders Customer

Warehouse

Quantitative Avg. Frequency at which orders are

placed

Frequency visits Clickstream

Logs

Quantitative Avg. Frequency at which users visit

the site

Days since visit Clickstream

Logs

Quantitative Num. of Elapsed days since last

visit

Avg. visits per

month

Clickstream

Logs

Quantitative Avg. number of visits per month

per user.

Behavioral No. of Active

Users

Clickstream

Logs

Quantitative Number of active users from the ac-

count.

Active User Ratio Clickstream

Logs

Quantitative Ratio of Active/Registered users

Avg. Page Visits Clickstream

Logs

Quantitative Avg. number of page visits in a ses-

sion

16

Table 3.1 – Continued


Avg. product

Views

Clickstream

Logs

Quantitative Avg. number of Product Viewed in

a session

Avg. Session

Length

Clickstream

Logs

Quantitative Avg. length of sessions by users

Cart/View Ratio Clickstream

Logs

Quantitative Ratio of Cart Addition over Product

Views

Cart/Buy Ratio Clickstream

Logs

Quantitative Ratio of Cart Addition over Pur-

chases

Avg. Abandoned

Cart

Clickstream

Logs

Quantitative Avg. worth of Products abandoned

in Cart

Abandoned/Buy

Ratio

Clickstream

Logs

Quantitative Ratio of worth of Cart Abandoned

over Purchases

No. of Futile Ses-

sions

Clickstream

Logs

Quantitative Number of Sessions with no orders

Experience Out of Stock Clickstream

Logs

Quantitative Number of times user had a product

go out of stock

Difficulty at

Checkout

Clickstream

Logs

Quantitative Number of times user had an issue

at checkout

Null results Clickstream

Logs

Quantitative Number of times product Search

yielded null results

3.2 Data Cleaning and Data Pruning

In the process of building a customer feature matrix ingesting many different customer

parameters into the system, there are often certain missing values for few features. As an

example, a feature like Abandoned Cart Worth for a traditional retail customer who might

17

not have an online presence is null, however it may not be correct to substitute this value as 0.

To deal with the missing values it is a common practice to often substitute them with a more

dependable value which could be either the mean, median, most frequently occurring value, etc

in the distribution. This activity, called imputing, handles the missing values. In our work we

experimented with all of these techniques and choose suitably to account for a missing value.

3.3 Feature Selection and Feature Reduction

Feature selection refers to the process of selecting a subset of relevant features from a pool

of features that are initially available. This process reduces the number of features as input

to the model, and therefore, reduces the data acquisition and computation cost. Secondly, it

yields more accurate results. As described in [KR92], Feature selection, as a preprocessing

step to machine learning, has been very effective in reducing dimensionality and irrelevant

data, increasing learning accuracy and improving result comprehensibility.” Feature selection

includes individual or subset selection. Individual feature selection ranks features separately

according to a particular metric where the subset selection takes into account the interaction

and correlation among features. The final goal of feature selection is to have a minimum number

of features that is good enough to capture all of the trends and variations in the output. It is

important to select the right feature set before implementing an effective algorithm.

The important factors to consider when removing a feature from the feature vector include

the noisy nature of the feature, variance, correlation among features, F-anova scores, Regular-

ization, etc. The target of building a feature matrix is not solely to accumulate a a number

of features, but to actually gather features that have sizable impact on the outcome of the

classifier.

3.3.1 Features with low threshold

The features with low variance can be removed [YL04] if a feature fails to satisfy a preset

threshold, as they have no impact on the classification. This approach can be applied to both

supervised and unsupervised learning. A feature is only considered effective when its Variance

is non-zero and exceeds a certain threshold.

18

High Variance leads to low SSE (Sum of Square Errors) while high bias leads to simplicity.

High variance may lead to better performance in the test data set, it leads to a complicate

model with a higher model building time. High bias leads to poor performance on the test

data set even though it may offer a simplistic model. Finding an optimum fit between the two

characteristics is an ideal fit for the feature set.

3.3.2 Features with high correlation: Pearson correlation coefficient

Pearson Correlation [Hal99] measures the linear relationship in any two distributions as-

suming that they are a normal distribution. The correlation results vary from −1 to +1 with 0

implying that there is no correlation. We compute the Pearson correlation and remove features

that have high correlation between them since one of them is redundant. This redundancy

certainly impacts the accuracy of the classification algorithms.

The Pearson Correlation coefficient between two random variables X and Y is defined as

ρ(X,Y ) =Cov(X,Y )√

Var(X)Var(Y ). (3.1)

Multicollinearity occurs when there is high correlation between the predictor variables lead-

ing to errors like unstable estimates of regression co-efficients. Researches use tests like Variance

inflation factors (VIF) to test if multilinearity can be safely ignored. These tests are beyond

the scope of this work and hence ignored. We adopt a baseline approach of looking at instances

of high correlation between predictor variables and found no significant correlation between the

predictor variables.

3.3.3 Imbalance in output class labels

An important observation from our data-sets is the imbalance in data. On an average

about 5-10% customers churn year-on-year basis depending on the segment we are looking at.

This imbalance in distribution consisting returning/non-returning customers is a good recipe

for learning algorithms to classify a large number of customers under returning and still attain

high overall accuracy. There are several works carried out in the past [BVdP09] that specifically

focus on handling imbalance in the data leading to skew the predictions of the model. Thus

19

we employ an in depth cross-validation technique based on confusion matrix, threshold shift

and ROC curve to arrive at the best algorithm to rule out such a bias in our model. We also

assign weights to the feature vectors that are inversely proportional to class frequencies in the

input data as shown in equation 3.2. Here, n samples is the total number of samples in the

data-set, n classes represents the total possible class outcomes from the output label, which

in this case is two. The count of occurrence Y i represents the total number of occurrences of

samples belonging to a given class whose weights we are interested in calculating.

Weight class Yi ∝n samples

(n classes)(count of occurence Yi)(3.2)

3.3.4 Uni-variate feature selection

We consider a few classic feature selection and feature reduction techniques after employ-

ing the baseline methods discussed in the preceding section. The uni-variate feature selec-

tion techniques [SIL07] select the best features based on uni-variate statistical tests. An

important assumption these techniques make is that they consider all the features as inde-

pendent of each other. Some of the most popular techniques of uni-variate feature selection

are Chi − Square tests, F − ANOV A classification tests. While Chi-Squared tests are best

suited in dealing with non-negative features, categorical or sparse data, it is less suited to han-

dle our feature matrix without several modifications. Therefore we consider ANOV A based

F-Classification test [SIL07] to identify important features from the feature matrix.

3.3.4.1 ANOVA F-Classification Test

We use the ANOVA F-test [SIL07] for scoring individual features to the transformed feature

matrix after applying threshold variance and pearson correlation techniques. The F-ANOVA

considers one feature at a time to see how well each continuous variable predicted the class

label. The importance value of each variable is calculated as F score of F-statistic test of

association with the predictor and class label which can also be said target variable.

The F-value is defined as follows:

20

F =MSRMSE

(3.3)

here, MSR is the ”Mean square regression” and MSE is the ”Mean square error”. While

MSR indicates the between group variability. MSE represents the within group variability. The

statistical tests concluding feature significance with the independent variability determines if

the between group variability is a higher than the within group variability. The sample variance

of predictor X for the target class Y = J is given by the following equation:

S2j =

Nj∑i=1

((xij − xj)2)/(Nj − 1) (3.4)

here Nj is the number of cases with Y = j. xj is the sample mean of predictor X for target

class Y = j. x referred to as a grand mean of predictor X given by the following

x =

j∑j=1

(Nj xj)/N (3.5)

F =

∑Jj=1(Nj(xj − x)2)/(J − 1)∑Jj=1(s

2j (Nj − 1))/(N − 1)

(3.6)

Once the F Value for all of the independent predictors are calculated, we employ K%

percentile approach or the K best feature approach to select the features that have the most

impact on the independent variable.

It is evident from Figure 3.1 that many conventional feature variable were scored by the

algorithm as features that influence customer churn. Some of the these include days between

purchase, spending ratio for Year-on-Year, total sales, number of online visits, number of cus-

tomer cross-shopping (dotcom Visits), etc. However, it is definitely important to notice that

there are several unconventional features brought out through this work like futile online ses-

sions, Total worth of cart abandoned online, several product categories like cat 11, cat 12,

cat 13 also influencing customer churn. The above features not only increases the prediction

accuracy of our model, but also provides a valuable input to Business to look into reasons why

certain product categories, prices or customer experience contribute to drive customer churn.

21

Figure 3.1 Customer feature Scoring through Univariate F-score

We rank these features by their scores and iterate the final classifier with variable number of

features (Top N) to arrive at the best solution.

3.3.4.2 Principal Component Analysis

PCA [Jol02] is a multivariate feature reduction technique that reduces the dimension of

the data by finding the first ′s′ orthogonal linear combinations of the original variables with

the largest variance. PCA is defined in such a way that the first principal component has the

largest possible variance. Each succeeding component in turn has the next highest variance

possible under the constraint that it is orthogonal to the preceding components.

Employing PCA algorithm to select the first N features orthogonal to the original feature

set had poor results when this transformed feature set was applied to a classifier in predicting

the customer outcome. Hence we decided not to pursue PCA further.

3.3.4.3 Regularization based Feature Selection

Adding regularization [Ng04] to learning algorithm is one of the ways to do feature selection

and avoid the problem of over-fitting specially when we are handling a lot of sparse features.

Since Regularization penalizes the complexity of learning model using L1, L2 norm. Having

22

Table 3.2 Result of L1 regularization on Feature Matrix

Coefficients Feature Variables

0.164565497 CAT 2

0.345285729 CAT 5

0.796459688 CAT 6

-2.066845099 CAT 12

-1.316095705 CAT 13

0.098768507 CAT 16

-1.16188276 CAT 17

-0.145851299 FUTILE SESSIONS

-0.145851299 NUM VISITS

0.1245 TOTAL REBATE

0.285193414 DOTCOM VISITS

4.470811284 DAYS BETWEEN PURCHASES

0.93979978 RATIO CART ABANDONED

0.5564536 SPENDING SLOPE

sparse solutions decreases the complexity, reduces the number of features and yields better

prediction.

Previous studies [Ng04] have shown that L1-based regularization is superior than L2-based

when there are many features. The complexity of L1-regularization logistic regression is loga-

rithmic in the number of features. The sample complexity of L2-regularization logistic regres-

sion is linear in the number of features.

We discuss more details on regularization for a classification problem in chapter four when

we discuss logistic regression in detail. Table 3.2 shows how the coefficients of each feature

stack up against others when L1 regularization is applied. The features with lower co-efficients

or close to zero co-efficients have marginal impact on the outcome of the classifier and hence

ignored from this table.

23

CHAPTER 4. BINOMIAL CLASSIFICATION

In this chapter, we discuss the most important component of machine learning application

pipeline for our problem, that is the Binomial Classification to predict if a customer would

abandon or stay with the organization.

Through empirically derived assumptions, grid and randomized searches to optimize the

parameters of the model, cross validation techniques and prior understanding of the nature of

algorithms in handling data, we arrive at the best algorithm to experiment and finalize. We

discuss this task in deeper detail in the following sections.

4.1 Previous Work

The most commonly used algorithms used by Data scientists and data analysts are listed

by Rexer Analytics survey through survey polls, KDD cup submissions, etc. Figure 4.1 shows

the results of these analysis.

Figure 4.1 Rexer Analytics Survey on most Commonly used Algorithms

24

It is evident from the above figure that Logistic Regression, Decision Trees, Ensemble meth-

ods, Bayesian, SVM, etc are some of the most popular supervised algorithms in the decreasing

order of their popularity used for classification problems. Upon studying customer churn stud-

ies carried out earlier, we find that these algorithms were consistently used in all these studies

as well.

4.2 Naive Bayes Classifier

Naive Bayes classifiers [Ris01] belongs to the family of probabilistic classifiers based on

applying Bayes theorem with assumptions of independence between the features. Naive Bayes

learning generates a probabilistic model given a training set of instances. Each data point is

represented as a vector of features [x1, x2, x3, ....., xd]. The task is to learn from the data to be

able to predict the most probable class yi ε Classi of a new instance whose class is unknown.

We first introduce the Bayes Theorem which describes the probability of an event, based on

conditions that might be related to the event defined by Eqn. 4.1.

p(yj | x) =p(x | yj)p(yj)

p(x)(4.1)

where p(yj | x) is the probability of an instance x being in class yj

p(x | yj) is the probability of generating instance x given class yj

p(yj) is the probability of occurrence of class yj

p(x) is the probability of x occurring.

Equation 4.1 serializes to

P (yi | x1, x2, ....., xd) =P (yi)P (x1, x2, x3, ...xd | yi)

P (x1, x2, x3, ...., xd)(4.2)

Naive Bayes employs the Bayess theorem to estimate the probabilities of the classes. Here,

P (yi) is the predetermined probability of class which is estimated as its occurrence frequency

in the training data, while P (yi | x1, x2, ....., xd) is the posterior probability of class after

observing the data. P (x1, x2, x3, ...xd | yi) denotes the conditional probability of observ-

ing an instance with the feature vector [x1, x2, x3, ....., xd] among those having class yi. Fi-

25

nally, P (x1, x2, x3, ...., xd) is the probability of observing an instance with the feature vector

[x1, x2, x3, ....., xd] regardless of the class.

Since the sum of the posterior probabilities,∑

yjε classcP (yi | x1, x2, ....., xd) = 1, the de-

nominator on Eqn. 4.2 right hand side is a normalizing factor and thus can be omitted.

P (yi | x1, x2, ....., xd) = P (yi)P (x1, x2, x3, ...xd | yi) (4.3)

A data point is labeled as a particular class if it has the highest posterior probability y(class)

for a given class among all available classes. This is given by

arg maxyiεclass

P (yi)P (x1, x2, x3, ...xd | yi) (4.4)

In order to estimate the term P (yi)P (x1, x2, x3, ...xd | yi) by counting frequencies, one needs

to have a huge training set where every possible combinations [x1, x2, x3, ....., xd] appear many

times to obtain reliable estimates. Naive Bayes solves this problem by its Naive assumption

that features that define instances are conditionally independent given the class. Therefore,

the probability of observing the combination [x1, x2, x3, ....., xd] is simply the product of the

probabilities of observing each individual feature value P (x1, x2, x3, ...xd | yi)∏di=1 P (xi | yi).

Substituting this approximation into the main equation above to derive the Naive Bayes clas-

sification rule.

arg maxyiεclass

P (yi)

d∏i=1

P (xi | yi) (4.5)

As discussed above, for a nominal feature, the probability is estimated as the frequency

over the training data. For continuous feature, there are two solutions. The first one is to

perform discretization on those continuous features, transferring them to nominal ones. The

second solution is to assume that they to follow a normal distribution.

4.3 Logistic Regression

Logistic regression [HJL04] is a representative of discriminative classifier that learns a direct

map from input x to output y by modeling the posterior probability P (y | x) directly. The

26

parametric model proposed by logistic regression is of the form explained below. One of the

simplest equations with which we can represent a logistic regression is a sigmoid function given

by eq. 4.6.

σ(x) =1

1 + e−z(4.6)

We then define the loss function that defines the 0-1 losses for the model.

Loss0/1(z) =

1, if z < 0

0, otherwise

(4.7)

let yε{−1, 1} and z = y.wTx. Here z is positive if y and wTx have same sign, else negative

otherwise.

P (y = −1 | x) =1

1 + exp(w0 +∑d

i=1wixi)(4.8)

P (y = 1 | x) = 1− P (y = −1 | x) (4.9)

The main task of logistic regression is minimizing w so that the average 0 − 1 loss is

minimized over the training points.

minw

n∑i=1

l0/1(y(i).wT .xi) (4.10)

w = [w0, w1, w2, ....., wd]← arg maxw

∏k

P (y(k) | x(k), w) (4.11)

Upon plotting the 0/1 loss function we transform the regression model into a logistic function

whose values vary from 0 to 1 as z goes from −∞ to +∞.

llog(z) = log (1 + e−z) (4.12)

We further solve for w using gradient descent rule. Although designed for continuous

features, logistic regression can still handle nominal feature and missing values effectively.

27

Figure 4.2 Plot of the Logistic Losses

4.3.1 Logistic regression with regularization

Adding Regularization to the learning algorithm avoids over-fitting by removing the un-

wanted features from the data set. The most common ways to achieve regularization is based

on L1 and L2 norm resulting in sparseness thus reducing complexity.

Regularization based logistic regression learns mapping (w) that minimizes logistic loss

on training data with regularization term. For Regularization in Logistic Regression, we use

maximum likelihood function as given in 4.13.

minw

n∑i=1

llog(y(i).wT .xi) + λ‖w‖22 (4.13)

Equation 4.13 has two components to it, the training log-loss function and the model com-

plexity. The λ from the model complexity component is the regularization parameter. This

determines how much of w parameters are inflated. Using Eqn. 4.13 as the cost function, we

can smoothen the output of our hypothesis function to reduce over-fitting. If w is chosen to be

too large, it may smooth out the function too much and cause under-fitting. We frequently ob-

serve that L1 regularization in many models causes many parameters to reduce to 0, resulting

in the parameter vector to be sparse.

28

4.4 Support Vector Machines (SVM)

Another popular classification algorithm used in supervised machine learning is the Support

Vector Machines (SVM) [Joa98]. The goal of SVM is to find the optimal separating hyper

plane which maximizes the margin of the training data by dividing the n-dimensional space

representation of the data into two regions using a hyperplane [YM07]. The SVM methodology

also has solid underpinnings in statistical learning theory. The methodology can be applied

successfully to many linear and non-linear classification problems.

There are many kernel-based functions such as linear kernel function, the normalized poly

kernel, polynomial kernel function, Radial Basis Function (RBF) or Gaussian Kernel and Hy-

perbolic Tangent (Sigmoid) Kernel sigmoid function that can be implemented in SVM [SS01].

SVM’s output a class label, either positive or negative for each sample in our case of binomial

classification: In order to compute metrics like ROC curve, etc. We can also find the distance

between from hyper-plane that separate classes. SVM has many advantages such as obtaining

the best result when deal with the binary representation, able to dealing with low number of

features.

Figure 4.3 Plot of Binomial Classification using SVM

SVM classifiers [EIO14] utilize the hyper-plane to separate classes. Every hyper-plane is

characterized by its direction (w) and (b) which is its exact position in space or threshold. Let

29

us consider xi is an input vector of dimension N. We would have a set of training data along

with the labeled output yiε(−1, 1).

(x1, y1), (x2, y2), (x3, y3), .....(xk, yk) (4.14)

The decision function for the above problem is of the form f(x,w, b) = sign((wxi) +

b), wεRd, bεR where d is the dimension of the class output. The region between the hyper-

plane that separates the two classes in this case is called the margins. Width of the margin is

equal to 12‖w‖ and the underlying goal is to maximize the margin between the hyper-plane.

To satisfy this maximization, we need to minimize f(w, b) = 12‖w‖

2. Minimizing the cost

is a trade-off issue between a large margin and a small number of margin errors. The final

solution to this optimization problem can be formulated as below

w =N∑

(i=1)

λiγiχi (4.15)

Equation 4.15 shows the weighted average of the training features. Here λi is a lagrange

multiplier of the optimization task and γi is a class label. Values of λ’s are non zero for all

points lying inside the margin and on the correct side of the classifier.

To prevent over-fitting by permitting some degree of miss-classifications, a cost parameter

C controls the trade off between allowing training errors and rigid margins. Increasing the

value of C increases the cost of miss-classifying points and may result in a model that may not

generalize well. For our experiments we use a SVM classifier with linear kernel (SVM-L).

4.5 Decision Trees

Decision trees [Qui86] was first introduced in 1966 [HMS66] and currently has become one of

the most widely used and researched machine learning methods especially for applications like

image recognition, artificial intelligence and multi-label classification. As white boxes, decision

trees generate interpretable and understandable models. Induction of decision tree involves

building a tree top-down using divide and conquers strategy. The ultimate goal is recursively

30

partition the training set, choosing one feature to split each time until all or most of instances

in each partition belong to the same class.

A decision tree consists of following main elements:

1. Root

2. Branches and leaves

The branches correspond to possible outcomes of feature value and finally leaves that specify

expected value of the class. Each leaf is assigned to the class that has the majority of instances

inside it. To classify a new instance, one starts at the root and follow a path lead by the nodes

and branches downward, end at a particular leaf and the instance is assigned a class specified

by the leaf.

4.6 Random Forest Classifier

Random Forest classifier [Bre01] is a popular choice for both linear and non-linear classifica-

tion problems that is relatively new. It belongs to a larger class of machine learning algorithms

called ensemble methods.

4.6.1 Ensemble learning : gradient boosting

Ensemble learning, refers to a technique which involves combination of several models to

solve a single predictor. It works by generating multiple classifiers/models which learn which

all make independent prediction. Those predictions are then combined into a single prediction

that should be as good or better than the prediction made by any one classifier. Random forest

is a type of ensemble learning which uses an ensemble of decision trees.

Gradient boosting [Fri01] uses a set of weak learners and delivers improved prediction

accuracy. The outcome of the model at an instance ′t′ is weighed based on the outcome

of previous instant ′t-1′. In Gradient descent shortcomings in predictions are identified by

negative gradients. At each step, a new tree is fit to negative gradients of the previous tree.

31

CHAPTER 5. CLASSIFIER EVALUATION AND RESULTS

5.1 Classifier Performance Evaluation

Although much has been inferred about how different algorithms work to solve the classi-

fication problem, it is only after observing results, one can make accurate assumptions about

the best algorithm that can be used for the given prediction problem. Some algorithms are

well suited for a few types of domains while that may not hold true in all cases. It is mostly

the underlying data that drives the results of different algorithms. To decide a good algorithms

we essentially look at the metrics like accuracy, generality and confidence of prediction. For a

classification problem, researches have been for long using confusion matrix [DG06] for studying

possible outcomes when a classifier is applied on a set of class instances. Since the customer

churn is a binomial classification, we are presented with four possible outcomes of prediction.

5.1.1 Confusion matrix

Figure 5.1 Confusion Matrix :Binomial Classification (Source : Wikipedia)

32

From the confusion matrix shown in Fig 5.1, there are exactly 4 possible outcomes from a

binomial classifier model. A correctly classified instance is counted as a true positive (TP) or

a true negative (TN) if its actual class is positive or negative respectively. A positive instance

which is wrongly classified as negative is counted as a false negative (FN). A negative instance

which is wrongly classified as positive is counted as a false positive (FP). The total number of

positive instances in the dataset is T = FN +TP and the total number of negative instances is

F = TN +FN . Based on a confusion matrix, the most common evaluation metrics are overall

accuracy, true positive rate and false positive rate.

Accuracy =TP + TN

N + P

The true positive rate (also known as hit rate or the Precision) is the proportion of positive

instances that a classifier captures.

Precision =TP

P

The Recall is the ratio of number positive instances(TP) over the sum of true positives

(TP) and False negatives (FN).

Recall =TP

TP + FN

The false positive rate (also known as false alarm rate) is the proportion of negative instances

that a classifier wrongly flagged as positive.

FPRate =FP

N

More than the accuracy, we are interested in increasing the TP rate of our classifier. A

customer being a returning Customer wrongly classified as non-returning by the classifier thus

falling in False Positive quadrant has lesser impact than an abandoning customer wrongly

classified as returning customer thus falling in False Negative quadrant. In this case, we ignore

a potential customer who might abandon the company in the near future.

33

5.1.2 Reverse operating characteristics (ROC) curve

When TP rate is plotted as against FP rate as seen in fig 5.2, one obtains a receiver

operating characteristics (ROC) graph [DG06]. Each classifier is represented by a point on

ROC graph. A perfect classifier is represented by point (0, 1) on ROC graph which classifies

all positive and negative instances correctly with 100% TP rate and 0% FP rate. The diagonal

line demonstrates classification that is based completely on random guesses. In that case, one

can achieve the desired TP rate but unfortunately also gain equally high FP rate. The major

goal of churn prediction is to detect churn. Therefore, a suitable classifier is the one having

high TP rate and low FP rate given that churn is the positive class. Such classifier is located

at the upper left corner of ROC graph.

A classifier provides output in probabilistic form, with exceptions for a few algorithm

Pr(churn | x), the probability that an instance belongs to the positive class. If this probabil-

ity is above the predefined threshold Pr(churn | x) > Θ, an instance is classified as positive,

otherwise negative. A classifier using high value for Θ is considered conservative. It classifies

positive instances only with strong evidence so it makes few FP mistakes but at the same time

has low TP rate. A classifier using low value for threshold is considered liberal. It classifies

positive instances with weak evidence so it achieve high TP rate but also makes many FP

mistakes. When the performance of a classifier is plotted on ROC graph with value of varied

from 0 to 1, an ROC curve will be formed. It demonstrates the trade-off between TP rate and

FP rate.

Figure 5.2 shows an example of ROC curve. The red points are random guess classifiers.

The pink and yellow points represent the performance of two classifiers using different values

of Θ. The higher is Θ, the more conservative a classifier becomes. Conservative classifiers

locate at the lower part of ROC graph. In contrast, the lower is Θ, the more liberal a classifier

becomes. Liberal classifiers locate at the upper part of ROC graph. Given two ROC curves, the

one that is further to the left of the random diagonal is preferred. For this reason, area under

ROC curve (AUC), a quantity that measures the overall average performance of a classifier is

introduced. The advantage of AUC is unlike many other evaluation metrics such as the overall

34

Figure 5.2 Example of an ROC Curve

accuracy, AUC is not affected by the class distribution, ratio. In the case of unbalanced class

distribution such as in churn prediction data, AUC yields a fair measure for model comparison.

Based on the classification result, the marketing teams focuses on customers that are classified

as positive or at risk of churn. However, the business will probably be unable to react to all

positive classified instances due to the lack of resources. Besides that, quality is more important

than quantity. The question is not only how many percent of defaulters, in this case inactive

customers a model can covers but also with what reliability. A classifier which covers 30%

of defaulters with 90% reliability may be more preferable than the one which covers 50% of

churners with 60% reliability. The choice is up to us to evaluate the cost of ignoring customers

in churn risk versus the cost of offering unnecessary special treatment for customers that will

are not likely to leave. [TXH+04] suggests a formula to calculate the confidence level of a

prediction.

35

Confidence of Class Y i =Pr(churn | x)− 0.5

0.5(5.1)

5.2 K-fold Strategy for Cross Validation

We used 5-fold CV by randomly splitting the training dataset (D) into five mutually exclu-

sive subsets (D1, D2, D3, D4, D5) of approximately equal size. Each classification model was

trained and tested five times, where each time (t ∈ 1, 2, 3, 4, 5), it was trained on all except one

fold (D Dt) and tested on the remaining fold (Dt). The accuracy and AUC measures were

averaged over the particular measures of the five individual test folds which we shall see in the

further section

5.3 Results from Classifier Models

We experiment with a mixture of both parametric and non-parametric classifiers as dis-

cussed in Chapter four. Logistic Regression with L1 regularization, SVD and Gradient Boost

classifier have the highest accuracy in predicting the customer churn from the data-set. More

importantly their precision scores are at the highest compared to other classifiers.

An example of a bad classifier, miss-classifying the entire distribution of abandoning cus-

tomers as FN may still achieve an overall accuracy greater than 90% owing to the imbalance

in the distribution of data. But such a classifier is of no use to us as we are interested in our

precision rates that determines how effectively does a model predicts churn.

Let us consider how a variant of Ensemble family of classifiers like Gradient boost classifier

performs on a given sample test data-set. The confusion Matrix for the prediction is as shown

in table 5.3. The overall accuracy for this classifier is 90.61%. Although, this accuracy is

unusually high for a learning classifier in predictive analytics, since we prune the noisy features

using feature selection algorithm, apply weights for the learning algorithm and tune the hyper

parameters for the model continuously, the classifier is able to attain this level of accuracy. The

precision for the model is close to 75% which indicates that we are able to identify every three

out of four churning customer.

36

We observe from Table 5.1 and 5.2 that the accuracy and other indicators for several

classifiers used to predict customer churn. As is evident from the tables, Gradient boost

classifier outperforms SVM and regularized logistic regression to give the highest accuracy

and precision. Running this algorithm on a feature-set which has already undergone feature

selection increases the predicting accuracy further as seen in 5.2.

5.4 ROC Curve

We have discussed that a larger area under the curve (AUC) is indicates a better estimator .

The ”steepness” of ROC curves is also important, since it is ideal to maximize the true positive

rate while minimizing the false positive rate.

The below plots for Logistic Regression and SVM classifiers shows the ROC response for

different datasets, created from K-fold cross-validation techniques. Taking all of these curves,

it is possible to calculate the mean area under curve, and see the variance of the curve when

the training set is split into different subsets. This roughly shows how the classifier output

is affected by changes in the training data, and how different the splits generated by K-fold

cross-validation are from one another.

5.4.1 Regularized logistic regression: L1-norm

The graph 5.4 shows the ROC curve for L1-norm regularized logistic regression with K-fold

cross validation technique. Except Fold 0 which has an area of 0.71, all of the other folds

have fairly consistent AUC indicating that our dataset is fairly robust. The mean ROC as

indicated by the plot is 0.77. It is apparent from the graph that ROC fold 2 strategy has the

best efficiency for the model.

5.4.2 SVM classifier with F-anova feature selection

The graph 5.5 shows the ROC curve for SVM Classifier with K-fold cross validation tech-

nique. The probabilistic estimation of classes for an SVM was made available not until recently

by the work proposed in [Pla99], called the Platt scaling to optimize internal variables to also

produce a probabilistic score. As with the earlier case, all of the folds for different combination

37

of Test, Training have fairly consistent AUC. The mean ROC as indicated by the plot is again

0.77. Once the model is finalized by tuning the hyper-parameters, the area under the ROC

Curve can be used to tweak the threshold probability such that we can tune the classifier to

return the best predictions for a given quadrant in the confusion matrix as desired by us. This

tweak in threshold probability identified by TP Rate and FP Rate populates the graph for the

ROC curve.

We can experiment bringing down the threshold to lower level than default value which is

0.5 which helps us decide the optimal point for the model depending on how accurately we

want to identify the churning customers at the cost of mis-classifying non-churning customers.

As we bring down the threshold probability, our model becomes more liberal which increases

the recall value for the classifier.

5.5 Best Performing Classifiers

The following classifiers are the best performing classifiers in the order of their appearance

at predicting the customer churn for the e-retailer in consideration.

Gradient Boost Ensemble Classifier

SVM Classifier with linear Kernel

Regularized Logistic Regression with L1 norm

Some of the other classifiers which were experimented on this data include Naive Bayes,

Artificial Neural Networks, KNN classifier, Random Forests, Decision Tree, etc.

38

Figure 5.3 Confusion Matrix for Gradient Boost Classifier

Table 5.1 Classification without Feature Selection

Classifier Algorithm TN FN FP TP Accuracy Precision Recall

Naive Bayes 6004 484 246 86 0.892 0.259 0.1508

Support Vector Machines 5575 913 114 218 0.849 0.656 0.192

Random Forest 6428 5 326 6 0.943 0.018 0.545

Gradient Boost 5875 613 68 264 0.900 0.795 0.301

Logistic Regression 5423 921 142 190 0.835 0.572 0.208

Table 5.2 Classification with F-Anova Feature Selection

Classifier Algorithm TN FN FP TP Accuracy Precision Recall

Naive Bayes 6372 121 286 41 0.940 0.125 0.253

Support Vector Machines 5520 968 68 259 0.850 0.79 0.211

Random Forest 6481 12 326 1 0.950 0.003 0.077

Gradient Boost 6903 590 53 274 0.917 0.838 0.317

39

Figure 5.4 ROC Curve for Logistic Regression with L1 Regularization

40

Figure 5.5 ROC Curve for SVM classifier and F-score based Feature Selection

41

CHAPTER 6. CONCLUSIONS

6.1 Summarization of Results

We present how a Big Data infrastructure can drive an end-to-end pipeline for predict-

ing customer attrition in E-commerce world. The results we derive from this study and the

contributions we make can be summarized below.

• We discuss how e-commerce organizations can design their data warehouse and big data

infrastructure such that they can readily be used to drive data science and analytic

applications with minimal effort in purposing this data.

• We discuss some of the popular tools that are available through the hadoop stack which

can be used to transform raw data at huge scale, perform aggregations, filter and update

continuously so that a data science pipeline can be built.

• We prove our novel proposal on how implicit features obtained through through click-

stream/web logs and marketing campaign data mining, etc act as significant features

in establishing customer behavior, experience and hence can be used as features to find

customer churn.

• Through feature Selection methodologies, we establish how several product categories

( CAT 10, CAT 11, CAT 12, etc ), web channel experience ( Futile Sessions,Dotcom

shopping, Time spent, etc ), cart activity (Carts Abandoned, etc) all play significant role

in driving customer churn.

• Through cross validation techniques we establish that Gradient Boost Ensemble classi-

fier, SVD Classifier, Logistic Regression with L1 regularization are the best models in

predicting customer churn.

42

• Lastly but not least, the most important contribution from our work is an end to end

generic Customer churn model consisting of both data and algorithmic pipeline that is

applicable in a large E-commerce industry or similar industry using Big Data.

6.2 Future Work

There is a continous scope for improving the Feature Engineering process and the model

building process from where we have currently stand through this work. This activity is contin-

uous and iterative in nature. An immediate addition to improve the current results are using

Grid search functionality to do hyper-parameter tuning to Gradient Boosting Classifier which

happens to be our best classifier. Another important avenue worth exploring for addressing

customer churn is the application of Time series analysis to this problem. As the businesses

grow and get more complex there are more additional data sources, channels that continuously

open up and may hold valuable information. This needs to be captured and harnessed for such

or similar applications.

43

BIBLIOGRAPHY

[AC+10] Sylvain Arlot, Alain Celisse, et al. A survey of cross-validation procedures for model

selection. Statistics surveys, 4:40–79, 2010.

[AKR08] Dudyala Anil Kumar and V Ravi. Predicting credit card customer churn in banks

using data mining. International Journal of Data Analysis Techniques and Strate-

gies, 1(1):4–28, 2008.

[Alp14] Ethem Alpaydin. Introduction to machine learning. MIT press, 2014.

[BLB+13] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller,

Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grob-

ler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gael Varo-

quaux. API design for machine learning software: experiences from the scikit-learn

project. In ECML PKDD Workshop: Languages for Data Mining and Machine

Learning, pages 108–122, 2013.

[Bra97] Andrew P Bradley. The use of the area under the roc curve in the evaluation of

machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.

[Bre01] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[BVdP09] Jonathan Burez and Dirk Van den Poel. Handling class imbalance in customer

churn prediction. Expert Systems with Applications, 36(3):4626–4636, 2009.

[CFS12] Zhen-Yu Chen, Zhi-Ping Fan, and Minghe Sun. A hierarchical multiple kernel

support vector machine for customer churn prediction using longitudinal behavioral

data. European Journal of operational research, 223(2):461–472, 2012.

44

[CVdP09] Kristof Coussement and Dirk Van den Poel. Improving customer attrition predic-

tion by integrating emotions from client/company interaction emails and evaluating

multiple classifiers. Expert Systems with Applications, 36(3):6127–6134, 2009.

[DG06] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc

curves. In Proceedings of the 23rd international conference on Machine learning,

pages 233–240. ACM, 2006.

[DG08] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on

large clusters. Communications of the ACM, 51(1):107–113, 2008.

[DL97] Manoranjan Dash and Huan Liu. Feature selection for classification. Intelligent

data analysis, 1(3):131–156, 1997.

[EIO14] Nadir Omer Fadl Elssied, Othman Ibrahim, and Ahmed Hamza Osman. A novel fea-

ture selection based on one-way anova f-test for e-mail spam classification. Research

Journal of Applied Sciences, Engineering and Technology, 7(3):625–638, 2014.

[FP08] Jillian Dawes Farquhar and Tracy Panther. Acquiring and retaining customers

in uk banks: An exploratory study. Journal of Retailing and Consumer Services,

15(1):9–21, 2008.

[Fri01] Jerome H Friedman. Greedy function approximation: a gradient boosting machine.

Annals of statistics, pages 1189–1232, 2001.

[GE03] Isabelle Guyon and Andre Elisseeff. An introduction to variable and feature selec-

tion. The Journal of Machine Learning Research, 3:1157–1182, 2003.

[Gef00] David Gefen. E-commerce: the role of familiarity and trust. Omega, 28(6):725–737,

2000.

[GL99] William Gropp and Ewing Lusk. Reproducible measurements of mpi performance

characteristics. In Jack Dongarra, Emilio Luque, and Toms Margalef, editors, Re-

cent Advances in Parallel Virtual Machine and Message Passing Interface, volume

45

1697 of Lecture Notes in Computer Science, pages 11–18. Springer Berlin Heidel-

berg, 1999.

[Hal99] Mark A Hall. Correlation-based feature selection for machine learning. PhD thesis,

The University of Waikato, 1999.

[Haw04] Douglas M Hawkins. The problem of overfitting. Journal of chemical information

and computer sciences, 44(1):1–12, 2004.

[HHR10] Muzammil Hanif, Sehrish Hafeez, and Adnan Riaz. Factors affecting customer

satisfaction. International Research Journal of Finance and Economics, 60(1):44–

52, 2010.

[HJL04] David W Hosmer Jr and Stanley Lemeshow. Applied logistic regression. John Wiley

& Sons, 2004.

[HMS66] Earl B Hunt, Janet Marin, and Philip J Stone. Experiments in induction. 1966.

[Joa98] Thorsten Joachims. Text categorization with support vector machines: Learning

with many relevant features. Springer, 1998.

[Jol02] Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002.

[KR92] Kenji Kira and Larry A Rendell. A practical approach to feature selection. In

Proceedings of the ninth international workshop on Machine learning, pages 249–

256, 1992.

[KZP07] Sotiris B Kotsiantis, I Zaharakis, and P Pintelas. Supervised machine learning: A

review of classification techniques, 2007.

[Mah00] Balasubramaniam Mahadevan. Business models for internet-based e-commerce: An

anatomy. California management review, 42(4):55–69, 2000.

[MBN02] Luis Carlos Molina, Lluıs Belanche, and Angela Nebot. Feature selection algo-

rithms: a survey and experimental evaluation. In Data Mining, 2002. ICDM 2003.

Proceedings. 2002 IEEE International Conference on, pages 306–313. IEEE, 2002.

46

[MCB+11] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs,

Charles Roxburgh, and Angela H Byers. Big data: The next frontier for inno-

vation, competition, and productivity. 2011.

[MCeC13] Vera L Migueis, Ana Camanho, and Joao Falcao e Cunha. Customer attrition in

retailing: an application of multivariate adaptive regression splines. Expert Systems

with Applications, 40(16):6225–6232, 2013.

[Ng04] Andrew Y Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance.

In Proceedings of the twenty-first international conference on Machine learning,

page 78. ACM, 2004.

[ORS+08] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew

Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of

the 2008 ACM SIGMOD international conference on Management of data, pages

1099–1110. ACM, 2008.

[Pla99] John C. Platt. Probabilistic outputs for support vector machines and comparisons

to regularized likelihood methods. In ADVANCES IN LARGE MARGIN CLASSI-

FIERS, pages 61–74. MIT Press, 1999.

[Qui86] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.

[Ris01] Irina Rish. An empirical study of the naive bayes classifier. In IJCAI 2001 workshop

on empirical methods in artificial intelligence, volume 3, pages 41–46. IBM New

York, 2001.

[SAP02] Srini S Srinivasan, Rolph Anderson, and Kishore Ponnavolu. Customer loyalty in e-

commerce: an exploration of its antecedents and consequences. Journal of retailing,

78(1):41–50, 2002.

[SIL07] Yvan Saeys, Inaki Inza, and Pedro Larranaga. A review of feature selection tech-

niques in bioinformatics. bioinformatics, 23(19):2507–2517, 2007.

47

[SKRC10] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The

hadoop distributed file system. In Mass Storage Systems and Technologies (MSST),

2010 IEEE 26th Symposium on, pages 1–10. IEEE, 2010.

[SLLM02] Mark Sweiger, Jimmy Langston, Howard Lombard, and Mark R Madsen. Click-

stream data warehousing. John Wiley & Sons, Inc., 2002.

[SR15] G Ganesh Sundarkumar and Vadlamani Ravi. A novel hybrid undersampling

method for mining unbalanced datasets in banking and insurance. Engineering

Applications of Artificial Intelligence, 37:368–377, 2015.

[SS01] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector

machines, regularization, optimization, and beyond. MIT press, 2001.

[TSA+10] Ashish Thusoo, Zheng Shao, Suresh Anthony, Dhruba Borthakur, Namit Jain, Joy-

deep Sen Sarma, Raghotham Murthy, and Hao Liu. Data warehousing and analytics

infrastructure at facebook. In Proceedings of the 2010 ACM SIGMOD International

Conference on Management of data, pages 1013–1020. ACM, 2010.

[TXH+04] Weida Tong, Qian Xie, Huixiao Hong, Leming Shi, Hong Fang, and Roger Perkins.

Assessment of prediction confidence and domain extrapolation of two structure-

activity relationship models for predicting estrogen receptor binding activity. En-

vironmental health perspectives, pages 1249–1254, 2004.

[WC02] Chih-Ping Wei and I-Tang Chiu. Turning telecommunications call details to churn

prediction: a data mining approach. Expert systems with applications, 23(2):103–

112, 2002.

[XLNY09] Yaya Xie, Xiu Li, EWT Ngai, and Weiyun Ying. Customer churn prediction using

improved balanced random forests. Expert Systems with Applications, 36(3):5445–

5449, 2009.

48

[YGGH11] Xiaobing Yu, Shunsheng Guo, Jun Guo, and Xiaorong Huang. An extended support

vector machine forecasting framework for customer churn in e-commerce. Expert

Systems with Applications, 38(3):1425–1430, 2011.

[YKG10] Sangho Yoon, Jim Koehler, and Adam Ghobarah. Prediction of advertiser churn

for google adwords. 2010.

[YL04] Lei Yu and Huan Liu. Efficient feature selection via analysis of relevance and

redundancy. The Journal of Machine Learning Research, 5:1205–1224, 2004.

[YM07] Seongwook Youn and Dennis McLeod. A comparative study for email classifica-

tion. In Advances and innovations in systems, computing sciences and software

engineering, pages 387–391. Springer, 2007.

Enhanced feature mining and classifier models to predict ...

Documents

Transcript of Enhanced feature mining and classifier models to predict ...