MV5: A Clinical Decision Support Framework for Heart Disease Prediction Using Majority Vote Based...

15
1 23 Arabian Journal for Science and Engineering ISSN 1319-8025 Arab J Sci Eng DOI 10.1007/s13369-014-1315-0 MV5: A Clinical Decision Support Framework for Heart Disease Prediction Using Majority Vote Based Classifier Ensemble Saba Bashir, Usman Qamar, Farhan Hassan Khan & M. Younus Javed

Transcript of MV5: A Clinical Decision Support Framework for Heart Disease Prediction Using Majority Vote Based...

1 23

Arabian Journal for Science andEngineering ISSN 1319-8025 Arab J Sci EngDOI 10.1007/s13369-014-1315-0

MV5: A Clinical Decision SupportFramework for Heart Disease PredictionUsing Majority Vote Based ClassifierEnsemble

Saba Bashir, Usman Qamar, FarhanHassan Khan & M. Younus Javed

1 23

Your article is protected by copyright and

all rights are held exclusively by King Fahd

University of Petroleum and Minerals. This e-

offprint is for personal use only and shall not

be self-archived in electronic repositories. If

you wish to self-archive your article, please

use the accepted manuscript version for

posting on your own website. You may

further deposit the accepted manuscript

version in any repository, provided it is only

made publicly available 12 months after

official publication or later and provided

acknowledgement is given to the original

source of publication and a link is inserted

to the published article on Springer's

website. The link must be accompanied by

the following text: "The final publication is

available at link.springer.com”.

Arab J Sci EngDOI 10.1007/s13369-014-1315-0

RESEARCH ARTICLE - COMPUTER ENGINEERING AND COMPUTER SCIENCE

MV5: A Clinical Decision Support Framework for Heart DiseasePrediction Using Majority Vote Based Classifier Ensemble

Saba Bashir · Usman Qamar · Farhan Hassan Khan ·M. Younus Javed

Received: 11 October 2013 / Accepted: 28 February 2014© King Fahd University of Petroleum and Minerals 2014

Abstract The medical diagnosis process can be interpretedas a decision-making process during which the physicianinduces the diagnosis of a new and unknown case froman available set of clinical data and using his/her clinicalexperience. This process can be computerized in order topresent medical diagnostic procedures in a rational, objec-tive, accurate and efficient way. In the last few decades, manyresearchers have focused on developing effective methods forintelligent heart disease prediction and decision support sys-tems. For such a system, high accuracy of prediction is para-mount. In this research, an ensemble classifier is proposed,which uses majority vote-based scheme for heart disease dataclassification and prediction. The five heterogeneous classi-fiers used to construct the ensemble model are as follows:Naïve Bayes, decision tree based on Gini Index, decision treebased on information gain, memory-based learner and sup-port vector machine. Five datasets from different data reposi-tories are employed for testing the effectiveness of the ensem-ble model. Each dataset has different types of attributes, forinstance binary, real, continuous, categorical, etc. Experi-

Electronic supplementary material The online version of thisarticle (doi:10.1007/s13369-014-1315-0) contains supplementarymaterial, which is available to authorized users.

S. Bashir · U. Qamar · F. H. Khan (B) · M. Y. JavedDepartment of Computer Engineering, College of Electrical andMechanical Engineering, National University of Sciences andTechnology (NUST), Islamabad, Pakistane-mail: [email protected]

S. Bashire-mail: [email protected]

U. Qamare-mail: [email protected]

M. Y. Javede-mail: [email protected]

mental results with stratified cross validation show that theproposed MV5 framework deals with all the attribute types.MV5 has achieved an accuracy of 88.52 % with 86.96 % sen-sitivity, 90.83 % specificity and 88.85 % f-measure. Com-parison of proposed MV5 model with individual classifiersshows increase in average accuracy, sensitivity, specificityand f-measure of about 14, 11, 17 and 18 % respectively.

Keywords Ensemble · Majority vote · Cross validation ·Heterogeneous classifiers · Naïve Bayes · Decision tree ·Gini Index · Information gain · Memory-based learner ·Support vector machine

123

Author's personal copy

Arab J Sci Eng

1 Introduction

Data mining is a process of analyzing and identifying previ-ously unknown and hidden patterns, relationships and knowl-edge from large datasets that was not possible with traditionaltechniques [1]. According to recent research, data miningtechniques are very helpful in the diagnostics of several dis-eases such as cancer [2], stroke [3], diabetes [4] and heartdisease [5].

The major challenge that healthcare organizations are fac-ing is provision of quality services. Quality service corre-sponds to diagnosing a patient accurately and then provid-ing proper treatment. Poor clinical decisions could result inadverse consequences and unacceptable results. An intelli-gent computer-based information and decision support sys-tem can achieve great accuracy in heart disease classificationand prediction. The main motivation of this research is totransform data into useful information, which can help practi-tioners to make intelligent clinical decisions, reduce medicalerrors and enhance patient safety.

In the last few decades, multiple data mining techniqueshave been widely accepted for intelligent heart disease pre-diction. Various machine learning algorithms play a vital rolefor clinical decision support. Table 1 shows a comparisonof different data mining techniques used on heart diseasedatasets. It can be seen from Table 1 that the current workfocuses on applying a single classifier for heart disease pre-diction. For example, Pattekari and Parveen [6] presented aheart disease prediction system based on Naïve Bayes algo-rithm to predict the hidden patterns in given dataset. Theproposed technique only works on categorical data and usesonly one classifier. Other data types should be considered and

other data mining techniques such as time series, clusteringand association mining can be incorporated to improve theresults.

Peter and Somasundaram [7] used pattern recognition anddata mining techniques for risk prediction in medical fieldspecifically for cardiovascular disease. It is concluded thatNaïve Bayes has better accuracy as compared to other algo-rithms. The proposed technique requires the input attributeset in ASCII file format and use of only numerical attributes.Ghumbre et al. [8] presented a heart disease prediction systemusing radial-based function network structure and supportvector machine. The analysis shows that the results obtainedfrom support vector machine algorithm are equivalently asgood as radial-based function network. This technique isaffected by data acquisition method used for input of dataset.Chitra and Seenivasagam [9] adopted a supervised learningalgorithm to predict heart disease in a patient at early stage.The proposed classifier is named as cascaded neural network(CNN) with hidden neurons. High specificity of CNN showsthat it can predict a healthy individual, whereas high sensi-tivity of SVM indicates that it has the high probability ofdisease prediction.

Many researchers have used decision trees and its combi-nation with other algorithms to solve various biological prob-lems [10]. Srinivas et al. [11] analyzed various data miningtechniques including rule-based, decision tree, Naïve Bayesand artificial neural network and stated that their proposedsystem can easily answer the complex queries for the pre-diction of heart attack. The proposed technique used only15 attributes for prediction, and its accuracy varies fromdataset to dataset. Shouman et al. [12] suggested diagno-sis of heart disease using decision tree algorithm with infor-

Table 1 Data mining techniques used on heart disease datasets

Author/year/reference Technique Specificity (%) Sensitivity (%) Accuracy (%)

Das et al. 2009 [5] Neural network ensemble 95.91 80.95 89.01

Ghumbre et al. 2011 [8] Support vector machine 88.50 84.06 85.05

Radial basis function 82.10 82.40 82.24

Chitra et al. 2013 [9] Cascaded neural network 87 83 85

Support vector machine 77.5 85.5 82

Shouman et al. 2011 [12] Nine voting equal frequencydiscretization gain ratio decision tree

85.2 77.9 84.1

Chen et al. 2011 [15] Artificial neural network 70 85 80

Tu et al. 2009 [20] J4.8 decision tree 84.48 72.01 78.9

Bagging algorithm 86.64 74.93 81.41

Shouman et al. 2012 [21] K mean clustering with Naïve Bayesalgorithm

76.59 69.93 78.62

Shouman et al. 2013 [22] Gain ratio decision tree 81.6 75.6 79.1

Naïve Bayes 80.8 78 83.5

K nearest neighbor 85.1 76.7 83.2

123

Author's personal copy

Arab J Sci Eng

mation gain, Gini Index and gain ratio. Other discretizationtechniques such as chi merge, entropy, equal partitioning andequal width are used to prune the trees resulting in accurateresults. The analysis shows that equal frequency discretiza-tion in combination with gain ratio produces better results andenhances the accuracy of heart disease prediction. However,the technique uses small training dataset, whereas a largerdataset can further enhance the testing results. Anbarasi etal. [13] predicted the presence of heart disease more accu-rately with reduced set of attributes. Three classification algo-rithms, decision tree, Naïve Bayes and classification by clus-tering are applied on the reduced dataset. The results showthat decision tree outperforms other two techniques with highmodel construction time. However, the intensity of the dis-ease cannot be predicted. Fuzzy learning model can be usedto identify the intensity of cardiac disease.

Chen et al. [15] proposed a heart disease prediction sys-tem for identifying the heart disease in patients using learn-ing vector quantization (LVQ) classification algorithm. Theresults are displayed using ROC curve and show 80 % ofaccuracy. The proposed technique can be enhanced by addingtext mining along with data mining technique. Text miningcan mine the vast amount of unstructured data available inheart disease databases. Jabbar et al. [16] discussed an asso-ciative classification and genetic algorithm for heart diseaseprediction, which results in higher accuracy and interest-ingness value. The proposed technique uses the entire inputattribute set for prediction which can be improved by usingreduced number of attributes and using only those attributeswhich contribute more toward the diagnosis of disease.

Although extensive research has been conducted in thearea of heart disease prediction, there is still room forimprovement in achieving higher levels of prediction accu-racy. One way to overcome the limitations of a single clas-sifier is to use an ensemble model. Ensemble model is amulti-classifier combination model that is precise in its deci-sion because the same problem is solved by different trainedclassifiers and reduces the variance of error estimation [17].

Many ensemble approaches have been introduced in therecent research. Das et al. [5] discovered an ensemble-basedmethod for heart disease diagnosis, which combines theindividual neural networks used to train on same tasks andresults in increase of generalization. The proposed techniqueuses only one UCI heart disease database, and more accu-rate results can be obtained by incorporating more datasets.Helmy et al. [18] used an ensemble model based on SVM,ANN and ANFIS to perform more accurate predictions.Bagging sample approach is used to train the individualmodels. The results reveal that heterogeneous ensembleapproach outperforms the individual models and results inmore accurate prediction. The proposed approach does notuse any cross validation technique and is applied on only twodatasets.

This research paper presents an intelligent heart diseaseprediction model using an ensemble classifier based on dif-ferent data mining techniques, namely Naïve Bayes (NB),decision tree induction using information gain (DT–IG), sup-port vector machine (SVM), decision tree induction usingGini Index (DT–GI) and memory-based learner (MBL). Theproposed model (MV5) integrates these classifiers usingmajority voting scheme [19]. MV5 performs better than theindividual classifiers in most of the datasets which containdifferent set of attributes and results in reliable performanceand high prediction accuracy. The best performance resultsachieved by MV5 are accuracy 88.52 %, sensitivity 86.96 %,specificity 90.83 % and f-measure 88.85 %. The MV5 ensem-ble framework predicts the healthy/sick individuals withaccuracy improvement around 14 % on average as comparedto individual classifiers. Average sensitivity, specificity andf-measure improvement are around 11, 17 and 18 % respec-tively.

The rest of the paper is organized as follows: Sect. 2is related to materials and methods used in the proposedresearch, while the proposed approach is defined in Sect. 3.Section 4 presents the evaluation measures, whereas Sect. 5corresponds to experimental results and discussion. Finally,Sect. 6 summarizes the work which has been done.

2 Dataset Information

The data used in the proposed research consist of five dif-ferent datasets taken from different data repositories. Fourdatasets (SPECT, SPECTF, heart disease and Statlog) aretaken from UCI [23], and the fifth dataset (Eric) is taken fromricco [24]. Each dataset has a certain feature set that will ulti-mately determine whether a person is having a heart diseaseor not. The final prediction depends on the class label. In orderto maintain consistency, the class labels of each dataset arereplaced with 0 and 1 where 0 represents the absence of dis-ease and 1 indicates the presence of heart disease. The sampleinstances from five datasets are shown in Online Resource1(supplementary file). Each dataset is now explained as fol-lows:

2.1 SPECT Dataset

The dataset is composed of cardiac single proton emissioncomputed tomography (SPECT) images. There are two cat-egories for each instance: normal and abnormal. The datasetconsists of 267 instances having 44 continuous feature pat-terns for each instance. The feature pattern is further refinedto obtain 22 binary feature set and 1 class label attribute.The overall diagnosis is in the form of (0, 1). The trainingdata have 240 instances, whereas testing data are based on

123

Author's personal copy

Arab J Sci Eng

27 instances. A sample dataset from SPECT dataset is givenin Table 1 (Online Resource 1).

2.2 SPECTF Dataset

SPECTF dataset consists of cardiac SPECT images. It has267 instances, and each instance has 45 attributes. Forty-four attributes are continuous valued, and 1 attribute repre-sents the class. There are two categories for patients: nor-mal and abnormal. 0 presents the absence of disease, and 1shows the presence of disease. The training dataset has 240instances, whereas testing set is composed of 27 instances. Asample dataset is shown in Table 2 (Online Resource 1) wherecolumns indicate the instances and rows show the feature setfor each instance.

2.3 Statlog Dataset

Statlog dataset consists of 270 instances having 243 traininginstances and 27 testing instances. The dataset consists of 13attributes that are extracted from larger set of 75. Attributesare of different type such as real, ordered, binary and nominal.The prediction is based on two variables (1, 2). 1 indicatesthe absence of heart disease, whereas 2 show presence of thedisease. The goal state (G) has two classes (1, 2), which arereplaced with (0, 1) for consistency. The updated dataset isshown in Table 3 (Online Resource 1).

2.4 Heart Disease Dataset

The UCI heart disease dataset is composed of healthy as wellas heart disease patient records. There are total 303 recordsin the given dataset. The training set is composed of 272instances, whereas test set consists of 31 instances. The fea-ture space contains 14 attributes where 13 attributes presentvital signs and one attribute is goal class (0, 1), 0 presentsthe absence of heart disease and 1 show the presence of dis-ease. A sample dataset of heart disease from UCI repositoryis shown in Table 4 (Online Resource 1).

2.5 Eric Dataset

The Eric heart disease dataset belongs to the ricco database.There are 209 instances in the dataset. The training set has188 instances, and testing set has 21 instances. Eric datasethas total eight attributes. Seven attributes correspond to thesymptoms related to heart disease, and last attribute showsclass label. It has two class labels (positive and negative).Positive value shows the presence of heart disease, whereasnegative value identifies the absence of disease. A sampleset is shown in Table 5 (Online Resource 1). The categori-cal attributes have been replaced by numeric values beforetraining.

3 MV5 Approach

The proposed MV5 approach is based on five heterogeneousclassifiers: NB, DT-GI, DT-IG, MBL and SVM. These classi-fiers are then combined using majority vote-based ensemblescheme.

3.1 Base Classifiers

The description of each base classifier and proposed ensem-ble approach is defined as follows:

3.1.1 Naïve Bayes (NB) Classifier

The Naïve Bayes classifier is based on the rule that presenceor absence of a particular disease is not dependent on otherfeatures [6]. The probability model of Naïve Bayes can beefficiently trained using supervised learning algorithms [25].The recent analysis of different research shows that NaïveBayes algorithm has theoretically unreasonable efficacy ofclassification [6,26]. Naïve Bayes algorithm assumes thatthe features in dataset are independent of each other. Zhang[26] proposed that there are some theoretical reasons thatNaïve Bayes algorithm has unreasonable efficacy of classi-fication. Manning et al. [27] proved that despite the fact thatprobability estimates of Naïve Bayes are of low quality, itsclassification decisions are quite good.

Moreover in [28], a comprehensive comparison of NaïveBayes with other classification methods shows that the NaïveBayes algorithm outperformed some of the other techniquessuch as random forest and boosted tree. Huang et al. [29]showed that Naïve Bayes is superior than other algorithms interms of CPU and memory consumption. Another advantageof Naïve Bayes is that it requires very small amount of datasetand training time to estimate the parameters.

Even though the probability of selected class is over esti-mated by NB, here we are using NB to make the classificationdecision and not correct estimation of probability for particu-lar class. As a result of its superior classification power and itsadvantages as discussed above, NB adds value to the model.

Let X is a vector that needs to be classified and Ck be apossible class. It is necessary to determine probability thatvector X belongs to class Ck . The probability P(Ck |X) canbe calculated using Bayesian rule:

P(Ck |X) = P(Ck) × P(X |Ck)

P(X)(1)

The class probability P(Ck) is obtained from training data butdue to sparseness of data, the direct estimation of P(Ck |X)

is not possible. Therefore, the P(X |Ck) can be decomposedas follows:

123

Author's personal copy

Arab J Sci Eng

Fig. 1 DT-GI for UCI heart disease dataset

P(X) =d∏

j=1

P(X j |C) (2)

where X j represents the j th element of vector X . Combiningequation (1) and (2), we have

P(Ck |X) = P(Ck) ×∏d

j=1 P(X j |Ck)

P(X)(3)

The main advantage of Naïve Bayes classifier is that itrequires a small amount of dataset for training and estima-tion, such as, central tendency and spread of the input para-meters. As the attributes are independent of each other, onlythe attribute of given class is needed to be estimated insteadof entire covariance matrix [30]. The posterior probabilitiesare calculated using the following formula:

P(C |F1, . . . , Fn) = P(C)P(F1, . . . , Fn)

P(F1, . . . , Fn)(4)

The testing tuple will be classified into the class which hasgreater probability.

3.1.2 Decision Tree Induction Based on Gini Index (DT–GI)

Decision tree is one of the most important algorithms for datamining and pattern recognition [12]. A database may containlarge feature space, and some of them do not contribute tothe research process. Thus, the number of candidate itemsets can be reduced using Gini Index [12]. An informativeattribute will produce a rule that is more compact. Therefore,the attributes which have lowest Gini Index are used for rulegeneration:

Gini(t) = 1 −c−1∑

i=0

[p(i, t)]2 (5)

where c−1 denotes the number of classes and p(i, t) denotesthe probability of class i in node t . The crisp set of rules willbe generated along with the decision tree using Gini indexas the split criterion. This algorithm is applied on all thedatasets. A sample decision tree (based on Gini Index) forUCI heart disease dataset is shown in Fig 1.

3.1.3 Decision Tree Induction Using Information Gain(DT–IG)

The decision tree based on information gain defines theentropy approach whose ultimate purpose is to minimize theentropy by identifying a split attribute. As a result, the infor-mation gain will be maximized [31]. The splitting attributeis determined by calculating the information gain for eachattribute and then select the attribute which has maximuminformation gain. Information gain for each attribute is cal-culated by using the formula [32]:

E = −k∑

i=1

Pi log2 Pi (6)

where k corresponds to the number of target attribute classesand Pi presents the probability of i occurring in class P .

A decision tree is more apposite for exploratory knowl-edge discovery. Repetition and replication are two mainissues for tree construction. A suitable attribute selection canenhance the performance of decision tree [31]. Rules aregenerated from the tree, and test data are labeled using these

123

Author's personal copy

Arab J Sci Eng

Fig. 2 DT-IG for UCI heart disease dataset

rules. A sample decision tree (based on information gain) isshown in Fig. 2.

3.1.4 Memory-Based Learner (MBL)

Memory-based learner also named as instance-based or case-based learner works on the principle of nearest neighbor(kNN) approach. It is a supervised learning algorithm thatclassifies the attributes based on training labels (k nearestneighbors) in the feature space. A distance function is usedto identify the nearest neighbors. The type of distance func-tion depends on the datatype of selected attributes. The dis-tance between two feature vectors can be calculated by usingfollowing formula [14]:

d(xi , x j ) =∑

q∈Q

(Xiq − X jq)2 +∑

c∈C

Lc(Xic − X jc) (7)

where Lc represents the M × M matrix used to describe thedistance between two categorical attributes. Let Ni representthe set of k nearest neighbors for an instance xi having thedistance d. The majority voting scheme determines the totalvotes Vi (t) of neighbors of xi having the label t. Formally, itcan be written as [14]:

Vi (t) =∑

k ∈ Ni (I (t, yk)) (8)

where I symbolizes the indicator function and value ofI (t, yk) = 1 if t = yk and 0 otherwise. Let target spaceT = {t1, t2}, then the new weighted majority voting schemeis as follows:

yi = argmaxt∈T wtvi (t) (9)

In the proposed technique, the value of k and weights of classvoting are selected empirically on training set. The followingformulation of class weights is used in this research:

w0 = 1, w1 = [k/2] + 1 (10)

where w0 is the weight of majority class, whereas w1 repre-sents the minority class weight.

3.1.5 Support Vector Machine (SVM)

The SVM classifier performs binary classification of featurespace by hyperplane-based linear separation of data. Gen-erally, SVM is used for pattern classification and nonlinearregression. The linear separation is defined by a weight vec-tor w and bias b, which represents the distance of hyperplanefrom origin. SVM follows the following classification rule[33]:

Sgn( f (x, w, b)) (11)

f (x, w, b) = 〈w.x〉 + b (12)

where x is the example to be classified, and maximum marginhyperplane (w, b) represents a complex problem having aunique solution. The ultimate solution is to minimize ‖w‖within the specified constraints.

yi (〈w.xi 〉 + b) ≥ 1 (13)

123

Author's personal copy

Arab J Sci Eng

In basic SVM framework, the two classes are classified basedon a hyperplane, written as:

(w.x) + b = 0 w ∈ Rn, b ∈ R (14)

We have used two attributes from each dataset, and they areselected on the basis of information gain.

3.2 The MV5 Ensemble Classifier

The MV5 approach involves data acquisition, preprocessing,classifier training and an ensemble model for heart diseaseprediction. Data acquisition process obtains data from dif-ferent repositories and performs data partition and variableselection. Preprocessing steps involve: missing value imputa-tion, outlier detection, feature selection and class label iden-tification.

There are two steps involved in creation of an ensem-ble approach. First, generate individual classifier decision,and second, combine individual classifier’s decision appro-priately to create a new model. Each member classifier inensemble method is trained using selected parameters fromthe given datasets. Let N is the number of classifiers denotedby C1, . . . , CN , and M is the number of output classes(healthy/sick). In this research, N = 5 and M = 2. The clas-sifier ensemble method is defined as: Find the vector V whichis a boolean array representing the binary vote-based ensem-ble. The size of V can be denoted by N ×M . In boolean array,V (i, j) signifies the decision whether i th classifier has votedin favor of j th class or not. V (i, j) = True/1 presents that i thclassifier has voted for j th class, whereas V (i, j) = False/0shows that the i th classifier has voted for second class. Inessence, if 3 out of 5 classifiers predict the class sick, thenthe final decision of ensemble is sick and healthy otherwise.

The main focus of proposed heterogeneous ensemblemodel is to integrate different type of classifiers that differ ontheir decisions. These methods change the training processin order to create classifier models that result in differentclassification decisions.

There are three main modules in the proposed system.First module is data acquisition and preprocessing, a processof obtaining heart disease data from different data reposito-ries, preprocessing it and then refining the feature space intoa form that is suitable for subsequent analysis. The data arethen divided into training set and testing set. Second moduleis used to train the five classifiers (NB, DT-IG, DT-GI, MBLand SVM)on training sets. Last component is prediction andevaluation module which performs prediction on testing setusing the pretrained classifiers. The ensemble-based major-ity voting is applied on the result, and then, final predictionis obtained. The evaluation component evaluates the finalresults on the basis of accuracy, sensitivity, specificity andf-measure. The framework of proposed approach is given inFig. 3.

3.2.1 Data Acquisition and Preprocessing Module

The fundamental role of data acquisition process is to obtainthe data from different heart disease data repositories. It holdsthe features that are used to characterize the data into healthyindividuals and heart disease patients. Each dataset has dif-ferent attribute types and feature space. Data partition com-ponent then splits the data into two subsets for each database:training set and testing set. Partitioning involves the mutuallyexclusive datasets and uses tenfold stratified cross validationto partition the data. The given datasets share no observa-tions with each other. In order to avoid bias, the records fromeach database are randomly selected for partition. The mainadvantage of partitioning is reduction of computation timefor preliminary model runs.

The feature selection process involves reduction of attribu-tes by only selecting those features, which contribute towardfinal prediction and setting others as rejected. These rejectedvariables will not be used for subsequent modules andanalysis. Multiple steps are involved in the process ofattribute/feature selection [34]. First, the generation proce-dure will generate the subset of features and second, andthe selection procedure will evaluate these features based ondifferent evaluation criteria. The generation procedure caneither consists of empty set or subset of features selectedrandomly or with set of all attributes. In case of empty set,the attributes are added iteratively through forward selection,and in case of all attributes set, the features can be removedthrough backward elimination. The evaluation criterion forselecting relevant attributes is based on wrapper approachesthat focus on the classification accuracy rate [35]. They calcu-late the estimation accuracy for each feature set that is addedor removed from the dataset. Cross validation is used for theaccuracy estimation for each feature set of training set. Thefeature selection process continues until prespecified num-ber of features is selected or some thresholding criterion isachieved [35].

Feature selection is performed for each dataset individu-ally. The heart disease datasets used in the proposed researchare benchmark datasets and do not contain any irrelevantfeatures because they are already processed by the respec-tive publishers. Therefore, we have used the entire fea-ture set of each dataset for further analysis. For any otherdataset that may contain attributes irrelevant to the domainwill be removed by using the defined feature selectionmethod.

After feature selection, data cleaning is performed onselected features. This involves the cleaning of raw data,and its transformation to a form that can be easily under-stood by further modules. The intensive processing stepsare performed on each attribute value individually and thenpass it to the base classifiers. This involves the followingsteps:

123

Author's personal copy

Arab J Sci Eng

Fig. 3 Detailed architecture ofthe proposed MV5 framework

UCI SPECT

Data Acquisition and Cleaning

Original data

Feature selection

Missing value Outlier detection

Delete feature

Delete value

Mathematical model reasoning

Update valueDatabase completed

Classifier Training

Prediction and Evaluation

SVM prediction

MBL prediction

DT(GI) prediction

DT(IG) prediction

NB prediction

Majority votingFinal prediction

Data sources

>50% <=50% +-1.5*IQR

AccuracySensitivitySpecificityF-measure

Evaluation

Prediction

UCI SPECTF

UCI Statlog

UCI heart disease Eric

Testing set

Training set

NB DT-IG DT-GI MBL SVM

• Missing value imputation: missing values in medicaldatasets is a serious problem that is encountered dur-ing preliminary analysis and interpretation of data. Theproposed preprocessing technique identifies the miss-

ing attribute values, and then, they are replaced bymean and mod values for continuous and categoricalattributes, respectively. This procedure is performed forthose attributes where values are missing in less than 50 %

123

Author's personal copy

Arab J Sci Eng

of the instances. If the number of instances with miss-ing values is more than or equal to 50 %, the particularattribute is rejected and not used further.

• Noisy data: the random error or variance in a measuredattribute is referred as noise. Noise can be removed frommissing data using binning, regression analysis or cluster-ing. We have removed the noise from datasets using bin-ning technique and then pass on the data to next processfor further cleaning.

• Outlier detection: outliers are the values that fall above orbelow a particular range. The outliers for each attribute ininput datasets are removed. Outliers are detected by inter-quantile range. The outliers have been removed using±1.5 IQR, and they are replaced by mean value for con-tinuous attributes and mod for categorical attributes.

3.2.2 Classifier Training Module

The preprocessed data are passed to classifier training mod-ule for further computation. The classifier training moduletrains the individual classifiers using each dataset in orderto make them useful for prediction. Each trained classifierhas the knowledge about feature space for each dataset andthe resultant class label. These trained classifiers will predicthealthy/sick individuals.

3.2.3 Prediction Algorithm and Evaluation Procedure

The prediction and evaluation module first predicts the classlabels of new data (test data) using trained classifiers and thenperforms majority voting (ensemble technique) to obtain thefinal result. The final outcome is then compared with eachbase classifier result to perform evaluation.

The ensemble component creates new models whichinvolves combining of posterior probabilities (for targetclass) or the predicted values (for interval targets) from theprecursor steps. The new ensemble model scores the test datawith high f-measure and accuracy as compared to individualclassifiers.

The proposed research involves the diagnosis of twoclasses: healthy person (0) and sick (1). It is clear fromthe literature survey that several techniques were proposedfor diagnosis of heart disease. Performance of the proposedensemble classifier is analyzed using different evaluationmeasures such as accuracy, sensitivity, specificity and f-measure. Classification and prediction results are also dis-played using confusion matrices. Each cell in a confusionmatrix corresponds to the raw number of exemplars clas-sified for the desired prediction, whereas actual attributesare shown in the corresponding cell. These procedures aredefined in detail in the next section.

4 Evaluation Measures

We have evaluated the accuracy of proposed model on fiveheart disease datasets. Sensitivity values indicate correctlyidentified healthy persons, whereas specificity values indi-cate the proportion of heart disease patients that are correctlyclassified as sick. Mathematically,

Sensitivity = True Positives

True Positives + False Negatives

Specificity = True Negatives

True Negatives + False Positives

F-measure presents the harmonic mean of both sensitivityand specificity. Mathematically,

F-Measure = 2 × Sensitivity × Specificity

Sensitivity + Specificity

Accuracy measures the percentage of correct predictionsdone by the proposed model in comparison with actual mea-surements performed on test data. Mathematically,

Accuracy

= True Positives + True Negatives

True Positives + False Positives + True Negatives + False Negatives

5 Results and Discussion

The experimental dataset consists of five different heart dis-ease datasets with different set of attributes. Experimentsare conducted on test sets, and results are evaluated. Eachdataset is evaluated using tenfold cross validation to allevi-ate the insufficiency of samples. The cross validation dividesthe labeled data in training and test sets. Each classifier islearned on training set and then applied to test set. The indi-vidual classification rules are generated for decision tree-based classifiers (DT-GI and DT-IG). These rules are thenused to classify test data. The remaining three classifiers (NB,SVM and MBL) are first trained on the training dataset andthen tested on a separate unseen test dataset.

The statistics of each dataset is shown in Table 2.

Table 2 Datasets distribution

Dataset name Total instances Ten fold cross validation

Training Testing

Dataset 1 Heart disease 303 272 31

Dataset 2 SPECT 267 240 27

Dataset 3 SPECTF 267 240 27

Dataset 4 Statlog 270 243 27

Dataset 5 Eric 209 188 21

123

Author's personal copy

Arab J Sci Eng

Fig. 4 Accuracycomparison—proposed MV5and individual classifiers

Table 3 Performanceimprovement of proposed MV5

Green color = improved results

Performance measure NB (%) DT-GI (%) DT-IG (%) MBL (%) SVM (%) MV5 (%) Averageincrease byMV5 (%)

Avg. accuracy 76.63 74.17 74.01 64.35 78.56 86.82 13.27

Avg. sensitivity 77.51 78.31 75.70 67.42 74.95 85.10 10.32

Avg. specificity 76.82 71.00 70.56 61.41 87.94 90.09 16.54

Avg. F-Measure 70.80 67.21 55.26 60.07 58.19 79.63 17.32

The ensemble model runs on test datasets and processeseach instance. It classifies the instances into class labelshealthy and sick. Confusion matrices, accuracy, sensitivity,specificity and f-measure are calculated for the proposedMV5 ensemble as well as for the individual NB, DT-IG, DT-GI, MBL and SVM classifiers. The final confusion matrixis the sum of ten confusion matrices obtained at each foldof cross validation. The prediction results are calculated andaveraged for all subsets. The results are analyzed to verify thesuperiority of proposed MV5. Figure 4 shows the comparisonresults of MV5 for all datasets with individual classifier tech-niques. It is clear from the analysis result that proposed MV5has achieved higher accuracy in all datasets as compared toindividual classifiers. It has achieved 85.81 % accuracy inheart disease dataset, 82.40 % accuracy in SPECT dataset,80.15 % accuracy in SPECTF dataset, 88.52 % accuracy inStatlog dataset and 86.12 % accuracy in Eric dataset. Theaverage performance of MV5 is also compared with aver-age performance of individual classifiers, and overall resultanalysis is summarized in Table 3.

It is clear from the analysis that accuracy improvement ofMV5 is around 14 % on average as compared to individualclassifiers, average sensitivity improvement is around 11 %,average specificity improvement is around 17 % and aver-age f-measure improvement is around 18 % as compared to

average sensitivity, specificity and f-measure of individualclassifiers, respectively. Graphical comparison of proposedMV5 for accuracy, sensitivity, specificity and f-measure isgiven in Figs. 4, 5, 6 and 7 indicating high performance ineach dataset.

In the proposed research, we have achieved significantresults: 1. we obtained the best accuracy of 88.52 % whichis a substantial improvement for diagnosing the heart dis-ease patients. The individual classifier cannot achieve thisperformance and accuracy for diagnosing the heart disease.2. We proved that our approach outperforms the state of art.The best performance results of MV5 have achieved sen-sitivity 86.96 %, specificity 90.83 % and f-measure 88.85 %which are statistically significant. The proposed frameworkcan handle any kind of data including interval-based, cat-egorical, continuous and real type. Each database containsdifferent type of attributes, and the proposed algorithm han-dles all of them well as shown in results.

The diversity among classifiers is achieved by combiningentirely different type of classifiers known as heterogeneousensemble. The diversity of proposed ensemble model can bedetermined by the extent to which the member classifiers dis-agree about the probability distributions for the test datasets.The proposed model can handle dependency and relationbetween attributes instead of considering each attribute inde-

123

Author's personal copy

Arab J Sci Eng

Fig. 5 Sensitivitycomparison—proposed MV5and individual classifiers

Fig. 6 Specificitycomparison—proposed MV5and individual classifiers

Fig. 7 F-Measurecomparison—proposed MV5and individual classifiers

pendently as in Naïve Bayes. The decision tree classifierin proposed ensemble model performs correlation analysisbetween set of attributes in order to calculate information gainor Gini index. One of the limitations of decision tree induc-tion is overfitting, which is being resolved by tree pruning.

The tree branches that do not contribute to the instance classi-fication are removed from the tree. This process also reducesthe tree complexity and increase classification accuracy. Thelimitations of MBL algorithm are that it requires lot of stor-age and is computationally intensive. The storage problem

123

Author's personal copy

Arab J Sci Eng

can be resolved by choosing only necessary or useful featuresfor heart disease analysis and prediction. The proposed MV5model has resolved this issue using SVM classifier that per-forms prediction using only subset of data chosen based oninformation gain. The SVM classifier in MV5 increases theaccuracy of classification and prediction as well as decreasesthe overfitting issue. The classifiers complement each othervery well. In any scenario where one classifier has some limi-tation, the other classifier performs well, consequently givingbetter performance.

6 Conclusion and Future Work

The research paper has proposed an ensemble classifier forintelligent heart disease prediction and analysis. It uses fiveindividual heterogeneous machine learning classifiers, com-bines the results of each classifier in an ensemble model andgenerates the prediction information for heart disease diag-nosis. Five different datasets were used with different set ofattributes of different types, and they were augmented bymajority voting method for ensemble classifier training andtesting. The experimental results with tenfold stratified crossvalidation show that the proposed ensemble model predictsthe heart disease patients with relatively high accuracy ascompared to other state-of-art techniques. MV5 has achievedan accuracy of 88.52 % with 86.96 % sensitivity, 90.83 %specificity and 88.85 % f-measure. Future research directionsinclude weighted voting-based classifier ensemble and appli-cation of the proposed algorithm on different diseases likediabetes and cancer for classification and prediction.

References

1. Rajkumar, M.; Reena, G.S.: Diagnosis of heaer disease using datamining algorithm. Glob. J. Comput. Sci. Technol. 10(10) (2010)

2. Porter, T.; Green, B.: Identifying diabetic patients: a data min-ing approach. In: Americas Conference on Information Systems(2009)

3. Panzarasa, S.; Quaglini, S.; et al.: Data mining techniques for ana-lyzing stroke care processes. In: Proceedings of the 13th WorldCongress on Medical Informatics (2010)

4. Li, L.; Tang, H.; Wu, Z.; Gong, J.; Gruidl, M.; Zou, J.; Tockman,M.; Clark, RA.: Data mining techniques for cancer detection usingserum proteomic profiling. Artif. Intell. Med. 32(2), 71–83 (2004)

5. Das, R.; Turkoglu, I.; Sengur, A.: Effective diagnosis of heart dis-ease through neural networks ensembles. Expert Syst. Appl. 36(4),7675–7680 (2009)

6. Pattekari, S.A.; Parveen, A.: Prediction system for heart diseaseusing Naïve Bayes. Int. J. Adv. Comput. Math. Sci. 3(3), 290–294 (2012)

7. Peter, T.J.; Somasundaram, K.: An empirical study on prediction ofheart disease using classification data mining techniques. In: IEEE-International Conference on Advances in Engineering, Science andManagement (ICAESM -2012) March (2012)

8. Ghumbre, S.; Patil, C.; Ghatol, A.: Heart disease diagnosis usingsupport vector machine. In: International Conference on Com-puter Science and Information Technology (ICCSIT’) Pattaya(2011)

9. Chitra, R.; Seenivasagam, V.: Heart disease prediction system usingsupervised learning classifier. Bonfring Int. J. Softw. Eng. SoftComput. 3(1), 1–7 (2013)

10. Abuhaiba, I.S.I.: Efficient OCR using simple features and decisiontrees with backtracking. Arab. J. Sci. Eng. 31(2), 223–244 (2006)

11. Srinivas, K.; Rani, B.K.; Govrdhan, A.: Applications of data min-ing techniques in healthcare and prediction of heart attacks. Int. J.Comput. Sci. Eng. (IJCSE) 02(02), 250–255 (2010)

12. Shouman, M.; Turner, T.; Stocker, R.: Using decision tree fordiagnosing heart disease patients. In: Proceedings of the 9th Aus-tralasian Data Mining Conference, Ballarat, Australia (2011)

13. Anbarasi, M.; Anupriya, E.; Sniyengar, N.CH.: Enhanced predic-tion of heart disease with feature subset selection using geneticalgorithm. Int. J. Eng. Sci. Technol. 2(10), 5370–5376 (2010)

14. Uguroglu, S.; Carbonell, J.; Doyle, M.; Biederman, R.: Cost-sensitive risk stratification in the diagnosis of heart disease. In:Proceedings of the Twenty-Fourth Innovative Applications of Arti-ficial Intelligence Conference (2012)

15. Chen, A.H.; Huang, S.Y.; Hong, P.S.; Cheng, C.H.; Lin, E.J.:HDPS: Heart disease prediction system. In: Computing in Car-diology, 2011, pp. 557–560. IEEE (2011)

16. Jabbar, M.A.; Chandra, P.; Deekshatulu, B.L.: Heart disease pre-diction system using associative classification and genetic algo-rithm. In: International Conference on Emerging Trends in Electri-cal, Electronics and Communication Technologies-ICECIT (2012)

17. Eom, J.; Kim, S.; Zhang, B.: AptaCDSS-E: a classifier ensemble-based clinical decision support system for cardiovascular diseaselevel prediction. Expert Syst. Appl. 34, 2465–2479 (2008)

18. Helmy, T.; Rahman, S.M.; Hossain, M.I.; Abdelraheem, A.: Non-linear heterogeneous ensemble model for permeability predictionof oil reservoirs. Arab. J. Sci. Eng. 38, 1379–1395 (2013)

19. Shouman, M.; Turner, T.; Stocker, R.: Applying k-nearest neigh-bour in diagnosing heart disease patients. Int. J. Inf. Commun.Technol. Educ. 2(3), 220–223 (2012)

20. Tu, M.C.; Shin, D.; Shin, D.: Effective diagnosis of heart dis-ease through bagging approach. In: 2nd International Conferenceon Biomedical Engineering and Informatics, 2009. BMEI’09, pp.1–4. IEEE (2009)

21. Shouman, M.; Turner, T.; Stocker, R.: Integrating Naive Bayes andK-means clustering with different initial centroid selection methodsin the diagnosis of heart disease patients. CS & IT-CSCP, pp. 125–137 (2012)

22. Shouman, M.; Turner, T.; Stocker, R.: Integrating clustering withdifferent data mining techniques in the diagnosis of heart disease.J. Comput. Sci. Eng. 20(1) (2013)

23. http://archive.ics.uci.edu/ml/datasets.html. Last Accessed 25 Sep2013

24. http://www.eric.univ-lyon2.fr/~ricco/dataset/heart_disease_male.xls. last Accessed 25 Sep 2013)

25. Sotelsek-Margalef, A.; Villena-Román, J.: MIDAS: aninformation-extraction approach to medical text classification. In:Procesamientodellenguaje Natural, pp. 97–104 (2008)

26. Zhang, H.: The Optimality of Naïve Bayes. American Associationof Artificial Intelligence, Menlo Park (2004)

27. Manning, D.; Raghavan, P.; Schutze, H.: Introduction to Informa-tion retrieval. Cambridge university, Cambridge (2008)

28. Naïve Bayes Classifier, http://en.wikipedia.org/wiki/Naive_Bayes_classifier#cite_note-0. Accessed 16 Dec 2013

29. Huang, J.; Lu, J.; Ling, C.X.: Comparing Naive Bayes, decisiontrees, and SVM with AUC and accuracy. In: Third IEEE Interna-tional Conference on Data Mining (2003)

123

Author's personal copy

Arab J Sci Eng

30. Palaniappan, S.; Awang, R.: Intelligent heart disease predictionsystem using data mining techniques- 978-1-4244-1968-5/08/2008IEEE (2008)

31. Han, J.; Kamber, M.: Data Mining Concepts and Techniques. Mor-gan Kaufmann Publishers, Burlington (2006)

32. Bramer, M.: Principles of data mining. Springer, Berlin (2007)33. Wang, S.; Mathew, A.; Chen, Y.; Xi, L.; Ma, L.; Lee, J.: Empiri-

cal Analysis of Support Vector Machine Ensemble Classifiers. In:Expert Systems with applications (2009)

34. Mokeddem, S.; Atmani, B.; Mokaddem, M.: Supervised featureselection for diagnosis of coronary artery disease based on geneticalgorithm (2013). arxiv:1305.6046 [Preprint]

35. Kohavi, R.; John, G.H.: Wrappers for feature subset selection. Artif.Intell. 273–324 (1997)

123

Author's personal copy