Big Data Analytics for Preventive Medicine - OPUS at UTS

1

Big Data Analytics for Preventive MedicineImran Razzak, Muhammad Imran, and Guandong Xu

Abstract—Medical data is one of the most rewarding and yetmost complicated data to analyze. How can healthcare providersuse modern data analytics tools and technologies to analyzeand create value from complex data? Data analytics, with itspromise to efficiently discover valuable pattern by analyzinglarge amount of unstructured, heterogeneous, nonstandard andincomplete healthcare data. It does not only forecast but also helpsin decision making and is increasingly noticed as breakthroughin ongoing advancement with the goal is to improve the quality ofpatient care and reduces the healthcare cost. Aim of this study isto provide a comprehensive and structured overview of extensiveresearch on the advancement of data analytics methods for diseaseprevention. This review first introduces disease prevention andits challenges followed by traditional prevention methodologies.We summarize state of the art data analytics algorithms used forclassification of disease, clustering (unusually high incidence of aparticular disease), anomalies detection (detection of disease) andassociation as well as their respective advantages, drawbacks andguidelines for selection of specific model followed by discussionon recent development and successful application of diseaseprevention methods. The article concludes with open researchchallenges and recommendations.

Keywords: Disease prevention, data analytics, healthcare,knowledge discovery, prevention methodologies.

I. INTRODUCTION

Due to the rise of healthcare expenditures, early diseaseprevention has never been important as it is today. This is par-ticularly due to the increased threats of new disease variants,bio-terrorism as well as recent improvement development indata collection and computing technology. Increase amount ofhealthcare data increases the demand to develop an efficient,sensitive and cost effective solution for disease prevention.Traditional preventive measures mainly focus on promotion ofhealthcare benefits and have lack of methods to process hugeamount of data. Using IT to promote healthcare quality canserve to improve health promotion and disease prevention. Itis true inter-disciplinary challenge that requires numbers typesof expertise’s in different research areas and really big data. Itraises some fundamental questions.• How do we reduce the increasing number of patients

through effective disease prevention?• How do we cure or slow down the disease progression.• How do we reduce the healthcare cost by providing

quality care?• How do we maximize the role of IT in identifying and

curing the risk at early stage?Clear answer to these question is the use of intelligent data

analytics methods to find information from glut of healthcare

Imran Razzak and Guandong Xu is with Advanced Analytics Institure,University of Technology, Sydney, Australia. Muhammad Imran is withCollege of Applied Computer Science, King Saud University, Saudi Arabia.

data. Data analytics researchers are poised to come up withhuge beneficial advancement in patient care. There is vastpotential for data analytics applications in healthcare sector.Currently, data analytics, machine learning and data miningmade it possible for early disease identification and treatment.Early monitoring and detection of disease being in practicein many countries i.e. BioSense (USA), CDPAC (Canada),SAMSS, AIHW (Australia) , SentiWeb (France) etc.

This paper discusses the IT bases methods for diseaseprevention. We chose to focus on data mining based preventionmethodologies because recent development in data miningapproaches led the researcher to develop number of preventionsystems. Tremendous progress has been made for early diseaseidentification and its complication management.

A. Whats Data Mining and Data AnalyticsExponential time increase in data made tough to get useful

information form that data. Traditional methods showed muchperformance however their predictive power is limited astraditional analysis deals only with primary analysis whereasdata analytics deals with secondary analysis. Data miningis the digging or mining of data from many dimensions orperspectives through data analysis tools to find prior unknownpattern and relationship in data that may be used as validinformation, moreover, it makes the use of this extracted infor-mation to build predictive model. It has been used intensivelyand extensively by many organizations especially in healthcaresector.

Data mining is not a magic wand but infact a big giant toolthat does not discover solutions without guidance. Data miningis useful for the following purposes:• Exploratory analysis –Examining the data to summarize

its main characteristics.• Descriptive modeling –Partitioning of the data into sub-

groups based on its properties.• Predictive modeling –Forecasting information form ex-

isting data.• Discovering pattern –Discover pattern that occur fre-

quently.• Retrieval by content –Discovering hidden patternsBig data and machine learning holds great potential for

Healthcare providers to systematically use data and analytics todiscover interesting pattern that are previously unknown anduncover the inefficiencies from vast data stores in order tobuild predictive models for best practices that improve qualityof healthcare as well as reduces reduces the cost. EHR systemare producing huge amount of data on daily basis which isa rich source of information that can be used by healthcareorganization to explore the interesting fact and findings thatcan help to improve patient care. Fig 1 shows the data analyticsgeneric architecture for healthcare applications.

2

Fig. 1. Architecture of Health care Data Analytics

As health sector data is moving towards really big data,thus better tools and techniques are required as compared totraditional data analytics tools. Traditional analytics tools areuser friendly and transparent as compare to big data analyticstools that are complex and programming intensive and requiredvariety of skills. Some famous big data analytics tools aresummarized in table I.

B. Whats disease prevention and its challengesEvery year millions of people die of preventable death [166].

In 2012, about 56 million people died worldwide and 2/3rd ofthese deaths were due to non-communicable disease includingdiabetes, cardiovascular and cancer. Moreover, 5.9 millionchildren died in 2015 before reaching the 5th year of their lifeand most of these death were due to infection (i.e. Diarrhea,malaria, birth asphyxia, pneumonia etc.) however, this numbercan be reduced to half at-least by treating or preventingthrough the access to simple affordable interventions [8] .Core problem in healthcare sector is to overcome the hugenumber of causalities as well as reduce the cost. The goalis to reduce the prevalence of disease; help people to livelonger and healthier life achieve as well as reduce the cost.One of the main interests in disease prevention is driven bythe need to reduce the cost. The lifetime medical expendituresare increased from $ 14k to $ 83k per person and this increaseis up to to 160K after the age of 65.

Thus the proportion of average world GDP devoted tohealthcare sector is increased from 5.28% in 1995 to 5.99% in2014 and is expected to increase in future (i.e. from 17.1% in2014 to 19.9 of US GDP by 2022%) [6] [4]. This increase inmedical expenditures are mainly due to the aging and growingpopulations, the rising prevalence of chronic diseases as well asfor infrastructure improvement. Thus, the cost saving and costeffective preventive solutions are required to reduce the burdenon economy. Traditional preventive measures mainly focuson promotion of healthcare benefits. The cost-effectivenessratio is said to be unfavorable when intervention incrementalcost are larges relative to the healthcare benefits. USA spent90% of budget on disease treatment and their complicationrather than prevention (only 2-3%) whereas many of thesediseases can be prevented at first stage [36] [173]. Spendingmore on health does not guarantee of health system efficiency.The investment on prevention can helps to reduce the cost

as well as improve the health quality and efficiency. Healthindustry is facing considerable challenges in the promotion andprotection of health at a time when there is huge pressure dueto the considerable budgets constraints and resources in manycountries. Early detection and prevention of disease plays avery important role in reducing deaths as well as healthcarecost. Thus, the core question is: How data can help to reducethe patients or disease effect in the population?

1) Concept and Traditional Methodology: Disease preven-tion focuses on prevention strategies to minimize the futurehazards to health by early detection and prevention of disease.An effective disease management strategy reduces the risksfrom disease, slow down its progression and reduces symp-toms. It is the most efficient and affordable way to reduce therisk of disease. Preventive measures strategies are divided intodifferent stages e.g. primary, secondary and tertiary. Diseaseprevention can be applied at any prevention level along withthe disease history, with the goal of preventing its progressionfurther. Primary: It seeks to reduce the occurrence of newcases e.g. stress management, exercises, smoking cessation toprevent lung cancer and immunization against communicablediseases. Thus it is most applicable at the suspected stageof a patient. Strategies of primary prevention include riskfactor reduction, general health promotion and other protectivemeasure. This can be done by bringing up the healthierlifestyles and environmental health approaches through healtheducation and promotion program. Secondary: Purpose ofsecondary prevention is to either cure the disease, slow downits progression, or reduce its impact and is the most appropriatefor those in the stage of early stage or pre-symptomatic disease.It attempts to reduce the number of cases through early detec-tion of the disease and reducing or halting its progressione.g.,detection of coronary heart patient after their first heart attack,blood tests for lead exposure, eye tests for glaucoma, lifestyleand dietary modification. Common approach to secondaryprevention includes procedure to detect and treat preclinicalpathological changes early through screening for disease e.g.,mammography for early stage breast cancer detection. Ter-tiary: The key aim of tertiary disease prevention is to enhancelife quality of patient. Once the disease is firmly establishedand has been treated in its acute clinical phase, it seeks tosoften the impact of disease on the patient through therapy andrehabilitation e.g., tight control of type-1 diabetes, assisting acardiac patient to lose weight and improving the functioningof stroke patient through rehabilitation program.

Effective primary prevention to avert new cases, secondaryprevention for early detection and treatment and tertiaryprevention for better diseases management are not only toimprove the quality of life but also helps to reduce unnecessaryhealthcare initialization. Extensive medicine knowledge andclinical expertise are required to predict the probability ofpatient that are contracting disease.

2) Challenges: Un-automated analysis of huge and complexvolumes of data is expensive as well as impractical. Datamining provides great benefits for the disease identificationand treatment however there are several limitations and chal-lenges involved in adapting DM analysis techniques. Success-ful prevention depends upon knowledge of disease causation,

3

TABLE I. BIG DATA ANALYTICS TOOLS

Platforms and Tools DescriptionAdvanced Data Visualization ADV can reduce quality problems which can occur when retrieving medical data

for extra analysis.Presto distributed SQL query engine used to analyze huge amount of data that collected

every single dayThe Hadoop Distributed File System(HDFS)

HDFS enables the underlying storage for the Hadoop cluster and enhanceshealthcare data analytics system by dividing large amount of data into smallerone and distributed it across various servers/nodes.

MapReduce Breaks task into subtasks and gathering its outputs and efficient for large amountof data.

Mahout An apache project, goal is to generate free applications of distributed and scalableML algorithms that supports healthcare data analytics on Hadoop systems.

Jaql functional, declarative query language, aim to process large data sets. It facilitateparallel processing by by converting high level queries into low level ones.

PIG and PIG Latin Configured to assimilate all types of data (structured/unstructured, etc.).Avro facilitates data encoding and serialization that improves data structure by

specifying data types, meaning and scheme.Zookeeper Allows a centralized infrastructure with various services, providing synchroniza-

tion across a cluster of servers.Hive Hive is a runtime Hadoop support architecture that permits to develop Hive

Query Language (HQL) statements akin to typical SQL statements.

TABLE II. PREVENTION LEVEL

Leavell’s Levels of PreventionStage of Disease Prevention Level Type of ResponsePre-disease Primary prevention Specific protection and Health

promotionLatent disease Secondary preven-

tionPre-symptomatic diagnosisand treatment

Symptomaticdisease

Tertiary prevention Disability limitation for earlysymptomatic disease.

transmission dynamics, risk factor and group identification,early detection and treatment methods, implementation ofthese methods and continuous evaluation and development ofprevention and treatment methods. Additionally, data accessi-bility (data integration) and constraints(missing, unstructured,corrupted, non-standardized data and noisy) add more chal-lenges. Due to the huge number of patients, it is impossibleto consider all those parameters to develop cost-effective andcost-efficient prevention system. The expansion of medicalrecords databases and increased linkage between physician,patient and health record led the researcher to develop efficientprevention system.

Healthcare applications generate mound of complex data.To transform this data into information for decision making,traditional approaches are lagging behind, and they barelyadopt advanced information technologies, such as data min-ing, data analytics, big data etc. Tremendous advancement inhardware, software and communication technologies opens upopportunities for innovative prevention by provided cost savingand cost effective solution by improving the health outcomes,properly analyzing the risk and overcoming the duplicateefforts. Barriers to develop such system include nonstandard(interoperability), heterogeneous, , unstructured , missing orincomplete, noisy or incorrect data.

Disease prevention mainly depends on the data interchangeacross different healthcare system thus interoperability playsmajor role in success of prevention system whereas healthcaresector is still on the way. ISO/TC 215 includes standardsfor disease prevention and promotion. Standards are a critical

component whereas it is not yet mature in healthcare sector.Many stakeholders (HL7, ISO and IHTSDO (organizationthat maintain SNOMED CT) with aim to have common datarepresentation are working to address semantic interoperability.Heatlhcare data is diverse and have different format. Moreover,With the rapid use of wearable sensors in healthcare results intremendous increase in the size of heterogeneous data. Foreffective prevention methods, integration of data is required.For years documentation of clinical data has trained clinicianto record data in most convenient way irrespective, how thisdata could be aggregated and analyzed. Electronic healthrecord systems attempts to standardized the data collection butclinician are reluctant to adopt for documentation.

Accuracy of data analysis depends significantly on thecorrectness and completeness of database. It is big challengeto find problems in data and even harder to correct the data,moreover data is missing. Using incorrect data will defiantlyprovide incorrect result. Whereas ignoring the incorrect data,or issue of missing data introduce bias into analysis thatleads to inaccurate conclusion. For the extraction of usefulknowledge from large volume of complex data that consistmissing data and incorrect values, We need sophisticatedmethods for data analysis and association. Moreover, dataprivacy and liability, capital cost, technical issue are otherfactors. Data privacy is another major hurdle in developmentof prevention system. Most of the healthcare organizationshave HIPAA certification, however it does not guarantee theprivacy and security of data as HIPAA is considering securityand policy rather than implementation [145]. With the increasepopularity of wearable devices, mobiles and on-line availabilityof healthcare data put it on emerging threat. In addition to that,it may increase racial and ethnic disparate because these maynot be equally available due to economic barrier.

Rest of the paper is organized as: section 2 describes theexisting prevention methodologies and is categorized into threesubsections nutrients, policies and HIT. Section 3 presents thedata mining development for disease prevention followed bydata analytics based disease prevention application in section 4.Finally, some openly available medical datasets are discussed

4

in section 5 followed by open issues and research challengesare presented in section 6.

II. EXISTING DISEASE PREVENTIONMETHODOLOGIES

Although chronic diseases are among the most common andcostly health problems however, these are the most preventable.Early identification and prevention is the most effective, afford-able way to reduce morbidity and mortality as well as helps toimprove the life quality [191]. Not only data mining, severalother prevention methods are being to reduce the risk factor.

A. Nutrients, Foods, and MedicineDiet acts as medical intervention, to maintain, prevent,

and treat disease. It is major lifestyle factor that contributesextensively for disease prevention such as diabetes, cancer,cardiovascular disease, metabolic syndrome and obesity etc.Poor diet and inactive lifestyle are lethal combination. JointWHO/FAO expert consultation on diet, nutrition and the pre-vention of chronic diseases states that chronic diseases arepreventable and developing countries are facing consequencesof nutritionally compromised diet [199]. Individual has powerto reduce the risk of chronic disease by making positivechanges in lifestyle and diet. Use of Tobacco, unhealthy diet,and lack physical activity are associated with many chronicconditions. Evidence shows that healthy diet and physicalactivity does not only influence present health but also helpsto decrease morbidity and mortality. Specific diet and lifestylechanges and their benefits are summarized in table III. Foodand nutrition interventions can be effective at any preventionstage. In primary stage, food and nutrition therapy could beused to prevent the occurrence of disease such as obesity. Whatif disease is already identified and how diet can help to reducethe effect of disease? Recent studies shows, potential of foodand nutrition interventions as secondary and tertiary preventionis also effective preventive strategy that reduce the risk factorand slow down the progression or mitigate the symptoms andcomplications. Thus, at secondary, it could be used to reducethe impact of a disease and at tertiary stage, it helps to reducethe complications i.e. stomach ulcer. Dietitian plays criticalrole in disease prevention i.e. change in lifestyle can help todelay or prevent type-II diabetes.

Please add the following required packages to your docu-ment preamble:

B. Policy, Systems and Environmental ChangeAfter many year focus on individual; policies, systems and

environmental changes are new way of thinking to improvethe quality of healthcare sector. It affects large segments of theworld population simultaneously. Disease prevention is mucheasy if we develop such environment that can help communityto adopt health lifestyle, proper nutrition and medications etc.Developed nations are promoting social, environmental, policy,and systems approaches to support healthy life such as lowfat diet at restaurants, smoking restricted areas, increase inprices of tobacco items and urban,healthy food restriction

for all students, infrastructure design that leads to lifestylechange i.e. increase in physical activity. United nations eco-nomic commission for Europe (UNECE) member states havecommitted to implement health policies to ensure the increasein longevity by quality of life [2]. Some good examples are5 A DAY programme to increase healthy nutrition in the UK,WHO: Age-friendly cities, Lifetime homes in the UK, Supportprogramme for dementia patients living alone in Germany [3]

Policy change includes new rules and regulation at thelegislative or organizational level i.e. tax on tobacco itemand soft drinks, provide time off during office hours forphysical activity, changing community park laws to allowfruit trees. System change includes change in systems strate-gies within organization such as improving school infrastruc-ture,transportation systems. Environmental change includesthe changes in physical environment such as incorporatingsidewalks and recreation areas in community areas, healthyfood in restaurants.

C. Information TechnologiesHealth Information Technology (HIT) strategy is to put

information technology to work in healthcare sector in orderto reduce healthcare cost and increase efficiency. HIT makesit possible to get maximum benefits for patient, healthcareorganizations and government through intelligent patient in-formation processing. Happy marriage of healthcare and in-formation technology includes variety of electronic approachesthat are used not only to manage information but to improvesthe quality of clinical and other preventive services i.e. diseaseprevention, early disease detection, risk factor reduction andcomplication management. Healthcare transaction generateshuge amount of data. This expansion of medical recorddatabases and increased linkage between physician, patient andhealth record led the researcher to develop efficient diseaseprevention systems. Thus, there is a need to transform healthdata to information through the use of innovative, collaborativeand cost effective informatics and information technology. ITprovides strategic value to achieve health impact and healthquality by transforming these mounds of data into information.Computer based disease control and prevention is an ongoingarea of interest to the healthcare community. Disparities inaccess to health information can affect preventive services.Increased access to the technology (.e. handled devices and fastcommunication) and availability of online health informationreduced the disease risk. It makes the user to be able to accesshealth information and make good decision. Increased useof HIT tools (i.e. reminders, virtual reality applications anddecision support system) helps to reduce the risk factor.

Effective disease prevention requires identifying and treatingindividuals at risk. Several preventive measures are beingused to improve therapy adherence. Relevant information suchas blood pressure, cholesterol measurement, fasting plasmaglucose etc. are recorded electronically and it makes automaticrisk prediction possible. Moreover, the rapid growth of cellularnetworks provides opportunity to be in touch with patient[169]. Reminder based preventive services involves continuousrisk assessment of several life threatening disease ( cardiac dis-ease [81, 55, 213, 214], diabetes management [56, 97, 61, 53]

5

TABLE III. CONVINCING AND PROBABLE RELATIONSHIPS BETWEEN DIETARY AND LIFESTYLE FACTORS AND CHRONIC DISEASES [215]

Factors CVD Type-2 Diabetes Cancer Dental disease Fracture Cataract Birth defect Obesity Metabolic syndorme Dipression Sexual dysfunctionLife style Avoid Smoking ↓ ↓ ↓ ↓ ↓ ↓ ↑ ↓

Physical activity ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓Avoiding overweight ↓ ↓ ↓ ↑ ↓ ↓ ↓

Diet Using healthy fat ↓ ↓ ↓ ↓Fruits and veggetable ↓ ↓ ↓ ↓ ↓ ↓Using whole grains ↓ ↓ ↓ ↓Reducing sugar ↓ ↓ ↓ ↓ ↓Reducing calories ↓ ↓Reduction in sodium ↓

↓ decrease risk ↑ increase risk

through vital sciences [81, 55, 56, 97, 61] or reminder of duefor specific preventive services such as vaccination, follow upappointment, weight loss [64, 35, 109, 88] or reminder formedicine [64, 139, 35]. This type of communication couldby any electronic medium i.e. call, SMS or emails. Studiesfound that automated reminder services are very effective inboosting patient adherence [149, 174]. Healthcare databasescontain cardiac disease data but practical methods are requiredto identify patient at risk and continuous monitoring of thosepatient through electronic vita science data i.e. Blood pres-sure, cholesterol and glucose etc. Several studies have beenperformed to reduce risk through automated reminder usingcollected primary care data [81, 55]. Text message basedintervention for supporting diabetic patient helps to improveself-management behaviors and achieve better control overglycemic [56, 97, 61]. Research studies shows that reminders(through emails, SMS, calls, social media and wearable devicesetc.) based individual data leveraged to reduce risk. Integrationof wearable sensors, EHR and expert system with automaticreminder will end up at efficient disease prevention system.

The wish of efficient healthcare is coming within the reachafter the advent of smart wearable technology. These sensorsare reliable and efficient for real-time data collection and anal-ysis [185, 231, 9] thus are really good source of preventativemethods for several threatening disease i.e. cardiovascular [28,144, 211], cardiopulmonary [230, 212], diabetes [110, 85] andneurological function [54]. Several companies (i.e. Medtronic,BottomLine, Medisafe, ImPACT, Actismile. Allianz and AiQ) are providing low cost wearable solution like concussionmanagement, cardiovascular, monitoring, reminders, diabetes,overweight, metabolism etc. Recent development in intelligentdata analysis methods nano-electronics, communication andsensor technology made possible the development of smallwearable devices for healthcare monitoring. Wearable sensorsare able to collect most of the useful medical data requiredfor early identification and prevention but it still costly andmost of the healthcare system are not yet able to processthis data. In near future, wearable devices are getting cheaper(few dollars), smaller (almost invisible i.e. sensor-clad smartgarments, implanted devices), accurate (no approximation ofdata) and powerful (low battery consumption, high commu-nication and processing power). Moreover, due to the recentdevelopment high processing sensors, research is shifting formsimple reasoning to high level data processing by imple-menting pattern recognition, data mining methodologies toprovide much valuable information i.e. fall detection, heartattack detection via data processing and sending message toemergency services or family member for immediate action.

III. ALGORITHMS AND METHODSBiomedical data is more complicated and is getting larger

day by day. Efficient analysis of this huge amount of dataprovides large amount of useful information that could be usedfor health promotion and prevention. Traditional methods dealwith primary data and fail to analyze big data. Thus questionarises: how to analyze it efficiently? Answer is: we need muchsmarter algorithms to analyze this data. The major objective ofdata analytics is to find hidden pattern and interesting relationsin the data. The significance of such approaches is to providetimely identification with less number of clinical attributes[104]. This secondary data is used for important decisionmaking.

Healthcare data analytics is multidisciplinary area of datamining, data analytics, big data, machine learning and patternrecognition [225]. Intelligent disease control and preventionis an ongoing area of interest to the healthcare community.The basic goal of data analytics based disease prevention is totake real world patient data and to help to reduce the patientat risk. Recent development in machine learning and dataanalytics for handling complex data opens new opportunitiesfor cost-effective and efficient prevention methods that canhandle really big data. Data mining provides the methodol-ogy and technology to transform these mounds of data intouseful information for decision making by by exploring andmodelling big data [218]. This section is divided into four subsection based on data mining functionalities in healthcare. i.e.classification of disease based on symptoms. In this section,our goal does not go much into detail but shows the basicconcept and difference between different algorithms.• Classification is the process of finding set of variables to

classify data into various types i.e. disease identification,or medication etc.

• Clustering: data grouping or data distinguishing used fordecision making i.e. finding pattern in EHR, predictionof re-admission.

• Associate Analysis: discover interesting, hidden relationbetween medical data i.e. frequently occurrence disease.

• Anomaly Detection: is the identification of abnormalitythat does not follow any specific pattern.

A. ClassificationSuccess of disease prevention is based on early detection

of disease. Accurate disease identification is necessary to helpthe physicians to deal it with proper medication. The goal ofclassification is to accurately predict target case data set fromunseen data. Classification is the process of assigning given

6

object to one of the predefined class. There has always been adebate on the selection of best classification algorithm. Severalstatistical and machine learning based contributions were madefor disease prediction and wide range of classifiers (shown infigure 2) has been used i.e. decision tree [99, 203, 22, 90] ,SVM [121, 23, 207], Naive Bayes [99, 198], neural network[205] and K nearest neighbour [99, 106], PCA [164, 168] fordisease classification.

Fig. 2. Classification Methodologies

1) Decision Tree: Decision tree is used as prediction modelthat maps observation sequence to classified items. Classifica-tion method uses tree-like graph and is based on sorting offeature values. Each node represents the feature in an instanceto be classified whereas each branch in decision tree representsthe value that the node can assume. In decision tree, nodes mayhave two or more child nodes; internal nodes are denoted byrectangle and ladled with input features; leaf nodes are denotedby oval and ladled with class or probability distribution overclass; classification rules are represented by path from rootto the leaf and internal nodes. The decision tree are of twotypes: regression tree analysis (predicted outcome is real valuenumber) and classification tree analysis (predicted outcomeis class). Several popular decision tree algorithms used fordata mining purpose are ID3 (C4.5, SPRINT), AID (CHAID),CART and MARS.

ID3 or Iterative Dichotomiser-3 is used to generate decisiontree from dataset and one of the simplest learning method[152]. C4.5 is the extension of ID3 and uses gain ratioas splitting criteria whereas ID3 uses binary splitting [151].Unlike ID3, it can handle data with missing values, attributeswith differing costs and handle both discrete and continuousattributes. It became quite popular and was ranked at No.1in the list of top 10 algorithms in data mining. C5.0 isthe extension of C4.5 and introduces new features such asmisclassification cost (weight on classification error basedon its importance), case weight attribute (quantification theimportance of each case). Misclassification cost boosting leadsto improvement in predictive accuracy [63]. It also supportcross validation and sampling. Several new data types areincluded especially not application if data values are missing.

ID3 and its descendant construct flowchart-like tree struc-ture and are based on the recursive and divide-and-conquer

algorithm.• If set S is small or all examples of S belong to same

class• return a new leaf and label it with class C in S.

Otherwise• Run a test based on single attribute with two or more

outcome;• Make this test as a root of the tree with one branch for

each outcome.• Partition S into subset S1, S2 and apply same procedure

recursively.• Stop if all its instances have the same class.2) Support Vector Machine: It is supervised learning ap-

proach that is based on statistical learning theory [52]. Itis one of the most simple, robust, accurate and successfulclassification method that provided high performance in solv-ing complex classification problems especially in healthcare.Due to its discriminative power especially for small data andlarge set of variables, data driven and modal free property,it is has been widely used for disease identification andclassification problems. Unlike logistic regression that dependson pre-determined modal for prediction of binary event throughfitting of data onto logistic curve, SVM produces binaryclassifier that discriminates between two classes by separatingthe hyperplanes through nonlinear mapping of data into fea-ture space that is high dimensional. Unlike regression basedmethods, SVM allows more input features than samples, soit is particularly suitable in classifying high-dimensional data.Moreover some kernels even allow SVMs to be considered asa dimensionality reduction technique.

SVM hyperplanes are the decision boundaries between twodifferent sets of classes. SVM uses support vectors to findhyperplane with maximum margin form set of possible hyper-plane. In ideal case, two classes can be separated by a linearhyperplane, however, in real world, data is not that simple. Dueto the limitation of SVM such as binary classification, memoryrequirement and computational complexity, several other vari-ations of SVM are presented such as GVSM, RSVM, LSVM[127], TWSVMs (twin support vector) VaR-SVM (value-at-risk support vector machine), SMM [122], SSMM [229],RSMM [165, 163], cooperative evolution SVM [162]. UsingKernel functions (polynomial, linear, sigmoid, and radial) toadd more dimensions to low dimensional space, two classproblem could be separable in high dimensional space. Ascompared to other classification methods, SVM training isextremely slow and computational complex. However, it doesalways rank well among the list of best classifier.

3) Machine Learning: Machine Learning have emerged asadvanced data analytics tools and recently, have been success-fully applied in a variety of applications including medicaldiagnosis assistance to the physician. Why machine learningis important for data analytics even though state of the artdata analytics algorithms are available? The essence of theargument is that in some cases where other analytics methodsmay not produce satisfiable predictive result, machine learningcan help to improve the generalization ability of the systems bytraining on the the data annotated by human experts. Mostly,

7

the predictive accuracy obtained using machine learnig isslightly higher than other analytics methods or human experts[102] and it is high affordable to the noise data. However,despite these fact, neural networks was not preferable choicefor data analytics due to facts: NN is quite slow in trainingand classification making it impractical for large data andtrained neural networks are usually not comprehensible [167].However, due to the recent advancement in computation power,neural network is increasingly favored for the developmentof data analytics applications. Important neural network algo-rithms are backpropagation neural network (BPNN), associa-tive memory, and the recurrent network.

Deep machine learning also known as hierarchical learningor deep learning is a branch of machine learning based ona set of algorithms which have one or more hidden layersthat automatically extract high-level and complex abstractionsas data representation [135]. Data is passed through multiplelayers in hierarchical manner and each layer applies nonlineartransformation on to data. With its amazing empirical resultover past couple of years, it is one of the best predictivealgorithms for big data. Main advantage of deep learning isautomatic analysis and learning of huge amount of unlabeleddata typically learning data representations in a greedy layer-wise fashion [80] [135]. Several attempt has been made fordisease prevention using deep learning methods [148, 159].An important question is whether to use deep learning formedical data analytics or not. As most of the healthcare datais unstructured and unlabeled, due to the automatic high-leveldata representation property of deep learning, it could be usedin effective way for prediction problems but it needs hugevolume of data to be trained.

4) Bayesian Networks: Bayesian network, Bayes networkor Bayes model is probabilistic directed acyclic graph thatencodes probabilistic relationships among variables of in-terest. Nodes represent set of random variable and theirconditional dependencies (represented by edges) via directedacyclic graph. Unconnected nodes represents conditionallyindependent variables, for example, probabilistic relationshipsbetween diseases and its symptoms. Nodes are associated withprobability distribution function that takes set of values asinput and output the probability of variable represented bynode. Advantages of Bayes networks are model interpretability,efficient for complex data, requiring small training dataset.For example, Bayes classifier to diagnose correct diseasebased on patient observed symptoms. Important algorithms ofBayesian networks are nave Bayes, semi nave Bayes, selec-tive nave Bayes, one-dependence Bayesian classifiers, unre-stricted Bayesian classifiers, and Bayesian multinets, Bayesiannetwork-augmented nave Bayes and K-dependence Bayesianclassifiers etc.

Naive Bayes has proved itself a powerful supervised learningalgorithm for solving classification problems and has been ex-tensively used for healthcare data analysis specially for diseaseprevention. It is built upon strong assumption that featuresare independent with each other (so that classifier could besimple and fast) and it assumes that the effect of variablevalue on the given class is independent of other variable values.Mostly, Naive Bayes uses maximum likelihood for parameter

estimation and classification is done by taking highest posteriorof classified values. Based on given set of variables that belongto known class, the aim is to construct rules for classification ofunknown data based. Limitation of Naive Bayes is sensitivityto correlated features. Selective nave Bayes uses only subsetof given attributes in making prediction [117]. It reducesthe strong bias of naive independence assumptions owingto variable selection [38].The objective is to find among allvariables, the best classifier, compliant with the naive Bayesassumption. Several selection methods have been presented sofar [150] [117][37][38]. Selective naive approach is good fordata sets with a reasonable number of variables however itdoes not scale for large data sets with large number variablesand instances [38].

5) K-Nearest Neighbour: The K-Nearest Neighbor or k-NN algorithm, a non-parametric classification and regressionmethod is based on nearest neighbour algorithms and is oneof the top 10 data mining algorithms [217]. The aim is tofind output with a class membership. kNN finds a group of kobjects in the training dataset that are closest to the test objectand decides the assignment of a label on the predominance ofa particular class in this neighborhood. k-nearest neighbors areidentified by computing the distance of the object to labeledobject. The calculated distance is used to assigner class. Themain advantages of kNN are simplein implementation, robustwith regard to search space, and on-line updation of classifierand few parameters to tune.

There are a lot of different improvements in the traditionalKNN algorithm, such as the Wavelet Based K-Nearest Neigh-bor Partial Distance Search, Equal Average Nearest Neigh-bor Search, Equal-Average Equal-Norm Nearest Neighborcode word Search, Equal-Average Equal-Variance Equal-NormNearest Neighbor Search and several other variations.

Guidelines: The complexity in classification of data arisesdue to the uncertainty and the high-dimensionality of medicaldata. Studies [216, 227, 47] show that there is a large amountof redundancy, missing values as well as irrelevant variablesin healthcare data that can effect the accuracy of diseaseprediction. Conventional wisdom is that larger corpora yieldbetter predicting accuracy however data redundancy and irrel-evancy introduces bias rather than benefits that distorts learnedmodels. Thus, before applying any prediction model, datasetneeds to be processed carefully by removing the redundantdata (e.g. age, DOB, reduent lab reports) as well irrelevant data(e.g. gender attribute in case of gestational diabetes). However,little is known about the redundancy that exists in the data aswell as what type of redundancy is beneficial as opposed toharmful. Statistical methods ( correlation analysis etc) could beused to identify irrelevant variables, other alternative is featureselection based on the importance of specific variable. To dealwith missing values, it is recommended to use data analysismethods that are robust to missingness that are good to usewhen we are confident that mild to moderate violations ofthe technique’s key assumptions will produce little or may beno biasness in the resultant conclusions. As healthcare data ishighly sensitive, one drawback of these method for missingor irrelevant information may lead to distortion of importantrelationship between set of dependent and independent vari-

8

ables. For example, smoking, drinking and helicobacter pyloriindividually might not affect stomach cancer significantlywhereas all together could effect significantly [225]. Thus,efficient approaches are required for data preprocessing toavoid ignorance of that type associated data as deletion of suchdepended variable will significantly effect prediction power.

As discussed above, no generic classifier works for all typeof data and it varies form data to data and nature of the prob-lem. For example, small and labelled data, classifier with highbias (Naive Bayes) is best choice. However, if the data is quitelarge, classifier doesn’t really matter so much, so the selectionof classifier is based on its scalability and run-time efficiency.Classifier performance can be significantly improved througheffective features selection. In case of big data, deep learningis good choice as it learns features automatically. For betterprediction, different feature sets and a couple of classifiersshould be selected and tested. The best classifier that beat allother classifier could be selected for prediction activities.

B. ClusteringClustering is a descriptive data analysis task that partitions

the data into homogeneous groups based on the informationfound in data that best describes the data and its relationshipi.e. classifying patients into groups. Organization of data intoclusters shows the internal structure of the data. The goal is togroups the individuals or objects that resemble each other forthe purposes of improved understanding. Greater the dissimi-larity between groups and greater the similarity within groupprovides better clustering i.e. a new observation is assignedto the cluster with closest match and it is assumed to havesimilar properties to others in same cluster. It is unsupervisedlearning that occurs by observing only independent variable.Unlike classification, it does not have training stage and doesnot use ”class”.

The heart of clustering analysis is the selection of theclustering algorithm. Selection of proper clustering methodis important because different methods tend to find differenttype of cluster structure and is based on the type of datastructure. A number of clustering methods have been proposedand has been widely used in several disciplines i.e. modelfitting, data exploration, data reduction, grouping similar en-tities, prediction based on groups etc. It is categorized intopartitional(unnested), hierarchical(nested), grid-based, density-based, subspace-clustering and some other clustering algo-rithms fuzzy, conceptual clustering as shown in table IV.

Partitional clustering requires a preset number of clustersand it decomposes a set of N observations into a set of kdisjoint clusters by moving them from one cluster to another,starting from an initial partitioning. It classifies the data intointo k cluster based on requirements: each observation belongto exactly one cluster and each cluster contains at-least oneobservation. An observation may belong to more than onecluster in fuzzy partitioning [186]. One of the most importantissue in clustering is the selection of number of clusters andit is complicated if no prior information exist. Another issueis the selection of proper parameter and proper clusteringalgorithm. Inappropriate choice of clustering algorithm or

TABLE IV. CLUSTERING ALGORITHMS

Category Algorithm

Partition k-mean, k-mediods, weighted k-mean, CLARA,CLARANS PAM etc.

Hierarchy Agglomerative algorithms (CURE, CHAMELEON, ROCK),Divisive algorithms (average link divisive, PDDP etc)

Distribution DBCLASD, GMM etcDensity DBSCAN, OPTICS, mean-shift etc.Fuzzy FCS, FCM, MM etc.Grid STING, CLIQUE etc.Graph CLICK, MST etc.Fractal FC etc.Model GMM, SOM, ART, genetic algorithms etc

wrong choice of the parameters, clustering may not reflect thedesired data partition [186]. Not all methods are applicableto all type of problem, so selection of algorithm is based onthe type of problem. Several contributions have been madeto address this issue [92] [143] [91] [131] [44] [186]. Oneof the most common partitionl algorithms is k-means. It usesan iterative refinement technique and attempts to minimize thedissimilarity between each element and the center of its cluster.k-mean partition the data into k cluster represent by their cen-ter. Cluster center is computed as mean of all instance belongto that cluster. k-medoids, k-medians,k-means++, Minkowskiweighted k-means, bisecting K-means, spherical k-means etcare some variations of k-mean.

Hierarchical clustering also called hierarchical clusteranalysis or HCA is a clustering method that seeks to build ahierarchy of clusters (permit clusters to have sub-clusters) andcan be viewed as sequence of partatioanl clustering. Based onpairwise distance between sets of observations HCA succes-sively merges (hierarchically group) most similar observationsuntil a termination condition holds. Unlike partition, it doesnot assume a particular value of k. First step in hierarchicalclustering is to look the most similar/closet pair which arethen joined to make clustering tree. Division or merging ofcluster is performed on the similarity measure that is based onoptimal criteria. Strategies for hierarchical clustering generallyfall into two types agglomerative and divisive also known asbottom-up and top-down respectively. Divisive clustering startswith single cluster containing all observation and consider allpossible way to split into appropriate sub clusters recursivelywhereas agglomerative starts with one point cluster and recur-sively merges two or more clusters into a new cluster in bottomup fashion. Main disadvantages of the hierarchical methods areinability to scale and no back-tracking capability [170]. Stan-dard hierarchical approaches suffer from high computationalcomplexity. Typical hierarchical clustering includes BIRCH,CURE and ROCK etc. To improve the performance, severalapproaches have been presented. In fuzzy based clusteringalso called soft clustering, where each data point can belongto more than one cluster. It is continuous interval [0, 1]of belonging label, 0, 1 is being used instead of discretevalue in order to describe relationship more reasonably. Mostwidely used fuzzy clustering algorithms are the fuzzy C-Means(FCM), fuzzy C-shells and mountain method (MM)) algorithm.Fuzzy C-Means get membership of each data point to everycluster by optimizing the object function. Distribution-basedclustering is iterative, fast and natural clustering of large

9

dataset. It automatically determines the number of clustersand produces complex models for clusters that captures thedependences and correlation of data attributes. Data generatedfrom the same distribution belongs to the same cluster if thereexists several distributions in the original data [219]. GMMusing expectation maximization and DBCLASD are of themost prominent method. Density-based clustering identifiesdistinctive clusters in the data based on high density region(contagious region of high density area separated from otherclusters having low density region). Typical methods of densitybased clustering includes DBSCAN, OPTICS and Mean-shift.DBSCAN (Density-Based Spatial Clustering of Applicationswith Noise) is the most well-known density-based clusteringalgorithm and does not require the number of clusters as aparameter. Further detail on clustering algorithms can be foundat [219].

Guidelines: Clustering is normally an option when there islittle or no data available. Moreover, they do not concentrateon all requirements simultaneously which makes the resultuncertain. Clustering efficiency could be affected by outliers,missing attributes, skewness of data, high dimensionality,distributed data and selecting of proper clustering methodwith respect to application context. Missing and outlier cansignificantly effect the measurements of similarity and thecomputational complexity. Thus, beofre going to clustering ofdata, it is recommended to eliminate the outliers. Groupingthe data into cluster provides signification information aboutthe object. Without prior knowledge of number of clusters orany other composition information of clusters, cluster analysiscan not be performed. Clustering (distance based clustering)performance is also effected by distance function used. Highdimensionally is another issue of data clustering. Number offeatures are very high and may even exceed the number ofsamples. Fixed number of data points become increasinglysparse as the dimensionality increase. Moreover, the datais so skewed that it can not be safely modeled by normaldistributions. In that case, data standarization and model basedclustering or density based clustering could preform better.

Selection of clustering algorithms for particular problemis quite difficult. Different algorithms may produce differentresults. In case when no data or little data is available, thenhierarchical clustering could be good option as they do notrequires predetermination of number of clusters. However,hierarchical clustering methods are computationally expensiveand un-scalable thus not good option for large data. Whenthe number of sample are high, algorithms have to be veryconscious of scaling. Generally, successful clustering methodsscale liner or log-liner. Quadratic and cubic scaling is alsofine but with linear behaviour. Generally, healthcare data is nottotally numerical, thus conversion is required to make it useful.Conversion into numerical value could distort the data thatcould affect the accuracy, for example there attributes valuesA, B, C are converted to numerical values 1,2 and 3. Herethere could be distortion conversion from attribute value tonumerical i.e. if distance of A and B is 2 whereas B and C is1.

C. Association AnalysisIt is a highly unsupervised approach for discovering of

hidden and interesting relations between variables in largedatasets. It is intended to identify strong rules that will predictthe occurrence of an item based on occurrence of other item.For example, which drug combinations are associated withadverse events?. Association rule is an implication of the form{ X } → {Y }whereX ∩ Y = ∅.

Fig. 3. Association Mining Methodologies

Strength of association rule is the measure in term of itssupport and confidence (rules are required to satisfy user-specified minimum support and confidence at the same time).Support is an indication of how frequently a rule is applied to agive dataset and confidence is an indication of how frequentlythe rule has been found to be true. Uncovered relations couldbe represented using association rules or set of frequent items.The following rules shows that there is strong relationshipbetween diabetic patient and retinopathy or heart-block canleads to hypertension. Association rules interestingness ismeasure through support and confidence. Support of 6% meansthat pregnancy and gestational diabetes occurred together in6% of all the transactions in the database and Confidence of90% means that the patients who are pregnant, 90% of thetimes, they also suffer gestational diabetes.

{Pregnancy} → {Gestationaldiabetes}

[support = 6%, confidence = 90%]

{Pregnancy} → {Type− IIDiabetes}

[support = 1%, confidence = 8%]

Support and Confidence are commonly used to extract asso-ciation rules through measuring interestingness of rule. Supportof 6% means that pregnancy and diabetes occurred togetherin 6%of all the transactions contained in the database. While

10

confidence of 75%means that, the woman who are pregnant,75% of the times, suffer type-II diabetes as well.However, itis well known that even the rules with a strong support andconfidence could be uninteresting. Expectation Maximization

In healthcare, this kind of association rules could be usedto assist physician to cure patient or to find the relationshipsbetween various diseases and drugs or occurrence of otherassociated disease as it is revealed that occurrence of onedisease can lead to several other diseases. Association rulesare created by analyzing data for frequent if/then patterns andusing the criteria support and confidence to identify the mostimportant relationships. Thus, the problem of association rulesis divided into two phases: generation of frequent itemset (findall the itemset: frequent-pattern that satisfy the minsup thresh-old) and rule generation (extract the all high confidence rulesform frequent patterns). Computation requirement of frequentpattern generation are more expensive than rule generation thatis straightforward. Finding of all frequent item-sets involvessearching all possible item-sets in the database thus is difficultand computational expensive that can be reduced by reducingthe number of candidate (apriori method) or reducing thenumber of comparison (FP-Grwoth, tree generation etc).

To design efficient algorithms for association rules compu-tation, several efforts were presented over time i.e. Sequence(Apriori, AprioriDP, tree based partition), Parallel (FP-growth,partition based i.e. Par-CSP[50] and hybrid (Eclat). The mostcommon algorithm for association rule is the Apriori algo-rithm. It is an algorithm for frequent item set mining andassociation rule learning over transactional databases. It usesbreadth-first search and hash tree strategy to count the supportof itemsets and uses a candidate generation function whichexploits the downward closure property of support. AprioriDPis extension of Apriori that utilizes Dynamic Programming infrequent itemset mining. Finding of frequent sets requires sev-eral passes thus performance depends upon the size of the data.Processing of high density in primary memory is not feasible.Accesses to secondary memory could be reduce by effectivepartitioning. Tree based partitioning approach organises thedata into tree structures that can be processed independently[11]. To relieve the sequential bottlenecks, parallel frequentpattern discovery algorithms exploit parallel and distributedcomputing resources. FP-growth counts item occurrence andstores into ’header table’ followed by construction of FP-treestructure by inserting instances. Core of GP-growth is the useof frequent-pattern tree (FP-tree), a special data structure toretain the itemset association information.FP-tree is a compactstructure that store quantitative information about frequentpattern. Further detail on association algorithms can be foundat [231, 41].

Guidelines: Traditionally, association analysis was consid-ered an unsupervised problem and has been applied in knowl-edge discovery. Recent studies showed that it has been appliedfor prediction in classification problem. Generating associationrules algorithms must be tailored to the particularities ofthe prediction to build effective classifiers. Computationalcost could be reduced by sampling database, adding extraconstraints, parallelization and reducing the number of passesover the database. To expedite multilevel association rules

searching as well as avoiding the excessive computation inthe meantime, much progress has been made during last fewyears. Other issues involved in association mining are selectionof non-relevant rules, huge number of rules and selection ofproper algorithm. A small number of rules are used by domainexpert whereas all rules are used by classification system. Forgood accuracy, it would be better, if the non-relevant rules areeliminated by domain experts (e.g. gestational diabetesfemales)and to readily organize raw association rules using a concepthierarchy.

D. Anomaly Detection

Anomaly detection has been widely researched areas andprovides useful and actionable information in a variety ofreal-world scenarios such as timely detection of an epidemicthat can helps to save human life or ECG signals or otherbody sensors for critical patient monitoring to detect critical,possibly life-threatening situations [79]. It is quite challengingto develop generic framework for anomaly detection due tounavailability of labelled data and domain specific normalitythus most of the anomaly detection methods have been devel-oped for certain domain [124]. It is an important problem andhas a significant impact on the efficiency of any data miningsystem. The importance of its detection is due to the fact thatits presence in data can compromises data quality and reducesthe effectiveness of learning algorithm. Thus it is a very criticalproblem and requires high degree of accuracy. Anomaly de-tection is the detection of items, events or observations in thedata that do not conform to an expected pattern [42]. Broadlyspeaking, based on their nature, anomalies can be classifiedinto four categories are point ( individual data instance isconsidered as anomalous with respect to the rest of data),contextual (data instance is anomalous in a specific context),collective anomalies (collection of related data instances isanomalous with respect to the rest of data set).

Based on the available data labels, three broad categories ofanomaly detection techniques exist: unsupervised, supervisedand semi-supervised anomaly detection. As availability oflabled dataset for anomaly detection is not easily possiblethus most of the approaches are based on unsupervised orsemi-supervised learning methods where the purpose is to findthe abnormal behaviour. Supervised learning based methodsuses fully labled data, semi-supervised anomaly detectionuses anomaly free dataset for training and find anomaly if itdeviated from trained dataset whereas unsupervised detectionmethod uses intrinsic information to detect anomalous valuesbased on the deviation form majority of data. Why the su-pervised learning methods is not that successful for anomalydetection ?. First of all, not all supervised classification methodsuits for this task i.e. C4.5 can not be applied to unbalanceddata [70]. Secondally, anomalies are abnormal behaviour andthey might not have known prior. Semi-supervised is one-class (normal data) based method that is trained on normaldata without anomalies and after that deviations from thatdata are considered anomalies. Unsupervised is the commonlyused method for anomalies detection and it does not requiredata labeling. Distance or densities are used to estimates

11

the abnormalities based on intrinsic properties of the dataset.Clustering is the simplest unsupervised methods to identifyanomaly in data.

Fig. 4. Anomaly detection methods based on dataset

The performance of detection method depends on datasetand parameter selection. Several methods for anomaly detec-tion have been presented in literature, some of most popularanomaly detection are SVM, Replicator neural networks, cor-relation based detection, density-based techniques (k-nearestneighbor, local outlier factor), deviations from associationrules and ensemble techniques (using feature bagging or scorenormalization). Output of anomaly detection system could bescore or label [73]. Score (commonly used for unsupervised orsemi-supervised) is a ranked list of anomalies that are assignedto each instance depending on the degree to which instance isconsidered anomaly whereas labeling (used for supervised) arebinary output (anomaly or not).

k-NN (not k-nearest neighbour classification) based unsuper-vised anomaly detection is a straightforward way for anomalydetection. It is not able to detect local anomalies and onlydetect global anomalies. Anomaly score is computed on k-nearest-neighbors distance (computed either single (known askth-NN)[158] or average (known as k-NN) [21]). OutlierFactor Local outlier factor is well know method for findinganomalous data points by measuring the local deviation of agiven data point with respect to its neighbouring point. LOFworks in three steps: k-NN is applied to all data points; estima-tion of local density using local readability density; and finallycomputation of LOF score through LRD comparison withits neighbour LRD. Infact LOF is the ratio of local density.Connectivity-Based Outlier Factor (COF) is based on thespherical density computation rather than k-nn based selection.Unlike LOF and COF, Cluster-Based Local Outlier Factor(CBLOF) performs density estimation for clusters. First, clus-tering is performed to find clusters and then anomaly score iscalculated based on the distance of each data point with clustercenter. LDCOF (local density cluster-based outlier factor) is

the extension of CBLOF that consider the density of localcluster as well by segmenting into small and large clusters.LDCOF (local density cluster-based outlier factor) is theextension of LDLOF to overcome the shortcoming of CBLOFand unweighted-CBLOF by estimating the clusters densitiesassuming a spherical distribution of the cluster members.LDCOF is local score as it is based on distance of instanceto its cluster center. LOF and COF output are score but theydon’t determine the threshold. Outlier Probability resolvesthis issue. Clustering-based Multivariate Gaussian OutlierScore (CMGOS) is another extension of cluster based anomalydetection. First, k-means clustering is performed followed bycomputation of covariance matrix. Local density estimationis calculated by estimating a multivariate Gaussian modeland Mahalanobis distance is used for distance computation.Histogram-based Outlier Score (HBOS) is unsupervised,anomaly detection method [69]. It is fast than multivariateapproach due to independence of features. For each feature,histogram is computed followed by multiplication of inverseheight of the bins it resides for each instance. Due to the featureindependence, HBOS can process a dataset under a minute,whereas other approaches may take hours.

One-class SVM [175] is a commonly used semi-supervisedanomaly detection method that is trained on anomalous freedata [68]. SVM is trained on anomaly-free data whereasnormal data is used for classification and once-class SVMclassifies into normal or anomaly. In one-class SVM basedsupervised anomaly detection scenario, it is trained on datasetand for anomaly detection, each instance in data is scoredby normalized distance [18]. Conditional anomaly detectionis different from standard anomaly detection method [193].The goal is to detect unusual pattern relating input attributesX an output attributed y in an example a whereas standardanomaly detection identify data that deviate from other data.Amer presented eta one-class SVM to deal with the sensivityof one-class SVM for outlier data [18]. An improved one-class SVM ”OC-SVM” that exploits the inner-class structureof the training set via minimizing the within-class scatter ofthe training data [19].

Guidelines: Not only generalized helathcare data, auto-mated anomaly detection, even for a specific disease likethyroid disease, is not easy to solve. Successful approachesneed to use a range of techniques to deal with real worldproblem espically when data is quite large and complicated.During recent years, large number of methods have beenpresented for anomaly detection, selection of method dependsupon the type of anomaly and data itself. Before selectinga method, it is important to know which anomaly detectiontechnique is best suited for a given anomaly detection problemand data. When dimensions are high in number, NN andclustering based methods are not good option as the distancemeasures in high number of dimensions are not able todifferentiate between anomalous and normal data. To deal withhigh dimensional data, spectral techniques could be used asthey explicitly address high dimensionality issue through highto low dimensional projection mapping. If fully labelled orpartially-labelled data is available then supervised or semi-supervised based method could be used even in case of high

12

Fig. 5. Anomaly detection methods

dimensional data. Semi-supervised method are preferable oversupervised methods as supervised methods are poor to dealdistribution imbalance of labels (normal vs specific anomaly)thus data need to be preprocessed to overcome this biasenessissue. Due to the sensitivity of healthcare finding, one mainchallenge is to find best feature vector that provide maximumdiscrimination power and avoid false positive to minimumand has high accuracy. Thus, it is required to find efficientfeature vector that characterizes the anomalous occurrence andits location causing agent in time or frequency domain usingsome domain transformation or wavelets. Another issue thatshould be considered is computational complexity as anomalydetection methods have high computational complexity spe-cially when deal with real time data i.e. data generated by ICU

monitoring senors. Most of the anomaly detection methods areunsupervised or semi-supervised and these techniques requireexpensive testing phases which can limit it in real setting. Toovercome this issue, one should consider model size and itscomplexity. Another big challenge that effect the performanceof anomaly detection method is data itself. Data should be beaccurate, complete and consistent. In healthcare sector, datacontains noise which tends to be similar as anomalies or effectdetection, Thus before applying anomaly detection, data shouldbe cleaned carefully.

IV. APPLICATIONS

Today, abundant data collected in medical sciences thatcould be utilized by healthcare organization to get wide range

13

of benefits i.e. descriptive analytics (What has happened basedon symptoms), predictive analytics (What will happen based oncurrent situation) and prescriptive analytics (how to deal withthis situation based on best practice and the most effectivetreatments. As discussed earlier, traditional methods are notto process mounds of data. These days, much of the focus inrecent development has been on early detection and preventionof diseases and data mining and data analytics interventionsin health sector are increasingly being used for this purpose.By extracting hidden information form medical data using dataanalytics and machine learning techniques, intelligent systemcan be designed that besides the physician knowledge andbe used as expert system for early diagnose and treatment.To promote early detection and prevention of disease, severalintelligent approaches has been presented. Based on the typesof applications, we have divided the following discussion intofour subsections.

A. Examples/Case StudiesDespite the extensive need of data analytics applications

in healthcare, although the work done is not that efficient,however, data analytics is driving vast improvements in health-care sector and in future, we will see the rapid, widespreadimplementation across the healthcare organizations. Here aresome case studies for disease prevention in past few years.

ScienceSoft, Truven, Artelnies, TechVantage, Xerox re-search centre India (XRCI), are some of solution providercompanies that are applying data analytics for clinical dataefficiently. ScienceSoft provides specific data analysis servicesto help healthcare providers make efficient and timely deci-sions and better understand activities. Truven health Analyt-ics provide solution for healthcare data analytics. Artelniesapplied artificial intelligence methods to analyze clinincaldata and provide solutions like medical diagnosis, medicalprognosis,microarray analysis and protein design and designedseveral application for early diagnose and disease preventioni.e. breast cancer hazard assessment ( predict if a patientcould have a recurrence of cancer or not), diagnosing heartdisease (diagnose a heart problem ) and early prognosticin patients with liver disorders ( investigating early stageliver disorder before it becomes serious). XRCI is performingpredictive analytics to identify high-risk patients in hospitalsand delivering several clinical decision support systems i.e.ICU admission prediction, complication prediction in ICUs,ICU mortality prediction and stroke severity and outcomeprediction.

Kaiser Permanente, one of the first medical organizationsto implement an EHR system and is the largest integratedhealth system serving mainly the western U.S. Kaiser primarilyuses SAS analytics tools and SAP’s Business Objects busi-ness intelligence software to support data analytics activitiesagainst the EHR system. Uncertainty of clinical decisionsoften mislead clinicians to deal with patients (i.e. treatmentof newborns using antibiotics: 0.05% newborns having theinfection confirmed by blood culture [180] and only 11% ofthose newborn received antibiotics). Kaiser Permanente systemallows clinicians to steer high-cost interventions to deal such

issues. System would be able to predict accurately whichantibiotics are required for the baby based on the mothersclinical data and the babys condition immediately after birththat result in cost reduction as well as reduces side effectamong newborns. Optum Labs, USA collected EHRs of morethan 30 million patients to create database for predictiveanalytics in order to improve the quality of care. Goal is touse predictive analytics to help physicians to improve patientcare by aiding decision power especially for the patients withcomplex medical histories and suffering from multiple condi-tions. Healthcare Analytics bY MckinseyCompany combinesstrategy big data and advanced analytics, and implementationprocesses.

Prediction of Zika virus has been challenging for publichealth officials. NASA is assisting public health officials,scientists and communities to limit the spread of Zika virus andidentify its causes. Researcher developed computation modelbased on historical patterns of mosquito-borne diseases (i.echikungunya and dengue) to predict the spread of virus andcharacterized by slow growth and high spatial and seasonalheterogeneity, attributable to the dynamics of the mosquitovector and to the characteristics and mobility of the humanpopulations [132]. Researchers used temperature, rainfall andsocioeconomic factors and identified the Zika transmission andpredict the number of cases for next year by combining realworld data on population, human mobility and climate.

Harvard Medical School and Harvard Pilgrim Health Careapplied analytical methods to EHR data to identify patientwith diabetes and classify them into groups ( Type 1 andType II diabetes). Four years worth of data based on numerousindicators from multiple sources have been analyzed. Patientcould be grouped into high risk disease groups and risk couldbe minimized by preventive care i.e. new preventive treatmentprotocols could be introduced among patient groups withhigh cholesterol. Rizzoli Orthopaedic Institute, Italy, is usingadvanced analytics to gain more granular understanding ofdynamics of clinical variability within families for improvingcare and reducing treatment costs for hereditary bone diseases.Result showed significant improvement in healthcare efficiencyi.e. 30% reductions in annual hospitalizations and over 60%reductions in the number of imaging tests. Hospital is planingto gain more benefits by insights into the complex interplayof genetic factors and identify cures. Columbia UniversityMedical Center is using analytical technique to analyze braininjured patients to detect complications earlier. To deal withcomplications proactively rather than re-actively, physiologicaldata form the patient who have suffered form bleeding strokefrom a ruptured brain aneurysm is analyzed to provide criticalinformation to medical professionals. Analytical techniquesare applied on real time data as well as persistent data todetect hidden pattern that indicate occurrence of complications.North York general hospital, implemented a scalable, real-timeanalytics solution to improve patient outcomes and developdeeper understanding of the operational factors driving itsbusiness. System provides analytical finding to physicians andadministrators to imporve the patient outcome.

Intensive Care Unit (ICU) is one of the main section whereanalytics method could be applied on real time data for

14

prediction to improve quality care. A number of organizationsare working on integration data analytics methods with bodysensors and other medical devices to detect plummeting vitalsigns hours before humans have a clue. SickKids hospital,Toronto, is largest center working on advancing childrenshealth and improving outcomes for infants susceptible to life-threatening nosocomial infections. Hospital applied analyticalmethods to vital science and other data from real time mon-itoring (captured from monitoring devices up to 1,000 timesper second) to improve the child health. Potential signs of aninfection is predicted before it happens (24 hours earlier thanwith previous methods). Researchers believe that, in future,it will be significantly beneficial for other medical diagnose.University of California, Davis applied analytical methods toEHR to identify sepsis early. QPID analytical system is usedby Massachusetts general hospital to predict surgical risk, tohelp patients with the right course of actions and to ensurethat they dont miss critical patient data during admission andtreatment.

Even though some of the advanced health organizations haveimplemented data analytics technologies however, it still hasenough room to grow on analytics.

B. ClassificationMost important and common use of data analytics and data

mining in healthcare probably involves predictive modeling.One key problem, is the selection of effective classifier. Onegeneral classifier will not work for all types of problems,selection of classifier based on the type of problem and dataitself. In literature, several classification methods is beingemployed for different disease prediction.

Cardiovascular disease (CVD) can lead to a heart attack,chest pain or stroke due to the blockage of blood vessels.It is one of leading causes of death (31.5%) whereas as 90% of Cardiovascular diseases are preventable [129]. Lot ofresearch is going on for the early identification and treatmentof CVD using data mining tools. Jonnagaddala et al. presenteda system for heart risk factor identification and progression fordiabetic patients unstructured EHR data [96]. System consist ofthree modules core NLP, risk factor recognition and attributeassignment module. NLP assignes POS-taggs and identifiesnoun phrases and forwards to risk factor recognition phasethat identifies medications, disease disorder mentions, familyhistory, smoking history and heart risk factors. Identified riskfactors are then forwards to attribute assignment module assignindicator and time attributes to risk factors. In another study,Jonnagaddala et al. presented rule-based system to extract riskfactors using clinical text mining [95]. Risk factor are ex-tracted from unstructured electronic health records. Rules weredeveloped to remove records, which do not contain age andgender information. Alneamy and Alnaish used hybrid teach-ing learning based optimization (TLBO) and fuzzy waveletneural network (FWNN) for identification of heart disease [17].TLBO is used for training parameter updation used for traningFWNN. Training data is consist of 13 attributes are forwardto FWNN and mean square error is computed that is usedfor weight updation using TBLO. TLBO-FWNN based systemprovided 90.29% accuracy Cleveland heart disease dataset.

Rajeswari et al. presented feature selection using feed for-ward neural network for ischemic heart disease identification[126]. The feature set is reduced to 12 most discriminantfeatures form 17 that increased the accuracy form 82.2%to 89.4%. In another study, Arslan et al. compared SVM,tochastic gradient boosting (SGB) and penalized logistic re-gressiosn (PLR) for ischemic stroke identification [26]. SVMprovide slightly higher accracy as comparfed to SGB andPLR. Anooj used weighted fuzzy rule-based system based onMamdani fuzzy inference for the diagnosis of heart disease[111]. Attribute selection and attribute weighted method isused for the development of fuzzy rules. Mining procedureis applied to generate appropriate attributes which are use todevelop fuzzy rules. Weighted based on frequency is added toattributes in the learning process for effective learning.

As discussed in section 3, decision tree is one of thebest method and is extensively used for disease prevention.[183, 182] [100, 90, 204, 136]. NaliniPriya and Anandhakumarextend the decision tree to handle multivariate data sets [136].Choudhary and Bajaj utilized K-NN and decision tree for theprediction of root canal treatment needs [100]. When there islittle change in ECG, Alizadehsani et al. presented a systemto find a way for specifying the lesioned vessel based on paraclinic data, risk factor and physical examination [14]. C4.5,Nave Bayes, and KNN methods are used for coronary arteriesstenosis identification on set of 303 random visitor (Z-AlizadehSani dataset) with additional effective features ( function class,dyspnoea, Q wave, ST elevation, ST depression and tinversion)and no missing value . C4.5 achieved best accuracy 74.20%for left anterior descending, 63.76% for left circumflex and68.33% for right corona C4.5 and Bagging algorithms[16]. Thesystem achieved 79.54%, 61.46%, and 68.96% left anteriordescending, left circumflex and right coronary artery respec-tively. Karaolis et al. investigated three events myocardialinfarction (MI), percutaneous coronary intervention (PCI), andcoronary artery bypass graft surgery (CABG) using C4.5decision tree algorithm on 528 cases [103]. Based on threeevent using five splitting criteria (age, smoking, hypertensionhistory, family history), most important risk factors extractedare age, smoking, and hypertension history for MI, familyhistory, hypertension history, and history of diabetes for PCIand age, hypertension history, and smoking for CBAG. Thesystem achieved accuracy 66%, 75%, and 75% for MI, PCI,and CABG models respectively. Alizadehsani [15] compareddifferent classifier SMO, naive Bayes, neural network foridentification of coronary artery disease on Z-Alizadeh Sanidataset whereas weighted SVM is used for feature selection.Set of 34 of are selected which had the weight higher than 0.6.The study showed that SMO provides better accuracy 94.08%as compared to 75.51%, 88.11% by naive Bayes and neuralnetowk respectively. El-Bialy et al. applied fast decision treeand pruned C4.5 tree for the classification of coronary arterydisease. Missing, incorrect, and inconsistent data problems isresolved using integration through integration the different datasets [192]. The strategy is to select attributed form differentdataset for same medical problem. In this study, four differentdata coronary artery dataset (Cleveland, Hungarian, V.A. andstatlog project heart dataset) are integrated. C4.5 and fast de-

15

cision tree is applied on each dataset and five common feauresare selected form all decision tree. These five most commonattributes are then used to build new integrated data set thatconsist of only selected common features. Result showed thatnew integrated dataset provided gain in accuracy 78.06% from75.48%. Fuzzy rules along with decision tree is also used fordisease risk prediction [108] [34]. Fuzzy rules with decisiontree are applied to overcome problems associated with uncer-tainty. Kim et al. used fuzzy with decision tree tree [108] [112].Cerebrovascular risk factor are elder age, hypertension, heartdisease, diabetes mellitus, temporal cerebral ischemia seizureand cerebrovascular disease history, and subordinate risk fac-tors: hyperlipaemia, obesity, polycythemia, smoking, drinking,family heredity, oral contraceptive and other medicine. Elderpeople are vulnerable to cerebrovascular disease. Yeh et al.uitlized decision tree, Bayesian classifier and back propagationneural network for the prediction of cerebrovascular disease[224].In total 29 variables (9 out of 24, 12 out of 29 and 8out of 10 for physical exam, blood test [161, 160, 180] anddisease diagnose respectively) are selected based on clinicianrecommendation. System provided accuracy/sensitivity (%)98.01/95.29, 91.30/87.10, 97.87/94.82 for decision tree, Bayesand BPNN respectively on on 493 patient. Study identified 8major factor (diabetes mellitus, hypertension, myocardial in-farction, cardiogenic shock, hyperlipemia, arrhythmiaischemicheart disease and BMI) for accrately predicting cerebrovasculardisease. Classification accuracy/sensitivity (%) is increased to99.59/99.48 based on 16 extracted diagnose rules. Adnan et al.presented hybrid approach using decision Tree, Nave Bayesfor Childhood Obesity Prediction [133]. CART is used toselect eight important variables. Outputs of hybrid method isclustered into the positive and negative groups.

Diabetes is one of the major health problem in all overthe world. It is quite serious disease that can lead to seriouscomplication if not treated on time and there are severalother diseases are related to diabetes. The cause of diabetesis mysterious and it is due to the low insulin productionby pancreas or body does not respond to produced insulinproperly. As it is lifelong disease, thus, even for individualpatient, massive amount of data is available to interpret [31].Breault et al. applied CART on 15,902 diabetes patients andconcluded that age is the most important attribute associatedwith bad glcemic control[39]. Using the age information,clinician can target specific set of people. Temporal abstractionmethod can be integrated with data mining algorithms tosupport data analysis. In another study, Yamaguchi et al. useddata forest software for type 1 diabetes mellitus predictionand predict next-morning FBG based on FBG, metabolicrate, food intake, and physical condition [220]. The studyconcluded that, physical conditions are highly correlated withFBG. Complex temporal abstractions is used for diabetesmellitus and blood glucose management [31]. Cho et al.compared different classification methods(logistic regression,SVM, and SVM with a cost sensitive learning method) fordiabetic nephropathy and several feature selection methodshave been used to remove redundant features [46] . Studyshowed that Linear SVM classifiers with embedded featureselection methods showed the promising result. Huang et al.

applied Nave Bayes, IB1classifier, and CART on 2,064 patientsdataset and identify five important factors (age, insulin needs,diagnose duration, diet treatment and random blood glucose)that effect blood glucose control. Using these five attributes,system achieved 95% accuracy and 98% sensitivity [87].

Parkinson Disease (PD) is the multisystem neurodegen-erative disorder affecting the motor system which result indopaminergic deafferentation of the basal ganglia, gives riseto characteristic motor disturbances that include slowing ofmovement, muscular rigidity, and resting tremor [13]. It is thesecond most common movement disorder. There is no medicaltreatment [187], thus PD patients have to rely on clinical inter-vention. Das compared different classifiers (DMneural, neuralNetworks, regression tree and decision Tree) on Max Littledataset [120] for effective diagnosis of Parkinsons diseases.Study showed that neural network obtained best reuslt 92.9%classification accuracy. In an other study, Geetha comparedseveral methods (KNN, SVM, C4.5, Random forest tree Clas-sification and regression Tree, partial leaset squiare regression,linear discriminat analysis, etc.) for the classification of PDpatient on Max Little dataset[157]. Study showed that RandomTree classifier yields the 100% accuracy. Khan used KNN,AdaBoost and Random forest for the identification of PDpatient on Max Little dataset [113] . Result showed that K-NN provided best accuracy 90.26% using k=10 fold validation.Suganya and Sumathi used metaheuristic algorithm for thedetection and classification of Parkinson Disease [200] . Studyreported that ABO algorithm provided best accuracy (97%) ascompare SCFW, FCM, ACO and PSO.

End stage Hemodialysis patient care cost is too high thusreducing cost of dialysis is important factor. On regular basis,more than 50 parameters are monitored in kidney dialysistreatment and there are multiple factor that can influencepatient survival [115]. To make mining results more likelycomprehended, domain knowledge prior to data analysis couldbe added by combining the temporal abstraction method. Yehet al. used temporal abstraction with data mining techniques(decision tree and multiple minimum supports associationrule) for analyzing biochemical data of dialysis patients todevelop a decision support system to to find temporal patternsresulting in hospitalization [94]. Association rules and theprobability patient hospitalization is used to decrease patienthospitalization. Monthly biochemical test data is transformedto basic TA through the basic TA algorithm that is used as ainput to complex TA algorithm to transform into complex TA.

Kusiak et al. used two different methods for knowledgeextraction in the form of decision making rules that are usedto predict patient survival [115]. Identifying pressure ulcerpatients at early stage is debilitating complication. Raju etal. compared the performance of different classifier (logisticregression, decision trees, random forests, and multivariateadaptive regression splines) risk factor associated with pres-sure ulcers [156]. Decision tree was split based on mobilitysub-scale value of 2.5. Compared to the other methods, randomforests provide better accuracy.

Stroke ranked 3rd in disease burden and is considered tobe the leading causes of hospitalizations, disability and death.Stroke risk and severity prediction can contribute significantly

16

to its prevention and early treatment. Literature shows thatthere are few post-stroke predictive models and wide rangeof different techniques has been applied for risk identificationrelated to a particular condition. Strokes factors are age, smok-ing, diabete, anti-hypertensive therapy, cardiovascular disease,systolic blood pressure, and left ventricular hypertrophy byelectrocardiogram [142]. Khosla et al. presented automatic fea-ture selection method that selects robust features and evaluatedthree automatic feature extraction approaches forward featureselection (FSS), L1 regularized logistic regression (RLR), andconservative mean feature selection (CM) [107]. FSS greedilyadds one feature at a time and cross-validation is used toselect the best features. CM select relevant and robust tovariations features using conservative mean by consideringmonotonic prediction functions over a single feature. SVMand margin-based censored regression (MCR) is used forclassification propose. Study showed that MCR provided betteraccuracy on set of features selecting using coservative meanmethod. Sung et al. applied MLR, k-nearest neighbor andregression tree on 3,577 patients (approximately one-third areprior stroke patient) with acute ischemic stroke [201]. Three-step feature selection (frequency cutoff operation, correlation-based feature selection and expert opinion ) process was usedand study identified seven predictive features. Panel of strokeneurologists identified the final set of features by consensus.Study identifies stroke severity indices, which represent proxymeasures of neurologic impairment that could be used forischemic stroke patient to adjust strokes severity. Easton et al.developed post-stroke mortality predictive model for or veryshort term and short/intermediate term mortality [7]. Studyidentified certain risk factors differentiated between very shortterm and intermediate term mortality i.e. age is highly relevantfor intermediate term. He et al presented a framework (D-ECG)for accurate detection of cardiac arrhythmia [79] that developthe result regulator using different set of features in order torefine the result.

Liver Disorder also called hepatic diseases is disturbance ofliver function that causes illness, normally doesn’t cause anyobvious symptoms until liver is damaged and disease is ad-vanced. If it is detected at early stage, acute liver failure causedby an overdose of acetaminophen can sometimes be treated andits effects reversed. Seker et al. applied KNN, SVM, MLP anddecision trees. Baitharu and Pani applied decision trees J48,Naive Bayes, ANN, ZeroR, 1BK and VFI algorithm to identifyliver disorder ( Cirrhosis, Bile Duct, Chronic Hepatitis, LiverCancer and Acute Hepatitis from Liver Function Test (LFT))[29]. Hepatitis Yasin et al. extracted top seven PCA features(steroid, antiviral, fatigue, malaise, anorexia, liver big and liverfirm) form set of 19 obtained using PCA [223]System obtained89% accuracy hepatic-c prediction using regression model isused for . Zayed et al. used decision tree (C4.5) to predicttherapeutic outcome to antiviral therapy in HCV patients[226]

MERS-CoV is viral airborne disease caused by novelcorona-virus (MERS-CoV) which spreads easily and has ahigh death rate. It developed severe acute respiratory illness,including fever, cough, and shortness of breath. Yang et al. pre-sented non-orthogonal decision trees for mining SARS-CoVprotease cleavage data [222]. Bio-mapping is used to transform

the k-mers to a high-dimensional that is used for constructionof decision tree. Prediction accuracy is significantly improvedby using bio-mapping based high dimensional templates andnon-orthogonal decision tree. Lee et al. used SVM (normal,sigmoid, RBF and polynomial) to predict SARS virus [179]and obtained 97.43% accuracy using polynomial. Sandhu etal. presented Bayesian belief network to predicts MERS-CoV-infected patients as well as geographic-based risk assessment[172]. Possibly MERS-CoV infected users are quarantined byusing GPS and marked on Google map.

Seminal quality prediction is binary classification problemand it is quite usefull for early diagnosis of seminal disorderand also helps for selection of doner etc. Wang et al presentedthree-stage ensemble learning framework called clustering-Based Decision Forests, to tackle unbalanced class learningproblem for the prediction of seminal quality [83]. Firststage deals with the class imbalancing (Tomek Links methodis used for data clearning) and at second stage, balanceddataset is formed by combining majority subset and baggingminority. Finaly, random subspace is used to generate a diverseforest of decision trees. Comprasion result on UCI fertalitydataset showed clustering based decision forest outperformsdecision tree, SVM, MLP, logistic regression and randomforest. To identify lifestyle and environmental features thataffect the seminal quality and fertility rate, Sahoo and Kumarapplied MLP, SVM, evolutionary logistic regression, , PCA,SVM+PSO, chi-square, correlation and T-test [171]. To eval-uate the seminal quality and fertility rate, decision tree andNavie Bayes, SVM, SVM+PSO and ML is applied on UCIfertility dataset. Result showed that SVM+PSO provides betteraccuracy as compared to other classifier and also concludesthat age, season, surgical interventions, smoking alcoholichabit have more impact on the reduction of fertility rate. Inanother study, Gil et al. used SVM, MLP and decision treeto predict seminal quality environmental factors and lifestyledata [66]. Naeem used Bayesian belief network (BBN) on UCIfertility dataset to identify effecting parameter[125]. Girelaet al. presented MLP based study on data collected by thequestionnaire to predict the results of the semen analysis [98].Sahoo et al. [171], Naeem[125], Gil et al. [66] and Girela et al.[98]studies showed that life style, environmental features andhealth status strong effect the semen quality and could be usedas a measure for early diagnose male fertility issues. IVF is oneof the most effective and expensive treatments for addressinginfertility causes that requires selection of healthy embryosand continuous patient observation. Predictive modeling couldhelp to reduce the chance of multiple pregnancy and increasethe success rate by finding the effecting parameters. Thefirst study on IVF outcome prediction was presented in 1997by Kaufmann et al. [190] using neural network. Corani etal. [51] and Uyar et al. [27] used Bayesian network forpredicting pregnancy after IVF. Corani et al. conclude thatembryo viability (top graded) is a more critical factor thanuterus and helps to decide how many and which embryos totransfer. Simulation result on 5000 IVF cycles shows decisionbased transfer increases both single pregnancy and double-pregnancy rate, moreover it also reduces the average numberof transferred embryos from 3 to 2.8. In another study, Uyar

17

et al. compared different classifier Naive Bayes , decisiontree, SVM, KNN, MLP and radial bases function network(RBF) classifier on dataset of 2275 instances (15 variables)in order to discriminate embryos according to implantationpotentials [206]. Study reported that Nave bayes and RBFperform better as compared to other methods. Siristatidis etal used neural network based on LVQ to predict outcomeearly [189] and presented a web-based System to predictIVF outcome [188]. Durairaj and Kumar used MLP [58] andrough set theory [59] to identify influential parameters thateffect the success of IVF outcome [58]. Fertility analysis isperformed on 27 different attributes of 250 patient. Gveniret al. presented RIMARC ranking algorithm to predict thechance treatment as probability of success for IVF treatmenton dataset of 1456 patients [72]. It assign the score basedon the past cases. Results are compared with random forestand Nave Bayes. BPNN [20], SOM [146] based approach topredict IVF treatment result. As in IVF, the success rate isquite low, thus selection of appropriate parameter, precondi-tion, post condition, embryo embryo viability could help toimprove the chance of conception. Intelligent techniques couldhelp to increase the chance of conception in IVF treatment.Nevertheless, more efforts are required to improve the efficacy.Moreover, the studies done so far are based on small dataset,that is not sufficient for efficient prediction.

C. Clustering

Cluster analysis has been widely applied for such variousapplications especially in preventive healthcare that helps toreveal hidden structures and clusters found in large data sets.Availability of extensive amount of healthcare data requirespowerful data analysis tools to come up with meaningfulinformation. Along with classification techniques, it has beenextensively used in healthcare sector and numerous CA ap-plications especially disease prevention have been reported inthe literature, such as characterizing diabetic patients on thebasis of clusters of symptoms; grouping cardiovascular patientsbased on their symptom experience and identification of causeof mortality etc.

K-means and its variations have been extensively used fordisease prediction. Luo et al. performed cluster analysis forprevention and treatment experience of infectious diseases(influenza, dysentery, tuberculosis, viral hepatitis and otherinfectious diseases) from famous herbalist doctors [123]. Studyshowed that cluster analysis is helpful to summarize thecommon understanding of experience including concept ofsyndrome differentiation, regulation of diagnosis and treat-ment, prescription characteristic that is based on preventionand treatment of influenza, viral hepatitis and other infectiousdiseases. Association rules are used to find the relationshipbetween etiology, syndrome, symptoms and herbal prescrip-tion of infectious diseases. In another work, Kabir et al.presented the system to detect epileptic seizure from EEGsignal using SVM, Naive bayes and logistic regression [101].Results showed that k-mean clustering can handle EEG dataefficiently in order to detect epileptic seizure. Kaushik et al.applied enhanced genetic clustering to diagnose and predict

diseases based on patient history, symptoms, and existingmedical condition [105]. New patients profile is analysedagainst existing patterns and mapped to a particular clusterbased on the patients medical parameters followed by predic-tion of medical condition and probability of being prone tothat disease in future. Vijayarani and Sudha applied FuzzyC-means, k-means and weight based k-means algorithm topredict diseases from hemogram blood test samples [210].Study result showed that weighted k-mean performed betteras compared to fuzzy c-mean and k-means. Norouzi et al.used fuzzy c-mean clustering to predict the renal failureprogression in chronic kidney based on weight, diastolic bloodpressure, diabetes mellitus as underlying disease, and current[137]. The number of fuzzy rules are equal to the number ofmembership function of input variable. Feng et al. applied K-center clustering method to analyze clinical data and syndromeinformation of 154 cases of prethrombosis state. Study showedthat traditional clinical syndrome differentiation presented 12syndrome patterns (blood stasis, qi deficiency , damp turbidity, yin deficiency , yang deficiency , phlegm turbidity , damp-heat (toxicity) , qi stagnation, blood deficiency, phlegm heat,and cold accumulation) [60]. Result showed that blood stasisand qi deficiency are more commonly seen (49.1%) syndromesthan other, whereas cold accumulation is the most rarely seensyndrome. Yang et al. performed fuzzy cluster analysis ofAlzheimers disease related gene sequences [221]. Study resultindicates that the gene sequences interrelated within one groupis consistently having closer relationship within the group otherthan in another group. Shaukat et al. compared k-means, K-mediods, DBSCAN and OPTICS to determine the populationof dengue fever infected cases [178]. Results showed thatOPTICS outperformed other methods.

As healthcare data is very sensitive thus to overcome theissue of sensitivity, recently hybried aproaches have beenconsidered especially clustering is being used along with clas-sification techniques. Shouman et al. presented hybrid methodand applied Naive Bayes and k-mean clustering with differentinitial centroid selection method for the heart disease patientdiagnosis [184]. Integration of unsupervised (k-mean) and su-pervised ( Naive Bayes) with different initial centroid selectionenhanced the prediction accuracy o Nave Bayes as comparedto standalone method. For centroid selection different methods(range, inlier, outlier, random attribute values, and random rowmethods) have been applied. In another work, Zheng et al.presented hybrid of K-means and support vector machine (K-SVM) algorithms for breast cancer diagnosis [228]. K-meansis used to recognize the hidden patterns of the benign andmalignant tumors separately and membership of each tumorto these patterns is calculated and treated as a new feature forSVM training. Study result showed that K-SVM improve theaccuracy to 97.38% on WDBC dataset. PCA and LDA is usedfor feature reduction followed by LS-SVM, PNN and GRNN.

D. Association Analysis

Association rule mining is to detect factors which contributeto disease outbreak. Moreover, Association is also used withclassification techniques to enhance the analysis capability.

18

Relational analysis could help to make healthcare data moreuseful as occurrence of one disease can lead to several otherassociated diseases. Association rule mining (ARM) has beenapplied to find link between other associated disease especiallycardiac, diabetes, cancer and various form of tumors etc.

Ogasawara et al. performed association analysis to find therelationship between lifestyles, family medical histories andmedical abnormalities [138]. Different variable of lifestylevariables (overweight, drinking, smoking, meals, physical ex-ercise and sleeping time), medical history attributes (hyperten-sion, diabetes, cardiovascular disease, cerebrovascular disease,and liver disease) and medical abnormalites (medical abnor-malities namely high blood pressure, hyperchoresterolemia,hypertrigriceridemia, high blood sugar, hyperuricemia, andliver dysfunction) from 5 year data of 5350 patient wereconsidered for analysis and 4371 rules were extracted. Studyconcludes that association analysis performed significantlybetter to predict the risk factor than conventional modeling.Semenova et al developed association rule algorithm withfocus on delivering knowledge from large database [176].Instead of focusing on frequent itemsets, they have considereditemset that provide knowledge and useful insight. Polydictalgorithm is used to find frequent itemset patterns. Initialdictionary stores all the variety of sequences with their countsthat make search frequent patterns.

Doddi et al. presented random sampling and Apriori toobtain association rules indicating relationships between pro-cedures performed on a patient and the reported diagnosesanalyze on large database containing medical-record data [57].Due to the large data set, random sampling is used to overcomethe computation issue. Small sample is collected form largecollection of transaction using random sampling and Apriorialgorithm was applied on small collected data. Transaction setis partitioned into five disjoint sets each consisting of about250K transactions and random sampling size of 5K is gen-erated form each subset of transaction and Apriori algorithmis applied on these selected sample of each subset. Payus etal. used Apriori algorithm for respiratory illness caused by airpollution. 6 attribute were considered and 42 rules were gen-erated with minimum support of 0.1 and minimum confidenceof 0.1 [147]. Concaro et al. presented temporal association rulemining (TARM) for analysis of care delivery flow of diabetesmellitus by extraction of temporal associations between diag-nostics and therapeutic treatments [49] and Apriori variant isused for mining ARs. In another work by Concaro et al., theyhave applied TARM for analysis of costs related to diabetesmellitus [48]. Jabbaret al. presented heart attack predictionusing boolean matrix (HAPBM) algorithm [93]. HAPBM isa variant of Apriori algorithm with differene in conversionof discretized dataset into boolean matrix, and then frequentitemset generation from boolean matrix. Raheja and Rajancomparatively analyzed ARM and MiSTIC (Mining Spatio-Temporally Invariant Core Regions) approaches for extractingspatio-temporal of Salmonellosis disease occurrence pattern[154]. Apriori algorithm is used for ARM with minimumsupport and minimum confidence values of 1 and 40 percentrespectively. For efficient rules, economy, demographics andenvironmental data is also used and Apriori is applied on

consolidated dataset. Lee et al. utilized the variant of Apriorialgorithm, and standard support-confidence framework forfrequent itemsets and ARs generation for discovering medicalknowledge from acute myocardial infraction (AMI) patients[118]. Pruning was performed by using lift, leverage, andconviction interestingness measures. 12 risk factors related toblood factors were selected out of the total 141 risk factors,to find associations between blood factors and disease historyin young AMI patients.

Nahar et al. applied Apriori, Predictive Apriori and Tertiusfor rule generation to investigates the sick and healthy factorswhich contribute to heart disease for males and females [134].Attributes indicating healthy and sick conditions were identi-fied. Two experiments ( rules to healthy and sick conditions,rules based on gender) have been performed. All sick individ-uals were regarded to be in one class and healthy individualsto be in another class and Apriori, Predictive Apriori andTertius is applied. Payus et al. mined air pollution databasefor extracting reasons behind respiratory illness [147]. Sevenattributes were selected and Apriori algorithm was applied withminimum support and minimum confidence value of 0.1. Intotal, 42 rules were generated (17 were normal hospitalizedpatients, 24 belonged to moderate hospitalized patients and 1rule belonged to high hospitalized patient). Anwar and Naseerpresented 69 association rules to identify lifestyle and envi-ronmental factors on man’s seminal fertility and quality [25].ARs were mined using Apriori algorithm based approach byusing XLMiner tool. Experiment and analysis was performedwith multiple support and confidence values, and reported tohave found interesting association rules. Sharma and Omusedapplied Apriori algorithm with minimum support value 10%and minimum confidence value 90% [177]. ARM was appliedon patients clinical examination and history data for oral cancerdetection. Huang presented data cutting and sorting method(DCSM) rather than Apriori algorithm that reduces the timeto scan immesens sizes database i.e. health examination andoutpatient medical records [87].

Berka and Rauch applied meta-learning techniques to as-sociation rules in the atherosclerosis risk domain [33]. Theyapplied ARM to the ARs extracted from atherosclerosis data togenerate rules about rules for more effective comprehension.To digitize, Apriori is used and then extracted ARs were post-processed to the rules extracted in prior step by application ofARM .McCormick et al. presented hierarchical association rulemining (HARM) technique to predict sequential events [128].HARM is applied for automated symptom prediction. Resultshowed that predictive performance of HARM was the highestwith partially observed patients, than with new patients. Theidea is based on the history of the patient, the next symptom tobe faced by patient could be predicted. Srinivas et al. presentedDAST algorithm for mining frequent as well as rare itemsetsand positive and negative ARM for disease prediction [196].

Association rules mining algorithms may results in gen-eration of extremely large number rules, especially in caselow minimum support and minimum confidence values, thusreduction technique is necessary for effective analysis. Todiscover significant association rules, summarize rules hav-ing the same consequent and accelerate the search process,

19

Ordonez et al. applied greedy algorithm [141]. Rules wereconstrained by using association rule size threshold, restrictionof itemsets for appearance, restriction of itemset combinationsappearance and lift measure usage. For evaluation purpose, 25out of 113 attributed for 655 individual are used minimumsupport value of 1%, minimum, confidence 70%, maximumrule size of 4, minimum lift value of 1.20, and minimumlift value of 2.0 for covers rules. Ohsaki et al. performed de-tailed analysis of practicability of rule interestingness measures[140]. In total 40 interestingness measures were analyzed,the accuracy,uncovered negative, peculiarity, relative risk andchi-square measures are the interestingness measures usefulfrom medical domain experts point of view. Moreover, medicalknowledge discovery from databases could be advanced by uti-lization of other interestingness measures by medical domainexperts. Kuo et al. approach reduces the number of generatedassociation rules by considering domain relevant rules selectedby domain expert using [114]. To deal with rule selection,domain expert is required.

Soni and Vyas integrated association rule mining and classi-fication rule mining [195]. Integration is done by focusing onmining a special subset of association rules and then classifica-tion is being preformed using these rules. Rules are advancedand different associative classifier (weighted associative classi-fiers, positive and negative rules, fuzzy and temporal) are usedto increase accuracy. Later on, Soni and Vyas extend theirwork and used weighted associative classifiers for heat diseaseprediction and presnted intelligent heart disease predictionsystem (IHDPS) using weighted associative classifier and studyreported that IHDPS achieved the highest average accuracy ascompared to other systems [194]. Concaro et al presented amethod to integrate of both clinical and administrative dataand address the issue of developing an automated strategy orthe output rules filtering, exploiting the taxonomy underlyingthe drug coding system and considering the relationshipsbetween clinical variables and drug effects [48]. Rajendranand Madheswaran presented hybrid association rule classifier(HARC) based on ARM and decision tree (DT) algorithm[155]. Transactional database containing texture features datawas mined using Frequent Pattern Tree (FP-Tree) based ARM.Moreover, DT is applied to transactional database instancesfor each of the rules generated in ARM step. In another work,Shirisha et al. presented Heparin induced thrombocytopeniaalgorithm based automated weight calculation approach forweighted associative classifier [181]. Based on the differentattributes, weights are assigned. Itemsets that are alreadyfound to be frequent are added with new items based onthe algorithm. Chin et al. presented framework for early RAassessment that integrated efficient associative classifiers tomine RA risk patterns from a large clinical database [45]. Riskpatterns that occur frequently, and classify the disease statusare extracted. Weighted aproach to integrate correlation andpopularity information into the measure is used to prevent theignorance of important rules that the objectivity of the results.

E. Anomaly DetectionToday, electronic medical records contain vast amount of

patient information regarding his conditions along with treat-

ment and procedure records. Anomaly detection in diseaseprevention and health promotion typically work with patientrecords and it has high potentials in mining hidden pattern andforecasting of disease. EHR have anomalies due to several rea-sons such as physician error, instrumentation error and patientcondition [42]. One particular advantage health organizers lookis hot spotting of data in a timely manner. Thus it is a keytask to support disease prevention and requires high degreeof accuracy. Typical anomaly detection approaches focus ondetection of data instances that deviate from the majority ofdata. So far, several anomaly detection techniques have beenpresented for the detection of disease outbreaks. Beside itscritical need, anomaly detection in health record is a chal-lenging task due dataset complexity (non standard and poorquality data, privacy issue etc). Supervised anomaly detectionmethod require complete labelled data for both anomaly andnormal data. However, in healthcare, it is rarely possible toacquire complete labelled data (as labelling is done manuallyby expert that require substantial efforts, moreover may occurthey can occur spontaneously as novelties at run time). Theother issue is data itself, there are few anomalies as comparedto normal one. Supervised method is not a preferable choicefor anomaly detection, thus we are not focusing it.

Due to the domain specific normality and unavailabilityof labelled data most of the existing studies are based onunsupervised or semi-supervised methods [40] and mainlyfocus on disease specific anomaly detection however thereare some generic studies as well. Hu et al. presented genericframework for utilization analysis that can be applied to anypatient data [86]. Carvalho et al. presented an effective andgeneric method for anomaly detection based score [40]. Inthe first phase, degree of anomaly is calculated by performinganomaly analysis. Three proximity-based algorithms (KNN,reverse KNN and Local Outlier Factor) are used to for anomalydetection. Detection is based on two steps: score assignmentand transference. In first phase, outlier analysis is performedto calculate the degree of anomaly and second phase, scoretransference assign score. Typical outlier detection methodsidentify unusual data instances that deviate from the majorityof examples in the dataset. Mezger et al. used logistic regres-sion model to detecting deviations from usual medical care[130]. In another work, Hauskrecht et al. presented statisticalanomaly detection (evidence-based anomaly detection) [76].Bayesian networks are used as probabilistic models to computestatistics.

Goldstein and Uchida evaluated 19 different unsupervisedmethods on 10 different datasets from multiple applicationdomains. Lie et al used three different methods for anomalydetection standard support vector data description, densityinduced support vector data description and Gaussian mixtureon UCI dataset for liver disorder.

Several rule-based anomaly detection haven been developedsuch as drug-allergy checking, automated dosing guidelines,identifying drug-drug interactions, detecting potential adversedrug reactions, detection of even in chronic condition i.e.congestive heart failure, diabetes etc.

It is more important in anomaly detection to assure thatreported anomalies to a user are infact interesting. Traditional

20

anomaly detection methods are unconditional that look foroutlier with respect to all data. Conditional anomaly detectionis relativity new approach that detect unusual data for a subsetof variables. Valko et al. presented instancebased methods fordetecting conditional anomalies for two real world problems(unusual pneumonia patient admission and unusual orders ofan HPF4 test) that rely on distance metrics to identify examplethat are most critical for anomaly detection [208]. To assessthe conditional anomaly and analyze the deviations of examplea for anomaly detection, one dimensional projection is used. Acase is considered to be anomalous if the value (output—input)falls below certian threshold. In another study, Hauskrecht etal. presented conditional-outlier detection method to identifyunusual patient-management actions (omissions of medicationor laboratory orders) based on patent’s past data [77]. Ratherthan only identification of unusual data instance deviation,this approach identify outliers where individual patient man-agement actions strongly depend on patient condition. SVMis used for predicting each type of action i.e. medicationand laboratory orders. Valko et. al. presented non-parametricapproach based on the soft harmonic solution for anomalydetection. In an other work, Hauskrecht presented frameworkfor post-cardiac surgical patient anomaly detection that detectsconditional anomalies and raises an alert when an anomaly isfound[75].Hong and Hauskrecht presented multivariate condi-tional model that consider both context and correlations foroutliers identification [82]. A tree structured model, multi-dimensional ensemble frameworks [43] is used to build multi-variate conditional model. Posterior of individual and posteriorjoint responses are computed using the decomposable structureof the multi-dimensional models. In order to compute theoutlier score for each instance, the data is transformed fromoriginal space to d dimensional conditional probability space.

Conventional PCA is used an indicator of the existence orabsence of anomalies [34] however its disadvantage is thelimited effectiveness for small anomaly. Multivariate cumu-lative sum (MCUSUM) has ability to effectively detect smallanomalies with high cross correlated variables and used as al-ternative to PCA. Harrou et al. integrate PCA and multivariatecumulative sum(MCUSUM) for anomaly detection with highsensitivity to small anomaly [74]. MCUSUM is applied to thePCA output to detect anomalies when data did not fit wit PCA.Evidence showed that PCA-MCSUM integration provide betterfault detection on pediatric emergency datset.

Lee et al. used multilayer perceptron for detection of cardiacanomalies (ventricular tachyarrhythmia, congestive heart fail-ure, malignant ventricular ectopy, supraventricular arrhythmia)form normal cardiac status. Features are extracted form ECGusing PCA and histogram of gradient (HOG) [89].

Pattern mining techniques are used to extract frequent pat-terns and comparison is performed between frequent patternand domain knowledge for anomaly detection [24]. Frequentpattern matching identify frequent pattern that deviate fromguidelines. It also identifies anomaly that deviate from frequentpatterns followed by the comparison between between thedomain knowledge of disease and the frequent patterns of treat-ment of that disease. Thus, it allows two types of anomaliesdetection. The anomalies that comprises the anomalous cases

deviate from the frequent patterns and frequent patterns thatdeviate from the accepted guidelines.

V. DATASETS

PhysioBank databases: is a large and growing archive ofphysiological data [67]. it consist of over 90,000 recordings, orover 4 terabytes of digitized physiologic signals and time se-ries, organized in over 80 databases. PhysioBank archives con-sist of clinical database, waveform database, multi-parameterdatabase, ECG database, interbeast interval database, othercardiac databases, computing in cardiology challenge datasets,synthetic data, gait and balance databases, neuroelectric andmyoelectric databases. EEG, EHG, and more. Each databaseis placed into a class (completed reference database, archivalcopies of raw data and other contributed collections of data).Further details of PhysioBank is available at [67].

Breast Cancer Wisconsin Diagnostic (WDBC): Featuresof the breast-cancer dataset are computed from digitized breastmass images of a fine needle aspirate (FNA) describing thecharacteristics of the cell nuclei[62]. Dataset consist of 32attribute (30 real-valued input features) of 569 instances (357benign, 212 malignant). The task of the UCI dataset is toseparate cancer from healthy patients. Wisconsin PrognosisBreast Cancer (WPBC) consist of 34 attributes (32 real-valuedinput features) of 198 instances (151 nonrecur, 47 recur). Eachrecord represents follow-up data for one breast cancer case.Cardiovascular dataset: Cardiovascular Heart Study (CHS)is a study of risk factors for development and progression ofCHD and stroke for cardiovascular diseases in people overthe age of 65. It is one of the most widely used benchmarkdataset for cardiovascular disease risk factors including stroke[116] . It consist of 5,201 instances with significant fraction ofmissing values and a large number of features. Another dataset,Post-surgical cardiac patients dataset consist of 4486 patientscollected from archived EHRs at a large teaching hospital inthe Pittsburgh area [78]. Another multivariate cardiac diseasedata set by UCI consist of 76 attributes (categorical, integer,real) and 303 number of instances. The purpose of datasetis to figure-out presence of heart disease in the patient.It isinteger valued from 0 (no presence) to 4. Congenital heartdisease (CHD) consist of 30 day outcomes (alive or dead) forcongenital heart disease treatment. Figures are provided forpatients aged under 16 (paediatric) or over 16 (adult congenitalheart disease).

Thyroid Disease dataset is another dataset from UCImachine learning repository in the medical domain also knownas the annthyroid dataset. It is suit for ANN training having 3classes (normal: not hypothyroid, hyperfunction and subnormalfunctioning) 3772 training instances, 3428 testing instancesand consist 15 categorical and 6 real attributes. Raw patientmeasurements contain categorical attributes as well as missingvalues. For outlier detection, 3772 training instances are used,with only 6 real attributes. Hyperfunction class is treated asoutliers class and other two classes are inliers. An other thyroiddataset collected from UCI repository, is binary class eitherthyroid our non-thyroid class dataset that consist of 7547instances (776 and 6771 belong to thyroid and non-thyroid

21

respectively) and 29 attributes (mostly numeric or Boolean).It contains both hypothyroid and hyperthyroid data. The rawpatient measurements contain categorical attributes as wellas several missing values. Another thyroid datast gatheredfrom Imam Khomeini hospital, consist of 28 attributes. An-other dataset released by Intelligent System Laboratory ofK.N.Toosi University of Technology consists of 1538 patientsof 21 features (sex, age, T3, T3RU, T4, FT4, TSH, palpita-tion, drowsiness, exophthalmia, diarrhea, constipation, edema,menstruation, diaphoresis, heat intolerance, cold intolerance,weight change, appetite, tremor, and nervousness. ) each, 15binary (from x1, x8 x21) and 6 continuous (x2, x3, x4, x5,x6, x7) [152]. It consist of 331 patients belong to Hyper class,648 patients belong to Hypo class and 559 of them belongto Normal class. Diabetes data is multivariate dataset consistof 75 attributes of 100K instances with missing values [197].It has been prepared to analyze factors related to readmissionas well as other outcomes pertaining to patients with diabetes[65, 12].

Diabetes dataset extracted form health facts nationaldatabase of 130 US hospital to analyze factors related toreadmission as well as other outcomes pertaining to patientswith diabetes[30]. It consist of 55 (integer) attributes of 100Kinstance with missing value. Fertility dataset consist of 10real value attributes of 100 instances (18-36 year old) releasedfor classification and regression purpose [66]. Sperm concen-tration are related to socio-demographic data, environmentalfactors, health status, and life habits. IVF data set of 1456patients released by VF Unit at Etlik Zubeyde Hanim WomensHealth, Teaching and Research Hospital, Ankara, Turkey thatconsist of demographic and clinical parameters, as independentfeatures (52 out 64 are related to the female, and 12 arerelated to the male) [72]. The dataset consist of 43 categoricalvalues features, 21 numerical features and 13.5 % missingvalues feature. Hepatitis UCI hepatitis dataset consist of 155samples with 19 features (13 binary and 6 are discrete) [119].The purpose of dataset is to predict the presence of hepatitisvirus given the results of various medical tests carried out ona patient. Liver ILPD (Indian Liver Patient Dataset) consistof 10 (integer and real) attribute of 583 (416 and 167 liverand no liver patient respectively) instances for classificationpurpose [32]. Heparin induced thrombocytopenia (HIT)dataset collected from 4273 records of postsurgical cardiacpatients treated [209].

VI. OPEN RESEARCH ISSUES AND FUTUREDIRECTIONS

Data mining and data analytics in the development ofpreventive healthcare applications have tremendous potential,however, the success hinges on the availability of quality databut there is no magic recipe to successfully apply data analyticsmethods in any healthcare organisation. Thus, for the suc-cessful development of prevention application depends uponhow data is stored, prepared and mined. However, healthcareanalytics poses series of challenges when dealing with moundof complex healthcare data. These challenges involved datacomplexity, access to data, regulatory compliance, information

security and efficient analytics methods to successfully analyzethis data in a reliable manner.

A. Monitoring and Incorporating Multi-Source Information

As the main goal of preventive helathcare analysis is totake real world medical data to help in disease prevention.Thus, for the successful development of preventive applicationdepends upon how data is stored, prepared and mined. In healthsector, we do have large volumes of complex, heterogeneous,distributed and dynamic data coming in i.e. in US only,healthcare data reached 150 exabytes in 2011 and expectedto reach zettabyte scale soon. Despite the rapid increase inEHR adoption, there are several challenges around makingthat information useful, readable and relevant to the physiciansand patients who need it most. One of the key challenge inhealthcare industry is, how to manage, store and exchangeall of this data. Interoperability is considered as one of thesolutions to this problem. Another challenge is data privacythat limit the sharing of data by blocking out significant patientidentification information such as MRN, SSN.

Most of the patients visit multiple clinics to try to find areason for their disease. For example if a patient visit clinic”A” for urgent care for heart attack and follow up with hisprimary care physician at clinic ”B” who further refer himto cardiologist at clinic ”C”. At each clinic, physician looksin depth, go for some test and prescribe some drugs. Thus,healthcare data is often fragmented thus improving interoper-ability in health sector not only help to improve patient carebut also helps to save $30 billion a year [5]. Hospitals haveyet to achieve interoperability level, without it, it is almostimpossible to improve patient care. US health departmentwant interoperability between disparate EHRs by 2024. Allmedical stakeholder (physician, administrators, patients etc)says that interoperability will improve patient care, reducesmedical errors and save budget. Imagine having the insightand opinions of hundred of IVF/PGD patients, eases in yourdecision and satisfaction before going to treatment ratherthan directly relying on physician recommendations. Basedon importance of data integration, healthcare organizations areturning to the implementation of interoperability.

Interoperability is backbone to critical improvements inhealth sector and yet has to become a reality. Similar to theconcept of ATM network, Health data should be standardizedand shared between providers. Generally, it seems that its quiteeasy and straight-forward to integrate EHRs (integration ofdifferent electronic databases). Yes, it could be easy, if EHRshave common structure for data collection but it is not the casein real world. In practice, EHRs are much more complex andwere not designed as open system where information can beshared with multiple providers. Infact EHRs were developedwith aim to replace paper based work and coordinate patientcare within a hospital. It is the most urgent needs in the questfor healthcare improvement. Despite its urgent requirement,the implementation of interoperability is slow due to severalfactors. One real issue in exchanges of data between providersis that most of the EHRs are unable to interface with each otherdue to unstructured data. Another issues is data standardization

22

i.e. varied vocabularies, multiple interpretations. Like banking,medical data need to be standardized so that EHRs can share itautomatically and physician can interpret it regardless of EHR.Quality of data i.e. missing values, multiple records etc. alsoleads to difficulty in data exchange.

To achieve interoperability level, HL7, HITECH includingHIPPA and several other standardization bodies have definedsome standards and guidelines. Reviewing process for interop-erability level includes standardized test scripts and exchangetest of standardized data. The assessment of an organizationthat how it has achieved the interoperability and securitystandards, a third party opinion on EHR by the authorizedtesting and certifying body (ATCB) is considered. CCHITand ARRA are the two types of certification that are usedto evaluate the system.

B. P4 Medicine

Healthcare pratice has largely been reactive where patienthave to wait until onset of disease and then treatment andcure of that disease. Our effort to treat disease are oftenlyimprecise, ineffective and unpredictable as we do not knowgenetic and environmental factor of disease. Moreover,drugsand treatment, we are advised are tested on broad populationand prescribed using statistical average thus they might workfor some population but not all. Overall, P4 medicine (de-fined as personalized, predictive, preventive and participatory)has the potential to revolutionize care by customizing thehealthcare to patient, enabling providers to match drugs topatient based on their profile (right dose of right drug forthe right person at the right time.), to identify which healthcondition patient is susceptible to and to determine how patientrespond to particular therapy however, current challenges andconcerns need to be addressed to enhance its uptake andfunding to benefit patients. One of the important purpose ofP4 medicine is to enable disease prevention among healthyindividuals through early detection of risk factors. Initiativeto P4 care has been taken via personalized genomics, mobilehealth technology and piolot projects [71]. Near in future,physicians will be able to examine the unique biology of eachindividual to assess the probability of developing disease suchas cancer, diabetes and will be treat perturbations in healthyindividuals before symptoms appear, thus optimizing the healthcondition of individuals and preventing disease.

P4 medicare technologies, however, do not fit exactly intoexisting healthcare technology assessment and reimbursementprocess. Future landscape of P4 care is being dictated bycurrent decisions in this space. Currently, one of the main chal-lenge for both healthcare service providers and patients recul-tant to advancement in healthcare industry. To overcome thisissue and provide customized healthcare service, healthcareproviders need to build personalized medicare. Rapid growthof p4 care, making it difficult for physicians to understand,interperate and apply new findings. It requires clinician torecognize that patient engagement means much more than theircompliance. Medical education curriculum have generally notincorporated personalized medicare concepts. Another biggestchallenges is societal acceptance of P4 deployment that is

more daunting than scientific and technological challengesfacing P4 medicine. These societal challenges include ethics,legal, privacy, data accessibility and data ownerships issuesetc. Moreover, P4 medicare will require new standards andnew policies for handling individual’s biological and healthcare information. To deal these issues, there is need to bringindustrial partners as part of this consortium to help to transferP4 medicine to the patient population. Medical colleges mustinclude personalized medicare in curriculum. Patient interestand demand are essential, too.

Asc [84].

C. Security and Privacy or Confidentiality, Privacy and Secu-rity

Every stakeholder in health industry has a role to play in theprivacy and security of patient information, Infact it is truly ashared responsibility. Patient privacy and information securityare fundamental components of well-functioning healthcaresystem that helps to to achieve better health outcomes, smarterspending, and healthier people. For example, patient maynot disclose or may ask physician not to record his healthinformation due to the lack of trust ant feel that informationmight not be confidential. This kind of patient attitude putpatient at risk and may deprive physician and researchers togo for important finding as well as put the the organization atrisk for clinical outcome and operational efficiency analysis.To reap the promise, providers and individuals must trust thatan patientss health information is private and secure. While onthe other hand, providers are facing several challenges in theimplementation of privacy and security at patient satisfactionlevel i.e. efficient data analysis without providing access toprecise data in specific patient records. For example, remindersent to patient form drug store require access to patient’sspecific information in EHR that patient does not want to sharewith drug store. Thus, it this case, does the use of patientinformation for drug reminder outweigh the potential misuseof patient data ? Security and privacy issues are magnifiedby data growth, nonstandard, variety and diversity of sourcessuch as different formats, multiple sources, nature of data etc.Thus, traditional data security which are actually to securesmall scale data are inadequate and fails [10]. Security andprivacy is data analytics especially when it draws informationfrom multiple sources poses several challenges.

Privacy of patient data is both the technical issue as well associological, which must be addressed jointly from both tech-nical and sociological perspectives. Health Insurance Portabil-ity and Accountability Act of 1996 (HIPAA) is world wideaccepted set of general requirements and security standardsto protect the information in health-sector. HIPAA providesthe legal rights regarding personally identifiable sensitive pa-tient information and their establish obligations for healthcareproviders to protect the use by avoiding its misuse. With the ex-ponential rise in healthcare data, remote monitoring and smartdevices, researchers are facing the big challenges to anonymizethe patients sensitive information in order to prevent its misuseor or disclosure? i.e. discarding of patient sensitive personalinformation such information SSN(social security number)

23

MRN (medical record number), name, age etc. makes it verycomplex and challenging to link the patient data up to uniqueindividual. Even hiding such information, still hacker caneasily identify some of patient sensitive information throughassociation. The other way to hide patient sensitive informationis the use of differential privacy that helps to restrict thedata access to the organization based on their requirement.However, these privacy challenges are factors that, infact leadsto the situations where, data analytics researchers are facing theissue from both legal as well as ethical perspective. The mainprivacy challenges associated with healthcare data analytics,overrunning the privacy concerns of traditional data . Forexample, how we can share the patient data while limitingthe disclosure as well as ensuring the sufficient data utilityi.e. YOB, Gender, 3-digit Zip code unique for 0.04% of UScitizens while DOB, Gender, 5-digit Zip code unique for 87%of US citizens [202]. However, limiting the data access resultsin unavailability of important information contents that couldbe important for certian data analytics task. Furthermore, thereal time data is not static but changes over period of time thus,earlier defined prevailing techniques may results in blockingsome of the most required information or sharing the patientsensitive information.

D. Advanced Analyzing Techniques

Technological advancements (wearable devices, patient cen-tered care etc) are transforming the entire healthcare industry.Nature of health data has evolved, currently, EHRs havesimplified the data acquisition process with help of latesttechnology, but don’t have the enough ability to aggregate,transform, or perform analytics on it. Intelligence is restrictedto retrospective reporting that is insufficient for data analysis.A plethora of algorithms, techniques and tools are availablefor analysis of complex data. Traditional machine learninguses statistical analysis based on a sample of a total dataset. Use of traditional machine learning methods against thistype of data is not efficient and computationally infeasible.Combination of big healthcare data and computational powerlets the analysts to focus on analytics techniques scaled up toaccommodate volume, velocity and variety of complex data.During the last decade, there is a dramatic change in the sizeand complexity of data thus, several emerging data analysistechniques have been presented. At minimum, in healthcaresector, data analytics for big data must have the support ofkey functions that are necessary for analyzing the big data inreal environment. The criteria for the evaluation of platformmay includes, the availability, the scalability, the continuity,quality assurance ease of use, ability to manipulate granularityat different levels, privacy and security enablement [153].Analyzing healthcare in real time is one of the key requirementhowever facing lot of challenges which required researchersattention i.e. The lag between data collection and processing,dynamic availability of numerous analytics algorithms etc.

Innovative analytics techniques need to be developed tointerrogate healthcare data and gain insight about hiddenpatterns, trends and associations in the data. Moreover, therewill be a shortage of 100K plus-person data analytic researcher

through 2020, which could mean 5060% of data analyticspositions may go unfilled [1]. Thus, there is need of dataanalytics scientist with advance technical skill. Deep machinelearning, a set of machine learning algorithms based on neuralnetwork, is still evolving but has shown great potential forsolving complex problems. It deduce the relationship withoutneed of specific model and enables the machine to identify thepattern of interest in huge unstructured data. For example, deeplearning base aproach learned on its own from the Wikipediadata that California and Texas are U.S states. It doesnt requireto understand the concept of country and its states [153]. Itreflect that how powerful is deep learning over other machinelearning aproaches.

E. Data Quality and Dataset

Gone are the days, when healthcare data was small, struc-tured and collected exclusively in electronic health records.Due to the tremendous advancement in IT, wearable technol-ogy and other body sensors, increasingly, the data is quitelarge (moving to big data), unstructured (80% of electronichealth data is unstructured), non-standard as well as in mul-timedia format. This variety in data makes it challengingand interesting for analysis. The quality of healthcare data isconcerned for successful prevention system and is due to threemain reasons, are incompleteness (missing data), inconsistency(data mismatch between within same or various EHR sources),inaccuracy (non-standards, incorrect or imprecise data) anddata fragmentation. Data quality is a group of different tech-niques that are data standardization, verification, validation,monitoring, profiling and matching. The biggest problem ofpoor data in health industry has reached at epidemic proportionand introduces several pernicious effects in particular diseaseprevention. The problem with dirty data are mostly related tomissing value, duplication, outliers and stale records. Thus, thefirst thing before going to data analytics is to perform cleaningfor high quality data and void the analytics until the data isfully clean. Not all healthcare data is collected directly throughsensors, large fraction of data is collected through data entry(i.e. patient history, symptoms, physician recommendation,Lab report data entry etc.) that is not prefect as automaticsentry and introduces human error in data. Incomplete dataand inaccurate data, can also lead to missed or spuriousassociations that can be wasteful or even harmful to patientcare. The opportunities for data error are relatively limited andeasily identified by algorithms.

3 Million preventable adverse events results 98,000 deathsand extra 17 billion cost per year whereas most of the causesof medical errors can be overcame by improving the deviceinteroperability. Currently, 1/3rd% of the hospitals uses variousdevices ( such as defibrillators, electrocardiographs, vital signsmonitors, ventilators and infusion pumps) could be integratedwith EHRs [5]. This lack of auto data acquisition createssignificant sources of waste and risk to patient. Because ofincomplete or stale information, data analytics methods maynot be reliable for decision making. The solutions to dealwith interoperability is not easier as compared to missingand inaccurate data. Since last two decades, little success has

24

been made. Currently, hardware industry lacks the imperativeto offer interoperability as most of the hospital have to bearadditional cost to integration. They don’t want hardware de-velopment companies to follow specific standards.

Although real-time data monitors (espically at ICU) arepartially used in most of the hospital, however real-time dataanalytics is not in pratice. Hospitals are moving to use real-time data collection and in near future, real-time data ana-lytics will revolutionize the healthcare industry such as earlyidentification of infections, continuous progress of treatment,selection of right drugs etc could helps to reduce the morbidityand mortality. To achieve real-time data processing, we needdata standarization and device interoperablity.

The other issue common issue is data standardization. Struc-turing of only 20 percent of data has shown its importancebut on the other hand clinical notes are still in practice andcreated in billions due to the reason that the physician can bestexplain the clinical encounter. Empower physicians as well asmaintaining the data quality is quite challenging. So far, thisdata is excluded from data analytics as its available in naturallanguage and not discrete. Transforming this unstructureddata into discreet form requires efficient intelligent technologyand it is hard problem of medical IT until now. The onlyway we can use this unstructured and nonstandard data byusing NLP to translate that data using ICD or SNOMETCT into discrete data. Systrue system provide clinincal NLPplatform with advance intelligence techniques that allow thehealthcare organizations to power up the content and datalayer of their applications. AWARE (Ambient Warning andResponse Evaluation) is a viewer and clinical decision supporttool designed to reduce risk of error. It works with multipleEHR systems and bedside monitors to present only relevantinformation on a single screen dashboard. Development pro-cess of AWARE is focused on reducing information overload,improving efficiency and eliminating medical error in the ICU.

Dataset development in health sector is not simple as it isin other areas. Supervised learning required labled data. Realproblem with dataset is the ground truth, it is more challengingwhen we are talking about big data and deep learning wherewe need really big data set to train. It is very easy to get imagesof segregating men and women or cars images with differentcolor and shapes however to annotate medical data, we needexpert. Thus, expensive medical expertisess are needed forhigh quality annotation of medical data. Data privacy makes itmore challenging as compared to other domain. Furthermore,annotation by single expert is not enough due to human error,thus it is required to have consensus annotations by multipleexpert observers. Whereas in case of unsupervised learning,we need millions of examples that are not easy on the otherhand.

VII. CONCLUSION

In health sector, accurate diagnosis/assessment dependsupon data acquisition and data interpretation. Data acquisitionhas improved substantially over recent few years however, datainterpretation process has recently begun to benefits. One areaof healthcare that data analytics have major impact is pre-ventive care, helping clinicians to ward off disease before they

have change to take hold. To face what we call this ”preventionservices” organizations must have to access to data analyticsexperts, healthcare experts, data as well as advancd analyticsalgorithms and tools. Intelligent healthcare data analytics hasthe big potential to transform the way the health-sector industryuses the sophisticated and state of the art technologies to gainthe deeper insight into the data from for disease prevention.In near future, we will be the witness of rapid and widespreadimplementation as well as the use of P4 medicine acrossthe healthcare organizations. To achieve P4 implementationlevel, several challenges highlighted above need to be ad-dressed. p4 medicine is at nascent stage of development, rapidadvancement in tools, algorithms and wearable healthcaresensors accelerate its maturing process. WE have highlight thechallenges and identified a unique assumption that should beconsidered before implementation of any method. Detail ofalgorithms with guild-lines for implementation of each sectionis presented.

REFERENCES

[1] Institute for health technology transformation (iht2).[2] Policy brief on ageing no. 3, older persons as consumers.

2009.[3] Policy brief on ageing no. 6, health promotion and

disease prevention. 2010.[4] National health expenditure projections 2012-2022.

2012.[5] The value of medical device interoperability. 2013.[6] Health expenditure, total (% of gdp) : World health

organization global health expenditure database. 2014.[7] Risk factors and prediction of very short term versus

short/intermediate term post-stroke mortality: A datamining approach. Computers in Biology and Medicine,54:199 – 210, 2014.

[8] Children: reducing mortality. 2015: Updated September2016.

[9] Ejaz Ahmed, Ibrar Yaqoob, Ibrahim Abaker TargioHashem, Imran Khan, Abdelmuttlib Ibrahim AbdallaAhmed, Muhammad Imran, and Athanasios V Vasi-lakos. The role of big data analytics in internet of things.Computer Networks, 129:459–471, 2017.

[10] Khandakar Ahmed, H Wang, and S Chenthara. Securityand privacy in big data environment. 2018.

[11] Shakil Ahmed, Frans Coenen, and Paul H. Leng. Tree-based partitioning of date for association rule mining.Knowl. Inf. Syst, 10(3):315–331, 2006.

[12] Zainab Ali Al-Qarny, Riyad Alshammari, and Muham-mad Imran Razzak. Impact of sharing health infor-mation related to diabetes through the social medianetwork: ontology. International Journal of Behaviouraland Healthcare Research, 5(3-4):162–171, 2015.

[13] Garrett E. Alexander. Biology of parkinsons disease:Pathogenesis and pathophysiology of a multisystemneurodegenerative disorder. Dialogues in Clinical Neu-roscience, 6(3/):259–280, 2004.

[14] Roohallah Alizadehsani, Jafar Habibi, Behdad Bahado-rian, Hoda Mashayekhi, Asma Ghandeharioun, Reihane

25

Boghrati, and Zahra Alizadeh Sani. Diagnosis of coro-nary arteries stenosis using data mining. J Med SignalsSens., 2(3/):153–159, 2012.

[15] Roohallah Alizadehsani, Jafar Habibi, Moham-mad Javad Hosseini, Hoda Mashayekhi, ReihaneBoghrati, Asma Ghandeharioun, Behdad Bahadorian,and Zahra Alizadeh Sani. A data mining approachfor diagnosis of coronary artery disease. ComputerMethods and Programs in Biomedicine, 111(1):52 –61, 2013.

[16] Roohallah Alizadehsani, Jafar Habibi, Zahra AlizadehSani, Hoda Mashayekhi, Reihane Boghrati, Asma Ghan-deharioun, Fahime Khozeimeh, and Fariba Alizadeh-Sani. Diagnosing coronary artery disease via data min-ing algorithms by considering laboratory and echocar-diography features. Res Cardiovasc Med., 2(3/):133–139, August 2013.

[17] Jamal Salahaldeen Majeed Alneamy and Rahma Abdul-wahid Hameed Alnaish. Heart disease diagnosis utiliz-ing hybrid fuzzy wavelet neural network and teachinglearning based optimization algorithm. Adv. ArtificialNeural Systems, 2014:796323:1–796323:11, 2014.

[18] Abdennadher S. Amer M, Goldstein M. Enhanc-ing one-class support vector machines for unsuper-vised anomaly detection. In: Proceedings of the ACMSIGKDD Workshop on Outlier Detection and Descrip-tion (ODD13)New York, NY, USA, pages 8–15, 2013.

[19] Wenjuan An, Mangui Liang, and He Liu. An improvedone-class support vector machine classifier for outlierdetection. Proceedings of the Institution of MechanicalEngineers, Part C: Journal of Mechanical EngineeringScience, 229(3):580–588, 2015.

[20] Giselle Pulice de Barros Larissa Sampaio de Paula AnaKarina Bartmann, Milton Faria and Nathalia RodriguesBettini1. The number of embryos obtained can offsetthe age factor in ivf results according to an artificialintelligence system. Womens Health Gynecology, 2(5),2013.

[21] Angiulli and Pizzuti. Fast outlier detection in highdimensional spaces. In European Conference on Princi-ples of Data Mining and Knowledge Discovery, PKDD,LNCS, volume 6, 2002.

[22] Padmakumari K. N. Anooj. Clinical decision supportsystem: risk level prediction of heart disease usingweighted fuzzy rules and decision tree rules. CentralEurop. J. Computer Science, 1(4):482–498, 2011.

[23] S. Anto, S. Chandramathi, and S. Aishwarya. An expertsystem based on LS-SVM and simulated annealing forthe diagnosis of diabetes disease. IJICT, 9(1):88–100,2016.

[24] Dario Antonelli, Giulia Bruno, and Silvia Chiusano.Anomaly detection in medical treatment to discoverunusual patient management. IIE Transactions onHealthcare Systems Engineering, 3(2):69–77, 2013.

[25] M. A. Anwar and Naseer Ahmed. Analyzing lifestyleand environmental factors on semen fertility using as-sociation rule mining. Information and KnowledgeManagement, 3(4):79 – 86, 2014.

[26] Ahmet Kadir Arslan, Cemil Colak, and Mehmet EdizSarihan. Different medical data mining approachesbased prediction of ischemic stroke. Computer Methodsand Programs in Biomedicine, 130:87 – 92, 2016.

[27] H. Nadir Ciray Mustafa Bahceci Asli Uyar, Ayse Bener.Predicting implantation outcome from imbalanced ivfdataset. In Proceedings of the World Congress on En-gineering and Computer Science, San Francisco, USA,pages 562–567. WCECS, 2009.

[28] Piotr Augustyniak. Personal wearable monitor of theheart rate variability. Bio-Algorithms and Med-Systems,7(1):5–10, 2011.

[29] Tapas Ranjan Baitharu and Subhendu Kumar Pani.Analysis of data mining techniques for healthcare deci-sion support system using liver disorder dataset. Proce-dia Computer Science, 85:862 – 870, 2016. InternationalConference on Computational Modelling and Security(CMS 2016).

[30] Chris Gennings Juan L. Olmo Sebastian VenturaKrzysztof J. Cios and John N. Clore Beata Strack,Jonathan P. DeShazo. Impact of hba1c measurement onhospital readmission rates: Analysis of 70,000 clinicaldatabase patient records. BioMed Research Interna-tional, 2014, 2014.

[31] Riccardo Bellazzi and Ameen Abu-Hanna. Data miningtechnologies for blood glucose and diabetes manage-ment. Journal of diabetes science and technology, 3(3),2009.

[32] M. S. Prasad Babu Bendi Venkata Ramana and N. B.Venkateswarlu. ritical comparative study of liver pa-tients from usa and india: An exploratory analysis.International Journal of Computer Science, 2012.

[33] Petr Berka and Jan Rauch. Mining and post-processingof association rules in the atherosclerosis risk domain. InInformation Technology in Bio-and Medical Informatics,ITBAM 2010, pages 110–117. Springer, 2010.

[34] Nidhi Bhatla and Kiran Jyoti. A novel approach forheart disease diagnosis using data mining and fuzzylogic. International Journal of Computer Applications,54(17), 2012.

[35] Romanelli RJ Block TJ Hopkins D Carpenter HADolginsky MS Hudes ML Palaniappan LP Block CHBlock G, Azar KM. Diabetes prevention and weightloss with a fully automated behavioral intervention byemail, web, and mobile phone: A randomized controlledtrial among persons with prediabetes. J Med InternetRes, 17(10), 2015.

[36] Hayden Bosworth. Improving patient treatment adher-ence: A clinician’s guide. Springer Science & BusinessMedia, 2010.

[37] Marc Boulle. Regularization and averaging of theselective naive bayes classifier. In IJCNN, pages 1680–1688. IEEE, 2006.

[38] Marc Boulle. Compression-based averaging of selectivenaive bayes classifiers. Journal of Machine LearningResearch, 8, 2007.

[39] Joseph L Breault, Colin R Goodall, and Peter J Fos.Data mining a diabetic data warehouse. Artificial

26

intelligence in medicine, 26(1-2):37–54, 2002.[40] Luiz F. M. Carvalho, Carlos H. C. Teixeira, Ester C.

Dias, Wagner Meira Jr., and Osvaldo F. S. Carvalho.A simple and effective method for anomaly detectionin healthcare. In 4th Workshop on Data Mining forMedicine and Healthcare, 2015 SIAM InternationalConference on Data Mining, Vancouver, Canada, 2015.

[41] Aaron Ceglar and John F. Roddick. Association mining.ACM Computing Surveys, 38(2):5:1–5:42, March 2006.

[42] Varun Chandola, Arindam Banerjee, and Vipin Kumar.Anomaly detection: A survey. ACM Comput. Surv,41(3), 2009.

[43] Iyad Batal Charmgil Hong and Milos Hauskrecht. Ageneralized mixture framework for multi-label classifi-cation. In SIAM Data Mining Conference (SDM). SIAM,2015.

[44] Peter Cheeseman and John C. Stutz. Bayesian classifi-cation (autoclass): Theory and results. In Advances inKnowledge Discovery and Data Mining, pages 153–180.1996.

[45] Chu Yu Chin, Meng Yu Weng, Tzu Chieh Lin,Shyr Yuan Cheng, Yea Huei Kao Yang, and Vincent STseng. Mining disease risk patterns from nationwideclinical databases for the assessment of early rheumatoidarthritis risk. PloS one, 10(4):e0122508, 2015.

[46] Baek Hwan Cho, Hwanjo Yu, Kwang-Won Kim,Tae Hyun Kim, In Young Kim, and Sun I. Kim. Applica-tion of irregular and unbalanced data to predict diabeticnephropathy using visualization and feature selectionmethods. 42(1):37 – 53, 2008.

[47] Raphael Cohen, Michael Elhadad, and Noemie Elhadad.Redundancy in electronic health record corpora: analy-sis, impact on text mining performance and mitigationstrategies. BMC Bioinformatics, 14(1):10, 2013.

[48] Stefano Concaro, Lucia Sacchi, Carlo Cerra, PietroFratino, and Riccardo Bellazzi. Mining healthcaredata with temporal association rules: Improvements andassessment for a practical use. In Conference onArtificial Intelligence in Medicine in Europe, pages 16–25. Springer, 2009.

[49] Stefano Concaro, Lucia Sacchi, Carlo Cerra, Mario Ste-fanelli, Pietro Fratino, and Riccardo Bellazzi. Temporaldata mining for the assessment of the costs related todiabetes mellitus pharmacological treatment. In AMIA,2009.

[50] Shengnan Cong, Jiawei Han, and David A. Padua.Parallel mining of closed sequential patterns. In RobertGrossman, Roberto J. Bayardo, and Kristin P. Bennett,editors, KDD, pages 562–567. ACM, 2005.

[51] G. Corani, C. Magli, A. Giusti, L. Gianaroli, and L.M.Gambardella. A bayesian network model for predictingpregnancy after in vitro fertilization. Computers inBiology and Medicine, 43(11):1783 – 1792, 2013.

[52] Vapnik Valdimir Cortes Corinna. Support-vector net-works. Machine Learning, (3):273–297, 21995.

[53] Tahani Daghistani, Riyad Al Shammari, and Muham-mad Imran Razzak. Discovering diabetes complications:an ontology based model. Acta Informatica Medica,

23(6):385, 2015.[54] Patrick Esser Dax SteinsEmail author, Helen Dawes

and Johnny Collett. Wearable accelerometry-basedtechnology capable of assessing functional activitiesin neurological populations in community settings: asystematic review. 11(36), 2014.

[55] J.B. Dasah P. Kuranchie A.G.B. Amoah D.N. Adjei,C. Agyemang. The effect of electronic reminders on riskmanagement among diabetic patients in low resourcedsettings. Journal of Diabetes and its Complications,,29(6):818 – 821, 2015.

[56] Jiang Y Shepherd M Maddison R Carter K Cutfield RMcNamara C Khanolkar M Murphy Dobson R, Whit-taker R. Text message-based diabetes self-managementsupport (sms4bg): study protocol for a randomised con-trolled trialth. Trials,, 17(179), 2016.

[57] SS Ravi David C. Torney Srinivas Doddi,Achla Marathe. Discovery of association rules inmedical data. Medical informatics and the Internet inmedicine, 26(1):25–33, 2001.

[58] M. Durairaj and R. NandhaKumar. Data mining applica-tion on ivf data for the selection of influential parameterson fertility. International Journal of Engineering andAdvanced Technology (IJEAT), 6(2):262 – 266, 2013.

[59] M. Durairaj and R. Nandhakumar. An integratedmethodology of artificial neural network and rough settheory for analyzing ivf data. In International Con-ference on Intelligent Computing Applications (ICICA),2014 , Coimbatore, pages 126–129. IEEE, 214.

[60] Yan Feng, Yixin Wang, Fang Guo, and Hao Xu. Ap-plications of data mining methods in the integrativemedical studies of coronary heart disease: progress andprospect. Evidence-Based Complementary and Alterna-tive Medicine, 2014, 2014.

[61] Henry H. Fischer, Ilana P. Fischer, Rocio I. Pereira,Anna L. Furniss, Jeanne M. Rozwadowski, Susan L.Moore, Michael J. Durfee, Silvia G. Raghunath,Adam G. Tsai, and Edward P. Havranek. Text messagesupport for weight loss in patients with prediabetes: Arandomized clinical trial. Diabetes Care, 2016.

[62] Andrew Frank and Arthur Asuncion. Uci machine learn-ing repository [http://archive. ics. uci. edu/ml]. irvine,ca: University of california. School of Information andComputer Science, 213, 2010.

[63] Kong L Wiederhold BK Gao K, Wiederhold MD. Clin-ical experiment to assess effectiveness of virtual realityteen smoking cessation program. Stud Health TechnolInform, 19:58–62, 2013.

[64] Ben S. Gerber, Melinda R. Stolley, Allison L. Thomp-son, Lisa K. Sharp, and Marian L. Fitzgibbon. Mobilephone text messaging to promote healthy behaviors andweight loss maintenance: a feasibility study. HealthInformatics Journal, 15(1):17–25, 2009.

[65] Huda Al Ghamdi, Riyad Alshammari, and Muham-mad Imran Razzak. An ontology-based system topredict hospital readmission within 30 days. Interna-tional Journal of Healthcare Management, 9(4):236–244, 2016.

27

[66] David Gil, Jose Luis Girela, Joaquin De Juan, M. JoseGomez-Torres, and Magnus Johnsson. Predicting sem-inal quality with artificial intelligence methods. ExpertSystems with Applications, 39(16):12564 – 12573, 2012.

[67] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M.Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mietus,G. B. Moody, C.-K. Peng, and H. E. Stanley. Phys-ioBank, PhysioToolkit, and PhysioNet: Components of anew research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000 (June 13).

[68] Uchida S. Goldstein, M. A comparative evaluation ofunsupervised anomaly detection algorithms for multi-variate data. PLoS ONE, 1(4), 2016.

[69] Markus Goldstein and Andreas Dengel. Histogram-based outlier score (hbos): A fast unsupervised anomalydetection algorithm. Poster and Demo Track of the 35thGerman Conference on Artificial Intelligence (KI-2012),pages 59–63, 2012.

[70] Markus Goldstein and Seiichi Uchida. A compara-tive evaluation of unsupervised anomaly detection al-gorithms for multivariate data. PLoS ONE, 11(4), 2016.

[71] Sara Green and Henrik Vogt. Personalizing medicne:Disease prevention in silico and in socio. HumanaMente, 2016.

[72] Dilbaz S Ozdegirmenci O Demir B Dilbaz BGvenir HA, Misirli G. Estimating the chance of successin ivf treatment using a ranking algorithm. MedicalBiological Engineering Computingt, 53:911 – 920,2015.

[73] Riyaz Ahamed Ariyaluran Habeeb, Fariza Nasaruddin,Abdullah Gani, Ibrahim Abaker Targio Hashem, EjazAhmed, and Muhammad Imran. Real-time big data pro-cessing for anomaly detection: A survey. InternationalJournal of Information Management, 2018.

[74] Fouzi Harrou, Farid Kadri, Sondes Chaabane, ChristianTahon, and Ying Sun. Improved principal componentanalysis for anomaly detection: Application to an emer-gency department. Computers & Industrial Engineering,88:63–77, 2015.

[75] Milos Hauskrecht, Michal Valko, Iyad Batal, GillesClermont, Shyam Visweswaran, and Gregory Cooper.Conditional outlier detection for clinical alerting. InAMIA annual symposium proceedings, volume 2010,pages 286–90, 2010.

[76] Milos Hauskrecht, Michal Valko, Branislav Kveton,Shyam Visweswaran, and Gregory Cooper. Evidence-based anomaly detection in clinical domains. In AnnualAmerican Medical Informatics Association Symposium,pages 319–324, 2007.

[77] Milos Hauskrecht, Shyam Visweswaran, Gregory F.Cooper, and Gilles Clermont. Conditional outlier ap-proach for detection of unusual patient care actions. InAAAI (Late-Breaking Developments), volume WS-13-17of AAAI Workshops. AAAI, 2013.

[78] Kveton B Visweswaran S Cooper GF. Hauskrecht M,Valko M. Evidence-based anomaly detection. In In:Proceedings of annual American Medical InformaticsAssociation symposium), pages 319 – 324. AAAI, 2007.

[79] Jinyuan He, Jia Rong, Le Sun, Hua Wang, YanchunZhang, and Jiangang Ma. D-ecg: a dynamic frameworkfor cardiac arrhythmia detection from iot-based ecgs. InInternational Conference on Web Information SystemsEngineering, pages 85–99. Springer, 2018.

[80] Geoffrey E. Hinton, Simon Osindero, and Yee WhyeTeh. A fast learning algorithm for deep belief nets.Neural Computation, 18(7):1527–1554, 2006.

[81] Griffiths F Munday S Friede T Stables D. Holt TA,Thorogood M. Automated electronic reminders tofacilitate primary cardiovascular disease prevention: ran-domised controlled trial. Br J Gen Pract,, 60(573):137– 143, 2010.

[82] Charmgil Hong and Milos Hauskrecht. Multivariateconditional outlier detection and its clinical application.In Dale Schuurmans and Michael P. Wellman, editors,AAAI, pages 4216–4217. AAAI Press, 2016.

[83] Qingsong Xu Hong Wang and Lifeng Zhou. Semi-nal quality prediction using clustering-based decisionforests. Algorithms,.

[84] L Hood and D Galas. P4 medicine: Personalized, pre-dictive, preventive, participatory a change of view thatchanges everything. Computing community consortium,2008.

[85] Vahid Hosseini. Algorithm and related application forsmart wearable devices to reduce the risk of death andbrain damage in diabetic coma. J Diabetes Sci TechnolMay 2016 vol. 10 no. 3 802-803, 2015.

[86] Jianying Hu, Fei Wang, Jimeng Sun, Robert Sorrentino,and Shahram Ebadollahi. A healthcare utilization anal-ysis framework for hot spotting and contextual anomalydetection. In AMIA. AMIA, 2012.

[87] Yi Chao Huang. Mining association rules between ab-normal health examination results and outpatient med-ical records. Health Information Management Journal,42(2):23–30, 2013.

[88] Morgan P Callister R Collins C. Hutchesson MJ1,Tan CY. Enhancement of self-monitoring in a web-basedweight loss program by extra individualized feedbackand reminders: Randomized trial. J Med Internet Res.,18(4), 2015.

[89] Chang-Beom Kwon Hye-Jin Lee, Jaeho Oh and Sang-Woo Ban. Emergent cardiac anomaly classificationusing cascaded autoassociative multilayer perceptronsfor bio-healthcare systems. International Journal of Bio-Science and Bio-Technology, 2016.

[90] Persi Pamela. I, Gayathri. P, and N. Jaisankar. A fuzzyoptimization technique for the prediction of coronaryheart disease using decision tree. 2013.

[91] H.B. Silva I. Lerman, J. Costa. Validation of verylarge data sets clustering by means of a nonparametriclinear criterion, in: Classification, clustering and dataanalysis, pages 147–157. Springer Berlin Heidelberg,2006.

[92] F.J. Cortijo-R. Molina J.A. Garcia, J. dezValdivia. Adynamic approach for clustering data. 44(2), 1994.

[93] MA Jabbar, P Chandra, and BL Deekshatulu. Knowl-edge discovery from mining association rules for heart

28

disease prediction. J Theor Appl Inf Technol, 41(2):45–53, 2012.

[94] Chuan-Wei Tsao Jinn-Yi Yeh, Tai-Hsi Wu. Usingdata mining techniques to predict hospitalization ofhemodialysis patients. Decision Support Systems,50(2):439 – 448, 2011.

[95] Jitendra Jonnagaddala, Siaw-Teng Liaw, Pradeep Ray,Manish Kumar, Nai-Wen Chang, and Hong-Jie Dai.Coronary artery disease risk assessment from unstruc-tured electronic health records using text mining. Jour-nal of Biomedical Informatics, 58, Supplement:S203 –S210, 2015. Proceedings of the 2014 i2b2/UTHealthShared-Tasks and Workshop on Challenges in NaturalLanguage Processing for Clinical Data.

[96] Jitendra Jonnagaddala, Siaw-Teng Liaw, Pradeep Ray,Manish Kumar, Hong-Jie Dai, and Chien-Yeh Hsu.Identification and progression of heart disease risk fac-tors in diabetic patients from longitudinal electronichealth records. August 25 2015.

[97] Alaleh Zamanbin Markus W. Hollmann Frits HollemanBenedikt Preckel Jeroen Hermanides Jorinde A.W. Pol-derman, Fleur A. de Groot. An automated reminderfor perioperative glucose regulation improves protocolcompliance. Diabetes Research and Clinical Practice,,116:80 – 82, 2016.

[98] Magnus Johnsson Mara Jos Gomez-Torres JoseL. Girela, David Gil and Joaqun De Juan. Semen param-eters can be predicted from environmental factors andlifestyle using artificial intelligence methods. Biology ofReproduction, 88(4), 2014.

[99] Sujata Joshi and Mydhili K. Nair. Prediction of HeartDisease Using Classification Based Data Mining Tech-niques, pages 503–511. Springer India, New Delhi,2015.

[100] P. Bajaja K. Choudhary. Automated prediction of rct(root canal treatment) using data mining techniques:Ict in health care. In International Conference onInformation and Communication Technologies (ICICT2014), Procedia Computer Science 46 ( 2015 ) 682 688,volume 26, pages 682–688, 2015.

[101] Enamul Kabir, Siuly Siuly, Jinli Cao, and Hua Wang. Acomputer aided analysis scheme for detecting epilepticseizure from eeg data. International Journal of Compu-tational Intelligence Systems, 11:663–671, 2018.

[102] Jehad Sarkar A. M. Kamruzzaman, S. M. A new datamining scheme using artificial neural networks. Sensors(Basel, Switzerland), 11(5):4622–4647, 2013.

[103] M. A. Karaolis, J. A. Moutiris, D. Hadjipanayi, andC. S. Pattichis. Assessment of the risk factors of coro-nary heart events based on data mining with decisiontrees. IEEE Transactions on Information Technology inBiomedicine, 14(3):559–566, May 2010.

[104] Noreen Kausar, Sellapan Palaniappan, Brahim Bel-haouari Samir, Azween Abdullah, and Nilanjan Dey.Systematic analysis of applied data mining based op-timization algorithms in clinical attribute extractionand classification for diagnosis of cardiac patients.In Aboul Ella Hassanien, Crina Grosan, and Mo-

hamed Fahmy Tolba, editors, Applications of IntelligentOptimization in Biology and Medicine, volume 96 ofIntelligent Systems Reference Library, pages 217–231.Springer, 2016.

[105] Kulvaibhav Kaushik, Divya Kapoor, VijayaraghavanVaradharajan, and Rajarathnam Nallusamy. Diseasemanagement: clustering–based disease prediction. Inter-national Journal of Collaborative Enterprise, 4(1-2):69–82, 2014.

[106] Sajid Ullah Khan. Classification of parkinsons diseaseusing data mining techniques. Parkinsons Dis AlzheimerDis, 2, 2015.

[107] Aditya Khosla, Hsu kuang Chiu, Yu Cao, Junling Hu,Cliff Chiung yu Lin, and Honglak Lee. An integratedmachine learning approach to stroke prediction, May 122012.

[108] Jongsik Lee Kim, Jaekwon and Youngho Lee. Pdata-mining-based coronary heart disease risk predictionmodel using fuzzy logic and decision tree. HealthcareInformatics Research, 21(3):167–174, 2015.

[109] Steinhubl S Kim S Bae WK-Han JS Kim JH Lee KKim MJ. Kim JY1, Oh S. Effectiveness of 6 months oftailored text message reminders for obese male partic-ipants in a worksite weight loss program: randomizedcontrolled trial. JMIR Mhealth Uhealth., 3(1), 2015.

[110] Helen S. Koo, Dawn Michaelson, Karla Teel, Dong-JooKim, Hyejin Park, and Minseo Park. Design preferenceson wearable e-nose systems for diabetes. InternationalJournal of Clothing Science and Technology, 28(2):216–232, 2016.

[111] Anooj Padmakumari Kunjunninair. Clinical decisionsupport system: Risk level prediction of heart diseaseusing weighted fuzzy rules. Journal of King SaudUniversity Computer and Information Sciences, 24:27–40, 2012.

[112] Anooj Padmakumari Kunjunninair. Erratum to: ”clinicaldecision support system: risk level prediction of heartdisease using weighted fuzzy rules and decision treerules”. Central Europ. J. Computer Science, 2(1):86,2012.

[113] Anooj Padmakumari Kunjunninair. Classification ofparkinsons disease using data mining techniques. JParkinsons Dis Alzheimer Dis, 2(1), 2015.

[114] Yen-Ting Kuo, Andrew Lonie, Adrian R Pearce, and LizSonenberg. Mining surprising patterns and their expla-nations in clinical data. Applied Artificial Intelligence,28(2):111–138, 2014.

[115] Andrew Kusiak, Bradley Dixon, and Shital Shah. Pre-dicting survival time for kidney dialysis patients: a datamining approach. Computers in Biology and Medicine,35(4):311 – 327, 2005.

[116] P. Enright C. D. Furberg J. M. Gardin R. A. Kro-nmal L. H. Kuller T. A. Manolio M. B. MittelmarkA. Newman D. H. OLeary B. Psaty P. Rautaharju R.P. Tracy L. P. Fried, N. O. Borhani and P. G. Weiler.The cardiovascular health study : design and rationale.Annals of Epidemiology, 1(3):263 – 276, 1991.

[117] Pat Langley and Stephanie Sage. Induction of selective

29

bayesian classifiers. CoRR, abs/1302.6828, 2013.[118] Dong Gyu Lee, Kwang Sun Ryu, Mohamed Bashir,

Jang-Whan Bae, and Keun Ho Ryu. Discovering med-ical knowledge using association rule mining in youngadults with acute myocardial infarction. Journal ofmedical systems, 37(2):1–10, 2013.

[119] M. Lichman. UCI machine learning repository, 2013.[120] Roberts SJ Costello DA Moroz IM Little MA, Mc-

Sharry PE. Exploiting nonlinear recurrence and fractalscaling properties for voice disorder detection. BioMed-ical Engineering OnLine, 6(23), 2007.

[121] Xiaoyong Liu and Hui Fu. PSO-based support vectormachine with cuckoo search technique for clinical dis-ease diagnoses. May 25 2014.

[122] Luo Luo, Yubo Xie, Zhihua Zhang, and Wu-Jun Li.Support matrix machines. In International Conferenceon Machine Learning, pages 938–947, 2015.

[123] Yi Luo, Ji-Qiang Li, Dan-Wen Zheng, Zhan-Peng Tan,Hong Zhou, Qing-Ping Deng, Yun-Tao Liu, Ai-hua Ou,and Jian Yin. Application of data mining technologyin excavating prevention and treatment experience ofinfectious diseases from famous herbalist doctors. InBioinformatics and Biomedicine Workshops (BIBMW),2011 IEEE International Conference on, pages 784–790. IEEE, 2011.

[124] Jiangang Ma, Le Sun, Hua Wang, Yanchun Zhang, andUwe Aickelin. Supervised anomaly detection in uncer-tain pseudoperiodic data streams. ACM Transactions onInternet Technology (TOIT), 16(1):4, 2016.

[125] Muhammad Maeemr. Etiological evaluation of seminaltraits using bayesian belief. International Journal ofBio-Science and Bio-Technology, 6(6):79 – 86, 2014.

[126] Ahmad Khushairy Makhtar, Hanafiah Yussof, Hay-der Al-Assadi, Low Cheng Yee, K. Rajeswari,V. Vaithiyanathan, and T.R. Neelakantan. Internationalsymposium on robotics and intelligent sensors 2012 (iris2012) feature selection in ischemic heart disease iden-tification using feed forward neural networks. ProcediaEngineering, 41:1818 – 1823, 2012.

[127] O. L. Mangasarian and David R. Musicant. Lagrangiansupport vector machine classification. Technical Re-port 00-06, Data Mining Institute, Computer SciencesDepartment, University of Wisconsin, Madison, Wis-consin, June 2000. ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-06.ps.

[128] Tyler McCormick, Cynthia Rudin, and David Madigan.A hierarchical model for association rule mining ofsequential events: An approach to automated medicalsymptom prediction. 2011.

[129] Gidding SS McGill HC Jr1, McMahan CA. Preventingheart disease in the 21st century: implications of thepathobiological determinants of atherosclerosis in youth(pday) study. Circulation, 117(9):1216–27, 2008.

[130] J Mezger, S Visweswaran, M Hauskrecht, G Clermont,and GF Cooper. A statistical approach for detectingdeviations from usual medical care. In AMIA... AnnualSymposium proceedings/AMIA Symposium. AMIA Sym-posium, page 1051, 2007.

[131] G. W. Milligan and M. C. Cooper. An examination ofprocedures for determining the number of clusters in adata set. Psychometrika, 50(1):159–179, 1985.

[132] Andrew J Monaghan, Cory W Morin, Daniel F Stein-hoff, Olga Wilhelmi, Mary Hayden, Dale A Quattrochi,Michael Reiskind, Alun L Lloyd, Kirk Smith, Chris ASchmidt, et al. On the seasonal occurrence and abun-dance of the zika virus vector mosquito aedes aegyptiin the contiguous united states. PLoS currents, 8, 2016.

[133] Wahidah Husain Muhamad Hariz Muhamad Adnan andNur’Aini Abdul Rashid. Hybrid approaches using deci-sion tree, nave bayes, means and euclidean distances forchildhood obesity prediction. International Journal ofSoftware Engineering and Its Applications, 6(3), 2012.

[134] Jesmin Nahar, Tasadduq Imam, Kevin S Tickle, and Yi-Ping Phoebe Chen. Association rule mining to detectfactors which contribute to heart disease in males andfemales. Expert Systems with Applications, 40(4):1086–1093, 2013.

[135] Maryam M Najafabadi, Flavio Villanustre, Taghi MKhoshgoftaar, Naeem Seliya, Randall Wald, and EdinMuharemagic. Deep learning applications and chal-lenges in big data analytics. February 24 2015.

[136] G. NaliniPriya, Arputharaj Kannan, and P. Anandhaku-mar. A knowledgeable decision tree classification modelfor multivariate heart disease data-A boon to health-care. In Zhenhua Li, Xiang Li, Yong Liu, and ZhihuaCai, editors, ISICA, volume 316 of Communicationsin Computer and Information Science, pages 459–467.Springer, 2012.

[137] Jamshid Norouzi, Ali Yadollahpour, Seyed AhmadMirbagheri, Mitra Mahdavi Mazdeh, and Seyed AhmadHosseini. Predicting renal failure progression in chronickidney disease using integrated intelligent fuzzy expertsystem. Computational and mathematical methods inmedicine, 2016, 2016.

[138] Mitsuhiro Ogasawara, Hiroki Sugimori, Yukiyasu Iida,and Katsumi Yoshida. Analysis between lifestyle, familymedical history and medical abnormalities using datamining method–association rule analysis. In Interna-tional Conference on Knowledge-Based and IntelligentInformation and Engineering Systems, pages 161–171.Springer, 2005.

[139] Jason S. ”O’Grady, Tom D. Thacher, and Rajeev”Chaudhry. ”the effect of an automated clinical reminderon weight loss in primary care”. ”The Journal of theAmerican Board of Family Medicine”, ”26”(”6”):”745–750”, ”2013”.

[140] Miho Ohsaki, Hidenao Abe, Shusaku Tsumoto, HidetoYokoi, and Takahira Yamaguchi. Evaluation of ruleinterestingness measures in medical knowledge discov-ery in databases. Artificial Intelligence in Medicine,41(3):177–196, 2007.

[141] Carlos Ordonez, Norberto Ezquerra, and Cesar A San-tana. Constraining and summarizing association rulesin medical data. Knowledge and Information Systems,9(3):1–2, 2006.

[142] A. J. Belanger P. A. Wolf, R. B. DAgostino and W. B.

30

Kannel. Probability of stroke: a risk profile from theframingham study. Stroke, 22:312318, 1991.

[143] M. Self J. Stutz P. Cheeseman, J. Kelly. Autoclass: abayesian classification system. Proceedings of the FifthInternational Conference on Machine Learning, MorganKaufman, Los Altos, CA,, 58, Supplement:54 – 64, 1988.

[144] Jakub Parak, Adrian Tarniceriu, Philippe Renevey, Mat-tia Bertschi, Ricard Delgado-Gonzalo, and Ilkka Korho-nen. Evaluation of the beat-to-beat detection accuracy ofpulseon wearable optical heart rate monitor. In EMBC,pages 8099–8102. IEEE, 2015.

[145] H. Kupwade Patil and R. Seshadri. Big data security andprivacy issues in healthcare. In 2014 IEEE InternationalCongress on Big Data, pages 762–765, June 2014.

[146] Piotr Ziniewicz Anna Justyna Milewska Jan CzernieckiSawomir Woczyski Pawe Malinowski, Robert Milewski.The use of data mining methods to predict the result ofinfertility treatment using the ivf et method. STUDIESIN LOGIC, GRAMMAR, 39(52):67 – 74, 2014.

[147] Carolyn Payus, Norela Sulaiman, Mazrura Shahani,and Azuraliza Abu Bakar. Association rules of datamining application for respiratory illness by air pollutiondatabase. Int J Basic Appl Sci, 13(3):11–16, 2013.

[148] NhatHai Phan, Dejing Dou, Hao Wang, David Kil, andBrigitte Piniewski. Ontology-based deep learning forhuman behavior prediction in health social networks.In BCB, pages 433–442. ACM, 2015.

[149] John D. Piette, Justin List, Gurpreet K. Rana, WhitneyTownsend, Dana Striplin, and Michele Heisler. Mobilehealth devices as tools for worldwide cardiovascular riskreduction and disease management. March 2015.

[150] Foster Provost. Well-trained PETs: Improving probabil-ity estimation trees, 2000.

[151] J. Ross Quinlan. C4.5: Programs for Machine Learning.Morgan Kaufmann, 1992.

[152] R. Quinlan. Induction of decision trees. MachineLearning, 1:81–106, 1986.

[153] Wullianallur Raghupathi and Viju Raghupathi. Big dataanalytics in healthcare: promise and potential. Healthinformation science and systems, 2(1):3, 2014.

[154] Vipul Raheja and KS Rajan. Comparative study ofassociation rule mining and mistic in extracting spatio-temporal disease occurrences patterns. In 2012 IEEE12th International Conference on Data Mining Work-shops, pages 813–820. IEEE, 2012.

[155] P Rajendran and M Madheswaran. An improved imagemining technique for brain tumour classification usingefficient classifier. arXiv preprint arXiv:1001.1988,2010.

[156] Dheeraj Raju, Xiaogang Su, Patricia A. Patrician,Lori A. Loan, and Mary S. McCarthy. Exploring factorsassociated with pressure ulcers: A data mining approach.International Journal of Nursing Studies, 52(1):102 –111, 2015.

[157] Dr. R. Geetha Ramani and G. Sivagami. Parkinsondisease classification using data mining algorithms. (9/),2011.

[158] Ramaswamy, Rastogi, and Shim. Efficient algorithms

for mining outliers from large data sets. SIGMODREC:ACM SIGMOD Record, 29, 2000.

[159] Imran Razzak, Muhammad Imran, and Guandong Xu.Efficient brain tumor segmentation with multiscale two-pathway-group conventional neural networks. IEEEjournal of biomedical and health informatics, 2018.

[160] Muhammad Imran Razzak. Malarial parasite classifi-cation using recurrent neural network. InternationalJournal of Image Processing (IJIP), 9(2):69, 2015.

[161] Muhammad Imran Razzak and Bandar Alhaqbani. Au-tomatic detection of malarial parasite using microscopicblood images. Journal of Medical Imaging and HealthInformatics, 5(3):591–598, 2015.

[162] Muhammad Imran Razzak, Michael Blumenstein, andGuandong Xu. Cooperative evolution multiclass supportmatrixmachines for single trial eeg classification. arXivpreprint arXiv:1003.2019.

[163] Muhammad Imran Razzak, Michael Blumenstein, andGuandong Xu. Multi-class support matrix machines bymaximizing the inter-class margin for single trial eegclassification. arXiv preprint arXiv:1002.2019.

[164] Muhammad Imran Razzak, Michael Blumenstein, andGuandong Xu. Robust 2d joint sparse pca. arXivpreprint arXiv:1001.2019.

[165] Muhammad Imran Razzak, Michael Blumenstein, andGuandong Xu. Robust support matrix machine. arXivpreprint arXiv:1001.2019.

[166] Muhammad Imran Razzak and Saeeda Naz. Mi-croscopic blood smear segmentation and classificationusing deep contour aware cnn and extreme machinelearning. In Computer Vision and Pattern RecognitionWorkshops (CVPRW), 2017 IEEE Conference on, pages801–807. IEEE, 2017.

[167] Muhammad Imran Razzak, Saeeda Naz, and AhmadZaib. Deep learning for medical image processing:Overview, challenges and the future. In Classificationin BioApps, pages 323–350. Springer, 2018.

[168] Muhammad Imran Razzak, Raghib Abu Saris, MichaelBlumenstein, and Guandong Xu. Robust 2d joint sparseprincipal component analysis with f-norm minimizationfor sparse modelling: 2d-rjspca. In 2018 InternationalJoint Conference on Neural Networks (IJCNN), pages1–7. IEEE, 2018.

[169] Rivera D. E. Atienza A. A. Nilsen W. Allison S. M.Mermelstein R. Riley, W. T. Health behavior models inthe age of mobile interventions: are our theories up tothe task? Translational Behavioral Medicine,, 1(1):57 –71, 2011.

[170] Lior Rokach and Oded Maimon. Chapter 15 CLUS-TERING METHODS, December 28 2009.

[171] Anoop J. Sahoo and Yugal Kumar. Seminal qualityprediction using data mining methods. Technology andHealth Care, 22(4):531 – 545, 2014.

[172] Rajinder Sandhu, Sandeep K. Sood, and Gurpreet Kaur.An intelligent system for predicting and preventingMERS-coV infection outbreak. The Journal of Super-computing, 72(8):3033–3056, 2016.

[173] Higginbotham E. J. Satcher, D. The public health

31

approach to eliminating disparities in health. AmericanJournal of Public Health,, 98(3):400 – 403, 2008.

[174] Catherine Hylas Saunders. Automated phone and mailnotices increase medication adherence, kaiser perma-nente study uncovers effective and inexpensive inter-vention. Nov 26, 2012.

[175] Bernhard Scholkopf, John C. Platt, John Shawe-Taylor,Alex J. Smola, and Robert C. Williamson. Estimatingthe support of a high-dimensional distribution. NeuralComputation, 13(7):1443–1471, 2001.

[176] Tatiana Semenova, Markus Hegland, Warwick Graco,and Graham Williams. Effectiveness of mining as-sociation rules for identifying trends in large healthdatabases. In Workshop on Integrating Data Mining andKnowledge Management. ICDM, volume 1, page 19,2001.

[177] Neha Sharma and Hari Om. Significant patterns fororal cancer detection: association rule on clinical exam-ination and history data. Network Modeling Analysis inHealth Informatics and Bioinformatics, 3(1):1–13, 2014.

[178] Kamran Shaukat, Nayyer Masood, Ahmed Bin Shafaat,Kamran Jabbar, Hassan Shabbir, and Shakir Shabbir.Dengue fever in perspective of clustering algorithms.arXiv preprint arXiv:1511.07353, 2015.

[179] Jisue Kang Jiwoo Oh Jungwon Baek Shinyoung Lee,Yoonjoo Kim and Taeseon Yoon. Prediction of sarscoronavirus main protease by support vector machine.pages 185–189, 2014.

[180] Syed H Shirazi, Arif Iqbal Umar, Saeeda Naz, andMuhammad I Razzak. Efficient leukocyte segmentationand recognition in peripheral blood image. Technologyand Health Care, 24(3):335–347, 2016.

[181] Y Shirisha, S Siva Shankar Rao, and D Sujatha. Datamining with predictive analysis for healthcare sector: Animproved weighted associative classification approach.Global Journal of Computer Science and Technology,11(22), 2012.

[182] M. Shouman, T. Turner, and R. Stocker. Using data min-ing techniques in heart disease diagnosis and treatment.In Electronics, Communications and Computers (JEC-ECC), 2012 Japan-Egypt Conference on, pages 173–177, March 2012.

[183] Mai Shouman, Tim Turner, and Rob Stocker. Usingdecision tree for diagnosing heart disease patients. InAusDM, volume 121 of CRPIT, pages 23–30. AustralianComputer Society, 2011.

[184] Mai Shouman, Tim Turner, and Rob Stocker. Integratingnaive bayes and k-means clustering with different initialcentroid selection methods in the diagnosis of heartdisease patients. CS & IT-CSCP, pages 125–137, 2012.

[185] Paolo Bonato Leighton Chan Shyamal Patel,Hyung Park and Mary Rodgers. A review of wearablesensors and systems with application in rehabilitation.Journal of NeuroEngineering and Rehabilitation, 9(21),2012.

[186] Helena Brs Silva, Paula Brito, and Joaquim Pintoda Costa. A partitional clustering algorithm validatedby a clustering tendency index based on graph theory.

Pattern Recognition, 39(5):776 – 788, 2006.[187] Neha Singh, Viness Pillay, and Yahya E. Choonara. Ad-

vances in the treatment of parkinson’s disease. Progressin Neurobiology, 81(1):29 – 44, 2007.

[188] Pouliakis A Trivella M Papantoniou N Bettocchi SSiristatidis C, Vogiatzi P. Predicting ivf outcome: Aproposed web-based system using artificial intelligence.In Vivo, 30(4):07 – 08, 2016.

[189] Chrelias C Kassanos D Siristatidis C1, Pouliakis A.Artificial intelligence in ivf: a need. Syst Biol ReprodMed, 57(4):179 – 185, 2011.

[190] S. Snowden S.W.Smye V.Sharma S.J Kaufmann,J.L.Eastaugh. The application of neural network inpredicting the outcome of in-vitro fertilization. HumanReproduction, 12(7):1454–1457, 1997.

[191] Deborah Leachman Slawson, Nurgul Fitzgerald, andKathleen T. Morgan. Position of the academy ofnutrition and dietetics: The role of nutrition in healthpromotion and chronic disease prevention. Journal ofthe Academy of Nutrition and Dietetics,, 113(7):972–979, 2013.

[192] Vaclav Snasel, Marcelo Alancar, Dorota Jeloneck, Ab-del Badeeh Salem, Luigi Valanderau, Yacoub Saleh,Senthil Kumar, Evgenia Theodoutou, Randa El-Bialy,Mostafa A. Salamay, Omar H. Karam, and M. EssamKhalifa. Feature analysis of coronary artery heartdisease data sets. Procedia Computer Science, 65:459– 468, 2015.

[193] Xiuyao Song, Mingxi Wu, Christopher Jermaine, andSanjay Ranka. Conditional anomaly detection. IEEETrans. on Knowl. and Data Eng., 19(5):631–645, May2007.

[194] Jyoti Soni, Uzma Ansari, Dipesh Sharma, and SunitaSoni. Intelligent and effective heart disease predictionsystem using weighted associative classifiers. Interna-tional Journal on Computer Science and Engineering,3(6):2385–2392, 2011.

[195] Sunita Soni and OP Vyas. Using associative classifiersfor predictive analysis in health care data mining. Inter-national Journal of Computer Applications, 4(5):33–37,2010.

[196] Sujatha Srinivasan and Sivakumar Ramakrishnan. Evo-lutionary multi objective optimization for rule mining:a review. Artificial Intelligence Review, 36(3):205–248,2011.

[197] Beata Strack, Jonathan P DeShazo, Chris Gennings,Juan L Olmo, Sebastian Ventura, Krzysztof J Cios, andJohn N Clore. Impact of hba1c measurement on hospitalreadmission rates: analysis of 70,000 clinical databasepatient records. BioMed research international, 2014,2014.

[198] G. Subbalakshmi, K. Ramesh, and M. Chinna Rao.Decision support in heart disease prediction systemusing naive bayes. 2011.

[199] P. Suganya and C. P. Sumathi. Diet, nutrition, and theprevention of chronic diseases. 2003.

[200] P. Suganya and C. P. Sumathi. A novel metaheuristicdata mining algorithm for the detection and classifica-

32

tion of parkinson disease. (8/), 2014.[201] Sheng-Feng Sung, Cheng-Yang Hsieh, Yea-Huei Kao

Yang, Huey-Juan Lin, Chih-Hung Chen, Yu-Wei Chen,and Ya-Han Hu. Developing a stroke severity indexbased on administrative data was feasible using datamining techniques. Journal of Clinical Epidemiology,68(11):1292 – 1300, 2015.

[202] Latanya Sweeney. Achieving k-anonymity privacyprotection using generalization and suppression. In-ternational Journal of Uncertainty, Fuzziness andKnowledge-Based Systems, 10(5):571–588, 2002.

[203] Smitha. T and V. Sundaram. Classification rules bydecision tree for disease prediction. (8/), 2012.

[204] Smitha. T and V. Sundaram. Classification rules bydecision tree for disease prediction. (8/), 2012.

[205] Wan-Ting Tseng, Wei-Fan Chiang, Shyun-Yeu Liu, Jin-sheng Roan, and Chun-Nan Lin. The application of datamining techniques to oral cancer prognosis. Journal ofMedical Systems, 39(5):1–7, 2015.

[206] Asli Uyar, Ayse Bener, H. Nadir Ciray, and MustafaBahceci. ROC Based Evaluation and Comparison ofClassifiers for IVF Implantation Prediction, pages 108–111. Springer Berlin Heidelberg, Berlin, Heidelberg,2010.

[207] Deepti Vadicherla and Sheetal Sonawane. DECISIONSUPPORT SYSTEM FOR HEART DISEASE BASEDON SEQUENTIAL MINIMAL OPTIMIZATION INSUPPORT VECTOR MACHINE. (8), 2013.

[208] Michal Valko, Gregory Cooper, Amy Seybert, ShyamVisweswaran, Melissa Saul, and Milos Hauskrecht.Conditional anomaly detection methods for patient-management alert systems. July 09 2008.

[209] Michal Valko and Milos Hauskrecht. Distance metriclearning for conditional anomaly detection. In FLAIRSConference, pages 684–689, 2008.

[210] S Vijayarani and S Sudha. An efficient clusteringalgorithm for predicting diseases from hemogram bloodtest samples. Indian Journal of Science and Technology,8(17), 2015.

[211] Elena Villalba, Dario Salvi, Manuel Ottaviano, Igna-cio Peinado, Marıa Teresa Arredondo, and A. Akay.Wearable and mobile system to manage remotelyheart failure. IEEE Trans. Information Technology inBiomedicine, 13(6):990–996, 2009.

[212] Zhihua Wang, Hanjun Jiang, Kai Yang, Lingwei Zhang,Jianjun Wei, Fule Li, Baoyong Chi, Chun Zhang,Shouhao Wu, Qingliang Lin, and Wen Jia. Lifetimetracing of cardiopulmonary sounds with ultra-low-powersound sensor stick connected to wireless mobile net-work. In NEWCAS, pages 1–4. IEEE, 2013.

[213] Mohammad Waqialla, Riyad Alshammari, and Muham-mad Imran Razzak. An ontology for remote monitoringof cardiac implantable electronic devices. In Computer,Communications, and Control Technology (I4CT), 2015International Conference on, pages 520–523. IEEE,2015.

[214] Mohammad Waqialla and Muhammad Imran Razzak.An ontology-based framework aiming to support cardiac

rehabilitation program. Procedia Computer Science,96:23–32, 2016.

[215] Walter C. Willett, Jeffrey P. Koplan, Rachel Nugent,Courtenay Dusenbury, Pekka Puska, and Thomas A.Gaziano. Prevention of Chronic Disease by Means ofDiet and Lifestyle Changes.

[216] Jesse O Wrenn, Daniel M Stein, Suzanne Bakken,and Peter D Stetson. Quantifying clinical narrativeredundancy in an electronic health record.

[217] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joy-deep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J.McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and DanSteinberg. Top 10 algorithms in data mining. 14(1):1–37, 2007.

[218] Xindong Wu, Xingquan Zhu, Gong-Qing Wu, andWei Ding 0003. Data mining with big data. IEEE Trans.Knowl. Data Eng, 26(1):97–107, 2014.

[219] Dongkuan Xu and Yingjie Tian. A comprehensivesurvey of clustering algorithms. Annals of Data Science,2(2):165–193, 2015.

[220] Masaki Yamaguchi, C Kaseda, Katsuya Yamazaki, andMasashi Kobayashi. Prediction of blood glucose levelof type 1 diabetics using response surface methodologyand data mining. Medical and Biological Engineeringand Computing, 44(6):451–457, 2006.

[221] Jing Yang, Jiarui Si, Xiaoxuan Gu, and Ouyan Shi.Fuzzy cluster analysis of alzheimers disease-relatedgene sequences. Engineering, 5(10):530, 2013.

[222] Zheng Rong Yang. Mining sars-cov protease cleav-age data using non-orthogonal decision trees: a novelmethod for decisive template selection. Bioinformatics,21(11):2644–2650, 2005.

[223] Huda Yasin, Tahseen A. Jilani, and Madiha Danish.HepatitisC classification using data mining techniques.(3/), 2011.

[224] Duen-Yian Yeh, Ching-Hsue Cheng, and Yen-WenChen. A predictive model for cerebrovascular diseaseusing data mining. Expert Syst. Appl, 38(7):8970–8977,2011.

[225] Changwon Yoo, Luis Ramirez, and Juan Liuzzi. Bigdata analysis using modern statistical and machinelearning methods in medicine. June 26 2014.

[226] Naglaa Zayed, Abu Bakr Awad, Wafaa El-Akel, WahidDoss, Tahany Awad, Amr Radwan, and MahasenMabrouk. The assessment of data mining for theprediction of therapeutic outcome in 3719 egyptianpatients with chronic hepatitis c. Clinics and Researchin Hepatology and Gastroenterology, 37(3):254 – 261,2013.

[227] McInnes BT Melton GB Zhang R, Pakhomov S. Eval-uating measures of redundancy in clinical texts. AMIAAnnu Symp Proc, 2011.

[228] Bichen Zheng, Sang Won Yoon, and Sarah S Lam.Breast cancer diagnosis based on feature extractionusing a hybrid of k-means and support vector ma-chine algorithms. Expert Systems with Applications,41(4):1476–1482, 2014.

33

[229] Qingqing Zheng, Fengyuan Zhu, Jing Qin, BadongChen, and Pheng-Ann Heng. Sparse support matrixmachine. Pattern Recognition, 76:715–726, 2018.

[230] Zhang Z Zheng J, Ha C. Design and evaluation ofa ubiquitous chest-worn cardiopulmonary monitoringsystem for healthcare application: a pilot study. MedBiol Eng Comput., 2016.

[231] Huiyu Zhou, Huosheng Hu, and Nigel D. Harris. Wear-able inertial sensors for arm motion tracking in home-based rehabilitation. In Tamio Arai, Rolf Pfeifer,Tucker R. Balch, and Hiroshi Yokoi, editors, IAS, pages930–937. IOS Press, 2006.

Big Data Analytics for Preventive Medicine - OPUS at UTS

Documents

Transcript of Big Data Analytics for Preventive Medicine - OPUS at UTS