A novel data mining mechanism considering bio-signal and environmental data with applications on...

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 1 ( 2 0 1 1 ) 44–61

journa l homepage: www. int l .e lsev ierhea l th .com/ journa ls /cmpb

A novel data mining mechanism considering bio-signal andenvironmental data with applications on asthmamonitoring�

Chao-Hui Leea, Jessie Chia-Yu Chena, Vincent S. Tsenga,b,∗

a Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, ROCb Institute of Medical Informatics, National Cheng Kung University, Tainan, Taiwan, ROC

a r t i c l e i n f o

Article history:

Received 29 August 2009

Received in revised form

24 February 2010

Accepted 29 April 2010

Keywords:

Data mining

Bio-signal analysis

Environmental factors

Patient monitoring

Asthma attacks

a b s t r a c t

Chronic asthmatic sufferers need to be constantly observed to prevent sudden attacks. In

order to improve the efficiency and effectiveness of patient monitoring, we proposed in this

paper a novel data mining mechanism for predicting attacks of chronic diseases by consider-

ing of both bio-signals of patients and environmental factors. We proposed two data mining

methods, namely Pattern Based Decision Tree (PBDT) and Pattern Based Class-Association Rule

(PBCAR). Both methods integrate the concepts of sequential pattern mining to extract fea-

tures of asthma attacks, and then build classifiers with the concepts of decision tree mining

and rule-based method respectively. Besides the general clinical data of patients, we consid-

ered environmental factors, which are related to many chronic diseases. For experimental

evaluations, we adopted the children asthma allergic dataset collated from a hospital in

Taiwan as well as the environmental factors like weather and air pollutant data. The exper-

imental results show that PBCAR delivers 86.89% of accuracy and 84.12% of recall, and PBDT

shows 87.52% accuracy and 85.59 of recall. These results also indicate that our methods

can perform high accuracy and recall on predictions of chronic disease attacks. The read-

able rules of both classifiers can provide patients and healthcare workers with insights on

essential illness related information. At the same time, additional environmental factors of

input data are also proven to be valuable in predicting attacks.

1. Introduction

Chronic diseases are major illnesses and increasingly becom-ing part modern life. Irregular lifestyles, malnutrition,congenital genetic problems, are some of the major causes

� This paper is an extended version of the paper entitled “An Integraon Asthma Care” [20], by Vincent S. Tseng, Chao-Hui Lee, Jessie ChiaSymposium on Computer-Based Medical Systems (CBMS), Finland, Jun

∗ Corresponding author at: Department of Computer Science and InformTaiwan, ROC. Tel.: +886 6 2757575x62536; fax: +886 6 2747076.

E-mail address: [email protected] (V.S. Tseng).URL: http://idb.csie.ncku.edu.tw/tsengsm (V.S. Tseng).

0169-2607/$ – see front matter © 2011 Elsevier Ireland Ltd. All rights resdoi:10.1016/j.cmpb.2010.04.016

© 2011 Elsevier Ireland Ltd. All rights reserved.

of chronic diseases. Medical technology has advanced expedi-tiously in recent years, but the prediction of chronic disease

ted Data Mining System for Patient Monitoring with Applications-Yu Chen, which appeared in Proceedings of IEEE Internationale 17–19, 2008.

ation Engineering, National Cheng Kung University, Tainan 701,

attacks is still arduous. Therefore, patient monitoring is anecessary treatment for these patients. However, with theincreasing number of chronic cases, medical human resourcesare insufficient. In recent years, many medical institutions

erved.

dx.doi.org/10.1016/j.cmpb.2010.04.016

mailto:[email protected]

dx.doi.org/10.1016/j.cmpb.2010.04.016

s i n

hcaebteadcn

swpwtpkk

mtctrpfpneaim

dca

c o m p u t e r m e t h o d s a n d p r o g r a m

ave tried to solve this problem by digitizing their medi-al equipments. Many physiology detection instruments arepplied in patient monitoring. Doctors and healthcare work-rs can give a diagnosis based on the information detectedy this equipment. Therefore, many chronic patients can usehe tools to keep their data for advanced treatment. How-ver, this mechanism is not sensitive for detecting suddenttacks of chronic diseases. Patients understand their con-itions only after returning for medical treatment. For somehronic diseases which could cause fatal attacks, this mecha-ism is miserably insufficient.

With the prevalence of the Internet, this problem wasolved rapidly. Many studies linked patients and doctorsith many kinds of Internet services. Through the Internet,atients upload their data or symptoms to a specific server,hereupon doctors can access and analyze their data from

he server. These services allow doctors to take care of theiratients more readily and can inform the patients of crucialnowledge to prevent repetition of further attacks through thisind of techniques.

However, this kind of systems still requires doctors oredical experts to constantly monitor and cannot give real-

ime responses when attacks occur. The motivation is toonstruct an automatic system, which can warn patientshrough alarms and give basic medical instructions. Severalesearchers in the past built patient monitoring systems toredict attacks but most of them only considered a singleactor of a target disease. Under this type of considerations,ivotal information for diagnosis maybe lost and attacks couldot be predicted and alert given effectively. We believe thatach disease related data contains information about attacksnd taking this into account we can ascertain that the morenformation that is available the more it makes analysis and

ining more reliable.
In addition, many related studies built systems for chronic
iseases using varying methods. Some of them predictedhronic disease attacks through a statistical method. Somenalyzed environmental data and built a patient monitor-

Fig. 1 – System a

b i o m e d i c i n e 1 0 1 ( 2 0 1 1 ) 44–61 45

ing system by training a neural network. These frameworksdid not inform patients and healthcare workers about thepatients’ condition except for a disease attack prediction.

Just as has been posited in many disease studies, e.g.,the focus of Stockes et al. [18] was based mainly on patientdata. Numerous patient disease related physiological profileswere taken into consideration. The findings showed a con-nection between these data profiles and the disease attacksusing mathematic models which would then be used to pre-dict when there would be attacks. In [7], Cure assembled amulti-layer perceptron (MLP) model neural network as theprediction module partly using environmental factors. Bothmethods predicted attacks with different disease related dataand worked effectively. In terms of the integrity of the infor-mation, both still separately considered the related factorsfor attacks. Their outputs provide only the prediction resultswithout any substantial readable or beneficial information forpatients or healthcare workers.

In this paper, we aim to propose a novel data mining mech-anism with applications on patient monitoring system forchronic diseases like asthma. We take varieties of chronicdisease related factors into consideration, including both envi-ronmental factors and patient bio-signals. This design enablesthe system to produce highly accurate predictions on chronicdisease attacks. In addition to predicting attacks, it also offerspatients and healthcare workers comprehensive formulariza-tion in preventing attacks. Fig. 1 demonstrates the architectureof our proposed mechanism, in which the environmental pol-lutant data, climatic and atmospheric data and all diseaserelated data are collected through the Internet and tele-com communication. Subsequently, the system for chronicpatient monitoring is built by analyzing the collected data. Twomethods are developed in this system, namely Pattern BasedDecision Tree (PBDT) and Pattern Based Class-Association Rule
(PBCAR). In both methods, the concept of sequential patternmining is adopted for feature extraction as the first step. Asthe result, the classifications of both methods are built asclassifiers with tree-based and rule-based mining methods
rchitecture.

dx.doi.org/10.1016/j.cmpb.2010.04.016

m s i n
46 c o m p u t e r m e t h o d s a n d p r o g r a
respectively. The classifier of PBDT illustrates each predictionresult intuitively with only one tree path of feature set, so itis completely straightforward to appreciate the primary rea-soning of a prediction result and its obvious suitability for easydiagnosis. The classifier of PBCAR shows all matched rules andthe prediction results with its scoring method. The matchedrules provide more facts about an illness and are appropriatefor advanced studies by doctors and other medical experts.

In experimental evaluations, we appraised our system withan asthma related dataset, which consists of asthma allergicdata, climatic and atmospheric data and the air pollutant datafor the city where the patients reside. The patient clinical datawere obtained from a hospital in southern Taiwan. The envi-ronmental data can be obtained from [24,25]. In experiments,both PBDT and PBCAR performed with high accuracy and highrecall on this dataset. The matched rules of their classifiersnot only perform predictions for asthma attacks but can alsooffer patients and healthcare workers the present conditionsof patients. In addition, the experimental results show that theaccuracy and the recall can be improved through the proposedsystem by taking into account more disease related factors,including bio-signals and environmental factors.

The rest of this paper is constructed as follows. Therelated works are discussed in Section 2. Section 3 mainlydescribes the proposed bio-signal mining system in detail.Subsequently, experimental evaluations are discussed in Sec-tion 4. Finally, we summarized the several key points of thisresearch and proposed some possible future undertakings inSection 5.

2. Related works

Due to the rapid development of society and technology, med-ical services have become ever more universal. Vast numbersof patients can now be treated, but the increased pressureon existing medical resources has caused traditional medi-cal services to be unable to cope efficiently with this influx ofpatients. Thus, digitalization for medical services is crucial. In[21], Várady et al. described that the process of digitalizationcan be regarded as three stages. In their first stage, the HospitalInformation System (HIS) integrates all medical information inthe hospital. Simultaneously, doctors and nurses can accessall medical data through the Patient Data Management System(PDMS). In the second stage, a Decision Support System wouldbe integrated with HIS to assist doctors to analyze patients’bio-signals. In recent years, the rapid development of tech-nology has brought medical systems into the third stage. Inthis stage, the systems have integrated Internet technologyand extended their service ranges to become the Home-CareMonitoring System.

Asthma is a common chronic disease, especially in chil-dren who cannot take care of themselves. The number ofchildren with asthma in Taiwan grows by at least 1% per year.Many hospitals digitalized their asthma monitoring systemto reduce their workload. This situation has initiated sub-
stantial asthma related research. Some research focused ona study concentrating on patients’ DNA profile. O’Leary et al.[15] addressed the allergens of asthma. Through the lung func-tions of children, asthma was able to be detected using neural
b i o m e d i c i n e 1 0 1 ( 2 0 1 1 ) 44–61

network approaches [3]. Finkelstein et al. [8] further assem-bled a communication platform between asthma sufferers andemergency rooms for an advanced research.

In [10], Jan et al. constructed an Internet-based interactivetele-monitoring system for asthmatic children, named BlueAngel Asthma Home-Care System. Through that system, theyexpected to enhance the generalization, convenience, and effi-ciency of the home-care tool. According to the descriptionof that research, healthcare workers would inform asthmapatients of basic medical information, for example, impor-tant symptoms of asthma and the usage of emergency asthmadrugs. At the same time, patients were informed on how totake care of themselves by using a peak expiatory flow meter(PEF). In addition, a network platform was also introduced.The patients could upload their daily monitoring records andobtain asthma related information through this tool as well. Itwas posited that the mechanism makes disease control moreefficient and effective, and reduce the probability of asthmaattacks.

Chu et al. [6] proposed a ubiquitous warning system forasthma-inducement, which took the Global Position System(GPS) into consideration. According to the data from the GPS,the system could obtain and display daily averaged temper-atures, humidity levels, and atmospheric pressure from theCentral Weather Bureau [24] and air pollution reports fromthe Environmental Protection Administrator Executive Yuan[25] to build a decision module by MLP neural network. There-fore, their system could assist healthcare workers to make adiagnosis and help to locate patients expediently through GPSinformation when asthma attacks occur. However, the systemdid not take a patient’s individual information into considera-tion, and could not offer patients any related alarm to preventattacks. In addition, applying statistical methodologies is alsoa popular method for processing medical data. In [18], Stockeset al. explored and undertook the task of discovering theimportance of asthma bio-signals using statistical methods.As the result, they proposed the Entelos® Asthma PhysioLabTM

System (EAPS), i.e., a module used for asthma patients, whichconsiders molecular cells, organ, allergy symptoms, respira-tory flow, and patients’ drug reaction applying a mathematicalfunction and principle; Through this system, they furtherfound a relationship between the immune system and asthmaattacks.

In the field of data mining, association rule mining [1] andsequential pattern mining [2] are the major methods for dis-covering the ubiquitous relationship in a dataset. Associationrule mining ascertains frequent connections and places allitem sets in a dataset. Under the same concept, sequential pat-tern mining undertakes to mine out the relationship betweenitems or item sets in chronological order. They are then usuallyable to represent features of a dataset. On the other hand, deci-sion tree mining [16,17] is an important data mining methodfor classification, while many researches have addressed thisand applied this kind of approaches into many applications.However, the types of datasets have changed expeditiously inrecent years as copious amounts of information need to be
recorded. Therefore, many integration mining methods wereproposed for handling this kind of datasets. In [13], classifica-tion was performed by the concept of association rule mining.It took the relationship between items as considered attributes
dx.doi.org/10.1016/j.cmpb.2010.04.016

s i n b i o m e d i c i n e 1 0 1 ( 2 0 1 1 ) 44–61 47

fiatntttoalde

3m

Iattdoftdttthcutwbtpaertsiot

3

Fbasoidswac


or classification and made their method more suitable fortem-based data classification. For a similar purpose, Lesh etl. [12] applied the concept of sequential pattern processing toime series classification. Some advanced and involved tech-iques are also applied in medical related studies, such ashe application in an expert system [5], and the classifica-ion mechanism on the electrocardiogram (ECG) data [19]. Inhis paper, we integrated several data mining techniques inrder to demonstrate that our proposed method can providen effective alert mechanism for attacks on chronic diseasesike asthma by considering more critical factors, while furtheretails are described as in the following section. This is anxtended paper of our preliminary work in [20].

. Integrated bio-signal data miningechanism

n this research, we designed a mechanism to predict asthmattacks automatically by mining asthma physiological symp-oms and environmental factors together. We assume thathere are useful features about asthma symptoms in theseata. Thus, we used data mining methods to discover featuresf asthma attacks and construct a classification mechanismor those features in order to establish an alarm system ando aid healthcare practitioners. For most patients with chroniciseases who need to be monitored, hospitals usually recordheir daily disease related data. Doctors or healthcare practi-ioners can do long-term observations and advanced studieso find more suitable treatments for the patients whose dataave been gathered. These recorded data consist of patients’linical data and their bio-signals, which are contiguous val-es and could be regarded as time series data. In addition,he outbreaks of many chronic diseases are often associatedith environmental factors. Asthma as an example is not onlyronchial contraction but also a series of inflammatory reac-ions. This means that attacks can occur at anytime when aatient’s airway is stimulated continuously by environmentalllergens. Consequently, our research took into account mostnvironmental factors. In the same concept of patients’ dailyecords, we regarded the long-term records of environmen-al factors as time series datasets as well. In the followingub-sections, the problem definition is firstly described. Afterllustrated the proposed system architecture, the proceduresf constructing Pattern Based Decision Tree and conductinghe Pattern Based Class-Association Rules are presented.

.1. Problem definition

or achieving our mechanism, two major challenges shoulde conquered. The first challenge is how to seek or designsuitable processing for these complex datasets which con-

ist of many different-type contiguous records. The secondne is to make our prediction mechanism of asthma attacks

nterpretable. Therefore, we did format transformation forifferent-type data to overcome these two challenges. Fig. 2
hows a basic architecture of our problem. Precisely speaking,e want to know if one patient will suffer from an asthmattack in near future, when all his asthma related informationan be got, which contains his asthma allergic symptoms and
Fig. 2 – Illustration of one patient’s data.

the relative environmental conditions. Under our assump-tion, all asthma patients’ historic data could be granted as aknowledge base and these data can be integrated into a hugemultivariate time series dataset, while each instance of thedataset is the data of one patient.

Based on all the descriptions above, we formulated ourproblem and defined it in the following manner. Consider adataset D, in which all the related data of monitored patientsare collected. The data of each patient contain all items aforementioned. Thus, it composed by all related datasets {dc|c isan integer and represents the index of related datasets in D},such as datasets of weather, air pollution, and patient allergy.For each related dataset dc, it consists of time series data {Tcr|rmust be an integer and represents the index of factors in dc}.Tcr represents the records of each factor in this dataset. Thesefactors of a related dataset are generally recorded as a timeseries pattern, such as CO2 or NO of air pollution dataset. Eachfactor Tcr is formed by a sequence of contiguous recorded val-ues {Vcr1, Vcr2, . . ., Vcrm}. Our purpose is to target some Tt inall Tcrs as an aim and to build a classifier C for Tt with thedataset D. In chronic disease dataset, D is the dataset con-taining all disease related datasets dc, such as daily recordsof patients, weather condition and air pollutants. Tcr repre-sents each factor of related datasets, such as concentration ofCO2 in air pollutant dataset. The Tt would be a disease stateor a diagnosis result for a patient. In terms of asthma moni-toring, Tt is the daily data which record whether the patientsuffered an attack on that day. In our proposed methods,we firstly extracted classifiable sequence features {Vq1, Vq2,. . ., Vqx|1 ≤ x ≤ m (cr1 ≤ qx ≤ crm) and q(j − 1) < qj when 2 ≤ j ≤ x}from all Tcr. Subsequently, we built a classification model forTt using data mining methods with these features and then Tt

could be predicted using the configured model.

3.2. System architecture

Fig. 3 shows the architecture of our proposed mechanism.Patients’ bio-signal data and environmental data of patients’locations are considered in this mechanism. Through thepreprocessing elements, these datasets are integrated and

dx.doi.org/10.1016/j.cmpb.2010.04.016

48 c o m p u t e r m e t h o d s a n d p r o g r a m s i n

Fig. 3 – Work flow of the two-phase design for the proposed

3.3.1. Phase 1 of PBDT: feature extraction

mechanism.

transformed into a specific dataset with medical and environ-mental knowledge for mining. Subsequently, two alternativemethods were designed for the data mining segment. Finally,the disease related diagnosis rules would be discovered orthe prediction model would be built. These results could aidhealthcare practitioners in disease diagnosis and construct anadvanced predictive alarm mechanism.

Considering the researches which are related to diseaseprediction and prevention, most of them have only focusedon physiological symptom data. In General, nothing morethan prediction results could be provided in their predic-tions. However, the attacks of many chronic diseases are verymuch related to environmental factors. Therefore, patients’bio-signal data and environmental factors are consideredtogether in our mechanism. These data are comprised of dailyphysiological data, past disease attack records, daily climaticconditions, daily weather statuses and daily records of air pol-lutant concentrations. Substantial disease related informationmining from these data makes the disease prediction be ableto be built more completely. Moreover, our prediction modeloffers more thorough information and rationalization for theprediction results and can be used as a reference in diagnosisor further advanced studies.

Traditional predictive methods often cannot be appliedto such a complex dataset considered in this paper. Thedataset consists of time series data and attribute-based data,which are two different formats. The most values of bio-signaldata and environmental data are of a numerical type. Con-sequently, we integrated the mining method for time series
data to extract data features and the classification method forattribute-based data to build our mechanism of disease attackprediction.
b i o m e d i c i n e 1 0 1 ( 2 0 1 1 ) 44–61

In order to reduce problem complexity, and to increasethe medical significance and readability of outputs, wesymbolized these numerical values with the meaning andinformation of each value on its related studies. For example,the concentration of environmental factor CO is a numericalattribute, but we can symbolize its value into different levelsof air pollution. This discretization method of numerical valueis more convenient to be conducted and makes the result ofany analysis more comprehensive. Resultantly, we reviewedall related studies of each data and chose appropriate policiesfor data preprocessing, such as common level definitions forsome air pollutants. In our proposed mechanism if there wasno suitable expertise of one factor data with numerical, somecommon methods of data mining could be applied directly,such as piecewise aggregate approximation (PAA).

In addition, feature selection can be considered accordingto input datasets. In our proposed mechanism, the purpose offeature extraction is mainly to derive characteristics from mul-tivariate time series datasets. All extracted features could bemeaningful and applied directly on the following classificationmining. However, feature selection also can be considered tobe performed after feature extraction under some conditions,such as a huge quantity of input dataset or space limitationfor the final classifier. We suggest choosing the feature selec-tion methods which do not destroy or rebuild features, suchas relief [11], information gain (IG) [9], or mutual information(MI) [4].

As shown in Fig. 3, we adopted a two-phase concept to per-form the integrated data mining. The concept consists of afeature extraction phase and a rule mining phase. After pre-processing, there could still be differences in the format ofdata. We extracted all disease related features in the featureextraction phase. Then, all the features could be rearrangedin some specified format. Finally, the rules or the model fordisease attack prediction can be constructed by the featuresin the last part of the mechanism. Based on the proposedarchitecture, we presented two policies for asthma attackprediction. The first policy is named Pattern Based DecisionTree (PBDT), which integrates sequential pattern mining anddecision tree mining; the second is named Pattern Based Class-Association Rule (PBCAR), which merges sequential patternmining and classification-by-associations mining.

3.3. PBDT: Pattern Based Decision Tree

We designed a tree-based method, named the PBDT, forasthma high-risk prediction, which follows the two-phasedesign and the architecture shown in Fig. 3. After data prepro-cessing, the concept of sequential pattern mining is adopted toretrieve the representative sequences of these time series dataas static features in the first phase. Secondly, the decision treeis assembled with these extracted features. The pseudo-codeof the PBDT algorithm is shown in Fig. 4 and its implementa-tion on all asthma related dataset is described in the followingsub-sections.

Most prepared related data are time series datasets, such asasthma symptom dataset and environmental dataset. Thesetypes of datasets are more difficult to comprehend for users

dx.doi.org/10.1016/j.cmpb.2010.04.016

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 1 ( 2 0 1 1 ) 44–61 49

ascostrsicaptf

enetrthdehmda“

Most chronic related research focused on all accessiblepatients’ bio-signals, yet many environmental factors whichcould affect chronic disease attacks are ignored. The majorgoal of this study is to predict asthma attacks. Therefore, the

Fig. 4 – PBDT algorithm.

nd have to be processed by mining methods. Thus, the PBDTegmented all these data into two datasets. The first datasetonsists of all segments of time series datasets with the riskf asthma attacks. The other comprises all segments of timeeries datasets, but without the risk of asthma attacks. Thesewo datasets are represented as Sa and Sn in the proposed algo-ithm in Fig. 4. Each segment of both datasets contains timeeries data for asthma symptom and environmental ones. Its expected that there could be obvious differences in theirharacteristics between these two datasets. Therefore, afterll segments in both datasets are processed, all sequentialatterns as features were extracted from both datasets respec-ively. The lines 5–7 of the PBDT algorithm in Fig. 4 are writtenor this extraction.

In the implementation of this phase of the PBDT, a param-ter � is defined as observation interval and it represents theumber of the days for observation. The value of � must bevaluated by experiments or suggested by healthcare prac-itioners. Through the patients’ clinical data and diagnosticecords, each date of asthma attacks can be recognized andhe data within � day before an asthma attack are cut as aigh-risk segment. Each dataset, i.e. asthma dataset, weatherataset, and pollution dataset, must be treated with the samextraction method so as to get data intervals that containigh-risk features. Fig. 5 shows that how to cut high-risk seg-
ents of asthma attacks of one patient’s record on asthma
ataset with � = 5. The sequence in Fig. 5 is one symptom insthma allergic data. The symptom is recorded as “g”, “y” orr”, which mean the asthma risk light grades using in general

Fig. 5 – The extraction of high-risk segments.

clinical diagnoses. Three high-risk segments before attacksare extracted in this example. After the extraction of high-risksegments, normal segments are also needed to be retrieved asa control dataset. Firstly, all high-risk segments and the timepoints of all patients’ asthma attacks should be removed fromthe asthma dataset. Subsequently, the rest data will be cutinto � length contiguous sequence as normal segments usinga sliding window approach. Fig. 6 illustrates how to get normalsegments on asthma dataset with � = 5.

After above procedures, the datasets of high-risk and nor-mal segments were prepared, and sequential pattern miningperformed on these two segment sets in order to find theirhigh frequency sub-sequences. Fig. 7 depicts a simple mapof sequential pattern mining on normal segments, e.g., fre-quent sub-sequences, such as “g” and “gyg”, can be discoveredthrough this process. These high frequency sub-sequencesare asthma related features, and would follow the followingformat:

High-risk pattern P: {allergic symptom i, allergic symptom j,allergic symptom k}.Normal pattern N: {allergic symptom x, allergic symptom y}.

High-risk patterns represent frequent value chains in thehigh-risk segments which are extracted before asthmaattacks. Normal patterns represent frequent value chains inthe normal segments.

Fig. 6 – The extraction of normal segments.

dx.doi.org/10.1016/j.cmpb.2010.04.016

50 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 1 ( 2 0 1 1 ) 44–61

Table 1 – The time intervals of all segments.

Time interval High risk?

Segment 1 patient 1: d11, d12, . . ., d1� Y...

......

Segment k patient e: dk1, dk2, . . ., dk� Y

In this phase, a disease classifier was built with all theextracted information from the last phase. After phase 1 ofthe PBDT, the all asthma related high-risk and normal frequentfeatures were prepared. Firstly, the high-risk and normal seg-

Fig. 7 – The feature extraction of normal segments.

PBDT takes all accessible environmental factors of a patient’sresidence and immediate surroundings into consideration. Allenvironmental factor data are numerical, and the PBDT cansymbolize each value to simplify the following procedures inthe preprocessing stage. Subsequently, the data should be cutinto high-risk and normal segments following the setting ofthe parameter �. In order to extract environmental factor seg-ments corresponding to the asthma related features, all timeintervals of high-risk and normal segments of each patient inasthma dataset must be collected firstly. Then, all data in thesetime intervals of high-risk segments are extracted as high-risksegments for each environmental factor dataset and normalsegments are also obtained using the same processing man-ner. Finally, two segment sets are prepared respectively foreach environmental factor. There are numerous environmen-tal factors, such as daily averaged temperature, and the dailyconcentrations of carbon dioxide, so we can get high-risk andnormal datasets for each environmental factor. Fig. 8 showsa high-risk segment of a daily averaged temperature datasetis extracted by matching the time interval of a high-risk seg-
ment of the asthma dataset. The value sequence {28.4, 28.4,27.8, 28, 25.3}which represents the temperature situations canbe accessed before this patient suffer from the asthma attack.
Fig. 8 – Extraction of a high-risk segment from dailytemperature data.

Segment (j + 1) patient 1: d(j+h)1, d(j+h)2, . . ., d(j+h)� N...

......

Segment (j + h) patient e: d(j+h), d(i+h)2, . . ., d(j+h) N

According to the design of phase 1, the PBDT performssequential pattern mining to discover features of each envi-ronmental factor on its high-risk and normal segment sets.Thus, high frequent sub-sequences of high-risk segmentsand normal segments for each environmental factor can beextracted and have the following format:

Pattern Pe from high-risk segments in environmental factorM:{state i, state j}.Pattern Ne from normal segments in environmental factorM:{state x, state y, state z}.

All features of each asthma related dataset were prepared,which contain patient clinical data, asthma symptom dataset,weather factor dataset, and air pollution dataset. These fea-tures are integrated as an indexed table for the second phaseof the PBDT. Note that the two important data tables mustbe generated. The first table contained the time intervals ofhigh-risk segments and normal segments. All high-risk andnormal segments of patients must be identified by these timeintervals. Table 1 presents a sketch of this table. Another tablelisted all features extracted with sequential pattern miningrespectively from high-risk and normal segments of each sub-datasets. The sketch of this table is shown in Table 2.

3.3.2. Phase 2 of PBDT: rule mining

Table 2 – All features of all asthma related factors.

Dataset Factors The features ofhigh-risk segments

and normalsegments

Asthma symptomdata

Allergic (1–n) Pattern Pa1

...Pattern Pan

Environmental data Factor 1 (1–q) Pattern Pe11

...Pattern Pe1q

......

Factor m (1–r)...Pattern Pemr

dx.doi.org/10.1016/j.cmpb.2010.04.016


Table 3 – An attribute table of all segments.

Factor X1 · · · Factor Xm High risk?

Pattern 1 · · · Pattern n1 · · · · · · Pattern nm

Segment 1 0 · · · 0 · · · · · · 1 Y...

... · · ·... · · ·

......

...Segment j 1 · · · 1 · · · · · · 0 Y

mtrmdlpcat

1

2

Segment (j + 1) 0 · · · 1...

... · · ·...

Segment (j + h) 1 · · · 1

ents of each factor data were transformed into the attributeable with their own frequency features. This processing wasepresented from lines 8 to 18 in the PBDT algorithm and

akes traditional decision tree mining performing on theatasets having time series data. Subsequently, as the last

ine of the PBDT algorithm in Fig. 4, decision tree mining iserformed to get a classification model. Its tree-based modelan predict asthma attacks and the features related to asthmattacks can be informed from its matching paths.In practice,he process consists of the following two steps:

. In the first step, a classification module is built with deci-sion tree mining. The tables prepared in phase 1 of the PBDTmust be integrated and transformed into an attribute table.Each segment contains many sequence data of differentasthma related factors and it is processed as a transactionin the attribute table. All features of all asthma related fac-tors are attributes. The table keeps all connections betweentransactions and attributes. So that all categories, andtypes of features and transactions can be accessed fromthis table. Table 3 presents the sketch of the attribute table.The grid of segment p and factor pattern q is 1 when alldata in the time interval of segment p contain the patternq. Otherwise, it should be 0. The grids of high-risk attributefollow the high-risk values of segments.

. After the attribute table is established, all relations betweenfeatures, high-risk segments and normal segments arelabeled. Therefore, decision tree mining can be performed
on this table with setting “High risk?” attribute as target togenerate a decision tree. The decision tree is a binary treeand like the tree-like structure shown in Fig. 9. Each treenode (i.e., leaf) of the tree represents one feature, and two
Fig. 9 – Example of decision tree fo

· · · · · · 0 N

· · ·...

......

· · · · · · 1 N

sub-trees of a tree node are under the conditions with thefeature and without the feature respectively. All paths inthe tree are oriented to produce one result, so a predictioncan be made if an unknown segment is high-risk by tracingthe tree with its feature set. Fig. 9 shows a sketch of the treewith more details.

According to the PBDT classifier, after a patient completes hisdaily routine records, the system obtains all related data in last� days, which contain asthma allergic symptoms and environ-mental factors. Through the matching on sequential patterns,the feature set of these related data could be extracted. Subse-quently, the “High risk” result of next day can be predicted bytracing the tree with the specific feature set. For each reachednode, its sub-tree matched by the condition of the feature setis chosen for the rest tracing. The tracing stops at one leafnode and the label of the leaf node is the prediction result.Fig. 10 presents the flow chart of the prediction method of ourproposed PBDT algorithm.

3.4. PBCAR: Pattern Based Class-Association Rule

In the concept for the PBDT, asthma attacks are predictedusing traditional decision tree mining. However, in [22] it wasdescribed that the rule set of a decision tree could be cov-ered by the rule set of the association-based method, such asthe Classification Based on Associations (CBA) [13]. The rea-son being the decision tree mining adopts greedy algorithm
as its policy. Several rules must be omitted for the process oftree building. Therefore, we enhanced the algorithm of PBDT,and proposed a new method which integrated the conceptof sequential pattern mining and association-based classifier,
r asthma attack classification.

dx.doi.org/10.1016/j.cmpb.2010.04.016


ck p

contained features of all factors as the item set for eachsegment. After collection of item sets of all transactions,the high-risk attribute of each segment is appended to thistransaction table.

Table 4 – A transaction based data table.

Features {Factor, Pattern} High risk?

Segment 1 {X1, P1}, {X1, P3}, . . ., {Xm, P4} Y...

......

Segment j {X2, P3}, {X3, P1}, . . ., {Xm−2, P1} Y

Fig. 10 – Asthma atta

named Pattern Based Class-Association Rule (PBCAR). The sys-tem architecture of the PBCAR is based on the architectureof Fig. 3 and the relative algorithm is shown in Fig. 11. Twomajor phases of PBCAR are described in detail in the followingsegments.

3.4.1. Phase 1 of PBCAR: feature extractionIn phase 1 of the PBCAR, the same processing of phase 1 of thePBDT is performed. In practice, the same two tables also can beacquired as is mentioned in phase 1 of the PBCAR. Therefore,phase 1 of the algorithm PBCAR is the same as phase 1 of thePBDT algorithm.

3.4.2. Phase 2 of PBCAR: rule miningIn this phase, a classifier was built by the concept of theCBA in the PBCAR algorithm. Therefore, our PBCAR algorithmtransforms the whole dataset into a transaction dataset withthe feature set F from lines 8 to 17 in Fig. 11. The CBA-likealgorithm designed and shown from lines 18 to 24 can be per-formed on the transaction dataset directly. The support andconfidence for each class of all generated item sets should becounted as well. Subsequently, the redundant items can befiltered out with a support and confidence threshold setting.The rest of the item sets are collected as an association ruleset, in which each rule has a score for its own class. Finally,the asthma attack prediction can be performed using these
scored association rules.
In practice, after all related data were prepared in phase 1of the PBCAR a classifier was built with a CBA-like methodin this phase. There are several factors which need to be

rediction with PBDT.

defined before the method, such as thresholds, scoring func-tion and scoring method for classification. We explain thewhole method in the following four steps.

1. Two tables of phase 1 should be integrated into a transac-tion table in this step. Table 4 shows the result, in which thefeatures and class label of each segment are listed clearly. Inthis table, each segment is a transaction, and each feature isan item of the segment. In order to fill the table, all featuresof all factors must be checked if they are contained for eachsegment. Each data segment consists of data sequences ofdifferent factors. Each data sequence may contain severalfeatures of its factor. Subsequently, the table collects all

Segment (j + 1) {X6, P2}, {X1, P3}, . . ., {Xm−1, P3} N...

......

Segment (j + k) {X3, P1}, {X5, P3}, . . ., {Xm, P5} N

dx.doi.org/10.1016/j.cmpb.2010.04.016


2

3

of these matched rules and gives the patient the result keepinghigher score as a prediction result for the coming tomorrow.

If the second rule in Fig. 13 is marked, it represents theexistence of asthma related data and high temperature differ-

Fig. 11 – PBCAR algorithm.

. In this step, the classifiable frequent item sets areextracted. Therefore, association rule mining was adoptedhere to extract frequent item sets firstly. After frequentitem sets are collected, the confidence of item sets for eachclass is counted. Finally, the classifiable frequent item setsare prepared in which each one is frequent and keeps con-fidences of each class. The CBA method [13] can predict theclass of a novel transaction with these item sets. However,the classification policy of the CBA method is too simple tohave a good accuracy, we then further modify the classifi-cation method in next step.

. The purpose of this step is to build a scoring mechanism.Scoring methods are widely applied in rule-based classifi-cation methods. Basically, the CBA classify a new data bythe matched class-association rule which have the high-est confidence of the result class. Though the method canproduce the result rapidly, the considered factors are notenough. In [14], Liu et al. enhanced the CBA and proposeda scoring method, which performs classification by con-
sidering all rules. Following their concept, each rule keepsdifferent scores for different classes. For each rule, the scoreof a class is defined as the value which is the confidenceof that class multiplied by the support. Thus, a new case
Fig. 12 – Scoring function of class-association rules of thePBCAR.

is classified with the cumulated scores of all classes frommatched rules in the PBCAR.

4. The scoring method of the PBCAR takes more factors intoconsideration, but there are still some irregularities. Mostdatasets for classification have imbalanced distribution ofthe target attribute, especially disease prediction datasets.The asthma dataset also has this problem, that is, thereare fewer high-risk (positive) cases than normal (negative)ones. The skew datasets make a huge difference betweenthe quantity of positive features and negative features.That promotes the results of the scoring method to bemore negative, and reduces the recall of positive simulta-neously. Therefore, we referred the work in [14] to definea parameter k as balance parameter to fix this imbalance inthe two-class dataset. Fig. 12 shows the definition of modi-fied score function in the PBCAR. All normal scores of rulesare needed to be divided by k, so that the final cumulatedscores of two classes are more appropriate for classifica-tion. The value of the parameter k generally needs to beformulated by experts or set by experimental evaluations.In regard of this, we did experiments to evaluate its valueand discuss its effects, which are discussed and explainedin detail Section 4.

After the four steps of building classifiers, the generatedrule set appears as is represented in Fig. 13. According to thepatient’s daily record, the PBCAR gathers all data sequencesof related factor datasets in � days. The data sequences aretransformed into a feature set like a transaction in Table 4 bymatching the ones in Table 2. Then, the PBCAR checks the ruleset of Fig. 13 and marks all matched rules. Finally, the PBCARclassifier accumulates the scores of normal and high-risk ones

Fig. 13 – Class-association rule of the proposed PBCAR forclassification.

dx.doi.org/10.1016/j.cmpb.2010.04.016

54 c o m p u t e r m e t h o d s a n d p r o g r a m s i n

Fig. 14 – Prediction model of the PBCAR.

ence, a high PM10 value of air pollutant data indicates thatthe patient had a fever in the past � days. So according tothese features, if this rule keeps the highest confidence of allmarked rules, the CBA predict that the patient will suffer froman asthma attack at (� + 1)th day. However, our PBCAR methodtakes the scores of all marked rules into consideration, Fig. 13just offers a score 0.45 × 0.3/2 = 0.0675 for normal and a scoreof 0.45 × 0.7 = 0.315 for high-risk respectively.

These classification steps are shown as a flow chart inFig. 14. There are two differences in classification method fromthe PBDT. (1) Input data must be transformed into transactionformat instead of attribute format. (2) More rules are matched.Those differences make the characteristics of a PBCAR classi-fier fundamentally different from a PBDT classifier for users. InFig. 14, patient’s asthma related data in last � days are trans-formed into the item set. The item set represents all matchedfeatures discovered in phase 1. These items may be allergicfeatures or environmental ones. Subsequently, each rule gen-erated by PBCAR must be checked whether all its items arecontained by patient’s feature set. Finally, the prediction resultof the patient status is the class label with the highest cumu-lated score of contained rules.

Firstly both methods proposed are mining normal andhigh-risk sequential patterns as features. Then they build dif-ferent classifiers through the concept of tree-based miningand rule-based mining respectively. As a result, the data for-mat of both methods and the usage of both classifiers havedifferent benefits. The PBDT referred the concept of decision
tree mining. The concept tries to cover whole cases in train-ing data with few major attributes in the tree architecture.Therefore, decision tree mining adopted a greedy algorithm toselect a few rules for the tree building. This policy makes the
b i o m e d i c i n e 1 0 1 ( 2 0 1 1 ) 44–61

classification easy to perform and be comprehended but maylose considerable information for classification as well. In con-trast, the tree architecture needs less capacity. It is suitable tobe embedded in mobile equipments which have less memorybut gives adequate and immediate alarms. In the method ofour PBCAR, we adopted the concept of the CBA to build a class-association rule set. All possible features for classification areenumerated as rules. Therefore, when new data needs to beclassified, all rules are considered. All matched rules deter-mine the classification result by the scoring method. Althoughthis policy makes the matched rules more complex, they canoffer more detailed information of a classification, especiallyfor healthcare workers.

4. Experimental results

4.1. Data

In this work, we adopted a real asthma dataset to evalu-ate our proposed architecture and methods. The real asthmadataset contains two sub-datasets. The first one is the datasetwith the patients’ asthma related data, which consists ofasthma data and patients’ diagnostic records. The seconddataset includes several environmental factors, which con-tains weather data and air pollutant data. In addition, thefirst sub-dataset was collected by the Blue Angel 24 h Monitor-ing System of Tainan Allergic Asthma Infants Health Association[26]. The asthma data of the first sub-dataset contain thelargest respiratory flow of the daily morning and evening, feversymptoms, daytime asthma symptoms, nighttime asthmasymptoms, and the corresponding risk level of an asthmaattack; while asthma symptoms include the allergic states ofthe eyes, nose and skin. The patients’ diagnostic records com-prise patient id, age, height, asthma medicine instructions anddoctors’ recommendations. Therefore, each daily record of thefirst sub-dataset consists of 23 attributes and all records arethe daily conditions of 33 asthmatic children in 2005. Withregard to the second dataset, as mentioned previously, theadopted environmental factors were formed by air pollutiondata and weather data. The air pollution data were obtainedfrom the dataset of the Environmental Protection AdministrationExecutive Yuan, R.O.C. [25]. It contains 15 kinds of air pollutants,which are PSI (Pollutants Standards Index), PM10 (particulatematter of 10 �m or less), SO2, CO, O3, NO2, indicators of thequality of air pollutants, and daily concentration of pollutantsreports. The weather data include daily average temperature,average relative humidity, and daily highest and lowest tem-perature values, which are the four important attributes andwere collected by the Central Weather Bureau [24] in Tainan, Tai-wan in 2005. Table 5 lists all attributes in our real datasets indetail.

According to the design of our proposed methods, we mustsymbolize all numerical values of the dataset to simplify thefollowing operations.

Table 6 shows the relationship between patient respira-
tory flow and the risk level of asthma attacks. In the asthmadataset, there are two important indicators: (1) peak expiatoryflow rate (PEFR) and (2) asthma medication dosage instruc-tions. The PEFR is the value measured from a patient’s breath
dx.doi.org/10.1016/j.cmpb.2010.04.016


Table 5 – Attribute list of our dataset.

Sub-dataset Category Attributes Count

Asthma allergicdataset

Patients’ diagnostic record ssn, d ssn, age, reference, date 5Asthma symptoms day record, day pefr, night record, night pefr, fever,

night symptom, day symptom, nose symptom1,nose symptom2, nose symptom3, nose symptom4,eye symptom, skin symptom, grade,asthma medicine instructions, a rhinitis,a conjunctivitis, a dermatitis

18

Environmentaldataset

Air pollutants Date, PSI, indicators of the quality of air pollutants,daily concentration of pollutants reports, SO2, NO2, O3,CO, PM10, NO, NitroOxy, NHHC, HydraCarbon,HydraCarbon2, Temperature, pH

16

Weather data Relative humidity (%), temperature (◦C), absolutemaximum temperature (◦C), absolute minimumtemperature (◦C)

4

Table 6 – PEFR value vs. its corresponding medication instructions and lights.

PEFR value (peak expiratory flow rate) Medication instruction Light

The value is between 80% and 100% of personal optimalvalue or normal estimates

Do not needmedication

Green

Variability: <20%Meaning: the state of asthma is under control

The value is between 60% and 80% of personal optimalvalue or normal estimates

To use short-acting inhaled expansion of trachealtwice each 20 min. If there is no improvement ofrespiratory flow rate after using 20–60 min, please gofor medical treatment as soon as possible

Yellow

Variability: 20–30%Meaning: asthma symptom: the nigh attack, activity

reduction, cough, stridor, chest tightnessThe value is between 60% and 80% of personal optimal To use short-acting inhaled expansion of tracheal

immthantreat

Red

etmptmiTb

value or normal estimatesVariability: >30%Meaning: The state of asthma would affect patient’s

normal routines and activities

very morning using a peak expiratory flow meter. Accordingo the patient’s physiological conditions and the proper nor-

al respiratory flow rate of his/her physiological conditions,atients’ circumstances can be classified into the followinghree levels of red, yellow and green lights in the proposed

echanism. The light level and patients’ allergic symptomsndicate which medication instructions should be adopted.herefore, we symbolized a patient’s physiological conditionsy the values of these two indicators. Combinations of their

Table 7 – Concentration of air pollutants and deputy index valu

PSI PM10: the averagevalue of 24 h

SO2: the averagevalue of 24 h

CO: the mvalue of the

valuesUnit: �g/m Unit: ppb Unit: p

Pollutant Standards Index (PSI) and concentration of air pollutants50 50 30 4.5100 150 140 9200 350 300 15300 420 600 30400 500 800 40500 600 1000 50

Pollutant Standards Index (PSI) and healthPSI 0–50 51–100 10Effect Good Moderate Un

ediately. If the respiratory of flow is still lower60% after using, please go for medicalment as soon as possible

possible values are represented by six categorical symbols andwe took them to symbolize patients’ asthma statuses in ourexperiments of this work.

In environmental dataset, the values of its data arenumeric. It is necessary to symbolize them to increase the
efficiency of classification model building and the readabil-ity of prediction results. Therefore, we refer the deputy indexvalues of each air pollutant, which are defined by proficientpractitioners. The symbolized values simplify the complexity
es.

aximumaverage

of 8 h

O3: the maximumvalue in 24 h

NO2: the maximumvalue in 24 h

pm Unit: ppb Unit: ppb

60120200 600400 1200500 1600600 2000

1–199 200–299 ≥300healthy Very unhealthy Hazardous

dx.doi.org/10.1016/j.cmpb.2010.04.016


Table 8 – Level of daily extreme temperature differencetable.

Extreme temperature difference (unit: ◦C) Level

<2 0>2 and <3 1>3 and <5 2>5 and <8 3

Table 9 – All parameters for our conducted experiments.

Parameter Symbol Defaultvalue

Observation interval � 5Balance parameter k 400Support of sequential pattern P1 support 0.01Support of CBA PBC P2 Support 0Confidence of CBA PBC P2 Confidence 10Rule number PBC P2 RuleNum 8000Item number Rule ItemNum 8

>8 and <13 4>13 and <21 5>21 6

of data and become effortlessly comprehensible. Table 7 showsa part of the deputy index values of air pollutants.

In addition to the air pollutant dataset, another datasetused in the environmental database is about the weather. Theweather database contains daily information, such as extremehigh and low temperatures, averaged temperature and humid-ity values. Each value is labeled as “up”, “down” or “flat”by referring the difference from the value of the day before.Furthermore, the difference between daily extreme high tem-perature and daily extreme low temperature were gathered.We call the difference value as daily extreme temperature dif-ference and list its symbolization table in Table 8.

4.2. Experimental design

Several experiments were designed to evaluate our proposedmethods and all these experiments consist of two parts. Inthe first part, the experiments were designed to evaluate allparameters of our methods. We discussed the impact of thelength of observation interval � on accuracy. Subsequently,we did experiments to determine a suitable value of thefair parameter of a PBCAR classifier. In the second part, allexperiments were designed to verify the accuracies and per-formances of both methods. We firstly simulated them onthe asthma related dataset, which consists of environmentaldata and patient allergic data. Then, we respectively examinedthem on different asthma related data.

In all experiments, we adopted two indicators to measureexperimental results, which are the accuracy and recall. Accu-racy represents the rate of the data transactions which areclassified correctly, while Recall means the rate of the datatransactions within one specific class, and which are classi-
fied correctly. In the asthma case, we only focused our recallevaluation on the high-risk class. Fig. 15 shows the definitionsof accuracy and recall with a confusion matrix.
Fig. 15 – Definitions of accuracy and recall.

Pruning – NoUsing short rule – YesTraining rate Training Rate 70

4.3. Evaluation of main parameters

Table 9 shows all parameters and their default values used inthe experiments. The parameter � is the observation intervalfor asthma attack prediction. Its default value is set as five daysby doctors’ suggestion and we have shown its experimentalevaluation in the next paragraph. The second parameter k isthe balance parameter, which is applied to reduce the impactof excessive negative class rules. We also show how we deter-mine its default value with an experiment in the followingparagraph. The third to seventh parameters are the thresh-olds for both methods. The support of feature extraction isthe third parameter, which makes sure that the discoveredfeatures are frequent. The next two parameters are definedin the mining tool we adopted, which is named the CBA-DMII[23] and originally proposed by Liu et al. [13]. The default val-ues of these parameters are determined by many simulatedexperiments in order to ensure that the discovered rules ofthe PBCAR are significant for asthma attack prediction. Thelast parameter is the training rate of a dataset. In the generaldata mining case, we set 70% of the whole dataset as the train-ing constituent and 30% as the testing one. Our experimentalevaluations also proved that this training rate can produce thebest result on the asthma dataset.

4.3.1. Evaluation of observation interval �

Since all related data we considered are time series datasets,the parameter observation interval � for asthma attack pre-diction becomes an important factor. We discussed this endwith doctors and they suggested that the observation inter-val � should be set to about five days. In other words, doctorsrealize that all factors that occurred five days previously didnot affect an asthma attack. Thus, we did this experiment tocheck if it is possible to make observation interval longer forcollecting more features for asthma attack prevention. Fig. 16presents the accuracy of asthma attack predictions under dif-ferent observation interval length settings. The accuracy ofour method grows down by increasing length of the observa-tion interval. The reason of this trend would be that the longerobservation interval makes more irrelevant information andcauses the classification inaccurate. The accuracy results areover 90% when the parameter is set less than five days and
the curve of the accuracy goes down when � is over 5. Evenif the experimental result with � = 3 is the best one, it may betoo short to discover complete reasons about asthma attacks.Discovering all possible features for asthma attack prevention
dx.doi.org/10.1016/j.cmpb.2010.04.016


F

iawi

4oObhTiaoTbwbopmoaa

4

Tl

Fp

Fig. 18 – Experimental results of both proposed methods on

ig. 16 – Accuracy results by varying observation interval �.

s the purpose of our proposed mechanism. Longer � is betternd the accuracy result of � = 5 is acceptable. Consequently,e determined � = 5 as default setting for observation interval

n our approach.

.3.2. The experimental results of the balance parameter kf PBCARur PBCAR method referred [14] in performing classificationy scoring methods. In our applied dataset, the quantity ofigh-risk cases is very less than the quantity of normal cases.herefore, the proper value of the balance parameter k of scor-

ng method must be evaluated carefully. Fig. 17 shows theccuracy values and the recall values of varying setting of kn the asthma dataset. The recall goes up with increasing k.he accuracy increases slightly when the value of k is small,ut the values of the accuracy are almost the same in thehole experiment. After the value of k is set higher than 400,oth indicators do not change at all. Consequently the scoref normal is not important in the asthma dataset and we canerform classification just by the score of abnormal. In sum-ation, the features of normal segments are general features

f the whole dataset, and they can also be extracted frombnormal segments. In contrast, we can predict an asthmattack when any feature of abnormal appears in one record.

.4. Comparisons of PBDT and PBCAR

he whole asthma related dataset consists of weather, air pol-utant and asthma allergic datasets. In this sub-section, we

ig. 17 – Accuracy results of our PBCAR by varying balancearameter k.

different datasets.

evaluate the recall and accuracy of both proposed methods byapplying different types of datasets and endeavor to demon-strate each factor dataset considered are useful for asthmaattack prediction. In other words, we performed both methodson the weather dataset, the air pollutant dataset, the asthmaallergic dataset and the asthma related dataset, which inte-grates all single factor datasets. Fig. 18 shows the accuracy andthe recall of each dataset respectively. Overall, the results ofthe integrated dataset are better than the results of other sin-gle factor datasets. Although the recall of the PBDT is especiallydifferent, we inferred that the greedy architecture of the deci-sion tree makes the result easier to be confused and unstableby inputting more information. Therefore, we discussed thetwo major problems in these experimental results in the fol-lowing. Firstly, we compared the classifiers of both proposedmethods in detail to ascertain what makes the experimentalresults different from ideal inference. Secondly, we discussedwhy the results of integrated dataset are not better than thoseof asthma allergic dataset using the PBDT method.

In Fig. 18, the PBCAR shows 86.89% of accuracy and 84.12%of recall, and the PBDT shows 87.52% and 85.59% of themrespectively. As we can observe, the accuracy and the recall ofthe PBDT are slightly better than those of the PBCAR. This resultis contrary to what we expected. In general, the rule-basedclassifier, like the CBA, can take more information into con-sideration than the decision tree-based classifier and performbetter on doing classification. This concept is also describedclearly in [22]. Tree architecture and greedy algorithm makesdecision tree-based classifier lose some information. In thesimulated experiments, we found that the number of rules
resulted by the PBCAR is really more than the ones for the PBDT.Hence this condition may be caused by the applied dataset.Fig. 19 depicts the partial rules of the PBCAR classifier in whicheach rule consists of feature set and its support and confi-
dx.doi.org/10.1016/j.cmpb.2010.04.016


l rul
Fig. 19 – Partia
dence. Fig. 20 shows a classifier of the PBDT, which is in treearchitecture. The tree shown in Fig. 20 is skewed meaningsome features keep very high weights for classification andthat additional information would cause more classificationerrors.

4.4.1. Analysis of trees generated by PBDTIn this section, we discussed the experimental results of thePBDT in detail and try to discover the reasons for useless envi-
ronmental information. From studying the results of the PBDTpresented in Fig. 18, we observe two major conditions. Firstly,the accuracy of two environmental datasets is larger than 60%,but the recalls for high-risk are very low. This situation indi-
Fig. 20 – Conducted r

es of the PBC.

cates both environmental datasets are not particularly relatedto asthma attacks but can still provide some features aboutasthma attack prediction. Secondly, the result of asthma aller-gic dataset is better than the result of the integrated one. Ina generalized case, the classifier considers more related infor-mation and can perform better, but this condition shows somevarying results. We checked the detail of the classifier to findreasons for this occurrence. Fig. 21 shows the classifier of theasthma allergic dataset and the classifier of the integrated
one. We endeavored to compare the difference between bothclassifiers. There are just few differences in their sub-treesand they are located near the leaves. According to this evi-dence, we can infer that environmental factors contain some
ules of the PBD.

dx.doi.org/10.1016/j.cmpb.2010.04.016


ma

itWads

fdipteokpttamtmg

4

AgmttotIaat

Fig. 21 – Trees of two classifiers for asth

nformation for asthma attack prediction, but their effect isoo insignificant to make great improvement in performance.

ith regard to recall, it is because the classifier of asthmallergic dataset consists of high-risk features. In addition, weefine the recall indicator based on high-risk class. Thus, theecond unexpected condition would be possible.

The integrated bio-signal mining mechanism is designedor chronic diseases. Its purpose is to integrate diverseatasets to predict attacks of a chronic disease. The exper-

mental evaluations demonstrated our proposed design canerform well. The environmental data significantly enhancedhe accuracy and the recall for asthma attack prediction. How-ver, we adopted the daily weather and air pollutant recordsf a whole city as the environmental data of each patient. Thisind of data may be too vague and imprecise to enhance theerformance of the prediction system effectively. Sometimes,he conditions of the patient’s location are very different fromhe environmental conditions of a whole city; for example,

building with air conditioning. These cases deterred theechanism from effectively enhancing the performance of

he asthma attack prediction system. Overall, the proposedechanism does perform effectively and efficiently by inte-

rating more related information.

.5. Application for asthma attack alert

s an application, we collaborated with Tainan Asthma Aller-ic Children Health Association [26] and applied our proposedechanism on the Blue Angel Asthma Home-Care System. In

his system, patients record their personal daily conditionshrough the web service of the system. The data in the previ-us five days are taken by the system for analysis, and thenhe possibility of asthma attacks on the next day is predicted.
n addition, we provided patients with helpful information forlerting them of possible future attacks, assuming that theyre likely to suffer from an asthma attack. According to doc-ors’ observations in real implementations, many patients felt
allergic dataset and integrated dataset.

that our proposed mechanism is really helpful and the riskprediction can effectively help them to reduce the incidenceof dangers.

5. Conclusions and future works

In this paper, we proposed an integrated bio-signals min-ing mechanism, which considers patients’ clinical data andthe environmental information where patients are domi-ciled. With respect to implementation, two methods namedPBDT and PBCAR were proposed. Their mechanisms con-sist of two phases, which make processing these differentdatasets together possible. The PBDT integrates the conceptsof sequential pattern mining and decision tree mining to builda decision tree with all asthma related features for asthmaattack prediction. The tree architecture makes the predictionrules easier to understand for general users. In addition, thePBCAR merges the concepts of sequential pattern mining andclassifiable association rule mining. It elicits the associationrules between the prefix features of all asthma attacks of theintegrated dataset. The PBCAR predicts risks with these asso-ciation rules. The corresponding rules are more detailed andare suitable for medical healthcare practitioners for in-depthstudies of chronic diseases. Through experimental evalua-tions, these two methods were proven to work effectively.The proposed mechanism evaluated the concentration of airpollutants, climatic changes in the weather, and patient’sasthma related allergies to provide an asthma attack predic-tion and some medication instructions instantaneously. Inconclusion, theses outputs help patients and doctors preventasthma attacks effectively and efficiently. There are severalnovel characteristics in our proposed mechanism and can besummarized as the following:

1. Method for integrating multi-format information: We pro-posed the PBDT and PBCAR methods, which integrate theconcepts of traditional mining techniques. The designs

dx.doi.org/10.1016/j.cmpb.2010.04.016

m s i n

r

60 c o m p u t e r m e t h o d s a n d p r o g r a

make the implementation more intuitive and transformour mechanism appropriate to be applied on the datasetwith a complex format. Simultaneously, the results of ourproposed mechanism are easier to apply and to understandfor users.

2. Consideration of compound factors in patient monitor-ing: We not only discussed patients’ asthma symptoms,but also took air pollutant database and weather datasetinto consideration. More useful information makes the pro-posed mechanism more comprehensive.

3. Accurate prediction mechanism: Through evaluation of allrelated data, our proposed methods can predict the proba-bility of asthma attacks in near future. Experimental resultsproved that our prediction mechanism has high accurate.This mechanism evaluates patients’ daily health statusesfrom their daily records and effectively alert patients forpossible asthma attacks in near future.

4. Provision of readable reference information: Our mecha-nism integrates several benefits of data mining methods.Thus, both methods can discover readable features andbuild classification rules. These classification rules are easyto read and able to offer patients and healthcare workersas a reference of patients’ illnesses.

For future work, we propose to explore and to apply theproposed mechanism to other kinds of chronic diseases. Inaddition, we will take more chronic disease related factorsinto consideration, such as periodic weather changes, the rela-tionship between the features of different factor datasets.Furthermore, we will take into account personal physiologicalprofiles such as DNA data and eating habits of patients, withthe aim of providing personalized preventive medical care.

Conflict of interest

None.

Acknowledgement

This research was supported by the Landmark Project ofNational Cheng Kung University, Taiwan, ROC.

e f e r e n c e s

[1] R. Agrawal, R. Srikant, Fast algorithms for miningassociation rules, in: The 20th International Conference onVery Large Databases, Santiago, Chile, September, 1994.

[2] R. Agrawal, R. Srikant, Mining sequential patterns, in:Eleventh International Conference on Data Engineering,1995.

[3] M. Barua, H. Nazeran, P. Nava, B. Diong, M. Goldman,Classification of Impulse Oscillometric Patterns of LungFunction in Asthmatic Children Using Artificial NeuralNetworks, Engineering in Medicine and Biology Society,
2005.
[4] B.V. Bonnlander, A.S. Weigend, Selecting input variablesusing mutual information and nonparametric densityestimation, in: Proceedings of the International Symposiumon Artificial Neural Networks (ISANN’94), 1994.

b i o m e d i c i n e 1 0 1 ( 2 0 1 1 ) 44–61

[5] T.H. Cheng, C.P. Wei, V.S. Tseng, Feature selection formedical data mining: comparisons of expert judgment andautomatic approaches, in: Proc. 19th IEEE Int’l Conf. onComputer-based Medical Systems (CBMS’06), USA, June,2006.

[6] H.T. Chu, C.C. Huang, Z.H. Lian, J.J.P. Tsai, A ubiquitouswarning system for asthma-inducement, in: IEEE SensorNetworks, Ubiquitous, and Trustworthy Computing, 2006.

[7] O. Cure, Designing patient-oriented systems with semanticweb technologies, in: IEEE Symposium on Computer-BasedMedical Systems (CBMS), 2003, p. 195.

[8] J. Finkelstein, M.R. Cabrera, G. Hripcsak, Web-basedmonitoring of asthma severity: a new approach toambulatory management, Proceedings of IEEE InternationalConference on Information Technology Applications inBiomedicine (1998) 139–143.

[9] I. Guyon, A. Elisseeff, An introduction to variable and featureselection, The Journal of Machine Learning ResearchArchieve 3 (2003) 1157–1182.

[10] R.L. Jan, J.Y. Wang, M.C. Huang, S.M. Tseng, H.J. Su, L.F. Liu,An Internet-based interactive telemonitoring system forimproving childhood asthma outcomes in Taiwan,Telemedicine and e-Health (2006) 1–28.

[11] K. Kira, L.A. Rendell, A practical approach to featureselection, in: Proc. 9th international workshop on MachineLearning, 1992, pp. 249–256.

[12] N. Lesh, M.J. Zaki, M. Ogihara, Mining features for sequenceclassification, in: Proc. 5th ACM SIGKDD Int’l Conf. onKnowledge Discovery and Data Mining, San Diego, CA, USA,1999, pp. 342–346.

[13] B. Liu, W. Hsu, Y. Ma, Integrating classification andassociation rule mining, in: The Fourth InternationalConference on Knowledge Discovery and Data Mining, NewYork, USA, 1998.

[14] B. Liu, Y. Ma, C.K. Wong, P.S. Yu, Scoring the data usingassociation rules, Applied Intelligence 18 (2) (2003) 119–135.

[15] M. O’Leary, D. Koolpiruck, W. Balachandran, J. Emberlin, R.Lewis, The role of electrostatic charge accumulated byrespirable sized allergens with regard to thunderstormasthma, in: Industry Applications Conference, 2005.

[16] J.R. Quinlan, Induction of Decision Trees, Machine Learning1 (1) (2003) 81–106 (Kluwer Academic Publishers).

[17] S. Ruggieri, Efficient C4.5, IEEE Transactions on Knowledgeand Data Engineering 14 (2) (2002) 438–444.

[18] C.L. Stokes, A.K. Lewis, T. Paterson, C.C. Leong, N.Defranoux, G. Bub Entelos, Asthma Physiolab: a dynamic,computer-based mathematical model of acute and chronicasthma, in: The First Joint BMES/EMBS Conference ServingHumanity, Advancing Technology, Akmta, GA, USA,1999.

[19] V.S. Tseng, L.C. Chen, C.H. Lee, J.-S. Wu, Y.-C. Hsu,Development of a vital sign data mining system for chronicpatient monitoring, in: Proc. 2008 International Workshopon Intelligent Informatics in Biology and Medicine (IIBM-08),Spain, March, 2008.

[20] V.S. Tseng, C.H. Lee, J.C. Chen, An integrated data miningsystem for patient monitoring with applications on asthmacare, in: Proc. of the 21th IEEE International Symposium onComputer-Based Medical Systems (CBMS 2008), Finland,June 17–19, 2008.

[21] P. Varady, Z. Benyo, B. Benyo, An open architecture patientmonitoring system using standard technologies, IEEETransactions on Information Technology in Biomedicine 6
(1) (2002).
[22] A. Veloso, W. Meira Jr., M.J. Zaki, Lazy associativeclassification, in: The Sixth International Conference onData Mining, 2006, pp. 645–654.

dx.doi.org/10.1016/j.cmpb.2010.04.016

s i n
[23] CBA-DMII, Developed by National University of Singapore.[http://www.comp.nus.edu.sg/∼dm2/index.html], 2005.

[24] Central Weather Bureau, R.O.C. [http://www.cwb.gov.tw/],2005.

b i o m e d i c i n e 1 0 1 ( 2 0 1 1 ) 44–61 61

[25] Environmental Protection Administration Executive Yuan,R.O.C. [http://edb.epa.gov.tw/], 2005.

[26] Tainan Asthma Allergic Children Health Association, R.O.C.[http://140.116.58.191/asthma/index.php], 2005.

dx.doi.org/10.1016/j.cmpb.2010.04.016

http://www.comp.nus.edu.sg/~dm2/index.html

http://www.cwb.gov.tw/

http://edb.epa.gov.tw/

http://140.116.58.191/asthma/index.php

A novel data mining mechanism considering bio-signal and environmental data with applications on...

Documents

Transcript of A novel data mining mechanism considering bio-signal and environmental data with applications on...