Classification models based-on incremental learning algorithm and feature selection on gene...

40 ECTI TRANSACTIONS ON COMPUTER AND INFORMATION TECHNOLOGY VOL.6, NO.1 May 2012

Classification Models Based-on IncrementalLearning Algorithm and Feature Selection on

Gene Expression Data

Phayung Meesad1 , Sageemas Na Wichian2 ,

Unger Herwig3 , and Patharawut Saengsiri4 , Non-members

ABSTRACT

This Gene expression data illustrates levels ofgenes that DNA encodes into the protein such as mus-cle or brain cells. However, some abnormal cells mayevolve from unnatural expression levels. Therefore,finding a subset of informative gene would be ben-eficial to biologists because it can identify discrimi-native genes. Unfortunately, genes grow up rapidlyinto the tens of thousands gene which make it diffi-cult for classifying processes such as curse of dimen-sionality and misclassification problems. This paperproposed classification model based-on incrementallearning algorithm and feature selection on gene ex-pression data. Three feature selection methods: Cor-relation based Feature Selection (Cfs), Gain Ratio(GR), and Information Gain (Info) combined with In-cremental Learning Algorithm based-on MahalanobisDistance (ILM). Result of the experiment representedproposed models CfsILM, GRILM and InfoILM notonly to reduce many dimensions from 2001, 7130 and4026 into 26, 135, and 135 that save time-resource butalso to improve accuracy rate 64.52%, 34.29%, and8.33% into 90%, 97.14%, and 83.33% respectively.Particularly, CfsILM is more outstanding than othermodels on three public gene expression datasets.

Keywords: Feature Selection, Incremental LearningAlgorithm, Gene Expression, Classification

1. INTRODUCTION

The Central Dogma of Molecular Biology createsgene expression data which Deoxyribonucleic Acid orDNA translates into protein. However, some abnor-

Manuscript received on July 31, 2011 ; revised on December1, 2011.1 The author is with Department of Information Technol-

ogy Faculty of Information Technology, KMUTNB, BangkokThailand. , E-mail: [email protected] The author is with Department of Applied Science and

Social College of Industrial Technology, KMUTNB, Bangkok,Thailand. , E-mail: [email protected] The author is with Department of Communication Network,

Faculty of Mathematics and Computer Science, Fern Univer-sity in Hagen, Germany., E-mail: [email protected] The author is with Division of Information and Com-

munication Technology, Thailand Institute of Scientific andTechnological Research, Pathum Thani, Thailand. , E-mail:[email protected]

mal cells may evolve from unnatural expression levels.Nowadays, microarray techniques are used as sim-ple method to evaluate gene expression level. Thismethod is applied to hybridization of nucleic acidwhich measures tens of thousands gene expressions atthe same time [1]. Nevertheless, the number of genesgrows rapidly which leads to curse of dimensionalityand miss classification. Thus, biologists need moretime to calculate and search discriminate genes.

Motivation of incremental learning concept comesfrom an idea of shape increasing with new observa-tion. Thus, keeping continuous data is the basic keyto respond with new driving force [2]. In contrast, theobjective of conceptual learning is to separate intotraining and testing sets and then the training set islearned by the numbers of target class. After that themodel is created using training processes. Then thetesting set is used to evaluate this model. A drawbackof this technique is not having a process for new learn-ing, for example, the stream of input data from theinternet [2],[3]. Since the numbers of gene are basedon microarray experiment, and the data is very large.Therefore, incremental learning procedure is suitablefor creating classification model. Nevertheless, mostof the genes are not related to diseases. Hence, fea-ture selection techniques can recognize a gene thathas been discriminated power. Here, it is not onlyfinding a subset of gene but also decreasing time con-sumption.

A basic method for gene selection is searching thesubset of genes that has more ability of classification.For now, feature selection has two techniques whichare filter and wrapper approaches, first of all theyconsider the discriminate power of a gene one by onebut do not consider the induction of an algorithm.The second technique is associated with an inductionalgorithm that determines the accuracy of a chosensub group of genes. For time consumption, the wrap-per approach consumes longer than the filter method.Although the correctness of the filter method is lowerthan the wrapper approach. In addition, featuretransformation such as Single Value Decomposition(SVD), Independent Component Analysis (ICA), andPrinciple Component Analysis (PCA) are not appro-priate for gene selecting because they cannot reduceany dimension, safeguard unrelated feature and hard

https://www.researchgate.net/publication/220344164_Knowledge_Acquisition_Via_Incremental_Conceptual_Clustering?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4


Classification Models Based-on Incremental Learning Algorithm and Feature Selection on Gene Expression Data 41

to interpret significant of genes [4].Referring to the above; gene expression data con-

tains large features meanwhile most algorithms havelow efficiency when working with high dimensionaldata. Hence, this paper proposed incremental learn-ing algorithm combining a feature selection methodon gene expression data which provides solution fordeveloping biological dimension reduction applica-tion. The result of the experiment depicted an ac-complishment of these classification methods, it candecrease many dimensions and improve accuracy cor-rectly. This paper is organized as follows; Section2 represents a summary of the literature review. InSection 3, the proposed method is explained in de-tail. The result of experiment will be represented inSection 4. Finally, conclusion will be shown in Sec-tion 5. The problem of data analysis based on highdimension can be found in many domains such as doc-ument clustering multimedia, and molecular biologydata. This trouble is well-known about very sensitivewith curse of dimensionality. In this way, if severalattributes or variables are not in the same direction,the data will be sparse that is some spaces may notexist at all. These properties can lead to be 1) ex-pensive cost such as memory and storage usage 2)difficult for understanding the data for instance, datacontaining low or high dimension can be interpretedas a same group and other. Therefore, feature se-lection techniques are more beneficial for conductinggood feature subset that can make classification al-gorithm effectively and efficiently. The objectives offeature selection are following [5]:1. Reducing feature set: predicting and forecasting

result based on searching expected knowledge.2. Decrease learning time: several algorithms con-

tained many attributes that spent more time forclassification.

3. Increasing accuracy rate: noise and irrelevancefeatures should be eliminated according to theyare correlated with prediction efficiency.

4. The average values of features between low andhigh are the same.

2. LITERATURE REVIEWS

2.1 Feature Selection (FS)

The key objectives of FS are selecting the best po-sition and decreasing feature more than traditionaldataset. This method is different from feature ex-traction and transformation algorithm because it isnot encoded new feature subset into another form. Incontrast, feature extraction and transformation tech-niques make new feature subset in a new format andcreate a drawback of physical translating in specialistarea. FS method can fall into supervised and unsu-pervised learning that shown in Fig. 1.

In general, feature selection in supervised learningcooperates with searching method for discovering fea-ture subsets space. On the one hand, the objective of

unsupervised feature selection is not clear two casesof unsupervised problem are prior knowledge abouta many clusters and complexity of finding subset ofeach feature. Nevertheless, this paper focuses on onlysupervised learning feature selection approach.

2.1.2 Filtering ApproachAs mention before, filtering approach tries to removeirrelevant feature from original feature set then sendto learning algorithm. Typically, traditional datasetis analyzed for identifying influence dimension sub-set used to describe and relate with their structure.Hence, the filtering process is not depended on effi-ciency of classification algorithm for instance, Cor-relation Based Feature Selection (Cfs), InformationGain (IG), Chi-Square, and Gain Ratio (GR).

2.1.3 Wrapper ApproachhIn other word, relevance feature subset is chosen us-ing classifier algorithm power. In that case, searchingmethod of wrapper based on feature space and eval-uated subset value using correct estimation of pre-dicted classification. The goal of method is findingfeature subset that is proximate with criterion, forexample Forward Selection (FS), Backward Elimina-tion, Genetic Search, and Simulated Annealing.

2.1.4. Embedded ApproachIntegrating between classification algorithm and fea-ture selection technique is applied together for Em-bedded Approach. Actually, this technique is work-ing with classifying and searching feature at the sametime. For example, Random Forest technique basedon decision tree and forced for giving score into cru-cial feature.

However, combining filter and wrapper approachis very interesting in gene expression data domain forexample comparison of hybrid feature selection mod-els is proposed by [6]. This research consists of foursteps 1) ranking feature subset using filter approach2) sending the result of first step into wrapper ap-proach based on Support Vector Machine (SVM) clas-sifier algorithm. 3) creating hybrid feature selectionmodel such as CFSSVMGA and GRSVMGS. 4) com-paring accuracy rate during hybrid feature selectionmodels.

In the first place, feature selection technique ofgene expression data is performed based on rankingsystems such as difference of means and t-statistics.T-statistics has higher efficiency, than difference ofmean which variance is assumed to be equal. Signif-icant Analysis of Microarray (SAM) is proposed by[7]. SAM is complete exceeding a range of restrictionthat affirms genes. SAM produces individual scoresof gene based on cooperation of gene expression thatis associated with standard deviation of evaluationrepeating.

Currently, feature selection method focus on fil-ter and wrapper approaches [8]. In filter approach,Correlation technique is very simple and wildly usedin gene selection domain. For instance, if correla-

https://www.researchgate.net/publication/224212774_Comparison_of_hybrid_feature_selection_models_on_gene_expression_data?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4

https://www.researchgate.net/publication/7326009_A_Theoretical_Analysis_of_Gene_Selection?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4

https://www.researchgate.net/publication/251882383_A_hybrid_approach_for_selecting_gene_subsets_using_gene_expression_data?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4

https://www.researchgate.net/publication/220520172_Subspace_clustering_for_high_dimensional_data_a_review_ACM_SIGKDD_Explor_Newsl?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4


Fig.1: Supervised and Unsupervised Learning ofFeature Selection.

tion value of each gene is higher than a threshold,these genes are integrated together. After that, generanking is created using correlation value and then se-lecting top-rank gene from gene ranking. In contrast,gene has correlation value less than threshold thatdepicts the best gene. Correlation method combinedwith Fuzzy Clustering for classification is proposed in[9]. However, disadvantage is how to define a numberof clusters. Another feature selection based on geneexpression data was proposed by [8] composed of twophases. 1) Cfs algorithm is used for selecting gene. 2)Binary Particle Swarm Optimization (BPSO) is usedto select feature from previous phase. Nevertheless,parameter setting is very difficult for general user.

Gene boosting technique was proposed by [10]which joins the filter and wrapper approaches.Firstly, filter approach selects a subset of gene fromthe top rank. Next, wrapper approach chooses theoutput of the first step. The method will stopwhen accuracy rate of training set is acceptable orcompleted iteration. On the contrary, the overfit-ting problem may happen when this technique isboosted. Gene expression data contains a large fea-ture and small instances. Thus, [11] proposed inte-grating two techniques, BPSO and K-nearest neigh-bour (K-NN) and then measured by Leave-one-outcross-validation (LOOCV). Nonetheless, the primaryproblem of BPSO is making local minimum fromlarge dimension of genes.

2.2 Classification Model for Gene ExpressionData

Fuzzy clustering by Local Approximation of Mem-bership (FLAME) has different weight structure usedto assign membership degree of gene for cluster-ing. Actually, membership function of Fuzzy C-Mean (FCM) is used to weight each gene for cluster-ing which is the factor similarity of gene comparingwith mean value of cluster. FLAME algorithm, geneis specified suitable membership degree using their

neighbour membership and repeating this process forconverging [12]. However, it is very sensitive to ad-just parameter and order of input. Incremental filterand wrapper approach combining is proposed by [13],consist of two steps. 1) Gene is evaluated using Infomethod and then 2) Rank list is created using wrap-per method. Three algorithms are used to process:Naıve Bayes, Instance-based Learner (IB1), and De-cision Tree (C4.5). Discriminate gene is determinedusing high score value of gene expression level. Classi-fication of microarrays to nearest centroid (CalNC) isproposed by [14] based on Linear Discriminant Anal-ysis (LDA) and t-statistics value. However, CalNC isupdated by [15] using suitable scores. Distance dur-ing each object and centroid are evaluated using ap-proximate score. A gene is nearly centroid is chosenusing Mahalanobis distance. Multi-objective strat-egy based on genetic algorithm to select gene is pro-posed by [16]. Since one-objective of GASVM is notsuitable for performing gene that is less than 1,000.Therefore, the concept of multi-objective optimiza-tion (MOO) based on relevance between many ob-jectives and classes are replaced which is called MO-GASVM.

Conducting high accuracy rate from classificationalgorithm is possible in several datasets but three ex-amples below represent that may be disadvantageousdetermination when ignore joining with other fea-ture selection techniques. Because of general gene ex-pression dataset contains many irrelevant attributesthat led to miss-classification algorithm. For ex-ample, original dimension of three gene expressiondatasets 1) Colon Tumor 2) Leukemia and 3) Lym-phoma represented low accuracy rate using ILM algo-rithm 64.52%, 34.29%, and 8.33%, respectively (Fig.2).

Fig.2: The Accuracy Rate of Traditional ILMAlogrithm without Festure Selection Technique.

https://www.researchgate.net/publication/8167427_Improved_Gene_Selection_For_Classification_Of_Microarrays?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4

https://www.researchgate.net/publication/29652338_Techniques_for_clustering_gene_expression_data_Comput_Biol_Med?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4

https://www.researchgate.net/publication/49910862_A_multi-objective_strategy_in_genetic_algorithms_for_gene_selection_of_gene_expression_data?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4

https://www.researchgate.net/publication/23300278_New_gene_selection_method_for_multiclass_tumor_classification_by_class_centroid?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4


https://www.researchgate.net/publication/222833154_Incremental_wrapper-based_gene_selection_from_microarray_data_for_cancer_classification?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4

https://www.researchgate.net/publication/221136674_Cancer_classification_with_incremental_gene_selection_based_on_DNA_microarray_data?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4

https://www.researchgate.net/publication/221533396_A_Novel_BPSO_Approach_for_Gene_Selection_and_Classification_of_Microarray_Data?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4

https://www.researchgate.net/publication/7588607_Classification_of_microarrays_to_nearest_centroids?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4


3. METHODS

3.1 Correlation Based Feature Selection (Cfs)

The concept of Cfs technique is relevance of fea-ture and target class based on heuristic operation [17].This method was created by Mark in 1999. However,evaluation processing of Cfs technique is assigninglarge relation of feature subset with target class andignores correlation. Cfs equation is that represents in(1).

Ms =krcf√

k + k(k − 1)rff(1)

where Ms is heuristic “merit” of feature subsets; Sis a set including k features; rcf is average feature-class relation (f ∈ S); and rff is mean feature-featureinter-relation.

3.2 Information Gain (Info)

An info algorithm is the most popular techniqueto find the node impurity; this is based on the crucialconcept for splitting power of a gene [18]. An Info al-gorithm decides that feature via Entropy estimation.Entropy at a given node t is given in (2):

Entropy(t) = −∑i

p(j|t) log2 p(j|t) (2)

where p(j|t) is related with frequency of category jat node t.

Gain = Entropy(t)−

(k∑

i=1

ni

nEntropy(i)

)(3)

Gain equation is depicted in (3) the parent nodewhere t is divided into k partitions and ni is numberof data in partition i. A disadvantage is that bias ofsplitting; this can happen with many classes.

3.3 Gain Ratio (GR)

GR technique improves the problem of Info. Thestructure of method is created by using to-down de-sign. GR was developed by Quinlan in 1986 andbased on evaluation of information theory. Gener-ally, probability, (P (vi)), is to answer vi, then theinformation or entropy (I) of the answer is given by[19]. SplitINFO is presented in (4) to resolve bias inInfo. In (5), Info is adapted by using the entropy ofthe partitioning (SplitINFO). Thus, higher entropypartitioning is adjusted.

SplitINFO = −

(k∑

i=1

ni

nlog

ni

n

)(4)

GR =∆info

SplitINFO(5)

3.4 An Incremental Learning Algorithm basedon Mahalanobis Distance (ILM)

The objective of ILM is incremental learning, cov-ering supervised and unsupervised learning, hard andsoft decisions [3] [20], Mahalanobis distance measure-ment, and Gaussian function [3]. ILM contains twosteps: learning step and predicting step that is shownin Fig. 3. First, learning step consists of two learn-ing algorithms (cover supervised and unsupervisedlearning) and an incremental learning algorithm. Thelearning algorithm is used to create a classificationmodel and learning model using Mahalanobis dis-tance and Gaussian membership function. On onehand, a new target class label can always learn byan incremental learning algorithm. Thus, the sys-tem prototype is updated regularly. However, theprototype is constructed using two parameters whichare large degree of membership function and distancethreshold of each target class label. The user definesboth factors. In that case, the distance threshold isrepresented by dth, where 0 < dth < 1. dm is mea-surement distance among instance and If dm > dth, instance is dissimilarity with cluster. Conversely,instance has more similarity to cluster if dm ≤ dth.Learning Model consists of system prototypes, whichare WP and WT weight for supervised and unsuper-vised respectively.

Next, unknown data are inputted into the predict-ing step. This way, similarity or dissimilarity betweenunseen pattern and the system prototype is measuredusing Mahalanobis distance. In the case of incremen-tal learning, it adds new prototypes if distance of un-known data are very high. The distance is calculatedusing Mahalanobis distance and Gaussian member-ship function. Soft decision characteristic is shownusing Mahalanobis Gaussian RBF to estimate mem-bership degree of each target class. Wining node willbe the higher degree of membership function, thisis called hard decision. Target class (WP) of un-known data is determined using WP label of winningnode. Measuring distance between input (p) and tar-get class is using the nonsingular covariance matrix(K) and calculating Mahalanobis distance (d) shownin (6) and (7). GR technique improves the problemof Info. The structure of method is created by usingto-down design. GR was developed by Quinlan in1986 and based on evaluation of information theory.Generally, probability, (P (vi)), is to answer vi, thenthe information or entropy (I) of the answer is givenby [15]. SplitINFO is presented in (4) to resolve biasin Info.

Kj,new = Kj,old + aIn (6)

d =√(p−Wp)TK−1(p−Wp) (7)

The membership degree of each cluster is calcu-lated by Mahalanobis RBF which is represented in

https://www.researchgate.net/publication/2805648_Correlation-Based_Feature_Selection_for_Machine_Learning?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4


https://www.researchgate.net/publication/226222944_Induction_of_Decision_Trees?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4

https://www.researchgate.net/publication/200110620_Introduction_to_Data_Mining?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4

https://www.researchgate.net/publication/3413502_An_effective_neuro-fuzzy_paradigm_for_machinery_condition_health_monitoring?el=1_x_8&enrichId=rgreq-8a677b4f-67dc-48a4-9daa-565304d45482&enrichSource=Y292ZXJQYWdlOzIzMzk3ODc3NztBUzo5NzE5MzAxMjc2MDU4MEAxNDAwMTg0MDI4MzI4


Fig.3: Structure of the ILM Algorithm.

(8). Maximum membership degree of winner clusteris found by fuzzy “OR” operator (e.g., max operator)in (9).

mem(p,Wp) = exp

(− (p−Wp)

TK−1(P −Wp)

2

)(8)

winner = arg maxi

(meni)

= arg maxi

(men1 ∨men2 ∨ . . . ∨menn)(9)

4. EXPERIMENTS AND RESULTS

The combination between ILM and feature selec-tion is used to confirm the validity for classifyinggene expression data. Hence, this research focuseson only ILM supervised learning methods. Firstly,three public gene expression datasets are prepared,Colon Tumor (62×2001), Leukemia (72×7130), andLymphoma (96×4026) shown in Table 1. Secondly,the discriminate gene of each dataset is selected basedon filtering approaches which are Cfs, GR, and Info.Thirdly, the output from the previous step is trans-ferred to ILM algorithm for classification. Moreover,this evaluation method is based on accuracy rate.Fig. 4 represents experiment design.The details of operation are depicted below:1. Data preprocessing: Replace the missing value

and normalized data. Two public gene expressiondata sets contain missing value data which areLymphoma and Leukemia dataset. Lymphomadataset is higher missing value than Leukemiaand shown in Fig. 5 and Fig. 6.

2. Filtering approach: Select the top-rank genesaccording to three feature selection methods (Cfs,GR, and Info) based on Colon Tumor, Leukemia,and Lymphoma Dataset that are shown in Table2.

3. Training and testing datasets: Each dataset isseparated into two groups, training (50%) and

Table 1: Detail of Gene Expression Datasets.

Datasets InstancesAttributesClassValues

Colon PositiveTumor 62 2001 and

Negative

Lymphoma 96 4026

GCL,ACL,

DLBCL,GCB,NIL,ABB,RAT,

TCL,FL,RBB, and

CLL

testing (50%) datasets. Training and testingdatasets are used to construct learning and pre-dicting, respectively.

4. Learning Model: transfer output of previous step(only training dataset) into ILM algorithm to cre-ate a learning model. In this case, covariance ma-trix and distance threshold is defined. Thus, theresult of each filter approach is combined withILM such as , CfsILM GRILM, and InfoILM.

5. Prediction: the performance evaluation of Cf-sILM, GRILM, and InfoILM based on accuracyrate of predicting target class and small subsetof gene that has been discriminating power.

Fig.4: Experiment Design for Classification ModelsBased on ILM Algorithm.


Fig.5: Missing Value of Lymphoma Dataset.

Fig.6: Missing Value of Leukemia Dataset.

Table 2: The Result of Three FS Approaches..

Datasets AttributesFeature SelectionCFs GR Info

Colon Tumor 2001 26 135 135Leukemia 7130 75 874 874Lymphoma 4026 269 1785 1785

4.1 The Result of Colon Tumor Dataset

The experimental results of feature selection basedon Colon Tumor gene expression dataset are reportedin Table 2. Using feature selections Cfs, GR, andInfo, there were 26, 135, and 135 attributes respec-tively chosen as follows. Working together betweenfeature selection and ILM algorithm represents im-portant results CfsILM, InfoILM, and GRILM; it ishigher accuracy than original ILM. As Fig. 7, thesetechniques can produce higher accuracy rate than tra-ditional ILM technique.

4.2 The Result of Leukemia Dataset

Likewise, comparison of efficiency on original ILMand CfsILM, InfoILM, and GRILM based on Lym-phoma dataset is depicted in Fig. 7 Meanwhile manyfeatures are chosen using feature selection Cfs, GR,and Info, which were 75, 874, and 874 representedin Table 2. The performance of integration CfsILM,InfoILM, and GRILM are represented in Fig.7.

4.3 The Result of Lymphoma Dataset

According to Lymphoma dataset, combining fea-ture selection and ILM algorithm is higher efficiencythan original ILM represented in Fig. 7. However,joining during two techniques; CfsILM, InfoILM andGRILM developed many attributes which were 269,1785, 1785 respectively reported in Table 2.

Table 3: The Result of Three FS Approaches..

DatasetsClassification Models

ILM InfoILM GRILM CfsILMColon Tumor 218.51 0.98 0.94 0.05Leukemia 81,479.80 87.77 86.58 0.35Lymphoma 2,000.00 1,876.09 1,989.46 9.30Average 27,899.44 654.95 692.33 3.23

4.4 Comparison of Accuracy Rate and TimeEfficiency

The experiment results shown all classificationmodels increased accuracy rate over traditional ILM.This could be seen in Fig. 7. Obviously, that resultshows CfsILM selected less number of attributes asTable 2 and get higher accuracy rate as Fig. 7 ap-propriate for classifying gene expression data. On theother hand, GRILM yielded lower efficiency when asubset of genes is selected. This exposed CfsILM tohave high performance while GRILM had restrictedcapability. Nevertheless, Many attributes of the threegene expression datasets were eliminated and fed tothe classification models. Furthermore, several di-mensions on overall the dataset are chosen usingGRILM and InfoILM model but a few attributes areselected using CfsILM model.

In case of time efficiency, three classification mod-els (CfsILM, InfoILM, and GRILM) save time morethan traditional ILM technique which is shown in Ta-ble 3. Especially, the average time of CfsILM is lessthan other classification techniques.

Fig.7: The Result of Accuracy Rate Based on Clas-sification Models.


5. CONCLUSIONS

Genes can grow up into the tens of thousands,but most of them are not associated with discrim-inate power. Therefore, finding a subset of infor-mative genes is very important for biological pro-cessing. This paper proposes classification modelsbased on incremental learning algorithm and featur-ing selection on gene expression data. Three fea-ture selection methods: Cfs, GR, and Info are com-bined with Incremental Learning Algorithm based-onMahalanobis Distance (ILM) and three public geneexpression datasets are used experimentally: ColonTumor (62×2001), Leukemia (72×7130), and Lym-phoma (96×4026). In the case of feature selection,classification models can eliminate many dimensions.For example, subsets of features based on Colon Tu-mor are decreased from 2001 into 26, 135, and 135,using CfsILM, GRILM, and InfoILM, respectively.Meanwhile, the accuracy rate of classification mod-els is higher than the original ILM.

For instance, accuracy rates are calculated fromexperiments tested with three datasets that are ex-pending from 64.52%, 34.29%, and 8.33% into 90%,97.14%, and 83.33%, by CfsILM model, respectively.Therefore, the experiment result concludes CfsILMclassification model joined with Cfs and ILM had notonly higher performance over the other models butalso average time saving 3.23 second.

6. ACKNOWLEDGEMENT

The major effort has acquired the whole-heartedsupport of Miss Maytiyanin Komkhao. Thank youvery much for your source code and knowledge.

References

[1] Hongbo Xie, Uros Midic, Slobodan Vucetic, andZoran Obradovic, Hand Book of Applied Algo-rithms, John Wiley and Sons, Inc., New York,2008, ch. 5.

[2] Douglas H. Fisher, Knowledge Acquisition ViaIncremental Conceptual Clustering, MachineLearning 2(2), Kluwer Academic Publishers,Boston, pp.139–172, 1987.

[3] M. Komkhao, S. Sodsee, and P. Meesad, “AnIncremental Learning Algorithm Based-on Ma-halanobis Distance for Unsupervised Learn-ing,” Proceeding of 3th National Computer onComputing and Information Technology (NC-CIT2007), pp.20-25, 2007.

[4] P. Lance, H. Ehtesham, and L. Huan, “SubspaceClustering for High Dimensional Data: A Re-view,” SIGKDD Explor. Newsl. 1931–0145, vol.6, pp.90–105, 2004.

[5] Padraig Cunningham, Dimension Reduction,Technical Report UCD-CSI-2007-7., UniversityCollege Dublin, pp.1-17, 2007.

[6] P. Saengsiri, Sageemas Na Wichian, Phayung

Meesad, and Herwig Unger, “Comparison ofHybrid Feature Selection Models on Gene Ex-pression Data,” Proceeding of 8th IEEE In-ternational Conference Knowledge Engineering,pp.13–18, 2010.

[7] Mukherjee, S. and S. J. Roberts. “A TheoreticalAnalysis of Gene Selection,” Proceedings Com-putational Systems Bioinformatics Conference,CSB, pp.131–141, 2004.

[8] Cheng-San, Y., C. Li-Yeh, Chao-Hsuan ke, andCheng-Hong Yang, “A Hybrid Approach for Se-lecting Gene Subsets Using Gene ExpressionData,” IEEE Conference on Soft Computing inIndustrial Applications (SMCia’08), pp.159–164,2008.

[9] Jaeger J., R. Sengupta , W. L. Ruzzo, “Im-proved Gene Selection for Classification of Mi-croarrays,” 8th Pacific Symposium on Biocom-puting, pp.53–64, 2003.

[10] Jin-Hyuk H. and C. Sung-Bae, “Cancer Clas-sification Incremental Gene Selection Based onDNAMicroarray Data,” IEEE Symposium Com-putational Intelligence in Bioinformatics andComputational Biology, pp.70–74, 2008.

[11] Cheng-San, Y., C. Li-Yeh, Jung-Chike Li, andCheng-Hong yang, “A Novel BPSO Approachfor Gene Selection and Classification of Microar-ray Data,” IEEE International Joint ConferenceWorld Congress on Computational IntelligenceNeural Networks (IJCNN2008), pp.2147–2152,2008.

[12] Kerr, G., H. J. Ruskin, M. Crane, and P. Doolan,“Techniques for Clustering Gene ExpressionData,” Computers in Biology and Medicine,vol.38 no.3, pp.283–293, 2008.

[13] R. Ruiz, Jose C. Riquelme, and Jesus S. Aguilar-Ruiz, “Incremental Wrapper-based Gene Selec-tion from Microarray Data for Cancer Classifi-cation,” Pattern Recognition, vol. 39, pp.2383–2392, 2006.

[14] R. Dabney, “Classification of Microarrays toNearest Centroids,” Bioinformatics, vol. 21,no.22, pp.4148-4154, 2005.

[15] Q. Shen, W.-m. Shi, and W. Kong, “New GeneSelection Method for Multiclass Tumor Classifi-cation by Class Centroid,” Journal of BiomedicalInformatics, vol. 42, pp.59-65, 2009.

[16] M. Mohamad, S. Omatu, S. Deris, M. Misman,and M. Yoshioka, “A Multi-Objective Strategyin Genetic Algorithms for Gene Selection of GeneExpression Data,” Artificial Life and Robotics,vol. 13, pp.410–413, 2009.

[17] Mark A. Hall, Correlation-based Feature Selec-tion for Machine Learning, Doctor of Philos-phy Thesis, Department of Computer Science,The University of Waikato Newzealand, pp.69–71, 1999.

[18] Pang-Ning Tan, Michael Steinbach, and Vipin








































































Kumar, Introduction to Data Mining, AddisonWesley, 2006, ch. 5.

[19] Quinlan, J. R, “Induction of Decision Trees,”Machine Learning, vol. 1, pp.81–106, 2006.

[20] G. G. Yen and P. Meesad, “An Effective Neuro-Fuzzy Paradigm for Machinery Condition HealthMonitoring,” IEEE Transactions : Systems,Man, and Cybernetics, Part B: Cybernetics, vol.31, pp. 523–536, 2001.

Phayung Meesad Ph.D., assistantprofessor. He is an associate deanfor academic affairs and research atthe faculty of Information Technology,King Mongkut’s University of Technol-ogy North Bangkok. He earned his MSand Ph.D. degrees in Electrical Engi-neering from Oklahoma State Univer-sity, U.S.A. His researches are focusedon data mining, hybrid intelligent sys-tem, optimization, and fuzzy logic algo-

rithm.

Sageemas Na Wichian receivedher Ph. D. in Educational ResearchMethodology at Chulalongkorn Univer-sity. She has been an educator for overten years, taught in universities in theareas of diversity, Psychology and Re-search for Information Technology. Sheis currently working as a lecturer atKing Mongkut’s University of Technol-ogy North Bangkok, Thailand. Her re-searches are focused on advanced re-

search methodology, Industrial and Organizational psychol-ogy.

Herwig Unger teaches Mathematicsand Information Technology at the Uni-versity of Hagen, Germany. He has 17years of experience in many areas of theIT, e.g. simulation models, communica-tion and networking, information man-agement, grid computing, business intel-ligence and SCM. He focuses on com-munication and information distributionin networks for his research. Especially,the decentralize storage and data pro-

cessing has been specially attended variously for his researchprojects.

Patharawut Saengsiri received Ph.D.in Information Technology at KingMongkut’s University of TechnologyNorth Bangkok. He is currently work-ing as an Information Technology Officerat Thailand Institute of Scientific andTechnological Research (TISTR), Thai-land. He had been scholarship studentof Science and Technology Ministry ofThai Government. He is interested in ADevelopment of an Incremental Hierar-

chical Clustering Algorithm for Gene Expression Data.








Classification models based-on incremental learning algorithm and feature selection on gene...

Documents

Transcript of Classification models based-on incremental learning algorithm and feature selection on gene...