Identification of topology-preserving, class-relevant feature ...

Soft Computinghttps://doi.org/10.1007/s00500-018-3122-0

METHODOLOGIES AND APPL ICAT ION

Identification of topology-preserving, class-relevant feature subsetsusing multiobjective optimization

Sriparna Saha1 ·Mandeep Kaur1

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

AbstractIn the current work, a multiobjective-based feature selection technique is proposed which utilizes different quality measuresto evaluate the goodness of reduced feature set. Two different perspectives are incorporated in the feature selection process:(1) selected subset of features should not destroy the geometric distribution of the sample space, i.e., the neighborhoodtopology should be preserved in the reduced feature space; (2) selected feature subset should have minimal redundancy andhigh correlation with the classes. In order to capture the second goal, several information theory-based quality measures likenormalized mutual information, correlation with the class attribute, information gain and entropy are utilized. In order tocapture the first aspect, concepts of shared nearest-neighbor distance are utilized. Multiobjective framework is employed tooptimize all these measures, individually and in different combinations to reduce the feature set. The approach is evaluatedon six publicly available data sets with respect to different classifiers, and results conclusively demonstrate the potency ofutilizing both types of objectives functions in reducing the feature set. Several performance metrics like accuracy, redundancyand Jaccard score are used for measuring the quality of the selected feature subset in comparison with several state-of-the-arttechniques. Experimental results on several data sets illustrate that there is no universal model (optimization of a set ofobjective functions) which can perform well over all the data sets with respect to different quality measures. But in generaloptimization of all objective functions (PMCI model) consistently performs well for all the data sets.

Keywords Feature selection · Shared nearest-neighbor distance · Normalized mutual information · Correlation · Entropy ·Information gain · Multiobjective optimization · Jaccard score · Redundancy reduction

1 Introduction

Dimensionality reduction is an important stage in prepro-cessing of large data sets (Liu and Yu 2005). It is a processof removing irrelevant and redundant features of the data setin order to decrease the computational complexity of per-forming some operations on the given data set. Selection ofan optimal feature subset having cardinality d (d < D, Dis the cardinality of original feature space) based on a pre-determined evaluation criteria is termed as feature selection.Search space for feature selection algorithms is 2D . When

Communicated by V. Loia.

B Sriparna [email protected]

Mandeep [email protected]

1 Department of Computer Science and Engineering, IndianInstitute of Technology Patna, Patna, India

D is large, it becomes a challenge for the underlying searchtechnique to efficiently explore the search space to determinethe best optimal feature subset. Thus, the development ofsome efficient search technique to better traverse the searchspace is highly required. Some heuristic-based search or ran-dom search can help in such situations (Garg and Sharapov2002).

Based on the evaluation criteria used in selecting featuresubset, feature selection algorithms can be generally catego-rized into filter and wrapper methods. Filter models calculatethe relevance of a feature subset by considering its intrinsicproperty. The low scoring features get removed from optimalfeature subset. These models are independent of classifiersand hence execute fast (Sánchez-Maroño et al. 2007). Inwrapper models, some feature subsets are randomly selectedand those are evaluated with the help of some classifier. Theprocess sequentially adds and removes features from the sub-set with the objective of increasing the accuracy of a given

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s00500-018-3122-0&domain=pdf

S. Saha, M. Kaur

classifier. This approach is very much prone to over-fitting(Kohavi and John 1997).

The feature selection approaches can be further cate-gorized into three types depending on the availability oflabeled information during the search for optimal feature sub-set. These are supervised, semi-supervised and unsupervised(Molina et al. 2002; Dash and Liu 1997). In recent years,some unsupervised feature selection approaches becomevery popular. In Kundu and Mitra (2015), a shared nearest-neighbor-based distance measure was utilized for featureselection. An objective function was designed utilizing theshared nearest-neighbor distance to preserve the neighbor-hood information of samples in the reduced feature space.This objective function along with the cardinality of the fea-ture subset are optimized simultaneously using the searchcapability of a multiobjective evolutionary algorithm. Butthis approach does not consider any objective function mea-suring redundancy among the features in determining theoptimal feature subset. The objective of this work lies indetermining that optimal feature subset which leads to pre-serve the pairwise sample similarity. In Bhadra and Bandy-opadhyay (2015), some information theory- based measuresare utilized in the differential evolution-based framework todetermine the appropriate feature subset. But the preserva-tion of sample similaritywas not examined in this framework.ReliefF is another popular feature selection approach (Kiraand Rendell 1992) utilizing the concepts of nearest neigh-bors. Another well-known feature selection approach isSPEC (Zhao and Liu 2007) where a ranking-based strategybased on eigenvectors of the pairwise similarity matrix isdeveloped for selecting the optimal feature subset. But thesealgorithms consider the features individually, ignoring thecorrelation present between a group of features. To removethe associated problem, a sequential forward feature selec-tion approach was developed by Zhao et al. (2013).

The current work attempts to combine the effectivenessof using a structure-preserving objective function with someredundancy measures of feature subset. Different featurequalitymeasures are available in the literature, which are pro-posed exploiting the concepts of information theory. Thesemeasures are capable of determining redundancy betweenfeatures and correlation with respect to individual classeswhich can be further participated in selecting a good fea-ture subset. In the current work, we have utilized severalquality measures based on information theoretic conceptslike entropy, information gain, mutual information, correla-tion with respect to classes, along with an objective functiondefined using shared nearest-neighbor-based distance to pre-serve the geometric structure of the original sample set inreduced space.A large number of objective functions exploit-ing different feature quality measures are further optimizedsimultaneously using a multiobjective-based optimizationtechnique (MOO). For this very purpose the popular multi-

objective evolutionary algorithm, NSGA-II (Deb et al. 2002)is employed. A large set of experiments optimizing differentcombinations of objective functions are conducted on six datasets. Results are compared with those obtained by existingfeature selection-based techniques (Kundu and Mitra 2015).

The rest of the article is organized as follows:Abrief intro-duction about preliminaries of MOO and different featuresubset quality measures are provided in Sect. 2. The pro-posedmethodology ofmultiobjective-based feature selectionis described in Sect. 3. Experimental results and comparativestudy are described in Sect. 4. Finally, the article is concludedin Sect. 5.

2 Preliminaries

In this section, we have discussed in brief some preliminariesof multiobjective optimization (Deb and Kalyanmoy 2001)and different quality measures to evaluate the feature subset.

2.1 Multiobjective optimization

Unlike single-objective optimization, any MOO algorithmcan optimize a set of conflicting objective functions, M(where M is the number of objectives and x is a vector ofz decision variables). MOO tries to either minimize or max-imize all these objective functions simultaneously. Finally, aset of trade-off solutions is generated using MOO which istermed as Pareto optimal solution set. These are the solutionswhich cannot improve any of the objectives without degrad-ing at least one of the other objectives. In other words, thesesolutions are called non-dominating to each other. Solutionx1 dominates solution x2 if the following two conditions aresatisfied:

1. The solution x1 is no worse than x2 in all M objectives.2. The solution x1 is strictly better than x2 in at least one

objective.

If any of the above conditions is not satisfied, then we saysolution x1 does not dominate solution x2, so they constitutethe Pareto optimal front of these objectives.

2.2 Shared nearest-neighbor distance (Kundu andMitra 2015)

LetX be a set of data points, |X | = N and s ∈ N+, NNs(x) ⊂X denotes set of s-nearest neighbors of x ∈ X determined byany primary distancemetric, e.g., Euclidean, cosine distance,etc. Size of the set of common neighbors between data points,x and y, is denoted as SNNs(x, y) = |NNs(x) ∩ NNs(y)|.

This ‘overlap’ between the neighborhoods of two points,x, y, is an alternative similarity measure. A new similarity

123

Identification of Topology-Preserving, Class-Relevant Feature Subsets using Multiobjective…

measure based on this ‘overlap’ is proposed in (Kundu andMitra (2015)), which is similar to the cosine of the anglebetween the zero-one set membership vectors for NNs(x)and NNs(y).

Then, the secondary similarity measure based on overlapis defined as,

simcoss(x, y) = SNNs(x, y)

s.

The above similarity measure can be termed as a sec-ondary similarity measure as it is based on the rankingsinduced by a specified primary similarity measure.

Transforming this into distance form, we get

dacoss(x, y) = arccos(simcoss(x, y)).

This distance is symmetrical and satisfies the triangularinequality.

2.3 Entropy

Entropy of a random variable in information theorymeasuresthe impurity associated with it. Let X be a discrete randomvariable and then entropy is defined as,

ENTROPY(X) = −∑

x∈Xp(x) logb p(x)

Here p(x) denotes probability mass function and b is usuallytaken as 2.

2.4 Information gain

Information gain measures the goodness of a feature in dis-criminating the classes for a given training data set. It isdefined as,INFO-GAIN(attr)= (Entropyof data set before split)–(Entropyof data set after split over the attribute attr).

2.5 Normalizedmutual information

Mutual information (MI) measures dependency between tworandom variables. Let X and Y be two discrete random vari-ables, and then, mutual information is defined as,

MI(X ,Y ) =∑

x∈X

∑

y∈Yp(x, y) logb

(p(x, y)

p(x)p(y)

)

Mutual information between two variables should satisfy thefollowing properties,

(i) MI(X ,Y ) ≥ 0 (nonnegative),

(ii) MI(X ,Y ) = 0 , if X and Y are completely independent,and

(iii) MI(X ,Y ) = MI(Y , X) (symmetrical).

When two variables are independent, mutual informationvalue equals to 0, but there is no upper bound on its valuewhen variables are completely dependent. Because of thisreason, the normalized form of MI is considered in the cur-rent study which further helps in the process of comparison.There are several normalized forms of mutual information.In the current paper, the following form is used:

¯NMI (X ,Y ) = MI(X ,Y )√ENTROPY(X) × ENTROPY(Y )

3 Methodology

In this section, we have described in detail the proposedmultiobjective-based feature selection technique. The prob-lem is mathematically stated as follows:

Given a data set N having total F number of features,select a subset F ′ such a way so that the following two con-ditions are satisfied.

– F ′ ⊆ F– simultaneously optimize several objective functions:Obj1(F

′),Obj2(F ′),Obj3(F ′), . . . ,ObjM (F ′),whereMis the total number of objective functions.Obji (F

′), i = 1 . . . M are some feature quality mea-sures calculated on a subset of features (F ′). Here wehave used several different feature quality measures likePDM (pairwise secondary matrix-based quality measurewhich preserves the topological information/structure inthe reduced feature space) as used in Kundu and Mitra(2015), mutual information (MI) present in the reducedfeature space, correlation (COR) of the features with theclasses in reduced feature space, information gain andentropy (INGO) present in the reduced feature space.

As the proposed feature selection technique aims to opti-mize the set of objective functions simultaneously, wehave devised a multiobjective-based solution frameworkin the current work. The search capability of a popularmultiobjective-based optimization technique, namely non-dominated sorting genetic algorithm II (NSGA-II) (Debet al. 2002), is utilized for optimization purpose. NSGA-IIinvolves three basic steps in its operation:mutation, crossoverand selection. The algorithm operates with a set of chro-mosomes called the population. Each of these chromosomesrepresents a solution of the optimization problem. In our case,each of the chromosomes encodes a feature combination. Inthe following, we have discussed about the representation

123

S. Saha, M. Kaur

scheme used in the current work and different objective func-tions.

3.1 Solution representation

Here chromosome is a binary vector of length F where Fis the total number of features present in the data set. Eachposition of the chromosome encodes the value of 0 or 1.Value of 0 at i th position indicates that i th feature does notparticipate in classification task and the value of 1 at i thposition indicates that the corresponding feature will takepart in the classification task. The chromosomes of the initialpopulation are generated randomly. For each chromosomeand for each position, a random number (0 or 1) is generatedand it is placed in that particular position.

3.2 Other operators

Now crossover and mutation operations of NSGA-II areapplied on the initial population to create new offsprings,with probability pm and pc, respectively. Here pm is theprobability ofmutation and pc is the probability of crossover.The new solutions generated after the mutation and thecrossover operations are stored in the new population. Theobjective function values of these solutions are calculated (asdescribed in Sect. 3.3). The new population and the old pop-ulation are merged. Non-dominated sorting approach (Debet al. 2002) is applied on the combined population to rankthe solutions in different non-domination levels. The newpopulation is started developing using solutions of differ-ent non-domination levels with rank-0 as the highest priorityand going toward lower ranked solutions. Suppose solutionsup to rank i are already copied in the new population. Theremaining number of positions on the new population isRem= P − ∑i

j=0 ‖S j‖ where ‖S j‖ denotes the set of solu-tions at rank j and P is the population size. If ‖Si+1‖ >Rem,thatmeans all the solutions of (i+1)th rank cannot be accom-modated in the new population. In this case, to select Remnumber of good solutions from ‖Si+1‖ solutions, crowdingdistance operation (Deb et al. 2002) is utilized. The solu-tions having rank (i + 1) are sorted based on their crowdingdistances (Deb et al. 2002) and finally best Rem number ofsolutions are selected and added in the new population. Thus,selectionprocess canbeperformedon the basis of lowest rankand least crowding distance values.

The process of mutation, crossover and selection is exe-cuted for a large number of generations, and finally, a set ofsolutions is generated on the final Pareto optimal front. Eachof these solutions provides an optimal feature subset whichcan furthermore take part in classification of the given dataset.

3.3 Different objective functions

To reduce the complexity of MOO, we select a subset ofinstances(Xsel, where Xsel ⊂ X ) using Algorithm I as donein Kundu and Mitra (2015). Here, |Xsel| = Nsel. N is theoriginal number of instances. PDM matrix has stored pair-wise secondary distance values of all pairs of points in thedata set. The PDM distance between two points, x and y, iscalculated as follows:PDMs(x, y) = arccos(simcos(x, y)). PDMs is a distancematrix of size N × N . Algorithm 1.Input: PDMs and XOutput: Sample subset Xsel

1: Find minimum entry(> 0) of each row of PDMs , andstore as min_rowi , with i ∈ 1, . . . , N − 1.

2: Sort min_rowi in ascending order along with indices.3: Select top Nsel index values of min_rowi .4: Generate an instance subset Xsel for these selected

indices.

The various objective functions considered in the currentstudy for evaluating the quality of the feature subset selectedare mentioned below.

3.4 PDM

This model is created along the lines of the objective func-tions introduced inKunduandMitra (2015). In thismodel,weused two objective functions for feature selection. PDMsfredurepresents the pairwise secondary distance with feature sub-set fredu. Here fredu consists of those features which arepresent in the particular chromosome.This objective functionaims to evaluate the quality of the feature subset in terms ofits effectiveness in preserving the neighborhood structure ortopological structure. The summation of differences of PDMvalues in the reduced feature space and the original featurespace computed for all pairs of points can be a measure oftopological similarity. Thus, the feature quality measure isgiven as follows:

F1 =∑

x∈Xsel,y∈Xsel,x =y

abs(PDMs(x, y) − PDMsfredu(x, y))

Here PDMsfredu(x, y) is the pairwise secondary distancemea-surement with s neighbors evaluated using reduced numberof features ( fredu). PDMs(x, y) is the pairwise secondarydistance measurement with s neighbors evaluated with theoriginal number of features ( f ). The objective is to minimizethe value of this objective function to determine that optimalfeature subset which preserves the s-sized neighborhood inthe reduced space.

123


In this connection in order to avoid getting the extremevalues of feature subset, i.e., the full feature subset, whichcorresponds to the minimum value of F1, the followingobjective function is also minimized. This objective functioncontradicts with the first objective function by minimizingthe number of features in the optimal subset.

F2 = | fredu|

Objective F1 aims to keep the structure of data set unal-tered, while objective F2 favors to lower the cardinality ofoptimal feature subset.

3.5 Mutual information (MI)-basedmodel

Mutual information measures the dependency between twofeatures. In this objective function, the feature subset qual-ity is measured in terms of mutual information contained inthat particular subset. Mutual information between selectedfeatures should be less; otherwise, selected feature subsetwill have redundant information. Similarly, mutual informa-tion between a selected feature and a non-selected featureshould be high. To incorporate this concept, we define objec-tive F3, which is based on functions, α and β. Let F denotetotal number of features of a data set, SF and NSF denoteselected feature subset and subset of non-selected features,respectively. These two feature subsets of F should satisfythe following properties,

(a) F = SF ∪ NSF(b) SF ∩ NSF = ∅

Function α calculates average normalized mutual infor-mation between all possible feature pairs of the selectedfeatures.

α =∑

fi , f j∈SF, fi = f j

NMI( fi , f j )

|SF|(|SF − 1|)

And function β is defined as average normalized mutualinformation of every non-selected feature with respect to itsfirst nearest-neighbor feature in the selected feature subset.

β =∑

fi∈NSF, f j∈SF, fi=1NN( f j )

NMI( fi , f j )

|NSF|

In this model, we use three different objective functions.Apart from using the two different feature quality measuresofPDM, the third objective functionused in the currentmodelis given below:

F3 = α − β

This combination of objectives aims for selecting lessinter-dependent feature subset along the structural preser-vation of patterns. The purpose is to minimize the value ofF3.

3.6 Correlation coefficient-basedmodel

Our selected feature subset should contain those featureswhich will be more helpful in predicting the class labels.Hence, correlation of features with class attribute has beentaken into consideration. To obtain a feature subset which ishighly correlated with class attributes and also has low inter-correlation among themselves, we defined objective F4 asgiven below,

F4 =∑

fi , f j∈SF, fi = f j COR( fi , f j )∑

f ∈SF,c=class COR( f , c)

In this model, objective F4 is used along with two objectivesof PDM model. In order to select a good feature subset, thevalue of F4 should be minimized.

3.7 Information gain-basedmodel

Informationgain and impurity (entropy) contained in individ-ual features affect the classification process. Our F5 objectivemeasures the uncertainty level of feature subset.

F5 =∑

f ∈SF ENTROPY( f )∑

f ∈SF INFO − GAIN( f )

In order to select a good feature subset, the value of objectivefunction F5 should be minimized.

3.8 Combinations of objective functions

Various combinations of objective functions are consideredin the current multiobjective-based framework. Differentmodels are developed, and the corresponding objective func-tions are summarized below:

Algorithm II is used to generate the best feature subset.Algorithm II:Input: Observation set X, having N samples each having Ddimensions.Output: Feature set ffinal.

1. Construct pairwise dissimilarity matrix PDMs .2. Generate Xsel using Algorithm I.3. Construct PDMs1 with s1-shared nearest neighbor on set

Xsel.4. For each model, select feature subsets that simultane-

ously minimize all of its objectives using multiobjectiveapproach.

123

S. Saha, M. Kaur

For each model, Algorithm II returns a set of feature subsetson the final Pareto optimal front. After decoding each featuresubset, three classification techniques, namely k-nearest-neighbor classifier, support vector machine and Naive Bayesclassifiers, are evaluated on different data sets using ten-fold cross-validation. Different performance measures arecalculated on the classification results obtained by differentclassification techniques and those are compared for the pur-pose of comparing the models.

3.9 Performancemeasures

Besides the accuracies of classifiers, there are few othermeasures to evaluate the obtained feature space. We havecalculated two such measures to evaluate the obtained fea-ture subset, as listed below:

Jaccard Score (JAC) This calculates the effectiveness of fea-ture subset in preserving the pairwise sample similarity andis defined as,

JAC(M f , M,m) = 1

N

N∑

i=1

NN(i,m, M f ) ∩ NN(i,m, M)

NN(i,m, M f ) ∪ NN(i,m, M)

Here M f = X f XTf is similarity matrix computed over fea-

ture space f and M is the similarity matrix computed overoriginal feature space. NN(i,m, M f ) and NN(i,m, M) rep-resent the mth-nearest neighbors of observation i in featurespace f and in the original space, respectively. JACmeasuresthe average overlapping of neighbors in two different searchspaces. Higher value of JAC corresponds to good featuresubset.Redundancy (RED) This is the average pairwise linear cor-relation of feature space F. Larger the value of RED, moreredundant features are present in the feature space.

RED(F) = 1

d(d − 1)

∑

fi , f j∈F,i> j

ζ fi , f j

Here ζ fi , f j denotes Pearson’s correlation coefficient of fea-tures i and j, and d is the cardinality of the feature space F.Lower value of RED corresponds to good feature subset.

4 Experimental results

Six real-valued data sets are examined to evaluate the effec-tiveness of different combinations of the above-mentionedobjective functions. Those are listed below:

Table 1 Objective functionsused in different models(minimization type)

Model Objectives

PDM F1 F2

MI F1 F2 F3

COR F1 F2 F4

INFO F1 F2 F5

PMI F1 F2 F3 F5

PMC F1 F2 F3 F4

PCI F1 F2 F4 F5

PMCI F1 F2 F3 F4 F5

1. Multiple feature (MF) It consists of 2000 sample pointsof handwritten numerals (0–9) and has 10 classes.1

2. USPS It consists of 9298 images (16×16 pixels) of hand-written digit database and has 10 classes.2

3. ORL It consists of 400 images (sampled to 56×46pixels)of 40 human faces and has 40 classes.3

4. COIL20 It consists of 1440 images (sampled down to32×32 pixels) of 20 objects and has 20 classes.4

5. Ozone- Eighthr It is ground-level ozone data from 8hpeak values and has 2 classes.5

6. Ozone- onehr It is ground-level data of ozone measuredat 1h peak value and has 2 classes.6

Six data sets are examined to evaluate the effectivenessof different combinations of the above-mentioned objectivefunctions in the proposed multiobjective optimization-basedframework. From each data set, 80 instances are selectedusing Algorithm 1. As per the discussion presented in Kunduand Mitra (2015), we set the value of Nsel = 80. Note thatsince s << N (where N is the actual number of pointspresent in the data set), the performance of SNN was rea-sonably robust to the choice of s (Houle et al. 2010; Kunduand Mitra 2015). Inspired by Kundu and Mitra (2015) andHoule et al. (2010), the following parameter values are kept:S = 50 and s1 = s, with s1 < Nsel < 2 × s1. These valuesprovide Nsel = 80.

Then NSGA-II-based approach implemented in C7 isapplied on individual data sets with different combinationsof objective functions. For each data set we have executedNSGA-II eight times with different objective function com-binations as listed in Table 1 and obtained eight different

1 https://archive.ics.uci.edu/ml/datasets.html.2 https://archive.ics.uci.edu/ml/support/Optical+Recognition+of+Handwritten+Digits.3 http://www.face-rec.org/databases/.4 http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php.5 https://archive.ics.uci.edu/ml/datasets/ozone+level+detection.6 https://archive.ics.uci.edu/ml/datasets/ozone+level+detection.7 http://www.iitk.ac.in/kangal/codes.shtml.

123

https://archive.ics.uci.edu/ml/datasets.html

https://archive.ics.uci.edu/ml/support/

http://www.face-rec.org/databases/

http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php

https://archive.ics.uci.edu/ml/datasets/ozone+level+detection

https://archive.ics.uci.edu/ml/datasets/ozone+level+detection

http://www.iitk.ac.in/kangal/codes.shtml


Table 2 Parameters of the proposed NSGA-II-based feature selectionapproach for different data sets used

DATA SET s i pc pm

MF 50 80 0.8 0.01

USPS 50 80 0.8 0.1

ORL 50 80 0.8 0.01

COIL20 50 80 0.8 0.01

Ozone-onehr 50 80 0.8 0.1

Ozone-Eighthr 50 80 0.8 0.1

Pareto optimal fronts. The following parameter combinationsare used for NSGA-II-based feature selection approach: pop-ulation size=100, number of generations=100. The otherparameters are mentioned in Table 2. Here, pc (crossoverprobability), s (shared nearest neighbors) and i (observationsor data points) are same for all the six data sets. pm (muta-tion probability) is varied from data set to data set becauseit has been observed that if we take small values of pm, thenNSGA-II does not evaluate feature subset with large variancein cardinalities, and hence, it converges to randomly gener-ated initial feature subsets and decreases the diversity of bestfeature subsets. For each data set, we varied the value of pmin the range of [0.01, 0.2]with a step size of 0.01, and finally,the pm corresponding to the best results are reported here.

With the set of optimal feature subsets obtained by theproposed multiobjective-based approach for a given data set,different classifiers like KNN (with varyingK values, K =1,3 and 5 are used in the current study), SVM and Naive Bayesare trained. We train SVM using SMO (Sequential MinimalOptimization) which is an algorithm to solve QP problems.The accuracy values of tenfold cross-validation are reportedin the current paper.Weka’s JAVAAPIs of these classifiers arecalled with all the default parameters. Information gain andentropy of different classification models are also obtainedfrom Weka’ JAVA API, and normalized mutual informationis obtained from Python Library.8 Note that the PDM modelof our proposed approach mimics the approach proposed inKundu and Mitra (2015).

The proposed architecture of feature selection is exe-cuted on different data sets with different objective functioncombinations. For each data set, eight different models (aslisted in Table 1) are evaluated. Each of them produces aPareto optimal front containingmultiple alternative solutionsof feature selection. The solutions encode different featurecombinations. For each of feature combinations, a set of clas-sifiers, namely KNN, SVM and Naive Bayes, are executedusing tenfold cross-validation. The average accuracy valueswith respect to different feature cardinalities are presentedin Figs. 1a, 2, 3, 4, 5 and 6f. After considering the various

8 http://scikit-learn.org/stable/modules/generated/sklearn.metrics.

parameters of the above-mentioned models, we found thata particular objective function combination is not good forall data sets. Different combinations of objective functionsworked better for different data sets. Differentmodels exhibitdifferent feature combinations on their Pareto optimal front.The number of features present in the feature subset is a keyfactor in determining the accuracy of the classifiers. In orderto make a fair comparison, we have selected a minimumvalue of the number of features (p) corresponding to whichmajority of the Pareto optimal fronts have some solutions.For other models, we select the value (p′) which is closest top. The results of differentmodels available on Pareto optimalfront corresponding to p or p′ are shown in Tables 3, 4, 5, 6,7 and 8.

In case of USPS data set, INFO model and COR modeloutperformed PDM model in lower cardinality range of thefeature set for all classifiers. But for higher cardinalities theiraccuracies are lower than PDM. Similarly, PMCmodel givesmuch higher accuracies for few feature sets compared toPDM model. PMCI gives good results for small range offeature set only for Naive Bayes classifier.

In case of Coil20 data set, CORmodel and PCImodel out-perform all othermodels with classifier KNN,whereas INFOmodel and PMI model work well for classifier SVM andNaive Bayes for higher cardinalities. PMCI model surpassesPDM with respect to KNN and Naive Bayes classifiers.

INFO model only gives better accuracies for comparablefeature set range to PDM model in case of MF data set. Restmodels perform poorly compared to PDMmodel for this dataset.

Further in order to show the efficacy of the proposed tech-niques with respect to state-of-the-art techniques, we havereported the following results:

– We have selected some feature sets returned by eachmodel. The accuracy values of different classifiers usingthese feature combinations along with other quality mea-sures like Jaccard score (JAC) and redundancy (RED) arereported for different models in Tables 3, 4, 5, 6, 7 and 8.

– Three state-of-the-art feature selection techniques,namely SPFS-SFS (Zhao and Liu 2007), ReliefF (Kiraand Rendell 1992) and SPEC (Zhao et al. 2013), are alsoexecuted with the same number of features selected bydifferent models for a given data set. The best resultsover different feature combinations used are reported inTables 3, 4, 5, 6, 7 and 8. The corresponding results arepresented in Tables 3, 4, 5, 6, 7 and 8.

– Note that our PDM model is exactly same as the pro-posed approach of Ref. Kundu and Mitra (2015). Thus,the results of the approach proposed in Ref. Kundu andMitra (2015) are also reported in Tables 3, 4, 5, 6, 7and 8.

123

http://scikit-learn.org/stable/modules/generated/sklearn.metrics

S. Saha, M. Kaur

Fig. 1 Cardinality v/s accuracy of classifiers KNN-1, SVM, NaiveBayes, respectively, for USPS data set. a Accuracies of KNN-1 clas-sifier over different feature subsets on final Pareto front for USPS dataset, b accuracies of KNN-1 classifier over different feature subsets onfinal Pareto front for USPS data set, c accuracies of SVM classifiersover different feature subsets on final Pareto front for USPS data set,

d accuracies of SVM classifiers over different feature subsets on finalPareto front for USPS data set, e accuracies of Naive Bayes classifierover different feature subsets on final Pareto front forUSPS data set andf accuracies of Naive Bayes classifier over different feature subsets onfinal Pareto front for USPS data set

123


Fig. 2 Cardinality v/s accuracy of classifiers KNN-1, SVM, NaiveBayes, respectively, for CPIL20 data set. a Accuracies of KNN-1 clas-sifier over different feature subsets on final Pareto front forCOIL20 dataset, b accuracies of KNN-1 classifier over different feature subsets onfinal Pareto front for COIL20 data set, c accuracies of SVM classifierover different feature subsets on final Pareto front for COIL20 data set,

d accuracies of SVM classifier over different feature subsets on finalPareto front for COIL20 data set, e accuracies of Naive Bayes classifierover different feature subsets on final Pareto front for COIL20 data setand f accuracies of Naive Bayes classifier over different feature subsetson final Pareto front for COIL20 data set

123

S. Saha, M. Kaur

Fig. 3 Cardinality v/s accuracies of classifiers KNN-1, SVM, NaiveBayes, respectively, forOzone-Eighthr data set. aAccuracies of KNN-1 over different feature subsets on final Pareto front for Ozone-Eighthrdata set, b accuracies of KNN-1 over different feature subsets on finalPareto front for Ozone-Eighthr data set, c accuracies of SVM over dif-ferent feature subsets on final Pareto front for Ozone-Eighthr data set,

d accuracies of SVM over different feature subsets on final Pareto frontfor Ozone-Eighthr data set, e accuracies of Naive Bayes over differentfeature subsets on final Pareto front for Ozone-Eighthr data set and faccuracies of Naive Bayes over different feature subsets on final Paretofront for Ozone-Eighthr data set

123


Fig. 4 Cardinality v/s accuracies of classifiers KNN-1, SVM, NaiveBayes, respectively, for Ozone-Onehr data set. a Accuracies of KNN-1 over different feature subsets on final Pareto front for Ozone-Onehrdata set, b accuracies of KNN-1 over different feature subsets on finalPareto front for Ozone-Onehr data set, c accuracies of SVM over dif-ferent feature subsets on final Pareto front for Ozone-Onehr data set, d

accuracies of SVM over different feature subsets on final Pareto frontfor Ozone-Onehr data set, e accuracies of Naive Bayes over differentfeature subsets on final Pareto front for Ozone-Onehr data set and faccuracies of Naive Bayes over different feature subsets on final Paretofront for Ozone-Onehr data set

123

S. Saha, M. Kaur

Fig. 5 Cardinality v/s accuracy of classifiers KNN-1, SVM, NaiveBayes, respectively, for MF data set. a Accuracies of KNN-1 over dif-ferent feature subsets on final Pareto front forMF data set, b accuraciesof KNN-1 over different feature subsets on final Pareto front for MFdata set, c accuracies of SVM over different feature subsets on final

Pareto front for MF data set, d accuracies of SVM over different fea-ture subsets on final Pareto front forMF data set, e accuracies of NaiveBayes over different feature subsets on final Pareto front for MF dataset and f accuracies of Naive Bayes over different feature subsets onfinal Pareto front for MF data set

123


Fig. 6 Cardinality v/s accuracy of classifiers KNN-1, SVM, NaiveBayes, respectively, for ORL data set. a Accuracies of KNN-1 overdifferent feature subsets on final Pareto front for ORL data set, b accu-racies of KNN-1 over different feature subsets on final Pareto front forORL data set, c accuracies of SVM over different feature subsets on

final Pareto front for ORL data set, d accuracies of SVM over differentfeature subsets on final Pareto front for ORL data set, e accuracies ofNaive Bayes over different feature subsets on final Pareto front forORLdata set and f accuracies of Naive Bayes over different feature subsetson final Pareto front for ORL data set

123

S. Saha, M. Kaur

Table 3 Results of USPS dataset by different feature selectionalgorithms

Approach d Accuracy(%) JAC RED

KNN1 KNN3 KNN5 SVM NB

PDM 89 96.99 96.9 96.87 96.7 93.4 0.675 0.07

MI 88 96.54 96.53 96.5 96.35 92.9 0.65 0.069

INFO 88 96.9 96.8 96.7 96.6 93.49 0.65 0.078

COR 91 96.11 96 95.89 95.8 92.99 0.63 0.073

PMI 89 96.46 96.55 96.48 96.34 92.89 0.66 0.70

PCI 88 96.94 96.74 96.69 96.54 93.04 0.63 0.064

PMC 89 97.16 97.0 96.9 96.8 93.8 0.67 0.05

PMCI 89 96.98 97.05 97 96.84 93.84 0.67 0.059

Relief 89 93.78 93.71 93.65 93.52 91.30 0.49 0.08

SPEC 89 76.33 77.11 77.33 76.78 71.69 0.12 0.14

SPFS-SFS 89 76.78 77.12 77.38 76.47 71.13 0.12 0.15

Best results are marked in bold

Table 4 Results of ORL dataset by different feature selectionalgorithms



PDM 842 97.5 96.125 94.667 95.563 94.3 0.87 0.071

MI 841 98 96.625 94.833 95.75 94.55 0.87 0.072

INFO 888 98 96.875 95 95.938 94.95 0.85 0.07

COR 844 98 96.75 95 95.938 94.75 0.869 0.068

PMI 922 97.25 96 94.667 95.563 94.5 0.83 0.071

PCI 918 97.75 96.75 95.333 96.188 94.9 0.87 0.069

PMC 836 97.5 96.375 95.083 96 94.7 0.87 0.07

PMCI 881 98 96.875 95.333 96.125 94.85 0.86 0.068

Relief 836 97.5 95.5 93.75 94.625 94 0.49 0.096

SPEC 836 88.75 83.625 80.917 83.5 83.5 0.41 0.14

SPFS-SFS 836 90.25 84 81 83.69 83.65 0.47 0.12


Table 5 Results of MF data setby different feature selectionalgorithms



PDM 68 94.75 94.62 94.6 95.2 94.6 0.577 0.027

MI 86 96.15 96.25 96.25 96.47 96.2 0.5 0.013

INFO 83 95.7 95.8 95.7 96.3 96.18 0.547 0.015

COR 81 95.85 95.8 95.88 96.27 95.7 0.58 0.031

PMI 106 96.65 96.6 96.73 97.02 96.66 0.45 0.010

PCI 103 96.15 96.07 95.98 96.34 96.06 0.54 0.0189

PMC 88 96.55 96.75 96.76 97.03 96.55 0.42 0.01

PMCI 105 95.65 95.55 95.67 96.21 95.76 0.28 0.012

Relief 81 95.8 95.9 95.783 95.575 94.54 0.26 0.022

SPEC 81 91.95 92 92 93.112 91.25 0.33 0.066

SPFS-SFS 81 91.8 91.975 92.1 93.037 91.38 0.35 0.055


– Analysis of the results reported in Tables 3, 4, 5, 6, 7and 8 evidently illustrates the potency of our proposedapproach. For all the data sets, the best values for dif-ferent measures are highlighted in bold in Tables 3, 4,

5, 6, 7 and 8. This clearly shows that for different datasets different proposed models perform well in terms ofdifferent quality measures used. There is no universalmodel which performs the best for all the data sets with

123


Table 6 Results of COIL20data set by different featureselection algorithms



PDM 225 99.93 99.86 99.49 99.54 98.3 0.16 0.082

MI 226 99.7 99.51 98.9 99.15 97.08 0.77 0.07

INFO 224 100 99.9 99.4 99.6 98.5 0.79 0.086

COR 251 100 99.79 99.3 99.5 98.27 0.78 0.067

PMI 257 99.8 99.7 99.3 99.5 97.88 0.8 0.079

PCI 234 100 99.9 99.35 99.51 98.43 0.78 0.082

PMC 224 99.9 99.7 99.2 99.35 97.54 0.78 0.069

PMCI 235 100 99.7 99.1 99.35 98.34 0.77 0.08

Relief 225 99.09 98.54 97.76 98.06 95.43 0.59 0.18

SPEC 225 99.93 99.83 99.31 99.29 97.92 0.45 0.18

SPFS-SFS 225 99.93 99.72 98.98 99.08 97.89 0.48 0.20


Table 7 Results of Ozone-onehrdata set by different featureselection algorithms



PDM 27 95.03 95.9 96.26 96.47 90.52 0.32 0.20

MI 27 95.29 95.9 96.2 96.46 91.38 0.43 0.21

INFO 27 95.58 96.22 96.47 96.68 92.55 0.26 0.23

COR 26 94.96 95.84 96.18 96.42 91.47 0.345 0.22

PMI 26 95.97 96.28 96.45 96.61 92.06 0.42 0.18

PCI 27 95.3 96.03 96.29 96.5 92.39 0.26 0.16

PMC 23 95.66 96.14 96.43 96.6 91.5 0.32 0.21

PMCI 27 95.7 96.25 96.47 96.65 92.37 0.22 0.19

Relief 27 94.99 95.88 96.16 96.40 88.91 0.10 0.53

SPEC 27 95.51 96.14 96.40 96.58 88.97 0.29 0.47

SPFS-SFS 27 95.31 96.02 96.29 96.5 88.33 0.33 0.71


Table 8 Results ofOzone-eighthr data set bydifferent feature selectionalgorithms



PDM 17 90.92 91.85 92.35 92.68 87.27 0.45 0.188

MI 17 91.43 92.6 93.06 93.22 88.6 0.254 0.221

INFO 17 91.63 92.44 92.84 93.05 88.88 0.325 0.171

COR 17 92.73 93.41 93.57 93.59 88.19 0.212 0.193

PMI 17 92. 92.87 93.18 93.3 88.9 0.36 0.28

PCI 17 91.98 92.44 92.8 93.02 88.45 0.29 0.15

PMC 16 92.10 92.66 93.05 93.21 88.24 0.24 0.18

PMCI 17 92.14 92.99 93.37 93.44 89.15 0.35 0.18

Relief 17 89.54 90.71 91.44 91.999 85.28 0.098 0.52

SPEC 17 90.81 91.91 92.45 92.76 86.54 0.45 0.29

SPFS-SFS 17 89.62 91.24 91.95 92.38 85.84 0.46 0.33


respect to all the quality measures. But in general PMCand PMCI models consistently perform well for all thedata sets. This supports the fact that optimization of alldifferent feature quality measures described in Sect. 3.3

leads to generation of good quality feature subsets. Thus,the objectives functions defined in Sect. 3.3 are all usefulin feature selection process.

123

S. Saha, M. Kaur

– For almost all data sets with respect to different featurequality measures, the proposed models perform bettercompared to the existing feature selection techniques.As PDM model is similar to the approach proposed inRef. Kundu and Mitra (2015) and models optimizing allthe objective functions (both structure-preserving objec-tive functions and information theory-based objectivefunctions), PMC and PMCI, perform better than PDMfor almost all the data sets with respect to all the qual-ity measures, the utility of optimizing the informationtheory-based measures in the feature selection process isestablished. This further validates the use of informationtheory-based feature qualitymeasures in the optimizationprocess.

– Note that our approach is a filter-based feature selectiontechnique. It does not utilize any classification algorithmfor determining the best subset of features. Filter-basedfeature selection techniques are in general unbiasedtoward any classifier and have the following advantages:(1)Fast executionFilters generally involve a non-iterativecomputation on the data set, which can execute muchfaster than a classifier training session.(2) Generality Since filters evaluate the intrinsic prop-erties of the data, rather than their interactions with aparticular classifier, their results exhibit more generality:the solution will be good for a larger family of classifiers.But it suffers from the following drawbacks: (1) Ten-dency to select large subsets: Since the filter objectivefunctions are generally monotonic, the filter tends toselect the full feature set as the optimal solution. Thisforces the user to select an arbitrary cutoff on the numberof features to be selected.The proposed approach is a filter-based method as itutilizes a set of feature quality measures for selectingthe appropriate feature subset. The results reported inTables 3, 4, 5, 6, 7 and 8 also support this fact as differentclassifiers perform variously with different feature sub-sets selected by our proposedmodels. The feature subsetsselected by different models are not over-fitted towardany particular classifier. The results obtained support thisbehavior.

– Note that obtained results in Tables 3, 4, 5, 6, 7 and 8evidently reveal that combination of all objective func-tions (model PMCI) does not always perform the bestover all other models for all data sets. The reasons forthis behavior can be attributed to the following: (1)a careful observation of the results suggests that eventhough PMCI does not perform the best for many ofthe data sets, but its performance with respect to dif-ferent quality measures over data sets is very near tothe corresponding best results; in some cases, there areinsignificant differences between the results obtained

by PMCI-based model and the best model; (2) for thepurpose of comparison, we have reported results ofonly one solution by any model from the Pareto opti-mal front; it could happen that the solution chosen forPMCI is optimum with respect to a particular objec-tive function but not optimum with respect to otherobjective functions like F1 (responsible for increasingthe value of JAC) or F4 (responsible for decreasing thevalue of RED); (3) as the proposed feature selectiontechnique is a filter-based method, during the searchfor the optimum feature subset it does not utilize anyclassification technique. Thus, the features returned byPMCI are sometimes good with respect to some clas-sifiers but not good for all the classifiers; this fol-lows the behavior of any filter-based feature selectiontechnique.

4.1 Time complexity analysis

The time complexity of the proposed MOO-based featureselection technique is analyzed below:

– Let population size=P, number of points in the dataset=N, total number of features = d, total number ofgenerations= G.

– Initialization of the population takes O(P × d) time aschromosome length= M .

– Calculation of first objective function (PDM-based)requires the calculation of neighborhoods of all the sam-ples. This takes O(N 2 × d + N 2logN ) ≈ O(N 2 × d)

time. Further the computation of shared nearest-neighbordistance takes O(N 2) time. Thus computation of F1objective function takes O(N 2) time.

– Calculation of other objective functions takes O(d2× P)

if different probability values are calculated before hand.– Thus, the overall time complexity of objective functioncalculation phase is O(N 2 × d × P).

– Genetic operators take O(P × d) time.– Non-dominated sorting of NSGA-II takes O(PlogP)

time.– Thus, the overall time complexity of the proposedapproach is O(N 2 × logN × d × P × G) ≈ O(logN ×d × N 2 × P × G).

– In general, P and G are too small as compared to N , andthen, the overall time complexity will be O(logN × d ×N 2).

The time complexity of the approach proposed in Ref. KunduandMitra (2015) is also O(logN×d×N 2). Time complexityof SPEC is O(d × N 2). Thus, the overall time complexityof the proposed approach is almost similar to those by otherapproaches.

123


5 Conclusion

In this article, several models of multiobjective-based fea-ture selection are proposed varying the objective functioncombinations. The study aims to investigate the effect ofsimultaneous optimization of different objective functionscapturing different feature qualitymeasures like preservationof topology/neighborhood information in the reduced featurespace, different information theory-based criteria likemutualinformation, correlation, entropy and information gain, etc.The novelty of the work lies in determining that particu-lar feature set which not only preserves the neighborhoodsimilarity in reduced space but also contains less redun-dant features which are more relevant in class prediction.In order to simultaneously optimize these set of objectivefunctions, a multiobjective-based solution framework is pro-posed. Several models are developed varying the objectivefunction combinations. A comparative study between thesedifferent models is conducted in the current paper. Resultson a large collection of data sets using different classificationtechniques show that a combination of objective functionspossessing the structure-preservation property as well asinformation theory-based quality measures performs bettercompared to individual objective functions.

Acknowledgements No funding is involved in this work. Authorswould like to acknowledge the help from Indian Institute of TechnologyPatna to conduct this research.

Compliance with ethical standards

Conflict of interest All the authors declare that they do not have anyconflict of interest.

Human and animal rights We have not performed any experimentswhich involve animals or humans.

References

Bhadra T, Bandyopadhyay S (2015) Unsupervised feature selectionusing an improved version of differential evolution. Expert SystAppl 42(8):4042–4053

Dash M, Liu H (1997) Feature selection for classification. Intell DataAnal 1(3):131–156. http://dl.acm.org/citation.cfm?id=2639279.2639281

Deb K, Kalyanmoy D (2001) Multi-objective optimization using evo-lutionary algorithms. Wiley, New York

Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist mul-tiobjective genetic algorithm: Nsga-ii. IEEE Trans Evol Comput6(2):182–197

Garg RP, Sharapov I (2002) Techniques for optimizing applications:high performance computing. PrenticeHall Professional TechnicalReference, Upper Saddle River, NJ

Houle ME, Kriegel H, Kröger P, Schubert E, Zimek A (2010) Canshared-neighbor distances defeat the curse of dimensionality? In:22nd international conference scientific and statistical databasemanagement, SSDBM2010,Heidelberg,Germany, 30 June–2 July2010. Proceedings, pp 482–500. https://doi.org/10.1007/978-3-642-13818-8_34

Kira K, Rendell LA (1992) A practical approach to feature selection. In:Proceedings of the 9th international workshop on machine learn-ing, ML92, Morgan Kaufmann Publishers Inc., San Francisco,CA,USA, pp 249–256. http://dl.acm.org/citation.cfm?id=141975.142034

Kohavi R, John GH (1997) Wrappers for feature subset selection.Artif Intell 97(1–2):273–324. https://doi.org/10.1016/S0004-3702(97)00043-X

Kundu PP, Mitra S (2015) Multi-objective optimization of shared near-est neighbor similarity for feature selection. Appl Soft Comput37:751–762

Liu H, Yu L (2005) Toward integrating feature selection algorithmsfor classification and clustering. IEEE Trans Knowl Data Eng17(4):491–502

Molina LC, Belanche L, Nebot A (2002) Feature selection algorithms:a survey and experimental evaluation, In: Proceedings of the 2002IEEE international conference on data mining, ICDM ’02, IEEEComputer Society, Washington, DC, USA, p 306. http://dl.acm.org/citation.cfm?id=844380.844722

Sánchez-MaroñoN,Alonso-BetanzosA,Tombilla-SanrománM(2007)Filter methods for feature selection: a comparative study. In: Pro-ceedings of the 8th international conference on intelligent dataengineering and automated learning, IDEAL’07. Springer, Berlin,pp 178–187. http://dl.acm.org/citation.cfm?id=1777942.1777962

Zhao Z, Liu H (2007) Spectral feature selection for supervised andunsupervised learning, In: Proceedings of the 24th internationalconference onmachine learning, ICML ’07, ACM,NewYork, NY,USA, pp 1151–1157. https://doi.org/10.1145/1273496.1273641

Zhao Z, Wang L, Liu H, Ye J (2013) On similarity preserving featureselection. IEEETransKnowlData Eng 25(3):619–632. https://doi.org/10.1109/TKDE.2011.222

123

http://dl.acm.org/citation.cfm?id=2639279.2639281


https://doi.org/10.1007/978-3-642-13818-8_34

https://doi.org/10.1007/978-3-642-13818-8_34



https://doi.org/10.1016/S0004-3702(97)00043-X

https://doi.org/10.1016/S0004-3702(97)00043-X




https://doi.org/10.1145/1273496.1273641

https://doi.org/10.1109/TKDE.2011.222

https://doi.org/10.1109/TKDE.2011.222

Identification of topology-preserving, class-relevant feature ...

Documents

Transcript of Identification of topology-preserving, class-relevant feature ...