Novel feature selection method using mutual information and fractal dimension

$: Novel feature selection method using mutual information and fractal dimension$
Novel Feature Selection Method using Mutual Information and Fractal Dimension

D. T. Pham, M. S. Packianather, M. S. Garcia and M. Castellani

The Manufacturing Engineering Centre, Cardiff University, Queen’s Buildings, The Parade, Newport Road, Cardiff CF24 3AA, UK.

[email protected] , [email protected], [email protected], and [email protected]

Abstract- In this paper, a novel feature selection method using Mutual Information (MI)) and Fractal Dimension (FD) to measure the relevance and the redundancy features is presented. The proposed algorithm maximises the relevance and minimises the redundancy of the attributes simultaneously. The new framework allows a more efficient method for the selection of features without using any search technique. The performance of the proposed algorithm is compared with three different feature selection methods on three different datasets. The results obtained confirm the comparable efficiency and effectiveness of the features selected through the proposed algorithm.

I. INTRODUCTION

Understanding feature structured data is a complex task which entails the identification of a good data representation. The first step is to select a minimal subset of informative and relevant features from the initial batch of data attributes. Irrelevant and redundant attributes must be detected and removed in order to maximise the descriptiveness of the data representation. A compact data description allows to build more practical and better understood learning models, and improves the performance of the data analysis procedure.

This paper focuses on ranking features according to their combined individual relevance and redundancy. The Mutual Information (MI) and Fractal Dimension (FD) techniques are used to carry out both analyses. A feature filter function is developed to return an index of usefulness for each of the features contained in the dataset.

The MI criterion is useful to measure accurately the information content of two variables. In this work, it is used to measure the relevance of the features to the output class. The FD of a dataset is not affected by redundant features [1]. The ratio defined in this paper is proposed to measure the level of redundancy of each feature without using any greedy search technique.

The proposed algorithm integrates the two techniques, namely, MI and DF, to mimic the minimal-redundancy-maximal-relevance criterion (mRMR) [2]. The algorithm performs such analysis in parallel. In this way, the use of predictors in the middle of both analyses is avoided and a pure filter configuration is maintained.

The proposed algorithm was tested using a MLP (Multi-

Layer Perceptron) classifier on one real dataset and two

benchmark datasets, which include wood, satellite and image segmentation data respectively. The experimental results obtained for MI-FD are compared with the results obtained using Optimal Cell Damage (OCD) and ReliefF feature selection techniques.

The literature survey is given in Section II. Section III introduces the relevance and the redundancy criteria. The MI and FD techniques are given in Sections IV and V respectively. Section VI describes the proposed algorithm, and the computational implementation is given in Section VII. Section VIII presents the experimental results. Finally, the conclusions are given in section IX.

II. LITERATURE REVIEW

A. Feature Selection Approaches Feature selection algorithms can be divided into three

types: filter, wrapper and embedded methods. The first type forms part of the feature selection algorithms that are independent of any learning method. The filter method performs without any feedback from the predictors and uses the data as the only source of performance evaluation. These characteristics make the filter method, the one used in this work, the least computationally expensive approach which is less prone to generating over-fitted models [3].

The wrapper approach is the second type of feature selection method. This approach receives direct feedback (accuracy measure) from the predictors. Such a connection to the classifier is used to drive a search technique to find an optimal subset of features. The wrapper approach requires a very high computational effort due to the exploration of the search space. The most common search techniques used are the greedy backward elimination, the forward selection approach and the nested approach [4]. These search methods suffer two main drawbacks. The first one is their tendency towards a sub-optimal convergence. The second one, the inability to find possible interactions among attributes. Despite these limitations, wrappers tend to produce better results in terms of classification accuracy.

The embedded method creates a specific interaction between the classifier learning and the feature selection processes. This interaction combines the two procedures into one single entity. The inclusion of the feature selection process into the classifier learning procedure makes

978-1-4244-4649-0/09/$25.00 ©2009 IEEE 3393

https://www.researchgate.net/publication/39993637_Feature_extraction_foundations_and_applications?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5

https://www.researchgate.net/publication/2794165_On_Feature_Selection_Learning_with_Exponentially_many_Irrelevant_Features_as_Training_Examples?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5

https://www.researchgate.net/publication/2337057_Fast_Feature_Selection_Using_Fractal_Dimension?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5

embedded methods specific to the type of classifier considered [4]. In contrast, wrapper approaches are independent from the kind of classifier used, since any induction machine can be used to measure the quality of the feature subset selected. The main advantage of the embedded method is the reduction of the computational effort in comparison with the wrapper method. It avoids repeating the whole training process for the evaluation of every possible solution. B. Relevant Features

Typically, feature selection methods focus on detecting relevant features. This type of features presents a high correlation with the target and is the basis of a descriptive subset of features. The definition of relevance has been elaborated in [5], and it is classified in three levels of importance.

The first level is called strong relevance. This category includes those features which must be retained by the feature selection process. If any of these attributes is removed the descriptive power of the feature subset is negatively affected. Weak relevance is the second level in terms of importance. This level includes features which are not essential to build a good feature subset. However, a weakly relevant feature can still be useful especially in combination with other features. The third level of importance is called irrelevance, and includes features which are not useful at all. In this work, a continuous range from zero to one is used to measure respectively the least and the most relevant features.

C. Redundant features

Redundancy amongst features appear when their values are correlated. Theoretically, this type of features do not provide any extra information to characterise the dataset. However, features with a certain level of redundancy may sometimes be complementary, and make a feature subset more descriptive [6].

III. RELEVANCE AND REDUNDANCY ANALYSIS

A subset of features f ∈ F is relevant if it is useful for an induction algorithm to build a learning model. This learning model must be able to predict a label C given a new set of instances different from the training set.

Relevance analysis is usually carried out by ranking the attributes according to some pre-defined index. The lowest ranking features are then filtered out (Maximal relevance). Two common techniques to characterise the relevance of a feature are the Pearson correlation coefficient [7] and the MI index [2].

When there is correlation amongst features relevance analysis alone is not sufficient. The addition of a correlated albeit relevant feature may in fact not provide any extra useful information. This causes a burden on the learning

process and generates unnecessary and more complex learning models [8].

Even when all the features of a subset are highly relevant, it is possible that some of them are redundant. These redundant features can be eliminated without the classifier losing any predictive power. Measures of feature redundancy are usually based on the level of correlation among features. It is said that one feature is redundant if it is completely correlated to another.

To ignore useless features increases the general quality of the dataset. Subsets that include highly relevant and low redundant features form good dataset representations.

IV. MUTUAL INFORMATION

MI is a criterion that is often used in feature selection to characterise the general dependence between two variables. MI is a non-parametric and non-linear technique that makes no assumption about the distribution of the data. It is capable of successfully detecting different types of feature dependencies without relying on transformations of the different variables. MI is an ideal way of quantifying how much information two variables share. Due to these reasons, MI is a popular method to measure the dependency between the features and their class labels [2].

The mutual information function ),( yxI between two random variables x and y is defined in terms of their probability density function p as follows:

∫∫= dxdy

ypxpyxpyxpyxI

)()() ,(log) ,() ,( (1)

V. FRACTAL DIMENSION

Fractal theory is based on various dimension theories and geometrical concepts. There are many definitions of a fractal [9, 10]. In this paper, a fractal is defined as a mathematical set with a high degree of geometrical complexity. This complexity is useful to model numeric sets such as data and images.

One of the characteristics of fractals is self-similarity as shown in Figure 1 by Sierpinski pyramid and triangle. This property defines the geometrical or statistical likeness between the parts of an entity (set) and the whole entity (set). To quantify the self-similarity of an object its fractal dimension defined by equation (2) needs to be calculated. This measure describes how the object fills up the space, giving information about its length, area and volume. Its value can be an integer or a fraction as the format used in this work to quantify the redundancy among features.

978-1-4244-4649-0/09/$25.00 ©2009 IEEE 3394

https://www.researchgate.net/publication/221996079_An_Introduction_of_Variable_and_Feature_Selection?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5

https://www.researchgate.net/publication/234213316_Fractals_Form_Chance_and_Dimension?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5


https://www.researchgate.net/publication/223130325_The_Fractal_Geometry_of_Nature?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5

https://www.researchgate.net/publication/3301850_Using_Mutual_Information_for_Selecting_Features_in_Supervised_Neural_Net_Learning?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5

https://www.researchgate.net/publication/223713209_Wrappers_for_Feature_Subset_Selection?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5

https://www.researchgate.net/publication/41781163_An_Introduction_to_Variable_and_Feature_Selection?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5

https://www.researchgate.net/publication/220321053_Eficient_Feature_Selection_Via_Analysis_of_Relevance_and_Redundancy?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5


https://www.researchgate.net/publication/238344516_The_Fractal_Geometry_Of_Nature?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5

Fig.1 a) Sierpinski diamond and b) Sierpinski triangle Mathematically the FD of a given set A is defined as follows [11]:

rA

D n

log)(log

∂Ν∂

= , [ ]2 ,1 rrr ∈ (2)

where D is the FD of the set A. N denotes the number of boxes used to cover the object and r the length of the box side. Due to its relative simplicity and good accuracy, the box-counting method was chosen in this work to calculate the fractal dimension of an object. The implementation presented in [1] was used, since its computational complexity grows only linearly with the number of instances in the dataset. A. The number of levels of resolution analysis

One of the main difficulties in employing the FD method is that natural objects exhibit self-similarity over a limited range of scales. This difficulty was shown when the English mathematician, Lewis Fry Richardson, attempted to estimate the length of the coastline of Great Britain. This estimate was obtained by multiplying the number of strides taken to return to the starting point. Richardson realised that the length of the coastline was indeterminate because it depended on the resolution (i.e. the stride length) on which the measurements were made. He found that the plot of the logarithm of the estimated coastline length, versus the logarithm of the stride length used, generated a straight line. These types of plots over different resolutions are known as “Richardson plots”. When an object presents high level of self-similarity it causes the plots to be a series of straight lines. On the contrary when an object is not self-similar enough curved line segments are produced [12]. To automate the selection of the data analysis resolution, the following threshold T and a ratio R are proposed:

Cm

mT i +=

max

)(μ,

maxmm

R i= (3)

where m is the slope vector of each of the line segments,

maxm is the maximum slope in the vector. μ is the average slope of the different segments (resolutions) analysed. C is a constant bias that is used to tune the threshold T. The value of C can vary from 0 to 0.1. For the datasets used in this study,

the value of C was set to 0.08. When the ratio R in the ith level of resolution becomes bigger than the threshold T, the Richardson plot cannot be considered straight anymore at that point. This indicates that from the ith resolution the dataset stops being self-similar. Consequently, the calculation of the fractal dimension is measured only until that scale. When a dataset presents high levels of self-similarity the fractal dimension remains stable over all the levels of resolution. In this case, it is possible to choose any level of resolution analysis without having significant changes in the quantification of the FD.

VI. THE PROPOSED ALGORITHM

The aim of the algorithm proposed in this work is to select a subset with the most relevant and less redundant features in order to obtain high accuracy results in supervised learning tasks. To achieve this goal, different frameworks have been proposed in the literature [7, 13]. The sequential analysis structure of these algorithms has two main steps as shown in Figure 2. The first analysis step is carried out to obtain a ranked subset of relevant features from the original dataset. The second analysis step is subsequently carried out on the subset of relevant features to measure their level of redundancy.

Fig. 2 Traditional framework of feature selection

This sequential feature selection approach has two disadvantages. The first is that redundancy analysis is not applied over the whole range of features in the dataset, but only to a selected subset of relevant features. As a consequence, attributes of low redundancy that can be useful to describe the dataset may be removed because they are only mildly correlated to the target. The second disadvantage is the necessity of setting the number of relevant features burdening the process after the relevance analysis.

To overcome these disadvantages, a novel filter algorithm

based on parallel feature analysis is proposed. This algorithm simultaneously evaluates the relevance and redundancy of the attributes, and balances the feature selection criterion accordingly.

Fig.3 Proposed framework of feature selection

The first step is to apply relevance analysis to the original set of features using MI. The MI index of every individual feature and the target is calculated using equation (1). The

978-1-4244-4649-0/09/$25.00 ©2009 IEEE 3395

https://www.researchgate.net/publication/4309432_Efficient_Dimensionality_Reduction_Approaches_for_Feature_Selection?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5



feature which obtains the highest value is ranked top in terms of relevance.

The second step is to quantify the level of redundancy of the whole dataset using FD analysis. This is done by first calculating the FD of the whole dataset and then the N PFDs (Partial Fractal Dimensions). The ith PFD is calculated eliminating the ith attribute and computing the fractal dimension Di of the remaining N-1 features. Given that the FD of the dataset is relatively unaffected by redundant attributes [1], redundancy can be quantified as follows:

i

i pDpDR

ΔΔ= min 10 ≤≤ iR (4)

where ipDΔ is the difference between the ith partial fractal dimension and the fractal dimension D. The expression

pDΔmin refers to the minimum absolute difference among all the partial fractal dimensions.

When ipDΔ tends to be small and approaches minΔpD, the ratio Ri takes a value close to its upper extreme, which indicates the redundancy of the ith attribute. When ΔpDi tends to be large the ratio Ri approaches zero, which means that the ith attribute has low redundancy. The final step is to verify the following condition:

[ ])() ,(max FRcxI ii

FX i

−∈ (5)

Where I(xi, c) is the dependency of the ith feature from the target c calculated using the MI. Ri(F) is the level of redundancy of the ith feature calculated using the FD and the PFDs. This function is developed by mimicking the mRMR criteria using a different condition for redundancy analysis that avoids the use of incremental search techniques.

The objective of the function defined in (5) is to optimally set a measure of usefulness for the features in the dataset. The full mathematical nature of the MI and the FD allows a pure filter analysis and avoids evaluations of feature subsets using heuristic search strategies. The computational complexity of the condition given in (5) is linear for the number of instances and quadratic for the number of features.

VII. ALGORITHM IMPLEMENTATION

The algorithm to calculate the fractal dimension D was implemented in C++ using linked lists instead of arrays to improve memory performance, and connected to MATLAB through MEX files for visualization purposes. The MI computation was performed using the algorithm proposed in [2], using a self-contained and cross-platform package available at Mathworks [14]. The proposed feature selection

algorithm using MI and FD with mRMR criteria is shown in Figure 4.

Fig. 4 The proposed MI-FD feature selection algorithm A. The number of features to be selected

Filter methods usually rank the features according to a pre-defined criterion of desirability. The decision on how many features are removed may be taken by using either a threshold of importance, or by evaluating every feature contribution using a predictor. The latter option is a combined methodology called “filtrapper” approach [4]. This is the only approach that was proved to work optimally for any kind of scenario. Filtrapper is computationally less expensive than the original wrapper approach due to the smaller combination of features that have to be evaluated using classifiers [4].

VIII. EXPERIMENTS AND RESULTS

The results obtained using the proposed algorithm are compared with two state-of-the-art feature selection methods, optimal cell damage (OCD) and ReliefF. In this experiment, all the feature selection techniques use a MLP classifier as a common learning machine to verify the goodness of the final data representation. For the wood data set, the algorithm is also compared with the approach based on inter-class variation, intra-class variation and feature correlation proposed in [15]. The architecture of the MLPs used in the various data sets given in Table 1 was empirically determined. The topology of the MLPs and the learning parameters used in the experiments are given in Table 2.

Table 1 Datasets used in the experiments Dataset # Instances # Attributes # Classes LanSat 6435 36 6 Segmentation 2310 19 7 Wood 232 17 13

The statistical significance of the differences between the

classification accuracy results obtained using the full feature sets, and those obtained using the reduced feature sets, were evaluated using the Welch method.

978-1-4244-4649-0/09/$25.00 ©2009 IEEE 3396



https://www.researchgate.net/publication/224356923_Feature_selection_method_for_neural_network_for_the_classification_of_wood_veneer_defects?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5


Table 2. Parameter setting of MLP and learning algorithms

MLP Settings Satellite Segmentation Wood

Hidden layers 1 1 1 Hidden nodes

(fixed structures)

30 30 45

Activation function of

neurons

Hyper-tangent (hidden nodes), Sigmoidal (output nodes)

Algorithms Settings BP rule ReliefF OCD

Learning coefficient 0.1 0.1 0.1

Momentum term 0.01 0.01 0.01

Training subset (% tr.

data) n.a. n.a. 80%

Validation subset (% tr.

data) n.a. n.a. 20%

OCD Saliency threshold

n.a. n.a. *

ReliefF cycles n.a. 30 n.a. ReliefF

neighbourhood size

n.a. 20% ** n.a.

ReliefF threshold n.a. * n.a.

* depending upon data set ** percentage of the size of the smallest class n.a. not applicable

A. Filter Approach – ReliefFAlgorithm settings

The following δ(·) function is chosen as the normalised metrics to evaluate the difference between attribute values:

))(),(())(),(())(),(( jnjrdiffjnjrwjnjr ⋅=δ (6) where r(j) is the jth attribute of a randomly sampled pattern, n(j) is the jth attribute of a nearest neighbour of r(j) (belonging either to the same class or a different one), and diff(r(j), n(j)) and w(r(j), n(j)) are defined as follows:

)min()max()()(

))( ),((XX

jnjrjnjrdiff

−−

= (7)

∑

=

⎟⎠⎞

⎜⎝⎛−

⎟⎠⎞

⎜⎝⎛−

=N

k

Njnrank

Njnrank

e

ejnjr

1

3/)((

3/)((

2

2

))( ),((w (8)

where r(j)∈ X and n(j)∈ X, rank(n(j)) is the rank of data pattern n(j) in a sequence of nearest neighbours of r(j), and N is the number of nearest neighbours of r(j) that is considered

for each class. Following experimental trial and error, the number N of nearest neighbours is set to 20% of the size of the smallest class, and the total number of m randomly sampled data instances is set to 30. The latter value is in good accord with the study of Robnik-Sikonja and Kononenko [16]. The threshold for feature retention was experimentally optimised for each data set as given in Table 3. The solutions generated by ReliefF were evaluated using the basicBP procedure. The basicBP algorithm uses the learning parameters presented in Table 2. The number of learning cycles for the basicBP algorithm was experimentally set for each classification problem to maximise the learning results. B. Standard Embedded Approach – Optimal Cell Damage Settings

The OCD algorithm was implemented following closely the guidelines of Cibas [17]. The saliency threshold for feature selection was empirically determined for each data set as given in Table 4.

Table 3. Variation of optimal ReliefF feature retention threshold for the 3 datasets

Satellite Segmentation Wood

Threshold 0.40 0.30 0.25

Table 4. Variation of optimal OCD saliency threshold for the

3 datasets

Satellite Segmentation Wood

Threshold 0.95 0.97 0.90

The performance of the OCD, ReliefF and the proposed Fractal based algorithm were calculated based on 20 independent trials and the results obtained for landsat, segmentation and wood sets are given in Tables 5, 6 and 7 respectively. The results show that in all cases the proposed fractal based algorithm performed consistently (i.e. with the least standard deviation) using a smaller feature set compared to OCD and ReliefF.

IX. CONCLUSIONS

The proposed algorithm generates subsets of useful features for classification applications. The usefulness of each feature is efficiently measured proposing a new, simpler, and non-heuristic feature selection framework. Comparable accuracy is obtained for all of the datasets showing the effectiveness of the subsets selected. The standard deviation calculated for 20 independent trials shows that the proposed algorithm has comparable reliability to other state-of-the-art algorithms. The number of features is reduced while keeping effectively the integrity of the information in all the datasets. The similar nature of MI and FD, both based on mathematical principles, makes the proposed algorithm more suitable for a straightforward analysis of the data. The effect of varying

978-1-4244-4649-0/09/$25.00 ©2009 IEEE 3397

https://www.researchgate.net/publication/2631865_Variable_Selection_with_Optimal_Cell_Damage?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5

https://www.researchgate.net/publication/220344198_Theoretical_and_Empirical_Analysis_of_ReliefF_and_RReliefF?el=1_x_8&enrichId=rgreq-c01d4b5c740e294c34cca91498d93658-XXX&enrichSource=Y292ZXJQYWdlOzI1MTkxOTkzNztBUzo5OTYwNTIyNDA5OTg1OUAxNDAwNzU5MTQ0OTI5

levels of noise in the data on the performance of the proposed fractal based algorithm will be considered in future work.

Table 5. MLP’s classification accuracy results for the Satellite data

Table 6. MLP’s classification accuracy results for the image

segmentation data

Table 7. MLP’s classification accuracy results for the wood

data

ACKNOWLEDGEMENT

The authors would like to thank I*PROMS NoE and CONACYT Mexico for supporting this work.

REFERENCES [1] C. Traina, Jr. A. Traina, L. Wu and C. Falousos, “Fast feature selection

using fractal dimension,” in XV Brazilian Symposium on Databases (SBBD), 2000.

[2] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: Criteria of max-dependance, max-relevance, and minredundancy.” IEEE Trans. Pattern Anal Mach Intell, vol. 27, No. 8. pp. 1226-1238, August 2005.

[3] A. Y. Ng, “On Feature Selection: Learning with Exponentially many irrelevant features as training examples,” in 15th International Conference on Machine Learning, M. Kaufmann, Editor. 1998: San Franscisco CA, pp. 404-412.

[4] I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, Feature Extraction: Foundations and Applications, Springer-Verlag: Heidelberg, 2006.

[5] R. Kohavi, and G.H. John, “Wrapper for feature subset selection,” Artificial Intelligence, vol. 97, pp. 273-324, 1997.

[6] I. Guyon, and A. Elisseeff, “An introduction to Variable and Feature Selection,” Machine Learning Research, vol. 3, pp. 1157-1182, 2003.

[7] L. Yu, and H. Liu. “Efficient feature selection via analysis of relevance and redundancy,” Journal of Machine Learning Research, vol. 5, pp. 1205-1224, 2004.

[8] R. Battiti. “Using mutual information for selecting features in supervised neural net learning,” IEEE Trans. on Neural Networks, vol. 5(4), pp. 537-550, July 1994.

[9] B. B. Mandelbrot, Fractals: Form, Chance and Dimension, W. H. Freeman and Co, San Francisco, 1977.

[10] B. B. Mandelbrot, The Fractal Geometry of Nature, W. H. Freeman and Co, San Francisco, 1982.

[11] M. Barnsley, Fractals Everywhere, Academic Press INC, London, 1988.

[12] A. Flook, Features Fractals. Sensor Review, vol. 16, pp. 42-47, 1996. [13] C. Deisy, B. Subbulakshmi, S. Baskar, and N. Ramaraj, “Efficient

Dimensionality Reduction Approaches for Feature Selection,” in Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) – vol. 2, pp. 121-127, December, 2007

[14] Mathworks. http://www.mathworks.com/matlabcentral. [15] M. S. Packianather, P. R. Drake, D. T. Pham, “Feature Selection

Method for Neural Network for the Classification of Wood Veneer Defects,” WAC 2008, Big Island, Hawaii, USA, pp. 1-6, 28 September to 2 October, 2008.

[16] M. Robnik-Sikonja and I. Kononenko, “Theoretical and empirical analysis of relieff and rrelieff.” Journal of Machine Learning, vol. 53, pp. 23–69, 2003.

[17] T. Cibas et al., “Variable selection with optimal cell damage,” in Proceedings of the International Conference on Artificial Neural Networks, vol. 1, M. Marinaro and P. G. Morasso, Eds, Springer-Verlag, 1994, pp. 727-730.

Satellite Full Set OCD ReliefF

Proposed Fractal

Algorithm Learning

cycles 8800 n.a. 7100 7300

Accuracy (Mean) % 89.06 84.51 87.74 88.83

Accuracy (Median) % 89.13 84.36 88.03 89.03

Std dva 0.47 1.16 1.70 0.58 Welch Test* - 0..000 0.003 0.175

Features 36 24.70 22.65 18

* Statistical significance P<0.05 n.a. not applicable

Segmentation Full Set OCD ReliefF Proposed Fractal

Algorithm Learning cycles 10000 n.a. 9600 9900

Accuracy (Mean) % 96.93 93.23 96.21 97.23

Accuracy (Median) % 96.75 93.40 96.21 97.08

Std dva 0.89 1.39 0.93 0.84 Welch Test* - 0..000 0.017 0.275

Features 19 14.00 11.00 12


Wood Full Set ReliefF OCD Manual

Proposed Fractal

Algorithm Learning

cycles 1500 6000 n.a. 6000 1500

Accuracy (Mean) % 82.23 82.02 72.24 82.13 82.13

Accuracy (Median) % 81.91 81.91 74.47 82.98 82.98

Std dva 5.58 5.32 8.52 6.78 5.83 Welch Test* - 0.902 0..000 0.957 0.953

Features 17 11.70 12.40 11 11


978-1-4244-4649-0/09/$25.00 ©2009 IEEE 3398






































Novel feature selection method using mutual information and fractal dimension

Documents

Transcript of Novel feature selection method using mutual information and fractal dimension