Mutual information-based feature selection for intrusion detection systems

Journal of Network and Computer Applications 34 (2011) 1184–1199

Contents lists available at ScienceDirect

Journal of Network and Computer Applications

1084-80

doi:10.1

n Corr

E-m

rezaei@

shakery

journal homepage: www.elsevier.com/locate/jnca

Mutual information-based feature selection for intrusion detection systems

Fatemeh Amiri a,n, MohammadMahdi Rezaei Yousefi a, Caro Lucas a, Azadeh Shakery b, Nasser Yazdani b

a Center of Excellence, Control and Intelligent Processing, School of Electrical and Computer Engineering, University of Tehran, Tehran, Iranb School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran

a r t i c l e i n f o

Article history:

Received 5 September 2009

Received in revised form

15 December 2010

Accepted 3 January 2011Available online 14 January 2011

Keywords:

Intrusion detection

Least squares support vector machines

(LSSVM)

Mutual information (MI)

Linear correlation coefficient

Feature selection algorithm

45/$ - see front matter & 2011 Elsevier Ltd. A

016/j.jnca.2011.01.002

esponding author. Tel.: +98 21 6111 4181; fa

ail addresses: [email protected], f.amir

ece.ut.ac.ir (M. Rezaei Yousefi), [email protected]

@ut.ac.ir (A. Shakery), [email protected] (N. Ya

a b s t r a c t

As the network-based technologies become omnipresent, threat detection and prevention for these

systems become increasingly important. One of the effective ways to achieve higher security is to use

intrusion detection systems, which are software tools used to detect abnormal activities in the com-

puter or network. One technical challenge in intrusion detection systems is the curse of high dim-

ensionality. To overcome this problem, we propose a feature selection phase, which can be generally

implemented in any intrusion detection system. In this work, we propose two feature selection

algorithms and study the performance of using these algorithms compared to a mutual information-

based feature selection method. These feature selection algorithms require the use of a feature good-

ness measure. We investigate using both a linear and a non-linear measure—linear correlation

coefficient and mutual information, for the feature selection. Further, we introduce an intrusion detection

system that uses an improved machine learning based method, Least Squares Support Vector Machine.

Experiments on KDD Cup 99 data set address that our proposed mutual information-based feature

selection method results in detecting intrusions with higher accuracy, especially for remote to login (R2L)

and user to remote (U2R) attacks.

& 2011 Elsevier Ltd. All rights reserved.

1. Introduction

With the rapid progress in the network-based technology andapplications, the threat of spammers, attackers and criminalenterprise has also grown accordingly. The 2005 annual computercrime and security survey showed that the total financial lossescaused by all kinds of network viruses/intrusions for respondentcompanies were about US $130 million (C.S. Institute and F.B.O.Investigation, 2005). Furthermore, according to other studies, anaverage of twenty to forty new vulnerabilities that existed innetworking and computer products was detected every month(Patcha and Park, 2007).

The traditional prevention techniques such as user authentica-tion, data encryption, avoiding programming errors and firewalls areused as the first line of defense for computer security (Lazarevicet al., 2003). Lee et al. (2009) proposed a security vulnerabilityevaluation and patch framework, which enables evaluation ofcomputer program installed on host to detect known vulnerabilities.After evaluation, the vulnerable computer program is patched withthe latest patch code. However, intruders can bypass the preventive

ll rights reserved.

x: +98 21 88778690.

[email protected] (F. Amiri),

(C. Lucas),

zdani).

security tools; thus, a second level of defense is necessary, whichis constituted by tools such as anti-virus software and intrusion

detection system (IDS).Security products like anti-virus softwares have several limita-

tions. They can protect network users from malwares (viruses,Trojan horses, worms and spywares) known with a signature storedin their database. Signature files of many anti-virus products areupdated only on a weekly or daily basis. Therefore, computer usersare unsafe against new intrusions in the intervals between updates.This is particularly problematic, because new threats can spreadacross the Internet in a few hours. Furthermore, anti-virus solutionsare reactive and do not ensure the safety of the first few computersinfected. The signature of threats must first be detected by anti-viruscompanies, diagnosed and finally a remedy must be deployed. Thetime of this detection is not predictable.

As opposed to anti-virus programs that detect infected com-puter programs (Morin and Me, 2007), an IDS gathers and analyzesinformation from various areas within a computer or a network(users, processes) in order to identify the subset of activities thatviolates the security policy. It is designed to give notice that anintruder is trying to get into the system. Traditionally, IDSs havebeen classified into two categories: signature-based detection andanomaly detection systems. In signature-based systems, attack patternsor behaviors of intruder are modeled and the system will alert oncea match is detected. It is able to detect all known attacks with a low

www.elsevier.com/locate/jnca

dx.doi.org/10.1016/j.jnca.2011.01.002

mailto:[email protected]





dx.doi.org/10.1016/j.comnet.2007.02.001

F. Amiri et al. / Journal of Network and Computer Applications 34 (2011) 1184–1199 1185

false positive rate. Similar to anti-virus programs, this type of an IDSrequires frequent attack signature updates to keep the signaturedatabase up-to-date. On the other hand, anomaly detection systems

first create a baseline profile of the normal behavior of the network.Afterwards, if any activity deviates from the normal profile, it will betreated as an intrusion. Anomaly detection systems are able to detectpreviously unknown attacks without the use of signatures.

Machine learning and data mining techniques have recentlybeen used in research to remove the manual and ad hoc elementsfrom the process of building an IDS. Machine learning techniqueshave the ability to build a system that improves its performancebased on a newly acquired information. Some of data miningdetection techniques have been very successful at discoveringunknown attacks, since these techniques are data driven (Patchaand Park, 2007).

The number of features extracted from raw network data,which an IDS needs to examine, is usually large even for a smallnetwork (Chou et al., 2008), (Mukkamala and Sung, 2006). Manyresearchers have tried to improve the detection rate of IDs throughproposing new classifiers, but improving the effectiveness of classi-fiers is not an easy task, Though feature selection can be used tooptimize the existing classifiers. For eliminating the unimportantfeatures, feature selection methods have been introduced to theintrusion detection domain. Feature selection is useful to reduce thecomputational complexity (reduce training and utilization times),remove information redundancy, increase the accuracy of the learn-ing algorithm, facilitate data understanding and improve the gen-eralization. In this line of research, some techniques have been usedin developing a lightweight IDS such as Markov blanket model

(Chebrolu et al., 2005), decision tree analysis (Chebrolu et al., 2005),flexible neural tree model (Chena et al., 2006) and hidden Markov

model (HMM) (Cho, 2002) explained in Section 2.Feature selection algorithms can be classified into wrapper

and filter methods. While wrapper methods try to optimize somepredefined criteria with respect to the feature set as part of theselection process, filter methods rely on the general character-istics of the training data to select features that are independentof each other and are highly dependent on the output. Featureselection methods use a search algorithm to look up the wholefeature space and evaluate possible subsets. To evaluate thesesubsets, they require a feature goodness measure that grades anysubset of features. In general, a feature is good if it is relevant tothe output, but is not redundant with other relevant features. Afeature goodness measure can be the dependency between twofeatures. Two of the most important goodness metrics to selectthe features are correlation coefficient and mutual information (MI).

In this paper, we develop PLSSVM, a novel IDS which uses areformulation of Support Vector Machine (SVM), called least square

SVM (LSSVM), for modeling the system. The experiments haveshown that an SVM is a good candidate for an intrusion detection,because of its training speed and scalability (Mukkamala et al.,2005). In general, an SVM may lead to heavy computational cha-llenges for large data sets. It is shown that LSSVM with RBF kernelhas an excellent generalization performance and low computationalcost (Suykens and Vandewall, 1999). An LSSVM uses equalityinstead of inequality constraints and uses a least squares costfunction. In this work, we proposed to add a preprocessing phase,feature selection, to improve the performance of PLSSVM. The selec-ted feature can also be used with other IDSs.

Furthermore, to build a lightweight IDS, two dependency-based feature selection algorithms with different evaluation func-tions have been proposed in this work: linear correlation-based

feature selection (LCFS) and modified mutual information-based

feature selection (MMIFS). For search strategy, Jain and Zongker(1997) showed that the forward search is best in large data sets.So, the search approach of our feature selection algorithms is the

greedy forward selection. The common benefit of these methodsis that they can be applied before the learning phase, so they areindependent of the learning process (filter methods).

An LCFS is based on the linear correlation coefficient and anMMIFS is based on the mutual information. Linear correlation helpsin detecting features with near linear correlation to the systemoutput, but in the real world, the correlations are not always linear.A method based on the linear correlation coefficient cannot detectan arbitrary relation between the input features and the outputs.Whereas, the MI can measure arbitrary relations between features.(Battiti, 1994).

We verify the effectiveness of our feature selection methods bydoing several experiments on the benchmark KDD cup 1999Intrusion Detection Data set. We study the performance of usingof LCFS and MMIS compared to a mutual information-basedforward feature selection method called forward feature selection

algorithm (FFSA). The experimental results show that our pro-posed LCFS method is superior, when there is the linear correla-tion between all input features and output. If all input features areindependent of each other, no redundancy will take place and anFFSA will yield good results. However, the proposed MMIFS is anappropriate method for the feature selection with maximumrelevancy and minimum redundancy. We further compared theperformance of our proposed MMIFS method with two MI basedfeature selection: conditional mutual information maximization

(CMIM) (Fleuret, 2004) and max-relevance min-redundancy (mRMR)(Peng et al., 2005). The performance of an MMIFS is comparablewith an mRMR and better than CMIM, except for the DOS attack.Moreover our experimental results reveal that an IDS with featureselection performs better than that without the feature selectionboth in computational cost and detection accuracy.

The rest of the paper is organized as follows: in Section 2,related works of this field are studied. We describe the mutualinformation in Section 3, and describe the feature selection algo-rithms in Section 4. In Section 5, an LSSVM is introduced. Intrusiondata set is introduced in Section 6. Experiments and results arestated in Section 7. Then, discussion and conclusions are presentedin Sections 8 and 9, respectively.

2. Related works

In the past decades, a great number of intrusion detection hasbeen proposed to detect anomalies (Tsai et al., 2009). Next-generation intrusion detection export system (NIDES) was oneof the few intrusion detection systems, which could operate inreal-time for continuous monitoring of user activity or could runin a batch-mode for the periodic analysis of the audit data(Anderson et al., 1994), (Anderson et al., 1995). It generates theprofile by using statistical measurement. Audit Data Analysis andMining (ADAM) is one of the known data mining projects in anintrusion detection, which uses a module to classify the abnormalevent into false alarm or real attack (Barbara�et al., 2001). It is anonline network-based IDS, which used two data mining techni-ques, association rule and classification. An ADAM was one outof the seven systems tested in the 1999 DARPA evaluation(Lippmann et al., 2000).

Boukerche et al. (2004) used natural immune human systemsto detection anomaly in the computer network. Authors applied theproposed scheme to extract significant features of the immunehuman system and then map these features within a softwarepackage designed to provide security of a computer system and toidentify irregular activities according to the usage log files.

Current IDSs use many techniques. Some of these techniquesthat are widely used for an intrusion detection are statistic(Lazarevic et al., 2003), hidden Markov model (Ye and Borror,

F. Amiri et al. / Journal of Network and Computer Applications 34 (2011) 1184–11991186

2004), artificial neural network (Debar et al., 1992; Novikov et al.,2006; Ramadas and Tjaden, 2003; Fisch et al., 2010), fuzzy logic(Saniee Abadeh et al., 2007; Chimphlee et al., 2006; Toosi andKahani, 2007), rule learning (Cohen, 1995; Xuren et al., 2006) andoutlier detection schema (Lazarevic et al., 2003).

Researchers have found that SVM can be used effectively for anintrusion detection (Mukkamala et al., 2005; Khan et al., 2007; Yuet al., 2003; Fisch et al., 2010). Mukkamala et al. (2005) have beenexamined the performance of support vector machine, multi-variate adaptive regression splines (MARS) and artificial neuralnetwork (ANN). It has been demonstrated that an ensemble ofANN, MARS and SVM is preferable to individual approaches for anintrusion detection according to the classification accuracy. Zhangand Shen (2005) have formulated an intrusion detection as a textprocessing problem, which can be solved by an SVM. Additionally,this system can employ some text processing techniques based onthe characterization of frequency of system call executed by theprivileged program. Horng et al. (2010) proposed an SVM-basednetwork intrusion detection system and applied BIRCH hierarch-ical clustering as data preprocessing. The BRICH hierarchicalclustering reduced and abstracted data set, thus, training timewas reduced and the resultant SVM classifiers showed betterperformance.

As well, due to high dimensionality of network data, severalIDS, which use feature selection as a pre-processing phase, havebeen developed. Mukkamala and Sung (2006) investigated theperformance of an IDS based on the SVM, multivariate adaptivesplines and linear genetic program. They used a novel significantfeature selection algorithm, which is independent of the modelingtools being applied. One input feature is removed from the data ata time; the remaining data set is then utilized for the training andtesting of the classifier. Afterwards, the classifier’s performance iscompared to that of the original classifier in terms of relevantperformance criteria. At last, the features are ranked according toa set of rules based on the performance comparison.

Chebrolu et al. (2005) have identified important input featuresin building an IDS that is computationally efficient for real worlddetection systems. In the feature selection phase, Markov blanketmodel and decision tree analysis has been used. Bayesian network(BN) classifier and regression trees (CART) have been used toconstruct an intrusion detection model.

Chena et al. (2006) have used a flexible neural tree (FNT) modelfor the intrusion detection system. The FNT model can reduce thenumber of features. Using 41 features, the best accuracy for the DOSand U2R is given by the FNT model. The decision tree classifiersupplied the best accuracy for normal and probe classes, which are alittle better than the FNT classifiers.

Sung and Mukkamala (2003) have deleted one feature at atime to act an experiment on SVM and neural network. KDD cup1999 data set has been used to examine this technique. In termsof the five-class classification, it is found that by using only 19 ofthe most significant feature, rather than all the 41-feature set, thechange in performance of an intrusion detection was statisticallyunimportant.

Cho (2002) reports a work where fuzzy logic and hiddenMarkov model have been deployed together to detect intrusions.In this approach, the hidden Markov model is used for the dimen-sionality reduction.

Li et al. (2009) proposed a wrapper based feature selectionalgorithm to build lightweight IDS. They applied modified RMHCas the search strategy and modified linear SVM as an evaluationcriterion. Their approach speeds up the process of selectingfeatures and yields high detection rates for an IDS.

The efforts for using mutual information in the feature selec-tion problem lead to the series of algorithms. Peng et al. (2005)proposed a minimal-Redundancy-Maximal-Relevance criterion

(mRMR) for selecting features incrementally. This criterion givespossibility of selecting features at very low cost. They comparedtheir proposed method with maximal relevance criterion, usingthree different classifiers. The results confirm that an mRMR featureselection can improve the classification accuracy. This method usecontinues and the discrete data set, but the experiments show thediscretization in many cases gives better features.

Fleuret (2004) proposed a simple, fast and efficient featureselection method based on conditional mutual information. He com-pared his proposed method with other feature selection techniqueslike C4.5 binary trees and fast correlation-based filter. In order toapply this method, the binary input features are required. He pre-sented that CMIM along with naıve Bayesian classifier is comparableand better than the method like support vector machine andboosting.

3. Mutual information

In the information theory, the mutual information (MI) can beapplied for evaluating any arbitrary dependency between randomvariables. In fact, the MI between two random variables X and Y isa measure of the amount of knowledge on Y supplied by X (orconversely on the amount of knowledge on X supplied by Y). If X

and Y are independent, i.e. X contains no information about Y andvice versa; then their mutual information is zero.

The MI of two random variables X and Y is defined as

IðX; YÞ ¼HðXÞ�HðX9YÞ ¼HðYÞ�HðY9XÞ ¼HðXÞþHðYÞ�HðX; YÞ ð1Þ

where H(.) is entropy, H(X9Y) and H(Y9X) are conditional entro-pies, and H(X; Y) is the joint entropy of X and Y that are defined as

HðXÞ ¼�

Zx

pXðxÞlogpXðXÞdx ð2Þ

HðYÞ ¼�

Zy

pY ðyÞlogpY ðyÞdy ð3Þ

HðX; YÞ ¼�

Zx

Zy

pX, Y ðx, yÞlogpX, Y ðx, yÞdxdy ð4Þ

where pX,Y(x, y) is the joint probability density function and pX(x)and pY(y) are marginal density functions of X and Y, respectively.The marginal density functions are

pXðxÞ ¼

Zy

pX, Y ðx, yÞdy ð5Þ

pY ðyÞ ¼

Zx

pX, Y ðx, yÞdx ð6Þ

By substituting Eqs. (2)–(4) into Eq. (1), the MI equation willbe

IðX;YÞ ¼

Zx

Zy

pX,Y ðx,yÞlogpX,Y ðx,yÞ

pXðxÞpY ðyÞdxdy ð7Þ

In discrete forms, the integration is substituted by summationover all possible values that appear in data. Therefore, it is onlyrequired to estimated pX,Y(x, y) in order to estimate the MIbetween X, Y.

Kraskov et al. (2004) proposed to use k-nearest neighborstatistics to estimate the entropies and compute MI. They estimatedthe MI between two random variables of any multi-dimensionalspace. The basic idea is to estimate the entropy from based on anaverage distance to the k-nearest neighbors. See Appendix A formore details.


4. Feature selection algorithms

4.1. Forward feature selection algorithm

Feature selection is the method of selecting most relevantfeatures for building appropriate models. In the forward featureselection algorithm (FFSA), each single input feature is added toselected features set based on maximizing MI between selectedinputs and output. This procedure can be performed until n inputfeatures have been selected, where n is determined priori. Thealgorithm can be described by the following procedure:

1)
Initialization: set F’‘initial set of all features’, S’‘empty set’,y’‘class-outputs’.
2)
Computation of the mutual information of the features with the
class-outputs: for each feature (fiAF), compute I(fi; y).
3) Selection of the first feature: find the feature fi that maximizes
I(fi; y); set F’F�{fi}, S’{fi}.
4) Greedy selection: repeat until desired numbers of features are
selected:a. Computation of the mutual information between features and

class-outputs: for all features (fiAF), if it is not alreadyavailable, compute I(S[fi; y).

b. Selection of the next feature: choose the feature fi as the onethat maximizes I(S[fi; y); set F’F�{fi}, S’S[{fi}.

5)
Output the set containing the selected features: S.
4.2. Modified mutual information-based feature selection algorithm

Mutual information-based feature selection algorithm is pro-posed by Battiti (1994). The objective is to maximize the rele-vance between the input features and the output and to minimizethe redundancy of the selected features. The algorithm computesI(fi; y) and I(fi; fs), where fi and fs are single input features. Thealgorithm selects one feature at a time; the one maximizing theinformation with outputs. This mutual information expression isadjusted by subtracting a quantity proportional to the averagemutual information within the selected features. Our proposedMMIFS algorithm can be described as follows:

1)
Initialization: set F’‘initial set of all features’, S’‘empty set’,y’‘class-outputs’.
2)
Computation of the mutual information of the features with the
class-outputs: for each feature (fiAF), compute I(fi; y)
3) Selection of the first feature: find the feature fi that maximizes
I(fi; y); set F’F�{fi}, S’{fi}.
4) Greedy selection: repeat until the desired number of features is
selected:a. Computation of the mutual information between features: for

all pairs of features (fi, fs), where (fiAF), (fsAS), computeIðfi; fsÞ, if it is not already computed.

b. Selection of the next feature: choose the feature fi as the onethat maximizes Iðfi; yÞ�ðb=9s9Þ

Pfs ASIðfi; fsÞ; set F’F�ffig,

S’S [ ffig

5)
Output the set containing the selected features: S
Battiti (1994) proposed Iðfi; yÞ�bP

fs A SIðfi; fsÞ as the evaluationfunction in the fourth step of the algorithm. It does not considerthe effect of the number of selected inputs. And thus the effect ofthe first term, Iðfi; yÞ, decreases when the number of selectedinputs increases. In order to avoid that we have suggestedIðfi; yÞ�ðb=9s9Þ

Pfs A SIðfi; fsÞ in step (4).

b is a parameter and determined empirically. If b¼0, thealgorithm only attempts to maximize mutual information withoutput, so the dependency between features is not considered.

Battiti (1994) has proposed a value between 0.5 and 1 for b. Weuse b¼0.5 in our experiments.

4.3. Linear correlation-based feature selection algorithm

Linear correlation coefficient is the most popular measure ofdependence between two random variables. It is able to capturecorrelations that are linear. For feature X with value x, class Y withvalue y treated as random variables, it is defined as

corrðX; YÞ ¼ r¼

PNi ¼ 1ðxi�xÞðyi�yÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

½PN

i ¼ 1 ðxi�xÞ2PN

i ¼ 1 ðyi�yÞ2�q ð14Þ

where x and y are expected values X and Y. corr (X; Y) is equal to71 if X and Y are linearly dependent and zero if they arecompletely independent. The evaluation function of proposedLCFS and MMIFS is the same. But, in order to reduce computingcomplexity of proposed MMIFS, the LCFS method uses the linearcorrelation coefficient as the goodness measure. We use b¼0.5 inour experiments for this method. The followed algorithm isproposed based on this coefficient:

1)
Initialization: set F’‘initial set of all features’, S’‘empty set’,y’‘class-outputs’
2)
Computation of the correlation coefficient of the features with the
class-outputs: for each feature ðfiAFÞ, compute corrðfi; yÞ:

3)
Selection of the first feature: find the feature fi that maximizescorrðf ; yiÞ; Set F’F�{fi}, S’{fi}.
4)
Greedy selection: repeat until desired number of features isselected:a. Computation of the correlation coefficient between features:
for all pairs of features ðfi, fsÞ with ðfiAFÞ, ðfsASÞ, computecorrðfi; fsÞ, if it is not already available.

b. Selection of the next feature: choose the feature fi as the onethat maximizes corrðfi; yÞ�ðb=9s9Þ

Pfs A Scorrðfi; fsÞ; setF’

F�ffig, S’S [ ffig.
5) Output the set containing the selected features: S.
5. Least squares support vector machine

Support vector machines (SVM) are a supervised learningmethod, which is used for classification and regression problems.An SVM can train with a large number of patterns (lssvmlab.www.esat.kuleuven.ac.be/sista/lssvmlab/). An LSSVM is a regular-ized reformulation to the standard SVM. A linear equation has tobe solved in the optimization stage, which not only simplifies theprocess, but also is effective in avoiding local minima in SVMproblems (lssvmlab. www.esat.kuleuven.ac.be/sista/lssvmlab/).We apply LSSVM as a classifier that detect normal and attacks data.

The LSSVM model is defined by

yðxÞ ¼oTjðxiÞþb ð15Þ

where xi is the ith p-dimensional vector of features, o and b are theparameters of the model and jð:Þ is a mapping of the feature spaceinto a higher dimensional space (lssvmlab. www.esat.kuleuven.ac.be/sista/lssvmlab/). Given N training pairs fyi,xig

Ni ¼ 1 AR� Rp,

compute the parameters o, b, e of the hyper plane so that to

mino, b, e

jðo, b, eÞ ¼1

2oToþU1

2

XN

i ¼ 1

e2i ð16Þ

Subject to yi½oTjðxiÞþb� ¼ 1�ei, i¼ 1, . . ., N

This optimization problem can be solved in a dual space. Tosolve, the following Lagrangian is defined:

Lðo, b, e; aÞ ¼ jðo, b, eÞ�XN

i ¼ 1

aifyi½oTjðxiÞþb��1þeig ð17Þ

www.esat.kuleuven.ac.be/sista/lssvmlab/














where ai are Lagrange multipliers. The conditions for optimality

@L@o ¼ 0-o¼

XN

i ¼ 1

aiyijðxiÞ ð18Þ

@L@b¼ 0-

XN

i ¼ 1

aiyi ¼ 0 ð19Þ

@L@ei¼ 0-ai ¼ gei, i¼ 1, . . ., N ð20Þ

@L@ai¼ 0-yi oTjðxiÞþb

� ��1þei ¼ 0, i¼ 1, . . ., N ð21Þ

can be gained as the solution to set of following linear equations.

ð22Þ

Table 1Lists of features in the KDD cup 99.

Label/feature name Type Desc

Category 1 1. Duration Continuous Leng

2. Protocol-type Discrete Type

3. Service Discrete Netw

4. Flag Discrete Norm

5. Src-bytes Continuous Num

6. Dst-bytes Continuous Num

7. Land Discrete 1 If

8. Wrong-fragment Continuous Num

9. Urgen Continuous Num

Category 2 10. Hot Continuous Num

prog

11. Num-failed-logins Continuous Num

12. Logged-in Discrete 1 If

13. Num-compromised Continuous Num

not f

14. Root-shell Discrete 1 If

15. Su-attempted Discrete 1 If

16. Num-root Continuous Num

17. Num-file-creations Continuous Num

18. Num-shells Continuous Num

19. Num-access-files Continuous Num

20. Num-outbound-cmds Continuous Num

21. Is-host-login Discrete 1 If

22. Is-guest-login Discrete 1 If

Category 3 23. Count Continuous Num

24. Srv-count Continuous Num

(sam

25. Serror-rate Continuous % Of

26. Srv-serror-rate Continuous % Of

27. Rerror-rate Continuous % Of

28. Srv-rerror-rate Continuous % Of

29. Same-srv-rate Continuous % Of

30. Diff-srv-rate Continuous % Of

31. Srv-diff-host-rate Continuous % Of

Category 4 32. Dst-host-count Continuous Coun

33. Dst-host-srv-count Continuous Srv_

34. Dst-host-same-srv-rate Continuous Sam

35. Dst-host-diff-srv-rate Continuous Diff_

36. Dst-host-same-src-port-rate Continuous Sam

37. Dst-host-srv-diff-host-rate Continuous Diff_

38. Dst-host-serror-rate Continuous Serr

39. Dst-host-srv-serror-rate Continuous Srv_

40. Dst-host-rerror-rate Continuous Rerr

41. Dst-host-srv-rerror-rate Continuous Srv_

where z¼ ½jðx1ÞT y1, . . .,jðxNÞ

T yN�, Y ¼ ½y1, . . ., yN �, 1!¼ ½1, . . ., 1�,

e¼ ½e1, . . ., eN�, a¼ ½a1, . . .,aN�. The solution is

ð23Þ

Mercer’s condition can be applied to the matrix O¼ ZZT ,where

Oil ¼ yiyljðxiÞTjðxlÞ

T i, l¼ 1, . . ., N ð24Þ

So, the classifier (15) is determined by solving the linear Eqs.(23)–(24). Details about the estimation of parameters are given inSuykens and Vandewall (1999).

6. Intrusion data set

The data set used in these experiments is ‘‘KDD Cup 1999data’’ (kddcup data set. kdd.ics.uci.edu//databases/kddcup99/kddcup99. html), a well-known set of intrusion evaluation data.The raw training data was processed into about five millionconnection records. A connection is a sequence of TCP packetsstarting and ending at some well defined times. Each record is

ription

th (number of seconds) of the connection

of the protocol, e.g., tcp, udp, etc.

ork service on the destination, e.g., http, telnet, etc.

al or error status of the connection

ber of data bytes from source to destination

ber of data bytes from destination to source

connection is from/to the same host/port; 0 otherwise

ber of ‘‘wrong’’ fragments

ber of urgent packets

ber of ‘‘hot’’ indicators (hot: number of directory accesses, create and execute

ram)

ber of failed login attempts

successfully logged-in; 0 otherwise

ber of ‘‘compromised’’ conditions (compromised condition: number of file/path

ound errors and jumping commands)

root-shell is obtained; 0 otherwise

‘‘su root’’ command attempted; 0 otherwise

ber of ‘‘root’’ accesses

ber of file creation operations

ber of shell prompts

ber of operations on access control files

ber of outbound commands in an ftp session

the login belongs to the ‘‘hot’’ list; 0 otherwise

the login is a ‘‘guest’’login; 0 otherwise

ber of connections to the same host as the current connection in the past 2 s

ber of connections to the same service as the current connection in the past 2 s

e-host connections)

connections that have ‘‘SYN’’ errors (same-host connections)

connections that have ‘‘SYN’’ errors (same-service connections)

connections that have ‘‘REJ’’ errors (same-host connections)

connections that have ‘‘REJ’’ errors (same-service connections)

connections to the same service (same-host connections)

connections to different services (same-host connections)

connections to different hosts (same-service connections)

t for destination host

count for destination host

e_srv_rate for destination host

srv_rate for destination host

e_src_port_rate for destination host

host_rate for destination host

or_rate for destination host

serror_rate for destination host

or_rate for destination host

serror_rate for destination host


unique in the data set with 41 continuous and nominal featuresplus one class label. In this paper, the nominal feature such asprotocol (tcp/udp/icmp), service type (http/ftp/telnet/y) and TCPstatus flag (sf/rej/y) have been converted into a numeric feature.The method is simply by replacing the values of the categoricalattributes with numeric values. For example, the protocol-typeattribute in KDD Cup, 1999, the value tcp is changed with 1, udp

with 2 and icmp with 3.Features can be classified into four different categories. The

first category containing features labeled 1–9 is the basic featureof individual TCP connections. The next category labeled 10–22corresponds to content features. The third category labeled 23–31are traffic features computed using a two-second time windowand the fourth category labeled 32–41 are traffic features com-puted using a two-second time window from destination to host.The label of the features and their corresponding network datafeatures along with the categories are shown in Table 1. Each rowrepresents a feature. The first column is the category of feature.Other columns show the feature label/name, feature type anddescription.

The KDD cup 99 data set contains 24 attack types that havebeen categorized in four groups: Probe, Denial of Service (DOS),User to Root (U2R) and Remote to User (R2L). Details of thesecategories are as follows (Mukkamala et al., 2005).

Probe is a class of attacks which an attacker scans a network togather information about the target host. The next class of attacksis DOS which an attacker consumes some computing or memoryresource to prevent legitimate behaviors of users.

U2R attacks are a class of attacks, which an attacker has localaccess to the system and is able to exploit vulnerabilities to gainroot permissions. The last class of attacks is R2L attack, where an

Table 2Confusion matrix for performance evaluation.

Normal classifier Predicted class

Negative class (Normal)

Actual class

Negative class (Normal) True negative (TN)

Positive class (non-Normal) False negative (FN)

DOS classifier Predicted class

Negative class (non-DOS)

Actual class

Negative class (non-DOS) True negative (TN)

Positive class (DOS) False negative (FN)

Probe classifier Predicted class

Negative class (non-Probe)

Actual class

Negative class (non-Probe) True negative (TN)

Positive class (Probe) False negative (FN)

R2L classifier Predicted class

Negative class (non-R2L)

Actual class

Negative class (non-R2L) True negative (TN)

Positive class (R2L) False negative (FN)

U2R classifier Predicted class

Negative class (non-U2R)

Actual class

Negative class (non-U2R) True negative (TN)

Positive class (U2R) False negative (FN)

attacker does not have an account on the victim machine, hencesends packets to it over a network to illegally gain access as alocal user.

7. Experiments and results

7.1. Our proposed system

In order to classify the records, we will need to employ fiveLSSVMs, because an SVM performs binary classification. TheseLSSVMs perform classification to identify five classes—one ofwhich represents the normal activity (Normal) and four of whichrepresent attacks on the system (DOS, Probe, R2L and U2R). Forexample, the Normal class separates normal from non-normaldata (all type of attacks). The Probe class separates Probe fromnon-Probe data (including Normal, DOS, U2R and R2L instances)and so on. Based on the confusion matrices in Table 2, we use anumerical evaluation to quantify the performance of our pro-posed IDS (Wu and Banzhaf, 2010).

�
True positive rate (TPR): TP=ðTPþFNÞ also known as detectionrate (DR) or sensitivity or recall. � False positive rate (FPR): FP=ðTNþFPÞ also known as the false
alarm rate.
� Accuracy: ðTNþTPÞ=ðTNþTPþFNþFPÞ
�
Classification rate for each class is defined as the ratio betweenthe number of test instances correctly classified and the totalnumber of test instance of this class. � Cost per example (CPE): ð1=NÞ
Pmi ¼ 1

Pmj ¼ 1 CMði, jÞCði, jÞ

Positive class (non-Normal)

False positive (FP)

True positive (TP)

Positive class (DOS)

False positive (FP)

True positive (TP)

Positive class (Probe)

False positive (FP)

True positive (TP)

Positive class (R2L)

False positive (FP)

True positive (TP)

Positive class (U2R)

False positive (FP)

True positive (TP)

Table 4Sample distributions on the first training data randomly selected of ‘‘10% data of


where CM and C are confusion and cost matrices, N represents thenumber of test samples and m is the number of the classes in

Table 5The sample distributions on the test data with the corrected labels of KDD Cup 99

data set.

Class Total number of samples Number of novel attack sample

Normal 60,593 –

DOS 229,853 6555

Probe 4166 1789

R2L 16,189 10,196

U2R 228 189

Total 311,029 18,729

KDD Cup 99 data set’’. The U2R instances are resampled by bootstrap technique for

U2R class.

Model Normal DOS Probe R2L U2R

Normal class 2000 3790 300 350 32

DOS class 1410 4340 340 350 32

Probe class 1300 3390 1450 300 32

R2L class 1000 4240 200 1000 32

U2R class 600 3330 300 250 2000

classification. A confusion matrix is a square matrix, in which therows show actual classes, while the columns correspond to thepredicted classes. An entry at row i and column j, CM(i, j)represents the number of misclassified records that originallybelong to class i, although incorrectly identified as a member ofclass j. CM(i, i) shows the number of properly detected records. Inthe cost matrix, entry C(i, j) represents the cost penalty formisclassifying a record belonging to class i into class j. (Toosiand Kahani, 2007). Table 3 shows cost matrix values employed forthe KDD’99 classifier learning contest. Lower values for CPEmeasure indicate better detection in the IDSs.

Our experiments over each classifier have four phases: datanormalization, data reduction, training, and testing phases. Afterall attribute values of each train and test data set are scaled to therange [0–1] by dividing every attribute value by its own max-imum value, we utilize the feature selection algorithms to selectimportant features for each class. To achieve the best perfor-mance, the results of three feature selection methods are com-pared. After data scaling and reduction, the next step is toconstruct LSSVM-based intrusion detection model, using theselected features. The test data are then passed through the savedtrained model to detect intrusions in the testing phase.

7.2. The data sources

Whereas the KDD Cup 99 training set contains more than fourmillion data points and such a large data set cannot be fed to anSVM in the training phase, we randomly selected 6480 recordsfrom five classes as the training data and 6703 records as theevaluation data. The distribution of the samples in the subsetsthat were used for the training is listed in Table 4. The numbers ofsamples of each class in subsets are selected in proportion to sizeof them in ‘‘10% data of KDD Cup 99 training set’’. For the KDDCup training set, only 79% (391,458 records) are DOS traffic andthe remaining are normal, R2L, Probe and U2R traffic. There are 52records of U2R attack in data set, which 20 of them are used forevaluation and remaining records are used for training. As shownin Table 4 for U2R class, we have used a bootstrap technique inorder to resample U2R attacks.

Moreover in order to evaluate the performance of our featureselection algorithm, we applied the KDD Cup 1999 labeled testdata set. The test data set has some new attacks, which do notappear in the training data set (Table 5).

7.3. Results

All experiments were performed on a Windows platformhaving configuration Intels core 2 Duo CPU 2.49 GHZ, 3 GB RAM.

Table 3Characteristics of cost matrix for the KDD’99 classifier learning contest. The

columns correspond to predicted classes, rows correspond to actual classes.

Predicted class

Normal DOS Probe U2R R2L

Actual class

Normal 0 2 1 2 2

DOS 2 0 1 2 2

Probe 1 2 0 2 2

U2R 3 2 2 0 2

R2L 4 2 2 2 0

We have used the open toolbox LS-SVMlab (lssvmlab.www.esat.kuleuven.ac.be/sista/lssvmlab/) to implement LSSVM andmodeling the IDS. The feature selection algorithm proposed inthis paper can only rank features in terms of their importance andcannot indicate the optimal numbers of features. So, for deter-mining the best number of features, we begin with the bestfeature and incrementally add features to an LSSVM by theirimportance. The optimal number of features in each algorithmis one which has shown the best accuracy in the training data.Figs. 1–3 plot the effect of the selected features in proposedfeature selection algorithms for the Normal class. The detectionand false positive rates are depicted in each step of the featureselection algorithm. In the ith step, all features selected till ithstep are applied to build an IDS. In an MMIFS, in the sixth step,there is the maximum difference between detection and falsepositive rates and the detection rate is above 90%. So the first sixselected features are selected as most important ones (Fig. 1), thefirst six features selected by an FFSA and first fifteen featuresselected by an LCFS are selected features (Figs. 2 and 3).

We use our feature selection algorithms to select the bestfeature subsets for all classifiers and the selected feature subsetare shown in Table 6a–e. In each table; each row represents thenumber of important features selected by a feature selectionalgorithm and the labels of these important features.

It is clear that after the feature selection algorithm, theimportant feature subsets for each class are greatly reduced.Experiments show that the features labeled src_byte, dst_byte,

service, count plays an important role in discriminating abnormalactivities.

We conducted several experiments to compare the time offeature selection process between different feature selection meth-ods. Table 7 shows the consuming time of the feature selectionprocess, using three different algorithms for all classes. It demon-strates that LCFS is the fastest method and FFSA is the slowest one.

Their evaluation function requires O

We compared the evaluation results of models using selectedfeature subset with those using all 41 features in two aspects:Receiver Operating Characteristic (ROC) curves and confidentinterval.

ROC curves are shown in Figs. 4–8 in terms of adding thefeatures ranked by our purposed feature selection algorithms. Thebest performance in each class can be obtained by utilizing






Fig. 2. . The effect of selected features in each step of the forward feature selection algorithm (FFSA) is shown for the Normal class. There is the maximum difference

between false positive and detection rates in the 6th step. So, the first six features are selected for the modeling system. (a) Detection rate is depicted in each step and

(b) false positive rate is shown in each step.

Fig. 1. The effect of selected features in each step of the modified mutual information feature selection algorithm (MMIFS) is shown for the Normal class. There is the

maximum difference between false positive and detection rates in the 6th step. So, the first five features are selected for the modeling system. (a) Detection rate is depicted

in each step and (b) false positive rate is shown in each step.


Fig. 3. The effect of selected feature in each step of linear correlation feature selection algorithm (LCFS) is shown for the Normal class. There is the maximum difference

between false positive and detection rates in the 15th step. So, the first fifteen features are selected for the modeling system. (a) Detection rate is depicted in each step and

(b) false positive rate is shown in each step.

Table 6aSelected features for the Normal class (each row shows feature selection

algorithm, the optimal number of features and labels of important features).

Method #Feature Selected features

FFSA 6 5, 3, 1, 4, 34, 6

MMIFS 6 5, 23, 3, 6, 35,1

LCFS 15 12, 34, 33, 3, 23, 27, 29, 40, 39, 28, 2, 41, 26, 35, 10

Table 6bSelected features for the DOS class (each row shows feature selection algorithm,

the optimal number of features and labels of important features).

Method #Feature. Selected features

FFSA 3 5, 38, 3

MMIFS 8 5, 23, 6, 2, 24, 41, 36, 3

LCFS 36 32, 27, 23, 38, 41, 24, 13, 2, 40, 22, 30, 25, 28, 35, 26, 37,

12, 36, 39, 1, 10, 14, 11, 17, 33, 16, 19, 18, 9, 5, 34, 31, 6, 3,

29, 3

Table 6cSelected features for the Probe class (each row shows feature selection algorithm,

the optimal number of features and labels of important features).

Method #Feature Selected feature

FFSA 24 40, 5, 41, 11, 2, 22, 9, 27, 37, 28, 14, 19, 31, 18, 1, 17, 16,

13, 25, 39, 26, 6, 30, 32

MMIFS 13 40, 5, 33, 23, 28, 3, 41, 35, 27, 32, 12, 24, 28

LCFS 7 27, 12, 40, 41, 34, 28, 35

Table 6dSelected features for R2L class (each row shows feature selection algorithm, the

optimal number of features and labels of important features).


FFSA 10 3, 6, 4, 11, 9, 33, 37, 38, 22, 25

MMIFS 15 3, 13, 22, 23, 10, 5, 35, 24, 6, 33, 37, 32, 1, 37, 39

LCFS 4 22, 38, 10, 3


proposed MMIFS or FFSA, because of lower false positive andhigher detection rate. These feature selection methods use themutual information for decision. So the mutual information isbetter than the correlation coefficient as a feature goodnessmeasure in this data. An MMIFS takes account of not only the

relevance of the candidate feature to the output classes, but alsothe redundancy between the candidate feature and the already-selected features.

Results of classification with 99% confidence interval areshown in Table 8. In each class – Normal or attack – , each rowshows the performance of each feature selection algorithm in


detecting intrusions. Our experiment results show that thefeature selection improves the classification accuracy in compar-ison to not using this phase.

Table 6eSelected features for U2R class (each row shows feature selection algorithm, the

optimal number of features and labels of important features).


FFSA 27 5, 1, 19, 18, 39, 2, 22, 9, 29, 7, 8, 15, 30, 16, 20, 21, 6, 3, 26,

31, 33, 14, 4, 17, 32, 12, 25

MMIFS 10 5, 1, 3, 24, 23, 2, 33, 6, 32, 4,14,21

LCFS 3 14, 17, 13

Table 7Average consuming time of selecting process for different feature selection

algorithms (measurement expressed in terms of minutes).

Preprocess

LCFS (min) 0.4MMIFS(min) 3FFSA(min) 58

Fig. 4. ROC-curves using different feature selection algorithms in the Normal class in

modified mutual feature selection (MMIFS); (b) IDS with the forward feature selection

Fig. 5. ROC-curves using different feature selection algorithms in DOS class in terms o

mutual feature selection (MMIFS); (b) IDS with the forward feature selection algorithm

In the Normal class, studying input features with regard to theoutput shows that there is no linear relation between the inputfeatures and output; so, MMIFS and FFSA have resulted insignificant improvements in the accuracy of the classificationmodels compared to an LCFS.

It seems the input features in Probe and R2L classes are notindependent nor have the linear correlation. So applying MMIFS;these classes have achieved better result. In U2R, DOS and Normalclasses, we have got almost the same results using either anMMIFS or an FFSA; therefore, the input features in these classesare almost independent from one another.

Some researchers have selected the training and testing datafrom the KDD Cup‘99 training data set. So, using our evaluationresults, we compare the accuracy of our PLSSVM evaluationresults against some SVM-based and feature selection—includedapproaches introduced in section 2, such as SVM, Bayesian andFNT (Table 9), all of which are shown to be effective classificationtechniques. An FNT uses flexible tree model as the neural net-work’s structure, which allows an input variable selection anddifferent activation functions for different nodes involved. Opti-mization methods determine their parameters. Bayesian networkrepresents the inter-relationship among the data set features. AnSVM constructs an optimal separating hyper plane between thepositive and negative classes, but an LSSVM involves the equality

terms of adding the feature ranked by the proposed algorithms. (a) IDS with the

algorithm (FFSA) and (c) IDS with the linear correlation feature selection (LCFS).

f adding the feature ranked by the proposed algorithms. (a) IDS with the modified

(FFSA) and (c) IDS with the linear correlation feature selection (LCFS).

Fig. 6. ROC-curves using different feature selection algorithms in Probe class in terms of adding the feature ranked by the proposed algorithms. (a) IDS with the modified

mutual feature selection (MMIFS); (b) IDS with the forward feature selection algorithm (FFSA) and (c) IDS with the linear correlation feature selection (LCFS).

Fig. 7. ROC-curves using different feature selection algorithms in an R2L class in terms of adding the feature ranked by the proposed algorithms. (a) IDS with the modified

mutual feature selection (MMIFS); (b) IDS with the forward feature selection algorithm (FFSA) and (c) IDS with the linear correlation feature selection (LCFS).

Fig. 8. ROC-curves using different feature selection algorithms in U2R class in terms of adding the feature ranked by the proposed algorithms. (a) IDS with the modified

mutual feature selection (MMIFS); (b) IDS with the forward feature selection algorithm (FFSA); and (c) IDS with the linear correlation feature selection (LCFS).


constraints only and is comparable to an SVM in terms of thegeneralization performance.

In Table 9, the published results in Mukkamala and Sung (2006),Chebrolu et al. (2005), Chena et al. (2006) and Mukkamala et al.

(2005) are presented. The comparison is based on the accuracy rate(%) for anomaly detection (normal-attack separation) and misuseddetection (detection of four main category intrusions). As evidentfrom Table 9, the proposed PLSSVM seems to be promising.

Table 8Performance of classification for attack and Normal classes in evaluation data (method: feature selection algorithm, DR: detection Rate, FPR: false positive rate, DE: the

number of detection errors).

Class Method DR (%) FPR (%) Accuracy (%) DE (#records)

Normal FFSA 99.84370.143 0.2570.22 99.8070.1 13

MMIFS 99.9270.062 0.0870.062 99.8970.035 7

LCFS 99.7370.257 2.2472.96 99.3770.66 51

All features 99.74370.452 10.4879.72 96.8172.19 210

Dos FFSA 90.0270.2 0.0270.3 99.0070.25 70MMIFS 85.8170.09 0.0370.2 98.970.15 91LCFS 87.8470.5 1.171.5 98.8371.0 130

All features 85.8171.0 10.076.1 97.6472.1 230

Probe FFSA 92.970.5 0.1970.21 99.0970.3 61

MMIFS 99.9770.05 0.1970.12 99.8370.045 11LCFS 57.1571.0 0.270.9 95.871.1 310

All features 78.5873.2 11.170.5 95.171.8 350

R2L FFSA 99.7270.09 0.270.1 99.7970.19 14

MMIFS 99.9870.08 0.370.1 99.9170.07 6LCFS 93.5771.08 0.0770.8 99.6171.3 77

All features 99.770.29 62.01710.0 84.2479.02 1056

U2R FFSA 9073.6 7.170.32 93.1670.5 458MMIFS 9571.01 9.6670.4 90.3270.5 648LCFS 5075.00 5.6971.21 94.2072.0 388

All features 9576.2 5.4670.65 94.5670.1 364

Table 9The evaluation results comparison with the other approaches according to accuracy rate (each row represent a method of intrusion detection. PLSSVM is our proposed

system).

Method Normal (%) DOS (%) Probe (%) R2L (%) U2R (%)

PLSSVM 99.80 99.00 99.83 99.91 93.16

SVM (Mukkamala et al., 2005) 99.55 99.25 99.70 99.78 99.87

Bayesian (Chebrolu et al., 2005) 98.78 98.95 99.57 98.93 48.00

FNT (Chena et al., 2006) 99.19 98.75 98.39 99.09 99.70

SVM (Mukkamala and Sung, 2006) (with feature selection) 99.59 99.22 99.38 99.78 99.87

Table 10Detection rate, false positive rate, accuracy, building time and testing time for different classes on the test data set with corrected labels of KDD Cup 99 data set. (Time

expressed in terms of minutes.)

Class Features fed to model DR (%) FPR (%) Accuracy (%) Building time (min) Testing time (min)

Normal class MMIFS selected features 95.15 0.65 99.1 25 11

All features 94.11 0.65 99.00 53 20

DOS class MMIFS selected features 78.69 0.73 84.11 19 8

All features 79.00 0.78 84.3 53 20

Probe class MMIFS selected features 86.46 13.87 86.12 35 13

All features 87.56 13.3 86.15 54 20

R2L class MMIFS selected features 84.85 0.53 98.70 5 4

All features 88 0.79 98.82 54 21

U2R class MMIFS selected features 30.70 0.47 99.47 23 10

All features 18.42 0.39 99.46 53 20


The PLSSVM model is able to select the good features for all classesand outperforms the SVM, Bayesian and FNT in detecting Normal,Probe and R2L classes. Accuracy of PLSSVM in DOS and U2R is alsohigh. In these two classes, the difference in accuracy with otherproposed methods is small.

As an MMIFS is a promising feature selection method in termsof lowest computational complexity and highest accuracy, wecompare the IDS using MMIFS method with those using nofeature selection algorithm on ‘‘the test set with corrected labelsof KDD Cup 99’’ in Table 10. We can see that in each class, the

accuracy of the model with the selected features is close to, andeven in some cases better than the model with 41 features andhas smaller building time and testing time. Feature selection impro-ves the detection rate, false positive rate and accuracy. It means thatfeatures selection process can help build lightweight IDSs.

The test results of PLSSVM using MMIFS have been comparedwith some other machine learning methods tested on the KDDCup 99 test set and shown in Table 11. It can be stated that all themachine learning algorithms tested on this data set offered anacceptable level of detection performance for Normal, DOS and

Table 11Classification rate and cost per example of KDD (CPE) for the different algorithms performances on the ‘‘test data set with corrected labels of KDD Cup 99 set’’.

Model Normal (%) DOS (%) Probe (%) R2L (%) U2R (%) CPE

PLSSVM 95.69 78.76 86.46 84.85 30.7 0.1807

Clustering feature (Horng et al., 2010) 99.3 99.5 97.5 28.8 19.7 Not reported

ESC_IDS(Toosi and Kahani, 2007) 98.2 99.5 84.1 31.5 14.1 0.1579

KDD’winner (Pfahringer, 2000) 99.5 97.1 83.3 8.4 13.2 0.2331

KDD’99 runner-up(Levin, 2000) 99.4 97.5 84.5 7.3 11.8 0.2356

Table 12Confusion matrix using CMIM feature selection as preprocessing (each cell contains the number of records and classification rate).

CMIM feature selection Predict

Normal DOS R2L Probe U2R

Actual

Normal 38,836 (64.09%) 4115 (6.79%) 8314 (13.72%) 9323 (15.38%) 57 (0.094%)

DOS 20,318 (8.83%) 203,684 (88.61%) 441 (0.19%) 5406 (2.35%) 4 (0.001%)

R2L 9751 (60.23%) 704 (4.34%) 1187 (7.33%) 4133 (25.52%) 396 (2.44%)

Probe 963 (23.09%) 2218 (53.2%) 199 (4.77%) 768 (18.42%) 18 (0.43%)

U2R 24 (9.75%) 89 (36.17%) 71 (28.86%) 62 (25.2%) 0 (0%)

Table 13Confusion matrix using mRMR feature selection as preprocessing (each cell contains the number of records and classification rate).

mRMR feature selection Predict


Actual

Normal 59,170 (97.65%) 729 (1.2%) 114 (0.18%) 580 (0.95%) 0 (0%)

DOS 62,017 (26.98%) 129,471 (56.32%) 30,432 (13.23%) 7933 (3.45%) 0 (0%)

R2L 13,179 (81.4%) 2279 (14.07%) 658 (4.06%) 55 (0.33%) 0(0%)

Probe 290 (6.96%) 260 (6.24%) 59 (1.66%) 3548 (85.165%) 0 (0%)

U2R 66 (26.82%) 135 (54.87%) 8 (3.25%) 37 (15.04%) 0 (0%)

Table 14

Confusion matrix using MMIFS feature selection as preprocessing. b takes 0.5. (Each cell contains the number of records and classification rate.)

MMIFS feature selection Predict


Actual

Normal 58,853 (97.12%) 10 (0.016%) 823 (1.35%) 204 (0.33%) 703 (1.16%)

DOS 46,564 (20.25%) 129,472 (56.32%) 5252 (2.28%) 48,565 (21.12%) 0 (0%)

R2L 11,465 (70.81%) 0 (0%) 4510 (27.85%) 25 (0.15%) 171 (1.05%)

Probe 1357 (32.57%) 3 (0.07%) 55 (1.32%) 2751 (66.03%) 0 (0%)

U2R 153 (62.19%) 0 (0%) 30 (12.19%) 30 (12.19%) 33 (13.41%)


Probe attack and they did not have good performance on the R2Land U2R types. Our proposed system demonstrates better perfor-mances for R2L and U2R attacks. CPE of the system is close to thebest one. But the system is not successful in detecting DOS and R2L,so a lot of them are not detected correctly. There are 18,729 samplesof various new attacks in the KDD Cup 99 test data, which havenever appeared in the training set. These records make an intrusiondetection system trained by a training set hard to achieve goodperformance for test data. Our experiments show that our proposedIDS detection rate for new attacks in the Normal class is only 43.64%and the worst detection is on new R2L attacks. A lot of snmpgetattack,xsnoop and xlock is not detected correctly. In fact, since the featuresof snmpgetattack and Normal connection are the same, it is impos-sible for an IDS to distinguish these two from each other, also shownin Bouzida and Cuppens (2006).

7.4. Comparison of MMIFS with two other MI based techniques

We compare the performance of our proposed MMIFS methodwith two other mutual information-based feature selection tech-niques introduced in Section 2: mRMR and CMIM. In order tocompare the methods, we discretize our data set with equal depthtechnique, and then the binary features selected by MMIFS, CMIMand mRMR are applied in the LSSVM model. Tables 12–14 showresults of modeling using fifteen selected features by CMIM,mRMR and MMIFS. As can be seen, the classification rates onthe main diagonal of Table 14 using MMIFS are comparable orbetter than those using mRMR (Table 13) and CMIM (Table 12),except for the DOS class and the miss-detected records inTable 14 are comparable or so lower than those using mRMRand CMIM again except for the DOS class. The results reveal that,

Table 15

Confusion matrix using MMIFS feature selection as preprocessing. b takes 1. (Each cell contains the number of records and classification rate.)

MMIFS feature selection Predict


Actual

Normal 53,802 (88.79%) 5045 (8.32%) 823 (1.35%) 216 (0.35%) 0 (1.16%)

DOS 80,437 (34.99%) 129,604 (56.38%) 5252 (2.28%) 14,560 (6.33%) 0

R2L 10,937 (67.55%) 546 (3.37%) 451 (27.85%) 3 (0.01%) 175 (1.089%)

Probe 832 (19.97%) 597 (14.33%) 55 (1.32%) 2682 (64.37%) 0

U2R 133 (54.065%) 34 (13.82%) 30 (12.19%) 19 (7.72%) 30 (12.19%)


excluding the DOS records, MMIFS selects more important fea-tures than mRMR and CMIM. An MMIFS is not successful atdetecting DOS records; it can be because of mechanisms of DOSattacks, which have widely various behaviors. Studying the featuresselected by the three algorithms reveal that the rate of detectedNormal records, using MMIFS is comparable to mRMR and betterthan CMIM. An LSSVM using MMIFS outperforms other methods indetecting U2R and R2L.We also computed CPE for the results ofLSSVM, using the three feature selection methods. CPE using CMIMis 0.4406, using mRMR is 0.8146 and using MMIFS is 0.6552. Since alot of DOS records are miss-detected using MMIFS, the CPE usingMMIFS is between CPE using CMIM and CPE using mRMR. Compar-ing Table 14 with Table 10 reveals that MMIFS using continuousfeatures produces much better results than using discrete features,and that discretizing features results losing some information anddecreasing detection rate. In these set of experiments, we had todiscretize the input features in order to be comparable with theother methods.

The three algorithms employ different methods for estimatingthe mutual information, but other than that MMIFS and mRMR usesimilar search strategies for selecting features. So, the number ofcommon features selected by MMIFS and mRMR is higher thanthose selected by MMIFS and CMIM. If an MMIFS uses b¼1, theMMIFS and mRMR have very similar search strategies. So weexperimented MMIFS with b¼1 and the classification rate is shownin Table 15. In R2L, Probe and U2R, MMIFS with b¼1 has still higherclassification rates than an mRMR. When b¼0.5, classification rates(Table 14) are better than when b¼1 (Table 15), and miss-detectedrecords are lower. In our experiments, we have not tuned b for anoptimal performance. We will study the tuning of b in our futureresearch.

8. Discussion

We have analyzed the features selected by our proposedMMIFS method and their relationship with different attack types.In KDD Cup 99 data set, the first category of features is useful fordetecting attacks, since some attacks scan the hosts (or ports),using a long time interval. The second feature category searchessuspicious behavior in the data portions of packets, such as thenumber of failed login attempts.

Probe attacks have limited variance, because they all involve a lotof connections to a large number of hosts or ports in a short periodof time. Studying these attacks has shown that dst_host_rerror_rate,dst_host_srv_rerror_rate and srv_rerror_rate are important featuresfor detecting Probe attacks. MMIFS has selected these features in thetop selected features, indicating that it has been successful inselecting the important feature (Table 6c).

In terms of the type of mechanism exploited by the DOSattacks, the important features selected by MMIFS are service,

dst_byte, dst_host_srv_rerror_rate and dst_host_same_src_port_rate.

dst_host_srv_serror_rate Has been shown to be an important

feature for detecting DOS attacks, specifically for Neptune attacks,but using our method, it did not have a large rank in the rankingof features

In an R2L attack, there are many possible ways an attacker cangain unauthorized access to a local account on a machine. Thewarezmaster and warezclient, two important R2L attacks, involveuploading and downloading data from hidden directories duringan FTP connection. So, the relevant features that can be observedare Hot, service, src_byte, dst_byte, is_guess_login and duration,

which are all selected by an MMIFS (Table 6d).Since the outcome of all U2R attacks is a root-shell obtained

without legitimate means and root-relevant features seem to beuseful for detecting, an MMIFS has selected some root-relevantfeatures, e.g, root_shell, is_hot_login.

Since the DOS and Probe attacks involve many connections toa some host(s) , the important features for observing these typesare src_byte and srv_count. There are no sequential patterns thatare frequent in records of R2L and U2R attacks. These attackschange the data portions of packets, and normally involve only asingle connection. Therefore, the second category of features has alow role in the best result of DOS and Probe classes and the thirdand fourth categories of features have a little role in the bestresult of R2L and U2R classes. The label of selected features byMMIFS points at these facts, too.

On the other hand, DOS and R2L have various behaviors. Theyexploit the weaknesses of network or system services. The dynamicsof these attacks is very similar to the normal network behavior. So,our detection models miss a large number of DOS and R2L attacks.

9. Conclusions and future work

In this paper, we introduced a new intrusion detection system,which applies information theoretic and statistical criteria for featureselection and LSSVM method for classification. Modified mutualinformation-based feature selection method (MMIFS) is a featureselection method with maximum relevancy and minimum redun-dancy. In this work, the effect of changing feature goodness measureand evaluation function has been investigated by linear correlation-

based feature selection (LCFS), forward feature selection (FFSA) andmodified mutual information feature selection algorithms (MMIFS).Experiments on KDD cup 99 data set demonstrate that feature selec-tion algorithms can greatly improve the classification accuracy. TheKDD cup 99 data set contains four groups of attacks: DOS, Probe, R2Land U2R. Our proposed MMIFS is the most effective among the threein detecting Probe and R2L by selecting features with low depen-dency, while an FFSA has comparable performance with an MMIFS indetecting U2R, DOS attacks and Normal profile.

While LCFS and FFS can be useful in particular cases, our proposedMMIFS is able to measure a general dependency between featuresand to rank them. Since MMIFS and FFSA use more information toselect features, these techniques usually lead to more efficient detec-tion than the suggested LCFS method. When the input features of a


class are almost independent of one another, FFSA and MMIFS pro-duce close results to each other.

In the future, we are going to study using our proposed featureselection methods as a preprocessing step in other learning methodsand also do a thorough comparison of our proposed feature selectionmethods with other ones.

Acknowledgements

The authors would like to thank the reviewers for theircomments, which helped improve the paper significantly,and Dr. Rouhollah Rahmani for his help to edit some sections ofthe paper.

Appendix A

How to estimate mutual information

In practice, there are a set of N input–output pairs, zi¼(xi, yi),i¼1, y, N, which are assumed to be realizations of a randomvariable Z¼(X, Y) with density pX,Y(x, y). Either X and Y havevalues in R or in Rp. And the algorithm will use the Euclideannorm in those spaces.

Input–output pairs are compared through the maximum norm

z�zu

1 ¼maxf:x�xu

:, :y�yu

:g ð8Þ

It can be considered that k is a fix positive integer; thenzk(i)¼(xk(i), yk(i)) is the k-th nearest neighbor of zi (with maximumnorm). It can be denoted that

ei=2¼ :zi�zki1: ð9Þ

exi =2¼ :xi�xki

:, eyi =2¼ :yi�yki

: ð10Þ

ei=2 is the distance from zi to its k-th neighbor and exi =2 and ey

i =2 arethe distances between the same points projected into X and Y

subspaces. Obviously, ei ¼maxfexi , ey

i gnix and ni

y are the numbers ofsample points with ex

i =2Z:xi�xj: and eyi =2Z:yi�yj:: The estima-

tion for an MI is then

IðX; YÞ ¼cðkÞ�1

k�

1

N

XN

i ¼ 1

cðnxi Þ�cðn

yi Þ

� �þcðNÞ ð11Þ

Which c is the digamma function

cðnÞ ¼G�1ðnÞ

dGðnÞdn

, cð1Þ ��0:5772156 ð12Þ

where

GðnÞ ¼Z 1

0un�1e�u du ð13Þ

for a small value of k, this estimator has a large variance and asmall bias, whereas a large value of k leads to a small variance anda large bias. In this paper, we use the 6th nearest neighbor toestimate an MI.

References

Anderson D, Frivold T, Tamaru A, Valdes A. Next-generation intrusion detectionexpert system (NIDES). Software Users Manual, Beta-Update release, Compu-ter Science Laboratory. Menlo Park, CA, USA: SRI International; 1994. TechnicalReport SRI-CSL-95-0.

Anderson D, Lunt TF, Javitz H, Tamaru A, Valdes A. Detecting Unusual ProgramBehavior Using the Statistical Component of the Next-generation IntrusionDetection Expert System (NIDES). Menlo Park, CA, USA: Computer ScienceLaboratory, SRI International; 1995. SRI-CSL-95-06.

Barbara�D, Couto J, Jajodia S, Wu N. ADAM: a testbed for exploring the use of datamining in intrusion detection. ACM SIGMOD Record: SPECIAL ISSUE: Special

section on data mining for intrusion detection and threat analysis 2001:15–2430 2001:15–24.

Battiti R. Using mutual information for selecting features in supervised neural netlearning. IEEE Transactions on Neural Networks 1994;5:537–50.

Boukerche A, Lemos Juca KR, Sobral JB, Sechi Moretti Annoni Notare M.An artificial immune based intrusion detection model for computer andtelecommunication systems. Parallel Computing 2004;30(5–6):629–46.

Bouzida Y, Cuppens F. Neural networks vs. decision trees for intrusion detection,/http://www.rennes.enst-bretagne.fr/�fcuppens/articles/monam06.pdfS;2006.

C.S. Institute and F.B.O. Investigation. In: Proceedings of the 10th AnnualComputer Crime and Security Survey 2005;10:1–23.

Chebrolu S, Abraham A, Thomas P. Feature deduction and ensemble design ofintrusion detection systems. Computers and Security 2005;24(4):295–307.

Chena Y, Abrahama A, Yanga B. Feature selection and classification using flexibleneural tree. Journal of Neurocomputing 2006;70:305–13.

Chimphlee W, Abdullah AH, Md Sap MN, Srinoy S, Chimphlee S. Anomaly-basedintrusion detection using fuzzy rough clustering. In: Proceedings of theinternational conference on hybrid information technology (ICHIT’06); 2006.

Cho SB. Incorporating soft computing techniques into a probabilistic intrusiondetection system. IEEE Transactions on Systems, MAN and Cybernetics part C:Applications and Reviews 2002;32:154–60.

Chou TS, Yen KK, Luo J. Network intrusion detection design using feature selectionof soft computing paradigms. International Journal of computational Intelli-gence 2008;4(3):196–208.

Cohen WW. Fast effective rule induction. In: Proceedings of the 12th InternationalConference on Machine Learning. Tahoe City, CA; 1995. p. 115–23.

Debar H, Becker M, Siboni D. A neural network component for an intrusiondetection system. In: Proceedings of the IEEE Computer Society Symposium onResearch in Security and Privacy. 1992. p. 240–50.

Fisch D, Hofmann A, Sick B. On the versatility of radial basis function neuralnetworks: a case study in the field of intrusion detection, InformationSciences. Information Sciences 2010;180:2421–39.

Fleuret F. Fast binary feature selection with conditional mutual information.Journal of Machine Learning Research (JMLR) 2004;5:1531–55.

Horng S-J, Su M-Y, Chen Y-H, kao T-W, Chen R-J, Lai J-L, Perkasa CD. A novelintrusion detection system based on hierarchical clustering and support vectormachines. Expert Systems with Applications 2010, doi:10.1016/j.eswa.2010.06.066.

Jain AK, Zongker D. Feature selection: evaluation, application, and small sampleperformance. IEEE Transactions on Pattern Analysis and Machine Intelligence1997;19(2):153–8.

KDD Cup. Data available: /http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.htmlS; 1999.

Khan L, Awad M, Thuraisingham B. A new intrusion detection system usingsupport vector machines and hierarchical clustering. The International Journalon Very Large Databases 2007;16(4):507–21.

Kraskov A, Stogbauer H, Grassberger P. Estimating mutual information. PhysicalReview E 2004;69(2004):066138.

Lazarevic A, Ertoz L, Kumar V, Ozgur A, Srivastava J. A comparative study ofanomaly detection schemes in network intrusion detection. In: Proceedings ofthe Third SIAM Conference on Data Mining; 2003.

Lee JH, Sohn SG, Chang BH, Chung TM. PKG-VUL: security vulnerability evalua-tion and patch framework for package-based systems. ETRI Journal 2009;vol.31(no. 5):554–64. /http://etrij.etri.re.kr/Cyber/BrowseAbstract.jsp?vol=31&num=5&pg=554S.

Levin I. KDD-99 classifier learning contest LLSoft’s results overview. SIGKDDExplorations 2000;1(2):67–75.

Li Y, Wang J, Tiand Z, Luc T, Young C. Building lightweight intrusion detectionsystem using wrapper-based feature selection mechanisms. Computers andsecurity 2009;28:466–75pp 2009;28:466–75.

Lippmann R, Haines JW, Fried DJ, Korba J, Das K. The 1999 DARPA off-line intrusiondetection evaluation, Computer Networks. The International Journal of Com-puter and Telecommunications Networking 2000;34:579–95. lssvmlab./http://www.esat.kuleuven.ac.be/sista/lssvmlab/S.

Morin B, Me L. Intrusion detection and virology: an analysis of differences, similaritiesand complementariness. Journal of Computational Virology 2007;3:39–49.

Mukkamala S, Sung AH. Significant Feature Selection using ComputationalIntelligent Techniques for Intrusion Detection. Berlin Heidelberg: Springer;2006. p. 285–306.

Mukkamala S, Sung A, Abraham A. Intrusion detection using an ensemble ofintelligent paradigms. Journal of Network and Computer Applications2005;28(2):167–82.

Novikov D, Yampolskiy RV, Reznik L., Anomaly detection based intrusion detec-tion. In: Proceedings of the third international conference on informationtechnology: new generations (ITNG’06); 2006.

Patcha A, Park J-M. An overview of anomaly detection techniques: existingsolutions and latest technological trends. Computer Network 2007,doi:10.1016/j.comnet.2007.02.001.

Peng H, Long F, Ding C. Feature selection based on mutual information: criteriaof max-dependency, max-relevance, and min-redundancy. IEEE Transactionson Pattern Analysis and Machine Intelligence 2005;vol. 27(No. 8):1226–38. s.

Pfahringer B. Winning the KDD99 classification cup: bagged boosting. SIGKDDExplorations 2000;1(2):65–6.

http://www.rennes.enst-bretagne.fr/&sim;fcuppens/articles/monam06.pdf

http://www.rennes.enst-bretagne.fr/&sim;fcuppens/articles/monam06.pdf

dx.doi.org/10.1016/j.eswa.2010.06.066

dx.doi.org/10.1016/j.eswa.2010.06.066

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

http://etrij.etri.re.kr/Cyber/BrowseAbstract.jsp?vol=31&num=5&pg=554




http://www.esat.kuleuven.ac.be/sista/lssvmlab/

dx.doi.org/10.1016/j.comnet.2007.02.001


Ramadas M, Tjaden SOB. Detecting anomalous network traffic with self-organizingmaps. In: Proceedings of the sixth international symposium on recentadvances in intrusion detection. Pittsburgh (PA, USA): 2003. p. 36–54.

Saniee Abadeh M, Habibi J, Lucas C. Intrusion detection using a fuzzy genetics-based learning algorithm. Journal of Network and Computer Applications2007;30(1):414–28.

Sung A, Mukkamala S. Identifying important features for intrusion detection usingsupport vector machines and neural networks. In: Proceedings of the inter-national symposium on applications and the internet (SAINT 2003). 2003.p. 209–17.

Suykens JAK, Vandewall J. Least squares support vector machine classifiers. NeuralNetwork Letters 1999;9:293–300.

Toosi AN, Kahani M. A new approach to intrusion detection based on anevolutionary soft computing model using neuro-fuzzy classifiers. ComputerCommunications 2007;30:2201–12.

Tsai C-F, Hsu Y-F, Lin C-Y, Lin W-Y. Intrusion detection by machine learning:a review. Expert Systems with Applications 2009;36:11994–2000.

Xuren W, Famei H, Rongsheng X. Modeling intrusion detection system by

discovering association rule in rough set theory framework. In: Proceedings

of the international conference on computational intelligence for modelling

control and automation, and international conference on intelligent agents.

Web Technologies and Internet Commerce (CIMCA-IAWTIC’06); 2006.Ye N, Borror YZCM. Robustness of the Markov-chain model for cyber-attack

detection. IEEE Transactions on Reliability 2004:116–12353 2004:116–23.Yu H, Yang J, Han J, Li X. Classifying large data sets using SVM with hierarchical

clusters. In: Proceedings of the international conference on knowledge

discovery in databases (KDD’03); 2003.Zhang Z, Shen H. Application of online-training SVMs for real-time intrusion

detection with different considerations. Computer Communications 2005;

28(12):1428–42.Wu SX, Banzhaf W. The use of computational intelligence in intrusion detection

systems: a review. Applied Soft Computing 2010;10:1–35.

Mutual information-based feature selection for intrusion detection systems

Documents

Transcript of Mutual information-based feature selection for intrusion detection systems