Applications of Neural Networks in Network Intrusion Detection

6
L 8th Seminar on Neural Network Applications in Electrical Engineering, NEUREL-2006 UMM X Faculty of Electrical Engineering, University of Belgrade, Serbia, September 25-27, 2006 20mtE ,i, ~~http /nure1l.etf`.1bg.ac.vyu, http ://www. ewh . ieee .org/reg/8/conferences .htm I /*IE EE 06 Applications of Neural Networks in Network Intrusion Detection Aleksandar Lazarevic, Dragoijub Pokrajac, Jelena Nikolic Abstract - In this paper, we discuss the applications of multilayer perceptrons for classification of network intrusion detection data characterized by skewed class distributions. We compare several methods for learning from such skewed distributions by manipulating data records. The investigated methods include oversampling, undersampling and generating artificial data records using SMOTE technique. The presented methods are tested on KDDCup99 network intrusion dataset and compared using various classification performance metrics. In addition, the influence of decision margin on recall and misclassification rates is also examined. Keywords - Neural networks, rare class, network intrusion detection. I. INTRODUCTION W 5[ITH the growing number of attacks on network infrastructures, the need for techniques to detect and prevent attacks is becoming urgent. Intrusion detection refers to a broad range of techniques that defend against malicious attacks. The most widely deployed methods for intrusion detection employ signature-based detection techniques. These methods extract features from various audit streams, and detect intrusions by comparing the feature values to a set of attack signatures provided by human experts. Such methods can only detect previously known intrusions that have a corresponding signature. The signature database has to be manually revised for each new type of attack that is discovered. Limitations of signature-based methods have led to an increasing interest in intrusion detection techniques base upon data mining [1, 2, 3, 4, 5]. Data mining based intrusion detection techniques generally fall into one of two categories; namely misuse detection and anomaly detection. In misuse detection Dragoljub Pokrajac has been partially supported by NIH (2P20 RR016472-04), DoD/DoA 45395-MA-ISP and NSF (0320991, HRD- 0310163, HRD-0630388) grants. A. Lazarevic is with United Technologies Research Center, 411 Silver Lane, East Hartford, CT 06108, USA (phone +1-860-610-7560, fax: +1-860-660-9334, e-mail: lazareaAutrc.utc.com). D. Pokrajac is with Delaware State University, Applied Mathematics Research Center, 1200 N DuPont Hwy, Dover, DE, 19904, USA (phone +1-302-857-6640, fax: +1-302-857-6552, e-mail: dpokrajaAdesu.edu). J. Nikolic is with Faculty of Electronic Engineering, University of Nis, Aleksandra Medvedeva 14, 18000 Nis, Serbia (phone: +381-18- 529-105, fax: +381-18-588-399, e-mail: njelenaAelfak.ni.ac.yu). approaches, each instance in a data set is labeled as normal or intrusion (attack) and a learning algorithm is trained over the labeled data. These approaches are able to automatically retrain intrusion detection models on different input data that include new types of attacks as long as they have been labeled appropriately. The main advantage of misuse detection is that it can accurately detect known attacks, while its drawback is its inability to detect novel, previously unseen attacks. Traditional anomaly detection approaches, on the other hand, build models of normal data and detect deviations from the normal model in observed data. Anomaly detection applied to intrusion detection and computer security has been an active area of research since it was originally proposed by Denning [6]. Anomaly detection algorithms have the advantage that they can detect new types of intrusions as deviations from normal usage [6, 7]. In this problem, given a set of normal data to train from, and given a new piece of test data, the goal of the intrusion detection algorithm is to determine whether the test data belong to "normal" or to an anomalous behavior. We refer to this problem as supervised anomaly detection, since the models are built only according to the normal behavior on the network. In contrast, unsupervised anomaly detection attempt to detect anomalous behavior without using any knowledge about the training data. However, both types of anomaly detection schemes suffer from a high rate of false alarms. This occurs primarily because previously unseen (yet legitimate) system behaviors are also recognized as anomalies, and hence flagged as potential intrusions. This paper presents the scope and status of our research work in misuse detection. The paper first gives the brief overview of our research in building predictive models for learning from rare classes and proposes several techniques for detecting network intrusions. We present experimental results on publicly available KDDCup'99 data set [20]. Experimental results on the KDDCup'99 data set have demonstrated that rare class predictive models are much more efficient in the detection of intrusive behavior than standard classification techniques. II. BACKGROUND In misuse detection related problems, standard data mining techniques are not applicable due to several specific details that include dealing with skewed class distribution, learning from data streams (intrusions are 1-4244-0433-9/06/$20.00 C2006 IEEE. 59

Transcript of Applications of Neural Networks in Network Intrusion Detection

L 8th Seminar on Neural Network Applications in Electrical Engineering, NEUREL-2006 UMMX Faculty of Electrical Engineering, University of Belgrade, Serbia, September 25-27, 2006 20mtE,i, ~~http /nure1l.etf`.1bg.ac.vyu, http://www.ewh .ieee.org/reg/8/conferences .htm I /*IEEE06

Applications of Neural Networks in NetworkIntrusion Detection

Aleksandar Lazarevic, Dragoijub Pokrajac, Jelena Nikolic

Abstract - In this paper, we discuss the applications ofmultilayer perceptrons for classification of networkintrusion detection data characterized by skewed classdistributions. We compare several methods for learningfrom such skewed distributions by manipulating datarecords. The investigated methods include oversampling,undersampling and generating artificial data records usingSMOTE technique. The presented methods are tested onKDDCup99 network intrusion dataset and compared usingvarious classification performance metrics. In addition, theinfluence of decision margin on recall and misclassificationrates is also examined.

Keywords - Neural networks, rare class, networkintrusion detection.

I. INTRODUCTION

W5[ITH the growing number of attacks on networkinfrastructures, the need for techniques to detectand prevent attacks is becoming urgent. Intrusion

detection refers to a broad range of techniques that defendagainst malicious attacks. The most widely deployedmethods for intrusion detection employ signature-baseddetection techniques. These methods extract features fromvarious audit streams, and detect intrusions by comparingthe feature values to a set of attack signatures provided byhuman experts. Such methods can only detect previouslyknown intrusions that have a corresponding signature.The signature database has to be manually revised foreach new type of attack that is discovered. Limitations ofsignature-based methods have led to an increasing interestin intrusion detection techniques base upon data mining[1, 2, 3, 4, 5].

Data mining based intrusion detection techniquesgenerally fall into one of two categories; namely misusedetection and anomaly detection. In misuse detection

Dragoljub Pokrajac has been partially supported by NIH (2P20RR016472-04), DoD/DoA 45395-MA-ISP and NSF (0320991, HRD-0310163, HRD-0630388) grants.

A. Lazarevic is with United Technologies Research Center, 411Silver Lane, East Hartford, CT 06108, USA (phone +1-860-610-7560,fax: +1-860-660-9334, e-mail: lazareaAutrc.utc.com).

D. Pokrajac is with Delaware State University, Applied MathematicsResearch Center, 1200 N DuPont Hwy, Dover, DE, 19904, USA (phone+1-302-857-6640, fax: +1-302-857-6552, e-mail: dpokrajaAdesu.edu).

J. Nikolic is with Faculty of Electronic Engineering, University ofNis, Aleksandra Medvedeva 14, 18000 Nis, Serbia (phone: +381-18-529-105, fax: +381-18-588-399, e-mail: njelenaAelfak.ni.ac.yu).

approaches, each instance in a data set is labeled asnormal or intrusion (attack) and a learning algorithm istrained over the labeled data. These approaches are ableto automatically retrain intrusion detection models ondifferent input data that include new types of attacks aslong as they have been labeled appropriately. The mainadvantage of misuse detection is that it can accuratelydetect known attacks, while its drawback is its inability todetect novel, previously unseen attacks.

Traditional anomaly detection approaches, on the otherhand, build models of normal data and detect deviationsfrom the normal model in observed data. Anomalydetection applied to intrusion detection and computersecurity has been an active area of research since it wasoriginally proposed by Denning [6]. Anomaly detectionalgorithms have the advantage that they can detect newtypes of intrusions as deviations from normal usage [6, 7].In this problem, given a set of normal data to train from,and given a new piece of test data, the goal of theintrusion detection algorithm is to determine whether thetest data belong to "normal" or to an anomalous behavior.We refer to this problem as supervised anomaly detection,since the models are built only according to the normalbehavior on the network. In contrast, unsupervisedanomaly detection attempt to detect anomalous behaviorwithout using any knowledge about the training data.However, both types of anomaly detection schemes sufferfrom a high rate of false alarms. This occurs primarilybecause previously unseen (yet legitimate) systembehaviors are also recognized as anomalies, and henceflagged as potential intrusions.

This paper presents the scope and status of our researchwork in misuse detection. The paper first gives the briefoverview of our research in building predictive modelsfor learning from rare classes and proposes severaltechniques for detecting network intrusions. We presentexperimental results on publicly available KDDCup'99data set [20]. Experimental results on the KDDCup'99data set have demonstrated that rare class predictivemodels are much more efficient in the detection ofintrusive behavior than standard classification techniques.

II. BACKGROUNDIn misuse detection related problems, standard data

mining techniques are not applicable due to severalspecific details that include dealing with skewed classdistribution, learning from data streams (intrusions are

1-4244-0433-9/06/$20.00 C2006 IEEE. 59

sequences of events) and proper labeling networkconnections. The problem of skewed class distribution isvery pronounced in the network intrusion detection sinceintrusion as a class of interest is much smaller i.e. rarerthan the class representing normal network behavior. Insuch scenarios when the normal behavior may typicallyrepresent 98-99% of the entire population a trivialclassifier that labels everything with the majority classcan achieve 98-99% accuracy. It is apparent that in thiscase classification accuracy is not sufficient as a standardperformance measure. ROC analysis and metrics such asprecision, recall and F-value [8, 9, 10] have been used tounderstand the performance of the learning algorithm onthe minority class. A confusion matrix, shown in Table 1,is used to define these metrics.From Table 1, recall, precision and F-value may be

defined as follows:Precision= TP / (TP + FP) (1)Recall = TP / (TP + FN) (2

F-value = (1+ /8) Re call Pr ecision (3)- Re call + Pr ecision

where A corresponds to relative importance of preci-sion vs. recall and is usually set to 1. The main focus ofall learning algorithms is to improve the recall, withoutsacrificing the precision. However, the recall andprecision goals are often conflicting and attacking themsimultaneously may not work well, especially when oneclass is rare. The F-value incorporates both precision andrecall, and the "goodness" of a learning algorithm for theminority class can be measured by the F-value. WhileROC curves represent the trade-off between values of TPand FP, the F-value basically incorporates the relativeeffects/costs of recall and precision into a single number.

TABLE 1: STANDARD METRICS FOR EVALUATIONS OF INTRUSIONS

Confusion matrix (Standard Predicted connection labelmetrics) | Normal Intrusions(Attacks)

Actual Normal True Negative (TN) False Alarm (FP)connection Intrusions False Negative Correctly detected

label (Attacks) (FN) attacks (TP)

III. METHODOLOGYResearchers have dealt with class imbalance using

different techniques such as manipulating data records(e.g. over-sampling the minority class samples withreplacement, under-sampling the majority class [11-14] orgenerating artificial examples from minority class),designing new algorithms suitable for learning rareclasses (e.g. SHRINK, PN-rule, CREDOS), case specificfeature/rule weighting, boosting based algorithms(SMOTEBoost, RareBoost), cost sensitive classification(MetaCost, AdaCost, CSB, SSTBoost). In this paper wewill focus only on techniques based on manipulating datarecords.

The method based on over-sampling the rare classessimply makes the duplicates of the rare class until the dataset contains as many examples as the majority class, sothe classes are balanced. Over-sampling does not increase

information but increase misclassification cost. However,the effect of over-sampling is to identify more specificdecision regions of the minority class in the feature space.This can lead to over-fitting, with the minority classdecision region becoming very specific.

The technique based on under-sampling (down-sizing)the majority class samples the data records from majorityclass by choosing data records completely randomly orrecords by "near miss" technique or data records that arefar from minority class examples (far from decisionboundaries). The under-sampling technique introducessampled data records into the original data set instead oforiginal data records from the majority class, andtypically results in a general loss of information andpotentially overly general rules.SMOTE (Synthetic Minority Oversampling Technique)

was proposed to counter the effect of having fewinstances of the minority class in a data set [15].Operating in the "feature space" rather than the "dataspace" creates synthetic instances of the minority class.By synthetically generating more instances of theminority class, the inductive learners, such as decisiontrees (e.g. C4.5 [16]) or rule-learners (e.g. RIPPER [17]),are able to broaden their decision regions for the minorityclass. We deal with nominal (or discrete) and continuousattributes differently in SMOTE. In the nearest neighborcomputations for the minority classes we use Euclideandistance for the continuous features and the ValueDistance Metric (with the Euclidean assumption) for thenominal features [15, 18, 19]. The new synthetic minoritysamples are created as follows:

* For the continuous featureso Take the difference between a feature vector

(minority class sample) and one of its k nearestneighbors (minority class samples).

o Multiply this difference by a random numberbetween 0 and 1.

o Add this difference to the feature value of theoriginal feature vector, thus creating a new featurevector

* For the nominal featureso Take majority vote between the feature vector in

consideration and its k nearest neighbors for thenominal feature value. In the case of a tie, chooseat random.

o Assign that value to the new synthetic minorityclass sample.

Using this technique, a new minority class sample iscreated along the line segment joining a minority classsample and its nearest neighbor, see Fig 1. Hence, usingSMOTE, more general regions are learned for theminority class, allowing the classifiers to better predictunseen examples belonging to the minority class.According to [8], combination of SMOTE and under-sampling creates potentially optimal classifiers as amajority of points from the SMOTE and under-samplingcombination lie on the convex hull of the family of ROCcurves.

60

I 'j

Minority class example. O I

7 0 m /I

N% 0

000

p

0

0

0Fig. 1. Illustration of SMOTE technique. Point i is among k=5nearest neighbors of a minority class example. A syntheticexample is generated between points i and p.

To perform classification (i.e., to distinguish betweenexamples belonging to normal behavior and those relatedto different attempts of network intrusion), we usedartificial neural networks [22]. Artificial neural networksin general model multivariate non-linear functions usingspecial processing nodes neuron. Each neuron receivessignals from the neurons it is connected to, proportionallyto the synaptic weights, and generates the outputaccording to its transfer function. In turn, the neuron'soutput is fed to other neurons. Artificial neural networkshave layered structure when a neuron from i-th layer getssignals from neurons from i- Ith layer and sends signals toi+lth layer of the network. In this paper, we usemultilayer perceptron, a type of layered artificial neuralnetwork where synaptic weights are learned so that theexpected squared difference between desired outputsignal of the network (outputs of the last output layer)and the actual outputs is minimized. Typically,classification using neural networks consists of thefollowing two steps: a) learning, where a training set issubmitted to the network and synaptic weights areadapted according to a learning algorithm; b) testing,where a class of unknown example is decided based onthe outputs of the network. In this paper, we deal with amulticlass problem, so the classification is performed asfollows: two largest outputs oi and oj of the outputneurons are observed and class i is decided if:

0l oj> threshold. (4)where threshold is a parameter to be determined. Whenthreshold=0, this reduces to ordinary neural networkclassifiers. When threshold>O, we restrain fromclassification and pronounce a data sample as notclassified.

IV. EXPERIMENTS

A. DatasetsOur experiments were performed on the KDD Cup-99

intrusion detection data set from the KDD Cup 1999competition [20].

In KDD Cup 1999 competition, the competition taskwas to build a network intrusion detector, a predictivemodel capable of distinguishing between "bad"connections, called intrusions or attacks, and "good"connections. The data set represents a modification of the

DARPA 1998 Intrusion Detection Evaluation Data [21]prepared by MIT Lincoln Lab and it contains a widevariety of intrusions simulated in a military networkenvironment. The entire data set contains original trainingdata and original test data. The original raw training datacorresponds to seven weeks of network traffic andcontains around five million network connections. Forour experiments we have used a 10% sample originallycreated by the KDDCup99 organizers [20]. A networkconnection is a sequence of TCP packets starting andending at some well defined times, between which dataflows to and from a source IP address to a target IPaddress under some well defined protocol. Eachconnection is labeled as either normal, or as an attack,with exactly one specific attack type. The original testdata corresponds to two weeks of network traffic andcontains 311,029 network connections. In addition to thenormal network connections, the data contains four maincategories of attacks:+ DoS (Denial of Service), for example, ping-of-death,

teardrop, smurf, SYN flood, etc.;+ R2L (Remote to Local), unauthorized access from a

remote machine, for example, guessing password;+ U2R (User to Remote), unauthorized access to local

super-user privileges by a local unprivileged user,for example, various buffer overflow attacks;

+ Probe, surveillance and probing, for example,port-scan, ping-sweep, etc.

The distribution of network connections in the new testdata set is given in Table 2.

TABLE 2: SUMMARY OF KDD99CUP DATA SET USED INEXPERIMENTS

|Number of majority Number of minority 1 Number ofData set class instances class instances classes

DoS 231455 U2R 246KDDCup-99 -Probe 4166Intrusion Normal 60593

R2L 14569

B. ResultsData were pre-processed as follows. Categorical

attributes were encoded using "one of c" encoding, wherec is the number of distinct attribute values. Attributes withzero standard deviation on training set are excluded fromthe training and test data. Subsequently, each attribute ontraining set is normalized to have zero mean and unitstandard deviation and attributes on the test set arenormalized accordingly. Class label c1 of i-th example is

encoded using the vector v = [V1 v2 v3 v4 v5]

where v =I Ji O.1,j # c.

=1,...,5. This way we

prevented the problems related to the output neuronssaturation [22].

For classification, we used three-layer non-linearperceptron [22]. The first layer consists of input neurons,where the number of neurons is equal to the number of

61

I

I

I

input attributes. Second-hidden layer consists of S1neurons with tansig transfer function, described by thefollowing equation:

tan sig(x) =e e-XeX + e-X

where the number of neurons S1 is varied inexperiments. The output layer consists of 5 neurons withlogistic sigmoid transfer function:

log si(x) =1

'- I+e-x ' 'The neural networks were trained using Conjugate

Gradient with Powell/Beale Restarts algorithm [23]. Toimprove neural network generalization, an early stoppingcriterion is used. A validation set consisting of 20% ofdata available for training is used to evaluate mean-squareerror (MSE) of trained network and training stops if theMSE starts to increase. Maximal number of epochs forneural network training was set to 200.To evaluate the performance of the neural networks,

for each setting (training set, parameters choice) werepeated classification experiments ten times and mergedconfusion matrices. The precision, recall and f-value wereestimated based on the combined confusion matrix. Inaddition, we estimated the missclassfication rate therelative number of examples classified to other classes(excluding the examples that are not classified).We experimented with four training datasets. The

trainseti was standard KDDCup99 training dataset. Thetrainset2 was obtained by oversampling of rare classesfrom KDDCup99 dataset, resulting in 22611 samplesfrom each class. The Using SMOTE technique, weoversampled rare classes U2R, Probe and R2L resultingin 5000 samples per class in trainset3. In this training set,all original samples from DoS and Normal classes wereretained. Finally, trainset4 consists the same samplesfrom trainset3 corresponding to rare classes, while DoSand Normal were downsampled such that each contains5000 examples.

In the first experiment, we varied number of hiddenneurons (S1 = 5, 10, 15, 20, 25, 30) and estimatedclassification accuracy on trainsetl with threshold = 0(ordinary NN classification). After S1 = 10, there was nosignificant improvement of the classification accuracy.Hence, the rest of the experiments were performed with10 hidden neurons.

In the second experiment, we compared recall,precision and f-value on the testset when neural networkswere trained on four training sets discussed above. Theresults are reported in Table 3. The best recall rates onfrequent classes (DoS and Normal) were obtained usingoriginal KDDCup99 data, but acceptable results are alsoobtained using oversampling and SMOTE technique(trainsets 2 and 3). On rare classes Probe and R2L thebest performances were obtained using oversampling(trainset2). Class U2R was most difficult to detect: thebest recall obtained was 17.7% (using SMOTE combinedwith undersampling). Achieved precision on DoS classwas good using either training set, but the best results

were obtained on trainset2 and trainsetl. The bestprecision for Normal class was also achieved usingoversampling. For rare class Probe, trainsetl and trainset2provided the best precision. However, for rare classesU2R and R2L achieved precision was rather small usingeither of the training sets.

TABLE 3. PERFORMANCE METRICS ON KDDCUP99 TEST DATAUSING DIFFERENT TRAINING SETS, S1=10 HIDDEN NEURONS ANDZERO DECISION THRESHOLD (THEI BEST VALUES FOR EACH CLASSARE UNDERLINED).

Training Recallset DoS R2L U2R Probe NormalTrainseti 96.7% 2.56%o|1.72% |72.3 |o 97.2%Trainset2 96.7% 49.7% 1.33% 76.7% 96.7%Trainset3 90.9% 40.6% 2.20% 75.7% 93.3%Trainset4 73.4% 53.7% 17.7% 75.0% 45.1%

Training Precisionset DoS R2L U2R Probe NormalTrainseti 99.11% 5.040% 17.5% 75.20% 73.8%Trainset2 99.4% 6.09% 7.90% 74.61% 75.7%Trainset3 99.0% 2.60% 12.3% 23.1% 72.3%Trainset4 91.5% 0.81% 22.3% 5.48% 67.1%

Training f-valueset DoS R2L U2R Probe NormalTrainseti 97.9% 3.400% 3.14% 73.70%o 83.9%Trainset2 98.0% 11.8% 2.28% 75.8% 84.9%Trainset3 94.8% 4.88% 3.73% 35.4% 81.5%Trainset4 81.5% 1.60% 19.7% 10.2% 53.9%

Overall performance, measured by f-value, was bestusing oversampling, with the exception of class U2R,where classification was best if trainset4 was used fortraining. Surprisingly, techniques based on SMOTEmethod were not particularly successful.To have better insight in the character of

misclassification, we looked at the combined confusionmatrices (obtained by merging confusion matrices from10 experimental repetitions), see Table 4. As we can see,the main reason why classification was poor on rareclasses R2L and L2R using original training set waswrong assignment of a "Normal" class label.Oversampling increased the number of samples from R2Lclass in the training set and helped more of the examplesfrom this class being properly classified.To examine the influence of introducing decision

margin (represented by a threshold value), we plot thedependence of misclassification rate on recall rate fordifferent values of threshold (0; 0.01; 0.05; 0.1; 0.2; 0.3;0.4), (see Fig. 2). For the same threshold, the betterperformance is achieved when misclassification rate issmall and recall is high (lower right part of the diagrams).Generally, the decision threshold > 0 results in smallerrecall rate (some samples correctly classified withthreshold = 0 may not be classified) but also in smallernumber of misclassified examples (examples wronglyclassified when the threshold is zero are not assigned toany class). The classifier with margin is hence useful if,for the specific value of the threshold, the drop in recall issmaller than the drop in misclassification rate. In such

62

cases, the examples that are not classified may be subjectto additional classification by another learning algorithm(or in practical setting, by a human expert). For example,for class U2R, when margin was set to 0.1, the recall ratedropped from 17.7% to 14.5% while misclassificationdecreased from 82.3% to 70.50/ (when networks weretrained on trainset4).

TABLE 4. COMBINED CONFUSION MATRICES ON KDDCUP99 TESTDATA USING DIFFERENT TRAINING SETS, S 1=0 HIDDEN NEURONSAND ZERO DECISION THRESHOLD.

sampling) leads to decrease of misclassification rate by10.5% while the recall dropped only by 4.2%. However,the introduction of margin was not always beneficial toperformance: e.g., on class Probe, networks trained ontrainset4 performed better than networks trained ontrainsetl with threshold = 0, but for threshold larger than0.05, the later networks had better performance (higherrecall for the same misclassification rates). Similarly, forclass R2L, the performance of networks trained ontrainset2 and trainset3 approached each other for highervalues of decision margin (threshold = 0.4)

Trainsetl Predicted connection labelDoS R2L U2R Probe Normal

DoS 2236960 994 99349 22558 |65589R2L 100 63 249 336 1712U2R 4774 540 2510 519 137347

¢ Probe 4811 482 1888 30121 4358Normal 9712 72 352 6543 589251

Trainset2 Predicted connection label(oversampling) DoS R2L U2R Probe Normal

DoS 2235609 42266 17343 2306 55066e R2L 1 2 3 623 79 383

:OU2R 4166 10654 1944 164 128761¢ = Probe 1210 1751 2821 32061 3816

Normal 7709 2236 1870 8360 585756

Class: DoS

70 80 90recall (%)

100

Class: U2R

10 15 20recall (%)

Class: R2L100 fo

90 ; 0A 0.2

80 0.3

i70 00.4

60 0

50~~~~~~~~~~~~~50 -0. 1/ 0 0

40 0 2,

30 0"330.4 0.3

200 20 40 60

30r

25

20

E 15

1060

Class: Normal

Class: Probe

65 70recall (%)

75 80

Ltrainsetl (KDDCup99 training data)

..trainset2 (oversampling)

trainset3 (SMOTE)

trainset4(SMOTE with undersampling)20 40 60 80 100

recall (%)

Fig.2. Recall (percent of correctly classified examples) andmisclassication (percentage of examples assigned to wrong

class) for different classes of normal behavior and intrusions anddifferent values of decision margins.

Similarly, for the class Probe, the introduction ofthreshold = 0.3 on networks trained on trainset2 (over-

V. CONCLUSIONS AND FUTURE WORK

In this paper, we discussed application of multilayerperceptrons for network intrusion detection. Weconsidered various techniques to improve class imbalancein datasets with skewed class distributions. Also, theinfluence of decision margin (when the classification isnot performed if the difference between two largestoutputs of the neural network is smaller than a threshold)on the prediction accuracy was considered. All theexperiments were performed on KDDCup99 networkintrusion dataset. The best classification performance wasobtained using the over-sampling technique. However,class of attacks U2R remains difficult to predict (all theexamined techniques typically assign samples belongingto this group into the Normal class). Hence, it would beinteresting to construct a classifier which woulddiscriminate between class U2R and normal samples. Webelieve that the use of such a classifier, combined withthe networks considered here, could lead to furtherimprovements in prediction performance.

The experiments discussed in this paper did notperform feature selection: instead all attributes from thedataset (with non-zero variance) are used to train and testthe prediction model. Further improvements of accuracyare possible when feature selection/feature extractiontechniques are introduced and this is part of our work inprogress.

The SMOTE technique, as proposed in [15] considersall data from each class to belong to a compact set-cluster. A part of our ongoing research efforts is toexamine techniques which perform unsupervised learningon each class subsequent application of the SMOTEtechnique on such distinguished clusters.

REFERENCES[1] W. Lee and S. J. Stolfo. Data Mining Approaches for Intrusion

Detection. Proceedings of the 1998 USENIX Security Symposium,1998.

[2] E. Bloedorn, et al., Data Mining for Network Intrusion Detection:How to Get Started, MITRE Technical Report, August 2001.

[3] J. Luo, Integrating Fuzzy Logic With Data Mining Methods forIntrusion Detection, Master's thesis, Department of ComputerScience, Mississippi State University, 1999.

[4] D. Barbara, N. Wu, S. Jajodia, Detecting Novel Network IntrusionsUsing Bayes Estimators, First SIAM Conference on Data Mining,Chicago, IL, 2001.

[5] s. Manganaris, M. Christensen, D. Serkle, and K. Hermix, "A DataMining Analysis of RTID Alarms," in Proceedings of the 2nd

63

30

25

20

15

10

60

100 _

90

t 80

70

60

50

40

60

50

40

30

20

10

n

International Workshop on Recent Advances in IntrusionDetection (RAID 99), West Lafayette, IN, September 1999.

[6] D.E. Denning, An Intrusion Detection Model, IEEE Transactionson Software Engineering, SE-13:222-232, 1987.

[7] H.S. Javitz, and A. Valdes, The NIDES Statistical Component:Description and Justification, Technical Report, Computer ScienceLaboratory, SRI International, 1993.

[8] F. Provost and T. Fawcett, Robust Classification for ImpreciseEnvironments, Machine Learning, vol. 42/3, pp. 203-231, 2001.

[9] M. Joshi, R. Agarwal, V. Kumar, PNrule, Mining Needles in aHaystack: Classifying Rare Classes via Two-Phase Rule Induction,Proceedings of ACM SIGMOD Conference on Management ofData, May 2001

[10] M. Joshi, R. Agarwal, V. Kumar, Predicting Rare Classes: CanBoosting Make Any Weak Learner Strong?, Proceedings of EightACM Conference ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, Edmonton, Canada, 2002.

[11] M. Kubat, R. Holte, and S. Matwin, Machine Learning for theDetection of Oil Spills in Satellite Radar Images, MachineLearning, vol. 30, pp. 195-215, 1998.

[12] N. Japkowicz, The Class Imbalance Problem: Significance andStrategies, Proceedings of the 2000 International Conference onArtificial Intelligence (IC-AI'2000): Special Track on InductiveLearning, Las Vegas, Nevada, 2000.

[13] D. Lewis and J. Catlett, Heterogeneous Uncertainty Sampling forSupervised Learning, Proceedings of the Eleventh InternationalConference of Machine Learning, San Francisco, CA, 148-156,1994.

[14] C. Ling and C. Li, Data Mining for Direct Marketing Problemsand Solutions, Proceedings of the Fourth International Conferenceon Knowledge Discovery and Data Mining, New York, NY, 1998.

[15] N. Chawla, K. Bowyer, L. Hall, P. Kegelmeyer, SMOTE:Synthetic Minority Over-Sampling Technique, Journal of ArtificialIntelligence Research, vol. 16, 321-357, 2002.

[16] J. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA:Morgan Kaufman, 1992.

[17] W. Cohen, Fast Effective Rule Induction, Proceedings of theTwelfth International Conference on Machine Learning, LakeTahoe, CA, 115-123, 1995.

[18] C. Stanfill and D. Waltz, "Toward Memory-based Reasoning,"Communications of the ACM, vol. 29, no. 12, pp. 1213-1228,1986.

[19] S. Cost and S. Salzberg, "A Weighted Nearest NeighborAlgorithm for Learning with Symbolic Features," MachineLearning, vol. 10, no. 1, pp. 57-78, 1993.

[20] KDD-Cup 1999 Task Description,http:Hkdd.ics.uci.edu/databases/kddcup99/task.html

[21] R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. P. Kendall,D. McClung, D. Weber, S. E. Webster, D. Wyschogrod, R. K.Cunningham, and M. A. Zissman, Evaluating Intrusion DetectionSystems: The 1998 DARPA Off-line Intrusion DetectionEvaluation, Proceedings DARPA Information SurvivabilityConference and Exposition (DISCEX) 2000, Vol 2, pp. 12-26,IEEE Computer Society Press, Los Alamitos, CA, 2000.

[22] Haykin, S., Neural Networks, a Comprehensive foundation, 2ndedn, Prentice Hall, 1998.

[23] Powell, M. J. D.,"Restart procedures for the conjugate gradientmethod," Mathematical Programming, vol. 12, pp. 241-254, 1977.

64