Exploiting unlabeled data to improve peer-to-peer traffic classification using incremental...

11
Exploiting unlabeled data to improve peer-to-peer traffic classification using incremental tri-training method Bijan Raahemi & Weicai Zhong & Jing Liu Received: 7 August 2008 / Accepted: 5 December 2008 / Published online: 9 January 2009 # Springer Science + Business Media, LLC 2009 Abstract Unlabeled training examples are readily available in many applications, but labeled examples are fairly expensive to obtain. For instance, in our previous works on classification of peer-to-peer (P2P) Internet traffics, we observed that only about 25% of examples can be labeled as P2Por NonP2Pusing a port-based heuristic rule. We also expect that even fewer examples can be labeled in the future as more and more P2P applications use dynamic ports. This fact motivates us to investigate the techniques which enhance the accuracy of P2P traffic classification by exploiting the unlabeled examples. In addition, the Internet data flows dynamically in large volumes (streaming data). In P2P applications, new communities of peers often join and old communities of peers often leave, requiring the classifiers to be capable of updating the model incremen- tally, and dealing with concept drift. Based on these requirements, this paper proposes an incremental Tri- Training (iTT) algorithm. We tested our approach on a real data stream with 7.2 Mega labeled examples and 20.4 Mega unlabeled examples. The results show that iTT algorithm can enhance accuracy of P2P traffic classification by exploiting unlabeled examples. In addition, it can effectively deal with dynamic nature of streaming data to detect the changes in communities of peers. We extracted attributes only from the IP layer, eliminating the privacy concern associated with the techniques that use deep packet inspection. Keywords Stream data mining . Concept drift . Windowing technique . Tri-training . Unlabeled data . Peer-to-peer traffic . IP traffic identification 1 Introduction Peer-to-Peer (P2P) is a type of Internet application that allows a group of interest users to share their files and computing resources. P2P applications utilize significant bandwidth and network resources, resulting in network congestion, affecting the availability, reliability and quality of services, and potentially reducing customer satisfaction. Some studies show that as much as 70% of broadband traffic is P2P [1, 2]. While allocating equipment for such significant network usage, telecom carriers and service providers do not gain proportional profits from the services they offer through their infrastructure. As such, telecommu- nication equipment vendors and Internet Service Providers are interested in efficient solutions to identify and filter P2P traffic for further control and regulation. However, with the growth of the Internet traffic, in terms of number and type of applications, traditional identification techniques such as port matching, protocol decoding or packet payload analysis are no longer effective. In particular, P2P applications may use randomly selected non-standard ports to communicate which makes it difficult to distinguish them from other types of traffic by inspecting only port numbers [3]. Thus, in recent years, several data mining techniques were proposed to identify the Internet traffic Peer-to-Peer Netw Appl (2009) 2:8797 DOI 10.1007/s12083-008-0022-6 B. Raahemi (*) : W. Zhong Telfer School of Management, University of Ottawa, 55 Laurier Ave., Ottawa, ON K1N 6N5, Canada e-mail: [email protected] W. Zhong e-mail: [email protected] J. Liu Institute of Intelligent Information Processing, Xidian University, No.2 South Taibai Road, Xian, Shaanxi 710071, P.R. China e-mail: [email protected]

Transcript of Exploiting unlabeled data to improve peer-to-peer traffic classification using incremental...

Exploiting unlabeled data to improve peer-to-peer trafficclassification using incremental tri-training method

Bijan Raahemi & Weicai Zhong & Jing Liu

Received: 7 August 2008 /Accepted: 5 December 2008 /Published online: 9 January 2009# Springer Science + Business Media, LLC 2009

Abstract Unlabeled training examples are readily availablein many applications, but labeled examples are fairlyexpensive to obtain. For instance, in our previous workson classification of peer-to-peer (P2P) Internet traffics, weobserved that only about 25% of examples can be labeledas “P2P”or “NonP2P” using a port-based heuristic rule. Wealso expect that even fewer examples can be labeled in thefuture as more and more P2P applications use dynamicports. This fact motivates us to investigate the techniqueswhich enhance the accuracy of P2P traffic classification byexploiting the unlabeled examples. In addition, the Internetdata flows dynamically in large volumes (streaming data).In P2P applications, new communities of peers often joinand old communities of peers often leave, requiring theclassifiers to be capable of updating the model incremen-tally, and dealing with concept drift. Based on theserequirements, this paper proposes an incremental Tri-Training (iTT) algorithm. We tested our approach on a realdata stream with 7.2 Mega labeled examples and 20.4 Megaunlabeled examples. The results show that iTT algorithmcan enhance accuracy of P2P traffic classification byexploiting unlabeled examples. In addition, it can effectivelydeal with dynamic nature of streaming data to detect the

changes in communities of peers. We extracted attributesonly from the IP layer, eliminating the privacy concernassociated with the techniques that use deep packetinspection.

Keywords Stream data mining . Concept drift .

Windowing technique . Tri-training . Unlabeled data .

Peer-to-peer traffic . IP traffic identification

1 Introduction

Peer-to-Peer (P2P) is a type of Internet application thatallows a group of interest users to share their files andcomputing resources. P2P applications utilize significantbandwidth and network resources, resulting in networkcongestion, affecting the availability, reliability and qualityof services, and potentially reducing customer satisfaction.Some studies show that as much as 70% of broadbandtraffic is P2P [1, 2]. While allocating equipment for suchsignificant network usage, telecom carriers and serviceproviders do not gain proportional profits from the servicesthey offer through their infrastructure. As such, telecommu-nication equipment vendors and Internet Service Providersare interested in efficient solutions to identify and filter P2Ptraffic for further control and regulation.

However, with the growth of the Internet traffic, in termsof number and type of applications, traditional identificationtechniques such as port matching, protocol decoding orpacket payload analysis are no longer effective. In particular,P2P applications may use randomly selected non-standardports to communicate which makes it difficult to distinguishthem from other types of traffic by inspecting only portnumbers [3]. Thus, in recent years, several data miningtechniques were proposed to identify the Internet traffic

Peer-to-Peer Netw Appl (2009) 2:87–97DOI 10.1007/s12083-008-0022-6

B. Raahemi (*) :W. ZhongTelfer School of Management, University of Ottawa,55 Laurier Ave.,Ottawa, ON K1N 6N5, Canadae-mail: [email protected]

W. Zhonge-mail: [email protected]

J. LiuInstitute of Intelligent Information Processing, Xidian University,No.2 South Taibai Road,Xi’an, Shaanxi 710071, P.R. Chinae-mail: [email protected]

based on the statistical characteristics [4–7]. These ap-proaches assume that P2P applications typically send datain some sort of pattern, and these patterns can be used as ameans of identification. Thus, Identification problems areequivalent to classification problems defined as follows: Aset of N training examples of the form (X, y) is given, wherey is a discrete class label and X is a vector of d attributes,each of which may be symbolic or numeric. The goal is toproduce a model y = f(X) from these examples which willpredict the classes y of future examples X with a highaccuracy. Therefore, it allows identification techniques tomore easily adapt to the dynamic nature of Internet traffic.For instance, P2P applications can be identified withoutknowing the port number in advance.

Internet data flows dynamically in large volumes, and assuch, it is a typical type of streaming data. In our previouswork [6, 7], in order to detect new communities of peersattending and old communities of peers leaving (aphenomenon called concept drift in data mining andmachine learning), we used a window-based classificationmethod that divides stream data into fixed-size windows ofexamples and builds a model on the window, then re-evaluates the model on each upcoming window. When theaccuracy drops below 90%, the model is rebuilt using thesubsequent window. This method, however, is not compu-tationally efficient because it rebuilds the model even noconcept drifts occur. To overcome this problem, weintroduce a statistical test approach to determine whenconcept drift occurs and then decide to rebuild the model ornot. Thus, the computational cost can be reduced andconcept drifts can be detected effectively.

Before applying data mining techniques, we first performpre-processing on data stream to label the data into twoclasses, namely “P2P” and “NonP2P”, using a port-basedheuristic rule. However, we found that only about 25% datacan be labeled, and the remaining 75% are still unlabeled.Furthermore, with more and more P2P applications usingdynamic ports, the fewer examples can be labeled. Thus, weare left with a large volume of unlabeled data. In data miningand machine learning, the techniques that exploit labeled andunlabeled examples are called semi-supervised learning. Inthis work, we use Tri-Training (TT) method [8] due to itssimplicity and effectiveness. Since Tri-Training is an offlinemethod, we extend it with a windowing technique and astatistical testing, and named it as incremental Tri-Training(iTT) algorithm. Thus, it can be used to process streamingdata and track concept drifts.

We present the results of our study where we capturedInternet traffic data stream at the campus gateway in theUniversity of Ottawa, performed preprocessing on the datastream and prepared a training dataset to which iTTalgorithm can be applied. The attributes are extracted onlyfrom IP layer data streams. That is to say, our approach

relies only on the IP header of the packets, eliminating theprivacy concern associated with the techniques that use deeppacket inspection. In the experiments, we apply our approachto a P2P traffic data streamwith 7.2Mega examples labeled as“P2P” and “NonP2P” traffic examples, and 20.4 Megaunlabeled ones. The results show that our approach improvesclassification accuracy by using unlabeled examples andeffectively detects changes in communities of peers.

The rest of the paper is organized as follows. The nextsection describes related works on both P2P traffic classifica-tion and semi-supervised learning methods. Section 3describes how to label the original P2P traffic data. Section4 describes the iTT algorithm. Experiments are given inSection 5, and the last section presents the conclusions.

2 Related works

2.1 Related works on P2P traffic classification

P2P traffic identification has recently gained much attentionin both academic and industrial research communities.Various solutions have been developed for P2P trafficclassification. A popular approach is the TCP port basedanalysis where tools such as Netflow [9] and cflowd [10]are configured to read the service port numbers in the TCP/UDP packet headers, and compare them with the known(default) port numbers of the P2P applications. The packetsare then classified as P2P if a match occurs. Although P2Papplications have default port numbers, newer versionsallow the user to change the port numbers, or choose arandom port number within a specified range. Hence, portbased analysis becomes inefficient and misleading.

A method using application signatures was developed bySen et al. in [11], noticing the fact that internet applicationshave a unique string (signature) located in the data portionof the packet (payload). They used the available informationin the proprietary P2P protocol specifications in conjunctionwith information extracted from packet-level trace analysisto identify the signatures, and classify the packets accordingly.This signature detection approach is process intensive, andperforms deep packet inspection which might not be possiblein cases where the privacy is required. Also, most of P2Papplications can encrypt the data making it impossible todetect the signature.

Karagiannis et al. [12] proposed a P2P traffic identificationmethod based on the transport layer analysis. This approachrelies on the connection level patterns of the P2P traffic byobserving the behavior of P2P applications’ communica-tions. Although this method was able to detect 95% of P2Pflows from OC48 (2.4Gpbs) backbone link, it also has somelimitations. First, the approach can be misled if a P2Papplication uses port numbers of the applications with the

88 Peer-to-Peer Netw Appl (2009) 2:87–97

same behavior. Second, the approach is misled if a userrunning different P2P applications at the same time orrunning the same P2P application downloads different filesfrom different peers.

Researchers have also considered the behavioral andstatistical characteristics of Internet traffic to identify P2Papplications. Zander et al. in [4] proposed a framework forIP traffic classification based on a flow’s statisticalproperties using an unsupervised machine learning technique.While the authors planned to evaluate their approach using alarger number of flows and more applications, they indicatedthat the accuracy and performance of the resulting classifierhad not yet been evaluated.

Zuev et al. proposed a supervised machine learningapproach in [5] to classify network traffic. They started byallocating flows of traffic to one of several predefinedcategories: Bulk, DataBase, Interactive, Mail, WWW, P2P,Service, Attack, Games and Multimedia. They then utilized248 per-flow discriminators (characteristics) to build theirmodel using Naive Bayes analysis. They evaluated theperformance of the solution in terms of accuracy (the rawcount of flows that were classified correctly divided by thetotal number of flows) and trust (the probability that a flowthat has been classified into a class, is in fact from that class).Although this approach is promising, there is a questionabout the scalability of the approach as it involves too manydiscriminators, and it takes too much time to prepare the data(with many attributes) and assign the traffic flows topredefined categories. To overcome it, Moore and Zuevused Fast Correlation-Based Filter and a variation of awrapper method to reduce the number of discriminators [13].Furthermore, Auld et al. classified Internet traffic usingBayesian neural network to improve the classificationaccuracy [14].

Raahemi et al. applied supervised machine learningtechniques in [6, 7], namely Neural Networks and decisiontrees, to classify P2P traffic. They pre-processed andlabelled the data, and built several models using acombination of different attributes for various ratios ofP2P/NonP2P in the training data set.

2.2 Related works on semi-supervised learning

In some practical applications, unlabeled examples arereadily available but labeled ones are fairly expensive toobtain because they require human intervention and efforts.Semi-supervised learning addresses this problem by usinglarge amount of unlabeled data, together with the labeleddata, to build betters classifiers. In the case of self-training[20], a classifier is first trained with the small amount oflabeled data. It is then used to classify the unlabeled data.The most confident unlabeled data, together with theirpredicted labels, are added to the training dataset. The

classifier is re-trained and the procedure repeated. Becausesemi-supervised learning requires less human effort andgives higher accuracy, semi-supervised learning has becomea hot topic both in theory and in practice. A number of semi-supervised learning algorithms [8, 15–19] were proposed inthe past years. Please see [20] for detailed survey of semi-supervised learning methods.

A prominent achievement in this area is the co-trainingparadigm proposed by Blum and Mitchell [15], whichtrains two classifiers separately on two different views, i.e.two independent sets of attributes, and uses the predictionsof each classifier on unlabeled examples to augment thetraining set of the other. The co-training paradigm hasalready been used in many areas [21–23].

The standard co-training algorithm [15] requires twosufficient and redundant views, that is, the attributes arenaturally partitioned into two sets, each of which issufficient for learning and conditionally independent tothe other given the class label. Unfortunately, such arequirement can hardly be met in most scenarios. Goldmanand Zhou [16] proposed an algorithm which does notexploit attribute partition. However, it requires using twodifferent supervised learning algorithms that partition theinstance space into a set of equivalence classes, and employtime-consuming cross validation technique to determinehow to label the unlabeled examples and how to producethe final hypothesis. Zhou and Li [8] proposed a new co-training style algorithm named tri-training. Tri-training doesnot require sufficient and redundant views, nor does itrequire the use of different supervised learning algorithmswhose hypothesis partitions the instance space into a set ofequivalence classes. It can be easily applied to commondata mining scenarios. As such, we select tri-training toexploit unlabeled P2P data in this paper.

3 Labeling the P2P traffic data streams

To identify P2P traffic using data mining techniques, we firstneed a training data stream. However, there are someinformation in the original IP headers which are not all usefulfor identifying P2P traffic. Also, there is no class informationin the original IP headers. Thus, we transform the original IPheaders into the training data stream using the followingprocess.

Using Tcpdump, we captured the IP packet header oftwo-way Internet traffic at the campus gateway over 5 daysat different time periods in April 2006. In total, 37 fileswere generated with different sizes. Using Windump,sample entries were extracted from the captured files andtransformed from binary format into a readable-text format.A sample full IP headers with protocol being TCP or UDPis shown in Table 1.

Peer-to-Peer Netw Appl (2009) 2:87–97 8989

To make the IP headers suitable for data miningtechniques, we consider each IP header as one exam-ple, and label all examples into three classes, namely“P2P”, “NonP2P”, and “Unknown” based on their“source port” and “destination port” numbers. The ruleto label is:

The port numbers used in the above rule represent thedefault port numbers of the most popular P2P applications.It is worth noting that in our previous work [6, 7] weinspected several P2P applications to confirm that they allallow users to randomly select port numbers only in therange of 1024–65535.

After all examples are labeled into three classes, we thenselect the most useful information in IP headers as theattributes since irrelevant and redundant information maydegrade the performance of data mining method andincrease the computational cost. First, we remove the“tos”, “offset”, “flags” and “cksum” fields which arealmost-unary attributes (more than 95% of the values arethe same). The “id” and “ack” are nearly random numbersand contain no information to differentiate records. Since“length”1 can be calculated by using two “sequencenumber” in TCP header and, in addition, “packet length”and “length” can be deduced from each other, two“sequence number” and “length” are redundant and thuscan be removed. In addition, “arrival time” is implicitlyconsidered in our analysis since IP header records are as

consequential inputs to the algorithm. Accordingly, weselect “ttl”, “protocol”, “packet length”, “source IP”,“destination IP”, and “win” as attributes. The attributes“source IP” and “destination IP” were originally captured indotted-decimal notation, and then were binned to 256 binsaccording to their values of the first octet. It is worth notingthat we do not use “source port” and “destination port” inthe mining process because they are already used to labelthe examples. Thus, the relation between the attributes and theclasses without considering the port numbers can be found byour approach.

Using the above processes, we can transform IP headerstreams into labeled example streams with six symbolicattributes, which can be directly used to train the iTTmodel. On the other hand, the trained model can be used topredict whether an Internet application is P2P traffic or not,according to the IP headers.

4 Incremental tri-training for P2P traffic classification

In this section, we first give a brief introduction of tri-training algorithm and extend it to incremental one using a1 “length” denotes the length field after “ > ” in Table 1.

Table 1 Sample records of full IP header extracted by Windump

Protocol IP header

TCP 15:39:54.369946 IP (tos 0x0, ttl 127, id 35950, offset 0, flags [DF], proto: TCP (6), length: 603) 137.122.72.6.3684 > 137.122.14.100.80:P 0:563(563) ack 82 win 17439

UDP 15:39:54.369535 IP (tos 0x0, ttl 127, id 19203, offset 0, flags [none], proto: UDP (17), length: 129) 137.122.69.220.59155>83.50.166.156.25307: UDP, length 101

If (source port OR destination port) < 1024

Then Class ← “NonP2P”

Else If (source port OR destination port) = {Well-known standard

port numbers including 1214, 6881, 6889, 6699, 6700, 6701, 4661,

4665, 4672, 4662, 6346, 6347, 6348, 6349, 6257, 1044, 1045, 1337,

2340, 2705, 4500, 4329, 5190, 5500, 5501, 5502, 5503, 6666, 6667,

7668, 7788, 8038, 8080, 28864, 8311, 8888, 8889, 41170, 3074,

3531}

Then Class ← “P2P”

Else Class ← “Unknown”

“P2P”, “NonP2P”, and “Unknown” based on their“source port” and “destination port” numbers. The ruleto label is:

90 Peer-to-Peer Netw Appl (2009) 2:87–97

windowing technique. Then a statistical test is presented toautomatically detect concept drifts.

4.1 Tri-training

Tri-training [8] is a variant of co-training algorithm, whichuses three classifiers instead of two. In Tri-training, thethree classifiers are trained first. They are then used toclassify the unlabeled data. If two of them agree on theclassification of an unlabeled data, the unlabeled data withits predicted label is used to re-train the third classifier inthe next round. This approach thus avoids the need forexplicitly measuring labeling confidence of any classifier.This can be applied to datasets without different views(sufficient and redundant views), or different types ofclassifiers.

Let L denote the labeled example set and U denote theunlabeled example set. In the first step, three classifiers,namely h1, h2, and h3, are initially trained from L. Then, forany classifier, an unlabeled example can be labeled for it aslong as the other two classifiers agree on the labeling of thisexample. For instance, if h2 and h3 agree on the labeling ofan example x in U, then x can be labeled for h1. In eachround, the classifiers h2 and h3 choose some examples in Uto label for h1. Since the classifiers are refined in thelearning process, the amount as well as the number ofunlabeled examples chosen to label may be different indifferent rounds. Let Lt and Lt-1 denote the set of examplesthat are labeled for h1 in the t-th round and the (t-1)-thround, respectively. Then the training set for h1 in the t-thround and (t-1)-th round are respectively L [ Lt andL [ Lt�1. Note that the unlabeled examples labeled in the(t-1)-th round, i.e. Lt-1, will not be put into the originallabeled example set, i.e. L. Instead, in the t-th round all theexamples in Lt-1 will be regarded as unlabeled and put intoU again. More detailed description of tri-training can befound in [8].

Since tri-training is an offline algorithm, it is not suitablefor streaming data. Thus, we extend tri-training algorithm toan incremental version using a windowing techniqueexplained in the following subsection.

4.2 Windowing technique

Windowing technique is a scalable technique used in [24,25]. The basic schema is that the examples in data streamsare divided into disjoined windows with size w, and tri-training algorithm is built on the first window. Then, thetrained model is kept in memory and its accuracy isperiodically reevaluated with upcoming windows. The tri-training algorithm is rebuilt only if the accuracy significantlydrops; that is, concept drift occurs. This method is easy toimplement and avoids repetitive learning while providing

stable performance for the trained model. The question ishow to select proper window size as a compromise betweenfast adaptation and acceptable generalization when there isno concept drift. Fortunately, the rate of concept drifts(communities of peers leaving and joining) in P2P traffic isrelatively slow. Therefore, it is not necessary to maintain awindow with an adaptive size.

4.3 Automatically detecting concept drift

The trained model may lose accuracy over a period of timebecause of the changes in communication patterns, namelyconcept drift. Therefore, periodic reevaluation is required tokeep it accurate and up to date. We keep the model inmemory and rebuilt it using new instances. The new modelassesses the accuracy of the classifier according to therecent observations. It is not necessary to rebuild the modelwith each upcoming window. We will rely on the model aslong as it exhibits accuracy higher than a threshold onperiodically selected windows; otherwise, tri-training is re-executed on the subsequent windows.

In general, when concept drift occurs, the performanceof the trained models may drop significantly. The fasterthe algorithms detect such a drop and take actions toadjust the trained models, the better the algorithms are.Let the current trained model be h. The model h can beused to predict the class label of the incoming examples,and the change on the accuracy of h can be used to detectthe occurrence of concept drift. Accordingly, we use thefollowing method which was also used in Bayesianclassifiers in [25].

Suppose the accuracy of h is p0. Let h predict thew incoming examples. The accuracy has the binomialdistribution b(w, p0), if the performance of h has notchanged. The null hypothesis, H0, states that there is nochange of the performance in h, which means that thecurrent mean accuracy p has not dropped, compared to p0.The alternate hypothesis, H1, says that the mean accuracyhas dropped, that is, p<p0. The decision rule is that wereject H0 and accept H1 if the mean accuracy p forthe w incoming examples drops with a significance levelof α. Since under H0, μ(=wp) has the distributionN wp0;wp0 1� p0ð Þð Þ after the binomial distribution isapproximated to the normal. Then, the critical region iscalculated using the following formula:

m� m0

s=ffiffiffiffi

wp � �z að Þ ð1Þ

where μ0=wp0 and s ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

wp0 1� p0ð Þp

.

Peer-to-Peer Netw Appl (2009) 2:87–97 9191

Equation (1) is called the standardized test statistics. Ifthe observed value z ¼ m� m0ð Þ= s=

ffiffiffiffi

wpð Þ is smaller than

-z(α), we reject H0 in favor of H1: μ<μ0. That is, conceptdrift occurs. Otherwise, larger than -z(α), then there is notenough evidence to reject H0, which means there is noconcept drift.

4.4 iTT: Incremental tri-training algorithm

Accordingly, iTT algorithm can be summarized inAlgorithm 1. From Algorithm 1, it can be seen that tri-training algorithm first is built on the window W1 and themodel Λ1 is resulted. In addition, the classificationaccuracy pΛ and the standard deviation σΛ of model Λ1

are computed by using W1 (if there exists a validation setV1, then pΛ and σΛ calculated on V1 will be moreaccurate). Then we calculate the classification accuracypi on each upcoming window Wi to see whether pi dropssignificantly or not compared to pΛ, namely, the currentmodel Λ1 is still accurate or not, by using the standardizedtest statistics. If the current model Λ1 is not accurateanymore, then new model Λ2 will be built on the Wi, andpΛ and σΛ are also re-calculated.

5 Experiments

We use real data set to test the performance of iTTalgorithm in identifying P2P traffic. Following the procedureexplained in Section 3, 7.2 Mega (7200K) examples arelabeled as P2P and NonP2P, and 20.4 Mega examples areunlabeled which are about three times of the labeled ones.They are divided into windows according to the number oflabeled examples within them. In our experiments, we firstinvestigate improvement of the learning performance byexploiting unlabeled examples on the P2P traffic datasetwith different sizes. Then, we apply iTT to the P2P trafficdata stream. Finally, we analyze the effect of parameters oniTT’s performance. iTT algorithm used in our experimentsis implemented in Java, which is based on tri-training

Algorithm 1: iTT(S, w, α)

input: S Labeled example stream split into windows marked as …,, 21 WW

w the number of labeled examples in ,2,1, =iWi

α The significant level

1. Train the initial model Λ1 using tri-training (W1,L, W1,U, J4.8), where W1,L and W1,U are

labeled and unlabeled example sets in W1, respectively, and UL WWW ,1,11

= .

2. Compute the accuracy pΛ and standard deviation σΛ of the model Λ1, k ←1.

for …,2=i do

3. Measure the accuracy pi of the current model Λk on window Wi.

4. Compute )()( wppwz i ΛΛ−= σ .

if )(αzz −< then

5. Train the model Λk+1 using tri-training (Wi,L, Wi,U, J4.8).

6. Compute the accuracy pΛ and standard deviation σΛ of the model Λk+1, k←←k+1.

end if

end for

Table 2 Four different prediction outputs in the confusion matrix

Predicted class

P2P NonP2P

Actual class P2P TP FNNonP2P FP TN

92 Peer-to-Peer Netw Appl (2009) 2:87–97

software from Zhou and Li.2 It runs on Windows® basedPC with 2.4 GHz CPU speed and 1GB RAM.

We use False Negative Rate, False Positive Rate, andError Rate to measure the performance of classifiers. Thesemeasures are defined as follows. A single example has fourdifferent prediction outputs, two are correct, namely TruePositive (TP) where an example is actually P2P and it isclassified as P2P and True Negative (TN) where anexample is actually NonP2P and it is classified as NonP2P.Accordingly, there could be two false predictions, namelyFalse Positive (FP) where an example is classified as P2Pbut actually it is NonP2P and False Negative (FN) where anexample is classified as NonP2P but actually it is P2P. Theyare visually represented in the confusion matrix of Table 2.

According to the four different prediction outputs, FalseNegative Rate, False Positive Rate, and Error Rate aredefined as:

Flase Negative Rate ¼ FN

FN þ TPð2Þ

Flase Positive Rate ¼ FP

FP þ TNð3Þ

Error Rate ¼ FP þ FN

TP þ TN þ FP þ FNð4Þ

The lower the values of Negative Rate, False Positive Rate,and Error Rate are, the more accurate the classifier is.

In the first experiment, in order to investigate theeffectiveness of using unlabeled data in classification ofP2P traffic, we compare the performance of tri-training(which is the core algorithm of iTT) with that of J4.8(which is the standard algorithm and the baseline classifierin tri-training). Figures 1, 2 and 3 compare Negative Rate,False Positive Rate, and Error Rate for tri-training and J4.8using the datasets of various sizes from 2K to 20K, andapplying 10-fold cross validation. First, it can be seen thatthe overall trends of these three measures are decreasingwith the dataset size. This is because there are more labeledand unlabeled examples that can be used when dataset sizeincreases. Further, it can be observed that False NegativeRate of tri-training decreases from about 9.2–2.3%, whilethat of J4.8 decreases from about 10.8–2.8%, and tri-training always outperforms J4.8. In particular, when the

size of the datasets is 6K, the results of tri-training and J4.8are 3.8% and 6.2%, respectively. That is, False NegativeRate of tri-training reduces by 38.7% compared to that ofJ4.8. False Positive Rate of tri-training is comparative tothat of J4.8. But the overall Error Rate of tri-trainingaveragely reduces by 10% compared to that of J4.8. Thisclearly demonstrates that it is effective and useful to exploitunlabeled examples in P2P traffic classification.

In the second experiment, in order to investigate theperformance of iTT on the data stream, we apply iTT to thetraining data stream. The significant level (α) is set to 0.1,and the window size (w) is 10 K. The minimal, average,and maximal False Negative Rate, False Positive Rate andError Rate are given in Table 3. Average False NegativeRate, False Positive Rate and Error Rate are smaller than

2 http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/ annex/TriTrain.htm

Fig. 1 Comparison of false negative rates

Fig. 2 Comparison of false positive rates

Peer-to-Peer Netw Appl (2009) 2:87–97 9393

4%, 8% and 6%, respectively. This illustrates that iTT candetect the P2P traffic fairly accurately. Figures 4, 5 and 6show these three measures over the examples. It can beobserved that iTT detects 46 concept drifts from 719possible points, which is denoted by the top subfigures inthese three figures.

Figure 4 shows that False Negative Rate is quite smalland stable, which means that iTT can detect most of P2Ptraffic; on the other hand, it validates that the windowingtechnique and the statistical test method of tracking conceptdrifts are effective and useful, specially when the changingrate of communities of P2P peers is relatively slow.

Figure 5 shows that False Positive Rate is morefluctuating than False Negative Rate, which is due to thefact that the changing rate of communities of non-P2P peersis faster than those of P2P. However, iTT can rapidly detectsuch changes, and rebuild the model quickly. Thus, the

curve of False Positive Rate is sharp on the windows wherethe concept drifts occurs (Fig. 5). Figure 6 shows theoverall performance, Error Rate, of iTT. It can be observedthat Error Rate is less fluctuating than False Negative Rateand its average value is below 6%. In addition, the curve ofError Rate is also bursty but its height is shorter. Thisdemonstrates that iTT detects the concept drifts and quicklyadjusts the trained model.

To demonstrate the effect of the two parameters,significant level and window size, on the performance, theAccuracy and running time over five independent runsalong with significant level and window size are given in

Fig. 3 Comparison of error ratesFig. 4 False negative rates of iTT for various number of examples

Fig. 5 False positive rates of iTT for various number of examples

Table 3 Performance of iTT

iTT

False negative rate Min 0.008Average 0.037Max 0.117

False positive rate Min 0.006Average 0.077Max 0.439

Error rate Min 0.011Average 0.058Max 0.283

94 Peer-to-Peer Netw Appl (2009) 2:87–97

Figures 7 and 8, respectively. The Accuracy is equal to(1−Error Rate) and is measured by the percentage of thenumber of correctly classified examples to the total numberof examples. Figure 7 shows that Accuracy increases withsignificant level, since a higher significant level leads toreject null hypothesis H0 (no change of performance) with ahigher probability, that is, no concept drift. Consequently,this results in more computation cost which can be seen inFigure 8. In addition, the trend of Accuracy is increasingalong with window size, because iTT algorithm has moreexamples to build the model. Looking at details, there aretwo local minima of running time at window size of 8 Kand 16 K. This is caused by the tradeoff between the time

to build a model and the number of times the model isrebuilt (due to concept drifts). Thus, combining Accuracyand running time, we can select significant level between[0.05, 0.1] and window size between [8 K, 16 K].

6 Conclusions

Motivated by the observations that only about 25% of theexamples can be labeled (and even fewer with more P2Papplications using random ports) but unlabeled examplesare readily available, we use the tri-training algorithm toenhance the performance of P2P traffic classification byexploiting the unlabeled examples. Furthermore, in order tohandle the dynamic nature of P2P traffic data stream, wepropose incremental tri-training algorithm using a window-ing technique and a statistical test method. Experimentalresults show that our approach can effectively detect P2Ptraffics, and the changes in communities of peers.

The tri-training algorithm belongs to a family of semi-supervised learning methods using iterative technique tohandle unlabeled examples. This may result in a highcomputation cost. As one of our future work, we will focuson reducing the computation cost, and also, explore theincremental semi-learning algorithms which do not usewindowing techniques.

Acknowledgement This work was supported by National Scienceand Engineering Research Council of Canada (NSERC) and OntarioResearch Networks on E-Commerce (ORNEC), National NaturalScience Foundations of China under Grants 60872135 and 60502043,and Program for New Century Excellent Talents in University ofChina under Grant NCET-06-0857.Fig. 7 Accuracy versus significant level and window size

Fig. 6 Error rate of iTT for various number of examples

Fig. 8 Time versus significant level and window size

Peer-to-Peer Netw Appl (2009) 2:87–97 9595

References

1. Azzouna NB, Guillemin F (2004) Impact of peer-to-peerapplications on wide area network traffic: an experimentalapproach. IEEE Global Telecommunications Conference 3:1544–1548

2. Kamei S, Kimura T (2003) Practicable network design forhandling growth in the volume of peer-to-peer traffic. IEEEPacific Rim Conference on Communications. Computers andsignal Processing 2:597–600

3. Cloud Shield (2007) Peer-to-peer traffic control. (October),Available: http://www.cloudshield.com/solutions/p2pcontrol.asp

4. Zander S, Nguyen T, Armitage G (2005) Self-learning IP trafficclassification based on statistical flow characteristics. Springer-Verlag Lecture Notes in Computer Science 3431:325–328 SpringerBerlin

5. Zuev D, Moore AW (2005) Traffic classification using a statisticalapproach. Springer-Verlag Lecture Notes in Computer Science3431:321–324 Springer Berlin

6. Raahemi B, Hayajneh A, Rabinovitch P (2007) Classification ofpeer-to-peer traffic using neural networks. Proceedings of ArtificialIntelligence and Pattern Recognition, Orlando, USA, July, pp.411–417.

7. Raahemi B, Hayajneh A, Rabinovitch P (2007) Peer-to-peer IPtraffic classification using decision tree and IP layer attributes.International Journal of Business Data Communications andNetworks 3(4):60–74

8. Zhou ZH, Li M (2005) Tri-training: exploiting unlabeled datausing three classifiers. IEEE Trans Knowl Data Eng 17(11):1529–1541

9. Kamei S, Kimura T (2006) Cisco IOS NetFlow Overview.Whitepaper, available at www.Cisco.com, Cisco Systems Inc

10. Crovella M, Krishnamurthy B (2006) Internet measurement:infrastructure, traffic and applications. John Wiley and Sons Ltd,West Sussex, England

11. Sen S, Spatscheck O, Wang D (2004) Accurate, scalable in-network identification of P2P traffic using application signatures.Proc. of the 13th International World Wide Web Conference, NY,USA, pp. 512–521

12. Karagiannis T, Broido A, Faloutsos M, Klaffy K (2004) Transportlayer identification of P2P traffic. Proc. of the 4th ACMSIGCOMM Conference on Internet Measurement, Italy, pp.121–134

13. Moore W, Zuev D (2005) Internet traffic classification usingBayesian analysis techniques, in Proc. ACM Sigmetrics, Alberta,Canada, June 2005, pp.50–59

14. Auld T, Moore W, Gull F (2007) Bayesian neural network forInternet traffic classification. IEEE Trans. on Neural Network 18(1):223–239

15. Blum A, Mitchell T (1998) Combining labeled and unlabeled datawith co-training. Proceedings of the Workshop on ComputationalLearning Theory, Morgan Kaufmann, pp. 92–100

16. Goldman S, Zhou Y (2000) Enhancing supervised learning withunlabeled data. Proceedings of the 17th International Conferenceon Machine Learning, San Francisco, CA, pp.327–334

17. Joachims T (1999) Transductive inference for text classificationusing support vector machines. Proceedings of the 16th Interna-tional Conference on Machine Learning, Bled, Slovenia, pp. 200–209

18. Blum A, Chawla S (2001) Learning from labeled and unlabeleddata using graph mincuts. Proceedings of the Eighteenth Interna-tional Conference on Machine Learning, Morgan Kaufmann, SanFrancisco, CA, USA, pp.19–26

19. Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learningusing Gaussian fields and harmonic functions. Proceedings of the

20th International Conference on Machine Learning, Washington,DC, pp. 912–919

20. Zhu X (2005) Semi-supervised learning literature survey. ComputerSciences Technical Report 1530, University of Wisconsin-Madison

21. Peirce D, Cardie C (2001) Limitations of co-training for naturallanguage learning from large data sets. Proceedings of the 6thConference on Empirical Methods in Natural Language Proceedings,Pittsburgh, PA, pp. 1–9

22. Levin A, Viola P, Freund Y (2003) Unsupervised improvement ofvisual detectors using co-training. Proceedings of the 9th IEEEInternational Conference on Computer Vision, Nice, France, pp.626–633

23. Sarkar A (2001) Applying co-training methods to statisticalparsing. Proceedings of the 2nd Annual Meeting of the NorthAmerican Chapter of the Association for computational Linguistics,Pittsburgh, PA, pp. 95–102

24. Widmer G, Kubat M (1996) Learning in the presence of conceptdrift and hidden contexts. Mach Learn 23(1):69–101

25. Orrego A (2004) SAWTOOTH: Learning from huge amounts ofdata, Master’s thesis, West Virginia University

Bijan Raahemi is an assistant professor at the Telfer School ofManagement, University of Ottawa, Canada, with cross-appointmentwith the School of Information Technology and Engineering. Hereceived his Ph.D. in Electrical and Computer Engineering from theUniversity of Waterloo, Canada, in 1997. Prior to joining theUniversity of Ottawa, Dr. Raahemi held several research positions inTelecommunications industry, including Nortel Networks and Alcatel-Lucent, focusing on Computer Networks Architectures and Services,Dynamics of Internet Traffic, Systems Modeling, and PerformanceAnalysis of Data Networks. His current research interests includeKnowledge Discovery and Data Mining, Information Systems, andData Communications Networks. Dr. Raahemi’s work has appeared inseveral peer-reviewed journals and conference proceedings. He alsoholds 10 patents in Data Communications. He is a senior Member ofthe Institute of Electrical and Electronics Engineering (IEEE), and amember of the Association for Computing Machinery (ACM).

96 Peer-to-Peer Netw Appl (2009) 2:87–97

Jing Liu is an Associate Professor with Xidian University, China. Shereceived a B.S. degree in computer science and technology fromXidian University, Xi’an, China, in 2000, and a Ph.D. in circuits andsystems from Xidian University in 2004. Her research interestsinclude Data Mining, Evolutionary Computation, and MultiagentSystems. She is a member of the Institute of Electrical and ElectronicsEngineering (IEEE).

Weicai Zhong is a post-doctoral fellow at the Telfer School ofManagement, University of Ottawa, Canada. He received a B.S. degreein computer science and technology from Xidian University, Xi’an,China, in 2000 and a Ph.D. in pattern recognition and intelligent systemsfrom Xidian University in 2004. Prior to joining the University ofOttawa, Dr. Zhong was a senior statistician in SPSS Inc. from Jan. 2005to Dec. 2007. His current research interests include Internet TrafficIdentification, Data Mining, and Evolutionary Computation. He is amember of the Institute of Electrical and Electronics Engineering (IEEE).

Peer-to-Peer Netw Appl (2009) 2:87–97 9797