A nonlinear, recurrence-based approach to traffic classification

13
A nonlinear, recurrence-based approach to traffic classification Francesco Palmieri * , Ugo Fiore Università degli Studi di Napoli Federico II, CSI, Complesso Universitario Monte S. Angelo, Via Cinthia 5, 80126 Napoli, Italy article info Article history: Available online 29 December 2008 Keywords: Recurrence plots Recurrence quantification analysis Nonlinear analysis Traffic classification abstract The ability to accurately classify and identify the network traffic associated with different applications is a central issue for many network operation and research topics including Quality of Service enforcement, traffic engineering, security, monitoring and intrusion- detection. However, traditional classification approaches for traffic to higher-level applica- tion mapping, such as those based on port or payload analysis, are highly inaccurate for many emerging applications and hence useless in actual networks. This paper presents a recurrence plot-based traffic classification approach based on the analysis of non-station- ary ‘‘hidden” transition patterns of IP traffic flows. Such nonlinear properties cannot be affected by payload encryption or dynamic port change and hence cannot be easily mas- queraded. In performing a quantitative assessment of the above transition patterns, we used recurrence quantification analysis, a nonlinear technique widely used in many fields of science to discover the time correlations and the hidden dynamics of statistical time ser- ies. Our model proved to be effective for providing a deterministic interpretation of recur- rence patterns derived by complex protocol dynamics in end-to-end traffic flows, and hence for developing qualitative and quantitative observations that can be reliably used in traffic classification. Ó 2008 Elsevier B.V. All rights reserved. 1. Introduction The accurate classification of traffic flows traversing a specific network perimeter and the reliable determination of the applications associated with each flow’s endpoints is an essential task for network security and traffic engineer- ing, in order to protect network resources and enforce institutional policies – i.e. limiting bandwidth for sharing of music files or gaming or detecting intrusions, malicious software, or simply new network-hungry applications which may impact the future provisioning of communica- tion resources. The most common identification technique based on the inspection of ‘‘known port numbers” is no longer accurate because many applications no longer use fixed, predictable port numbers. The Internet Assigned Numbers Authority (IANA) assigns the well-known ports from 0 to 1023 and registers port numbers in the range from 1024 to 49151. But many applications have no IANA assigned or registered ports and only utilize ‘well known’ default ports. Often these ports overlap with IANA ports and an unambiguous identification is no longer possible. Furthermore some applications (e.g. passive FTP or video/ voice communication) use dynamic ports unknowable in advance, and some others (e.g. systems for peer-to-peer information sharing) use a combination of dynamic port numbers, masquerading techniques and encryption to by- pass firewalls and avoid detection. A more reliable tech- nique used in many current industry products involves stateful reconstruction of session and application informa- tion from packet content. Although this technique avoids reliance on fixed port numbers, it requires the inspection of the payload of every packet and hence raises privacy is- sues and imposes significant processing complexity and load on the traffic identification device. It must be kept up-to-date with extensive knowledge of application and protocols, which can continuously evolve and be poorly documented, and must be powerful enough to perform concurrent analysis of a potentially large number of flows. 1389-1286/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2008.12.015 * Corresponding author. Tel.: +39 0812537054; fax: +39 081 676628. E-mail addresses: [email protected] (F. Palmieri), ufi[email protected] (U. Fiore). Computer Networks 53 (2009) 761–773 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet

Transcript of A nonlinear, recurrence-based approach to traffic classification

Computer Networks 53 (2009) 761–773

Contents lists available at ScienceDirect

Computer Networks

journal homepage: www.elsevier .com/ locate/comnet

A nonlinear, recurrence-based approach to traffic classification

Francesco Palmieri *, Ugo FioreUniversità degli Studi di Napoli Federico II, CSI, Complesso Universitario Monte S. Angelo, Via Cinthia 5, 80126 Napoli, Italy

a r t i c l e i n f o

Article history:Available online 29 December 2008

Keywords:Recurrence plotsRecurrence quantification analysisNonlinear analysisTraffic classification

1389-1286/$ - see front matter � 2008 Elsevier B.Vdoi:10.1016/j.comnet.2008.12.015

* Corresponding author. Tel.: +39 0812537054; faE-mail addresses: [email protected] (F. Palmi

(U. Fiore).

a b s t r a c t

The ability to accurately classify and identify the network traffic associated with differentapplications is a central issue for many network operation and research topics includingQuality of Service enforcement, traffic engineering, security, monitoring and intrusion-detection. However, traditional classification approaches for traffic to higher-level applica-tion mapping, such as those based on port or payload analysis, are highly inaccurate formany emerging applications and hence useless in actual networks. This paper presents arecurrence plot-based traffic classification approach based on the analysis of non-station-ary ‘‘hidden” transition patterns of IP traffic flows. Such nonlinear properties cannot beaffected by payload encryption or dynamic port change and hence cannot be easily mas-queraded. In performing a quantitative assessment of the above transition patterns, weused recurrence quantification analysis, a nonlinear technique widely used in many fieldsof science to discover the time correlations and the hidden dynamics of statistical time ser-ies. Our model proved to be effective for providing a deterministic interpretation of recur-rence patterns derived by complex protocol dynamics in end-to-end traffic flows, andhence for developing qualitative and quantitative observations that can be reliably usedin traffic classification.

� 2008 Elsevier B.V. All rights reserved.

1. Introduction from 1024 to 49151. But many applications have no IANA

The accurate classification of traffic flows traversing aspecific network perimeter and the reliable determinationof the applications associated with each flow’s endpoints isan essential task for network security and traffic engineer-ing, in order to protect network resources and enforceinstitutional policies – i.e. limiting bandwidth for sharingof music files or gaming or detecting intrusions, malicioussoftware, or simply new network-hungry applicationswhich may impact the future provisioning of communica-tion resources. The most common identification techniquebased on the inspection of ‘‘known port numbers” is nolonger accurate because many applications no longer usefixed, predictable port numbers. The Internet AssignedNumbers Authority (IANA) assigns the well-known portsfrom 0 to 1023 and registers port numbers in the range

. All rights reserved.

x: +39 081 676628.eri), [email protected]

assigned or registered ports and only utilize ‘well known’default ports. Often these ports overlap with IANA portsand an unambiguous identification is no longer possible.Furthermore some applications (e.g. passive FTP or video/voice communication) use dynamic ports unknowable inadvance, and some others (e.g. systems for peer-to-peerinformation sharing) use a combination of dynamic portnumbers, masquerading techniques and encryption to by-pass firewalls and avoid detection. A more reliable tech-nique used in many current industry products involvesstateful reconstruction of session and application informa-tion from packet content. Although this technique avoidsreliance on fixed port numbers, it requires the inspectionof the payload of every packet and hence raises privacy is-sues and imposes significant processing complexity andload on the traffic identification device. It must be keptup-to-date with extensive knowledge of application andprotocols, which can continuously evolve and be poorlydocumented, and must be powerful enough to performconcurrent analysis of a potentially large number of flows.

762 F. Palmieri, U. Fiore / Computer Networks 53 (2009) 761–773

This approach can be difficult when dealing with proprie-tary protocols or encrypted traffic. The strong limitationsof port-based and payload-based analysis motivate thesuccess of new traffic classification paradigms based onthe study of some flow properties that are more difficultto masquerade, such as transport layer statistics. Theseclassification techniques rely on the fact that differentapplications typically have distinct behavior patternswhen communicating on a network. For instance, a largefile transfer using FTP would have a longer connectionduration and larger average packet size than an instantmessaging client sending short occasional messages toother clients. Similarly, some peer-to-peer (P2P) applica-tions such as Bit Torrent can be distinguished from FTPdata transfers because these P2P connections typicallyare persistent and bidirectional; FTP data transfer connec-tions are non-persistent (a separate connection is openedfor each data transfer operation) and send data only unidi-rectionally. Transport layer statistics such as the totalnumber of packets sent, the ratio of the bytes sent in eachdirection, the duration of the connection, and the averagesize of the packets characterize these behaviors. The statis-tical flow characteristics that can be considered (eventu-ally together) include the total number of packets, meanpacket size, mean payload size excluding headers, numberof bytes transferred (in each direction and combined), andmean inter-arrival time of packets. Note that analysis andcomparison of the distribution of packet sizes in bothdirections may not always be possible, due to asymmetricrouting. Accordingly, we propose a novel traffic classifica-tion scheme, particularly suitable for IP networks, basedon nonlinear statistical analysis and, more precisely, onthe evaluation of the non-stationary transition patternsin end-to-end traffic flow time series. That is, in the IP net-work environment, traffic flow features and hence thecharacteristics of probability distributions of their IP-layerpackets change dynamically in the time domain [1,2]. Itfollows that, since power laws apply to changes in trafficdensity, the traffic flow statistical characteristics changewith the phase transition patterns and that their fractal-like behaviors can be affected by the packet density andits time-variation trend [3]. In addition, a self-organizingmodel can be considered in order to assess the non-sta-tionary time-variation patterns of end-to-end traffic flows[4]. The dynamic transitional patterns, that are the fractal-related characteristics of the involved traffic can be used toprecisely describe some aggregated flow properties andhence can be considered as an interesting way to discrim-inate their characteristics and hence classify the individualflows. To study the evolution dynamics and specific non-stationary features of the traffic flowing between a coupleof hosts, we used recurrence plots (RP), which are a verysimple and effective tool for visualizing the variation pat-terns of such dynamical systems [5,6]. In addition, to ob-tain quantitative information associated with eachgenerated RP, we applied recurrence quantification analy-sis (RQA) [7,8] that has been used for measuring the degreeof non-stationarity in the above patterns. The strength ofour approach is twofold. First, we base our classificationstrategy on the nonlinear characteristics of the traffic flowsthat are not affected by payload encryption or dynamic

port change and hence cannot be easily masqueraded. Sec-ond, nonlinear approaches are now considered as a prom-inent alternative to linear time series analysis of trafficdata. This is because, linear methods cannot account forall the irregular phenomena observed in the network traf-fic flowing end-to-end between two hosts and, if a nonlin-ear process underlies the involved time series, the use oflinear rules to describe it is conceptually false and can leadto largely erroneous results [9]. The latter line of thought isnot bidirectional, as modeling the irregularities of a trafficseries by nonlinear methods does not indicate the exis-tence of nonlinearities either in time series or in the under-lying process generating them. As such, the ability toidentify the nature of the series is essential in improvingthe understanding of the process involved and in providingan, as accurate as possible, approximation of complex traf-fic data structures such internet traffic flows. Our approachalso differentiates from the majority of the statistic-basedclassification schemes by its peculiar theoretical perspec-tive: we chose a method (recurrence analysis) that doesnot make any specific assumption on the mathematicalstructure of data, does not rely on assumptions of stationa-rity and does not need to consider the studied traffic dataas the output of a linear dynamic system. We demon-strated the possibility of a pure operational use of conceptsand techniques derived by complex systems dynamics fordeveloping deterministic qualitative and quantitativeobservations that can be reliably used in traffic classifica-tion. To the best of our knowledge, this is a first attemptat classifying traffic by leveraging only upon the nonlinearproperties of network dynamics.

2. Related work

The idea of using statistical properties to classify trafficflows, or at least to model their behavior, is not new. Somepioneering works by Paxson et al. on Internet traffic char-acterization [10,11] focus on the relationship between theobserved statistical flow properties and the associatedapplication/protocols. However, such works show thatanalytical models based on random variables such as pack-et length, inter-arrival times and flow-duration can be suit-able to express the behavior of only a few protocols and donot make any attempt to classify flows according to appli-cation layer protocol features. Other works that have evi-denced the relationship between the class of traffic andits observed statistical properties include those due toDewes et al. [12], making effective use of the packet-sizeprofile of particular applications, and to Claffy [13] that ob-serves that DNS traffic is easily identifiable using the joint-distribution of flow-duration and the number of packetstransferred. A trained approach for class discriminationhas been proposed with a supervised machine learningtechnique by Moore et al. [14]. Although based on fulland deterministic payload analysis, Moore et al. [15] alsotry to identify classes of traffic, instead of focusing on theclassification of specific application layer protocols. Theabove work has been extended by Auld et al. [16] who pro-posed a supervised machine learning approach based on aBayesian trained neural network. In contrast, McGregor

F. Palmieri, U. Fiore / Computer Networks 53 (2009) 761–773 763

et al. [17] seek to identify traffic with similar observableproperties and apply an untrained classifier to this prob-lem. The untrained classifier has the advantage of identify-ing groups/classes of traffic with similar properties butdoes not directly assist in understanding what or whyapplications have been grouped this way. Two recentworks of Bernaille et al. [18] and Crotti [19] proposerespectively the use of clustering techniques and statisticalfingerprinting to achieve fine-grained classification basedon size and direction of the first few packets of a TCP ses-sion. Conceptually, their approach is similar to payload-based approaches that look for characteristic signaturesduring protocol handshakes to identify the applicationand is unsuccessful when classifying application typeswith variable-length packets in their protocol handshakessuch as Gnutella. A detailed survey of data mining tech-niques applied to traffic classification can be found in[20]. Other Traffic identification approaches that rely onheuristics derived from analysis of communication pat-terns between hosts have been proposed in [21–23]. Forexample, Karagiannis et al. studied the multi-level behav-ior of traffic by analyzing interactions between hosts, pro-tocol usage and per-flow average packet size anddeveloped a method that leverages the social, functional,and application behaviors of hosts to identify traffic classes[22]. Concurrent to [22], Xu et al. [23] developed a method-ology, based on data mining and information theoretictechniques, to discover functional and application behav-ioral patterns of hosts and the services used by the hosts.They subsequently use these patterns to build general traf-fic profiles. In contrast, our approach uses only the nonlin-ear characteristics of each traffic flow to reveal hiddenperiodicities and non-masquerable properties that cannotobserved by other means and hence achieves comparableor better accuracies when classifying traffic, including traf-fic originating from P2P applications.

3. Recurrence plots

Our classification strategy starts from the study of spe-cific nonlinear characteristics, such as recurrence phenom-ena and hidden non-stationary transition patterns in thetime series associated with the traffic classes that we wantto explicitly distinguish. This requires a preliminary study,also defined as the ‘‘supervised” learning phase, on refer-ence pre-classified traffic flows to determine the most dis-criminating features on each traffic flavor. This is a veryslow and complex task requiring a lot of computing effortand human expertise, but fortunately has to be performedonly once, in the initial ‘‘knowledge construction” phase ofour classification model. Once the ‘‘qualitative” discrimina-tion schemes have been built, all the following activitiesconsist in a ‘‘quantitative” recurrence assessment thatcan be realistically performed on-line on special purposeand hence resource-constrained network devices. Theabove preliminary study, that is the most significant con-tribution of our work, has been performed by using recur-rence plot (RP) analysis. By using RPs, we can gainsignificant insights on the non-stationary variation pat-terns of time-series data. The main idea is to reconstruct

the (unknown) system dynamics in the phase space byusing time-delay embedding, then computing the dis-tances between all pairs of embedded vectors, generatinga symmetric two-dimensional square matrix. The RP visu-alizes the distance matrix.

3.1. Non-stationarity in network traffic

The concept of stationarity is of utmost importance intraditional linear and nonlinear time series analyses, inparticular when coping with time series of traffic variablesthat are, most times, non-stationary [24]. The term sta-tionarity is used to describe an assumed regularity in a ser-ies of data [25] and a time series is defined as non-stationarity [26], when, for some q, the joint probabilitydistribution of xi, xi+1, . . . ,xi+(q�1) is dependent on the timeindex i. Traffic engineering practices regarding traffic vol-ume analysis are tightly related with the notions of non-stationarity. Hence detecting non-stationarity is importantas it describes the shifting points in the temporal statisticalbehavior of the underlying process. In many dynamic phe-nomena, it is of fundamental importance to trace thesepoints [24]. Sudden changes in the statistical characteris-tics of traffic variables, for example volume or packet size,can lead to understanding the different dynamics associ-ated to the specific behavior of the involved endapplications.

3.2. Reconstructing the phase space: delay-coordinateembedding

Usually, a detailed analysis of a dynamic system is pos-sible when the equations of motion and all the degrees offreedom n are known. The changing state of such a dy-namic system can be indeed represented by sequences of‘‘state vectors” or vectors of state variables in the phasespace [27]. Unfortunately, only a few quantities can beusually observed in a system. However, it is possible toreconstruct the entire dynamics of a system from a rela-tively small number of observable variables; in the analysisof a complex nonlinear system such an end-to-end trafficflow, whose evolution is determined by the interaction ofa very complex set of state variables, it is unusual thatthe complete description of such variables is known. Allthat is available to the analyst is a time series as the resultof sampling from a single observation point. The method ofdelay-coordinate embedding makes use of past values toreconstruct a useful version of the internal dynamics. Ta-kens theorem [28–30] states that we can recreate a topo-logically equivalent picture of the behavior of the originalmulti-dimensional system (in other words, we can recon-struct a phase space trajectory), using the time series of asingle observable variable, by means of the method of timedelays: starting from the scalar time series fxtgT

t¼1 we gen-erate a sequence of (embedded) vectors y(i) � (xi, xi+s,xi+2s, . . . ,xi+(m�1)s). The set of all embedded vectors y(i),i = 1, . . . ,T � (m � 1)s, constitutes a trajectory in Rm wherem is the embedding dimension and s is the time delay.Each unknown point of the phase space at time i is recon-structed by the delayed vector y(i) in an m-dimensionalspace called the reconstructed phase space. The sequence

764 F. Palmieri, U. Fiore / Computer Networks 53 (2009) 761–773

of embedded vectors recreates the original dynamics onlyif the values of m and s are chosen properly. In particular,for the Takens theorem to hold, the choice of m must as-sure that m > 2d + 1, where d is the original (unknown) sys-tem’s dimension. An estimation for d is provided by theCorrelation Dimension D2 of Grassberger–Procaccia [31].The most natural question is, then, how to choose anappropriate value for the time delay s and the embeddingdimension m. Several methods have been developed tobest guess mand s. The most often used methods are theaverage mutual information function (AMI) for the timedelay, as introduced by Fraser and Swinney [32] and thefalse nearest neighbors (FNN) method for the embeddingdimension developed by Kennel et al. [33].

The average mutual information (AMI) minimum is agood estimate for s, on the grounds that uncorrelated vari-ables tend to produce uncorrelated values. Suppose thatthe time series domain is partitioned into equiprobablebins. Let pi be the probability to find a time series valuein the ith bin, let pi,j(s) be the joint probability to find atime series value in the ith bin and a time series value inthe jth bin after a time s, i.e. the probability of transitionin s time from the ith to the jth bin. The average mutualinformation function is

SðsÞ ¼ �X

i;j

pijðsÞ lnpijðsÞpipj

!: ð1Þ

A similar argument supports the proposal to use the firstzero-crossing of the autocorrelation function [31]. Otherauthors suggest instead that the first AMI maximum shouldbe selected, since it relates to natural periods of the sys-tem. Good values for m can be found by using methods likefalse nearest neighbors. This method for finding the embed-ding dimension analyzes, with respect to m, the number offalse nearest neighbors, that are points in the data set thatused to be nearest neighbors in lower embedding dimen-sions but become distant to one another in the currentdimension. For example, two points on a circle can appearclose to each other, even though they are not, if, e.g. the cir-cle is seen sideways (as a projection), thus appearing like aline segment. Increasing by one the dimension m of thereconstructed space often permits differentiating betweenthe points of the orbit, i.e. those which are true neighborsand those which are not. Let y be a point of the recon-structed space. Let y(r) be the rth nearest neighbor of yand compute the (squared) Euclidean distance D2 betweenthem (as usual, yk denotes the kth component of y)

D2mðy; yðrÞÞ ¼

Xm�1

k¼1

½yk � yðrÞk �2: ð2Þ

Next, increase m to m + 1 and compute the new distance,i.e. D2

mþ1ðy; yðrÞÞ. The point yr(i) is said a false nearest neigh-bor if

D2mþ1ðy; yðrÞÞ � D2

mðy; yðrÞÞD2

mðy; yðrÞÞ> DTS; ð3Þ

where DTS is a predefined threshold. Note that the numberof false nearest neighbors depends on DTS. In practice, thepercentage of false nearest neighbors (FNN) is computed

for each m of a set of values; the embedding dimensionis said to be found for the first m such that the percentageof FNN drops to zero. Note that with real-world, noisy data,this percentage never reaches zero so the embeddingdimension providing the lowest FNN percentage is usuallychosen.

3.3. Building the recurrence plots

The next step in RP analysis is to calculate the mutualdistances between embedded vectors to build the recur-rence plot. Thus, a norm is to be chosen. Three frequentlyused norms are the L1-norm (minimum norm), the L2-norm(Euclidean norm) and the L1-norm (maximum norm) [31].Essentially, an RP is a two-dimensional graphical represen-tation of the distances matrix D = {di,j}, where the pixel lo-cated at coordinates (i, j) is shaded according to thedistance between the ith and jth vectors. More precisely,if the distance di,j is lower than a fixed cutoff value e (thetwo points are sufficiently close to each other) a dot is plot-ted in (i, j). As each coordinate i represents a point in time,RP provides information about the temporal correlation ofphase space points. Indeed, each horizontal coordinate i inRP refers to the state of the system at i and each verticalcoordinate j refers to the state in j. Thus, a recurrent pointin (i, j) means that the interaction between the observedquantities in the instant i is almost the same as in the in-stant j, i.e. the interaction is recurring. So if the point (i, j)is marked as recurrent, the state j belongs to the neighbor-hood centered in i of size e; this means that the state of thesystem at time i has some ‘similarity’ with the state of thesystem at j, in other words we can say that the system isstaying on nearby ‘‘orbits”. More formally, the recurrenceof a state x from the time i in a different time j is givenby the following equation [34]:

ri;j ¼ hðe� kxi � xjkÞ; i; j ¼ 1; . . . ;N; ð4Þ

where ri,j is an element of the recurrence matrix R, N is thenumber of states xi in the time window of study, e isthe threshold for the distances, k�k is a norm, and h(x) isthe Heaviside function, defined as: h(x) = 0, for x < 0,h(x) = 1, for x P 0. In other words, ri,j assumes a value of1 if (i, j) is recurrent, and a value of 0 otherwise. Note thatthe RP is symmetric (given a constant value of e), i.e.Di,j = Dj,i. Moreover, as ri,i = 1 (i = 1, 2, 3, . . . ,N) an RP will al-ways contain a diagonal line angled at 45� called Line ofIdentity (LOI). Each recurrent point indicates an isolatedrecurrence of the phase relationship between the time ser-ies. Line segments parallel to the main diagonal come frompoints close to each other successively forward in time (i,j), (i + 1, j + 1), . . . , (i + l, j + l) such that xj, xj+1, . . . ,xj+l is(respectively) close to xi, xi+1, . . . ,xi+l. Thus, a diagonal lineindicates a stable recurrence of the phase relationship fora time interval corresponding to the length of the diagonal(l). The time interval separating different diagonals is therecurrence period. Recurrent points organized in rowsand columns do not convey any information about the tim-ing of the periodicity, since they are not successively for-ward in time. However, horizontal and vertical lines areassociated with stationary states. If the time series is deter-ministic, the orbit in the phase space will revisit some

F. Palmieri, U. Fiore / Computer Networks 53 (2009) 761–773 765

points sometime in the future, forming a picture of the sys-tem’s attractor. Then, the RP will show short upward linesegments parallel to the main diagonal. Those segmentscorrespond to sequences. The RP of a purely random series,instead, will not show any structure at all. The purpose ofRP analysis is to identify in the RP several patterns thatindicate statistical properties of the time series such asnon-stationarity, drifts in data and so on. Any non-station-ary process yields a patterned RP. If the process has a trend,the RP will exhibit a fading towards the corners. In case ofabrupt changes, the RP contains disruptions (dark coloredbands). The cyclicity of an oscillating process causes diag-onal lines and a checkerboard structure in the RP. At amore detailed level, there are some essential patterns, of-ten called small-scale structures, such as single dots, diag-onal lines as well as vertical and horizontal lines (thecombination of vertical and horizontal lines obviouslyforms rectangular clusters) that are strictly related todeterministic structure and nonlinearity. These small scalestructures are the basis for a quantitative analysis of theRPs. For example, vertical or horizontal lines on an RP(ri,j+k = 1 for k = 1, . . . ,v, where v is the length of the verticalline) denote that the system state does not change orchange very slowly in time. It seems that the system istrapped in a state for some time. This is a typical behaviorof laminar states, often called intermittency.

Diagonal lines (ri+k,j+k = 1 for k = 1, . . . , l where l is thelength of the diagonal line) correspond to trajectories pass-ing in the same region of the phase space at different times.Therefore, parallel and perpendicular lines to the maindiagonal appear when the series presents some determin-ism or periodicity. More specifically, diagonal lines parallelto the LOI occur when states evolve deterministically atdifferent times. Diagonal structures perpendicular to theLOI represent phase discordances (this is often a hint foran inappropriate embedding). The lengths of diagonal linesare directly related to the ratio of determinism or predict-ability inherent to the system. Suppose that the states attimes i and j are neighbors, i.e. ri,j = 1. If the system behavespredictably, similar situations lead to a similar future, i.e.the probability for ri+1,j+1 = 1 is high. For perfectly predict-able systems, this leads to infinitely long diagonal lines(like in the RP of the sine function). In contrast, if the sys-tem is stochastic, the probability for ri+1,j+1 = 1 will be smalland we only find single points or short lines. If the systemis chaotic, initially neighboring states will diverge expo-nentially. The faster the divergence, i.e. the higher theLyapunov exponent, the shorter the diagonals. The lengthof lines parallel to the main diagonal of the RP indicateshow fast the trajectories diverge in phase space.

On the other hand, single isolated points in a RP can oc-cur if states are recurrent but rare, if they do not persist forany time or if they fluctuate heavily, pointing towards astochastic process. However, they are not a unique signof randomness or noise. Finally, square-like structures oc-cur when data changes slowly (sojourn points).

3.4. Recurrence quantification analysis

While RPs are visually appealing, their interpretationrequires some degree of subjectivity and, of course, exper-

tise. Since recurrence plots can contain subtle patterns thatare not easily ascertained by visual inspection, Zbilut andWebber [35] introduced the concept of recurrence quanti-fication analysis (RQA), that is an efficient and determinis-tic way to easily identify non-stationarity features in trafficflows. The power of such analysis is that it uncovers timecorrelations between data that are not based on linear ornonlinear assumptions and cannot be distinguishedthrough the direct study of one-dimensional series of traf-fic flow volumes. Moreover, traffic patterns, implying themanner in which traffic states propagate through time,can be revealed by the study of the evolution of somerecurrence statistics in time. This approach is based onthe diagonal line structures found in recurrence plots.The standard first step in the analysis is to choose theembedding parameters: the time delay s and the embed-ding dimension m. Then, to enable the computation of dis-tances between points, a norm must be selected. Thethreshold radius e, determining the number of recurringpoints, is another parameter whose value must be chosenbefore the diagram can be drawn. Analogously, the mini-mum number of consecutive darkened points that consti-tutes a diagonal line, lmin, is a parameter that is to bedetermined before RQA can take place.

At the core of RQA is the computing of several statisticsproviding the identification and the quantification of tran-sient recurrent patterns characterizing the behavior of thetime series under investigation. These measures can becomputed in successive time windows (epochs) along themain diagonal in a RP. This allows us to study their timedependence and can be used for the detection of transi-tions [35]. Another possibility is to define these measuresfor each diagonal parallel to the main diagonal separately.This approach enables the study of time delays and byapplying to RPs the assessment of similarities betweenprocesses [36]. In both cases the quantification analysisprovides information about the deterministic structureand the complexity of the dynamics of the observed pat-terns. Mainly, the variables focused on evaluation of thedeterministic content of a time series are: percentage ofRecurrence (%REC), percentage of Determinism (%DET), En-tropy (ENT), Divergence (DIV), RATIO, Laminarity (LAM) andTREND.

The first variable, %REC, corresponding to the correla-tion sum, measures the percentage of recurrent points inthe phase space, excluding the main diagonal, whosepoints are trivially recurrent. Embedded processes thatare periodic have higher percent recurrence values. RECis given by Marwan and Kurths as

REC ¼ 1N2

XN

i;j¼1

ri;j; ð5Þ

where Ri,j is the recurrence estimated by the Eq. (4).The second variable, %DET, is the percentage of recur-

rent points that are included in line segments parallel tothe upward diagonal and whose length meets or exceedsthe minimum length threshold lmin. It allows distinguishingbetween dispersed recurrent points and those that are or-ganized in diagonal patterns, representing strings of vec-tors (deterministically) repeating themselves

Table 1General workload dimensions of the traces.

Trace A B C

Duration 2 h 12 h 48 hFlows 1.8 � 106 1.1 � 107 5.2 � 107

Packets 8.8 � 107 5.9 � 108 2.5 � 109

Bytes 7.7 � 1010 5.3 � 1011 3.1 � 1012

766 F. Palmieri, U. Fiore / Computer Networks 53 (2009) 761–773

DET ¼PN

l¼lminl � PðlÞPN

i;j¼1ri;j

; ð6Þ

where l is the length of the diagonal line parallel to the Lineof Identity, P(l) is the frequency distribution of the diagonallines (that is the histogram of the their lengths l) and Ri,j, asabove, the recurrence of a traffic state. High values of %DETshow that traffic exhibit a deterministic structure. Deter-ministic structures may have high degree of complexityor the opposite.

Analogously, the amount of recurrence points whichform vertical lines can be quantified in the same way

LAM ¼PN

v¼vminv � PðvÞPN

v¼1v � PðvÞ; ð7Þ

where P(v) is the frequency distribution of the lengths v ofthe vertical lines, which have at least a length of vmin. Thismeasure, evidencing chaotic transitions, is called Lamina-rity and is related with the amount of laminar phases inthe system (intermittency).

The ENT variable, is computed as the Shannon entropyof the frequency distribution of the diagonal line lengthsdistributed over integer bins in a histogram, and quantifiesthe complexity of recurrence plots. More precisely, the en-tropy gives a measure of how much information one needsin order to recover the system. A low entropy value indi-cates that little information is needed to identify the sys-tem, in contrast, high entropy indicates that muchinformation is required. Upward line segment lengths arecounted and distributed over integer bins of a histogram.Shannon entropy is computed according to the formula

ENT ¼ �XN

l¼lmin

Pllog2ðPlÞ; ð8Þ

where N is the number of bins and Pl is the probability dis-tribution of the diagonal line lengths, that can be alsoviewed as the percentage of all line lengths falling intothe lth bin. As the logarithms are in base 2, the entropycan be interpreted as number of bits. The increasing entro-py reflects ‘‘disorder” and is connected to decreased pre-dictability – simply stated, a low entropy is typical ofperiodic behavior while high entropy indicates chaoticbehavior. The more complex the deterministic structureof the recurrence plot the larger the value of the entropybecomes.

The DIV variable is equal to the reciprocal of the maxi-mum line length (lmax). This quantity is proportional tothe largest positive Lyapunov exponent. The basic idea inthe calculation of the maximum exponent is to find a pairof spatially nearby points in the attractor and follow theirevolution in time, measuring the rate of divergence, untilthe points can no longer be considered close. Short hori-zontal lines in the recurrence plot correspond to large localexponents. Consequently, a periodic signal produces longline segments, while short lines indicate chaos.

The RATIO variable is defined as the quotient of %DET di-vided by %REC. It is useful to detect transitions betweenstates: this ratio increases during transitions but settlesdown when a new quasi-steady state is achieved.

The TREND variable is the slope of the least squaresregression of local recurrence as a function of the orthogo-nal displacement from the main diagonal. It quantifies thedegree of system stationarity. In other words it representsthe measure of the positioning of recurrent points awayfrom the central diagonal, that is the paling of the RP to-wards its edges. A ‘‘flat” diagram indicates stationarity,whereas drift in the signal will result in the overall increaseor reduction of distances as we move away from the maindiagonal. TREND is computed as follows:

(1) we compute the percentage of recurrent points indiagonals parallel to the central line,

(2) fit by least squares the relationship:

dj ¼ aþ bgj þ uj; ð9Þ

where dj is the percentage of recurrent points, and gj is thedistance away from the central diagonal. The trend is thevalue of b. If there is not drift in a dynamical system, thereis no fading of the recurrence plot away from the centraldiagonal, leading to low values (near zero) of b; however,large values (positive or negative) of b is an evidence of asystem exhibiting drift. Thereafter, ratio and trend areRQA variables specially suited for detection of non-stationarity.

4. Implementation details

In our analysis we used data from three packet tracescollected at the Federico II University of Napoli. The tracescontains all the traffic going through the Federico II Uni-versity 1 Gbps link to the Internet on March 8, 2008 from2 to 3 pm, on March 14, 2008 from 21 pm to 9 am and from10 am on March 21, 2008 to 10 am on March 23, 2008 (seeTable 1). Our traces cover some typical cases such as thenoticeable differences in usage between morning andevening hours, and the noticeable differences in usage be-tween weekdays and weekends. We collected the tracesthrough port mirroring from our Cisco 6509 border routerto an HP� DL380 Dual Processor (Intel� Xeon� 2.5 GHz)monitoring server running the FreeBSD� operating system.Such choice is perfectly suited to a modern network sce-nario where the actions of packet classification are dele-gated to edge gateways.

For all traces, we captured the first 68 bytes of eachpacket, which includes the IP and TCP/UDP headers and en-ough payload bytes to perform payload analysis to be usedas a reference. For the sake of privacy, the monitoring sys-tem anonymized the IP addresses in the traces using theCryptography-based Prefix-preserving Anonymizationalgorithm (Crypto-PAn) [37].

F. Palmieri, U. Fiore / Computer Networks 53 (2009) 761–773 767

4.1. Initial flow and statistic feature calculation

We used NetMate (NETwork Measurement andAccounTing systEm) [38] to process packet traces, organizepackets into traffic flows and compute feature values. Spe-cifically, end-to-end traffic flows have been identified bysource IP and source port, destination IP and destinationport and protocol. Flows are bidirectional and their firstpacket determines the forward direction. They are alsocharacterized by a limited duration: UDP flows are termi-nated by a flow timeout. TCP flows are terminated uponproper connection teardown (TCP state machine) or aftera timeout (whichever occurs first). We only consider UDPand TCP flows that have at least one packet in each direc-tion and transport at least one byte of payload. This ex-cludes flows without payload (e.g. failed TCP connectionattempts) or ‘‘unsuccessful” flows (e.g. requests withoutresponses commonly found in scans). We used a 10 s flowsampling interval. The choice of such flow sampling inter-val results from a tradeoff between sensitivity on one sideand accuracy plus memory usage on the other side. A shortsampling interval increases the sensitivity to transientphenomena and the quick processing of IP addressesappearing in the link for the first time. On the other hand,large sampling intervals increase the profiling accuracywhile at the same time reducing the memory footprint.Nevertheless, the self-similarity properties imply that traf-fic characteristics exists across many time scales, i.e. aggre-gated traffic does not necessarily get steadier.Consequently, the chosen 10 s sampling interval resultedin the best compromise for our analysis.

The flow features considered in our model are any sta-tistics that can be calculated from the trace information athand (in our case packets within a flow). Traffic volumedata, the minimum/maximum/average length of observedIP packets per export interval or the variance of packet in-ter-arrival times are all valid features. As network flowscan be bidirectional, features can also be calculated forboth directions of the flow. When choosing the flow fea-tures, the ‘kitchen-sink’ method of using as many featuresas possible was eschewed in favor of a constraint-basedapproach. The main limitation in choosing features wasthat calculation should be realistically possible within a re-source constrained IP network device. Thus the consideredfeatures needed to fit the following criteria:

� Complete packet payload independence.� No implicit dependence from the transport layer.� The context must be limited to a single flow (i.e. no fea-

tures spanning multiple flows).� Simple to compute.

The following features were found to match the abovecriteria and became the base feature set for ourexperiments:

� Packet length (minimum, mean, maximum). Lengths arebased on the IP length excluding link layer overhead.

� Inter-arrival time between packets (minimum, mean,maximum and variance) with at least microsecond pre-cision and accuracy.

The features that demonstrated to be most selective arethe average packet length and, less impressively, the inter-arrival time variance. Note that recurrence quantificationanalysis assumes a scalar time series. Thus, the above fea-tures have thus been studied one at a time.

In choosing the type of pre-classified flows to be used inour supervised learning phase we considered as our pri-mary objective to distinguish from the main traffic classesthat are currently associated to the most common andused Internet services such as the World-Wide-Web, theElectronic Mail, the Domain Name System and peer-to-peer file sharing networks. For completeness also the SSHprotocol, now widely used for remote terminal sessionshas been chosen. Accordingly we extracted from our tracessome sample flows associated to the DNS, SSH, HTTP,SMTP, POP3 and eDonkey2000 protocols. These flows ac-count respectively for about 17%, 1%, 2%, 1%, 0.5% and 4%of the total number of flows. Note that we need not differ-entiate between HTTP/1.0 and HTTP/1.1, since, as statedabove, our analysis is based on inter-arrival time and pack-et length, and those features are not significantly affectedby the specific HTTP protocol characteristics (i.e. connec-tion persistence). Also note that eDonkey2000 has beenchosen as the most representative peer-to-peer protocol,known to contribute to the vast majority of P2P traffic inEurope, actually used by eDonkey, eMule, MLDonkey andShareaza clients. All the extracted flows have been reliablypre-classified through host/port checking and payloadinspection, and then aggregated into homogeneous ‘‘train-ing” sets to be used to determine the most discriminatingnonlinear features associated to each specific type of traf-fic. Each set has been chosen to contain a sufficiently largenumber of aggregated flows needed to completely describethe associated traffic characteristics. Recurrence Analysishas been performed on these sets to complete the ‘‘knowl-edge construction” phase of our classification model.

4.2. Recurrence analysis tools and methods

The nonlinear features used in our model are not easy tointerpret. A complicating issue is the fact that the recon-structed state space is multi-dimensional, making itimpossible to visualize in a traditional two-dimensiongraph plot. Nevertheless, visual recurrence analysis, per-formed through colored recurrence plots, is a powerfuldescriptive tool. It is intuitive, quick, and provides a robustmethod for tentatively classifying and characterizing thetraffic data time series. It can be used as an ad-hoc meth-odology to quickly set up hypotheses that can, later on,be tested with the more rigorous RQA tools. The resultingquantitative observations, performed an all the trainingsets and processed through data mining methodologiesto extract the more selective features, produce the classifi-cation knowledge base on which our model is built. Thesoftware used for RP and RQA analysis is the visual recur-rence analysis (VRA) version 5.01 coded by Kononov, freelyavailable on the web [39]. We skipped the problem of thethreshold using a scale of color for distances; on each plot,white ‘‘color” represents a short distance while ‘‘black”corresponds to a long distance as it can be seen in the

Fig. 1. AMI for HTTP inter-arrival time variance, trace B.

Fig. 2. AMI for eDonkey2000 average packet length, trace B.

Fig. 3. FNN for HTTP inter-arrival time variance.

768 F. Palmieri, U. Fiore / Computer Networks 53 (2009) 761–773

legend on the right side of each plot, in which each colorindicates a range of distances between states. VRA also in-cludes tools for finding the embedding parameters. Thesame ideas are implemented in the TISEAN software pack-age for nonlinear time series analysis, freely available onthe web [40], with the notable characteristic that they ap-pear as command-line utilities, easily suitable to integra-tion with other software for further processing and hencefor automated on-line or offline implementation of ourclassification paradigm. We specifically used in out proto-type implementation the TISEAN version 3.0.1 code byHegger and Schreiber [41].

4.3. Determining the embedding parameters

It should be recalled that the first step in our nonlinearanalysis is the determination of the embedding parameterssuitable for almost all the recognizable traffic types in oursupervised learning phase. Hence a search for the best sand m values must be made first. This has been done, inTISEAN, through the routine ‘‘mutual” which computesthe ‘‘mutual information” while the routine ‘‘false-nearest”computes the percentage of false nearest neighbors. Theoptimum time delay s has been chosen for all the aggre-gated flows as the one which minimizes the average mu-tual information of Eq. (1) in each training set, while theembedding dimension m has been chosen as the valuefor which the percentage of false nearest neighbors reachesits lowest value. Consider that each delayed vector in thestate space is reconstructed along a period of (m � 1) s s,given a set of s and m values. In so doing, each vector y(i)

of the so reconstructed space somewhat represents theevolution of the flow during this time. This kind of phasespace reconstruction methodology can be expected to al-low one to directly determine the maximum time neededto recognize specific traffic classes. Larger parameters val-ues imply longer time intervals and in fact induce less sen-sitivity to changes on a short time scale. Furthermore, dueto the inherent model complexity, and to the high numberof variables involved, overfitting is likely to occur in caseswhere the flow sampling interval is too long, flow data areinsufficient of too short in duration. This basically meansthat many different solutions/discriminating criteria maybe consistent with the pre-classified training flow samplesused in the initial supervised learning activity, but disagreeon unseen data. Hence, when presenting new traffic sam-ples to the developed classification logic, the predictionswill not be reliable. In order to avoid overfitting as muchas possible, we performed result validation by workingon the above aggregates built from several different pre-classified flows collected in different times and from differ-ent hosts.

Figs. 1 and 2 detail the AMI calculations for the inter-ar-rival time variance and average packet length for, respec-tively, HTTP, and eDonkey traffic. The AMI results for theother protocols/training sets have not been shown becausethey are very similar to those reported below.

From the above figures it can be seen that the AMI valuerapidly decreases for all the considered protocols. Theabrupt change that can be observed near 1 makes uschoose 1 as a trustworthy tentative value for the time de-

lay s. Similarly, the sample FNN plots reported in Figs. 3–7,reveal, besides some differences in curve sharpness, anearly common FNN minimum at 6, thus suggesting theuse of that value as a good tentative estimate for the com-mon embedding dimension m.

The instability of the FNN percentages is remarkablyevident in the case of DNS, whose nature seems closer torandomness. This may be due to the smaller amount ofdata involved in each transaction as compared to otherprotocols, leading to a higher burstiness. On the other side,DNS only accounts for approximately 7% of the whole traf-fic volume. Finally, no significant difference has been evi-denced between the three traces.

The effectiveness of the determined embedding dimen-sions has also been verified on all the pre-classified train-ing sets by analyzing the corresponding RPs. We usedideas from the theory of smooth dynamical systems toidentify the type of patterns that the recurrence plotsshould and should not contain, and we distinguished goodand bad embeddings by their corresponding plots. Clean-ing a recurrence plot from non horizontal patterns is a firststep in the determination of the correct embedding param-eters, but usually is not sufficient. Two other undesirable

Fig. 4. FNN for eDonkey2000 inter-arrival time variance, trace C.

Fig. 5. FNN for DNS average packet length, trace B.

Fig. 6. FNN for FTP inter-average packet length, trace C.

Fig. 7. RP for eDonkey average packet length m = 6, s = 1, L1-norm.

F. Palmieri, U. Fiore / Computer Networks 53 (2009) 761–773 769

features are isolated points, or very short lines and shortgaps frequently interrupting line segments. We would ex-pect then that a ‘‘clean” RP such as represents a betterreconstruction of the phase space dynamics that can beused for a reliable quantification analysis.

The following figures (colors are available in the onlineversion) show some of the RPs corresponding to the mostused protocols present in our traffic samples. A massivepresence of hot colors (red1, yellow, orange) denotes smalldistances between vectors.

Visual inspection of these plots immediately reveals thepresence of small-scale structures, more evident for HTTP.For DNS traffic, the RP shows a very fine-grained organiza-tion, so that Fig. 9 is drawn with the L2-norm to enhance itsvisual appeal. This reflects high level of burstiness and ran-domness that give further confirmation to our previousconsiderations (see the FNN analysis).

Furthermore, the plots in Figs. 7–10 are characterizedby some regularity in the distribution of colors. In fact,after estimating appropriate time delays and embeddingdimensions, the associated short-term series reveal faintregularities, indicating that there is some kind of determin-istic process driving them. This kind of process can beimmediately associated with the specific (hidden or expli-cit) protocol dynamics.

4.4. Quantification analysis for traffic differentiation

Once the most suitable embedding dimensions com-mon to all the sampled traffic features and interesting flowtypes have been determined, it is time to perform thequantification measurements and analyze the results in or-der to determine discriminating properties of each trafficclass. Two complementary studies have been made: inthe first case, the RQA variables have been computed forthe whole time series, in the other case, we divided thewhole time series into subseries and computed the vari-ables in each subinterval, called an epoch. Each epoch is60 min long and regularly shifted by 15 min, in such away that each epoch overlaps the next one by 45 min. IfNe is the length of each epoch and de the shift, the epochi corresponds in the time series to the interval starting int = (i � 1)de + 1 and ending in t = (i � 1)de + Ne + 1. The firstinvestigation has been devised in view of comparing globaleffects due to structures in subseries, while the computa-tion for various epochs has been made to emphasize thechanges in state inside the whole time series. The averageQuantification results computed for inter-arrival time var-iance and packet length are, respectively, summarized inTables 2 and 3.

Construction of the discriminating feature vectors is thekey step in building the knowledge base in our classifica-tion model. Here, each recognizable traffic type must berepresented in terms of some RQA properties that areinformative when distinguishing one class of sequencesfrom the others. The set of chosen RQA descriptors werecombined, for each training set, as a pair of 7-dimensional

1 For interpretation of color in Figs. 1–9 the reader is referred to the webversion of this article.

Table 2Average RQA measurements for inter-arrival time variance.

HTTP eDonkey DNS SMTP POP3 SSH

REC 15.464 3.764 1.804 1.414 81.039 16.501DET 35.349 8.929 2.270 1.184 98.605 85.030ENT 3.390 2.277 1.860 1.864 6.798 5.757DIV 0.012 0.038 0.063 0.056 0.001 0.001RATIO 2.286 2.430 1.259 0.837 1.217 5.154LAM 1.746 0.009 0.000 0.000 95.277 77.358TREND �6.940 0.351 �0.107 0.063 �6.230 �11.213

Table 3Average RQA measurements for packet lengths.

HTTP eDonkey DNS SMTP POP3 SSH

REC 19.505 2.343 41.577 5.677 81.039 18.560DET 29.359 5.005 48.092 5.329 98.518 83.578ENT 3.115 2.046 3.621 2.018 6.707 5.957DIV 0.023 0.045 0.017 0.053 0.001 0.001RATIO 1.505 2.136 1.157 0.939 1.216 4.503LAM 0.867 0.000 2.938 0.000 95.196 76.805TREND 0.181 �0.180 �1.441 �0.075 �6.279 �11.979

Fig. 8. RP for HTTP average packet length, m = 6, s = 1, L1-norm.

Fig. 9. RP for SMTP average packet length, m = 6, r = 1, L1-norm.

Fig. 10. RP for DNS average packet length, m = 6, s = 1, L2-norm.

770 F. Palmieri, U. Fiore / Computer Networks 53 (2009) 761–773

feature vector respectively associated to the inter-arrivaltime variance and average packet length measurements.Our main objective is to build a decision tree for trafficclassification choosing the most promising RQA attributeto split on at each point in our decision process and branchaccordingly. To do this we have to search the attributespace for the subset that is most likely to predict the trafficclass best. Because irrelevant attributes are known to de-grade the performance of the classification process, theRQA attribute pairs we considered have to be screened,to identify and exclude useless or redundant ones. Theattribute selected generate a different set of rules, one rulefor every discriminating threshold value. To select themost informative features for traffic flow classification,all the features were subjected to selection by calculating

the mutual information gain for each feature, ranking themand selecting the best features for building the final classi-fication model. The InfoGain algorithm with the rankermethod [42] was implemented using WEKA 3.5.8 and crossvalidation on each training set was used to identify the fea-tures that perform best for each given traffic type. Theattributes resulting in a best InfoGain ranking score are,in order, DIV, LAM and REC for both the feature vectors (in-ter-arrival time variance and average packet length). Dif-ferent classification models were built with the bestdiscriminating features, in addition to a globally scopedone built using all the features together. Finally, the classi-fier resulting from J48, that is the WEKA implementation ofC4.5 decision trees [43], is described by the following deci-sion tree.

Fig. 11. J48 decision tree.

Table 5Confusion matrix.

F. Palmieri, U. Fiore / Computer Networks 53 (2009) 761–773 771

An interesting remark that can be made by observingFig. 11 is that only the inter-arrival time variance (itvmark) RQA measures are selected for classification of bothinteractive and non-interactive traffic. It should be notedthat for non-interactive traffic, packet sizes are also a verygood indicator of the protocol in use, but the relativelyunstructured nature of interactive traffic makes the inter-arrival time variance a significantly better discriminator.

5. Performance evaluation

In this section, we present the classifier’s performanceevaluation, starting from the knowledge basis built as de-tailed in the previous paragraphs. Since our model relieson non-stationary traffic properties that can only be as-sessed when a significant amount of data is available inthe time series, we empirically determined a minimumtime series size threshold (about 100 samples). Further-more, since our classification paradigm aims to be totallyport-independent we have explicitly chosen to ignore portinformation in selecting the traffic end-to-end sessions tobe used in our evaluation, and worked only on significantlylong conversations between host pairs by filtering out fromour sample traces all the traffic sessions falling below theabove threshold.

5.1. Metrics

The most significant metrics that can be used to assessthe effectiveness and accuracy of our classification schemeare defined by the entries in the taxonomy below (seeTable 4).

The True Positive Rate (TPR), and True Negative Rate(TNR) represent the percentage of elements that were cor-rectly identified to, respectively, belong or not belong totraffic class X. The False Negatives Rate (FNR) is the per-

Table 4A taxonomy of the accuracy metrics.

Classified as ? X X

X TPR FPRX FNR TNR

centage of members of class X classified as not belongingto class X. Correspondingly, the False Positives Rate (FPR)is the percentage of members of other classes classifiedas belonging to class X, expressing, in some sense, thetrustworthiness of the classifier. The ideal confusion ma-trix is, therefore, a multiple of the identity matrix. A goodtraffic classifier aims to minimize the FNR and FPR,although the relative importance of each of these metricsheavily depends on the intended use of the classificationresults. A low FNR guarantees that only a small fractionof class X flows will be discarded, whereas a low FPRmeans that the set of flows classified as belonging to trafficclass X will not contain non-X flows.

5.2. Experimental results

The results of our simple tests are summarized by theconfusion matrix below (Table 5), effectively illustratinghow well our model works and showing at the same timehow and when it fails.

When compared against the most effective classifica-tion techniques known in literature, the results obtainedthrough our paradigm, while not impressive in overallaccuracy, are promising in terms of flexibility in recogniz-ing less structured (i.e. disguised peer to peer and highlyinteractive) protocols. In particular, the Bayesian neuralnetwork-based approach proposed by [16] classifies HTTPand SMTP traffic with much higher accuracy (respectively89% and 97%), but is not so effective (47%) for peer-to-peerapplications where our scheme is more discriminating, byexploiting the hidden peer to peer traffic dynamics. Fur-thermore, the algorithm in [16] is based on the use of alarge number of features (including the Fourier transformof packet inter-arrival times for each direction), many ofwhich are computationally challenging. On the other sidethe BLINC [22] behavior-based approach, once properlytuned, classifies WWW, DNS and SMTP, flows with anFNR lower than 10%. However, since it tries to associatetraffic-originating hosts with the services they provide oruse, this method needs to gather information from severalflows for each host before it can decide on the host role.Such requirement might prevent its usage in real-timeapplications. The unsupervised machine learning algo-rithm for early identification presented in [18] correctlyidentifies more than 80% of total flows, for a number ofapplications, by only using the first five packets of eachTCP flow, with one notable exception: the classifier labels86% of POP3 flows as NNTP and 12.6% as SMTP, becausePOP3 flows always belong to clusters where POP3 is not

Classified as?

HTTP(%)

eDonkey(%)

DNS(%)

SMTP(%)

POP3(%)

SSH(%)

HTTP 73.8 2.9 6.7 3.9 10.9 1.8eDonkey 0.2 77.5 10.3 10.4 0.2 1.4DNS 10.8 9.5 51.8 23.7 1.5 2.7SMTP 3.8 9.1 25.4 45.8 11.2 4.7POP3 8.5 0.2 3.2 12.6 75.2 0.3SSH 3.3 0.4 2.3 3.4 1.2 89.4

772 F. Palmieri, U. Fiore / Computer Networks 53 (2009) 761–773

the dominant application. In addition, it does not handleUDP traffic and requires the enforcement of the packet or-der of arrival on the first four packets of each flow.

5.3. Analysis and discussion

First of all, we note that inter-arrival time variance ap-pears to be a good feature to be used for building traffic pro-files. However, we find it somewhat surprising that oursingle-feature test classifier, using only inter-arrival times,performs quite satisfactorily on several such different pro-tocols. This can be due to the fact that, in our analysis, wehave examined nonlinear characteristics of dynamic sys-tems and such nonlinear dynamic effects seem to manifestthemselves more evidently in inter-arrival time variancethan in other features. This was reflected in the results ofthe information gain estimation performed to detect themost discriminating RQA traffic properties for classification.

We can easily see that our model does not characterizeneither DNS nor SMTP flows well, as can be seen from theassociated percentage of false positives. In fact, the twoclasses seem to almost merge with one another, whereasthe other classes are more evidently separated. This effectmay be caused by the selection of inter-arrival time vari-ance alone as a discriminating feature. In fact, since manySMTP control connections are human-driven (althoughlikely through a mail client or other graphical interface),it may be plausible that the inter-arrival time dynamicsof the SMTP transactions can be easily confused with theDNS ones, which during SMTP sessions exhibits a muchmore similar information exchange pattern (the SMTPspeaker needs to contact DNS servers for each contactedparty, to obtain the IP address of the destination mail ser-ver, by checking the MX records, etc.), and may also easilyconfused with other mail-related protocols such as POP3.Also, mail or DNS servers share the property of communi-cating with other similar servers by using the same well-known service port, according to the same social behaviorthat can be expressed through the formation of communi-ties or clusters between sets of addresses and ports. Forsuch protocols, naive port-based classification techniquesstill achieve better results since no single site could arbi-trarily change the ports on which mail or DNS informationis exchanged without effectively cutting itself off from ex-changes with other Internet sites. Finally, the two proto-cols are both strictly transactional and, in a sense, highlypredictable. These characteristics may adversely affectthe sensitivity of RQA statistics which our classified isbased upon.

5.4. Known limitations

Any supervised approach can only classify traffic forwhich it has labeled training data, and cannot discovernew applications. Furthermore, the supervised learningphase, while needed only once, at the model initial knowl-edge construction time, is very complex and slow and re-quires a lot of expertise in determining the optimalembedding dimension to be used in the following recur-rence quantification process. Also, it should be noted that,to achieve satisfactory results, the ‘‘training set” associated

to each traffic type should be built on a significantly largenumber of pre-classified flows, taking into considerationthe widest possible spectrum of specific traffic features. Fi-nally, to be effective, the classification process needs towork on flows with a sufficient number of samples so thatspecific non-stationary properties and hidden transitionpatterns in traffic can be detected and quantified in a reli-able way. Such a feature makes our model better suited foran offline classification scenario.

6. Conclusions and future work

We presented a new approach to traffic classificationbased on recurrence analysis and performed a detailedstudy of the nonlinear dynamics of some specific trafficflow types, to determine recurrence phenomena and hid-den non-stationary transition patterns in the time seriesassociated to the traffic classes that we would like to distin-guish. The results show that this approach can be effectiveto evaluate and recognize the complex dynamics of severalwidely used network protocols, such as HTTP, SMTP andseveral peer-to-peer file exchange protocols, starting fromthe corresponding traffic flows and transforming qualita-tive classification criteria into a quantitative discriminationstrategy. Tools and techniques such as RP and RQA can bevaluable for gaining insights, which can be reliably usedfor classification, into the hidden statistical characteristicsof network traffic. Because both these techniques have beenformerly conceived for nonlinear analysis and chaos theory,they naturally show themselves to be particularly effectivefor traffic flow time series, due to the inherent fractalbehavior of network traffic data.

Some directions that are open for further investigationare the analysis of IP addresses and IP address pairs, to sin-gle out machines hosting P2P software, and the study ofsensitivity to various parameters, such as the embeddingdimension, time delay, metric, minimum diagonal and ver-tical line lengths, and the sampling rate. Finally, additionalresearch is needed to progress towards a classificationtechnique that can be applied to real-time traffic.

References

[1] A. Tretyakov, H. Takayasu, M. Takayasu, Phase transition pattern in acomputer network, Physica A 253 (1998) 315–322.

[2] M. Takayasu, H. Takayasu, K. Fukuda, Dynamic phase transitionobserved in the Internet traffic flow, Physica A 277 (2000) 248–255.

[3] M. Masugi, T. Takuma, Multi-fractal analysis of IP-network traffic forassessing time variations in scaling properties, Physica D 225 (2007)119–126.

[4] M. Masugi, Recurrence plot-based approach to the analysis of IP-network traffic in terms of assessing nonstationary transitions overtime, IEEE Transactions on Circuits and Systems I 53 (10) (2006)2318–2326.

[5] J.P. Eckmann, S.O. Kamphorst, D. Ruelle, Recurrence plots ofdynamical systems, Europhysics Letters 4 (9) (1987) 973–977.

[6] J.S. Iwanski, E. Brandley, Recurrence plots of experimental data: toembed or not to embed, Chaos 8 (4) (1998) 861–871.

[7] C.L. Webber Jr., J.P. Zbilut, Dynamical assessment of physiologicalsystem and status using recurrence plot strategies, Journal ofApplied Physiology 76 (1994) 965–973.

[8] N. Marwan, J. Kurths, Nonlinear analysis of bivariate data with crossrecurrence plots, Physics Letters A 302 (2002) 299–307.

[9] T. Schreiber, Interdisciplinary application of nonlinear time seriesmethods, Physics Reports 308 (1999) 1–64.

F. Palmieri, U. Fiore / Computer Networks 53 (2009) 761–773 773

[10] V. Paxson, Empirically derived analytic models of wide-area TCPconnections, IEEE/ACM Transactions on Networking 2 (4) (1994)316–336.

[11] V. Paxson, S. Floyd, Wide area traffic: the failure of Poisson modeling,IEEE/ACM Transactions on Networking 3 (3) (1995) 226–244.

[12] C. Dewes, A. Wichmann, A. Feldmann, An analysis of Internet chatsystems, in: IMC’03: Proceedings of the Third ACM SIGCOMMConference on Internet Measurement, 2003, pp. 51–64.

[13] K.C. Claffy, Internet traffic characterization, Ph.D. Thesis, Universityof California, San Diego, 1994.

[14] A.W. Moore, D. Zuev, Internet traffic classification using bayesiananalysis techniques, in: Proceedings of SIGMETRICS’05, 2005, pp.50–60.

[15] A.W. Moore, K. Papagiannaki, Toward the accurate identification ofnetwork applications, in: Proceedings of the Sixth Passive and ActiveMeasurement Workshop (PAM 2005), 2005, pp. 41–54.

[16] T. Auld, A.W. Moore, S.F. Gull, Bayesian neural networks for internettraffic classification, IEEE Transactions on Neural Networks 18 (1)(2007) 223–239.

[17] A. McGregor, M. Hall, P. Lorier, J. Brunskill, Flow clustering usingmachine learning techniques, in: Proceedings of the Fifth Passiveand Active Measurement Workshop (PAM 2004), 2004.

[18] L. Bernaille, R. Teixeira, K. Salamatian, Early applicationidentification, in: Proceedings of the Second ADETTI/ISCTE CoNEXTConference, 2006.

[19] M. Crotti, M. Dusi, F. Gringoli, L. Salgarelli, Traffic classificationthrough simple statistical fingerprinting, Computer Communi-cations Review 37 (1) (2007) 7–16.

[20] T.T.T. Nguyen, G. Armitage, A survey of techniques for internet trafficclassification using machine learning, IEEE Communications Surveysand Tutorials 10 (2008) 56–76.

[21] T. Karagiannis, A. Broido, M. Faloutsos, K. Clay, Transport layeridentification of P2P traffic, in: Proceedings of IMC’04, 2004.

[22] T. Karagiannis, K. Papagiannaki, M. Faloutsos, BLINC: multileveltraffic classification in the dark, in: Proceedings of SIGCOMM’05,2005.

[23] K. Xu, Z.-L. Zhang, S. Bhattacharyya, Profiling internet backbonetraffic: behavior models and applications, in: Proceedings ofSIGCOMM’05, 2005.

[24] S.P. Washington, M.G. Karlaftis, F.L. Mannering, Statistical andEconometric Methods for Transportation Data Analysis, Chapmanand Hall/CRC Press, 2003.

[25] R.H. Shumway, D.S. Stoffer, Time Series Analysis and its Applications,Springer Texts in Statistics, Springer-Verlag, New York, 2000.

[26] M.B. Priestley, Nonlinear and Non-Stationary Time Series, AcademicPress, New York, 1988.

[27] J.P. Eckmann, D. Ruelle, Ergodic theory of chaos and strangeattractors, Reviews of Modern Physics (1985) 617–656.

[28] F. Takens, Detecting strange attractors in fluid turbulence, in: D.Rand, L.S. Young (Eds.), Dynamical Systems and Turbulence,Springer, 1981, pp. 366–381.

[29] N.H. Packard, J.P. Crutchfield, J.D. Farmer, R.S. Shaw, Geometry froma time series, Physics Review Letters 45 (1980) 712.

[30] T. Sauer, J.A. Yorke, M. Casdagli, Embedology, Journal of StatisticalPhysics 65 (1991) 579–616.

[31] N. Marwan, M.C. Romano, M. Thiel, J. Kurths, Recurrence plots for theanalysis of complex systems, Physics Reports 438 (2007) 237–329.

[32] A.M. Fraser, H.L. Swinney, Independent coordinates for strangeattractors from mutual information, Physics Review A 33 (2) (1986)1134–1140.

[33] M. Kennel, R. Brown, H. Abarbanel, Determining embeddingdimension for phase space reconstruction using a geometricalconstruction, Physics Review A 45 (1992) 3403–3411.

[34] J.P. Zbilut, A. Giuliani, C.L. Webber Jr., Recurrence quantificationanalysis and principal components in the detection of short complexsignals, Physics Letters A 237 (1998) 131–135.

[35] J.P. Zbilut, C.L. Webber, Embeddings and delays as derived fromrecurrence quantification analysis, Physics Letters A 171 (1992)199–203.

[36] L.L. Trulla, A. Giuliani, J.P. Zbilut, C.L. Webber Jr., Recurrencequantification analysis of the logistic equation with transients,Physics Letters A 223 (4) (1996) 255–260.

[37] J. Xu, J. Fan, M.H. Ammar, Prefix-preserving IP addressanonymization: measurement-based security evaluation and anew cryptography-based scheme, in: Proceedings of IEEE ICNP,2002.

[38] NetMate 0.9.4. <http://sourceforge.net/projects/netmate-meter/>(as of March 2008).

[39] VRA 5.0.1-Visual Recurrence Analysis. <http://www.myjavaserver.com/~nonlinear/vra/download.html> (as of March 2008).

[40] Tisean 3.0.1-Nonlinear Analysis. http://www.mpipks-dresden.mpg.de/~tisean/Tisean_3.0.1/index.html (as of March 2008).

[41] R. Hegger, H. Kantz, T. Schreiber, Practical implementation of nonlinear time series method: TISEAN package, Chaos 9 (1999) 413–435.

[42] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Toolsand Techniques, second ed., SF Morgan Kaufmann, 2005.

[43] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed.,John Wiley & Sons, Inc., 2001.

Francesco Palmieri holds two Computer Sci-ence degrees from Salerno University, Italy.Since 1989, he worked for several interna-tional companies on a variety of networking-related projects, concerned with nation-widecommunication systems, network manage-ment, transport protocols, and IP networking.Since 1997 he leads the network manage-ment/operation centre of the Federico II Uni-versity, in Napoli, Italy. He has been closely

involved with the development of the Internet in Italy in the last years,particularly within the academic and research sector, as a member of theTechnical Scientific Committee and of the Computer Emergency Response

Team of the Italian Academic and Research Network GARR. He is an activeresearcher in the fields of high performance/evolutionary networking andnetwork security.

Ugo Fiore (Italian Physics degree, 1989) hasbeen with Italian National Council forResearch at the beginning of his career. He hasbeen working for more than 10 years in theindustry, developing software support sys-tems for telco operators. He is currently withthe network management/operation centre ofthe Federico II University, in Napoli, Italy. Hisresearch interests focus on optimizationtechniques and algorithms aiming at

improving the performance of high-speed core networks. He is alsoactively investigating security-related algorithms and protocols.