WWW Client/Server Traffic Characterization: A Proxy Server Point of View

10
:::&OLHQW6HUYHU7UDIILF&KDUDFWHUL]DWLRQ$3UR[\6HUYHU3RLQWRI9LHZ J.C. Cano, T. Nachiondo, J. Sahuquillo, A. Pont and J. A. Gil ’HSDUWDPHQWRGH,QIRUPiWLFDGH6LVWHPDV\&RPSXWDGRUHV 8QLYHUVLGDG3ROLWpFQLFDGH9DOHQFLD &QRGH9HUDVQ9DOHQFLD6SDLQ ^MXFDQRWQDFKLRQMVDKXTXLDSRQWMDJLO‘#GLVFDXSYHV )D[ 1 This work was supported by Spanish Grant GV98-14-47 $EVWUDFW :KHQ SHUIRUPDQFH VWXGLHV DERXW SUR[\ FDFKH VHUYHUV V\VWHPVDUHPDGHRQHRIWKHPRVWFRPPRQGLIILFXOWLHVLV WRLGHQWLI\DQGWRREWDLQUHSUHVHQWDWLYHZRUNORDGV7UDFHV KDYHEHHQXVHGDVWUDGLWLRQDOZRUNORDG*DWKHULQJWUDFHV LPSO\ D ODUJH DPRXQW RI WLPH ,I D VHOIVLPLODU WUDIILF JHQHUDWRU FRXOG EH XVHG WKLV SUREOHP ZRXOG EH VROYHG WKHUHIRUH HYDOXDWLRQ VWXGLHV EHFRPH IDVWHU DQG PRUH IOH[LEOH 7KLV ZRUN FRQWDLQV WZR ELJ SDUWV )LUVW ZH SHUIRUP D VWXG\ RI WKH VHOIVLPLODU SURSHUW\ VWXG\ DERXW VHYHUDO FKDUDFWHULVWLFV RI WKH DUULYDO FROOHFWHG WUDFHV DV UHVSRQVHVL]HSDWWHUQHODSVHGUHTXHVWWLPHSDWWHUQDQGVR RQ 6HFRQGO\ ZH PRGHO D VRXUFH DQG GHYHORS D VHOI VLPLODUWUDIILFDUULYDOSDWWHUQJHQHUDWRU ,QWURGXFWLRQ The World Wide Web (WWW) gives a quick and easy access to an enormous variety of information in remote locations. Sometimes, obtaining the Web pages implies a long time for their retrieval. One of the problems appears because, at the same time, the same copy can be claimed by other users, producing that identical copies of many documents pass through the same network links, whereby the network administrators see a growing utilization that requires bandwidth upgrades, and Web site administrators see a growing server utilization that requires upgrading or replacing servers. The key performance factors to consider are how to reduce the volume of network traffic produced by Web clients and servers, and how to improve the mean response time for WWW users. Mechanisms such as mirroring and caching have been proposed to rescue the Internet by reducing the page waiting time and WWW traffic. Mirroring involves cooperation between the page owners and the mirror sites and thus requires prior arrangements. Caching can be more generally applied where a cache server can be set up to provide closer-to-user services for users who wish to reduce the page access time by selecting the cache servers as their proxy server. This shows that the aim of improving Web performance depends on a deep knowledgement of WWW workload characterization. One of the main problems for obtaining a representative WWW workload is the need to collect the WWW traffic during a great amount of time, months or years, and consequently the traces obtained occupy a great space. The Poisson process has long been used to model arrivals at networks. The work done by Leland et al. [1] suggested that the Poisson process was inadequate as a model of the arrival process. They showed that network traffic was much more closely modeled by self-similar processes. In this paper we show that clients workloads have self- similar behavior, and we describe how an initial arrival pattern generator of WWW traffic will be done. The paper has two parts. First, we establish the self-similarity of WWW request performed by the different clients of the Valencia Polytechnic University. To do so, a large and anonymous population of clients have been chosen to get a big set of WWW traffic data. We then forced every WWW client requests to pass through a proxy cache server, in which we try different cache sizes. Big traces were obtained from the WWW traffic information generated by the clients requests. The traffic contains information related with the most frequently used browsers. In order to achieve more representative traces the experiment was done without notice to the clients that some information about their WWW requests was being collected. Each trace line contains several variables for every WWW request: the arrival time, the response time, the file size, the URL address, the file type (html, gif, Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000 0-7695-0493-0/00 $10.00 (c) 2000 IEEE 1

Transcript of WWW Client/Server Traffic Characterization: A Proxy Server Point of View

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

:::�&OLHQW�6HUYHU�7UDIILF�&KDUDFWHUL]DWLRQ��$�3UR[\�6HUYHU�3RLQW�RI�9LHZ�

J.C. Cano, T. Nachiondo, J. Sahuquillo, A. Pont and J. A. Gil'HSDUWDPHQWR�GH�,QIRUPiWLFD�GH�6LVWHPDV�\�&RPSXWDGRUHV

8QLYHUVLGDG�3ROLWpFQLFD�GH�9DOHQFLD&QR��GH�9HUD�V�Q��������9DOHQFLD��6SDLQ�

^MXFDQR��WQDFKLRQ��MVDKXTXL��DSRQW��MDJLO`#GLVFD�XSY�HV)D[����������������

1 This work was supported by Spanish Grant GV98-14-47

$EVWUDFW:KHQ� SHUIRUPDQFH� VWXGLHV� DERXW� SUR[\� FDFKH� VHUYHUVV\VWHPV�DUH�PDGH��RQH�RI�WKH�PRVW�FRPPRQ�GLIILFXOWLHV�LVWR�LGHQWLI\�DQG�WR�REWDLQ�UHSUHVHQWDWLYH�ZRUNORDGV��7UDFHVKDYH�EHHQ�XVHG�DV�WUDGLWLRQDO�ZRUNORDG��*DWKHULQJ�WUDFHVLPSO\� D� ODUJH� DPRXQW� RI� WLPH�� ,I� D� VHOI�VLPLODU� WUDIILFJHQHUDWRU� FRXOG� EH� XVHG�� WKLV� SUREOHP�ZRXOG� EH� VROYHG�WKHUHIRUH� HYDOXDWLRQ� VWXGLHV� EHFRPH� IDVWHU� DQG� PRUHIOH[LEOH�� 7KLV� ZRUN� FRQWDLQV� WZR� ELJ� SDUWV�� )LUVW�� ZHSHUIRUP� D� VWXG\� RI� WKH� VHOI�VLPLODU� SURSHUW\� VWXG\� DERXWVHYHUDO� FKDUDFWHULVWLFV� RI� WKH�DUULYDO� FROOHFWHG� WUDFHV�� DVUHVSRQVH�VL]H�SDWWHUQ��HODSVHG�UHTXHVW�WLPH�SDWWHUQ�DQG�VRRQ�� 6HFRQGO\�� ZH� PRGHO� D� VRXUFH� DQG� GHYHORS� D� VHOI�VLPLODU�WUDIILF�DUULYDO�SDWWHUQ�JHQHUDWRU�

���,QWURGXFWLRQ

The World Wide Web (WWW) gives a quick and easyaccess to an enormous variety of information in remotelocations. Sometimes, obtaining the Web pages implies along time for their retrieval. One of the problems appearsbecause, at the same time, the same copy can be claimedby other users, producing that identical copies of manydocuments pass through the same network links, wherebythe network administrators see a growing utilization thatrequires bandwidth upgrades, and Web site administratorssee a growing server utilization that requires upgrading orreplacing servers. The key performance factors toconsider are how to reduce the volume of network trafficproduced by Web clients and servers, and how to improvethe mean response time for WWW users. Mechanismssuch as mirroring and caching have been proposed torescue the Internet by reducing the page waiting time andWWW traffic. Mirroring involves cooperation betweenthe page owners and the mirror sites and thus requires

0-7695-0493-0/00

prior arrangements. Caching can be more generallyapplied where a cache server can be set up to providecloser-to-user services for users who wish to reduce thepage access time by selecting the cache servers as theirproxy server. This shows that the aim of improving Webperformance depends on a deep knowledgement of WWWworkload characterization.

One of the main problems for obtaining arepresentative WWW workload is the need to collect theWWW traffic during a great amount of time, months oryears, and consequently the traces obtained occupy a greatspace. The Poisson process has long been used to modelarrivals at networks. The work done by Leland et al. [1]suggested that the Poisson process was inadequate as amodel of the arrival process. They showed that networktraffic was much more closely modeled by self-similarprocesses.

In this paper we show that clients workloads have self-similar behavior, and we describe how an initial arrivalpattern generator of WWW traffic will be done. The paperhas two parts. First, we establish the self-similarity ofWWW request performed by the different clients of theValencia Polytechnic University. To do so, a large andanonymous population of clients have been chosen to geta big set of WWW traffic data. We then forced everyWWW client requests to pass through a proxy cacheserver, in which we try different cache sizes. Big traceswere obtained from the WWW traffic informationgenerated by the clients requests. The traffic containsinformation related with the most frequently usedbrowsers. In order to achieve more representative tracesthe experiment was done without notice to the clients thatsome information about their WWW requests was beingcollected. Each trace line contains several variables forevery WWW request: the arrival time, the response time,the file size, the URL address, the file type (html, gif,

$10.00 (c) 2000 IEEE 1

o

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

jpeg, ...), and so on��We use different statistical methodsto obtain the Hurst parameter in the traces collected fromthe Squid tool [2] (R/S; variance-time; and periodogramanalysis). Secondly, we present a WWW arrival patterntraffic generator. In this way, we give an alternative to thereal WWW traffic for our simulations.

The remainder of this paper is organized as follow.Section 2 discusses related work. Section 3, describes thetheory behind self-similarity and the statistical methodsused in the study. An overview study about experimentalenvironment is presented in section 4. Section 5 containsthe trace analysis. In section 6 the arrival pattern trafficgenerator is shown; and finally section 7 presents theconclusions and the future work in this way.

���:::�6HOI�6LPLODULW\�UHODWHG�ZRUN

Since the first self-similarity studies related withcomputer networks to nowadays, the self-similarity hasbeen widely applied in computer engineering fields; i.e.ATM networks [3], variable-bit-rate[4],file systems,SPLASH2 benchmarks, wide-area traffic [5],and WorldWide Web traffic[6].

Independently of the discussed field, the self-similaritycould be attributed to the ON/OFF behavior of trafficsources within their system [7]. These ON/OFF periodshave high variability or infinite variance, and thisproduces aggregate workload (i.e. traffic), this workloadis self-similar or long-range dependent. This shows howthere is a relation between the parameters describing thehigh variability (Noah Effect) and the self-similarity(Joseph Effect) [7]. This originated the study of differentaspects of the implications of long-range dependence fortraffic modeling and network performance evaluation.

The traffic due to the WWW is a particular subset ofwide area traffic, influencing factors in thecharacterization of this traffic are: the distributions ofWWW document sizes; the effects of caching and userpreference in file transfer; the effects of user “think time”;etc. One of the most striking aspects of this issue is theinfluence of heavy-tailed nature of transmission. Idle timeis not primarily a result of network protocols or userpreferences, but rather stems from more basic propertiesof information storage and processing: both file sizes anduser “think time” are themselves strongly heavy-tailed.The study of self-similarity in the traffic due to the WorldWide Web (WWW) has been done from different viewpoints. The main difference between them is the span oftime used. In [6] they studied the self-similar behavior onthe four bursiest hours in their logs and they showed thattraffic due the WWW transfers can be self-similar whendemand is high. However, in [1] they demonstrated theself-similarity of network traffic using many large datasetstaken from multi-year span. They showed that the

bnc

clca

�iltdSTsi

fdipids

)

�cdvf

0-7695-0493-0/00

urstiness of LAN traffic typically intensifies as theumber of active traffic sources increase, contrary toommonly held views.

��6HOI�6LPLODU�VWRFKDVWLF�SURFHVVHV

In this section we define the Self-Similarityharacteristic [7], discuss the mathematical definition ofong-range dependence (Hurst effect [8]), and show somelasses of stationary stochastic processes which are able tccount for long-range dependence.

����'HILQLWLRQ�RI�6HOI�6LPLODULW\

������ ,QWXLWLYH� GHVFULSWLRQ� RI� VHOI�VLPLODULW\�� Thentuitive definition of the self-similarity is that the processooks similar across all time-scales. Figure 1 shows sixime series plots of WWW accesses, with a totalifference of 50 seconds, induced by the collected trace.uccessive plots are refinements of the previous plots.he scopes of the refined plots were chosen by selectingome arbitrary intervals from its more detailed plot. There

s significant burstiness.The most outstanding characteristic of these plots is the

act that it is difficult to distinguish among them. Theyisplay a characteristically bursty behavior that is

ndependent of the time scale used. This burstinessrovides strong evidence that the process is self-similar. It

s because self-similar processes have heavy-tailedistributions where burstiness can be observed at all timecales.

0 50 100 150 2000

50

100

(a) Time unit=100 sec250 300 350 400 4500

20

40

60

(b) Time unit=60 sec

1450 1500 1550 1600 16500

10

20

30

(c) Time unit=20 sec4700 4750 4800 4850 49000

5

10

15

(d) Time unit=10 sec

3000 3050 3100 3150 32000

2

4

6

8

(e) Time unit=4 sec3000 3050 3100 3150 32000

2

4

(f) Time unit=2 sec

LJXUH����9LVXDO�GHPRQVWUDWLRQ�RI�VHOI�VLPLODULW\

������ 0DWKHPDWLFV� GHVFULSWLRQ� RI� VHOI�VLPLODULW\�� Aovariance stationary stochastic process ; = (;W: W ≥ 0) isefined as a process with constant mean µ� �(�>;W@, finiteariance σ�� � (� >�;W� �� µ��@, and an autocorrelationunction U��N� = (�>�;W���µ��;W�N���µ�@�(>�;W���µ��@��N�≥���

$10.00 (c) 2000 IEEE 2

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

that depends only on k. In particular, we assume that Xhas an autocorrelation function of the formU�N��a�N�β/��N���DV�N�→�∞,

where 0 < β < 1 an L1 is slowly varying at infinity,limt→ ∞ /��W[���/��W� = 1 for all [ > 0.

For each P > 0, ;�P� = (;�P� : N >0) denote a new timeseries obtained by averaging the original series ; overnon-overlapping blocks of size P – for each P > 0, ;�P� isgiven by ;N

�P�=1/P (;NP�±�P���� + ... + ;NP), N > 0Note that for each P, the aggregated time series ;�P�

defines a covariance stationary process; U�P� denote thecorresponding autocorrelation function.

The process ; is called self-similar with self-similarityparameter + = 1 - β/2 if the corresponding aggregatedprocesses ;�P� have the same correlation structure as ;,i.e.,U�P��N� = U�N�, for all P > 0 (N > 0).

�����(VWLPDWLQJ�WKH�+XUVW�SDUDPHWHU

This subsection provides a brief description of thestatistical methods for assessing self-similar: R/S analysis[1], Variance-Time Analysis [12], and Periodogram [7]-based analysis.

�������5�6�$QDO\VLV��For given observations (;.��. = 1,2,

..., Q) with partial sum <�Q� =∑=

Q

L

L;

1

sample mean ; (Q)

and sample variance 6���Q� = 22

1

2 )()1()1( Q<Q;QQ

L

L−∑

=

,

the 5�6�VWDWLVWLF, or UHVFDOHG�DGMXVWHG�UDQJH is given by:

−−

−=

≤≤≤≤)()()()(

)(

1)(

00Q<

Q

WW<PLQQ<

Q

WW<PD[

Q6Q

6

5

QWQW

(1)

Hurst found that many naturally occurring time seriesappear to be well represented by relation

)()(

Q6Q5( ~ CH nH , as Q → ∞, for fractional Gaussian

noise or fractional ARIMA, with +XUVW� SDUDPHWHU +“typically” about 0.73, and &+ a positive, finite constantnot dependant on Q.

To determine + using the 5�6� VWDWLVWLF, proceed asfollow. For a given observations of length 1, subdividethe series into . non-overlapping blocks, each of size 1�.and computes the rescaled adjusted range 5�NL�Q��� 6�NL�Q�for each of the new “starting points” NL L1�.���L 1,2, ...,which satisfy NL + Q ≤1. For values of n smaller than 1�.,one gets . different estimates of 5�Q��6�Q�. For values of nclose to 1, one gets fewer values, as few as 1 when Q ≥ 1–1�.. Next, one takes logarithmically spaced values of Q,starting with Q ≈ 10. Plotting log �5�NL�Q��� 6�NL�Q�� versusORJ� �Q� results in the UHVFDOHG� DGMXVWHG� UDQJH plot (poxdiagram of 5�6). The parameter + can be estimated by

0-7695-0493-0/00

fitting a line to the points in the pox plot. Since any short-range dependence in the series typically results in atransient zone at the low end of the plot, set a cut-offpoint, and do not use the low end of the plot for porpoisesof estimating +.

Usually, the very high end of the plot is neither used,because there are too few points to make reliableestimates. The values of Q that are situated between thelower and higher cut-off points are used to estimate +. Forpractical purposes, the most useful and attractive featureof 5�6 analysis is its relative robustness against changes ofmarginal distributions.

������� 9DULDQFH�7LPH� $QDO\VLV�� From a statisticalviewpoint, for self-similar processes, the variances of theaggregated processes ;�P� (P = 1,2 , ...) decrease linearly(for large P) in log-log plots against P with slopesarbitrarily flatter than –1.

In order to determine + using the Variance statistic; weconsider the aggregated series, obtained by dividing agiven series of length 1 into non-overlapping blocks ofsize P (we assume that both 1 and 1�P are large), andaveraging the series over each block. That is for P=1,2,..., ;��P��is given by

[ ]∑+−=

==NP

PNL

L

P

P1N;

PN;

1)1(

)( ,...,2,1,1

)( (2)

We compute its sample variance,

( )∑=

−=P1

N

PP ;N;P1

;9DU/

1

2)()(^

)(/

1 (3)

The series ;�P) scales like P+��; thus, if the series isGaussian or at least finite variance, the sample variancewill be asymptotically proportional to P�+�� for large 1�Pand P.

If ( ;��� ;�, ... ) are independent and identicallydistributed with finite mean and variance then thedecrease in the variance (decaying variance) is around Q-1.From (2) we obtain 9DU�;�P��a�D�P

�����DV�P�→�∞We say that a process has slowly decaying variance if

the aggregated variance follow equation (4). In (4) eitherthe inter-event times can not be independent or theirvariance is not finite.9DU�;�P��a�D�P

�β������β�����DV�P�→�∞ (4)One plots the sample variance of the aggregated series

versus P on a log-log plot for successive values of P. Theresult should be a straight line with a slope of 2+-2. Inreality, the slope is estimated by fitting a least-squares lineto the points of the plot. If the series has no longer-rangedependence and finite variance, then + = 0.5 and the slopeof the fitted line should be –1. Short-range effects candistort the estimates of + if the low end of the plot is used,and at the very high end of the plot there are too few

$10.00 (c) 2000 IEEE 3

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

blocks to get reliable estimates of the variance. Thus, inpractice, these points are not used.

�������3HULRGRJUDP�%DVHG�$QDO\VLV��The periodogramis defined as

2

1

)(2

1)( ∑

=

=1

M

LMYHM;1

Y,π

�L ��������T��2

)1( −= 1T (5)

Where Y is the frequency, 1 is the length of the series,and ; is the time series. In the finite variance case, ,�Y� isan estimator of the spectral density of ;, and a series withlong-range dependence will have a spectral densityproportional to _Y_���+� for frequencies close to the origin.We thus expect a log-log plot of the periodogram versusthe frequency to display a straight line with a slope of���+.

���([SHULPHQWDO�HQYLURQPHQW

We evaluate the self-similar property of the proxyserver infrastructure used in the Spanish Data Network forresearch and development. This network, named RedIris,depends directly on the Scientific Research Council.

RedIRIS is a network that offers a proxy-cachecoordination service to its affiliated institutions from theAcademic and Research community in Spain. The serviceintends to promote the installation of http-proxies forgovernmental and academical institutions connecteddirectly to this network. The service tries to find the bestcooperative way between them to offer to the final user ofthe global community the highest quality of service foraccessing the World Wide Web

5HG,5LV

&,&$80

80+

*9$

839

89

8-,

ucaujaen uporoa ucous ugruma

ualm

8&/0

&,&$

iaculpgc

csic

cycit

mec

boe

uam

upm

uc3m

ujrc

uem

&(6&$81,=$5

uabudg

upcub

upf

urv

xtec

(+8

81,5,2-$

81,&$1

89$

lgsc gc

81,29,

86$/

HWVLPR

8'&

86&unicles

uvigo

)LJXUH����5HG,ULV�FRPPXQLW\�PDS�FDFKHV

As we can see in Figure 2, the network implements adistributed proxy cache in which every node cachesinformation as a function of the global policy. Also, eachnode routes the request to the node that probably has therequested document. In the present study we analyze theWWW traffic in the Polithecnical University of Valencia(UPV) proxy server.

0-7695-0493-0/00

����'DWD�FROOHFWLRQ

RedIris use Squid [2], that originates from the Harvestsystem [10]. This system offers high performance proxycaching for web clients. The data patterns of the tracescaptured by squid consists of the sequence of WWW filerequest, each of these file requests has several fields as:Time-Stamp, Elapsed time IP Client Address, URL, andsize in bytes. Once we have captured all the data, the nextphase in the process, is to filter the trace collection inorder to reduce it and obtain the different patterns thatfeed the discussed statistical analysis in section 3. Thedifferent phases followed in that process can be seen inFigure 3.

&OLHQWV�DW�XSY�HV

�7UDFH�&ROOHFWLRQ

3UR[\�/RJ

6TXLG

3UR[\�)LOWHU

6WDWLVWLFDO�$QDO\VLV

$UULYDO��SDWWHUQ�5HVSRQVH�VL]HSDWWHUQ�����

�7UDFH�5HGXFWLRQ

�7UDFH�3URFHVLQJ

�+XUVW�3DUDPHWHU

������������������������������7&3B0,66���������*(7�KWWS���ZZZ�GLVFD�XSY�HV��MXFDQR�',5(&7�ZZZ�GLVFD�XSY�HV�WH[W�KWPO

��������������������������������������

)LJXUH����)URP�'DWD�&ROOHFWLRQ�WR�SDUDPHWHU�+

Table 1 shows the number of clients, the total request,and the size for the two traces used in the rest of the work.

7DEOH����7UDFHV�&KDUDFWHULVWLFV

7UDFH 1XPEHU�RI�FOLHQWV 5HTXHVW 6L]H�0E\WHV�

PW 1481 871695 6233LW 764 358114 2797

���6WDWLVWLFDO�6HOI�6LPLODULW\�$QDO\VLV

Using the data obtained by filtering the proxy cachetraces, we used the different statistical methods related insection 3 to show a consistent conclusion: the proxyWWW traffic presents the self-similar property ondifferent time scales.

We analyze the self-similar property in severalpatterns, as the arrival pattern, completion pattern, elapsedrequest time pattern and the response size one. As asummary of our experiments, Figure 4 plots the arrivalpattern distribution for two traces used in the actualanalysis, one trace is from a low workload day (LW) andthe other one from a peak load day (PW).

$10.00 (c) 2000 IEEE 4

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

0 5 10 15 20 250

1

2

3

4

5

6

7

8

9x 10

4

Hours of the LW day

Num

ber

of a

cces

ses

0 5 10 15 20 250

1

2

3

4

5

6

7

8

9x 10

4

Num

ber

of a

cces

ses

Hours of the PW day

)LJXUH����+LVWRJUDP�IRU�ERWK�WUDFHV�JURXSHG�E\KRXUV

����$UULYDO�SDWWHUQ�$QDO\VLV

Initially, we applied the different methods mentionedabove, to analyze the Hurst parameter of the arrivalpattern for different trace periods. We analyzed a dayperiod (24 hours), a work day period (12 hours) and apeak (PW) and low (LW) three hour periods. All theresults obtained exhibit the self-similar property. Thestudy of the different traces remarks that as the activitylevel increases, the Hurst parameter estimated movescloser to 1.

We have obtained the Hurst parameter using differentstatistical approaches, for example of these statisticalapproaches. Figure 5 plots the results for the three busiest

R/S Analysis

0-7695-0493-0/0

hour period of the PW trace obtained with the followingmethods, R/S analysis, aggregated variance analysis andperiodogram.

7DEOH����(VWLPDWHG�+XUVW�SDUDPHWHU�IRU�GLIIHUHQWSHULRGV�RI�WKH�WUDFHV

Hurst exponent HMeasure Session(Hours) R/S

methodVariancemethod

Period.Method

00:00 – 24:00 0.84 0.98 0.74

08:00 – 20:00 0.78 0.93 0.69

11:00 – 14:00 0.72 0.69 0.78LW

21:00 – 24:00 0.73 0.74 0.70

00:00 – 24:00 0.85 0.98 0.80

08:00 – 20:00 0.88 0.92 0.80

11:00 – 14:00 0.78 0.79 0.88PW

21:00 – 24:00 0.73 0.75 0.80

Variance Time Analysis

1 1.5 2 2.5 3 3.50

0.5

1

1.5

2

2.5

3

log10(n)

log10(R/S)

Straight Equation = 0.782603 x + -0.303210

1 1.5 2 2.5 3 3.50

0.5

1

1.5

log10(n)

log1

0(V

ar(X

m)

Straight Equation= -0.439429 x + 1.863396

Hurst Parameter= 0.788610

-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0

Periodogram Analysis

Straight Equation = -0.773082 x +0.848020

Hurst Parameter = 0.886541

log1

0(P

erio

dog r

am)

0

1

2

3

4

5

6

)LJXUH����*UDSKLFDO�$QDO\VLV�IRU�WKH�WKUHH�EXVLHVW�KRXUV�RI�WKH�3:�WUDFH

0 $10.00 (c) 2000 IEEE 5

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

R/S analysis is able to identify the “truth test” short-range dependencies by a graphical method. If one plotsthe R/S method for different time series ;�W�, byaggregating the original series over non-overlappingblocks of size W and there are not short-rangedependencies; then, the Hurst parameters calculatedremain close to the original value (W=1). On the otherhand, if the original series presents short-rangedependencies, the slope of the R/S plot decrease close t0.5 as the aggregation size increases. For example, Figure6 shows the results obtained from the PW trace.

a)

b)

c)

d)

0 0.5 1 1,5 2 2.5 3 3.5 4

0 0.5 1 1,5 2 2.5 3 3.5

Hurst Parameter = 0.90

0 0.5 1 1,5 2 2.5 3

Hurst Parameter = 0.97

0 0.5 1 1,5 2 2.5

Hurst Parameter = 0.98

Log1

0(R

/S)

L og1

0(R

/S)

L og1

0(R

/S)

L og1

0(R

/S)

0

1

20

1

2

0

1

2

30

1

2 3

Hurst Parameter

Hurst Parameter

= 0.90

= 0.85

)LJXUH����3ORWV�RI�5�6�ZLWK�GLIIHUHQW�DJJUHJDWLRQOHYHOV������������DQG�������IRU�3:�WUDFH��7KH�OHYHOVDUH�VKRZHG�LQ�LQFUHDVLQJ�DJJUHJDWLRQ�VL]HV�IURP

D��WR�G��

7DEOH����(VWLPDWHG�+XUVW�SDUDPHWHU�IRU�WKHFRPSOHWLRQ�SDWWHUQ

Hurst exponent HMeasure Session(Hours) R/S

methodVariancemethod

Period.method

00:00 – 24:00 0.85 0.98 0.74

08:00 – 20:00 0.78 0.93 0.69

11:00 – 14:00 0.71 0.70 0.79LW

21:00 – 24:00 0.73 0.84 0.68

00:00 – 24:00 0.86 0.98 0.85

08:00 – 20:00 0.87 0.92 0.79

11:00 – 14:00 0.78 0.78 0.88PW

21:00 – 24:00 0.74 0.76 0.63

Usually, when experiments in open systems areperformed, the flow balance is assumed [11], but as longthe authors knows, no results are presented in the open

lwscto

pgosprts

rstp

H

0-7695-0493-0/00

o

iterature about the self-similarity at the moment this paperas written. As Internet may be considered an open

ystem, a similar study was also carried out for theompletion pattern. The obtained results are very similaro the arrival pattern. Table 3 shows the + parameterbtained.

���(ODSVHG�5HTXHVW�7LPH�SDWWHUQ�$QDO\VLV

The most important improvement, from the client’soint of view in a proxy cache system, is to reduce thelobal amount of time to serve a document. So, in order tobtain the system performance, we have evaluated theelf-similar property of the elapsed time pattern. For thisurpose we group the total elapsed time required for eachequest which happens in every unit of time (1 second),herefore we have 86400 time units per day. Table 4hows the Hurst parameter obtained for the elapsed time.

7DEOH����+XUVW�SDUDPHWHU�RI�WKH�(ODSVHG�WLPH�IRUVHYHUDO�EXV\�SHULRG

Hurst exponent HMeasure Session(Hours) R/S

methodVariancemethod

Period.method

00:00 – 24:00 0.61 0.88 0.55

08:00 – 20:00 0.62 0.70 0.55

11:00 – 14:00 0.60 0.53 0.54LW

21:00 – 24:00 0.55 0.56 0.56

00:00 – 24:00 0.75 0.92 0.64

08:00 – 20:00 0.69 0.79 0.61

11:00 – 14:00 0.60 0.55 0.60PW

21:00 – 24:00 0.54 0.60 0.52

Figure 7 shows the relation between the number ofequests and the elapsed time per request of the twoelected traces. As we can see, the elapsed time/request ofhe two traces (LW and PW), depends much more on theroxy cache configuration than in the number of requests.

0 5 10 15 20 250

1

2

3

4

5

6

7

8

9x 10

4

Num

ber

of R

eque

st (

solid

line

) an

d E

laps

ed T

ime/

Req

uest

in s

econ

d (d

ashe

d lin

e)

Hours of the Day0 5 10 15 20 25

0

1

2

3

4

5

6

7

8

9x 10

4

Hours of the Day

Num

ber

of R

eque

st (

solid

line

) an

d E

laps

ed T

ime/

Req

uest

in s

econ

d (d

ashe

d lin

e)

)LJXUH����1XPEHU�RI�5HTXHVW��VROLG�OLQH���DQGODSVHG�WLPH�LQ�VHFRQG���5HTXHVW��GDVKHG�OLQH��RIWKH�SUR[\�ORJ�/:��ULJKW��DQG�3:��OHIW���JURXSHG

SHU�KRXU

$10.00 (c) 2000 IEEE 6

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

In the three busiest hours the mean request responsetime in LW was 0.9918 seconds/request, while in the PWit was 0.7086 seconds/request. As we can observe, theresponse time will affect the arrival pattern of eachparticular source, but the form of the shape of the elapsedtime is very similar from one busy day to the another.

����5HVSRQVH�6L]H�SDWWHUQ�$QDO\VLV

In addition to the elapsed response time, we have alsostudied the size of the requested documents. Table 5shows the self-similar nature of the response size patternby means of the +�parameter value. As it occurs with thearrival pattern, if the traffic increases the Hurst parameteralso increase.

7DEOH����5HVSRQVH�VL]H�+XUVW�SDUDPHWHU�IRUGLIIHUHQW�EXV\�SHULRGV�RI�WKH�WUDFHV

Hurst exponent HMeasure Session(Hours) R/S

methodVariancemethod

Period.method

00:00 – 24:00 0.54 0.58 0.52

08:00 – 20:00 0.53 0.59 0.53

11:00 – 14:00 0.57 0.51 0.60LW

21:00 – 24:00 0.56 0.51 0.51

00:00 – 24:00 0.71 0.85 0.56

08:00 – 20:00 0.61 0.65 0.54

11:00 – 14:00 0.55 0.54 0.59PW

21:00 – 24:00 0.55 0.56 0.54

Once we have demonstrated the self-similar behaviorin all the cases analyzed, we perform the study about thehypothetical relation between cache size and + parameter.For this study, we change both the memory cache size andthe hard disk cache size, in order to know if anymathematical relation exits between the cache size and theHurst parameter. The conclusion was that while the cachesize changes the self-similar property appears in the sameway, but there was not a direct relation obtained betweenthe cache size and the Hurst parameter.

���6\QWKHVL]LQJ�6HOI�6LPLODU�7UDIILF

In this section an explanation about the self-similarproperty is shown, and a model to generate an arrivalpattern is built.

����21���2))�6RXUFHV

Willinger et al [9], proposed an explanation of the self-similar property observed in Ethernet LAN traffic, and

dttssaTaivt

cft

d

as

fOs

p

bsmwcotttuts

meamaonPo

0-7695-0493-0/00

esigned techniques to model a self-similar workload. Theheory, developed by Taqqu and Levy [12], explains thathe aggregation of several ON/OFF sources within theystem results in a self-similar traffic. An individualource can be classified as an ON/OFF source [9], if itlternates ON and OFF periods with high variable length.hat is, the distribution time of the ON and OFF periodsre heavy tailed with parameter α1 and α2. As we describen section 3, a heavy-tailed distribution has infiniteariance and the portion of the tail distribution depends onhe α parameter.

To explain self-similarity in Web traffic, ON timesorrespond to the transmission time of individual Webiles, and OFF periods correspond to the interval betweenransmission [13].

An example of a heavy-tailed distribution is the Paretoistribution. The general form of Pareto distribution is:

α)(1)( [D[) −= , where a, α�≥��, and [�≥�D. If the ON

nd OFF period length distribution is heavy-tailed, theyatisfy the property:

21 , ~)( <<∞→> − αα [ZLWKF[[;3 [9], and

or a ; length period, if the activity is uniform within anN period, then the result of aggregating many such

ources results in a self-similar process with Hurst

arameter 2

3 α−=+ (6).

���0RGHOLQJ�D�6RXUFH

The properties of the ON/OFF source aggregation cane used to develop a simple model to generate a self-imilar arrival pattern by aggregating several sources andodeling the ON and OFF length periods of each sourceith a Pareto distribution. Parameter α in equation (6) is

alculated by using the estimated Hurst parameterbtained in section 5. The number of sources as well ashe ON period traffic of each source is calculated to adjusthe amount of accesses to the synthesized workloadowards the original one. Figure 8 shows the algorithmsed to model every source of the generator, where WLPH ishe actual simulation time and W�VLPXODWLRQ is the totalimulation time.

This model generates the arrival pattern. An idealodel should be able to synthesize others arguments in

very request, like the clients making the request, the URLnd the number of bytes generated, etc. The generatorust be able to generate a large amount of self-similar

rrival pattern workloads, with a behavior similar to thatbtained from real traces. Furthermore by modifying theumber of ON/OFF processes, inter-arrival time andareto parameters must be able to generate a great varietyf synthetic workloads to estimate the benefits or

$10.00 (c) 2000 IEEE 7

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

disadvantages of several proposed proxy cacheconfigurations.

:KLOH��WLPH���W�VLPXODWLRQ�6ZLWFK��VWDWH�

&DVH�VWDWHB2))�:DLW�QH[W�21�VWDWHVWDWH� �VWDWHB217LPH� �7LPH���3DUHWR��DOSKDB21�1H[W�VWDWHB2))�SHULRG� �7LPH

&DVH�VWDWHB21�:KLOH�WLPH�1H[W�VWDWHB2))�SHULRG�

*HQHUDWHBDUULYDO:DLW�LQWHU�DUULYDO�WLPH

(QGB:KLOHVWDWH� �VWDWHB2))WLPH� �7LPH���3DUHWR��DOSKDB2))�1H[W�VWDWHB21�SHULRG� �7LPH

(QGB6ZLWFK

(QGB:KLOH

)LJXUH����21���2))�6RXUFH�*HQHUDWRU�$OJRULWKP

In order to validate the model, we have synthesizedWWW requests for 50 clients and aggregated them toobtain the overall traffic in the proxy server. The model

hp

Hta

srstr

0-7695-0493-0/00

as been adjusted to generate workload with theroperties of different peak load of LW and PW traces.

Table 6, Figure 9, and Figure 10 shows the syntheticurst parameter obtained with the generator for the PW

race (period 11:00 – 14:00) with a parameter + ~=0.8,nd for the same period of the LW trace with +~=0.7.

7DEOH����+XUVW�SDUDPHWHU�IRU�GLIIHUHQWV\QWKHVLVHG�ZRUNORDGV

Hurst exponent HMeasure Session(Hours) R/S

methodVariancemethod

PW 11:00 – 14:00 0.80 0.78

LW 11:00 – 14:00 0.71 0.70

Clearly, the aggregated synthesized traffic is self-imilar with Hurst parameter close to the estimated in theeal traffic. Furthermore, as mention above, we adjust theynthesized traffic to obtain a superimposing requestraffic rate close to the real traffic. Figure 11 shows theequest pattern of real traffic and the synthesized one.

1 1.5 2 2.5 3 3.50

0.5

1

1.5

2

2.5R/S Analysis

log10(n)

log1

0(R

/S)

Straight Equation = 0.717884 x + -0.156740

1 1.5 2 2.5 3 3.5 4-2.5

-2

-1.5

-1

-0.5

0

0.5Variance Time Analysis

log10(n)

log1

0(V

ar(X

m)

Straight Equation= -0.581237 x + 0.884558

Hurst Parameter= 0.709382

)LJXUH����6\QWKHVL]LQJ�+XUVW�SDUDPHWHU�REWDLQHG�ZLWK�WKH�JHQHUDWRU�IRU�WKH�/:�ZRUNORDGV�ZLWKHVWLPDWHG�+a �������SHULRG�������±�������K�

1 1.5 2 2.5 3 3.50

0.5

1

1.5

2

2.5

3R/S Analysis

log10(n)

log1

0(R

/S)

Straight Equation = 0.807169 x + -0.257697

1 1.5 2 2.5 3 3.5-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Variance Time Analysis

log10(n)

log1

0(V

ar(X

m)

Straight Equation= -0.431301 x + 1.422317

Hurst Parameter= 0.784349

)LJXUH�����6\QWKHVL]LQJ�+XUVW�SDUDPHWHU�REWDLQHG�ZLWK�WKH�JHQHUDWRU�IRU�WKH�3:�ZRUNORDGV�ZLWKHVWLPDWHG�+a �������SHULRG�������±�������K�

$10.00 (c) 2000 IEEE 8

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

���&RQFOXVLRQ�DQG�)XWXUH�:RUNIn this paper a detailed study about the World Wide

Web transfers in a distributed proxy cache system hasbeen done. The study shows that this workload presentssome characteristics that are consistent with self-similarity. We have examined this statistical property forthe main pattern that can characterize this kind of traffic,as the arrival pattern, elapsed request pattern, and requestsize. The obtained results are very encouraging for all thecases studied.

This statistical behavior has been tested using thetypical statistical methods used in the open literature; theR/S, the variance and the periodogram method. Inaddition, we have used the aggregate of the R/S a morecompleted analysis. In all the methods used, the Hurstparameter remains inside the permitted interval, anddirectly depends on the network traffic. We have alsolooked into the possible mathematical relation betweenthe cache size and the Hurst parameter, coming to anegative conclusion.

Once the self-similar property is tested, we havevalidated that to generate self-similar workload it isfeasible using the ON/OFF method. An individual clientmodel with the ON/OFF behavior has been developed.Combining several of these synthesized clients, asynthesized arrival generator for World Wide Web hasbeen implemented. The resulting trace analysis gives thesame Hurst parameter as obtained from the original trace.

0 20 40 60 800

1000

2000

(a) Time unit=100 sec100 120 140 160 1800

500

1000

1500

(b) Time unit=60 sec

460 480 500 5200

200

400

600

(c) Time unit=20 sec700 720 740 760 7800

100

200

300

(d) Time unit=10 sec

1000 1020 1040 1060 10800

50

100

150

(e) Time unit=4 sec5000 5020 5040 5060 50800

20

40

60

(f) Time unit=1 sec

)LJXUH�����5HDO�SUR[\�FDFKH�UHTXHVW�WUDIILF�RI�3:WUDFH��SHULRG�������±������K��GDVKHG�OLQH���YHUVXV

WKH�V\QWKHVL]HG�RQH

The most important conclusion of this paper is that acomplete self-similar generator of World Wide Webtransfers can be done in order to perform comparison andevaluation studies of proposed proxy systemconfigurations. With the generator we would have a moreflexible tool to generate a wide variety of traces.

0-7695-0493-0/00

As for future work, we plan to improve the generator inorder to produce traces that include other synthesizedparameters as the request size, URL clients and others.Also we plan to study whether the WWW workloadcharacterization might be affected by parameters such ascountry, type of users or protocol characteristics, whichcould be possible factors to be included in the generator.

$FNQRZOHGJHPHQWV

The authors would like to express their thanks toVicente Benet, from Valencia Polytechnic Data ProcessCenter, who supplied us the traces used in this study, andhelped us with the Squid program tool.

���5HIHUHQFHV

[1] W. E Leland, M. S. Taqqu, W. Willinger, D. V. Wilson, “Onthe Self-Similar Nature of Ethernet Traffic,” ,(((�$&07UDQVDFWLRQV�RQ�1HWZRUNLQJ, 9RO 2, 1994.

[2] “Squid Internet Object Cache,” http://squid.nlanr.net/.

[3] N. D. Georganas, “Self-similar (“fractal”) traffic in atmnetworks,” 3URFHHGLQJV� RI� WKH� �QG� ,QWHUQDWLRQDO� :RUNVKRS� RQ$GYDQFHG� 7HOHVHUYLFHV� DQG� +LJK�6SHHG� &RPPXQLFDWLRQV$UFKLWHFWXUHV��,:$&$������, pp. 1-7, 1994.

[4] J. Beran, R. Sherman, M. S. Taqqu and W. Willinger,“Long-range dependence in variable-bit-rate video traffic,”,(((� 7UDQVDFWLRQV� RQ� &RPPXQLFDWLRQV� 9RO� ��, pp. 1566-79,1995.

[5] V. Parxon and S. Floyd, “ Wide-Area Traffic: The Failure ofPoisson Modeling,” ,(((�$&0�7UDQVDFWLRQV�RQ�1HWZRUNLQJ, 3(3), pp. 226-244, 1995.

[6] M. E. Crovella and A. Bestavros, “Explaining world wideweb traffic self-similarity�´� 7HFK�� 5HS�� 75��������� &RPSXWHU6FLHQFH�'HSDUWDPHQW��%RVWRQ�8QLYHUVLW\, 1995.

[7] W. Willinger, M. S. Taqqu, R. Sherman and D. V. Wilson,“Self-Similarity Through High-Varibility: Statistical Analysis ofEthernet LAN Traffic at the Source Level�´� ,(((�$&0WUDQVDFWLRQV�RQ�1HWZRUNLQJ��9RO. 5, pp. 71-86, 1997.

[8] B. B. Mandelbrot, “The Fractal Geometry of Nature,”Freeman, New York, 1983.

[9] W. Willinger, M. S. Taqqu, R. Sherman and D. V. Wilson,“Self-Similarity Through High-Varibility: Statistical Analysis ofEthernet LAN Traffic at the Source Level�” � ,(((�$&0WUDQVDFWLRQV�RQ�1HWZRUNLQJ��Vol. 5, pp. 71-86, 1997.

[10] P. B. Danzing, R. S. Hall, and M. F. Schwartz, “A case forCaching File Objects Inside Internetworks,” 3URFHHGLQJV�RI�WKH6,*&200¶����pp. 239-248, 1993

$10.00 (c) 2000 IEEE 9

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

[11] D. Ferrari, G. Serazzy, and A. Zeigner, “Measurement andTunning of Computer Systems,” Ed. Prentice Hall, 1983.

[12] M. Taqqu, and J. Levy. “Using renewal processes togenerate long-range dependence in high variability,”'HSHQGHQFH� LQ� 3UREDELOLW\� DQG� 6WDWLVWLFV, E. Eberlein and M.Taqqu, Eds., pp. 73-89, (Boston, MA, 1986).

[W7

[i6

0-7695-0493-0/00 $

13] M. E. Crovella and A. Bestavros, “Self-Similarity in Worldide Web Traffic: Evidence and Possible Causes,” ,(((�$&0UDQVDFWLRQV�RQ�1HWZRUNLQJ��Vol 5, pp. 835-845, 1997.

14] S. D. Gribble, G. S. Manku, E. A. Brewer, “Self-similarityn hight-level file system: Measurement and Application,” $&0,*0(75,&6¶���� Madison, Wisconsin�� June 1998

10.00 (c) 2000 IEEE 10