WWW Client/Server Traffic Characterization: A Proxy Server Point of View
Transcript of WWW Client/Server Traffic Characterization: A Proxy Server Point of View
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
:::�&OLHQW�6HUYHU�7UDIILF�&KDUDFWHUL]DWLRQ��$�3UR[\�6HUYHU�3RLQW�RI�9LHZ�
J.C. Cano, T. Nachiondo, J. Sahuquillo, A. Pont and J. A. Gil'HSDUWDPHQWR�GH�,QIRUPiWLFD�GH�6LVWHPDV�\�&RPSXWDGRUHV
8QLYHUVLGDG�3ROLWpFQLFD�GH�9DOHQFLD&QR��GH�9HUD�V�Q��������9DOHQFLD��6SDLQ�
^MXFDQR��WQDFKLRQ��MVDKXTXL��DSRQW��MDJLO`#GLVFD�XSY�HV)D[����������������
1 This work was supported by Spanish Grant GV98-14-47
$EVWUDFW:KHQ� SHUIRUPDQFH� VWXGLHV� DERXW� SUR[\� FDFKH� VHUYHUVV\VWHPV�DUH�PDGH��RQH�RI�WKH�PRVW�FRPPRQ�GLIILFXOWLHV�LVWR�LGHQWLI\�DQG�WR�REWDLQ�UHSUHVHQWDWLYH�ZRUNORDGV��7UDFHVKDYH�EHHQ�XVHG�DV�WUDGLWLRQDO�ZRUNORDG��*DWKHULQJ�WUDFHVLPSO\� D� ODUJH� DPRXQW� RI� WLPH�� ,I� D� VHOI�VLPLODU� WUDIILFJHQHUDWRU� FRXOG� EH� XVHG�� WKLV� SUREOHP�ZRXOG� EH� VROYHG�WKHUHIRUH� HYDOXDWLRQ� VWXGLHV� EHFRPH� IDVWHU� DQG� PRUHIOH[LEOH�� 7KLV� ZRUN� FRQWDLQV� WZR� ELJ� SDUWV�� )LUVW�� ZHSHUIRUP� D� VWXG\� RI� WKH� VHOI�VLPLODU� SURSHUW\� VWXG\� DERXWVHYHUDO� FKDUDFWHULVWLFV� RI� WKH�DUULYDO� FROOHFWHG� WUDFHV�� DVUHVSRQVH�VL]H�SDWWHUQ��HODSVHG�UHTXHVW�WLPH�SDWWHUQ�DQG�VRRQ�� 6HFRQGO\�� ZH� PRGHO� D� VRXUFH� DQG� GHYHORS� D� VHOI�VLPLODU�WUDIILF�DUULYDO�SDWWHUQ�JHQHUDWRU�
���,QWURGXFWLRQ
The World Wide Web (WWW) gives a quick and easyaccess to an enormous variety of information in remotelocations. Sometimes, obtaining the Web pages implies along time for their retrieval. One of the problems appearsbecause, at the same time, the same copy can be claimedby other users, producing that identical copies of manydocuments pass through the same network links, wherebythe network administrators see a growing utilization thatrequires bandwidth upgrades, and Web site administratorssee a growing server utilization that requires upgrading orreplacing servers. The key performance factors toconsider are how to reduce the volume of network trafficproduced by Web clients and servers, and how to improvethe mean response time for WWW users. Mechanismssuch as mirroring and caching have been proposed torescue the Internet by reducing the page waiting time andWWW traffic. Mirroring involves cooperation betweenthe page owners and the mirror sites and thus requires
0-7695-0493-0/00
prior arrangements. Caching can be more generallyapplied where a cache server can be set up to providecloser-to-user services for users who wish to reduce thepage access time by selecting the cache servers as theirproxy server. This shows that the aim of improving Webperformance depends on a deep knowledgement of WWWworkload characterization.
One of the main problems for obtaining arepresentative WWW workload is the need to collect theWWW traffic during a great amount of time, months oryears, and consequently the traces obtained occupy a greatspace. The Poisson process has long been used to modelarrivals at networks. The work done by Leland et al. [1]suggested that the Poisson process was inadequate as amodel of the arrival process. They showed that networktraffic was much more closely modeled by self-similarprocesses.
In this paper we show that clients workloads have self-similar behavior, and we describe how an initial arrivalpattern generator of WWW traffic will be done. The paperhas two parts. First, we establish the self-similarity ofWWW request performed by the different clients of theValencia Polytechnic University. To do so, a large andanonymous population of clients have been chosen to geta big set of WWW traffic data. We then forced everyWWW client requests to pass through a proxy cacheserver, in which we try different cache sizes. Big traceswere obtained from the WWW traffic informationgenerated by the clients requests. The traffic containsinformation related with the most frequently usedbrowsers. In order to achieve more representative tracesthe experiment was done without notice to the clients thatsome information about their WWW requests was beingcollected. Each trace line contains several variables forevery WWW request: the arrival time, the response time,the file size, the URL address, the file type (html, gif,
$10.00 (c) 2000 IEEE 1
o
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
jpeg, ...), and so on��We use different statistical methodsto obtain the Hurst parameter in the traces collected fromthe Squid tool [2] (R/S; variance-time; and periodogramanalysis). Secondly, we present a WWW arrival patterntraffic generator. In this way, we give an alternative to thereal WWW traffic for our simulations.
The remainder of this paper is organized as follow.Section 2 discusses related work. Section 3, describes thetheory behind self-similarity and the statistical methodsused in the study. An overview study about experimentalenvironment is presented in section 4. Section 5 containsthe trace analysis. In section 6 the arrival pattern trafficgenerator is shown; and finally section 7 presents theconclusions and the future work in this way.
���:::�6HOI�6LPLODULW\�UHODWHG�ZRUN
Since the first self-similarity studies related withcomputer networks to nowadays, the self-similarity hasbeen widely applied in computer engineering fields; i.e.ATM networks [3], variable-bit-rate[4],file systems,SPLASH2 benchmarks, wide-area traffic [5],and WorldWide Web traffic[6].
Independently of the discussed field, the self-similaritycould be attributed to the ON/OFF behavior of trafficsources within their system [7]. These ON/OFF periodshave high variability or infinite variance, and thisproduces aggregate workload (i.e. traffic), this workloadis self-similar or long-range dependent. This shows howthere is a relation between the parameters describing thehigh variability (Noah Effect) and the self-similarity(Joseph Effect) [7]. This originated the study of differentaspects of the implications of long-range dependence fortraffic modeling and network performance evaluation.
The traffic due to the WWW is a particular subset ofwide area traffic, influencing factors in thecharacterization of this traffic are: the distributions ofWWW document sizes; the effects of caching and userpreference in file transfer; the effects of user “think time”;etc. One of the most striking aspects of this issue is theinfluence of heavy-tailed nature of transmission. Idle timeis not primarily a result of network protocols or userpreferences, but rather stems from more basic propertiesof information storage and processing: both file sizes anduser “think time” are themselves strongly heavy-tailed.The study of self-similarity in the traffic due to the WorldWide Web (WWW) has been done from different viewpoints. The main difference between them is the span oftime used. In [6] they studied the self-similar behavior onthe four bursiest hours in their logs and they showed thattraffic due the WWW transfers can be self-similar whendemand is high. However, in [1] they demonstrated theself-similarity of network traffic using many large datasetstaken from multi-year span. They showed that the
bnc
�
clca
�
�iltdSTsi
fdipids
)
�cdvf
0-7695-0493-0/00
urstiness of LAN traffic typically intensifies as theumber of active traffic sources increase, contrary toommonly held views.
��6HOI�6LPLODU�VWRFKDVWLF�SURFHVVHV
In this section we define the Self-Similarityharacteristic [7], discuss the mathematical definition ofong-range dependence (Hurst effect [8]), and show somelasses of stationary stochastic processes which are able tccount for long-range dependence.
����'HILQLWLRQ�RI�6HOI�6LPLODULW\
������ ,QWXLWLYH� GHVFULSWLRQ� RI� VHOI�VLPLODULW\�� Thentuitive definition of the self-similarity is that the processooks similar across all time-scales. Figure 1 shows sixime series plots of WWW accesses, with a totalifference of 50 seconds, induced by the collected trace.uccessive plots are refinements of the previous plots.he scopes of the refined plots were chosen by selectingome arbitrary intervals from its more detailed plot. There
s significant burstiness.The most outstanding characteristic of these plots is the
act that it is difficult to distinguish among them. Theyisplay a characteristically bursty behavior that is
ndependent of the time scale used. This burstinessrovides strong evidence that the process is self-similar. It
s because self-similar processes have heavy-tailedistributions where burstiness can be observed at all timecales.
0 50 100 150 2000
50
100
(a) Time unit=100 sec250 300 350 400 4500
20
40
60
(b) Time unit=60 sec
1450 1500 1550 1600 16500
10
20
30
(c) Time unit=20 sec4700 4750 4800 4850 49000
5
10
15
(d) Time unit=10 sec
3000 3050 3100 3150 32000
2
4
6
8
(e) Time unit=4 sec3000 3050 3100 3150 32000
2
4
(f) Time unit=2 sec
LJXUH����9LVXDO�GHPRQVWUDWLRQ�RI�VHOI�VLPLODULW\
������ 0DWKHPDWLFV� GHVFULSWLRQ� RI� VHOI�VLPLODULW\�� Aovariance stationary stochastic process ; = (;W: W ≥ 0) isefined as a process with constant mean µ� �(�>;W@, finiteariance σ�� � (� >�;W� �� µ��@, and an autocorrelationunction U��N� = (�>�;W���µ��;W�N���µ�@�(>�;W���µ��@��N�≥���
$10.00 (c) 2000 IEEE 2
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
that depends only on k. In particular, we assume that Xhas an autocorrelation function of the formU�N��a�N�β/��N���DV�N�→�∞,
where 0 < β < 1 an L1 is slowly varying at infinity,limt→ ∞ /��W[���/��W� = 1 for all [ > 0.
For each P > 0, ;�P� = (;�P� : N >0) denote a new timeseries obtained by averaging the original series ; overnon-overlapping blocks of size P – for each P > 0, ;�P� isgiven by ;N
�P�=1/P (;NP�±�P���� + ... + ;NP), N > 0Note that for each P, the aggregated time series ;�P�
defines a covariance stationary process; U�P� denote thecorresponding autocorrelation function.
The process ; is called self-similar with self-similarityparameter + = 1 - β/2 if the corresponding aggregatedprocesses ;�P� have the same correlation structure as ;,i.e.,U�P��N� = U�N�, for all P > 0 (N > 0).
�����(VWLPDWLQJ�WKH�+XUVW�SDUDPHWHU
This subsection provides a brief description of thestatistical methods for assessing self-similar: R/S analysis[1], Variance-Time Analysis [12], and Periodogram [7]-based analysis.
�������5�6�$QDO\VLV��For given observations (;.��. = 1,2,
..., Q) with partial sum <�Q� =∑=
Q
L
L;
1
sample mean ; (Q)
and sample variance 6���Q� = 22
1
2 )()1()1( Q<Q;QQ
L
L−∑
=
,
the 5�6�VWDWLVWLF, or UHVFDOHG�DGMXVWHG�UDQJH is given by:
−−
−=
≤≤≤≤)()()()(
)(
1)(
00Q<
Q
WW<PLQQ<
Q
WW<PD[
Q6Q
6
5
QWQW
(1)
Hurst found that many naturally occurring time seriesappear to be well represented by relation
)()(
Q6Q5( ~ CH nH , as Q → ∞, for fractional Gaussian
noise or fractional ARIMA, with +XUVW� SDUDPHWHU +“typically” about 0.73, and &+ a positive, finite constantnot dependant on Q.
To determine + using the 5�6� VWDWLVWLF, proceed asfollow. For a given observations of length 1, subdividethe series into . non-overlapping blocks, each of size 1�.and computes the rescaled adjusted range 5�NL�Q��� 6�NL�Q�for each of the new “starting points” NL L1�.���L 1,2, ...,which satisfy NL + Q ≤1. For values of n smaller than 1�.,one gets . different estimates of 5�Q��6�Q�. For values of nclose to 1, one gets fewer values, as few as 1 when Q ≥ 1–1�.. Next, one takes logarithmically spaced values of Q,starting with Q ≈ 10. Plotting log �5�NL�Q��� 6�NL�Q�� versusORJ� �Q� results in the UHVFDOHG� DGMXVWHG� UDQJH plot (poxdiagram of 5�6). The parameter + can be estimated by
0-7695-0493-0/00
fitting a line to the points in the pox plot. Since any short-range dependence in the series typically results in atransient zone at the low end of the plot, set a cut-offpoint, and do not use the low end of the plot for porpoisesof estimating +.
Usually, the very high end of the plot is neither used,because there are too few points to make reliableestimates. The values of Q that are situated between thelower and higher cut-off points are used to estimate +. Forpractical purposes, the most useful and attractive featureof 5�6 analysis is its relative robustness against changes ofmarginal distributions.
������� 9DULDQFH�7LPH� $QDO\VLV�� From a statisticalviewpoint, for self-similar processes, the variances of theaggregated processes ;�P� (P = 1,2 , ...) decrease linearly(for large P) in log-log plots against P with slopesarbitrarily flatter than –1.
In order to determine + using the Variance statistic; weconsider the aggregated series, obtained by dividing agiven series of length 1 into non-overlapping blocks ofsize P (we assume that both 1 and 1�P are large), andaveraging the series over each block. That is for P=1,2,..., ;��P��is given by
[ ]∑+−=
==NP
PNL
L
P
P1N;
PN;
1)1(
)( ,...,2,1,1
)( (2)
We compute its sample variance,
( )∑=
−=P1
N
PP ;N;P1
;9DU/
1
2)()(^
)(/
1 (3)
The series ;�P) scales like P+��; thus, if the series isGaussian or at least finite variance, the sample variancewill be asymptotically proportional to P�+�� for large 1�Pand P.
If ( ;��� ;�, ... ) are independent and identicallydistributed with finite mean and variance then thedecrease in the variance (decaying variance) is around Q-1.From (2) we obtain 9DU�;�P��a�D�P
�����DV�P�→�∞We say that a process has slowly decaying variance if
the aggregated variance follow equation (4). In (4) eitherthe inter-event times can not be independent or theirvariance is not finite.9DU�;�P��a�D�P
�β������β�����DV�P�→�∞ (4)One plots the sample variance of the aggregated series
versus P on a log-log plot for successive values of P. Theresult should be a straight line with a slope of 2+-2. Inreality, the slope is estimated by fitting a least-squares lineto the points of the plot. If the series has no longer-rangedependence and finite variance, then + = 0.5 and the slopeof the fitted line should be –1. Short-range effects candistort the estimates of + if the low end of the plot is used,and at the very high end of the plot there are too few
$10.00 (c) 2000 IEEE 3
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
blocks to get reliable estimates of the variance. Thus, inpractice, these points are not used.
�������3HULRGRJUDP�%DVHG�$QDO\VLV��The periodogramis defined as
2
1
)(2
1)( ∑
=
=1
M
LMYHM;1
Y,π
�L ��������T��2
)1( −= 1T (5)
Where Y is the frequency, 1 is the length of the series,and ; is the time series. In the finite variance case, ,�Y� isan estimator of the spectral density of ;, and a series withlong-range dependence will have a spectral densityproportional to _Y_���+� for frequencies close to the origin.We thus expect a log-log plot of the periodogram versusthe frequency to display a straight line with a slope of���+.
���([SHULPHQWDO�HQYLURQPHQW
We evaluate the self-similar property of the proxyserver infrastructure used in the Spanish Data Network forresearch and development. This network, named RedIris,depends directly on the Scientific Research Council.
RedIRIS is a network that offers a proxy-cachecoordination service to its affiliated institutions from theAcademic and Research community in Spain. The serviceintends to promote the installation of http-proxies forgovernmental and academical institutions connecteddirectly to this network. The service tries to find the bestcooperative way between them to offer to the final user ofthe global community the highest quality of service foraccessing the World Wide Web
5HG,5LV
&,&$80
80+
*9$
839
89
8-,
ucaujaen uporoa ucous ugruma
ualm
8&/0
&,&$
iaculpgc
csic
cycit
mec
boe
uam
upm
uc3m
ujrc
uem
&(6&$81,=$5
uabudg
upcub
upf
urv
xtec
(+8
81,5,2-$
81,&$1
89$
lgsc gc
81,29,
86$/
HWVLPR
8'&
86&unicles
uvigo
)LJXUH����5HG,ULV�FRPPXQLW\�PDS�FDFKHV
As we can see in Figure 2, the network implements adistributed proxy cache in which every node cachesinformation as a function of the global policy. Also, eachnode routes the request to the node that probably has therequested document. In the present study we analyze theWWW traffic in the Polithecnical University of Valencia(UPV) proxy server.
0-7695-0493-0/00
����'DWD�FROOHFWLRQ
RedIris use Squid [2], that originates from the Harvestsystem [10]. This system offers high performance proxycaching for web clients. The data patterns of the tracescaptured by squid consists of the sequence of WWW filerequest, each of these file requests has several fields as:Time-Stamp, Elapsed time IP Client Address, URL, andsize in bytes. Once we have captured all the data, the nextphase in the process, is to filter the trace collection inorder to reduce it and obtain the different patterns thatfeed the discussed statistical analysis in section 3. Thedifferent phases followed in that process can be seen inFigure 3.
&OLHQWV�DW�XSY�HV
�7UDFH�&ROOHFWLRQ
3UR[\�/RJ
6TXLG
3UR[\�)LOWHU
6WDWLVWLFDO�$QDO\VLV
$UULYDO��SDWWHUQ�5HVSRQVH�VL]HSDWWHUQ�����
�7UDFH�5HGXFWLRQ
�7UDFH�3URFHVLQJ
�+XUVW�3DUDPHWHU
������������������������������7&3B0,66���������*(7�KWWS���ZZZ�GLVFD�XSY�HV��MXFDQR�',5(&7�ZZZ�GLVFD�XSY�HV�WH[W�KWPO
��������������������������������������
)LJXUH����)URP�'DWD�&ROOHFWLRQ�WR�SDUDPHWHU�+
Table 1 shows the number of clients, the total request,and the size for the two traces used in the rest of the work.
7DEOH����7UDFHV�&KDUDFWHULVWLFV
7UDFH 1XPEHU�RI�FOLHQWV 5HTXHVW 6L]H�0E\WHV�
PW 1481 871695 6233LW 764 358114 2797
���6WDWLVWLFDO�6HOI�6LPLODULW\�$QDO\VLV
Using the data obtained by filtering the proxy cachetraces, we used the different statistical methods related insection 3 to show a consistent conclusion: the proxyWWW traffic presents the self-similar property ondifferent time scales.
We analyze the self-similar property in severalpatterns, as the arrival pattern, completion pattern, elapsedrequest time pattern and the response size one. As asummary of our experiments, Figure 4 plots the arrivalpattern distribution for two traces used in the actualanalysis, one trace is from a low workload day (LW) andthe other one from a peak load day (PW).
$10.00 (c) 2000 IEEE 4
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
0 5 10 15 20 250
1
2
3
4
5
6
7
8
9x 10
4
Hours of the LW day
Num
ber
of a
cces
ses
0 5 10 15 20 250
1
2
3
4
5
6
7
8
9x 10
4
Num
ber
of a
cces
ses
Hours of the PW day
)LJXUH����+LVWRJUDP�IRU�ERWK�WUDFHV�JURXSHG�E\KRXUV
����$UULYDO�SDWWHUQ�$QDO\VLV
Initially, we applied the different methods mentionedabove, to analyze the Hurst parameter of the arrivalpattern for different trace periods. We analyzed a dayperiod (24 hours), a work day period (12 hours) and apeak (PW) and low (LW) three hour periods. All theresults obtained exhibit the self-similar property. Thestudy of the different traces remarks that as the activitylevel increases, the Hurst parameter estimated movescloser to 1.
We have obtained the Hurst parameter using differentstatistical approaches, for example of these statisticalapproaches. Figure 5 plots the results for the three busiest
R/S Analysis
0-7695-0493-0/0
hour period of the PW trace obtained with the followingmethods, R/S analysis, aggregated variance analysis andperiodogram.
7DEOH����(VWLPDWHG�+XUVW�SDUDPHWHU�IRU�GLIIHUHQWSHULRGV�RI�WKH�WUDFHV
Hurst exponent HMeasure Session(Hours) R/S
methodVariancemethod
Period.Method
00:00 – 24:00 0.84 0.98 0.74
08:00 – 20:00 0.78 0.93 0.69
11:00 – 14:00 0.72 0.69 0.78LW
21:00 – 24:00 0.73 0.74 0.70
00:00 – 24:00 0.85 0.98 0.80
08:00 – 20:00 0.88 0.92 0.80
11:00 – 14:00 0.78 0.79 0.88PW
21:00 – 24:00 0.73 0.75 0.80
Variance Time Analysis
1 1.5 2 2.5 3 3.50
0.5
1
1.5
2
2.5
3
log10(n)
log10(R/S)
Straight Equation = 0.782603 x + -0.303210
1 1.5 2 2.5 3 3.50
0.5
1
1.5
log10(n)
log1
0(V
ar(X
m)
Straight Equation= -0.439429 x + 1.863396
Hurst Parameter= 0.788610
-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0
Periodogram Analysis
Straight Equation = -0.773082 x +0.848020
Hurst Parameter = 0.886541
log1
0(P
erio
dog r
am)
0
1
2
3
4
5
6
)LJXUH����*UDSKLFDO�$QDO\VLV�IRU�WKH�WKUHH�EXVLHVW�KRXUV�RI�WKH�3:�WUDFH
0 $10.00 (c) 2000 IEEE 5
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
R/S analysis is able to identify the “truth test” short-range dependencies by a graphical method. If one plotsthe R/S method for different time series ;�W�, byaggregating the original series over non-overlappingblocks of size W and there are not short-rangedependencies; then, the Hurst parameters calculatedremain close to the original value (W=1). On the otherhand, if the original series presents short-rangedependencies, the slope of the R/S plot decrease close t0.5 as the aggregation size increases. For example, Figure6 shows the results obtained from the PW trace.
a)
b)
c)
d)
0 0.5 1 1,5 2 2.5 3 3.5 4
0 0.5 1 1,5 2 2.5 3 3.5
Hurst Parameter = 0.90
0 0.5 1 1,5 2 2.5 3
Hurst Parameter = 0.97
0 0.5 1 1,5 2 2.5
Hurst Parameter = 0.98
Log1
0(R
/S)
L og1
0(R
/S)
L og1
0(R
/S)
L og1
0(R
/S)
0
1
20
1
2
0
1
2
30
1
2 3
Hurst Parameter
Hurst Parameter
= 0.90
= 0.85
)LJXUH����3ORWV�RI�5�6�ZLWK�GLIIHUHQW�DJJUHJDWLRQOHYHOV������������DQG�������IRU�3:�WUDFH��7KH�OHYHOVDUH�VKRZHG�LQ�LQFUHDVLQJ�DJJUHJDWLRQ�VL]HV�IURP
D��WR�G��
7DEOH����(VWLPDWHG�+XUVW�SDUDPHWHU�IRU�WKHFRPSOHWLRQ�SDWWHUQ
Hurst exponent HMeasure Session(Hours) R/S
methodVariancemethod
Period.method
00:00 – 24:00 0.85 0.98 0.74
08:00 – 20:00 0.78 0.93 0.69
11:00 – 14:00 0.71 0.70 0.79LW
21:00 – 24:00 0.73 0.84 0.68
00:00 – 24:00 0.86 0.98 0.85
08:00 – 20:00 0.87 0.92 0.79
11:00 – 14:00 0.78 0.78 0.88PW
21:00 – 24:00 0.74 0.76 0.63
Usually, when experiments in open systems areperformed, the flow balance is assumed [11], but as longthe authors knows, no results are presented in the open
lwscto
�
pgosprts
rstp
H
0-7695-0493-0/00
o
iterature about the self-similarity at the moment this paperas written. As Internet may be considered an open
ystem, a similar study was also carried out for theompletion pattern. The obtained results are very similaro the arrival pattern. Table 3 shows the + parameterbtained.
���(ODSVHG�5HTXHVW�7LPH�SDWWHUQ�$QDO\VLV
The most important improvement, from the client’soint of view in a proxy cache system, is to reduce thelobal amount of time to serve a document. So, in order tobtain the system performance, we have evaluated theelf-similar property of the elapsed time pattern. For thisurpose we group the total elapsed time required for eachequest which happens in every unit of time (1 second),herefore we have 86400 time units per day. Table 4hows the Hurst parameter obtained for the elapsed time.
7DEOH����+XUVW�SDUDPHWHU�RI�WKH�(ODSVHG�WLPH�IRUVHYHUDO�EXV\�SHULRG
Hurst exponent HMeasure Session(Hours) R/S
methodVariancemethod
Period.method
00:00 – 24:00 0.61 0.88 0.55
08:00 – 20:00 0.62 0.70 0.55
11:00 – 14:00 0.60 0.53 0.54LW
21:00 – 24:00 0.55 0.56 0.56
00:00 – 24:00 0.75 0.92 0.64
08:00 – 20:00 0.69 0.79 0.61
11:00 – 14:00 0.60 0.55 0.60PW
21:00 – 24:00 0.54 0.60 0.52
Figure 7 shows the relation between the number ofequests and the elapsed time per request of the twoelected traces. As we can see, the elapsed time/request ofhe two traces (LW and PW), depends much more on theroxy cache configuration than in the number of requests.
0 5 10 15 20 250
1
2
3
4
5
6
7
8
9x 10
4
Num
ber
of R
eque
st (
solid
line
) an
d E
laps
ed T
ime/
Req
uest
in s
econ
d (d
ashe
d lin
e)
Hours of the Day0 5 10 15 20 25
0
1
2
3
4
5
6
7
8
9x 10
4
Hours of the Day
Num
ber
of R
eque
st (
solid
line
) an
d E
laps
ed T
ime/
Req
uest
in s
econ
d (d
ashe
d lin
e)
)LJXUH����1XPEHU�RI�5HTXHVW��VROLG�OLQH���DQGODSVHG�WLPH�LQ�VHFRQG���5HTXHVW��GDVKHG�OLQH��RIWKH�SUR[\�ORJ�/:��ULJKW��DQG�3:��OHIW���JURXSHG
SHU�KRXU
$10.00 (c) 2000 IEEE 6
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
In the three busiest hours the mean request responsetime in LW was 0.9918 seconds/request, while in the PWit was 0.7086 seconds/request. As we can observe, theresponse time will affect the arrival pattern of eachparticular source, but the form of the shape of the elapsedtime is very similar from one busy day to the another.
����5HVSRQVH�6L]H�SDWWHUQ�$QDO\VLV
In addition to the elapsed response time, we have alsostudied the size of the requested documents. Table 5shows the self-similar nature of the response size patternby means of the +�parameter value. As it occurs with thearrival pattern, if the traffic increases the Hurst parameteralso increase.
7DEOH����5HVSRQVH�VL]H�+XUVW�SDUDPHWHU�IRUGLIIHUHQW�EXV\�SHULRGV�RI�WKH�WUDFHV
Hurst exponent HMeasure Session(Hours) R/S
methodVariancemethod
Period.method
00:00 – 24:00 0.54 0.58 0.52
08:00 – 20:00 0.53 0.59 0.53
11:00 – 14:00 0.57 0.51 0.60LW
21:00 – 24:00 0.56 0.51 0.51
00:00 – 24:00 0.71 0.85 0.56
08:00 – 20:00 0.61 0.65 0.54
11:00 – 14:00 0.55 0.54 0.59PW
21:00 – 24:00 0.55 0.56 0.54
Once we have demonstrated the self-similar behaviorin all the cases analyzed, we perform the study about thehypothetical relation between cache size and + parameter.For this study, we change both the memory cache size andthe hard disk cache size, in order to know if anymathematical relation exits between the cache size and theHurst parameter. The conclusion was that while the cachesize changes the self-similar property appears in the sameway, but there was not a direct relation obtained betweenthe cache size and the Hurst parameter.
���6\QWKHVL]LQJ�6HOI�6LPLODU�7UDIILF
In this section an explanation about the self-similarproperty is shown, and a model to generate an arrivalpattern is built.
����21���2))�6RXUFHV
Willinger et al [9], proposed an explanation of the self-similar property observed in Ethernet LAN traffic, and
dttssaTaivt
cft
d
as
fOs
p
�
bsmwcotttuts
meamaonPo
0-7695-0493-0/00
esigned techniques to model a self-similar workload. Theheory, developed by Taqqu and Levy [12], explains thathe aggregation of several ON/OFF sources within theystem results in a self-similar traffic. An individualource can be classified as an ON/OFF source [9], if itlternates ON and OFF periods with high variable length.hat is, the distribution time of the ON and OFF periodsre heavy tailed with parameter α1 and α2. As we describen section 3, a heavy-tailed distribution has infiniteariance and the portion of the tail distribution depends onhe α parameter.
To explain self-similarity in Web traffic, ON timesorrespond to the transmission time of individual Webiles, and OFF periods correspond to the interval betweenransmission [13].
An example of a heavy-tailed distribution is the Paretoistribution. The general form of Pareto distribution is:
α)(1)( [D[) −= , where a, α�≥��, and [�≥�D. If the ON
nd OFF period length distribution is heavy-tailed, theyatisfy the property:
21 , ~)( <<∞→> − αα [ZLWKF[[;3 [9], and
or a ; length period, if the activity is uniform within anN period, then the result of aggregating many such
ources results in a self-similar process with Hurst
arameter 2
3 α−=+ (6).
���0RGHOLQJ�D�6RXUFH
The properties of the ON/OFF source aggregation cane used to develop a simple model to generate a self-imilar arrival pattern by aggregating several sources andodeling the ON and OFF length periods of each sourceith a Pareto distribution. Parameter α in equation (6) is
alculated by using the estimated Hurst parameterbtained in section 5. The number of sources as well ashe ON period traffic of each source is calculated to adjusthe amount of accesses to the synthesized workloadowards the original one. Figure 8 shows the algorithmsed to model every source of the generator, where WLPH ishe actual simulation time and W�VLPXODWLRQ is the totalimulation time.
This model generates the arrival pattern. An idealodel should be able to synthesize others arguments in
very request, like the clients making the request, the URLnd the number of bytes generated, etc. The generatorust be able to generate a large amount of self-similar
rrival pattern workloads, with a behavior similar to thatbtained from real traces. Furthermore by modifying theumber of ON/OFF processes, inter-arrival time andareto parameters must be able to generate a great varietyf synthetic workloads to estimate the benefits or
$10.00 (c) 2000 IEEE 7
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
disadvantages of several proposed proxy cacheconfigurations.
:KLOH��WLPH���W�VLPXODWLRQ�6ZLWFK��VWDWH�
&DVH�VWDWHB2))�:DLW�QH[W�21�VWDWHVWDWH� �VWDWHB217LPH� �7LPH���3DUHWR��DOSKDB21�1H[W�VWDWHB2))�SHULRG� �7LPH
&DVH�VWDWHB21�:KLOH�WLPH�1H[W�VWDWHB2))�SHULRG�
*HQHUDWHBDUULYDO:DLW�LQWHU�DUULYDO�WLPH
(QGB:KLOHVWDWH� �VWDWHB2))WLPH� �7LPH���3DUHWR��DOSKDB2))�1H[W�VWDWHB21�SHULRG� �7LPH
(QGB6ZLWFK
(QGB:KLOH
)LJXUH����21���2))�6RXUFH�*HQHUDWRU�$OJRULWKP
In order to validate the model, we have synthesizedWWW requests for 50 clients and aggregated them toobtain the overall traffic in the proxy server. The model
hp
Hta
srstr
0-7695-0493-0/00
as been adjusted to generate workload with theroperties of different peak load of LW and PW traces.
Table 6, Figure 9, and Figure 10 shows the syntheticurst parameter obtained with the generator for the PW
race (period 11:00 – 14:00) with a parameter + ~=0.8,nd for the same period of the LW trace with +~=0.7.
7DEOH����+XUVW�SDUDPHWHU�IRU�GLIIHUHQWV\QWKHVLVHG�ZRUNORDGV
Hurst exponent HMeasure Session(Hours) R/S
methodVariancemethod
PW 11:00 – 14:00 0.80 0.78
LW 11:00 – 14:00 0.71 0.70
Clearly, the aggregated synthesized traffic is self-imilar with Hurst parameter close to the estimated in theeal traffic. Furthermore, as mention above, we adjust theynthesized traffic to obtain a superimposing requestraffic rate close to the real traffic. Figure 11 shows theequest pattern of real traffic and the synthesized one.
1 1.5 2 2.5 3 3.50
0.5
1
1.5
2
2.5R/S Analysis
log10(n)
log1
0(R
/S)
Straight Equation = 0.717884 x + -0.156740
1 1.5 2 2.5 3 3.5 4-2.5
-2
-1.5
-1
-0.5
0
0.5Variance Time Analysis
log10(n)
log1
0(V
ar(X
m)
Straight Equation= -0.581237 x + 0.884558
Hurst Parameter= 0.709382
)LJXUH����6\QWKHVL]LQJ�+XUVW�SDUDPHWHU�REWDLQHG�ZLWK�WKH�JHQHUDWRU�IRU�WKH�/:�ZRUNORDGV�ZLWKHVWLPDWHG�+a �������SHULRG�������±�������K�
1 1.5 2 2.5 3 3.50
0.5
1
1.5
2
2.5
3R/S Analysis
log10(n)
log1
0(R
/S)
Straight Equation = 0.807169 x + -0.257697
1 1.5 2 2.5 3 3.5-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Variance Time Analysis
log10(n)
log1
0(V
ar(X
m)
Straight Equation= -0.431301 x + 1.422317
Hurst Parameter= 0.784349
)LJXUH�����6\QWKHVL]LQJ�+XUVW�SDUDPHWHU�REWDLQHG�ZLWK�WKH�JHQHUDWRU�IRU�WKH�3:�ZRUNORDGV�ZLWKHVWLPDWHG�+a �������SHULRG�������±�������K�
$10.00 (c) 2000 IEEE 8
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
���&RQFOXVLRQ�DQG�)XWXUH�:RUNIn this paper a detailed study about the World Wide
Web transfers in a distributed proxy cache system hasbeen done. The study shows that this workload presentssome characteristics that are consistent with self-similarity. We have examined this statistical property forthe main pattern that can characterize this kind of traffic,as the arrival pattern, elapsed request pattern, and requestsize. The obtained results are very encouraging for all thecases studied.
This statistical behavior has been tested using thetypical statistical methods used in the open literature; theR/S, the variance and the periodogram method. Inaddition, we have used the aggregate of the R/S a morecompleted analysis. In all the methods used, the Hurstparameter remains inside the permitted interval, anddirectly depends on the network traffic. We have alsolooked into the possible mathematical relation betweenthe cache size and the Hurst parameter, coming to anegative conclusion.
Once the self-similar property is tested, we havevalidated that to generate self-similar workload it isfeasible using the ON/OFF method. An individual clientmodel with the ON/OFF behavior has been developed.Combining several of these synthesized clients, asynthesized arrival generator for World Wide Web hasbeen implemented. The resulting trace analysis gives thesame Hurst parameter as obtained from the original trace.
0 20 40 60 800
1000
2000
(a) Time unit=100 sec100 120 140 160 1800
500
1000
1500
(b) Time unit=60 sec
460 480 500 5200
200
400
600
(c) Time unit=20 sec700 720 740 760 7800
100
200
300
(d) Time unit=10 sec
1000 1020 1040 1060 10800
50
100
150
(e) Time unit=4 sec5000 5020 5040 5060 50800
20
40
60
(f) Time unit=1 sec
)LJXUH�����5HDO�SUR[\�FDFKH�UHTXHVW�WUDIILF�RI�3:WUDFH��SHULRG�������±������K��GDVKHG�OLQH���YHUVXV
WKH�V\QWKHVL]HG�RQH
The most important conclusion of this paper is that acomplete self-similar generator of World Wide Webtransfers can be done in order to perform comparison andevaluation studies of proposed proxy systemconfigurations. With the generator we would have a moreflexible tool to generate a wide variety of traces.
0-7695-0493-0/00
As for future work, we plan to improve the generator inorder to produce traces that include other synthesizedparameters as the request size, URL clients and others.Also we plan to study whether the WWW workloadcharacterization might be affected by parameters such ascountry, type of users or protocol characteristics, whichcould be possible factors to be included in the generator.
$FNQRZOHGJHPHQWV
The authors would like to express their thanks toVicente Benet, from Valencia Polytechnic Data ProcessCenter, who supplied us the traces used in this study, andhelped us with the Squid program tool.
���5HIHUHQFHV
[1] W. E Leland, M. S. Taqqu, W. Willinger, D. V. Wilson, “Onthe Self-Similar Nature of Ethernet Traffic,” ,(((�$&07UDQVDFWLRQV�RQ�1HWZRUNLQJ, 9RO 2, 1994.
[2] “Squid Internet Object Cache,” http://squid.nlanr.net/.
[3] N. D. Georganas, “Self-similar (“fractal”) traffic in atmnetworks,” 3URFHHGLQJV� RI� WKH� �QG� ,QWHUQDWLRQDO� :RUNVKRS� RQ$GYDQFHG� 7HOHVHUYLFHV� DQG� +LJK�6SHHG� &RPPXQLFDWLRQV$UFKLWHFWXUHV��,:$&$������, pp. 1-7, 1994.
[4] J. Beran, R. Sherman, M. S. Taqqu and W. Willinger,“Long-range dependence in variable-bit-rate video traffic,”,(((� 7UDQVDFWLRQV� RQ� &RPPXQLFDWLRQV� 9RO� ��, pp. 1566-79,1995.
[5] V. Parxon and S. Floyd, “ Wide-Area Traffic: The Failure ofPoisson Modeling,” ,(((�$&0�7UDQVDFWLRQV�RQ�1HWZRUNLQJ, 3(3), pp. 226-244, 1995.
[6] M. E. Crovella and A. Bestavros, “Explaining world wideweb traffic self-similarity�´� 7HFK�� 5HS�� 75��������� &RPSXWHU6FLHQFH�'HSDUWDPHQW��%RVWRQ�8QLYHUVLW\, 1995.
[7] W. Willinger, M. S. Taqqu, R. Sherman and D. V. Wilson,“Self-Similarity Through High-Varibility: Statistical Analysis ofEthernet LAN Traffic at the Source Level�´� ,(((�$&0WUDQVDFWLRQV�RQ�1HWZRUNLQJ��9RO. 5, pp. 71-86, 1997.
[8] B. B. Mandelbrot, “The Fractal Geometry of Nature,”Freeman, New York, 1983.
[9] W. Willinger, M. S. Taqqu, R. Sherman and D. V. Wilson,“Self-Similarity Through High-Varibility: Statistical Analysis ofEthernet LAN Traffic at the Source Level�” � ,(((�$&0WUDQVDFWLRQV�RQ�1HWZRUNLQJ��Vol. 5, pp. 71-86, 1997.
[10] P. B. Danzing, R. S. Hall, and M. F. Schwartz, “A case forCaching File Objects Inside Internetworks,” 3URFHHGLQJV�RI�WKH6,*&200¶����pp. 239-248, 1993
$10.00 (c) 2000 IEEE 9
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000
[11] D. Ferrari, G. Serazzy, and A. Zeigner, “Measurement andTunning of Computer Systems,” Ed. Prentice Hall, 1983.
[12] M. Taqqu, and J. Levy. “Using renewal processes togenerate long-range dependence in high variability,”'HSHQGHQFH� LQ� 3UREDELOLW\� DQG� 6WDWLVWLFV, E. Eberlein and M.Taqqu, Eds., pp. 73-89, (Boston, MA, 1986).
[W7
[i6
0-7695-0493-0/00 $
13] M. E. Crovella and A. Bestavros, “Self-Similarity in Worldide Web Traffic: Evidence and Possible Causes,” ,(((�$&0UDQVDFWLRQV�RQ�1HWZRUNLQJ��Vol 5, pp. 835-845, 1997.
14] S. D. Gribble, G. S. Manku, E. A. Brewer, “Self-similarityn hight-level file system: Measurement and Application,” $&0,*0(75,&6¶���� Madison, Wisconsin�� June 1998
10.00 (c) 2000 IEEE 10