The Pennsylvania State University The Graduate School College of ...
The Pennsylvania State University The Graduate School SYSTEMS ...
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of The Pennsylvania State University The Graduate School SYSTEMS ...
The Pennsylvania State University
The Graduate School
SYSTEMS INFRASTRUCTURE FOR SUPPORTING UTILITY
COMPUTING IN CLOUDS
A Dissertation in
Computer Science and Engineering
by
Byung Chul Tak
c© 2012 Byung Chul Tak
Submitted in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
May 2012
The dissertation of Byung Chul Tak was reviewed and approved∗ by the following:
Bhuvan Urgaonkar
Associate Professor of Computer Science and Engineering
Dissertation Advisor, Chair of Committee
Anand Sivasubramaniam
Professor of Computer Science and Engineering
Trent Jaeger
Associate Professor of Computer Science and Engineering
Qian Wang
Associate Professor of Mechanical and Nuclear Engineering
Rong N. Chang
Research Staff Member & Manager at IBM Research
Special Member
Raj Acharya
Professor of Computer Science and Engineering
Head of the Department of Computer Science and Engineering
∗Signatures are on file in the Graduate School.
Abstract
Recent emergence of cloud computing is being considered an important enablerof the long-cherished paradigm of utility computing. Utility computing representsthe desire to have computing resources delivered, used, paid for and managed withassured quality similar to other commoditized utilities such as electricity. Theprincipal appeal of utility computing lies in the systemized framework it creates forthe interaction between providers and consumers of computing resources. Whilecurrent clouds are undoubtedly succeeding towards this goal, they lack some ofthe crucial features necessary to realize a mature utility. First, one foundationalfeature of a utility is the ability to accurately measure and manage the usage of theresources by its various consumers. Modern VM-based cloud platform, providersof the cloud services face significant difficulties in obtaining the accurate picture ofresource consumption by their consumers. Second, consumers of a utility expectto have systematic ways to infer their resource needs so that they can make cost-effective resource procurement decisions. However, current consumers of the cloudsare ill-equipped in making their resource procurement decisions because of lackof information regarding resource quantity and their implication on applicationperformance.
In the first part of the dissertation, we consider provider-side issues of resourceaccounting. It is nontrivial to correctly apportion the usage of a shared resourcein the cloud to its various users. Towards achieving accurate understanding ofthe overall resource usage within the cloud, we develop a technique for dynami-cally discovering the various resources that are directly or indirectly being usedby a consumer’s application. This, in turn, enables us to build techniques foraccurately accounting the resource usage. The benefits of our approach are ex-plained by comparing with and illustrating the problems of using state-of-the-artmethods in resource accounting. In the next part of the dissertation, we focuson the problem of understating the cost of application deployments to the cloud
iii
from the consumers perspective. Employing empirical approaches to estimate theresource requirements of the target application, we present how to systematicallyincorporate important systems characteristics such as workload intensity, growth,and variances as well as comprehensive set of hosting options into determining theeconomic feasibility of application deployment in the cloud.
iv
Table of Contents
List of Figures viii
List of Tables xiii
Acknowledgments xiv
Chapter 1Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope and Outline of Dissertation . . . . . . . . . . . . . . . . . . . 8
1.2.1 Provider-end Resource Accounting . . . . . . . . . . . . . . 91.2.1.1 Dependency Discovery . . . . . . . . . . . . . . . . 91.2.1.2 Resource Usage Inference . . . . . . . . . . . . . . 9
1.2.2 Consumer-end Decision Making . . . . . . . . . . . . . . . . 10
Chapter 2Related Work 112.1 Provider-end Resource Usage Inference and Accounting . . . . . . . 11
2.1.1 Statistical Inference-based Technique . . . . . . . . . . . . . 112.1.2 System-dependent Instrumentation . . . . . . . . . . . . . . 122.1.3 Resource Accounting . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Consumer-end Decision Making . . . . . . . . . . . . . . . . . . . . 15
Chapter 3Provider-end Dependency Discovery 173.1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . 173.2 Dependency Discovery: Problem Statement and Requirements . . . 193.3 Inadequacy of Existing Approaches . . . . . . . . . . . . . . . . . . 21
v
3.4 Proposed Solution: vPath . . . . . . . . . . . . . . . . . . . . . . . 213.4.1 Design and Implementation of our Dependency Discovery
Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4.1.1 Implementation . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Applicability to Other Software Architectures . . . . . . . . 263.4.3 Usefulness of Proposed Solution . . . . . . . . . . . . . . . . 29
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5.2 Overhead of vPath . . . . . . . . . . . . . . . . . . . . . . . 323.5.3 Dependency Discovery for vApp . . . . . . . . . . . . . . . . 353.5.4 Dependency Discovery forTPC-W . . . . . . . . . . . . . . . 373.5.5 Dependency Discovery for RUBiS and MediaWiki . . . . . . 383.5.6 Discussion on Benchmark Applications . . . . . . . . . . . . 39
3.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 40
Chapter 4Provider-end Resource Usage Inference 414.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.1 Usefulness of Resource Accounting Information . . . . . . . 444.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Our Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4 Design and Implementation: sAccount . . . . . . . . . . . . . . . . 49
4.4.1 Local Monitoring . . . . . . . . . . . . . . . . . . . . . . . . 504.4.1.1 Identification of S and T . . . . . . . . . . . . . . 504.4.1.2 Identifying Resource Principals & Scheduling Events 51
4.4.2 Collective Inference . . . . . . . . . . . . . . . . . . . . . . . 534.4.3 Implementation of sAccount . . . . . . . . . . . . . . . . . . 54
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5.1 Accounting Accuracy for a Synthetic Shared Service . . . . . 564.5.2 Accounting for Real-world Services . . . . . . . . . . . . . . 60
4.5.2.1 Clustered MySQL as the Shared Service . . . . . . 604.5.2.2 HBase as the Shared Service . . . . . . . . . . . . . 66
4.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 5Consumer-end Decision Making 735.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 Net Present Value . . . . . . . . . . . . . . . . . . . . . . . . 745.2.2 Cost Components . . . . . . . . . . . . . . . . . . . . . . . . 75
vi
5.2.3 Application Hosting Choices . . . . . . . . . . . . . . . . . . 775.2.4 Determining Hardware Size Requirement . . . . . . . . . . . 77
5.2.4.0.1 In-house Provisioning: . . . . . . . . . . . 785.2.4.0.2 Cloud-based Provisioning: . . . . . . . . . 79
5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.3.1 Data Transfer, Vertical Partitioning . . . . . . . . . . . . . . 815.3.2 Storage Capacity, Software Licenses . . . . . . . . . . . . . . 835.3.3 Workload Variance and Cloud Elasticity . . . . . . . . . . . 855.3.4 Horizontal Partitioning . . . . . . . . . . . . . . . . . . . . . 875.3.5 Indirect Cost Components . . . . . . . . . . . . . . . . . . . 89
5.4 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 6Conclusion and Future Work 94
Bibliography 97
vii
List of Figures
1.1 Cloud-based hosting of a consumer’s application through the Vir-tual Cluster interface. Virtual Cluster is a generalized version ofinterface exposed to the consumer through which they can specifydesired quantities of various resources they need such as comput-ing power, storage and networks. The figure shows one examplemapping of virtual components to actual physical resources as de-termined by the provider’s management algorithms. . . . . . . . . 4
1.2 N-to-m relationship between consumer and provider. Each con-sumer has their own type of applications they wish to deploy inthe cloud. Many questions, as noted in the figure, can arise onthe consumer side. Providers offer various hosting choices to theconsumers. They all have different set of virtual resources withdifferent pricing and performance characteristics. . . . . . . . . . . 7
3.1 Example deployment of VCs from chargeable entities CEA andCEB. Two VCs share the database server instance. This shar-ing is transparent to the chargeable entities since who will sharethe service instance is the decision of provider. The abstractionsof “Set of Used Servers” and “Resource Accounting Tree” are alsolabeled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 The principle idea of finding the causality of our proposed solution. 223.3 Multi-threaded Server Architecture . . . . . . . . . . . . . . . . . . 273.4 Event-driven and SEDA model architecture . . . . . . . . . . . . . 283.5 The topology of TPC-W benchmark set-up. . . . . . . . . . . . . . 303.6 The topology of vApp used in evaluation. . . . . . . . . . . . . . . . 313.7 TPC-W Response Time and CPU Utilization. . . . . . . . . . . . . 333.8 Examples of vApp’s paths discovered by vPath. The circled num-
bers correspond to VM IDs in Figure 3.6. . . . . . . . . . . . . . . . 363.9 CDF of vApp’s response time, as estimated by vPath and actually
measured by the vApp client. . . . . . . . . . . . . . . . . . . . . . 363.10 Typical paths discovered by vPath technique. . . . . . . . . . . . . 37
viii
4.1 A portion of a platform that hosts two applications, each a CE,and the servers hosting their components. Arrows indicate commu-nication between components. Also shown is a shared service - adatabase used by both the CEs. The shared service itself consistsof multiple software components, some of which are exercised bythe CEs indirectly (e.g., the “Data Store”), i.e., via requests madeto other components (e.g., the “Front-end”). . . . . . . . . . . . . 43
4.2 Overall architecture of sAccount implementation. . . . . . . . . . . 494.3 Solution concept. Start and end of CPU accounting is determined
by the arrival of messages and departure of response messages. Asthe threadx of VM2 sends a message to the threadA of VM1, theVM1 starts to account the CPU usage of threadA to CE1. This bind-ing stops when threadA sends reply back. CPU usage of threadB isnot charged to CE1 inbetween. This requires us to be able to detectthread scheduling events. . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Design and configuration of our synthetic shared service and theCEs exercising it. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Impact of burstiness and shared service resource utilization on theaccuracy offered by sAccount versus LR. We use our syntheticshared service along with three chargeable entities. We comparethe percentage error in CPU accounting for sAccount and LR. Welabel the errors for our three chargeable entities with LR as CE1,CE2, and CE3, respectively, and label their average as “LR Aver-age.” In all cases, the accounting information offered by sAccountshows less than 1% error (we plot the average of error for the threeCEs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6 Impact of caching and number of CEs on the accuracy of resourceaccounting of (a) network and (b) CPU for our synthetic sharedservice. The number of CEs is three in (a). We plot the averageerror across all CEs and standard deviation. . . . . . . . . . . . . . 59
4.7 Shared MySQL cluster setting. Three CEs labeled CE1, CE2, andCE3 share this database service. . . . . . . . . . . . . . . . . . . . 61
ix
4.8 Network traffic and individual CPU utilization time-series. Graph(a) shows the network traffic exchanged between SN and each of ourthree CEs, and forms part of the input to LR. Graph (b) presentsthe CPU usage at SN induced by each of the CEs when it runsseparately from other CEs as part of offline profiling that we do.These usages serve as our estimate of the ground truth for theCPU usage each of the CEs induces in the actual experiment. Theresource accounting results of LR and sAccount should be comparedwith (b) to see how far from this estimated ground truth they are. 64
4.9 Comparison of CPU accounting results. CPU usage of MySQLCluster SQL node is being accounted. In (a) the accounting startsat time 200 since LR needs to collect some amount of data. Bycomparing the areas of equivalent color we can see the rank orderdetermined by each technique as well as accuracies. Please comparewith Figure 4.8(b) to see the true CPU consumption. The result ofsAccount includes the ‘unaccountable’ portion. This can be dividedamong chargeable entities by any reasonable policy. . . . . . . . . 65
4.10 Response time change of RUBiS application. This graphs showsthe development of RUBiS response time for two cases - throttledby LR, and controlled by sAccount. Since LR picks the wrong CE(CE3) as a culprit for performance degradation, throttling the re-quest rate of CE3 is ineffective. However, the sAccount techniqueshows noticeable effects. The moving average response time by sAc-count indicates sAccount can contain the performance interference. 66
4.11 Our setup for using HBase as a shared service. Our CEs are basedon client programs that use the YCSB workload generator. . . . . 67
4.12 Evolution of network traffic that are incoming to the region serverfrom two CEs during the run. Both CEA and CEB start out bysending similar requests to HBase during t=0 to t=100s, implyingthe network bandwidth and CPU usage of the region server shouldbe accounted equally to them for this period. CEA changes itsbehavior at t=100s, whereas CEB changes its behavior at t=200s. . 68
4.13 Profiling measurement of CPU and Network resources at the regionserver for CEA and CEB. These are obtained from individual (notcombined) runs of each chargeable entities. They are intended toserve as an estimate of true resource usage quantity when analyzingthe performance of resource accounting results. . . . . . . . . . . . 69
x
4.14 Comparison of resource accounting results between LR and sAc-count on the out-bound network traffic size from data node to theregion server. The contour of (a) drawn in think line is obtained byiptables. Note the resemblance of this to the overall traffic size of(b) as a quick sanity check of sAccount technique. . . . . . . . . . 70
4.15 Resource accounting by sAccount at various nodes of HBase. Theresult (b) provides the proof of sAccount’s capability to performresource accounting at multiple hops away from the front-end ofthe shared service. Note that data node of HBase does not have adirect communication with any of the chargeable entities. . . . . . 71
5.1 Taxonomy of costs involved in in-house and cloud-based applicationhosting. Costs can be classified according to quantifiability anddirectness. Quantifiable costs are grouped into material, labor andexpenses. The material category roughly corresponds to capitalexpenses (Cap-Ex). The labor and expenses categories correspondsto operational expenses (Op-Ex). In this study we focus on thequantifiable category. . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Marginal throughput measurements. Both graphs show how muchthroughput gain there is by adding one more unit of resource. ForJboss server (a), we observe the marginal gain by adding one moreserver that has single core. For Mysql (b) we add one more CPUcore and observe the marginal gain. . . . . . . . . . . . . . . . . . 78
5.3 EC2 Instance’s CPU Microbenchmark Results. Graph (a) com-pares the latency of finishing the same number of arithmetic op-erations. Roughly the CPU of EC2 instance is one third of ourreference machine. Graph (b) shows that the distribution of EC2CPU bandwidth is bimodal. . . . . . . . . . . . . . . . . . . . . . . 79
5.4 NPV over a 10 year time horizon for TPC-W. We consider threedifferent workload intensity of small (20 tps at t=0), medium work-load intensity (100 tps) and high workload intensity (500 tps) Wealso consider two different workload growth rate of 0% and 20% peryear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Closer look at cost components for four cloud-based applicationdeployment options in the 5th year. Initial workload is 100 tps andthe annual growth rate is 20%. . . . . . . . . . . . . . . . . . . . . . 82
5.6 Cost break-down of TPC-E at the 6th year. . . . . . . . . . . . . . 835.7 Two sets of TPC-E results at initial workload of 300 tps and 900 tps. 845.8 Effect of workload variance on the cost of in-house hosting for TPC-
W. The workload is 100 tps and the growth rate is 20% per year. . 86
xi
5.9 Timeseries xt and workload distribution f(x) for a fixed τ . . . . . 885.10 Cost behavior of horizontal partitioning as a function of varying
threshold value. Although not shown in the graph, the cost atPAR=11 (at 5.5K on x-axis), the cost is $625K. . . . . . . . . . . . 89
5.11 Impact of labor cost for the medium workload intensity using TPC-W. The cases for small and large workload intensity is not shown.For small workload, cloud-based hosting is always cheaper. Forlarge workloads, in-house hosting is alyways cheaper. . . . . . . . . 91
5.12 (a) Effect of labor costs on the hosting decision in relation to theworkload intensity and business size. (b) Stacked view of in-housecost at year 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xii
List of Tables
1.1 Summary of the problems addressed in the following chapters. . . . 8
3.1 Response time and throughput of TPC-W. “App Logging” repre-sents a log-based tracking technique that turns on logging on alltiers of TPC-W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Performance impact of vPath on RUBiS. . . . . . . . . . . . . . . . 343.3 Worst-case overhead of vPath and breakdown of the overhead. Each
row represents the overhead of the previous row plus the overheadof the additional operation on that row. . . . . . . . . . . . . . . . 34
4.1 Usage pattern of AWS shared services. The total number applica-tions are 120. RDS in the AWS is not a shared service as we definehere since consumers own separate VM instances. ‘Custom DB’means that user has installed their own database within the EC2instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Description of how the workload imposed by the three CEs is variedover the course of our experiment with the MySQL cluster as ourshared service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 Labor cost per server. We differentiate 3 different ratio of IT staffvs. Server. Average cost per server per hour (3rd column) is basedon the average IT staff salary of $33/h. . . . . . . . . . . . . . . . 90
xiii
Acknowledgments
This dissertation could not have been written without supports from many people.First of all, I am deeply indebted to my dissertation advisor, Professor BhuvanUrgaonkar, for his guidance. His insights and advices were most helpful in shapingand developing this dissertation. I would also like to thank my doctoral committeemembers, Professor Anand Sivasubramaniam, Professor Trent Jaeger, ProfessorQian Wang and Dr. Rong N. Chang for accepting the role as committee membersand providing useful feedbacks. I feel fortunate to have spent three summers atIBM T.J. Watson Research Center as an intern under the supervision of Dr. RongN. Chang. It was a wonderful experience to be able to work with Dr. ChunqiangTang and Dr. Chun Zhang. I would also like to express thanks to CSL members,Shiva Chaitanya, Sriram Govindan, Arjun Nath, Dharani Sankar Vijayakumar,Niranjan Soundara, Youngjin Kwon, Aayush Gupta, Di Wang, Chuangang Ren,Jeonghwan Choi, Euiseong Seo, Srinath Sridharan, Cheng Wang, Bo Zhao. Theyhave been wonderful colleagues and friends and I have really enjoyed working andspending time with them.
I cannot express my gratitude enough to my family for their unconditional loveand support for such a long period of time. I would like to thank my parents fortheir encouragements. I am also grateful to my father-in-law and mother-in-law fortheir supports. My gratitude extends to my sister, sister-in-law and their newly-born son as well. Finally and importantly, I deeply thank my wife, Sekyoung Huh,for being supportive and encouraging all along without losing faith in me. Withouther support, I cannot imagine how I could have finished my dissertation.
xiv
Chapter 1Introduction
1.1 Motivation
Cloud computing has emerged as a novel model for IT hosting, impacting many sec-
tors ranging from industry to academia. Several industry giants and mid-size/small
enterprises as well as academic/research institutions are conducting or considering
major restructuring of their IT infrastructure to avail of cloud computing offerings.
Among these cloud offerings are those based on infrastructure-as-a-service (e.g.,
Amazon’s Elastic Compute Cloud (EC2) [1] and Simple Storage Service [2], Sun
Grid [3], Rackspace [4]), software/platform-as-a-service (e.g., Microsoft Azure [5],
Google App Engine [6]), and a number of others. As is often the case with a
newly emergent technology, different views exist on the meaning and scope of
cloud computing [7, 8]. Despite the difficulty this poses, the defining feature of
cloud computing, cross-cutting all these views, is easily identified as the separation
it offers between the usage and management of IT infrastructure. In its most basic
form, cloud computing can be viewed as creating two distinct sets of entities, the
provider and the consumer of IT. The provider owns and manages IT resources,
thereby relieving the consumer of these responsibilities and allowing them to fo-
cus solely on using these resources. Myriad views of cloud computing that have
emerged can be seen as different takes on the details of how the partitioning of
these responsibilities occurs and what the interfaces offered to the consumers are.
Regardless of these different kinds of offerings, the growing popularity of cloud
computing, slated to continue its momentum based on many indicators [9, 10], can
2
be attributed to two main reasons:
• The growing complexity of managing IT, and the ensuing need for special-
ization, is claimed to have rendered out-sourcing of many IT management
responsibilities to experts desirable and cost-effective. This trend has ex-
panded to apply to a wide spectrum of consumers ranging from vendors of
Internet-scale services and managers of enterprise/department-scale IT en-
vironments to even individual users. As an example of the latter, according
to Jeff Barr, Senior Manager of Cloud Computing Solutions at Amazon, re-
ported that at the end of year 2011, there were 762 billion objects stored in
Amazon S3. And, the peak number of objects being processed reached over
500,000. Comparing with last year, it was the growth of 192%
• During the past decade, we have witnessed a proliferation of large data
centers with substantial investments in IT infrastructure. The resulting
economies-of-scale, gains derived from statistical multiplexing, and concen-
tration of expertise imply that the owners of such facilities are well-positioned
to take up the role of providers of cloud-hosted services of various kinds.
The Problem Cloud computing is viewed by many as one promising realization
of the long-cherished utility computing paradigm [11, 12, 13]. Utility computing
represents the desire to have IT acquired, delivered, used, paid for and managed
in a similar way we use other commoditized utilities such as electricity, telephone
service, cable television, etc. The principal appeal of utility computing lies in the
systematized framework it could create for the interaction between providers and
consumers of IT resources.
Current cloud computing offerings lack some of the crucial features necessary
to realize a mature utility. First, one of the foundational features of a utility is its
ability to accurately measure and charge the allocation and usage of the commodity
being exchanged between a provider and a consumer: we refer to this as account-
ing of resources. Resource accounting may be crucial to a provider for conducting
accurate billing. It also provides valuable information for a variety of resource allo-
cation and tuning decisions. However, it is well-known that resource accounting is
non-trivial even within a single server that consolidates multiple applications [14].
3
Often the workloads emerging from different user-level software entities are mul-
tiplexed in complex ways. It is often unclear how to “charge” different user-level
applications for the resources used on their behalf by other software including
systems software (e.g., operating system or virtual machine monitors), or runtime
(e.g., garbage collector). Similarly, looking at a larger scale, accounting of resource
usage for the shared services (e.g., shared file system, shared database server) by
the group of distinct resource consuming entities are not a straightforward problem.
Especially, resource virtualization and complex, distributed nature of modern con-
sumer applications make it non-trivial. One major challenge of performing resource
accounting stems from the fact that granularity at which resource consumption is
measured within the system does not match the granularity by which computing
resources are consumed. This implies that we need to process and transform the
measurement of resource usage data into desired granularity where accuracies may
be lost if certain conditions are not met. While problems very closely related to
this have received attention [15, 16, 17, 18, 19, 20, 21, 22, 23, 24], existing solutions
have shortcomings in their generality and accuracy. Note that accountability has
other connotations (e.g., guarantees related to executing the right software [25])
and we only use it in the specific sense related to resource allocation described
above.
Second, as cloud-based offerings mature, the interfaces exposed to consumers
grow richer, and multiple providers compete with each other, we envision a cloud-
based utility where consumers would desire being able to conduct complex decision-
making, involving trade-offs between the cost it spends towards procuring re-
sources and the value it receives. If desirable, a consumer might procure resources
from multiple providers. Additionally, it might dynamically modulate this set of
providers as well as the resources it procures from them. Such decision-making
is exercised, to different degrees of complexity, by consumers of existing utilities.
Examples range from (a) the extremely simple choice between multiple providers
and monthly subscription plans for the telephone service to (b) the more complex
case of certain electricity consumers being able to adapt (e.g., via re-scheduling
of tasks) their usage to time-varying power prices that are exposed to them by
their electricity provider [26]. How could a consumer of the IT utility, navigate a
decision-space comprising such trade-offs? Specifically, it should be able to incor-
4
���������������� ��������
�� ��������������������������������������� ��������������
�����������
������ ���
�������� �
���� ��� ��� ���������������
�������
��������������������
��������������������
������ ���������������� �������������
�������
���� ��� ��
� �����
����� ��
����������
����������
�����
�� ��
�� ��
�� ��
Figure 1.1. Cloud-based hosting of a consumer’s application through the Virtual Clus-
ter interface. Virtual Cluster is a generalized version of interface exposed to the consumerthrough which they can specify desired quantities of various resources they need suchas computing power, storage and networks. The figure shows one example mapping ofvirtual components to actual physical resources as determined by the provider’s man-agement algorithms.
porate into its decision-making multiple features affecting its cost versus revenue,
especially, the multiple hosting options that open up in a cloud-based utility envi-
ronment: in-house IT hosting, cloud-based hosting, and myriad “hybrid” hosting
options spanning in-house and cloud. In this dissertation, we develop techniques
and underlying enabling systems mechanisms to address these problems.
Current cloud-based offerings come in a variety of forms. For example, the
“Infrastructure as a Service” (IaaS) model exposes virtualized hardware (e.g., in
its most general form, a virtualized data center) on which the consumer can run
its entire software stack (including applications and systems software) [1, 27, 4, 3].
The consumer is allowed partial control over resource management via these virtual
resources, which are securely multiplexed with those of other consumers on the
provider’s infrastructure. The “Software as a Service” (SaaS) model, on the other
hand, only allows the consumer access to certain software hosted by the provider for
collaborating with other components of consumer application, and the facility to
5
run this application on the provider’s infrastructure; the consumer is provided little
exposure to and participation in resource management decisions [6]. Between SaaS
and IaaS lies the “Platform as a Service” (PaaS) model, in which consumers are
given tools and APIs they must use to write their applications. In PaaS, consumers
can partially control some aspects of resource management by specifying options
such as how much to replicate and how much storage to use, etc [5]. Another
distinction worth noting is that of between public and private clouds, where the
former are intended as a general utility while the latter is meant to cater to a
restricted set of consumers, such as within an organization. However, the problems
we address in this dissertation is not limited to any specific type of cloud, and
equally relevant to both platforms.
We assume a model of provider-consumer interaction that is general enough
to encompass these various options. Figure 1.1 provides an overview of vari-
ous relevant entities and abstractions using an illustrative consumer application
being hosted by a cloud-based provider. We assume that underlying physical
infrastructure of the cloud being managed by the provider is a state-of-the-art
data center. Such a data center consists of a large cluster of high-end servers
with multi-core processors, several GB of memory, and some local storage inter-
connected by a high bandwidth network for communication. In addition, these
servers are also connected to a consolidated high capacity storage device/utility
through one or more SANs, each of which facilitates data sharing and migration
of applications between servers without explicit movement of data. These servers
are connected via some gateway to the Internet to service end-user requests from
clients of consumer applications. We assume that the provider implements dis-
tributed resource management software spanning its physical infrastructure, that
is responsible for securely partitioning and multiplexing resources among appli-
cations belonging to different consumers; such mechanisms have been researched
extensively [28, 29, 30, 31, 32, 33]. Current providers offer numerous ways for a
consumer to specify the resource needs of its application. We expect these inter-
faces to become more rich and expressive as the cloud computing model becomes
more popular and competition between providers grows. In this dissertation, we
assume that the provider exposes a very general interface, called a virtual cluster
(VC), via which a provider allows a consumer to specify its resource needs. A
6
VC consists of: (i) a set of virtual servers, (ii) a virtual network amongst them
as well as connectivity to the outside world, and (iii) virtual shared storage and
bandwidth to it. Virtual servers come with their CPU, main memory, and lo-
cal secondary storage specifications. The owner of a VC becomes what we call
a chargeable entity. The chargeable entity is a portion or a collection of software
applications hosted by the cloud provider whose resource usage needs to be tracked
and accounted separately from other such entities. Often it would be the same as
a consumer application or the owner of them.
We use a generic consumer application hosted within such a data center, as
shown in Figure 1.1, to drive our discussion. In general, a consumer may choose to
partition components of its distributed application across multiple such providers
(not shown in the figure). The execution of this application involves the partici-
pation of: (i) software components of the application itself, as well as (ii) certain
provider-owned software components securely shared by the provider’s resource
management software among hosted applications (e.g., shared relational database
service [5, 34], key-value store service [35, 36, 37]). A consumer is assumed to sup-
ply the provider with data and executables needed by its application. The provider
would multiplex its physical resources among VCs for various hosted applications
in ways that optimize some metric meaningful to it while adhering to certain con-
straints, an extensively-studied problem [38, 39, 29, 30, 40, 28, 41, 42, 43, 33].
Such multiplexing would be constrained by the nature of the guarantees provided
to the consumers as well as the properties of their workloads and resource needs.
We assume a general enough billing model that depends both on the resource al-
location sought by the application and on its actual usage. Such a billing scheme
implies an incentive for the consumer to estimate well its resource needs, and also
to control well its usage of actual resources via various knobs available to it. At
the same time, it also captures the incentive that the provider has to efficiently
utilize and allocate its physical resources.
Taking a more expansive view, the interaction between multiple providers and
consumers takes the form of the many-to-many relationship as shown in Figure 1.2.
Consumers have different type of applications they wish to host in the cloud and/or
in their in-house facilities, including possibly private clouds (depicted as three dif-
ferent applications in the figure) and the cloud providers have offerings with differ-
7
������� ��� �������
���������
���
����
��
��
��������
��
���
��
!"#"$%&
&&
&&&
'()*+,-.'(/01-20,32,0.
Q: Migrate to Cloud, or use In-house?
Q: Which Cloud to choose?
Q: Does cloud deliver sufficient performance?
��������������� ���������������� ������������4��
�����5��
�����6��
���������
��������
��������
���������
�������� ���������
������� ������������
���������
���������
Figure 1.2. N-to-m relationship between consumer and provider. Each consumer hastheir own type of applications they wish to deploy in the cloud. Many questions, as notedin the figure, can arise on the consumer side. Providers offer various hosting choices tothe consumers. They all have different set of virtual resources with different pricing andperformance characteristics.
ent pricing policies and performance characteristics. Both kinds of parties would
be interested in maximizing their own utility functions. For consumers, the utility
function might be defined in terms of factors such as usage costs, performance
and/or service convenience levels etc. Consumers will try to collect relevant in-
formation about available providers and take necessary actions to maximize this
function. For example, they may choose to alternate between several clouds regu-
larly if it happens that there are certain time periods (e.g., within a day cycle) in
which one cloud’s performance is significantly better or some cloud charges lower
rate than others. One consumer strategy might be to use a cloud service only when
workload exceeds certain level. On the other hand, the provider’s utility function
might be defined in terms of its revenue and costs (which determines its profit).
Providers may try to draw in more consumers into their service in order to increase
the revenue. They may also apply various systems optimizations to minimize their
operational and capital expenditures.
In one possible evolution of these relationships, being explored by some re-
searchers [44, 45], each provider and consumer is a “selfish” agent, interested in
optimizing its own utility/satisfaction. Other possibilities also exist for how this
8
Chapter Addressed Problems
Chapter 3 Dependency discovery problem: The problem of discoveringcausal dependencies established via message exchangesbetween application components.
Chapter 4 Resource accounting problem: The problem of determiningaccurate resource usage of participating entities orgroup of entities, called chargeable entities.
Chapter 5 Consumer-side application deployment decision problem: Theproblem of selecting the most cost-effective cloud-deploymentoption of consumper applications in the cloud.
Table 1.1. Summary of the problems addressed in the following chapters.
“cloud world” might evolve, such as a subset of partially cooperating providers.
Similarly, more sophisticated pricing schemes might evolve in the future than the
current instance usage based billing. Amazon EC2 already offers spot pricing for
some of its instances, for example [46]. Whereas the exact resource management
techniques desirable for providers and consumers would closely depend on how the
interactions between providers and consumers evolve, the two problems we choose
to address in this dissertation are universally applicable and, hence, a useful set of
contributions to cloud-based utility computing.
1.2 Scope and Outline of Dissertation
In order for the vision of utility computing to be realized through cloud computing
model described above, problems from both the provider and consumer must be
addressed. For the provider, resource accounting capabilities must be significantly
improved from the current state-of-the-art. For the consumer, intelligent decision-
making framework for cloud-based application deployment must be established.
Toward these ends, this dissertation studies and develops systems facilities with
supporting mechanisms.We describe these below as well as our approach for ad-
dressing them. Table 1.1 summarizes the problems addressed in each chapter of
this dissertation.
9
1.2.1 Provider-end Resource Accounting
As a solution for the provider-end problem, we design and build techniques that
can deliver the desired level of correctness and accuracy in resource accounting.
We formulate our solution to the resource accounting problem as being composed
of constructing two data structures. The first data structure is, what we call
the Set of Used Servers. ‘Set of Used Servers’ captures the information regarding
which server node is consuming the resources on behalf of each chargeable entity at
certain time granularity. We call this relationship formed by resource consumption
as dependency. The second data structure is the Resource Accounting Tree. It
captures the information regarding which chargeable entities are consuming how
much of the resource within a server node. ‘Resource Accounting Tree’ is also a
time-varying data structure. More detailed explanation and illustrations of these
data structures are given in Chapter 3. We can obtain the final resource usage
information by combining the information from these two data structures.
1.2.1.1 Dependency Discovery
To conduct accounting at a particular server, the first logical stop is to identify
which chargeable entities cause resource consumption on this server. We refer to
the problem of determining at which servers each chargeable entity causes resource
consumptions (directly or indirectly) as dependency discovery. Dependency dis-
covery is equivalent to finding the set of used servers. The study of dependency
discovery in this dissertation is focused on building the data structure of the ‘set of
used servers’ via novel technique of causality discovery. This research is the topic
of Chapter 3.
1.2.1.2 Resource Usage Inference
In resource usage inference step, our focus is to construct the data structure of the
‘resource accounting tree’ for each resource type (CPU, network or disk I/O) at
each server node. The scope of this study is, first, to develop VM-transparent(i.e.,
implemented at the hypervisor layer) and accurate resource accounting technique.
Second, we study how much more it is effective compared to state-of-the-art tech-
nique. The baseline for comparison is the technique that uses common monitoring
10
utilities with well-known linear regression method. Finally, we demonstrate the
usefulness of our resource accounting solution by constructing a scenario of SLA
violation due to resource usage imbalance. We show that our solution correctly
identifies the culprit and initiates targeted resource throttling for fairness whereas
baseline technique fails. This study is presented in Chapter 4.
1.2.2 Consumer-end Decision Making
Consumer-end decision making is concerned with determining if the consumer ap-
plication should migrate to a cloud and in what way. In this study, we employ
empirical approaches to estimate the resource requirements of the consumer appli-
cation and analyze the costs of comprehensive hosting options including in-house
and multiple cloud-based choices. We present how to systematically incorporate
important systems properties such as workload growth, intensity, variability and
subjective costs into determining the economic feasibility of cloud-based hostings.
We present this study in Chapter 5.
Chapter 2Related Work
2.1 Provider-end Resource Usage Inference and
Accounting
First, we explain existing works in the area related to the provider-end dependency
discovery technique and resource accountings. Dependency discovery technique is
classified into statistical inference-based technique in which data-mining is applied
to the measurement to infer some properties, and instrumentation-based technique
in which information is intentionally inserted into the software to aid in tracking
the behavior.
2.1.1 Statistical Inference-based Technique
Aguilera et al. [15] proposed two algorithms for debugging distributed systems.
The first algorithm finds nested RPC calls and uses a set of heuristics to infer
the causality between nested RPC calls, e.g., by considering time difference be-
tween RPC calls and the number of potential parent RPC calls for a given child
RPC call. The second algorithm only infers the average response time of compo-
nents; it does not build request-processing paths. WAP5 [21] intercepts network
related system calls by dynamically re-linking the application with a customized
system library. It statistically infers the causality between messages based on
their timestamps. By contrast, our method is intended to be precise. It monitors
thread activities in order to accurately infer event causality. Anandkumar et al.
12
[16] assumes that a request visits distributed components according to a known
semi-Markov process model. It infers the execution paths of individual requests
by probabilistically matching them to the footprints (e.g., timestamped request
messages) using the maximum likelihood criterion. It requires synchronized clocks
across distributed components. Spaghetti is evaluated through simulation on sim-
ple hypothetical process models, and its applicability to complex real systems
remains an open question. Sengupta et al. [22] proposed a method that takes ap-
plication logs and a prior model of requests as inputs. However, manually building
a request-processing model is non-trivial and in some cases prohibitive. In some
sense, the request-processing model is in fact the information that we want to ac-
quire through monitoring. Moreover, there are difficulties with using application
logs as such logs may not follow any specific format and, in many cases, there may
not even be any logs available.
2.1.2 System-dependent Instrumentation
Magpie [17] is a tool-chain that analyzes event logs to infer a request’s processing
path and resource consumption. It can be applied to different applications but its
inputs are application dependent. The user needs to modify middleware, applica-
tion, and monitoring tools in order to generate the needed event logs. Moreover,
the user needs to understand the syntax and semantics of the event logs in order
to manually write an event schema that guides Magpie to piece together events of
the same request. Magpie does kernel-level monitoring for measuring resource con-
sumption, but not for discovering request-processing paths. Pip [20] detects prob-
lems in a distributed system by finding discrepancies between actual behavior and
expected behavior. A user of Pip adds annotations to application source code to
log messaging events, which are used to reconstruct request-processing paths. The
user also writes rules to specify the expected behaviors of the requests. Pip then
automatically checks whether the application violates the expected behavior. Pin-
point [19] modifies middleware to inject end-to-end request IDs to track requests.
It uses clustering and statistical techniques to correlate the failures of requests to
the components that caused the failures. Chen et al. [18] used request-processing
paths as the key abstraction to detect and diagnose failures, and to understand the
13
evolution of a large system. They studied three examples: Pinpoint, ObsLogs, and
SuperCall. All of them do intrusive instrumentation in order to discover request-
processing paths. Stardust [23] uses source code instrumentation to log application
activities. An end-to-end request ID helps recover request-processing paths. Star-
dust stores event logs into a database, and uses SQL statements to analyze the
behavior of the application.
2.1.3 Resource Accounting
Earlier works by Banga et. al. [14] have addressed the issue of correct resource
accounting within a single host. They introduced new abstraction, called resource
containers, to be used as a resource principal within the kernel. Distributed re-
source container [47] is an extension of the resource container to the distributed
environment in which local resource containers are bound together by exchang-
ing identifiers in order to coordinate the resource consumption across hosts. In
their work, the goal was to throttle the energy consumption per applications. Our
work advances this thread of research by enabling resource accounting at various
locations within the distributed application hierarchy by exploiting the message
causality tracking technique and thread-level monitoring capability.
Recall that there are two aspects to resource accounting solution: local mon-
itoring and collective inference. Some monitoring techniques do not require any
modification to application software, and we label all of them as non-intrusive.
These techniques span a spectrum of the effort involved in modifying the underly-
ing systems software. Among the least intrusive techniques are popular user-level
monitoring tools such as top, iostat, vmstat, sar, netstat, etc. These tools
either rely upon OS system calls or read certain OS-provided information (such
as /proc/stat) to find resource usage. Some techniques insert hooks within the
systems/runtime software for data collection. While still non-intrusive according
to our classification, they entail different amounts of additional effort in their de-
sign and use. For example, the tool OProfile [48] requires insertion of a kernel
module, or kernel recompilation with reconfiguration. Chopstix [49] adds a data
structure, sketches, to the kernel in order to monitor low-level OS operations such
as page allocation, mutex/semaphore locking and CPU utilization. Kprobes [50]
14
allows you to insert probes to the kernel functions or addresses. Since it induces
breakpoint exceptions, it is known that performance degradation is high. Using
Kprobes may require turning CONFIG KPROBES and other configurations on and
rebuild, depending on the Linux distributions. It is intended to be used for ker-
nel debugging. Xenprobes [51] can be used to inject breakpoints into entry and
exit point of any kernel function within the guest domain, similar to Kprobes.
Although Xenprobes is designed with test and debugging in mind, sAccount can
certainly make use of Xenprobes to collect richer monitoring information. Unfor-
tunately, the code is not available to public as of now, preventing us from neither
adopting it nor investigating the feasibility. An intrusive technique, on the other
hand, requires modifying the application itself. The additional information re-
sulting from these modifications often allows more accurate/detailed monitoring.
However, this comes at the cost of added programming effort of modifying the
application (which may be difficult or even impossible in many scenarios) and a
possible run-time slowdown. For example, IBM ARM [52] requires recompilation
after instrumenting the application with certain calls for monitoring the travel
path of user requests through servers.
Monitoring itself is seldom the final goal and there are usually domain-specific
higher-level goals (e.g., while the goal of inference in this work is resource account-
ing, frequently occurring goals in existing work are capacity planning [53, 54],
debugging [15, 20, 21, 19], performance management [55, 52, 56]). Once monitor-
ing data is collected, the next step is to apply suitable inference techniques that
process this data to derive information needed for achieving such goals. The body
of work on inference techniques is, of course, vast (see [54, 53, 57, 58] for some
surveys) and spans the entire spectrum of statistical sophistication. For example,
non-intrusive tool mentioned earlier that needs to interpret application-specific
logs (e.g., access log for the popular apache web server) can be considered as a
simple inference technique. On the other hand, application of TAN model to the
classification problem of SLO violations [58] falls into the group of sophisticated
inference techniques.
15
2.2 Consumer-end Decision Making
Questions related to cloud economics have been raised by several researchers and
are drawing increasing attention. Gray [59] has looked at economics in the context
of distributed computing and he came up with the amount of resource users can buy
with one dollar in the year 2003. One of his conclusions was that since data transfer
costs are non-negligible for Web-based applications, it is economical to optimize
the application towards reducing data transfer even at the cost of increased CPU
consumptions. We find that this argument still holds in today’s cloud environment
according to the results of our cost analysis. Economics have also been looked at
in the grid computing domain. Thanos et al. [60] have identified factors related
to economics that could stimulate the adoption of grid computing by business
sectors. There also have been several efforts to promote the commercialization of
grid [61, 62].
Armbrust et al. [63] have extended the cost analysis of Gray’s data into the year
2008 and presented how the cost of each resource type evolved at different rate.
They have also pointed out that cost analysis can be complicated due to cost factors
such as power, cooling and operational costs, which are in many cases difficult to
quantify. Their arguments are in line with our observations and treatment of
cost factors since those costs correspond to the class of “less quantifiable” and/or
“indirect” costs from our cost taxonomy.
Walker [64] has looked at issues related to the economics of purchasing or leas-
ing CPU hours using the NPV concept. The focus of his work is to provide a
methodology that can aid in deciding whether to buy or lease the CPU capac-
ity from the organization’s perspective. His analysis ignores application-specific
intricacies. For example, in calculating the cost of leasing the CPU hours from
Amazon EC2, required total CPU hours is assumed to be statically fixed. In real-
ity CPU requirement (and usage) will vary depending on the type of application
and the amount of injected workloads. Similarly, Walker et al. [65] also studied
the problem of using storage cloud vs. purchasing hard disks. Both studies bring
many useful insights about the cost of procuring a known fixed set of hardware
resources. However, in order to see the feasibility of moving specific applications
into the cloud, additional variables must be taken into account. Our study differs
16
from these in a sense that we try to address the question at the level of individual
applications.
Harms and Yamartino [66] explain how the emergence of cloud impacts the eco-
nomics for IT businesses. They show that cloud infrastructure benefits economies
of scale in three areas and this provides incentives for organizations and businesses
to migrate their IT infrastructures into the cloud. Although they do not develop
detailed analysis for application migration, they mention the workload variability
and growth patterns of an application as key properties, which we incorporate into
our analysis. In our prior work we have also addressed some economic issues of
cloud migration as they apply to digital library and search systems such as Cite-
Seer [67, 68]. Klems et al. [69] propose a framework for valuation of cloud in order
to enable cost comparison. However, the emphasis is on the procedural aspect
of the problem, rather than cost analysis. Wang et al. [70] conducts preliminary
studies on several aspects of current pricing schemes of cloud. They discuss the
relationship and consequences between cost and performance optimization from
both the user and provider’s view point. They do not address the problem of cost
related to migrating application. There is also a study on how to minimize the cost
of map-reduce applications using transient VM instances such as Amazon Spot In-
stances [71]. Campbell et al. [72] carry out simple calculations to determine the
break-even utilization point for owning vs. renting the system infrastructure for a
medium-sized organization. There are also tools that aid in calculating the cost of
using the cloud services [73, 74]. Overall, although there have been attempts to an-
alyze the cost of application migration, most of them are preliminary assessments
or limited to specific application with many simplifying assumptions. Our work is
an effort to broaden the insights by identifying and incorporating comprehensive
set of critical factors into cost analysis.
Chapter 3Provider-end Dependency Discovery
3.1 Introduction and Background
This chapter is devoted to the study of a dependency discovery technique that
forms the foundation of our overall resource accounting solution. In general, the
term “dependency” in the context of distributed systems may represent some kind
of reliance of one component on another. We define dependency in the following
specific way: a dependency is said to exist between a chargeable entity c and
a server node s during time [t, t + ∆] if c causes consumption of resources of s
during this period. The chargeable entity c and the server node s can be located
anywhere in the cloud infrastructure, with c (or parts of it) not necessarily being
hosted within s or “adjacent to” s.
We use the illustration shown in Figure 3.1 to define some of the abstractions
and to describe the role of dependency discovery within our overall resource ac-
counting technique. Figure 3.1 depicts the deployment scenario of two VCs from
two distinct chargeable entities that make use of a shared database service. VMs
v1,...,v5 are VM instances owned exclusively by chargeable entities, CEA and CEB.
VMs v6,...,v9 together comprise the shared database service. Labels s1,...,s9 refer
to the physical servers hosting these VMs. We employ two abstractions, Set of
Used Servers and Resource Accounting Tree to formalize the resource accounting
problem and to capture the distributed information that our accounting solution
must keep track of.
18
����
����
������������������ ���
����
����
78����
78
����
78
����������
�������������
�������������
��������������
����
78 78
�����������
������������������
���
���
���
���
���
����
���
���
���
���
���
���
��������
�����������������
��
���
��� �
��
���9 �������������� �� �
����
��
��
�����������������
:;<
=>?@ ABB ABC
DEF DEGDEF DEG
DEF DEG
���������������������������� �
�
��������������
�������������� ��
��������H
����
�����������������
��
��I��J
��I
��J��I
K
Figure 3.1. Example deployment of VCs from chargeable entities CEA and CEB. TwoVCs share the database server instance. This sharing is transparent to the chargeableentities since who will share the service instance is the decision of provider. The abstrac-tions of “Set of Used Servers” and “Resource Accounting Tree” are also labeled.
• Set of Used Servers (Sc(t)): For each CE c, the accounting solution maintains
Sc(t), the set of servers whose resources are used for c during the time interval
[t, t+∆]. This usage may be either: (i) direct, i.e., by one or more components
of c, or (ii) indirect, i.e., by a shared service on behalf of c. In Figure 3.1,
there are two chargeable entities CEA and CEB owning v1, v2, v3 and v4, v5,
respectively, as part of their VCs. The Set of Used Servers for each of them
are marked with colored boundaries as well as labels. In this example, note
that server s8 happens to belong only to SA(t).
• RAT (Resource Accounting Tree) T rs (t): For resource r within a server s, the
accounting solution must maintain resource accounting information during
the interval [t, t+∆] in the form of a resource accounting tree T rs (t). We use
the example shown in Figure 3.1 to explain a resource accounting tree for
the CPU resource on server s9. The entire usage of the CPU resource within
s9 is represented by the root of the tree. The next level of nodes in the tree
represent a breakdown of this overall CPU usage among three entities: (i)
the VM Hypervisor or VMM, (ii) CEA, and (iii) CEB. The usage of VMM
19
may further be broken down into portions attributed to applications CEA
and CEB as captured by the nodes of the tree below the node for the VMM’s
CPU usage. We denote the sum of the usage corresponding to all leaf nodes
associated with the CE c as urc(t).
3.2 Dependency Discovery: Problem Statement
and Requirements
The goal of our dependency discovery is to construct the Set of Used Servers Sc(t)
for each chargeable entity c. For a chargeable entity c, Sc(t) can vary due to several
reasons. Properties of the requests from c to the shared service may change in many
ways. Requests may transition from read-intensive to write-intensive which may
render any type of caching within the shared service to behave differently (e.g.,
more traffic to the data store nodes v9 from the front-end v6 in Figure 3.1). Other
possibility would be, c may start to request previously unaccessed data in which
case Sc(t) may grow to include another server node because of new access pattern.
The granularity of time t also affects Sc(t). Large time granularity would tend
to encompass large number of server nodes into the set and stay stable. As the
time granularity gets smaller, Sc(t) may shrink or grow depending on the workload
patterns of c. The choice of time granularity depends on the inference method to
be applied during the resource usage inference, the topic of Chapter 4.
The information about currently dependent set of chargeable entities for a
given server node offers important benefits. By having the minimal set of charge-
able entities as the target of resource accounting, we expect that it will improve
the efficiency of accounting algorithms as well as the accuracy of results. Regard-
less of which algorithm we employ at the resource usage inference step, having
smaller input data set always reduces the amount of computations. Smaller num-
ber of chargeable entities also improves the accuracy because the uncertainty of
input data is less. It is well known in the data mining field that separating out
larger number of component distributions from the mixture of them degrades the
accuracy. This is because more work has to be done for a fixed amount of input
information. Similar difficulty is also known in the signal processing in the name of
20
the blind source separation problem [75]. Obtaining the minimal dependent set of
chargeable entities can be crucial if there are large number of chargeable entities.
It is not uncommon to have hundreds or thousands of chargeable entities in the
large scale cloud environment. Without the knowledge of minimal dependency set,
we must use all the potential entities as the input of accounting algorithm when, in
fact, only one tenth of them could be involved in the actual resource consumption.
There can be several ways to discover the dependency set. One way is to
infer from the set of chosen measurements that are appropriate to the selected
inference algorithm. Another way could be to modify the software stacks to insert
information for easier discovery through post processing of logs. However, we strive
to develop a technique that delivers accurate set unlike the inference approach,
and that does not require any software modification unlike the instrumentation
approach. We focus on tracking the messages between application components to
find the causal path or trail of messages. By ‘causal’, we imply that one message
exchange between two components, say c1 and c2, triggers or be directly responsible
for the generation of another set of messages between c2 and c3. It is important
to discover such causality. The messages between c2 and c3, in the example, can
be merely coincidental and, if that is the case, this means that any activity within
c3 cannot be attributed to c1. Declaring c3 as dependent to c1 in this case will
end up adding c3 spuriously into the dependency set. Therefore, the key to the
dependency discovery technique for resource accounting is to find the causality
between components.
In order for a solution to be practical, we set the following requirements.
• Transparency: The technique should not require any user-level knowledge.
Acquisition of such information usually mandates intrusive modification to
the user applications or guest kernels. The technique must work only with
the information available from the hypervisor side.
• Accuracy: The accuracy of dependency information directly affects the qual-
ity of resource accounting and any other optimizations built on it.
• Generality: The technique should ideally work regardless of software archi-
tecture running in the user virtual machines.
21
• Efficiency: In order for the technique to be deployable in an online fashion,
the overhead must be within tolerable level.
3.3 Inadequacy of Existing Approaches
We classify the existing dependency discovery techniques into two categories: (i)
instrumentation-based techniques and (ii) statistical inference techniques. The
instrumentation-based approach modifies middleware or applications to record
events (e.g., request messages and their end-to-end identifiers) that can be used to
reconstruct paths of the messages. They can provide accurate information, but are
not generally applicable. Their applicability is limited, because it requires knowl-
edge (and often source code) of the specific middleware or applications in order
to do instrumentation. This class of technique fails to meet the requirement of
generality and transparency. The statistical inference technique is an approach
that takes readily available information (e.g., timestamps of network packets) as
inputs, and infer the dependency in a “most likely” way. Statistical approachs are
general but not accurate. Their accuracy can degrade for a multiple reasons in dif-
ferent circumstances. Some of the administrative actions cannot be applied if the
information is not accurate due to the danger of misoperation. Another drawback
is that statistical techniques often require heavy computations for model construc-
tion which may impact the performance. This class of technique fails to meet the
accuracy and efficiency requirement. We discuss techniques of both these kinds as
well as ones that combine them in detail in Chapter 2.
3.4 Proposed Solution: vPath
We propose a solution, named vPath, that approaches the dependency discovery
problem from a new direction. Our solution focus on tracking the messages between
application components to find the causal path or trail of messages. By ‘causal’, we
imply that one message exchange between two components, say c1 and c2, triggers
or be directly responsible for the generation of another set of messages between c2
and c3. It is important to discover such causaility. The messages between c2 and c3,
in the example, can be merely coincidental and, if that is the case, this means that
22
������������ ����� ����������� �����
����������
����LMNOMPQ j
����LMRST i
UVWVXYYMZQ[XY \
��� �
�������LMNOMPQ i
�������LMRST j
UVWVXYYMZQ[XY ]
����������
����^_`a_bc j
defeghh_icjgh k
��� �
�������^_lmn j
defeghh_icjgh o
����^_lmn j
�������^_`a_bc j
Figure 3.2. The principle idea of finding the causality of our proposed solution.
any activity within c3 cannot be attributed to c1. Declaring c3 as dependent to c1
in this case will end up adding c3 spuriously into the dependency set. Therefore,
the key to the dependency discovery technique for resource accounting is to find
the causality between components. In the distributed applications the resource
of a server node is consumed upon the arrival of request messages from other
components. This implies that the identification of the Set of Used Servers of a
chargeable entity c is equivalent to determining the set of servers touched by the
messages originated from c during [t, t + ∆]. In turn, this is equivalent to tracking
the passage of each message across components of the virtual clusters. Therefore,
in order to be able to construct the Set of Used Servers, we need to develop a
technique that can track the passage/path of the messages.
3.4.1 Design and Implementation of our Dependency Dis-
covery Technique
To reconstruct paths of messages across virtual cluster components, we need to
find two types of causality. Intra-node causality captures the behavior that, within
one component, processing an incoming message i triggers sending an outgoing
message j. Inter-node causality captures the behavior that, an application-level
message j sent by one component corresponds to message j′ received by another
component. Our thread-pattern assumption enables the inference of intra-node
causality, while the communication-pattern assumption enables the inference of
inter-node causality.
Specifically, we reconstructs the path of the messages in Figure 3.2 as fol-
23
lows. Inside component 1, the synchronous-communication assumption allows us
to match the first incoming message over ‘TCP Connection 1’ with the first out-
going message over ‘TCP Connection 1’ match the second incoming message with
the second outgoing message, and so forth. (Note that one application-level mes-
sage may be transmitted as multiple network-level packets.) Therefore, ‘Receive
Request i’ can be correctly matched with ‘Send Reply i’. Similarly, we can match
component 1’s ‘Send Request j’ with ‘Receive Reply j’, and also match component
2’s ‘Receive Request j’ with ‘Send Reply j’.
Between two components, we can match component 1’s first outgoing message
over ‘TCP Connection 2’with component 2’s first incoming message over ‘TCP
Connection 2’, and so forth, hence, correctly matching component 1’s ‘Send Re-
quest j’ with component 2’s ‘Receive Request j’.
The only missing link is that, in component 1, ‘Receive Request i’ triggers
‘Send Request j’. From the thread-pattern assumption, we can indirectly infer this
causality between them. Recall that we have already matched ‘Receive Request i’
with ‘Send Reply i’. Between the time of these two operations, we observe that
the same thread performs ‘Send Request j’ and ‘Send Reply i’. It follows from
our thread-pattern assumption that ‘Receive Request i’ triggers ‘Send Request
j’send-request-Y. This completes the construction of all the causality needed to
discover the dependency.
3.4.1.1 Implementation
Our proposed solution vPath’s toolset consists of an online monitor and an offline
log analyzer. The online monitor continuously logs which thread performs a send
or recv system call over which TCP connection. The offline log analyzer parses
logs generated by the online monitor to discover the paths and the performance
characteristics at each step along these paths. The online monitor tracks network-
related thread activities. This information helps infer the intra-node causality of
the form “processing an incoming message X triggers sending an outgoing message
Y .” It also tracks the identity of each TCP connection, i.e., the four-element tuple
(source IP, source port, dest IP, dest port) that uniquely identifies a live TCP con-
nection at any moment in time. This information helps infer inter-node causality,
i.e., message Y sent by a component corresponds to message Y ′ received by an-
24
other component. The online monitor is implemented in Xen 3.1.0 [76] running on
x86 32-bit architecture. The guest OS is Linux 2.6.18. Xen’s para-virtualization
technique modifies the guest OS so that privileged instructions are handled prop-
erly by the VMM. Xen uses hypercalls to hand control from guest OS to the VMM
when needed. Hypercalls are inserted at various places within the modified guest
OS. In Xen’s terminology, a VM is called a domain. Xen runs a special domain
called Domain0, which executes management tasks and performs I/O operations
on behalf of other domains.
Monitoring Thread Activities: vPath needs to track which thread performs
a send or recv system call over which TCP connection. If thread scheduling
activities are visible to the VMM, it would be easy to identify the running threads.
However, unlike process switching, thread context switching is transparent to the
VMM. For a process switch, the guest OS has to update the CR3 register to
reload the page table base address. This is a privileged operation and generates
a trap that is captured by the VMM. By contrast, a thread context switch is not
a privileged operation and does not result in a trap. As a result, it is invisible to
the VMM.
Luckily, this is not a problem for vPath, because vPath’s task is actually sim-
pler. We only need information about currently active thread when a network
send or receive operation occurs (as opposed to fully discovering thread-schedule
orders). Each thread has a dedicated stack within its process’s address space. It is
unique to the thread throughout its lifetime. This suggests that the VMM could
use the stack address in a system call to identify the calling thread. The x86
architecture uses the EBP register for the stack frame base address. Depending
on the function call depth, the content of the EBP may vary on each system call,
pointing to an address in the thread’s stack. Because the stack has a limited size,
only the lower bits of the EBP register vary. Therefore, we can get a stable thread
identifier by masking out the lower bits of the EBP register.
Specifically, vPath tracks network-related thread activities as follows:
• The VMM intercepts all system calls that send or receive TCP messages.
Relevant system calls in Linux are read(), write(), readv(), writev(),
recv(), send(), recvfrom(), sendto(), recvmsg(), sendmsg(), and
25
sendfile(). Intercepting system calls of a para-virtualized Xen VM is pos-
sible because they use “int 80h” and this software trap can be intercepted
by VMM.
• On system call interception, vPath records the current DomainID, the con-
tent of the CR3 register, and the content of the EBP register. DomainID
identifies a VM. The content of CR3 identifies a process in the given VM.
The content of EBP identifies a thread within the given process. vPath uses
a combination of DomainID/CR3/EBP to uniquely identify a thread.
By default, system calls in Xen 3.1.0 are not intercepted by the VMM. Xen
maintains an IDT (Interrupt Descriptor Table) for each guest OS and the 0x80th
entry corresponds to the system call handler. When a guest OS boots, the 0x80th
entry is filled with the address of the guest OS’s system call handler through
the set trap table hypercall. In order to intercept system calls, we prepare our
custom system call handler, register it into IDT, and disable direct registration of
the guest OS system call handler. On a system call, vPath checks the type of the
system call, and logs the event only if it is a network send or receive operation.
Contrary to the common perception that system call interception is expensive,
it actually has negligible impact on performance. This is because system calls
already cause a user-to-kernel mode switch. vPath code is only executed after this
mode switch and does not incur this cost.
Monitoring TCP Connections: On a TCP send or receive system call, in
addition to identifying the thread that performs the operation, vPath also needs to
log the four-element tuple (source IP, source port, dest IP, dest port) that uniquely
identifies the TCP connection. This information helps match a send operation in
the message source component with the corresponding receive operation in the
message destination component. The current vPath prototype adds a hypercall
in the guest OS to deliver this information down to the VMM. Upon entering a
system call of interest, the modified guest OS maps the socket descriptor number
into (source IP, source port, dest IP, dest port), and then invokes the hypercall to
inform the VMM.
This approach works well in the current prototype, and it modifies fewer than
26
100 lines of source code in the guest OS (Linux). However, our end goal is to
implement a pure VMM-based solution that does not modify the guest OS at all.
Such a pure solution would be easier to deploy in a Cloud Computing platform
such as EC2 [77], because it only modifies the VMM, over which the platform
service provider has full control.
As part of our future work, we are exploring several techniques to avoid mod-
ifying the guest OS. Our early results show that, by observing TCP/IP packet
headers in Domain0, four-element TCP identifiers can be mapped to socket de-
scriptor numbers observed in system calls with high accuracy. Another alternative
technique we are exploring is to have the VMM keep track of the mapping from
socket descriptor numbers to four-element TCP identifiers, by monitoring sys-
tem calls that affect this mapping, including bind(), accept(), connect(), and
close().
3.4.2 Applicability to Other Software Architectures
The proposed technique relies on two assumptions as described above. These
assumptions work well with one predominant form of software architecture which is
the multi-threaded software architecture. However, additional facilities are needed
to cover other forms of software architectures. In this section we explain what other
software architectures exist and what needs to be done to apply the principles of
our proposed technique.
Multi-threaded Dispatcher & Worker Model: Figure 3.3 (a) shows the
dispatcher-worker model, which is arguably the most widely used threading model
for server applications. In the front-end, one or more dispatcher threads use the
select() system call or the accept() system call to detect new incoming TCP
connections or new incoming requests over existing TCP connections, respectively.
Once a user request is identified, the request is handed over to a worker thread for
further processing. This single worker thread is responsible for executing all activ-
ities triggered by the request (e.g., reading HTML files from disk, making JDBC
calls to retrieve data from database servers, calling back-end CICS mainframe ap-
plications, doing local computation, etc.) and finally sending a response-message
back to the user. After the worker thread finishes processing the request, it is put
27
dispatcher threadRequest
worker threads
component
(a) Simple Multi-threaded Server Model.
���������
�������
����
��
� �����
��������
����� ���
�����������
�����������
�����
�������
�����
��������
����
(b) Finite State Machine of a Worker Thread.
Figure 3.3. Multi-threaded Server Architecture
back into the worker thread pool, waiting for being picked to process another in-
coming request. The finite-state machine describing the behavior of such a worker
thread is depicted in Figure 3.3 (b).
Event-Driven Model: Figure 3.4(a) shows the architecture of an application’s
component built using an event-driven programming model. Unlike other thread-
ing models, the event-driven model uses a relatively small number of threads, typi-
cally equal to or slightly larger than the number of CPUs on the server hosting the
component. When processing a request R, a thread T1 always uses non-blocking
system calls. If it cannot make progress on processing the request R because a
non-blocking I/O operation on behalf of R has not yet completed, the thread T1
records the current status of R in a finite state machine (FSM) maintained for R,
and moves on to process another request. When the I/O operation on behalf of
R finishes, an event is created in the event queue, and eventually a thread T2 re-
trieves the event and resumes the request R. Note that T1 and T2 may be different
threads, both participating in the processing of the same request, but at different
times during its lifetime.
In order for our proposed technique to extend to the Event-driven model, one
feature must be implementable. Since a few number of threads multiplex over
28
Event queue
I/O events
Finite State Machine
for Request i
Threads
component
Finite State Machine
for Request j
Finite State Machine
for Request k
Fetch
(a) Event-driven model
Event queue
Stage 1
component
Event queue
Stage n
(b) SEDA model
Figure 3.4. Event-driven and SEDA model architecture
many communications, we need to be able to detect the moment when thread
switches to processing another communication. This can be done theoretically
by intercepting the event fetch operation since event fetch signifies such switch of
context. We have not delved into accomplishing this engineering task, but believe
that this can be done with reasonable amount of efforts if desired.
Staged Event-Driven Architecture: Shown in Figure 3.4(b) is the architec-
ture of one component within a SEDA-based application [78]. SEDA partitions the
request processing pipeline into stages and each stage has its own dedicated pool
of threads. Any two neighboring stages are connected by a event queue. Threads
in stage i put events in the queue qi, and threads in stage i+1 remove events from
this queue and process them, which may further trigger the generation of events
in the queue for the subsequent stage. One advantage of SEDA, claimed in [78], is
that the size of each stage’s thread pool can be adjusted independently based on
the observed workload.
The argument about whether the proposed vPath technique applies to SEDA
29
or not follows the same reasoning presented for the event-driven model. SEDA is a
extension of event-driven model with stages and for the same reason, it is possible
to apply our proposed technique only if we can intercept the event fetching. We
also do not look into implementing this in our study.
3.4.3 Usefulness of Proposed Solution
Applications that adopt the event-driven model cannot be handled by vPath. How-
ever, the pure event-driven model in is rarely used in real applications. The
Flash Web server [79] is often considered as a notable example that adopts the
event-driven model, but Flash actually uses a hybrid between event-driven and
multi-threaded programming models. In Flash, a single main thread does all
non-blocking network I/O operations and a set of worker threads do blocking
disk I/O operations. The event-driven model is not yet popular in real appli-
cations and there is considerable consensus in the research community that pro-
gramming/debugging applications based on a pure event-driven model are difficult.
Furthermore, even the frequently-cited performance advantages of the event-driven
model are questionable in practice as it is extremely hard to ensure that a thread
actually never blocks. For example, the designers of Flash themselves observed
that the supposedly never-blocking main thread actually blocks unexpectedly in
the “find file” stage of HTTP request processing, and subsequently published mul-
tiple research papers [80, 81] to describe how they solved the problem by changing
the operating system. Considering the excellent expertise of the Flash researchers
on this subject, it is hard to imagine that regular programmers have a better chance
of getting the implementation right. Similar sentiments were expressed by Behren
et al. who have had extensive experience programming a variety of applications
using event-driven approaches [82].
3.5 Evaluation
Our experimental testbed consists of Xen VMMs (v3.1.0) hosted on Dell servers
connected via Gigabit Ethernet. Each server has dual Xeon 3.4 GHz CPUs with 2
MB of L1 cache and 3 GB RAM. Each of our servers hosts several virtual machines
30
VMMVMMVMMVMM
Apache
JBoss2
MySQL
VM1
JBoss1
Client
Linux Dom-0
VM2D
om-0
VM3D
om-0
VM4D
om-0
Figure 3.5. The topology of TPC-W benchmark set-up.
(VMs) with each VM assigned 300 MB of RAM. We use the xentop utility in
Domain0 to obtain the CPU utilization of all the VMs running on that server.
3.5.1 Applications
To demonstrate the generality of vPath, we evaluate vPath using a diverse set
of applications written in different programming languages (C, Java, and PHP),
developed by communities with very different backgrounds.
TPC-W: To evaluate the applicability of vPath for realistic workloads, we use a
three-tier implementation of the TPC-W benchmark [83], which represents an on-
line bookstore developed at New York University [84]. Our chosen implementation
of TPC-W is a fully J2EE compliant application, following the “Session Facade”
design pattern. The front-end is a tier of Apache HTTP servers configured to
load balance the client requests among JBoss servers in the middle tier. JBoss
3.2.8SP1 [85] is used in the middle tier. MySQL 4.1 [86] is used for the back-end
database tier. The topology of our TPC-W setup is shown in Figure 3.5. We
use the workload generator provided with TPC-W to simulate multiple concurrent
clients accessing the application.
This setup is a heterogeneous test environment for vPath. The Apache HTTP
server is written in C and is configured to use a multi-process architecture. JBoss
is written in Java and MySQL is written in C.
RUBiS: RUBiS [87] is an e-Commerce benchmark developed for academic re-
search. It implements an online auction site loosely modeled after eBay, and adopts
31
VM1
VM3
VM2 VM4
VM5
Tier 1 Tier 2 Tier 3
vApp
Client
Figure 3.6. The topology of vApp used in evaluation.
a two-tier architecture. A user can register, browse items, sell items, or make a bid.
It is available in three different versions: Java Servlets, EJB, and PHP. We use the
PHP version of RUBiS in order to differentiate from TPC-W, which is written in
Java and also does e-Commerce. Our setup uses one VM to run a PHP-enabled
Apache HTTP server and another VM to run MySQL.
MediaWiki: MediaWiki [88] is a mainstream open source application. It is the
software behind the popular Wikipedia site (wikipedia.org), which ranks in the
top 10 among all Web sites in terms of traffic. As mature software, it has a large
set of features, e.g., support for rich media and a flexible namespace. Because
it is used to run Wikipedia, one of the highest traffic sites on the Internet, its
performance and scalability have been highly optimized. It is interesting to see
whether the optimizations violate the assumptions of vPath (i.e., synchronous
remote invocation and event causality observable through thread activities) and
hence would fail our technique. MediaWiki adopts a two-tier architecture and is
written in PHP. Our setup uses one VM to run PHP-enabled Apache and another
VM to run MySQL.
vApp: vApp is our own prototype application. It is an extreme test case we
designed for vPath. It can exercise vPath with arbitrarily complex paths of the
messages. It is a custom multi-tier multi-threaded application written in C. Fig-
ure 3.6 shows an example of a three-tier vApp topology. vApp can form various
topologies, with the desired number of tiers and the specified number of servers at
each tier. When a server in one tier receives a request, it either returns a reply,
or sends another request to one of the servers in the downstream tier. When a
server receives a reply from a server in the downstream tier, it either sends an-
32
Response time in seconds Throughput(req/sec)
Configuration (Degradation in %) (Degradation in %)
Average 90th percentile Average
Vanilla Xen 4.45 11.58 4.88
vPath 4.72 (6%) 12.28 (6%) 4.59 (6%)
App Logging 10.31 (132%) 23.95 (107%) 4.10 (16%)
Table 3.1. Response time and throughput of TPC-W. “App Logging” represents alog-based tracking technique that turns on logging on all tiers of TPC-W.
other request to a server in the downstream tier, or returns a reply to the upstream
tier. All decisions are made based on specified stochastic processes so that it can
generate complex paths with different structures and path lengths.
We also developed a vApp client to send requests to the front tier of the vApp
servers. The client can be configured to emulate multiple concurrent sessions. As
request messages travel through the components of the vApp server, the identi-
fiers of visited components are appended to the message. When a reply is finally
returned to the client, it reads those identifiers to precisely reconstruct the path,
which serves as the ground truth to evaluate vPath. The client also tracks the re-
sponse time of each request, which is compared with the response time estimated
by vPath.
3.5.2 Overhead of vPath
We first quantify the overhead of vPath, compared with both vanilla (unmodified)
Xen and log-based tracking techniques [24, 22]. For the log-based techniques, we
turn on logging on all tiers of TPC-W. The experiment below uses the TPC-W
topology in Figure 3.5.
Overhead of vPath for TPC-W. Table 3.1 presents the average and 90th
percentile response time of TPC-W benchmark as seen by the client, catering to
100 concurrent user sessions. For all configurations, 100 concurrent sessions cause
near 100% CPU utilization at the database tier. Table 3.1 shows that vPath has
low overhead. It affects throughput and average response time by only 6%. By
contrast, “App Logging” decreases throughput by 16% and increases the average
33
0.00
0.20
0.40
0.60
0.80
1.00
5 10 15 20 25 50 65
CD
F
Response Time (sec)
Vanilla XenvPath
App Logging
(a) CDF (cumulative distribution function) comparison of TPC-W response time
0.00
0.20
0.40
0.60
0.80
1.00
0 10 20 30 40 50 60 70 80 90 100
CD
F
CPU Utilization (%)
Vanilla XenvPath
App Logging
(b) CDF Comparison of TPC-W JBoss tier’s CPU utilization
Figure 3.7. TPC-W Response Time and CPU Utilization.
response time by as high as 132%. The difference in response time is more clearly
shown in Figure 3.7(a), where vPath closely follows “vanilla Xen”, whereas “App
Logging” significantly trails behind.
Figure 3.7(b) shows the CPU utilization of the JBoss tier when the database
tier is saturated. vPath has negligible CPU overhead whereas “App Logging” has
significant CPU overhead. For instance, vPath and “vanilla Xen” have almost
identical 90th percentile CPU utilization (13.6% vs. 14.4%), whereas the 90th per-
centile CPU utilization of “App Logging” is 29.2%, more than twice that of vPath.
Thus, our technique, by eliminating the need for using application logging to trace
paths, improves application performance and reduces CPU utilization (and hence
34
Response Time in millisec Throughput in req/sec
(Degradation in %) (Degradation in %)
Vanilla Xen 597.2 628.6
vPath 681.8 (14.13%) 593.4 (5.60%)
Table 3.2. Performance impact of vPath on RUBiS.
Response time (in sec) Throughput (req/sec)
Configuration Avg(Std.) Overhead Avg(Std.) Overhead
Vanilla Xen 1.69(.053) 2915.1(88.9)
(1) Intercept Syscall 1.70(.063) .7% 2866.6(116.5) 1.7%
(2) Hypercall 1.75(.050) 3.3% 2785.2(104.6) 4.5%
(3) Transfer Log 2.02(.056) 19.3% 2432.0(58.9) 16.6%
(4) Disk Write 2.10(.060) 23.9% 2345.4(62.3) 19.1%
Table 3.3. Worst-case overhead of vPath and breakdown of the overhead. Each rowrepresents the overhead of the previous row plus the overhead of the additional operationon that row.
power consumption) for data centers. Moreover, vPath eliminates the need to re-
peatedly write custom log parsers for new applications. Finally, vPath can even
work with applications that cannot be handled by log-based discovery methods
because those applications were not developed with this requirement in mind and
do not generate sufficient logs.
Overhead of vPath for RUBiS. Due to space limitation, we report only sum-
mary results on RUBiS. Table 3.2 shows the performance impact of vPath on
RUBiS. We use the client emulator of RUBiS to generate workload. We set the
number of concurrent user sessions to 900 and set user think time to 20% of the
original value in order to drive the CPU of the Apache tier (which runs PHP) to
100% utilization. vPath imposes low overhead on RUBiS, decreasing throughput
by only 5.6%.
Worst-case Overhead of vPath. The relative overhead of vPath depends on
the application. We are interested in knowing the worst-case overhead (even if the
worst case is unrealistic for practical systems).
The relative overhead of vPath can be calculated as vv+p
, where v is vPath’s
35
processing time for monitoring a network send or receive operation, and p is the
application’s processing time related to this network operation, e.g., converting
data retrieved from the database into HTML and passing the data down the OS
kernel’s network stack. vPath’s relative overhead is highest for an application that
has the lowest processing time p. We use a tiny echo program to represent such a
worst-case application, in which the client sends a one-byte message to the server
and the server echoes the message back without any processing. In our experiment,
the client creates 50 threads to repeatedly send and receive one-byte messages in
a busy loop, which fully saturates the server’s CPU.
When the application invokes a network send or receive system call, vPath
performs a series of operations, each of which introduces some overhead: (1) in-
tercepting system call in VMM, (2) using hypercall to deliver TCP information
(src IP, src port, dest IP, dest port) from guest OS to VMM, (3) transferring log
data from VMM to Domain0, and (4) Domain0 writing log data to disk. These
operations correspond to different rows in Table 3.3, where each row represents
the overhead of the previous row plus the overhead of the additional operation on
that row.
Table 3.3 shows that intercepting system calls actually has negligible overhead
(1.7% for throughput). The biggest overhead is due to transferring log data from
VMM to Domain0. This step alone degrades throughput by 12.1%. Our current
implementation uses VMM’s printk() to transfer log data to Domain0, and we
are exploring a more efficient implementation. Combined together, the operations
of vPath degrade throughput by 19.1%. This is the worst-case for a contrived tiny
“application.” For real applications, throughput degradation is much lower, only
6% for TPC-W and 5.6% for RUBiS.
3.5.3 Dependency Discovery for vApp
Our custom application vApp is a test case designed to exercise vPath with arbi-
trarily complex paths. We configure vApp to use the topology in Figure 3.6. The
client emulates 10-30 concurrent user sessions. In our implementation, as a request
message travels through the vApp servers, it records the actual path, which serves
as the ground truth to evaluate vPath.
36
1
3
5
3
5
3
1
3
1
3
1
2
1
2
11
2
4
2
1
(a) Simple path (b) Complex path
Figure 3.8. Examples of vApp’s paths discovered by vPath. The circled numberscorrespond to VM IDs in Figure 3.6.
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
10 20 30 40 50 60 70 80 90 100
CD
F
Response Time (ms)
Estimated by vPathMeasured by vApp Client
Figure 3.9. CDF of vApp’s response time, as estimated by vPath and actually measuredby the vApp client.
The message paths of vApp, as described in 3.5.1, is designed to be random. To
illustrate the ability of our technique to discover sophisticated paths, we present
two discovered paths in Figure 3.8. The simple path consists of 2 remote invoca-
tions in a linear structure, while the complex path consists of 7 invocations and
visits some components more than once.
In addition to discovering the message paths, vPath can also accurately cal-
culate the end-to-end response times as well as the time spent on each tier along
a path. This information is helpful in debugging distributed systems, e.g., iden-
tifying performance bottlenecks and abnormal requests. Figure 3.9 compares the
end-to-end response time estimated by vPath with that actually measured by the
vApp client. The response time estimated by vPath is almost identical to that
observed by the client, but slightly lower. This small difference is due to message
delay between the client and the first tier of vApp, which is not tracked by vPath
37
Apache
JBoss2
MySQL
JBoss1
Client
Client Request
Large Number
of Requests and
Replies
between
JBoss & MySQLPartial Reply
Partial Reply
Partial ReplyPartial Reply
RUBiS(PHP) MySQL
Client Request
Reply
Client
Exactly 3
Round Trips
About 50
Consecutive
recv()
Possibly
Sending Large
Data Here
(a) TPC-W (b) RUBiS
Figure 3.10. Typical paths discovered by vPath technique.
because the client runs on a server that is not monitored by vPath.
We executed a large number of requests at different session concurrency levels.
We also experimented with topologies much larger than that in Figure 3.6, with
more tiers and more servers in each tier. All the results show that vPath precisely
discovers the path of each and every executed request.
3.5.4 Dependency Discovery forTPC-W
The three-tier topology (see the top of Figure 3.10(a)) of the TPC-W testbed is
static, but its message paths are dynamic and can vary, depending on which JBoss
server is accessed and how many queries are exchanged between JBoss and MySQL.
The TPC-W client generates logs that include the total number of requests, current
session counts, and individual response time of each request, which serve as the
ground truth for evaluating vPath. In addition to automated tests, for the purpose
of careful validation, we also conduct eye-examination on some samples of complex
paths discovered by vPath and compare them with information in the application
logs.
vPath is able to correctly discover all the message paths with 100% complete-
ness and 100% accuracy. We started out without knowing how the paths of TPC-
W would look. From the results, we were able to quickly learn the path structure
without any knowledge of the internals of TPC-W. Typical paths of TPC-W have
the structure in Figure 3.10(a).
We observe two interesting things that we did not anticipate. First, when
38
processing one request, JBoss makes a large number of invocations to MySQL.
Most requests fall into one of two types. One type makes about 20 invocations
to MySQL, while the other type makes about 200 invocations. These two types
represent radically different TPC-W requests.
The second interesting observation with TPC-W is that, both the JBoss and
Apache send out replies in a pipeline fashion (see Figure 3.10(a)). For example,
after making the last invocation to MySQL, JBoss reads in partial reply from
MySQL and immediately sends it to Apache. JBoss then reads and sends the
next batch of replies, and so forth. This pipeline model is an effort to reduce
memory buffer, avoid memory copy, and reduce user-perceived response time. In
this experiment, once JBoss sends the first partial reply to Apache, it no longer
makes invocations to MySQL (it only reads more partial replies from MySQL
for the previous invocation). vPath is general enough to handle an even more
complicated case, where JBoss sends the first partial reply to Apache, and then
makes more invocations to MySQL in order to retrieve data for constructing more
replies. Even for this complicated, hypothetical case, all the activities will still be
correctly assigned to a single path.
3.5.5 Dependency Discovery for RUBiS and MediaWiki
Unlike TPC-W, which is a benchmark intentionally designed to exercise a breadth
of system components associated with e-Commerce environments, RUBiS and Me-
diaWiki are designed with practicality in mind, and their paths are actually shorter
and simpler than those of TPC-W.
Figure 3.10(b) shows the typical path structure of RUBiS. With vPath, we are
able to make some interesting observations without knowing the implementation
details of RUBiS. We observe that a client request first triggers three rounds of
messages exchanged between Apache and MySQL, followed by the fourth round in
which Apache retrieves a large amount of data from MySQL. The path ends with
a final round of messages exchanged between Apache and MySQL. The pipeline-
style partial message delivery in TPC-W is not observed in RUBiS. RUBiS and
TPC-W also differ significantly in their database access patterns. In TPC-W,
JBoss makes many small database queries, whereas in RUBiS, Apache retrieves a
39
large amount of data from MySQL in a single step (the fourth round). Another
important difference is that, in RUBiS, many client requests finish at Apache
without triggering database accesses. These short requests are about eight times
more frequent than the long ones. Finally, in RUBiS, Apache and MySQL make
many DNS queries, which are not observed in TPC-W.
For MediaWiki, the results of vPath show that very few requests actually reach
all the way to MySQL, while most requests are directly returned by Apache. This is
because there are many static contents, and even for dynamic contents, MediaWiki
is heavily optimized for effective caching. For a typical request that changes a wiki
page, the PHP module in Apache makes eight accesses to MySQL before replying
to the client.
3.5.6 Discussion on Benchmark Applications
We started our experiments with little knowledge of the internals of TPC-W,
RUBiS and MediaWiki. During the experimentation, we did not read their manuals
or source code. We did not modify their source code, bytecode, or executable
binary. We did not try to understand their application logs or write parsers for
them. We did not install any additional application monitoring tools such as IBM
Tivoli or HP OpenView. In short, we did not change anything in the user space.
Yet, with vPath, we were able to make many interesting observations about the
applications. Especially, different behaviors of the applications made us wonder,
in general how to select “representative” applications to evaluate systems perfor-
mance research. TPC-W is a widely recognized de facto e-Commerce benchmark,
but its behavior differs radically from the more practical RUBiS and MediaWiki.
This discrepancy could result from the difference in application domain, but it is
not clear whether the magnitude of the difference is justified. We leave it as an
open question rather than a conclusion.
This question is not specific to TPC-W benchmark. For example, the Trade6
benchmark [89] developed by IBM models an online stock brokerage Web site.
We have intimate knowledge of this application. As both a benchmark and a
testing tool, it is intentionally developed with certain complexity in mind in order
to fully exercise the rich functions of WebSphere Application Server. It would
40
be interesting to know, to what degree the conclusions in systems performance
research are misguided by the intentional complexity in benchmarks such as TPC-
W and Trade6.
3.6 Summary and Conclusion
In this chapter we proposed a novel technique for discovering the dependency
among distributed application components. Specifically the goal was to discover
the end-to-end causal dependency established by message traversal across compo-
nents. Our proposed technique is hypervisor-based, meaning that it is transpar-
ent to the VMs running on top of the hypervisor and delivers accurate causality
of every message. We have presented our prototype implementation using Xen
virtualization platform and tested the effectiveness on various software settings
including synthetic TCP/IP-based multi-threaded program, TPC-W, RUBiS and
MediaWiki applications. Our evaluation shows that the dependency discovery
technique of our design, i.e., the technique based on the assumptions of multi-
threadedness paradigm and synchronous communication pattern, is an effective
method of dependency discovery.
Chapter 4Provider-end Resource Usage
Inference
4.1 Introduction
Recall that overall resource accounting problem breaks down into two subproblems:
the dependency discovery and the resource usage inference. We have presented our
study on the dependency discovery and the solution in Chapter 3. In this chapter,
we move on to the study of the resource usage inference problem to complete the
overall goal of resource accounting. The first main goal of this study is to under-
stand what the shortcomings of common approaches are in resource accounting
and to develop a technique that overcomes them. Popular choice of technique for
resource accounting, which we select as our baseline for comparison to our solu-
tion, is to apply well-established linear regression method on the monitoring data
collected via commonly-available system monitoring tools. Then, we set up several
applications with appropriate workloads to evaluate the efficacy of our solution.
We also present the result of applying our overall resource accounting solution
to one example system management scenario to demonstrate its advantages over
existing solutions.
The first major reference to the resource accounting problem is in the work on
resource container [14]. This work identified the mismatch between the notion of
a scheduling entity from the perspective of the OS kernel and an accounting entity
42
(akin to CE we define), and proposed to introduce a new abstraction called resource
container to correctly account resource usage. There have been follow-up works
that extended the notion of resource container to distributed environments [47].
Our resource accounting problem has some similarity with the problem addressed
in the work on the distributed resource container in that it also tries to determine
the resource usage at a target server node incurred by some set of remote entities.
However, there is one significant difference that prevents us from simply borrowing
the work on the distributed resource container. This difference arises due to the
presence of what we call the shared services and the matter in which resource
usages of each chargeable entity is interleaved within them.
Shared services are the software components that are run and managed by the
cloud providers to deliver certain services or functionalities to the user applications.
The instances of the shared services, whether they be virtualized or not, are not
owned by any chargeable entities. Software components within CEs’ VMs only
see an interface through which they can request services. Behind this interface
lies arbitrary number of shared service instances that multiplex CEs’ requests
according to suitable admission policy. There are many real-world examples of
shared services that follow this architecture. One notable example is the SQL Azure
relational database service. In SQL Azure, subscribers of the service are given
logical database servers, which is actually formed from multiple physical database
server instances that are located on several physical servers. Each database server
instance on each physical server harbors data for multiple subscribers [90].
The presence of shared services poses several difficulties to the resource ac-
counting.
• First, unlike software installed within CE’s exclusive VM instances, a shared
service may only be exercised by a CE indirectly, making it more difficult to
ascertain this dependency between the shared service and chargeable entities.
For example, in Figure 4.1, the front-end component of the shared database
is invoked directly by both the CEs. Existing work, including ours [91], can
be easily leveraged for identifying this dependency (if it is not already well-
known for some reason). However, the data store is exercised via requests
made to the front-end and not directly by the CEs. One approach for in-
ferring such dependency may be to instrument the messaging S/W to inject
43
����
����
������������������ ���
����
����
pq����
pq
����
pq
����������
�������������
�������������
����
����������
����
pq pq
�����������
������������������
���
���
���
���
���
����
���
���
���
���
���
���
��������
�����������������
��
���
��� �
��
����
��
��
�����������������
����
�����������������
��
��r��s
��r
��s��r
Figure 4.1. A portion of a platform that hosts two applications, each a CE, and theservers hosting their components. Arrows indicate communication between components.Also shown is a shared service - a database used by both the CEs. The shared serviceitself consists of multiple software components, some of which are exercised by the CEsindirectly (e.g., the “Data Store”), i.e., via requests made to other components (e.g., the“Front-end”).
tracking identifiers, which is not generally applicable and may be prohibitive.
• Second, application-owned S/W components are contained within resource
principals that are likely to be easily identifiable by underlying resource man-
agement software (e.g., separate VMs on top of the virtual machine monitors
(VMMs) in Figure 4.1). This implies an existing local accounting solution
such as resource container can be readily used by this management software
to associate these resource principals with the corresponding CE. E.g., in
Figure 4.1, the web/app servers of CEA and CEB are hosted within their
own virtual machines (VMs) created by the underlying VMM software; by
associating containers with these VMs, existing accounting functionality can
be leveraged 1. On the other hand, a shared service’s software design and
configuration may not be amenable to easy adaption of existing solutions for
1In fact, this is the essential idea behind distributed resource containers [47]: individualservers use resource containers for local accounting and the network stacks within server operatingsystems are modified to embed tokens within messages sent to/by components that uniquelyidentify their CEs.
44
Services S3 SimpleDB EBS ELB SQS RDS Custom DB
Percentage 73.3% 8.3% 38.3% 16.6% 15% 21.6% 18.3%
Table 4.1. Usage pattern of AWS shared services. The total number applications are120. RDS in the AWS is not a shared service as we define here since consumers ownseparate VM instances. ‘Custom DB’ means that user has installed their own databasewithin the EC2 instances.
local accounting. For example, the data store component highlighted in Fig-
ure 4.1 multiplexes the resources assigned to its internal schedulable entities
(e.g., threads) in highly application-specific (and possibly unknown) ways
among the activities it carries out on behalf of CEA and CEB, rendering a
solution such as resource containers difficult to adapt.
Increasingly, for the cloud provider, accounting the resource usage of such
shared services is becoming important. Our investigation of current application
deployment patterns reveals that the majority of application deployments in cur-
rently popular clouds involve one or more shared services. In the case of Amazon’s
AWS cloud [77], more than 87% of the applications out of 120 deployments cur-
rently rely on the shared services Amazon provides. Only in 12% of the cases do
they use only the raw VM instances and install necessary software themselves. Ta-
ble 4.1 presents a detailed break-down of shared service usage. The usage pattern
of shared services is similar for the Windows Azure [92]. About 73% out of total 56
applications make use of the SQL Azure [93]. SQL Azure’s architecture conforms
well to the shared service we define, having each instance service multiple CEs’
workloads [90]. These facts indicate that shared services are an increasingly critical
part of cloud offerings that should not be disregarded in an accounting solution.
As we will show empirically, accurate accounting of resources used by such shared
services can lead to improved efficiency for the cloud provider.
4.1.1 Usefulness of Resource Accounting Information
Accurate resource accounting information helps providers gain more clear view
of what is going on within their cloud infrastructure. This allows them to im-
prove their management decisions or even enable new management actions that
were difficult to apply. We can consider various areas where resource accounting
45
information can be of aid.
• Cost Optimization: Consolidation of IT infrastructure is an important issue
in the industry. Owners of data centers are interested in the “right sizing”
of system resources for cost optimization. In order to carry out the “right
sizing” one must have the knowledge of how much computing resources are
needed so that required number of base hardware units can be estimated.
Resource accounting information can be used to reason about various con-
solidation configurations given certain number of applications to operate.
Accuracy in resource accounting will lead to better estimation of costs.
• Dynamic Resource Management: Resource accounting can be used to dynam-
ically adjust system resources on the fly. The knowledge of resource usage
per chargeable entity allows providers to enforce certain fairness constraint or
load balancing scheme. Coupled with flexibility in resource assignment due
to virtualization, providers can also dynamically control resource assignment
(e.g., number of replica assigned to certain front-end shared service node) to
adjust system capacity and performance. In this scenario, simple reactive
provisioning based on reading immediate server conditions may be too slow
and inadequate. Our resource accounting information tells us accurate rela-
tionship between the inbound request rate from chargeable entities and the
magnitude of resource usage per chargeable entity of arbitrary shared service
server nodes. This allows us to calculate how early and how much system
resources to adjust for even the server nodes lying deeply within the shared
service infrastructure.
• Modeling: The numbers obtained from the resource accounting can be used
to construct models by relating them to any input measurements. In many
cases, inbound network traffic volume from chargeable entities to the front-
end shared service can serve as the input measurements. The models built
here can be used to describe how resources are shared among chargeable
entities. For example, it could be used to explain what the caching effects
are when two specific workloads are combined to single shared service.
• Diagnosis: Resource accounting can also be used for system diagnosis. Partic-
46
ularly it is useful for detection of anomalies in resource consumption. When
certain shared service node approaches resource saturation, resource account-
ing can be used to identify who is causing the resource saturation. It may
turn out that the resource consumption is due to some internal activities,
having no causality to any chargeable entities. These discoveries can ease
the following diagnosis efforts by narrowing down the possible causes.
• Charging/Billing: One direct application of resource accounting is the charg-
ing and billing. Charging/billing is important in both public and private
clouds. In public clouds, resource usage information forms the base for charg-
ing the user of their usage. In private clouds, cloud providers can also utilize
them to enforce some quota per certain group or constrain the usage as well.
However, in this study, we do not emphasize the charging/billing aspect of re-
source accounting. Although resource accounting is vital for charging/billing,
it usually involves much wider range of factors that are often subjective.
Complete understanding of issues involved in charging/billing is the out of
scope of this dissertation.
4.2 Problem Definition
We refer readers to Figure 3.1 in Chapter 3 for key abstractions in defining the
resource usage inference problem. We show an illustrative portion of a public
cloud (a representative multi-tenant IT platform) that hosts software applications
for its clients. As mentioned earlier, we refer to a platform user’s application
whose resource usage must be separately tracked and accounted as a chargeable
entity (CE). It shows two such CEs (labeled CEA and CEB) each a multi-tier
e-commerce site. Each of these CEs supplies the cloud provider with a set of
software components that the cloud runs on its behalf. Our platform runs each
of these components within a VM, and this set of VMs is accommodated within
physical servers s1, ..., s5. Each of these servers runs a VMM/hypervisor layer that
multiplexes its server resources among overlying VMs.
Also shown in the figure is a shared service - a SaaS database - that the platform
offers to its CEs. This shared service itself has multiple components that span
47
servers s6, ..., s9. Although this database service is not a CE itself, we require
our accounting solution to keep track of the resources it consumes on behalf of the
CEs. Unlike for servers s1, ..., s5, where an accounting-capable VMM (e.g., using an
existing accounting solution such as resource containers) could associate the virtual
machines v1, v2, and v3 with CEA and v4, v5 with CEB, existing solutions cannot
be directly adapted for accounting within servers s6, ..., s9 where the VMM-visible
resource principals do not have a fixed association with any CE. Additionally, since
the back-end tier of the database service is only exercised by the CEs indirectly,
i.e., via work generated during processing of requests that are made by CEs to
the front-end, additional thought is needed to identify what portion of its resource
usage should be attributed to which CE.
Problem Definition: Given a set of CEs and an accounting granularity ∆, the
goal of sAccount is to infer, for each CE c, the time-series Uc(t).
4.3 Our Approaches
Any accounting solution must have two elements: (i) local monitoring and (ii)
collective inference. We use the phrase “local monitoring” to refer to facilities
within each server that record events and statistics pertaining to the resource us-
age of (or on behalf of) each CE. E.g., in resource containers, local monitoring is
carried out by the server operating system that is modified to identify resource
allocation/scheduling events (e.g., when threads are scheduled/descheduled on the
CPU) and using this information to charge their usage to appropriate contain-
ers [14]. We use the phrase “collective inference” to refer to functionality that is
needed to combine the pieces of information offered by local monitoring to create a
correct overall picture of accounting. Since resource containers are only concerned
with a single server, collective inference is trivially realized from the monitored
data. Distributed resource containers must address a more complicated version
of collective inference, and it does this by augmenting the locally monitored data
within each server with the identity of the distributed container (carried within
messages exchanged between container components) that they correspond to [47].
As argued in Section 4.2, both local monitoring and collective inference need to
48
be reconsidered for servers running shared services. There exist a large number of
techniques and tools for local monitoring that one could choose from. In particular,
these existing techniques span a wide spectrum of the “level of detail” they offer at
the cost of generality, application intrusiveness, and overheads posed. At one end
of this spectrum are techniques that can instrument user-space and OS/VMM code
to create a very detailed record of a shared service’s resource usage that contains
sufficient information for collective inference [94]. At the other end of the spectrum
are CE-oblivious resource usage reporting tools that rely on information available
within the server’s OS and VMM. E.g., top, and iostat. As we will empirically
show in Section 4.5, collective inference that relies on data offered by these tools
can have significant inaccuracies in accounting. Furthermore, as we will find, such
inference can be extremely sensitive to a variety of system properties and environ-
mental conditions, an undesirable feature. Although our results will be based on
a specific inference technique, we argue that the root cause of these inaccuracies
is the inadequacy of information contained in the monitoring information offered
by these tools, and even more sophisticated inference techniques relying on such
information would falter.
Generally speaking, collective inference is a statistical learning problem that
must derive models that can meaningfully tie together the data provided by lo-
cal monitors, possibly filling in any gaps or discrepancies within these data. The
efficacy of such inference crucially depends upon the resource usage phenomena
collected by local monitoring elements. Existing monitoring tools that are not
application-intrusive have been designed for information collection at the gran-
ularity of OS/VMM-relevant abstractions (e.g., threads, TCP connections) that
may not coincide with the needs of our accounting. Consequently we identify the
following design principle that underlies our accounting solution: our local mon-
itoring must explicitly capture information pertaining to resource usage on behalf
of CEs to allow accurate accounting by our collective inference.
Although we believe our design principle is the right direction to follow, we do
not dismiss the possibility that there might exist some advanced technique, yet
unknown to us, that can equally or better perform in resource accounting as our
solution does. However, finding such technique is not easy and requires strong ex-
pertise in the related fields. The efforts in finding and applying such technique may
49
����������������
������ ��
��������
������
���������
���� ����� �
���������� �
tuvwux yzvu{|}~���z �w�uv�u���v
���������
�� � ����
� ������������
����� ���
����Xen VMM
������
����
����
��� ��������
���
���
��� ��������
���
��
��� ��������
���
���
����� �������� �������������������
����
����
���
����
����
���
����
����
���
����
����
���
Figure 4.2. Overall architecture of sAccount implementation.
easily outweigh the benefits from using them. We regard the resource accounting
problem as having the trade-off of efforts between the “local monitoring” and the
“collective inference” stage. Our approach can be viewed as investing more efforts
into the “local monitoring” to gain the simplicity at the “collective inference” stage
as well as accuracy of resource accounting. The yet-unknown technique, on the
other hand, can be considered as paying large efforts in the “collective inference”
stage for having relatively small efforts during the “local monitoring” stage. We
believe that the trade-off of efforts between two stages for our approach is more
beneficial to the other inference-heavy approach. Even when such advanced col-
lective inference technique is found, our approach still provides benefits to it by
delivering monitoring data that are fine-grained and richer in information than the
monitoring data obtained from common system monitoring tools.
4.4 Design and Implementation: sAccount
Throughout this section, we first present the general ideas underlying our solu-
tion. We follow the description of each key idea with details of how we implement
it within our prototype accounting system. Our prototype, called sAccount, com-
prises a cluster of upto 10 Dell poweredge sc1425 servers with 2GB RAM each and
1Gbps network. Each server runs a modified Xen 3.1.4 hypervisor within which we
implement our local monitoring facilities. Additionally, a dedicated server receives
monitored data from all others and runs our collective inference that yields the ac-
50
counting information. Figure 4.2 presents the overall schematic for our sAccount
prototype.
4.4.1 Local Monitoring
There are two key aspects to the local monitoring that we need to perform at
each server: (i) recording information needed to identify the sets of used servers
for each CE and the structure of resource accounting trees within each server, and
(ii) identifying and recording information about resource principals and scheduling
events of interest. In what follows, we discuss these two issues.
4.4.1.1 Identification of S and T
General Design Considerations: We need to answer the following basic ques-
tion: for a given pair of a CE c and a server machine m, does c make use of any
resources on m during the monitoring interval of interest? Recall from Section 4.2,
that the real challenge in answering this question arises when c uses resources
on m indirectly, i.e., when a shared service component s running on m consumes
resources on behalf of c. How accurately the local monitoring on server m can
identify such indirect usage depends on its accuracy in recognizing the underlying
causation (i.e., some activities of c caused certain activities of s which consumed
some resources of m). When c is only “one hop away from s, the presence of direct
communication between c and s can yield this causation information. E.g., the
front-end component of the shared database service in Figure 3.1 is one hop away
from the CEs. However, identifying causation becomes trickier when the compo-
nent s is “more than one hop away” from c. An example of this is seen for the data
store component in Figure 3.1 which is two hops away from the CEs. Solving this
problem, in general, requires some form of statistical inference based on building
a probabilistic model to capture this causation, and closely related examples can
be seen in some recent work [16, 18].
Realization in sAccount: If a CE c is only one hop away from a shared service
component s, and is using its resources, the Xen hypervisor on the machine m
running s simply recognize that m ∈ Sc(t) if it observes an IP addresses belonging
51
������¡ ������¢
������£ ������£
�������
������
���
¤¥¦§¨©ª«¬
®
¯°±« ²³¨©
´¥µ¶
·¥¤
������������� �
´¥µ¶¸¹
º»¼½¾¿À½Á
¸¹
º»¼½¾¿À½Á
ÂÃÄ°Å« èªÄ«ÆÄ ¨² ¤¥¦
����
¯Ç«È« Ä©¨ ÄÇɫʬ ȫ˱«ªÄÈ
ÊÉ« ÊÃè̪ī¬ Ĩ ¤¥¦
·¥¤
Figure 4.3. Solution concept. Start and end of CPU accounting is determined by thearrival of messages and departure of response messages. As the threadx of VM2 sends amessage to the threadA of VM1, the VM1 starts to account the CPU usage of threadA
to CE1. This binding stops when threadA sends reply back. CPU usage of threadB isnot charged to CE1 inbetween. This requires us to be able to detect thread schedulingevents.
to c on any of its incoming messages during the interval [t, t + ∆]. To recognize a
CE c that is more than one hop away, we reply upon ideas from our prior work on
vPath [91]. Very briefly, if one assumes that software components are constructed
using a multi-threaded architecture where: (i) a given thread is only associated
with acting on behalf of one CE at a given time, and (ii) all threads only employ
synchronous communication, then the problem of identifying causation can be
solved exactly (rather than only probabilistically as in the general case) [91]. A
more general realization could employ statistical techniques mentioned above and
is interesting future work.
4.4.1.2 Identifying Resource Principals & Scheduling Events
General Design Considerations: This aspect of local monitoring is concerned
with collecting information about when a schedulable entity begins to use a re-
source on behalf of a certain CE and when it stops doing so. The local monitoring
must record such information solely based on what the hypervisor can observe or
discover about the resource principals on that server, and the events correspond-
ing to their scheduling. For a server that is being directly used by a CE, this
may be relatively straightforward. E.g., the hypervisor can simply use the per-
VM scheduling information that it has access to. Additionally, one may consider
collecting information to enable accounting of resources consumed by systems soft-
52
ware (e.g., the hypervisor itself, privileged VMs that deal with significant portions
of a IO virtualization in many systems, etc.) on behalf of the CEs, similar to such
accounting in resource containers [14].
For a server machine m indirectly used by a CE c (i.e., running a shared service
component s exercised by c), we need to identify CPU (de-)scheduling events within
the software of s that correspond to durations for which s was using the CPU on
behalf of c. Identification of any IO activities initiated during these same periods
allows for accounting IO bandwidth usage by s on behalf of c. Figure 4.3 gives
illustrative examples of these ideas.
Realization in sAccount: How completely and accurately the ideal local mon-
itoring described above can be realized depends crucially on certain aspects of the
software architecture employed by the shared service component s in question. In
the sAccount prototype, we assume that shared service components employ the
prevalent multi-threaded concurrency architecture. In this architecture, subsets of
existing threads cater to each CE using s. Furthermore, each individual thread
caters only to a single CE at any given time, although this mapping itself can be dy-
namically changed by the application’s scheduling policy. With this architecture, it
becomes possible to observe and record relevant scheduling events accurately from
within the Xen hypervisor in the following manner: (i) context switching points
within the VM hosting s correspond exactly to events when processing on behalf
of a particular CE begins/end, (ii) context switches, despite being performed by
the guest/VM kernel, trap to the hypervisor due to the paravirtualized nature of
the Xen that we use, allowing its local monitoring facility to precisely record them,
(iii) our causation establishment technique, described earlier, allows us to correctly
keep track of dynamically evolving binding between threads comprising s and the
CEs that they act on behalf of. Note that the reliance on paravirtualization for
(ii) is not a significant shortcoming and can be overcome even in a system with a
fully virtualized Xen hypervisor (e.g., written for Intel VT). An example technique
for this is based on the following modification to such a hypervisor: the hypervisor
uses the PRESENT bits in the PTE corresponding to the stack of the thread whose
context switching it wishes to intercept.
Once CPU intervals used on behalf of a CE are identified, IO activities ini-
53
tiated during these are marked as corresponding to these CEs. Network-related
monitoring is based on tracking the system calls that are related to the network
activities. System calls such as READ/WRITE and RECV/SEND triggers network us-
age. Return bytes of these system calls are interpreted as the network bandwidth
usage and accumulated to the corresponding CEs. Disk I/O-related monitoring
follows similar principle as network accounting. The system calls to track are READ
and WRITE. These two system calls are also used for the network reads and writes.
4.4.2 Collective Inference
Given the extensive information that our local monitoring gathers, collective infer-
ence for accounting CPU and network/disk IO bandwidth essentially boils down
to simple aggregation of the resource usage information collected by various local
monitoring units.
CPU Accounting: CPU accounting is done by measuring CPU cycle counts
between start and end of the thread segments. Cycle counts are accumulated
to corresponding CE’s CPU accounting variables. Thread segment that is not
labeled is accounted as ‘unaccountable’ (See Figure 4.9(b) and Figure 4.15(a) for
examples). The ‘unaccountable’ quantities tell us the possible range of errors in
CPU accounting.
Network Accounting: Any network related activities between start and end of
the thread segment is accounted to the current identified CE. We observe network-
related system calls such as recv or send and add return byte sizes to determine
how much of the network bandwidth has been consumed. Note that this quantity
does not include any bandwidth consumption due to protocol-specific overheads
such as retransmission try and various header/trailer portions added across the
protocol stacks.
Disk I/O Accounting: Unfortunately, disk I/O accounting has limitations.
Due to nondeterminism introduced by the page cache and block I/O handling
mechanisms within the kernel, it is not possible to accurately identify the block
traffics that are caused by each thread segments. From the application system
54
calls, we can only know how many bytes are requested to be read or written. How-
ever, we cannot determine exactly which part of those translates to actual requests
to the device. This forces us to resort to inference techniques from the information
collected by sAccount techniques.
Disk READ traffic: When a thread issues disk READ I/O requests, there
can be either page cache hit or miss. In case of hit, the system call latency is
fast. In case of miss, the system call has to block until the data is fetched from the
physical device. The latency of miss is significantly high. By measuring the latency
of individual system calls, we are able to identify the disk READ I/O requests that
triggered actual disk I/O traffic. We collect the number of reads (that missed the
page cache) issued by each thread segment and use this ratio among CEs to divide
the actual (read) block traffic observed at the storage device under the control of
the hypervisor.
Disk WRITE traffic: Disk WRITE I/O requests do not exhibit latency differ-
ence between page cache hit and miss. All write I/Os hit page cache (unless page
cache eviction is triggered) and destaged in bursts later. The sAccount is unable
to precisely account the disk WRITE I/O requests for this delayed destaging and
block I/O coalescing. Division by ratio as used in the READ case is not possible
because the locality of each thread’s WRITEs may be different. In this case, we
have no choice but to use inference techniques.
4.4.3 Implementation of sAccount
Our environment is based on the Xen [95] virtualization. We have employed para-
virtualized xen on 30 dual-CPU blade servers divided over two racks. The dis-
tributed applications we have used in our evaluations are all hosted within indi-
vidual virtual machines in these blade servers and they communicate over 1Gbps
ethernet. In our environment we focus on performing resource accountings on
three resource types (CPU, network bandwidth and disk bandwidth). However,
our resource accounting is not limited to these resources. The basic principles of
the technique, based on observing resource consumption at the thread granularity,
can be extended to other type of resource that servers might use. For example,
55
memory bandwidth, memory space and/or SAN networks/storage spaces can be
subjected to sAccount frameworks.
Figure 4.2 depicts the overall architecture of sAccount and relevant components.
At each physical hosts, we have modified Xen hypervisor to add functionalities
for system call entry/exit interception, kernel thread switch interception and VM
scheduling events. This information is delivered to Dom-0 and recorded through
the modified version of xentrace. The output of xentrace from multiple hosts are
transferred to a central location, labeled as ‘Accounting Node’ in the figure and this
node runs the parsing and analysis algorithm to generate time series of resource
usage per each physical hosts.
Overhead of Running sAccount: It is important that sAccount incurs small
runtime overhead in order to be practical. In our current implementation, the
overhead that can potentially impact the guest VM or user application comes
from the part where we collect the trace information. Once trace information is
collected at each individual hypervisor, since they are transferred to the separate
server for processing, the execution overhead of running the accounting algorithm
does not impact the guest VMs.
In order to assess how much overhead the tracing mechanism creates, we have
conducted the measurements in the following way. First, we created a user-level
application that would generate 8 million system calls of one type in a tight loop.
Then, we measured total running time of the application with and without having
the xentrace mechanism enabled. The average elapsed time was measured to be
5.62 seconds when xentrace was not enabled, and 5.91 seconds when enabled. This
shows us that the overhead of xentrace mechanism is about 5.2%. Note that this
level of overhead is observed only when the application does nothing but the system
calls back-to-back. Since real applications would have many other instructions
between system calls, we expect that the effective overhead would be significantly
less that 5.2%.
56
4.5 Evaluation
In this section we evaluate the efficacy of sAccount and compare it against a
baseline called LR (described below). In the interest of space, we do not report
results about the aspect of sAccount dealing with inferring and keeping track of
dynamically changing sets of used servers for different CEs. Instead, we restrict
our attention only to the most interesting aspects of the resource accounting tree
within one particular server belonging to the shared service in question.
Our Baseline Accounting Technique (LR): Our baseline is based on a linear
regression model relating the per-CE resource usage to the inbound network traffic
volume from each CE. In order to account the CPU usage of a server to n different
CEs, one can use the volume of inbound network traffic from each CE as X input
and the CPU utilization of the server from TOP as the Y. Assuming the linearity
between X and Y, we can solve AX=Y to find the coefficients. The coefficients can
then be interpreted as the contribution of each client to the server’s CPU usage.
For one measurement data point we can think of forming the following equation:
a1x1i + a2x
2i + ... + anx
ni = yi (4.1)
where xni represents the measurement of input traffic volume to the frond-end of
the shared service cluster induced from CEn at time i, and yi the aggregate resource
usage measured by system utility functions of target resource (e.g. CPU utilization
by top). At time i, the coefficients a1, a2, ..., an is interpreted as how much each
CE has contributed to the resource usage. Therefore, negative coefficient values
are undefined.
4.5.1 Accounting Accuracy for a Synthetic Shared Service
Experimental Setup: Fig. 4.4 shows the design and configuration of a synthetic
shared service we employ. We use a two-tiered design for the shared service with
the front-end acting as a caching tier. Cache misses in the front-tier result in work
generated at the back-end. Multiple clients send requests to the front-end during
long-lasting sessions and correspond to our synthetic CEs. We define operation
57
���
����
���
���
����
�����������
��� ����������������
�� ������� ����������
����
���������
����
�������
����
�������
�����������
�����������
�����������
Figure 4.4. Design and configuration of our synthetic shared service and the CEsexercising it.
types offered by the server whose resource consumption we extensively measure
offline to construct the “ground truth” about their resource needs.
Experiment Design and Key Findings:
Bursty vs. Non-Bursty Workload: Figures 4.5(a)-(b) compare the effect of bursti-
ness/variance in workload on the accuracy of CPU accounting. For different values
of the average request rate imposed on the shared service by a group of three charge-
able entities CE1, CE2, CE3 (which create different CPU utilization levels at the
server), we pick a ”non-bursty” scenario where the requests are uniformly spaced
in time, and a ”bursty” scenario where the request inter-arrival times follow log-
normal(0,1.0) distribution. We find that the efficacy of LR varies depending upon
the extent of variation within the imposed workload. This is in line with results
known in existing work that find non-stationarity in workloads useful for certain
kinds of prediction and modeling [57]. Intuitively, better accuracy is achieved with
bursty workloads because the higher variety/dynamism in the input data supplies
more information to LR; we expect this basic insight to apply to any statistical
inference technique for accounting. For a less bursty workload, a large part of
the input data may be redundant and not offer new information to an inference
technique. On the other hand, by virtue of its direct measurement of relevant phe-
nomena, sAccount is able to achieve accurate accounting that is robust to changes
in such workload conditions. We also find the accuracy of LR to be sensitive to
the overall CPU utilization level at the shared server, although we do not see a
clear pattern. sAccount performs well in all utilization regions offering less than
1% error, which may be particularly desirable in the high utilization regions.
58
0
50
100
150
200
250
13% 20% 32% 41% 54% 66% 78% 90%
Err
or
(%)
in A
ccu
racy
CPU Utilization
CE1CE2CE3
LR AveragesAccount
0
5
10
15
20
25
30
10% 22% 34% 47% 53% 60% 70% 84%
Err
or
(%)
in A
ccu
racy
CPU Utilization
CE1CE2CE3
LR AveragesAccount
(a) Steady (less bursty) workload (b) Highly bursty workload
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0 1 2 3 4 5 6 7 8 9
Cro
ss C
orr
ela
tio
n
Time Lag (second)
13%20%32%41%54%66%78%90%
-0.2
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9
Cro
ss C
orr
ela
tio
n
Time Lag (second)
10%22%34%47%53%60%70%84%
(c) Correlation of steady workload (d) Correlation of bursty workload
Figure 4.5. Impact of burstiness and shared service resource utilization on the accuracyoffered by sAccount versus LR. We use our synthetic shared service along with threechargeable entities. We compare the percentage error in CPU accounting for sAccountand LR. We label the errors for our three chargeable entities with LR as CE1, CE2, andCE3, respectively, and label their average as “LR Average.” In all cases, the accountinginformation offered by sAccount shows less than 1% error (we plot the average of errorfor the three CEs).
Caching/Buffering: We find that caching and buffering - both valuable and preva-
lent performance enhancement techniques - affect the accuracy of accounting of
a technique like LR. As is well-known in general, a cache within or in front of a
service can destroy/distort correlations between its incoming request/traffic events
and the workload imposed on its underlying server in complex ways. Buffering of
requests/traffic can also have a similar effect by modifying the time lag between an
event (e.g., the issuance of a request) and its cause (e.g., the actual servicing) in
complicated ways. We carry out experiments where we vary several factors affect-
ing the degree and nature of caching within the front-tier of our shared service (see
59
0
10
20
30
40
50
60
70
80
1 min 2 min 3 min 4 min 5 min
Err
or
(%)
in A
ccu
racy
Mesurement Time Granularity
w1w2w3s4
sAccount
0
2
4
6
8
10
12
14
16
18
2 3 4 5 6 7
Err
or
(%)
in A
ccu
racy
Number of CEs
LRLR stdevsAccount
(a) Effect of Caching (b) Effect of number of CEs
Figure 4.6. Impact of caching and number of CEs on the accuracy of resource account-ing of (a) network and (b) CPU for our synthetic shared service. The number of CEs isthree in (a). We plot the average error across all CEs and standard deviation.
Figure 4.4): request size (fixed or varying), read/write ratio (from 10:1 to 1:1), tem-
poral locality (non-existent to very high), and the extent of common/overlapping
content requested by the chargeable entities. Of the large parameter space, we
present results for the following four kinds of workloads imposed on the shared
service: (i) w1, with a choice of factors that we expect to offer minimal caching
gains, (ii) w2, with a choice of factors that we expect to offer significant caching,
(iii) w3, moderate gains between the last two, and (iv) w4, a workload that uses
these three in succession.
Figure 4.6(a) shows results for the accuracy of accounting the network band-
width used for communication between the front-tier (cache) and back-tier of our
shared service. The chargeable entities impose a bursty workload (as described
earlier) that imposes an average CPU load of 50% on the server hosting the front-
tier. The x-axis of the Figure 4.6(a) indicates the length of data points fed into
the LR. For example, ‘1 min’ means that we have used 60 data since we collected
measurements every one second. If the total run is 20 minutes, then we perform
20 LR computations for ‘1 min’ case and 4 LR computations for ‘5 min’ case. The
longer data we use for LR, the better the accuracies become. The graph shows that
caching also has large impact to the accuracy. Accuracy gains from the burstiness
of the workloads can be easily offset if application happens to employ some form
of caching structure internally.
Varying Number of Chargeable Entities: Figure 4.6(b) shows the effect of the num-
60
ber of chargeable entities on the accuracy of LR and sAccount. We observe that
the accuracy of LR (both average and variance) deteriorates as the number of
chargeable entities grows whereas the accuracy of sAccount is unaffected.
Summary of Key Findings: To summarize, we find that the efficacy of LR
relies upon both the quality of data it gathers as well as the presence/extent of cor-
relation between its inputs and outputs. Even when accurate data can be obtained
(as with our implementation of LR), several factors including (i) inherent workload
properties (e.g., variance, temporal locality, intensity), (ii) system mechanisms and
algorithms (e.g., caching or buffering), and (iii) environmental conditions (e.g., de-
gree of resource interference from other software) might affect such correlation and
affect the accuracy of the accounting technique. We find empirical evidence that,
owing to its ability to directly measure relevant phenomena accurately, sAccount
is robust to such effects, and offers high-accuracy accounting information across a
wide range of operating conditions.
4.5.2 Accounting for Real-world Services
In general, we cannot expect to determine a real-world application’s actual resource
usage on behalf of different chargeable entities without resorting to extensive ap-
plication and OS instrumentation. Consequently, unlike for our synthetic shared
service, we cannot obtain/present a direct comparison of the efficacy of our tech-
niques, i.e., distance of the accounting information offered by sAccount versus that
offered by LR from the ”ground truth.” We present results for the accounting of
the most bottlenecked resource for the shared service, which we find to be CPU
cycles for MySQL and network bandwidth for HBase.
4.5.2.1 Clustered MySQL as the Shared Service
Experimental Setup: Figure 4.7 shows the set-up of our MySQL cluster that
is used as a shared service by three CEs. Two of these CEs use the TPC-W
benchmark [83] to generate workload for the database, while the third CE uses
RUBiS [87]. The cluster consists of a front-end SQL node that interacts with
the chargeable entities, three data nodes, and a management node; each node is
61
������������
������
����������
������
����
������
�����
��
�� ����� ������������������� � ��
��������������
����
����������
ÍÎÏÐÑÒ����
��� ���
������������
ÍÎÏÐÑÒ����
��� ��� ����
����
������
��� ���
Figure 4.7. Shared MySQL cluster setting. Three CEs labeled CE1, CE2, and CE3
share this database service.
hosted within its dedicated server. One interesting aspect of the cluster’s operation
is that even in the absence of any workload imposed by the CEs, a large number of
small messages are exchanged between all pairs of nodes within the cluster. These
messages are for liveness check. The CEs house separate/non-overlapping data
within the database which is spread across the three data nodes, and the cluster a
replication degree of 1.
Experiment Design and Key Findings: Given exact accuracy numbers are
elusive, we compare the efficacy of sAccount and LR in the following online re-
source control situation: we wish to ensure that when the aggregate workload
imposed upon the MySQL cluster causes its server CPUs to saturate, we identify
the contribution of various CEs to this “overload,” and then enforce targeted CPU
throttling only to the CE causing the overload.
We implement a CPU policing mechanism within the Xen hypervisors of the
MySQL cluster servers which manipulates the rate at which timer interrupts are
delivered, similar to the idea of time dilation [96], to the guest VM only when the
thread serving the CE causing the overload is to be scheduled. There are two parts
implemented into the xen hypervisor to achieve this CPU controlling mechanism.
First, there are set of variables, one for each chargeable entity, for accumulating
the CPU usage. Second, there is a mechanism for enforcing desired amount of
resource consumption of the chargeable entities. The method we use to control the
per-CE CPU consumption is based on manipulating the rate of timer interrupts
being delivered to the guest VM. Delivering timer interrupts to a VM (specifically
62
the guest kernel) has the effect of speeding up the notion of time flow from the
guest kernel’s view point. This will, in turn, trigger the scheduler invocation at a
faster pace. Our strategy is to deliver the timer interrupt faster when the VM is
about to schedule the thread that are currently serving the chargeable entity we
want to limit the resource consumption of. When any other innocent threads are
about to gain the CPU, we switch back to the normal rate of timer interrupts.
This strategy requires us to detect what thread is to be scheduled to run next,
which we have also implemented with the xen hypervisor. Our version of Xen,
32bit para-virtualized Xen, supplies a point where every thread context switch
occurring within the guest kernel is intercepted. Thread context switch of the
guest kernel traps because stack switch requires modifying the kernel stack pointer
stored in TSS (Task State Segment) and that is a privileged operation. The guest
kernel, running in ring 1, cannot modify TSS which is only allowed at ring 0.
Executing this strategy requires systematic ways of determining the timer rate.
The default interval between timer interrupts are 10 ms in our settings and the
guest VM expects to receive one timer interval every 10 ms as well. By setting this
interval to a smaller value, we are able to issue timer interrupts at a faster rate. For
example, setting the interval to 1 ms would generate timer interrupts 10 times faster
and the guest VM would invoke the thread scheduler every 1 ms, thinking that 10
ms has already passed. In order to decide on certain value for this timer interval,
we adopt similar approach used in FAST TCP [97] and PARDA [98]. Let this
timer interval be denoted by wit at time t for CEi and 1 < wi
t < 10. Additionally,
let Li be the desired resource usage level of the CEi and, Lit represents the observed
resource usage level of CEi. The equation below is used to determine the timer
interval, wit, from the observed resource usage of CEi, where γ is a smoothing
factor.
wit = (1 − γ)wi
t−1 + γLi
Lit
wit−1 (4.2)
Our choice of this policing mechanism is only for demonstration purposes, and in
practice a more sound technique would be desirable.
We configure our CEs to impose a dynamically changing workload (consisting
of three phases) on MySQL as described at a high-level in Table 4.2. In phase 1,
all CEs generate a low-intensity workload, whose aggregate does not saturate the
63
Phase Time Window Workload Top User
Phase 1 0-400s All 3 CE generate light loads CE2
Phase 2 400-600s CE2 starts to issue CE2
CPU-heavy requestsCE2’s workload
Phase 3 600-1200s overwhelms CPU, CE2
load increases every 100s
Table 4.2. Description of how the workload imposed by the three CEs is varied overthe course of our experiment with the MySQL cluster as our shared service.
MySQL servers. During phase 2, starting at t=400s, CE2 starts issuing more CPU-
intensive requests. We are interested in observing how LR and sAccount handle
this sudden change of behavior. Finally, in phase 3, starting at t=600s, CE2 issues
continually increasing workload that causes the CPUs to saturate. Here, we are
interested in observing how our simple resource control performs based on the
accounting information offered by LR and sAccount.
Since we do not have precise knowledge about true resource consumption, we
engineer the workloads so that the CPU consumptions imposed by the CEs are
significantly different from each other, allowing us to rank their contributions with-
out ambiguity. For example, we make the CPU consumption of CE2 much larger
than other starting at t=400s so that other CEs cannot be mistaken as heavy CPU
consumers. We begin by taking an in-depth look at the CPU accounting informa-
tion offered by LR and sAccount at one of the MySQL servers (SN) and how it
evolves during phases 1-3 (results for CPU accounting of other MySQL nodes are
qualitatively similar and we do not present them in the interest of space).
In Figures 4.8(a),(b), we depict the inputs for LR (per-CE network traffic and
aggregate CPU usage at SN’s server) and its output (accounting information for
each CE), respectively. These figures are helpful to consider in combination with
the following discussion of LR’s accounting and its comparison with sAccount.
Figures 4.9(a),(b) show CPU accounting for SN’s server as carried out by LR
and sAccount, respectively. We use a ”stacked” representation, where the area
under the curve corresponding to a CE represents the CPU usage charged to it.
During phase 1, both LR and sAccount produce correct rank orders of CEs, al-
though LR slightly overestimates the CPU consumption for CE2. However, during
phase 2, LR starts to report incorrect rank order: it determines CE3 to be the
64
0
20
40
60
80
100
120
140
160
180
200 400 600 800 1000
Net
wor
k T
raffi
c (K
B)
Time (Seconds)
CE1CE2CE3
(a) Inputs to LR accounting.
50
100
150
200 400 600 800 1000
CP
U U
tiliz
atio
n (%
)
Time (Seconds)
CE1CE2CE3
(b) CPU usage pattern of 3 CEs from separate runs.
Figure 4.8. Network traffic and individual CPU utilization time-series. Graph (a)shows the network traffic exchanged between SN and each of our three CEs, and formspart of the input to LR. Graph (b) presents the CPU usage at SN induced by eachof the CEs when it runs separately from other CEs as part of offline profiling that wedo. These usages serve as our estimate of the ground truth for the CPU usage each ofthe CEs induces in the actual experiment. The resource accounting results of LR andsAccount should be compared with (b) to see how far from this estimated ground truththey are.
cause of the increased CPU usage. Upon investigating the reason for this mistake
by LR, we find the following. While CE2 issues CPU-heavy requests and waits for
MySQL’s response, CE3 continues to issue requests at a relatively high rate that
are not CPU-heavy. However, the higher rate of requests coming from CE3 causes
LR to infer spurious positive correlation between CE3’s requests and SN’s CPU
usage. In fact, LR is unable to correct this throughout phase 2.
As we show in Figure 4.9(b), besides correctly identifying the correct rank order
65
Ó
ÔÓ
ÕÓ
ÖÓ
×Ó
ØÓ
ÙÓ
ÚÓ
ÛÓ
ÜÓ
ÔÓÓ
Ó ÔÓÓ ÕÓÓ ÖÓÓ ×ÓÓ ØÓÓ ÙÓÓ ÚÓÓ ÛÓÓ ÜÓÓ ÔÓÓÓ ÔÔÓÓ
���
���
���
��������������� ��
�������������
ÝÞßàá âãäåæçèé
Ýäåêëãá ìéèí
îäçá ïèçêî
Ýðñòåîóéåîçèêáîåéîá
ôõ öåîå ÷èøøã÷îçèê
çêîãéæåø èì ßùù áã÷
(a) Result of CPU accounting using LR. This is a stacked graph.
ú
ûú
üú
ýú
þú
ÿú
�ú
�ú
ú
ú
ûúú
ú ûúú üúú ýúú þúú ÿúú �úú �úú úú úú ûúúú ûûúú ����
����������� �
������
��������������� �� ����� ������ �
������������
�� �����
������� �� � �������
��� !"" !"# !"$
%&' %&( %&)
*+, *+- *+. //
*+, *+- *+. /
������������������
(b) Result of CPU accounting using sAccount. This is a stacked graph.
Figure 4.9. Comparison of CPU accounting results. CPU usage of MySQL ClusterSQL node is being accounted. In (a) the accounting starts at time 200 since LR needsto collect some amount of data. By comparing the areas of equivalent color we cansee the rank order determined by each technique as well as accuracies. Please comparewith Figure 4.8(b) to see the true CPU consumption. The result of sAccount includesthe ‘unaccountable’ portion. This can be divided among chargeable entities by anyreasonable policy.
in its accounting, sAccount also reports what portion of the CPU usage of SN’s
server it finds unaccountable. This amount indicates that sAccount’s algorithm was
unable to charge the given thread’s resource usage to any of the chargeable entities
because no direct association was found. This can happened if some thread is
spawned independently of input requests from the chargeable entities and performs
maintenance jobs. Or, it could be due to the nature of the thread that is created to
service other running threads. In any case, sAccount provides this resource usage
to the user and it is up to the user to divide up among chargeable entities. The
most reasonable division would be to divide the ‘unaccountable’ portion according
to the proportion of resource usage by each chargeable entity within that time
window.
During phase 3, starting at t=600s, CE2 starts to saturate the CPU by dras-
66
0
100
200
300
400
500
600
700
800
0 100 200 300 400 500 600 700 800 900 1000 Time (Seconds)
Res
pons
e T
ime
(ms)
Online sAccount control start
Warning Level
SLA Violation Level
Response Time with LRMoving Avg with LR
Response Time with sAccountMoving Avg with sAccount
Figure 4.10. Response time change of RUBiS application. This graphs shows thedevelopment of RUBiS response time for two cases - throttled by LR, and controlled bysAccount. Since LR picks the wrong CE (CE3) as a culprit for performance degradation,throttling the request rate of CE3 is ineffective. However, the sAccount technique showsnoticeable effects. The moving average response time by sAccount indicates sAccountcan contain the performance interference.
tically increasing the workload it imposes as described in Table 4.2. As portions
of Figures 4.9(a) and (b) for this phase show, LR continues to perform incorrect
accounting. This has a detrimental effect on our CPU policing based resource
control. Figure 4.10 shows the change of response time for CE3 whose RUBiS ap-
plication is accessing the shared MySQL Cluster service. Starting from time 600,
the response time increases. We have set the response time of 300 ms as the initial
warning level and 600 ms as the SLA violation level. The CPU saturation caused
by CE2 continues to degrade the response time of RUBiS and eventually it violates
the SLA. Since LR determines that CE3, not CE2, is the source of overload (see
Figure 4.9 (a)), CE2 is not marked for any counter actions. However, sAccount is
able to identify true cause of the overload and, starting at time 730, it initiates the
CPU throttling for CE2. Figure 4.10 indicates that the moving average of response
time under the control of sAccount is able to contain the response time below the
SLA limit. This demonstrates one promising capability of sAccount (which is the
thread-level monitoring technique) in critical resource managements of such shared
resources.
4.5.2.2 HBase as the Shared Service
Experimental Setup: Our second real-world shared service is HBase, a key-
value storage offering an open-source implementation of Google’s Bigtable [36],
67
����������� �
�����
����0
�����
����1
��������
�����
�����
�������� �
�����
���� �
�������
�� �� 0
��������� �
����0
����1
Figure 4.11. Our setup for using HBase as a shared service. Our CEs are based onclient programs that use the YCSB workload generator.
that has significantly different resource usage characteristics from a database such
as MySQL. An HBase cluster consists of “region servers” and the HBase “master”
that manages these region servers. HBase stores its data in a Hadoop cluster,
and this is a great example of a shared service (HBase) relying upon another
shared service (Hadoop) to cater to the needs of CEs using it. The region servers
act as in-memory caches for the contents of the data nodes. A region server
employs an HDFS client [99] to communicate with the Hadoop cluster that stores
data persistently. HBase employs Zookeeper [100] for coordinating distributed
operations and locating region servers. We configure our HBase with a single region
server. HBase operation involves significant data transfer from the data nodes to
the region server, whereas the CPU load imposed by most requests is small as most
requests are for simple data retrieval or inversions. Since the network bandwidth
available to the region server becomes the bottleneck resource well before the CPU,
we highlight accounting results for network bandwidth more. Our HBase caters to
requests from two CEs derived from the YCSB workload generator [101], that we
refer to as CEA and CEB.
Experiment Design and Key Findings: We run an experiment lasting 500
seconds, during which the loads offered by CEA and CEB are varied as follows:
(i) during t=0 to t=100s, both CEA and CEB generates identical workloads which
contains 5% update requests, (ii) at t=100s, CEB changes to a read-intensive mode
with good temporal locality, which incurs high hits in the region server causing its
68
0
10
20
30
40
50
60
100 200 300 400 500
Net
wor
k T
raffi
c (K
B)
Time (Seconds)
CEB goes read-intensive
CEA goes update-intensive
CEACEB
Figure 4.12. Evolution of network traffic that are incoming to the region server fromtwo CEs during the run. Both CEA and CEB start out by sending similar requests toHBase during t=0 to t=100s, implying the network bandwidth and CPU usage of theregion server should be accounted equally to them for this period. CEA changes itsbehavior at t=100s, whereas CEB changes its behavior at t=200s.
CPU usage to increase proportionally with the network traffic sent to CEB, (iii)
at t=200s, CEA starts to issue CPU-intensive insert-type requests that cause the
CPU usage at the region server to increase. Figure 4.12 shows the network traffic
size inbound to the region server from the two CEs under the described workload
scenario.
Figure 4.13 shows the profiling result of individual runs for both CEA and
CEB chargeable entities. Unlike Mysql Cluster where there can be several different
request types each having varying resource demands in terms of CPU and Network,
HBase seems to behave in a much simpler way. HBase’s request types in terms of
resource consumption are not diverse, because HBase uses simple key-value storage
that usually does not carry out complex logic functions such as join and sub-query.
As a result, for both CPU and network, they show very similar trends with client
input network.
Summary of Results: Figure 4.14 (a) and (b) shows the result of accounting the
network bandwidth by LR and sAccount at the region server of HBase. The inputs
to the LR are two time series of inbound network traffic from two chargeable entities
as shown in Figure 4.12. We have configured LR to use 100 second-long data as an
input length in this HBase’s resource accounting, producing no accounting results
69
0
10
20
30
40
50
60
70
100 200 300 400 500
CP
U U
tiliz
atio
n (%
)
Time (Seconds)
CEACEB
(a) CPU usage at the region server for two CEs.
2M
4M
6M
8M
10M
12M
14M
100 200 300 400 500
Net
Thr
ough
put (
KB
)
Time (Seconds)
CEACEB
(b) Network input traffic from two data nodes to the region server.
Figure 4.13. Profiling measurement of CPU and Network resources at the region serverfor CEA and CEB. These are obtained from individual (not combined) runs of eachchargeable entities. They are intended to serve as an estimate of true resource usagequantity when analyzing the performance of resource accounting results.
for the first 100 seconds of the run. The accounting results from sAccount are
presented in Figure 4.14 (b). In both of the stacked graphs, the red area (the top
most area) corresponds to the portion of network bandwidth used by CEB and
the blue area (the second area from the top), the portion used by CEA. From the
Figure 4.13, we know that the network bandwidth usage of CEA should significantly
drop from time 200, and thus making CEB the heavier consumer of the network
bandwidth. We observe that the result of LR unfortunately does not reflect these
true values. LR reports that CEA remains to be the dominant consumer of the
network bandwidth throughout the run. In contrast, sAccount correctly reflects
70
�
����
����
����
����
�����
�����
�����
� ��� ��� ��� ��� ���
����
����
����
�������� �������
�� ���� ��2 �� ��������� ��
������� ��� ������ ���������
����������
(a) Result of LR on incoming traffic from datanodes to region server. This is a stacked graph.
3
4333
5333
6333
7333
83333
84333
3 833 433 933 533
����
�������� �������
�����������:���� �������������
������������������������������
�������� ��
(b) Result obtained from sAccount. This is a stacked graph.
Figure 4.14. Comparison of resource accounting results between LR and sAccount onthe out-bound network traffic size from data node to the region server. The contour of(a) drawn in think line is obtained by iptables. Note the resemblance of this to theoverall traffic size of (b) as a quick sanity check of sAccount technique.
this resource usage by CEs. (See the size of red area in Figure 4.14 (b) after time
200). The misjudgment by LR can be explained in terms of caching effects. Since
time 200, the input network traffic from CEB to the region server doubles until
the end of run (See Figure 4.12) whereas the incoming traffic from the data nodes
increases by only 20% (See Figure 4.13(b)). We believe this is due to the internal
caching at the region server and HBase documentation supports this conjecture.
Earlier in Figure 4.6, we have seen that caching can impact the performance of
LR.
In addition to the network accounting results, we also present some of the
selected accounting results using sAccount. Figure 4.15 (a) shows the CPU ac-
counting at the region server for CEA and CEB. According to our preplanned
workload scenario, CEB should consume more CPU than CEA after time 200 sec.
The accounting result indicates this behavior. Notice the significant portion of un-
71
��������������� ��
;
<;
=;
>;
?;
@;
A;
B;
; @; <;; <@; =;; =@; >;; >@; ?;; ?@;
CD E
CD F
GHIJJKLHMINOP
����������
(a) CPU Accounting by sAccount at the region server. This is a stacked graph.
����
�������� �������
Q
RQQQ
SQQQ
TQQQ
UQQQ
VQQQ
WQQQ
Q VQ RQQ RVQ SQQ SVQ TQQ TVQ UQQ UVQ
XY Z
XY [
\]^__`a]b^cde
����������
(b) Network accounting by sAccount at data node. This is a stacked graph.
Figure 4.15. Resource accounting by sAccount at various nodes of HBase. The result(b) provides the proof of sAccount’s capability to perform resource accounting at multiplehops away from the front-end of the shared service. Note that data node of HBase doesnot have a direct communication with any of the chargeable entities.
accountable CPU usage at the bottom region of the stacked graph (orange color).
It includes CPU cycles consumed for communicating with the Zookeeper, HBase
master and Hadoop Namenodes. Especially, we have noticed two high peaks at
around 80 sec and 470 sec. By studying the HBase documentations, we have con-
cluded that these would most likely be due to the I/O compaction at the region
server. This nondeterminism disturbs the correct operation of LR whereas our
sAccount correctly identifies them as not being correlated to chargeable entities.
Figure 4.15 (b) presents the accounting result of network usage in one of the datan-
odes. We present this result here in order to provide evidence that our sAccount
can perform resource accounting at the node that lies “more-than-one-hop” away
from the chargeable entities. This shows that our implementation can success-
fully deliver the causality of messages to across nodes and make use of them into
resource accounting.
72
4.6 Summary and Conclusion
In this chapter we have presented our solution to the provider-end resource ac-
counting problem. Our solution, called sAccount, operates entirely within the
hypervisor, giving us high transparency, and the granularity of data is the thread-
level. As a motivation we have provided evidence that existing solutions that are
based on statistical inferences may suffer from several weaknesses and that our
sAccount solution is robust to such conditions. We have evaluated our solution to
the synthetic TCP/IP software as well as two real-world distributed applications
- MySQL Cluster and HBase. MySQL Cluster represents popular SaaS offer-
ing of the relational database found in today’s cloud services. HBase represents
the key-value type of simple storage service. HBase is an open-source version of
Google’s Bigtable. Using those real-world applications we have observed that sta-
tistical technique can actually mislead the cloud platform operators by providing
incorrect results under certain conditions. We have also applied online-version of
the sAccount to the scenario of sudden workload changes to cause SLA violation.
Comparing side-by-side with statistical inference technique on the same workload
scenario, we have shown that sAccount can be more responsive and correct in
critical real-time system management scenario.
Chapter 5Consumer-end Decision Making
5.1 Introduction
As utility computing develops into maturity, consumers are presented with numer-
ous service providers to choose from and sets of service options in each of them.
Consumers are faced with the problem of incorporating all information of cloud of-
ferings and making a decision of hosting their applications in the most cost-optimal
way. In such decision making, it is important that consumers are aware of impor-
tant cost components that potentially have large impact to the overall costs. Also,
consumers need a systematic methodology or tools that can incorporate various
cost factors and present useful analysis results with reasonable margin of errors.
This chapter is concerned with investigating such a methodology that can aid in
the decision making for the problem of selecting optimal hosting configurations.
We first discuss the taxonomy of cost components and current cloud-based host-
ing options. Then we utilize the Net-Present-Value concept to analyze the cost of
hosting a consumer’s application for a given period of time.
The quintessential question when considering a move to the cloud is: should
I migrate my application to the cloud? Whereas there have been several studies
into this question [63, 72, 64, 102], there is no consensus yet on whether the cost
of cloud-based hosting is attractive enough compared to in-house hosting. There
are several aspects to this basic question that must be considered. First, although
many potential benefits of migrating to the cloud can be enumerated for the gen-
eral case, some benefits may not apply to my application. For example, benefits
74
related to lowered entry barrier may not apply as much to an organization with
a pre-existing infrastructural and administrative base. As another example, the
benefits of pay-per-use are less attractive for a well-provisioned application whose
workload does not exhibit much variation in its workload. Second, there can be
multiple ways in which an application might make use of the facilities offered by
a cloud provider. For example, using the cloud need not preclude a continued use
of in-house infrastructure. The most cost-effective approach for an organization
might, in fact, involve a combination of cloud and in-house resources rather than
choosing one over the other. Third, not all elements of the overall cost consider-
ation may be equally easy to quantify. For example, the hardware resource needs
and associated costs may be reasonably straightforward to estimate and compare
across different hosting options. On the other hand, labor costs may be signifi-
cantly more complicated: e.g., how should the overall administrators’ salaries in
an organization be apportioned among various applications that they manage? As
another example, in a cloud-based hosting, how much effort and cost is involved
in migrating an application to the cloud? Answering these questions requires an
in-depth understanding of the cost implications of all the possible choices spe-
cific to my circumstances. This chapter presents our findings on the economic
aspects of application hosting in a cloud-based environment using cost analysis for
representative e-commerce benchmarks. Although we restrict our attention to a
single cloud provider, it will become clear that our methodology readily extends
to scenarios where multiple cloud providers are available to a consumer.
5.2 Background
5.2.1 Net Present Value
In financial analysis, investigating the suitability of an investment involves assess-
ing the overall costs expected to be incurred over its lifetime. From the financial
point of view, the decision-making of whether to migrate an application to the
cloud can be viewed such as an investment decision problem. The concept of Net
Present Value (NPV) [103] is popularly used in financial analysis to calculate the
profitability of an investment decision over its expected lifetime considering all the
75
cash inflows and outflows. Borrowing existing notation, we define the NPV of an
investment choice spanning T years into the future as:
NPVT =T−1∑
t=0
Ct
(1 + r)t(5.1)
where r is the discount rate and Ct the expenditure during tth year. The role of
the discount rate is to capture the phenomenon that the value of a dollar today is
worth more than a dollar in the future, with its value decreased by a factor (1+ r)
per year.
As a simple example to understand NPV, assuming r = 5%, consider two
choices to purchase 10 items, each costing $1,000 over a one year span: (i) buy all
today: NPV=$10,000, and (ii) buy half today, and half next year: NPV = $5,000
+ $5,000 / 1.05 = $9,761. The latter is the preferred choice here since it allows
us to spend a lower amount than with choice (i) for the same overall purchase.
Whereas the NPV definition can be enhanced to also incorporate the effect of
inflation (e.g., in case (ii) we might need more than $5,000 to buy 5 items a year
from now), we assume it to be small and ignore it in our present work.
5.2.2 Cost Components
Hosting an application poses various types of costs including hardware, software
costs and/or operational costs. Each cost type has its own idiosyncrasies requiring
us to scrutinize them one by one and understand in what circumstances and how
the cost is incurred. In an attempt to better understand and facilitate the following
discussions, we first develop taxonomy of costs in the context of in-house and cloud-
based application hosting. We classify costs along two dimensions: quantifiability
and directness of contribution to the overall costs. Quantifiability refers to whether
the cost is easily representable with dollar amount. The directness of contribution
means whether the cost is solely for the product/service under consideration or
not.
Figure 5.1 presents our classification of costs. Certain cost components are
less easy to quantify than others, and we use the phrases “quantifiable” and “less
quantifiable” to make this distinction. Examples of less quantifiable costs include
76
���������� �����������
����������
����������
������������� �������
������������ �������� ������
���� ���������� ����������
��������
���
�� �����
���������������������������������������
����� ���!����������������"��� � ��� ����#��#������������������
���������������$%&������ � � ������
��'����������� �#����� ��'���������#�������(��
������������ ������!�� ��)�!����������� ������*�����!���������#�
��+���� ���!� ��
��+���������#�����#� �������
��,���#����������
���
Figure 5.1. Taxonomy of costs involved in in-house and cloud-based application hosting.Costs can be classified according to quantifiability and directness. Quantifiable costs aregrouped into material, labor and expenses. The material category roughly corresponds tocapital expenses (Cap-Ex). The labor and expenses categories corresponds to operationalexpenses (Op-Ex). In this study we focus on the quantifiable category.
effort of migrating an application to the cloud, porting an application to the pro-
gramming API exposed by a cloud (e.g., as required with Windows Azure), time
spent doing the migration/porting, any problems/vulnerabilities that arise due to
such porting or migration, etc. “Quantifiable costs”, on the other hand, can be ac-
curately quantified and we further divide these into three sub-categories: material,
labor and expenses.
Adhering to well-regarded convention in financial analysis, we also employ the
classification of costs into the “direct” and “indirect” categories based on their
ease of traceability and accounting. If a cost can be clearly traced and accounted
to a product, service, or personnel, it is a direct cost, else it is an indirect cost. As
shown in Figure 5.1, examples of direct cost include hardware and software costs;
examples of indirect cost include staff salaries. It should be noted that certain
costs may be “less quantifiable” yet direct (e.g., porting an application in case
of using Platform-as-a-service cloud). Similarly, certain costs may be quantifiable
yet indirect (e.g., staff salaries, cooling, etc.) Since less quantifiable costs can
contain ambiguity in their interpretation, requiring excessive efforts when trying
to quantify into dollar amounts, we mainly focus on the quantifiable costs in this
work. Within the quantifiable category, we first carry out the cost analysis using
the direct costs only. Next, we add indirect costs into the picture to understand
what effect they have to the result of previous analysis (Section 5.3.5). Accounting
77
the indirect costs often requires organization-specific knowledge. In our analysis,
we use ranges rather than exact numbers for indirect costs to capture a wide
spectrum of scenarios.
5.2.3 Application Hosting Choices
Besides pure in-house and pure cloud-based hosting, a number of intermediate
and/or hybrid options have been suggested, and are worth considering [104]. We
view these schemes as combinations of different degrees of “vertical” and “hori-
zontal” partitioning of the application. Vertical partitioning splits an application
into two subsets (not necessarily mutually exclusive) of components - one is hosted
in-house and the other migrated to the cloud - and may be challenging if any
porting is required [104]. Horizontal partitioning replicates some components of
the application (or the entire application) on the cloud along with suitable work-
load distribution mechanisms. Such partitioning is already being used as a way to
handle unexpected traffic bursts by some businesses (e.g., KBB.com and Domino’s
Pizza [105]). Such a partitioning scheme must employ mechanisms to maintain
consistency among replicas of stateful components (e.g., databases) with associ-
ated overheads. Given myriad cloud providers and hosting models (we consider
IaaS and SaaS), there can be multiple choices for how a component is migrated
to the cloud, each with its own cost implications. In this work, we choose three
such options (in addition to pure in-house and pure cloud-based) that we described
next.
5.2.4 Determining Hardware Size Requirement
The first step in performing the cost analysis is determining the size of hardware
base for the given workload intensity. Knowing the required number of hardware
units allows us to calculate other dependent costs such as OS licenses or power
consumption. This is an important problem that is starting to receive attention
from researchers [102]. Consumers need such techniques in order to come up
with accurate cost estimates, especially when there are many service providers
to choose from. This problem is closely tied to the general problem of application
performance modeling. Once performance model is established it can be used to
78
0
50
100
150
200
250
300
350
0 100 200 300 400 500 600
Th
rou
gh
pu
t(re
q/s
)
Sessions(# of clients)
One NodeTwo Nodes
0
100
200
300
400
500
600
0 200 400 600 800 1000
Th
rou
gh
pu
t(re
q/s
)
Sessions(# of clients)
One CPUTwo CPUs
(a) Jboss server (b) MySQL server
Figure 5.2. Marginal throughput measurements. Both graphs show how much through-put gain there is by adding one more unit of resource. For Jboss server (a), we observethe marginal gain by adding one more server that has single core. For Mysql (b) we addone more CPU core and observe the marginal gain.
answer the question of what the required hardware size is for a given performance
by exploring combinations of model parameters. Performance modeling has been
extensively studied for the traditional hosting environment [106, 107, 108] and is
being actively studied for the virtualized environment as well [109, 110]. The goal
of this chapter is to observe and understand the effects of various cost factors to
the overall costs. In this chapter, we employ simple techniques based on empirical
measurements and interpolations as described below. We find that inaccuracies
in such simplistic techniques do not affect the findings and insights we learn from
this cost analysis.
5.2.4.0.1 In-house Provisioning: We employ a cluster of servers in our lab as
our in-house hosting platform all of which have a Intel Xeon 3.4GHz dual-processor
with 2GB DRAM and are connected via a 1 Gbps Ethernet. In order to determine
the number of machines required to meet the desired throughput, we empirically
obtain the marginal throughput gain offered by adding an extra unit (granularity of
CPU core as well as single machine) to it when all other tiers are well-provisioned.
Assuming each unit is eventually operated at relatively low utilization (i.e., it
is sufficiently over-provisioned), we can use these marginal gains to predict the
capacity needs of each tier for a given workload intensity. Figure 5.2(a) and (b)
show the marginal throughput offered to TPC-W (i) by an extra machine for JBoss,
and (ii) by an extra CPU for Mysql, respectively. From these empirical results, we
obtain the capacity of each of our machines for Jboss and Mysql tier to be 146.6
79
0
4
8
12
16
20
Reference EC2
Ela
pse
d S
eco
nd
sAvg: 3.85 s
(Std: 0.014)
3.4GHz CPU
Avg: 11.9 s
(Std: 1.86)
0
0.1
0.2
0.3
0.4
0.5
0.6
9.5 sec 13.4 sec
Pro
ba
bili
ty
(a) Performance Comparison (b) Latency Distribution of EC2
Figure 5.3. EC2 Instance’s CPU Microbenchmark Results. Graph (a) compares thelatency of finishing the same number of arithmetic operations. Roughly the CPU of EC2instance is one third of our reference machine. Graph (b) shows that the distribution ofEC2 CPU bandwidth is bimodal.
transactions/sec (tps, henceforth) and 311.5 tps per machine. As for the Apache
tier, the resource consumption was insignificant and we estimated from observed
CPU utilization that one machine could handle about 4ktps.
5.2.4.0.2 Cloud-based Provisioning: We would like to find cloud-offered
resources that are likely to offer performance to TPC-W comparable with that
offered in-house. The hosting options that we consider require us to do this exercise
for the following: (i) Amazon EC2 instances (IaaS), (ii) Amazon RDS (SaaS) for
database, and (iii) SQL Azure (SaaS) for database. Based on existing results, we
assume that for (ii) and (iii) the cloud provider internally employs techniques to
scale resources to match workload needs and this reflects in the payments [111].
For (i), we must determine the resource needs and procure sufficient number of
instances. We describe our methodology for estimating this number for each tier of
TPC-W where we employ Amazon EC2’s small instance type (we did not find any
improvements in performance/dollar offered by large and extra large instances,
hence we restrict our attention to only small instances). Amazon EC2’s small
instance type claims to provide a computing power equivalent to 1.0-1.2 GHz
CPU. Interestingly the /proc/cpuinfo of such an instance shows that it has a
Intel(R) Xeon CPU 2.6GHz CPU. It is known that Amazon EC2 multiplexes two
VM instances on one physical core making it effectively 1.3GHz. We ran a simple
microbenchmarks to verify this and to establish computing power relative to our
machines. Our microbenchmark performed increment operations on an integer
80
variable in a loop from a single thread. We set the loop count to be 2 × 109
times and measured the elapsed time for both the reference machine (our lab
machine) that had 3.4GHz CPU and the EC2 small instance that had 2.6GHz
CPU. EC2 small instance is found to operate at 1.09GHz, about 1/3 speed of our
reference machine, which matches the claim of 1.0-1.2 GHz. Using this benchmark
information we set the throughput limit of single EC2 small instance for the Jboss
and Mysql tiers to be 57.34 tps and 121.86 tps, respectively.
5.3 Analysis
We first conduct our study on the effect of workload intensity and growth rates over
varying operational periods. We include only the “quantifiable” and “direct” cost
components in this discussion which will enable us to better understand the roles
of basic cost components. This will help us avoid drawing premature or incorrect
conclusions from possibly unnoticed effects of “Indirect” costs. We extend our
analysis to the domain of “indirect” costs in Section 5.3.5.
Figure 5.4 presents NPV cost calculations for up to a 10 year time horizon for
TPC-W. We present results with three initial workload intensities: (i) 20 tps, (ii)
100 tps, and (iii) 500 tps which represent “small”, “medium” and “large” in the
overall spectrum. Although the choice of 500 tps as “large” may be arbitrary, we
do so since analysis results does not change beyond that point. We also present
two intensity growth scenarios: (i) 20% increase per year and (ii) stagnant. The
former represents a thriving business and, the latter, a stabilized one. Also, for
the growth rates, there can be a business growing at an astronomical rate, but the
analysis results do not change qualitatively.
For a specific intended operational period, Figure 5.4 can be used to identify
the most economical hosting option. For example, in case of “medium” and stag-
nant growth (Figure 5.4(e)), if operational period of 3 years is expected, the best
hosting option would be “Fully EC2” whereas the winner becomes “Fully Inhouse”
if operational period of 8 years is assumed. Therefore it is important to identify
such cross-over points for decision making.
From Figure 5.4 we observe several interesting points. We find that in-house
provisioning is cost-effective for medium to large workloads, whereas cloud-based
81
10K
20K
30K
40K
50K
1 2 3 4 5 6 7 8 Years
NP
V C
ost(
$)
Legend identical
as (d)
10K
30K
50K
70K
90K
110K
1 2 3 4 5 6 7 8 Years
NP
V C
ost(
$)
Legend identicalas (d)
10K
100K
200K
300K
1 2 3 4 5 6 7 8 Years
NP
V C
ost(
$)
Legend identicalas (d)
(a) 20 tps, 20% growth (b) 100 tps, 20% growth (c) 500 tps, 20% growth
10K
20K
30K
40K
50K
1 2 3 4 5 6 7 8 Years
NP
V C
ost(
$)
Fully InhouseFully EC2
EC2+RDSInhouse+RDS
Inhouse+SQL Azure
10K
30K
50K
70K
90K
110K
1 2 3 4 5 6 7 8 Years
NP
V C
ost(
$)
Fully InhouseFully EC2
EC2+RDSInhouse+RDS
Inhouse+SQL Azure
10K
100K
200K
300K
1 2 3 4 5 6 7 8 Years
NP
V C
ost(
$)
Fully InhouseFully EC2
EC2+RDSInhouse+RDS
Inhouse+SQL Azure
(d) 20 tps, 0% growth (e) 100 tps, 0% growth (f) 500 tps, 0% growth
Figure 5.4. NPV over a 10 year time horizon for TPC-W. We consider three differentworkload intensity of small (20 tps at t=0), medium workload intensity (100 tps) andhigh workload intensity (500 tps) We also consider two different workload growth rateof 0% and 20% per year.
options suit small workloads. For small workloads, the servers procured for in-house
provisioning end up having significantly more computational power than needed
(and they remain severely under-utilized) since they are the lowest granularity
servers available in the market today. On the other hand, cloud can offer instances
with computational power matching the small workload needs (due to the statis-
tical multiplexing and virtualization it employs). For medium workload intensity,
cloud-based options are cost-effective only if the application needs to be supported
(i.e. operate temporarily) for 2-3 years, and become expensive for longer-lasting
scenarios. These workload intensities are able to utilize well provisioned servers
making in-house procurement cost-effective.
5.3.1 Data Transfer, Vertical Partitioning
Consistently in all cases of Figure 5.4, the most economical cloud-based deploy-
ment option turns out to be “Fully EC2”, closely followed by “EC2+RDS”. Both
82
������
�����
������
������
����������
����������
���� �
����������
���� ������
�������������
��� �
�����������
���� � �����
������
������ �����
�����
��������
�������������
��� �
����� ��� ����
������������
����� ��� ����
�����
����������������
(a) Fully EC2 ($22k) (b) EC2+RDS ($29k)
������
�����
�������
�� ���
����
����
�����������
��������
��� ����
�� �
�����
������
���
��������� �
�����
������
������� ��
�����
����������������������
����
����
�����
���
(c) In-house+RDS ($70k)(d) In-house+SQL Azure ($63k)
Figure 5.5. Closer look at cost components for four cloud-based application deploymentoptions in the 5th year. Initial workload is 100 tps and the annual growth rate is 20%.
“Inhouse+RDS” and “Inhouse+SQL Azure” are significantly more expensive than
either of “Fully EC2” and “EC2+RDS”. To explain the reason we provide detailed
break-downs of NPV for five-year long hosting of TPC-W for hosting options in-
volving the cloud in Figure 5.5. For “Fully EC2” and “EC2+RDS”, the cost of
purchasing cloud instances (including RDS instance) takes up about 60% or more
and the rest is mostly the charge for the data traffic in and out of the cloud. How-
ever, for the “Inhouse+RDS” and “Inhouse+SQL Azure”, the data traffic cost
dominates the overall cost by reaching more than 70%. Data transfer (in & out)
costs in Figure 5.5(c),(d) are larger than those in Figure 5.5(a),(b) because traf-
fic per transaction between Jboss and MySQL (16KB/tr) is larger than between
clients and Apache (3KB/tr). From this observation, we find that data transfer is
a significant contributor to the costs of cloud-based hosting - between 30%-70% for
TPC-W.
As a corollary, this also suggests that vertical partitioning choices may not be
83
�����
����� ����
����������
��
���
����������
����������
���������
�����
����� �������
����
�������������� ��
���
������
�������������
����
���
�������
����
��������
�����
�����
��� ���� �������
�����������
����
���
�����
��������������� ��
(a) Fully In-house ($482K) (b) Fully EC2 ($348K) (C) EC2+SQL Server ($279K)
Figure 5.6. Cost break-down of TPC-E at the 6th year.
appealing for applications that exchange data with the external world and/or across
components that fall across the boundary of partitioning.
5.3.2 Storage Capacity, Software Licenses
Storage capacity can be a key factor that might overturn the decision of cloud-
based hosting vs. in-house hosting. Whereas TPC-W poses relatively small costs
for storage capacity (its database only needs a few GB and its storage capacity
costs do not even show up in Figure 5.5), TPC-E has significant data storage needs
(its database requires about 4.5TB). Figure 5.7 presents the NPV cost evolution
for TPC-E for two initial intensities - 300 tps (medium) and 900 tps (high). The
annual growth rate in both cases is 20%. We only present fully in-house and two
cloud (Fully EC2 and EC2+SQL server) options since we have already established
the high costs of vertical partitioning. We find that in-house provisioning for
TPC-E has to make significant investments in high-end RAID arrays (gap A),
that constitute about 75% of overall costs. For initial workload intensity of 300
tps, these costs go down substantially with fully EC2 (i.e., renting storage from
EC2 is cheaper than the amortized cost of procuring this much storage in-house),
causing the overall costs to improve by 50% (year 1, shown as gap A) and 28%
(year 6, shown as gap B in Figure 5.7).
The software licensing fee for SQL Server and Windows can also be a significant
contributor to TPC-E costs: 2nd largest (17.4% of overall) and largest (67%)
84
� � � � � � � � �
����
�����
����
����
�� �����
���������
��������������
����
����
����
���
!"#
$���
%���
� � � � � � � � �
����
������
� �������������������
����
����
����
����
���
���
!���
(a) Initial workload: 300 tps, 20% Growth rate (b) Initial workload: 900 tps, 20% Growth rate
Figure 5.7. Two sets of TPC-E results at initial workload of 300 tps and 900 tps.
contributor, respectively, for fully in-house and EC2 options as shown in Figure 5.6.
Using pay-per-use SaaS DB allows the elimination of SQL Server licensing fees
(shown as gap C in Figure 5.7) and results in even better costs. SaaS options can be
cost-effective for applications built using software with high licensing/maintenance
fee. Note that these concerns did not arise with TPC-W which employed open-
source software, implying a different ordering of cost-efficacy among options.
It is also worth comparing the cost evolution for two intensities in Figure 5.7.
With medium intensity (300tps), in-house option is less attractive than cloud-
based option for the entire 10 year period without ever having a cross-over point.
However, at higher intensity (900tps), cloud-based hosting quickly (after 2 years
for fully EC2 and after 4 years for EC2+SQL server) becomes more expensive than
in-house. This is qualitatively similar to the observations for TPC-W. However,
cloud-based options remain attractive for a larger range of workload intensity than
for TPC-W (compare Figure 5.7(a) with Figure 5.4(b) both of which have the
same growth rate but differ in intensity by a factor of 9) - the key reasons for this
difference are gaps B and C, i.e., the higher storage costs for in-house TPC-E as
well as the contribution of software licenses in non-SaaS options.
A final interesting phenomenon arises due to the following: when buying cloud
instances for TPC-E database, we do find machines that offer required computa-
tional power per-VM but not the requisite degree of parallelism. Combined with
current database pricing policy of major venders regarding virtual cores (which is
explained a few sentences below), this affects the overall cost of cloud-based hosting
negatively. For example, High-Memory Quadruple Extra Large Instance of Ama-
85
zon EC2 offers 8 virtual cores each with 3.25 EC2 compute unit. A virtual core of
3.25 EC2 compute unit generates a computing power equivalent to 3.6 Ghz CPU
which is in par with typical processor speeds. However, since maximum number
of cores is limited to 8 for cloud instances whereas in-house server being used for
TPC-E analysis has 12 (6 cores × 2 sockets), this forces the cloud-based options
to procure more instances than in-house. Additionally, using commercial database
in the cloud-based hosting requires purchase of separate licenses for each virtual
core. Considering the prevalent pricing policy of per-socket (instead of per-core)
charging for traditional (non-virtualized) server, this puts the use of commercial
database in the cloud at a relatively high disadvantage. And, at the same time it
encourages the use of SaaS database service when using the cloud. This suggests
that a reconsideration of software licensing structures, particularly as applicable
to large-scale parallel machines, may be worthwhile for making cloud-based hosting
more appealing.
5.3.3 Workload Variance and Cloud Elasticity
Our cost analysis so far has not taken into account the variances in workload
intensity. One potential cost benefit of cloud-based hosting is from the ability
to dynamically match the resource capacity to the workloads at finer time-scale
than in-house hosting. Given high burstiness (i.e., high peak-to-average ratio or
PAR) in many real workloads, it is common in practice to provision close to the
peak. Whereas in-house provisioning must continue this practice, the usage-based
charging and elasticity offered by the cloud open new opportunities for savings (for
both in-cloud and horizontal partitioning). We investigate costs of variance-aware
provisioning for three degrees of burstiness corresponding to time-of-day effects and
flash crowds. Researchers have reported the magnitude of daily fluctuation (ratio
of peak to average) to be 40-50% [112, 113] for social networking applications,
and about 70% (min:40, max:135 tps) for an e-commerce Web site [114]. Flash
crowds can cause orders of magnitude higher peaks than the average and become
a particularly appealing motivation for considering the use (perhaps partial) of
cloud. The logs of World Cup 1998 has reported 70 times increase of web requests
due to flash crowds [115]. There have been many efforts to handle the flash crowds
86
� � � � � � � � �
������
����
� �
� �
� �
� ����������
� �
�� �
������
����
�������������
�� �
�� �
������������
������������
Figure 5.8. Effect of workload variance on the cost of in-house hosting for TPC-W.The workload is 100 tps and the growth rate is 20% per year.
for the enterprise applications [116, 117, 118]. We represent the workload variance
using peak-to-average ratio (PAR) which we define as max(xt)/E(xt) where xt
is the time series representing workload intensity (e.g., number of arrivals/sec).
Borrowing from [114], we choose PAR of 1.54 to represent daily variations and
PAR values of 11 and 51 to represent two flash crowd scenarios (i.e., peak of 10
and 50 times the average, respectively).
For cloud-based hosting, we assume the followings. Cloud provides a mecha-
nism to monitor and detect the occurrence of sudden burst of workloads. Also,
cloud provides a mechanism to scale out at run time to match the change of ob-
served workloads. These assumptions are safe to make since those features are
already supported by mainstream cloud providers. Therefore, regardless of PAR,
the cost of cloud-based hosting is theoretically equal to the overall average work-
load intensity. In real world, however, since most of the cloud charges the usage at
the granularity of hours, the actual cost can be slightly higher than the theoretical
costs.
Figure 5.8 illustrates the effect of three levels of burstiness on the in-house pro-
visioning cost. We select the case of in-house with medium & increasing workloads
(Figure 5.4(b)). Provisioning for the diurnal fluctuation of 70% (PAR=1.54) does
not impact the cost whereas flash crowd noticeably increases costs. Provisioning
for PAR=11 shifts the cross-over point with “Fully EC2” from year 2.5 to year 6.5
87
(See annotations in Figure 5.8). Provisioning for PAR=51 becomes uncompetitive
compared to “Fully EC2” over the entire 10 year period. Note that the effect
of diurnal fluctuation is miniscule. It is because provisioned servers already have
enough capacity to embrace the peak of diurnal fluctuation.
5.3.4 Horizontal Partitioning
We explore the benefits offered by a horizontal partitioning scheme that sets a
threshold of workload intensity over which we create a replica in the cloud to
handle the excess.
Total cost of horizontal partitioning is the sum of two costs: one from in-house
provisioning and the other from the usage of cloud as:
C = CH(τ) + CC(τ) (5.2)
where C denotes “cost” and τ is the threshold of workload intensity. The cost of
in-house provisioning (CH(τ)) is computed the same way as in previous sections,
with the value of τ being the required capacity to provision. In both cases of
horizontal partitioning schemes described in the Figure ?? of Section 5.3.4, the
cost of in-house provisioning part is determined using the same steps. That is:
CH(τ) = CH/W (τ) + CS/W (τ) + Copex(τ) (5.3)
where Copex represents the sum of operational expenses such as electricity, and
labor costs.
Calculating the cost of cloud usage requires us to assume a suitable distribu-
tion for workload intensities. Since we are interested in observing the effect of
workload burstiness to the cost of horizontal partitioning, we select a heavy-tailed
distribution - lognormal distribution. Let us denote the distribution as f(x : θ)
where θ is the distribution parameters. In order to determine CC(τ), we first find
two values from the distribution.
• Average workload intensity of all time t where xt ≥ τ (xt is the workload
at time t). Pictorially it is the dotted line in Figure 5.9(b).
• Proportion of time length where xt ≥ τ .
88
����
��������
�
������� �
��
� ��������
����������
�
������ ����
������
����
(a) Time series xt (b) PDF f(x)
Figure 5.9. Timeseries xt and workload distribution f(x) for a fixed τ .
These two values allow us to calculate how many of the VM instances are used for
how many hours. The average workload intensity I is given by:
I(τ, f(x : θ)) =∑
x>τ
x ·f(x : θ)
1 − F (τ : θ)(5.4)
where f is divided by 1 − F (τ : θ) since∑
f(x : θ) = 1 − F (τ : θ) as shown in
Figure 5.9(b). The proportion of time T where xt ≥ τ is:
T (τ, f(x : θ)) = 1 − F (τ : θ) (5.5)
These I and T is fed into the procedure we used to calculate the “Fully In-cloud”
cost. This provides the cost for CC(τ), completing the cost calcuation for the total
cost C.
Figure 5.10 shows the cost change over a range of threshold for TPC-W with
average workload of 500 tps. We assume a lognormal (µ:0, σ:1) distribution to
simulate the bursty traffic. Applying the equations described above, we are able to
observe the behavior of costs as a function of threshold when horizontal partitioning
is used. Blue dotted line in Figure 5.10 is the total cost, the sum of two separate
costs from in-house hosting infrastructure and usage charge of replica within the
cloud. As the threshold moves to higher workload intensity, in-house cost rises
and the cloud cost shrinks. The in-house cost shows a stepwise increasing pattern
because the server capacity increases at a discrete size. The cloud cost is shown as a
smooth decreasing function because an application deployed in the cloud can grow
89
5K
10K
15K
20K
25K
30K
35K
40K
45K
300 600 900 1.2K 1.5K 1.8K 2.1K 2.4K 2.7K
Co
st(
$)
Workload Intensity (requests/sec)
Optimal Threshold (1100 req/s)
(cost: $248K)
Average (500 req/s)
(cost: $309K)
In-house partCloud part
Total
Figure 5.10. Cost behavior of horizontal partitioning as a function of varying thresholdvalue. Although not shown in the graph, the cost at PAR=11 (at 5.5K on x-axis), thecost is $625K.
and shrink to match the current workload intensity. The cloud cost diminishes
since the probability of overflowing the in-house server capacity becomes smaller
as the threshold moves to the higher region. Due to inherent burstiness in the
workload (captured by the heavy-tailedness), the cost of cloud does not diminish
rapidly. The equilibrium point where the cost becomes minimum is found at the
workload intensity of 1100 tps as marked in the Figure 5.10. If the threshold is
set to an average, the cost will be 25% higher than the minimum. Also, if the
horizontal partitioning is not employed and the in-house server is provisioned for
the PAR of 11 (not shown in the Figure 5.10 because the data point is beyond
range), then the overall cost becomes 2.5 times higher than the minimum. This
suggests that horizontal partitioning can be effectively used to eliminate the cost
increase from provisioning for the peak.
5.3.5 Indirect Cost Components
In this section we study the effect of one representative indirect costs, namely the
labor cost. Since indirect costs are not dependent entirely on the application under
consideration, we are unable to treat them as a single quantity as a function of
application size. Instead we consider different cases for the labor cost per server
with numbers borrowed from other literature.
90
Size of Ratio of Average Labor Cost forBusiness Staff:Svr $/Svr/hour 1 Server/Year
SMB 1:10 $3.3 $6,336Large Enterprise 1:100 $.33 $633Data Center class 1:1000 $.033 $63
Table 5.1. Labor cost per server. We differentiate 3 different ratio of IT staff vs. Server.Average cost per server per hour (3rd column) is based on the average IT staff salary of$33/h.
In order to assess the impact of labor cost, we need to determine the value
range of following two metrics.
• Yearly labor cost per server: This is the contribution of labor to the
overall cost for a single server. This allows us to quantify the labor cost as
a function of the system size.
• Ratio of optimal manageable number of instances between VM
and physical server: This value indicates how many VM instances an
IT staff can manage with the same effort as managing one physical server.
This is used to calculate the labor cost for a given number of cloud-hosted
VM instances.
In a typical large enterprise and in a data center, the ratio of IT staff vs. server is
reported to be 1:100, and 1:1000 respectively [119]. In addition to this, we consider
a case for the small and medium-sized businesses (SMB) having the IT staff vs.
server ratio of 1:10 as shown in the 2nd column of Table 5.1. From hereon, we
use the term small, medium and large to refer to SMB, Large Enterprise and Data
Center class, respectively. The average salary of typical IT staff is between $61,924
and $66,196 as of 2010 [120]. This translates to about $33 per hour for one IT
staff. Combining this with the ratio of IT staff vs. server, we are able to determine
the yearly cost of one server as shown in the last column of Table 5.1. As for the
second metric, it is difficult to obtain the actual value and, thus, we approximate
it using the average VM density per physical server of about 7 in a virtualized
cluster environment [121]. Note that this value is a conservative estimate because
the number of cloud-based VM instances an IT staff can manage is likely to be
higher than this since cloud-based VMs do not require caring of underlying physical
servers.
91
10K
30K
50K
70K
90K
110K
130K
150K
170K
190K
1 2 3 4 5 6 7 8 Years
NP
V C
ost(
$)
Fully InhouseFully EC2
EC2+RDSInhouse+RDS
Inhouse+SQL Azure
10K
30K
50K
70K
90K
110K
130K
150K
170K
190K
1 2 3 4 5 6 7 8 Years
NP
V C
ost(
$)
Legend identical as (a)
(a) Without labor cost (b) Large business size
10K
30K
50K
70K
90K
110K
130K
150K
170K
190K
1 2 3 4 5 6 7 8 Years
NP
V C
ost(
$)
Legend identical as (a)
10K
30K
50K
70K
90K
110K
130K
150K
170K
190K
1 2 3 4 5 6 7 8 Years
NP
V C
ost(
$)
Legend identical
as (a)
(c) Medium business size (d) Small business size
Figure 5.11. Impact of labor cost for the medium workload intensity using TPC-W.The cases for small and large workload intensity is not shown. For small workload,cloud-based hosting is always cheaper. For large workloads, in-house hosting is alywayscheaper.
From Table 5.1 we can expect that the cost of SMB’s in-house hosting will
become higher because it has the highest yearly labor cost per server. However,
the cost of cloud-based hosting also increases as the number of cloud instances
increases. If the rate of increase for the cloud instance is sufficiently high, the labor
cost of managing cloud instances can exceed the labor cost of in-house hosting.
Thus, it is not straightforward to anticipate the impact of labor cost to the overall
decision of in-house vs. cloud-based hosting. Figure 5.11 shows the effect of labor
cost for a medium workload (100tps) of TPC-W. We show only the case for the
medium workload since this is the region where the decision of in-house vs. cloud-
92
������������
�����������
�����
������
�����
������
����
������
�����
���
����
���������
��� ���
������
��� ���
����� ���� ����
�� ��
�����������
��
���
������������
���
���
���
���
����������
Figure 5.12. (a) Effect of labor costs on the hosting decision in relation to the workloadintensity and business size. (b) Stacked view of in-house cost at year 5.
based hosting is reversed. For the case of small workload (20tps), the best option
is already cloud-based hosting even without considering the labor cost. Likewise,
the best option for high workload (500tps) is in-house hosting and the gap widens
as the business size shrinks from large to small. From Figure 5.11 (b), (c) to (d), it
can be seen that the in-house cost rises rapidly. Although the cost of cloud-based
hosting rises as well, the magnitude is smaller. As a result, the in-house cost loses
competitiveness in (d). Figure 5.12(b) shows the composition of in-house cost at
the 5th year from Figure 5.11 in more detail.
The effect of labor cost on the hosting decision can be summarized as Fig-
ure 5.12(a). Looking at the y-axis, if the workload is small, cloud-based hosting is
preferable for all business sizes. If the workload is large, in-house hosting is more
economical. However, there is a range of medium workloads in which the host-
ing decision shifts from cloud-based to in-house as the business size grows. In this
range, higher labor cost of smaller business size manifests as higher cost of in-house
hosting. Looking at the x-axis, the diagram shows that cloud-based hosting makes
more sense for the smaller business size most of the time. In order for a small
business to consider in-house hosting option, the workload has to be significantly
high. For a large enterprise, it is important to analyze the cost of available hosting
options because they are more likely to be on the boundary of decision spectrum.
93
5.4 Summary and Conclusion
In this chapter we have studied how to incorporate various cost factors into de-
termining the cost of hosting application in several cloud-based hosting options as
well as in-house hosting. As representative applications, we have used TPC-W and
TPC-E both of which are multi-tiered and web-based service application that uses
database as a back-end. We have presented the classification of costs related to
application hostings and explained what horizontal and vertical partitionings are
and how they can be used for the cloud-based hosting of consumer’s applications.
We have considered several important application characteristics such as workload
intensity, growth rate, workload variances and the impact of uncertainty in some
of the cost factors. We summarize below the key lessons we have learned from our
study.
• Cloud-based hosting makes sense for applications with workload intensity
and growth rate that are relatively small (the exact range is specific to the
application, we provided illustrative numbers for TPC-W and TPC-E).
• Data transfer (network usage) cost is one of the major costs of using the
cloud and this can make vertical partitioning infeasible for some applica-
tions.
• Costs such as storage and license can have a big impact on the choice of
best hosting options.
• Small workload variances such as diurnal patterns are unlikely to impact
the hosting costs, but flash crowds can have a huge impact.
• Indirect costs such as labor costs have a range of values depending on the
external factors like size of the organization and skill level of the staff.
We learn that it can render the economic benefit of cloud-based hostings
meaningless for smaller organization sizes.
Complementary to our cost analysis in this chapter is the problem of predicting the
performance an application is likely to experience once it is (partially) migrated
to a certain cloud(s). This forms the subject of our investigation in future work.
Chapter 6Conclusion and Future Work
In this dissertation, we have addressed the problem of realizing mature utility com-
puting paradigm from currently emerging clouds. We have argued that there were
some important missing functionality in the current cloud computing infrastruc-
ture in order for it to evolve into a mature utility computing platform.
From the provider’s point of view, we have identified that accurate resource
accounting methodology is needed. This solution would enable the cloud providers
to gain clear understanding of resource usage on behalf of the chargeable entities.
We have found that existing methods suffered from various runtime conditions of
the shared services and could even lead to incorrect operation of cloud manage-
ments. The fundamental reason for such shortcoming was that traditional methods
of monitoring data collection were inherently ill-suited to the purpose of resource
accounting. To this end, we have devoloped a novel resource accounting technique
that is implementable within the hypervisor layer and that operate at the gran-
ularity of thread-level monitoring data. One sub-problem that we had to solve
was the problem of discovering causal, end-to-end dependency among application
components. We have presented the solution and evaluation in Chapter 3. Our
complete resource accounting solution that incorporated the dependency discovery
technique has shown to be robust to various system and workload conditions that
impacted the traditional approaches. While traditional approaches suffered the
error rate of more than 100%, our technique consistently maintained low error rate
of less than 1% for all tested conditions. The evaluation using two real-world appli-
cations revealed that our method was superior compared to traditional statistical
95
approaches. In the scenario of response time management via resource allocation
control, statistical methods failed to avoid SLA violation due to incorrect deci-
sion of which chargeable entity is the dominant resource consumer, whereas our
methods succeeded in avoiding the SLA violation.
From the consumer’s point of view, we have identified that consumers needed
to understand the complex interactions of various cost factors related to using the
cloud services in order to be able to make informed decision of how to deploy their
applications in the cloud. Using well-regarded multi-tiered applications, we have
studied the impact of factors such as workload intensity, burstiness, growth rates,
time of operation, storage and license costs as well as labor costs. We have learned
that cloud-based hosting tends to make sense for a low workload intensity and
low growth rates. Data transfer costs could become one of the major cost factors
and this could render the hybrid hosting option of using the in-house and cloud
together very uneconomical.
Future Work: There are many interesting direction of future work on both the
provider and consumer side of the dissertation topic. For the provider-side, our
solution to the resource accounting problem can be used to address the question
of performance interferences among VMs or application components. Since our
solution provides accurate picture of how much resources are consumed, we are now
able to predict what the combined resource usage would be if several components
are combined and serviced by identical shared services. The challenge would to be
develop a sound technique that could translate this total combined resource usage
into expected performance numbers. The accurate performance (or the range of
possible performance) prediction would be an invaluable tools for optimizing the
cloud management. On the other hand, on the consumer side, we can consider
the research direction of predicting what the performance of a target application
would be if it were to be deployed into specific cloud. Since there are variety
of cloud services with different pricing and characteristics, the information of the
expected performance of consumer’s application is important in order to make an
optimal decision of which cloud to use and how. Deploying the application directly
to multiple target cloud for performance testing is prohibitive because such task
is often time-consuming if not impossible and error-prone. We are in need of
96
technique that enables us to accurately estimate the performance without having
to actually deploy the applications every time. Additional requirement is that,
since the best choice of cloud could change over time, we need the method to be
lightweight so that it can be applied repeatedly.
Bibliography
[1] “Amazon Elastic Compute Cloud (EC2),” http://www.amazon.com/ec2/.
[2] “Amazon Simple Storage Service (S3),” http://aws.amazon.com/s3/.
[3] “Sun Grid,” http://www.sun.com/service/sungrid/.
[4] “The Rackspace Cloud,” http://www.rackspacecloud.com.
[5] “Windows Azure Platform,” http://www.microsoft.com/azure/.
[6] “Google App Engine,” http://code.google.com/appengine.
[7] Hamm, S. (2008), “Confusion Over Cloud Computing,” Business Week.
[8] (2009), “A NIST Notional Definition of Cloud Computing,” csrc.nist.gov/
groups/SNS/cloud-computing/cloud-def-v14.doc.
[9] (2009), “Google Trends: Cloud Computing Surpasses Virtual-ization in Popularity,” http://www.elasticvapor.com/2009/04/
google-trends-cloud-computing-surpasses.html.
[10] Horrigan, J. (2008), “Use of Cloud Computing Applications and Services,”Pew Internet and American Life Project.
[11] Rappa, M. A. (2004) “The utility business model and the future of com-puting services,” IBM Syst. J., 43, pp. 32–42.URL http://dx.doi.org/10.1147/sj.431.0032
[12] Eilam, T., K. Appleby, J. Breh, G. Breiter, H. Daur, S. A.
Fakhouri, G. D. H. Hunt, T. Lu, S. D. Miller, L. B. Mummert,J. A. Pershing, and H. Wagner (2004) “Using a utility computing frame-work to develop utility systems,” IBM Syst. J., 43, pp. 97–120.URL http://dx.doi.org/10.1147/sj.431.0097
98
[13] Paleologo, G. A. (2004) “Price-at-Risk: A methodology for pricing utilitycomputing services,” IBM Syst. J., 43, pp. 20–31.URL http://dx.doi.org/10.1147/sj.431.0020
[14] Banga, G., P. Druschel, and J. C. Mogul (1999) “Resource containers:a new facility for resource management in server systems,” in Proceedings ofthe third symposium on Operating systems design and implementation, OSDI’99, USENIX Association, Berkeley, CA, USA, pp. 45–58.URL http://portal.acm.org/citation.cfm?id=296806.296810
[15] Aguilera, M. K., J. C. Mogul, J. L. Wiener, P. Reynolds, andA. Muthitacharoen (2003) “Performance debugging for distributed sys-tems of black boxes,” in SOSP’03: Proceedings of the 19th Symposium onOperating Systems Principles, ACM, New York, NY, USA, pp. 74–89.
[16] Anandkumar, A., C. Bisdikian, and D. Agrawal (2008) “Tracking in aspaghetti bowl: monitoring transactions using footprints,” in SIGMETRICS’08: Proceedings of the 2008 ACM SIGMETRICS international conferenceon Measurement and modeling of computer systems, ACM, New York, NY,USA, pp. 133–144.
[17] Barham, P., A. Donnelly, R. Isaacs, and R. Mortier (2004) “UsingMagpie for Request Extraction and Workload Modeling,” in OSDI’04: Pro-ceedings of the 6th conference on Symposium on Opearting Systems Design& Implementation, USENIX Association, Berkeley, CA, USA.
[18] Chen, M. Y., A. Accardi, E. Kiciman, J. Lloyd, D. Patterson,A. Fox, and E. Brewer (2004) “Path-based faliure and evolution manage-ment,” in NSDI’04: Proceedings of the 1st conference on Networked SystemsDesign and Implementation, USENIX Association, Berkeley, CA, USA.
[19] Chen, M. Y., E. Kiciman, E. Fratkin, A. Fox, and E. Brewer (2002)“Pinpoint: Problem Determination in Large, Dynamic Internet Services,” inDSN ’02: Proceedings of the 2002 International Conference on DependableSystems and Networks, IEEE Computer Society, Washington, DC, USA, pp.595–604.
[20] Reynolds, P., C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah,and A. Vahdat (2006) “Pip: detecting the unexpected in distributed sys-tems,” in NSDI’06: Proceedings of the 3rd conference on Networked SystemsDesign & Implementation, USENIX Association, Berkeley, CA, USA.
[21] Reynolds, P., J. L. Wiener, J. C. Mogul, M. K. Aguilera, andA. Vahdat (2006) “WAP5: black-box performance debugging for wide-area
99
systems,” in WWW ’06: Proceedings of the 15th international conference onWorld Wide Web, ACM, New York, NY, USA, pp. 347–356.
[22] Sengupta, B. and N. Banerjee (2008) “Tracking Transaction Footprintsfor Non-intrusive End-to-End Monitoring,” Autonomic Computing, Interna-tional Conference on, 0, pp. 109–118.
[23] Thereska, E., B. Salmon, J. Strunk, M. Wachs, M. Abd-El-Malek,J. Lopez, and G. R. Ganger (2006) “Stardust: tracking activity in a dis-tributed storage system,” in SIGMETRICS ’06/Performance ’06: Proceed-ings of the joint international conference on Measurement and modeling ofcomputer systems, ACM, New York, NY, USA.
[24] Wang, T., C. shing Perng, T. Tao, C. Tang, E. So, C. Zhang,R. Chang, and L. Liu (2008) “A Temporal Data-Mining Approach forDiscovering End-to-End Transaction Flows,” in 2008 IEEE InternationalConference on Web Services (ICWS08)., Beijing, China.
[25] Yumerefendi, A. and J. Chase (2004) “Trust but Verify: Accountabilityfor Internet Services,” in Proceedings of the Eleventh ACM SIGOPS Euro-pean Workshop.
[26] (2006), “Time of Use Electricity Billing: How Puget Sound Energy Re-duced Peak Power Demands (Case Study),” http://energypriorities.
com/entries/2006/02/pse tou amr case.php.
[27] “Open Cirrus the HP/Intel/Yahoo! Open Cloud Computing ResearchTestbed,” http://opencirrus.org.
[28] Benani, M. and D. Menasce (2005) “Resource Allocation for AutonomicData Centers Using Analytic Performance Models,” in Proceedings of IEEEInternational Conference on Autonomic Computing, Seattle (ICAC-05) .
[29] Chase, J. and R. Doyle (2001) “Balance of Power: Energy Managementfor Server Clusters,” in Proceedings of the Eighth Workshop on Hot Topicsin Operating Systems (HotOS-VIII), Elmau, Germany.
[30] Chen, Y., A. Das, W. Qin, A. Sivasubramaniam, Q. Wang, andN. Gautam (2005) “Managing Server Energy and Operational Costs inHosting Centers,” in Proceedings of the ACM International Conference onMeasurement and Modeling of Computer Systems (SIGMETRICS 2005),Banff, Canada, June 2005.
[31] Urgaonkar, B., P. Shenoy, and T. Roscoe (2002) “Resource Overbook-ing and Application Profiling in Shared Hosting Platforms,” in Proceedings
100
of the 5th USENIX Symposium on Operating Systems Design and Implemen-tation (OSDI), Boston.
[32] Waldspurger, C. (2002) “Memory Resource Management in VMWareESX Server,” in Proceedings of the Fifth Symposium on Operating SystemDesign and Implementation (OSDI’02).
[33] Xu, M. and C. Xu (2004) “Decay Function Model for Resource Configu-ration and Adaptive Allocation on Internet Servers,” in Proceedings of theTwelfth International Workshop on Quality-of-Service (IWQoS 2004).
[34] Amazon Relational Database Service, http://aws.amazon.com/
rds/.
[35] Apache HBase, http://hbase.apache.org/.
[36] Chang, F., J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach,M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber (2006)“Bigtable: a distributed storage system for structured data,” in Proceedingsof the 7th USENIX Symposium on Operating Systems Design and Implemen-tation - Volume 7, USENIX Association, Berkeley, CA, USA, pp. 15–15.URL http://portal.acm.org/citation.cfm?id=1267308.1267323
[37] Amazon SimpleDB, http://aws.amazon.com/simpledb/.
[38] Abdelzaher, T., K. G. Shin, and N. Bhatti (2002) “PerformanceGuarantees for Web Server End-Systems: A Control-Theoretical Approach,”IEEE Transactions on Parallel and Distributed Systems, 13(1).
[39] Appleby, K., S. Fakhouri, L. Fong, G. Goldszmidt, M. Kalantar,S. Krishnakumar, D. Pazel, J. Pershing, and B. Rochwerger (2001)“Oceano-SLA Based Management of a Computing Utility,” in Proceedingsof the IEEE/IFIP Integrated Network Management.
[40] Doyle, R., J. Chase, O. Asad, W. Jin, and A. Vahdat (2003) “Model-Based Resource Provisioning in a Web Service Utility,” in Proceedings of theFourth USITS.
[41] Slothouber, L. (1996) “A Model of Web Server Performance,” in Proceed-ings of the Fifth International World Wide Web Conference.
[42] Urgaonkar, B., G. Pacifici, P. Shenoy, M. Spreitzer, andA. Tantawi (2005) “An Analytical Model for Multi-tier Internet Servicesand its Applications,” in Proceedings of the ACM International Conferenceon Measurement and Modeling of Computer Systems (SIGMETRICS 2005),Banff, Canada.
101
[43] Urgaonkar, B. and P. Shenoy (2004) “Sharc: Managing CPU and Net-work Bandwidth in Shared Clusters,” 15(1), pp. 2–17.
[44] Lee, S. C. M. and J. C. S. Lui (2008) “On the Interaction and Competi-tion among Internet Service Providers,” IEEE Journal on Selected Areas inCommunications, 26(7), pp. 1277–1283.
[45] Shakkottai, S. and R. Srikant (2006) “Economics of network pricingwith multiple ISPs,” IEEE/ACM Trans. Netw., 14, pp. 1233–1245.URL http://dx.doi.org/10.1109/TNET.2006.886393
[46] Amazon EC2 Spot Instances, http://aws.amazon.com/ec2/
spot-instances/.
[47] Weissel, A. and F. Bellosa (2004) “Dynamic Thermal Management forDistributed Systems,” in Proceedings of the First Workshop on Temperatur-Aware Computer Systems (TACS’04), Munich, Germany.
[48] John Levon, “Oprofile,” http://oprofile.sourceforge.net/credits/.
[49] Bhatia, S., A. Kumar, M. E. Fiuczynski, and L. Peterson (2008)“Lightweight, high-resolution monitoring for troubleshooting production sys-tems,” in Proceedings of the 8th USENIX conference on Operating systemsdesign and implementation, OSDI’08, USENIX Association, Berkeley, CA,USA, pp. 103–116.URL http://portal.acm.org/citation.cfm?id=1855741.1855749
[50] “Kprobes,” http://sourceware.org/systemtap/kprobes/.
[51] Quynh, N. A. and K. Suzaki “Xenprobes, a lightweight user-space probingframework for Xen virtual machine,” in 2007 USENIX Annual TechnicalConference on Proceedings of the USENIX Annual Technical Conference,Berkeley, CA, USA, pp. 2:1–2:14.URL http://dl.acm.org/citation.cfm?id=1364385.1364387
[52] Johnson, M. W. (1998) “Monitoring and Diagnosing Application ResponseTime with ARM,” in Proceedings of the IEEE Third International Workshopon Systems Management, IEEE Computer Society, Washington, DC, USA,pp. 4–.URL http://portal.acm.org/citation.cfm?id=829512.830401
[53] Wood, T., L. Cherkasova, K. Ozonat, and P. Shenoy (2008) “Profil-ing and modeling resource usage of virtualized applications,” in Proceedingsof the 9th ACM/IFIP/USENIX International Conference on Middleware,Middleware ’08, Springer-Verlag New York, Inc., New York, NY, USA, pp.
102
366–387.URL http://portal.acm.org/citation.cfm?id=1496950.1496973
[54] Zhang, Q., L. Cherkasova, G. Mathews, W. Greene, andE. Smirni (2007) “R-Capriccio: a capacity planning and anomaly detec-tion tool for enterprise services with live workloads,” in Proceedings of theACM/IFIP/USENIX 2007 International Conference on Middleware, Middle-ware ’07, Springer-Verlag New York, Inc., New York, NY, USA, pp. 244–265.URL http://portal.acm.org/citation.cfm?id=1516124.1516142
[55] Gupta, D., L. Cherkasova, R. Gardner, and A. Vahdat (2006) “En-forcing performance isolation across virtual machines in Xen,” in Proceedingsof the ACM/IFIP/USENIX 2006 International Conference on Middleware,Middleware ’06, Springer-Verlag New York, Inc., New York, NY, USA, pp.342–362.URL http://portal.acm.org/citation.cfm?id=1515984.1516011
[56] Ren, G., E. Tune, T. Moseley, Y. Shi, S. Rus, and R. Hundt (2010)“Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Cen-ters,” IEEE Micro, 30, pp. 65–79.URL http://dx.doi.org/10.1109/MM.2010.68
[57] Stewart, C., T. Kelly, and A. Zhang (2007) “Exploiting nonsta-tionarity for performance prediction,” in Proceedings of the 2nd ACMSIGOPS/EuroSys European Conference on Computer Systems 2007, Eu-roSys ’07, ACM, New York, NY, USA, pp. 31–44.URL http://doi.acm.org/10.1145/1272996.1273002
[58] Cohen, I., M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase
(2004) “Correlating instrumentation data to system states: a building blockfor automated diagnosis and control,” in Proceedings of the 6th conferenceon Symposium on Opearting Systems Design & Implementation - Volume 6,USENIX Association, Berkeley, CA, USA, pp. 16–16.URL http://portal.acm.org/citation.cfm?id=1251254.1251270
[59] Gray, J. (2008) “Distributed Computing Economics,” Queue, 6, pp. 63–68.URL http://doi.acm.org/10.1145/1394127.1394131
[60] Thanos, G., C. Courcoubetis, and G. Stamoulis (2007) “Adoptingthe Grid for Business Purposes: The Main Objectives and the AssociatedEconomic Issues,” in Grid Economics and Business Models (D. Veit andJ. Altmann, eds.), vol. 4685 of Lecture Notes in Computer Science, pp. 1–15.
103
[61] Kenyon, C. and G. Cheliotis (2004) “Grid resource management,” chap.Grid resource commercialization: economic engineering and delivery scenar-ios, Kluwer Academic Publishers, Norwell, MA, USA, pp. 465–478.URL http://portal.acm.org/citation.cfm?id=976113.976142
[62] Cheliotis, G. and C. Kenyon (2003) “Autonomic Economics: Why Self-Managed e-Business Systems Will Talk Money,” in IEEE Conference onE-Commerce (CEC’03).URL http://www.zurich.ibm.com/grideconomics/refs.html
[63] Armbrust, M., A. Fox, R. Griffith, A. D. Joseph, R. Katz,A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica,and M. Zaharia (2009) “Above the Clouds: A Berkeley View of CloudComputing,” .URL http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/
EECS-2009-28.html
[64] Walker, E. (2009) “The Real Cost of a CPU Hour,” Computer, 42.URL http://portal.acm.org/citation.cfm?id=1550393.1550432
[65] Walker, E., W. Brisken, and J. Romney (2010) “To Lease or Not toLease from Storage Clouds,” Computer, 43, pp. 44–50.URL http://dx.doi.org/10.1109/MC.2010.115
[66] Harms, R. and M. Yamartino (2010) The Economics of the Cloud, Tech.rep., Microsoft.
[67] Teregowda, P., B. Urgaonkar, and C. L. Giles (2010) “Cloud Com-puting: A Digital Libraries Perspective,” in Proceedings of the 2010 IEEE3rd International Conference on Cloud Computing, CLOUD ’10, IEEE Com-puter Society, Washington, DC, USA, pp. 115–122.URL http://dx.doi.org/10.1109/CLOUD.2010.49
[68] Teregowda, P., B. Urgaonkar, and L. Giles (2010) “CiteSeerX: ACloud Perspective,” in Proceedings of the Second USENIX Workshop on HotTopics in Cloud Computing.
[69] Klems, M., J. Nimis, and S. Tai (2009) “Do Clouds Compute? AFramework for Estimating the Value of Cloud Computing,” in DesigningE-Business Systems. Markets, Services, and Networks, vol. 22 of LectureNotes in Business Information Processing, Springer Berlin Heidelberg, pp.110–123.
[70] Wang, H., Q. Jing, R. Chen, B. He, Z. Qian, and L. Zhou (2010)“Distributed systems meet economics: pricing in the cloud,” in Proceedings of
104
the 2nd USENIX conference on Hot topics in cloud computing, HotCloud’10,USENIX Association, Berkeley, CA, USA, pp. 6–6.URL http://portal.acm.org/citation.cfm?id=1863103.1863109
[71] Chohan, N., C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi,and C. Krintz (2010) “See spot run: using spot instances for mapreduceworkflows,” in Proceedings of the 2nd USENIX conference on Hot topics incloud computing, HotCloud’10, USENIX Association, Berkeley, CA, USA,pp. 7–7.URL http://portal.acm.org/citation.cfm?id=1863103.1863110
[72] Campbell, R., I. Gupta, M. Heath, S. Y. Ko, M. Kozuch,M. Kunze, T. Kwan, K. Lai, H. Y. Lee, M. Lyons, D. Milojicic,D. O’Hallaron, and Y. C. Soh (2009) “Open CirrusTMcloud comput-ing testbed: federated data centers for open source systems and servicesresearch,” in Proceedings of the 2009 conference on Hot topics in cloud com-puting, HotCloud’09, USENIX Association, Berkeley, CA, USA, pp. 1–1.URL http://portal.acm.org/citation.cfm?id=1855533.1855534
[73] “AWS Simple Monthly Calculator,” http://calculator.s3.amazonaws.
com/calc5.html.
[74] “Cloud Price Calculator,” http://cloudpricecalculator.com/.
[75] Shi, Z., H. Tang, and Y. Tang (2005) “Blind source separation of moresources than mixtures using sparse mixture models,” Pattern Recogn. Lett.,26, pp. 2491–2499.URL http://dx.doi.org/10.1016/j.patrec.2005.05.006
[76] Barham, P., B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,R. Neugebuer, I. Pratt, and A. Warfield (2003) “Xen and the Art ofVirtulization,” in Proceedings of the 19th Symposium on Operating SystemsPrinciples (SOSP).
[77] Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2/.
[78] Welsh, M., D. Culler, and E. Brewer (2001) “SEDA: an architecturefor well-conditioned, scalable internet services,” SIGOPS Oper. Syst. Rev.,35(5), pp. 230–243.
[79] Pai, V. S., P. Druschel, and W. Zwaenepoel (1999) “Flash: an efficientand portable web server,” in ATEC ’99: Proceedings of USENIX AnnualTechnical Conference, USENIX Association, Berkeley, CA, USA.
105
[80] Ruan, Y. and V. Pai (2004) “Making the box transparent: system callperformance as a first-class result,” in Proceedings of the USENIX AnnualTechnical Conference 2004, USENIX Association Berkeley, CA, USA.
[81] ——— (2006) “Understanding and Addressing Blocking-Induced NetworkServer Latency,” in Proceedings of the USENIX Annual Technical Conference2006, USENIX Association Berkeley, CA, USA.
[82] Behren, R. V., J. Condit, and E. Brewer (2003) “Why Events Are ABad Idea (for high-concurrency servers),” in Proceedings of HotOS IX.
[83] Smith, W., “TPC-W: Benchmarking An Ecommerce Solution,” http://
www.tpc.org/information/other/techarticles.asp.
[84] “NYU TPC-W,” http://www.cs.nyu.edu/pdsg/.
[85] “The JBoss Application Server,” http://www.jboss.org.
[86] “MySQL,” http://www.mysql.com.
[87] “RUBiS,” http://rubis.objectweb.org/.
[88] MediaWiki, http://www.mediawiki.org.
[89] Yu, H., J. Moreira, P. Dube, I. Chung, and L. Zhang (2007) “Perfor-mance Studies of a WebSphere Application, Trade, in Scale-out and Scale-upEnvironments,” in Third International Workshop on System ManagementTechniques, Processes, and Services (SMTPS), IPDPS.
[90] Wayne Walter Berry and Vitor Tomaz, “Inside SQL Azure,”http://social.technet.microsoft.com/wiki/contents/articles/
1695.inside-sql-azure.aspx.
[91] Tak, B. C., C. Tang, C. Zhang, S. Govindan, B. Urgaonkar, andR. N. Chang “vPath: precise discovery of request processing paths fromblack-box observations of thread and network activities,” in Proceedings ofthe 2009 conference on USENIX Annual technical conference, USENIX’09,Berkeley, CA, USA, pp. 19–19.URL http://dl.acm.org/citation.cfm?id=1855807.1855826
[92] Windows Azure, http://www.microsoft.com/windowsazure/
windowsazure/.
[93] Microsoft SQL Azure, http://www.microsoft.com/en-us/sqlazure/default.aspx.
106
[94] Barham, P., A. Donnelly, R. Isaacs, and R. Mortier (2004) “Usingmagpie for request extraction and workload modelling,” in OSDI’04: Pro-ceedings of the 6th conference on Symposium on Opearting Systems Design& Implementation, USENIX Association, Berkeley, CA, USA.
[95] Barham, P., B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,R. Neugebauer, I. Pratt, and A. Warfield (2003) “Xen and the art ofvirtualization,” in SOSP ’03: Proceedings of the 19th ACM Symposium onOperating Systems Principles, ACM, New York, NY, USA, pp. 164–177.
[96] Gupta, D., K. Yocum, M. McNett, A. C. Snoeren, A. Vahdat, andG. M. Voelker (2006) “To infinity and beyond: time-warped network em-ulation,” in Proceedings of the 3rd conference on Networked Systems Design& Implementation - Volume 3, NSDI’06, USENIX Association, Berkeley, CA,USA, pp. 7–7.URL http://dl.acm.org/citation.cfm?id=1267680.1267687
[97] Wei, D. X., C. Jin, S. H. Low, and S. Hegde (2006) “FAST TCP: mo-tivation, architecture, algorithms, performance,” IEEE/ACM Trans. Netw.,14(6), pp. 1246–1259.
[98] Gulati, A., I. Ahmad, and C. A. Waldspurger (2009) “PARDA: pro-portional allocation of resources for distributed storage access,” in FAST ’09:Proccedings of the 7th conference on File and storage technologies, USENIXAssociation, Berkeley, CA, USA, pp. 85–98.
[99] “HDFS,” http://hadoop.apache.org/hdfs/.
[100] Hunt, P., M. Konar, F. P. Junqueira, and B. Reed (2010)“ZooKeeper: wait-free coordination for internet-scale systems,” in Proceed-ings of the 2010 USENIX conference on USENIX annual technical confer-ence, USENIXATC’10, USENIX Association, Berkeley, CA, USA, pp. 11–11.URL http://dl.acm.org/citation.cfm?id=1855840.1855851
[101] Cooper, B. F., A. Silberstein, E. Tam, R. Ramakrishnan, andR. Sears (2010) “Benchmarking cloud serving systems with YCSB,” in Pro-ceedings of the 1st ACM symposium on Cloud computing, SoCC ’10, ACM,New York, NY, USA, pp. 143–154.URL http://doi.acm.org/10.1145/1807128.1807152
[102] Li, A., X. Yang, S. Kandula, and M. Zhang (2010) “CloudCmp: com-paring public cloud providers,” in Proceedings of the 10th annual conf. onInternet measurement, New York, NY, USA.URL http://doi.acm.org/10.1145/1879141.1879143
107
[103] Johnson, R. W. and W. G. Lewellen (1972) “Analysis of the Lease-or-Buy Decision,” Journal of Finance, 27(4), pp. 815–23.URL http://ideas.repec.org/a/bla/jfinan/v27y1972i4p815-23.html
[104] Hajjat, M., X. Sun, Y.-W. E. Sung, D. Maltz, S. Rao, K. Sri-
panidkulchai, and M. Tawarmalani “Cloudward bound: planning forbeneficial migration of enterprise applications to the cloud,” in Proceedingsof the ACM SIGCOMM 2010 conf.URL http://doi.acm.org/10.1145/1851182.1851212
[105] Windows Azure Lessons Learned, http://channel9.msdn.com/
Blogs/benriga/.
[106] Ipek, E., S. A. McKee, R. Caruana, B. R. de Supinski, andM. Schulz (2006) “Efficiently exploring architectural design spaces viapredictive modeling,” in Proceedings of the 12th international conferenceon Architectural support for programming languages and operating systems,ASPLOS-XII, ACM, New York, NY, USA, pp. 195–206.URL http://doi.acm.org/10.1145/1168857.1168882
[107] Lee, B. C. and D. M. Brooks (2006) “Accurate and efficient regres-sion modeling for microarchitectural performance and power prediction,”SIGOPS Oper. Syst. Rev., 40, pp. 185–194.URL http://doi.acm.org/10.1145/1168917.1168881
[108] Stewart, C., T. Kelly, A. Zhang, and K. Shen (2008) “A dollar from15 cents: cross-platform management for internet services,” in USENIX 2008Annual Technical Conference on Annual Technical Conference, USENIX As-sociation, Berkeley, CA, USA, pp. 199–212.URL http://portal.acm.org/citation.cfm?id=1404014.1404029
[109] Xu, J., M. Zhao, J. Fortes, R. Carpenter, and M. Yousif (2008)“Autonomic resource management in virtualized data centers using fuzzylogic-based approaches,” Cluster Computing, 11, pp. 213–227.URL http://portal.acm.org/citation.cfm?id=1395064.1395067
[110] Kundu, S., R. Rangaswami, K. Dutta, and M. Zhao (2010) “Applica-tion performance modeling in a virtualized environment.” in HPCA’10, pp.1–10.
[111] Kossmann, D., T. Kraska, and S. Loesing (2010) “An evaluation ofalternative architectures for transaction processing in the cloud,” in Proceed-ings of the 2010 international conference on Management of data, SIGMOD’10, ACM, New York, NY, USA.
108
[112] Chen, G., W. He, J. Liu, S. Nath, L. Rigas, L. Xiao, andF. Zhao (2008) “Energy-aware server provisioning and load dispatching forconnection-intensive internet services,” in Proceedings of the 5th USENIXSymposium on Networked Systems Design and Implementation, NSDI’08,USENIX Association.URL http://portal.acm.org/citation.cfm?id=1387589.1387613
[113] Guha, S., N. Daswani, and R. Jain (2006) “An Experimental Study ofthe Skype Peer-to-Peer VoIP System,” in Proceedings of the 5th InternationalWorkshop on Peer-to-Peer Systems, 2006.
[114] Wang, Q., D. Makaroff, H. K. Edwards, and R. Thompson (2003)“Workload characterization for an E-commerce web site,” in Proceedings ofthe 2003 conference of the Centre for Advanced Studies on Collaborativeresearch, CASCON ’03, IBM Press.URL http://portal.acm.org/citation.cfm?id=961322.961372
[115] Arlitt, M. and T. Jin (1999) Workload Characterization of the 1998 WorldCup Web Site, Tech. rep., IEEE Network.
[116] Chen, X. and J. Heidemann (2005) “Flash crowd mitigation via adaptiveadmission control based on application-level observations,” ACM Trans. In-ternet Technol., 5, pp. 532–569.URL http://doi.acm.org/10.1145/1084772.1084776
[117] Ramamurthy, P., V. Sekar, A. Akella, B. Krishnamurthy, andA. Shaikh (2008) “Remote profiling of resource constraints of web serversusing mini-flash crowds,” in USENIX 2008 Annual Technical Conference onAnnual Technical Conference, USENIX Association, Berkeley, CA, USA, pp.185–198.URL http://portal.acm.org/citation.cfm?id=1404014.1404028
[118] Urgaonkar, B., P. Shenoy, A. Chandra, P. Goyal, and T. Wood
(2008) “Agile dynamic provisioning of multi-tier Internet applications,” ACMTrans. Auton. Adapt. Syst., 3, pp. 1:1–1:39.URL http://doi.acm.org/10.1145/1342171.1342172
[119] Greenberg, A., J. Hamilton, D. A. Maltz, and P. Patel (2008) “Thecost of a cloud: research problems in data center networks,” SIGCOMMComput. Commun. Rev., 39, pp. 68–73.URL http://doi.acm.org/10.1145/1496091.1496103
[120] 2010 IT Salary Survey, http://ejobdescription.com/Salary.htm#
epm1 1.
109
[121] Microsoft War on Cost Study, http://download.microsoft.
com/download/1/F/8/1F8BD4EF-31CC-4059-9A65-4A51B3B4BC98/
Hyper-V-vs-VMware-ESX-and-vShpere-WP.pdf.
Vita
Byung Chul Tak
Byung Chul Tak is a Ph.D candidate in the department of computer scienceand engineering at The Pennsylvania State University. He entered the doctoralprogram in September, 2006. He received his BS in computer science from YonseiUniversity, Korea, in 2000. He holds MS degree in computer science from KAIST,Korea, in 2003. From 2004 to 2006, he was a research engineer in the embeddedsoftware center at ETRI (Electronics and Telecommunications Research Institute),the national research laboratory in Korea. His research interests include VirtualMachines, Cloud, Operationg System and Distributed Systems. He has coauthoredpapers that were publised in USENIX ATC, HotCloud, ICDCS and ISPASS. Hespent three summers at IBM Research, Hawthorne as a research intern.