The Pennsylvania State University The Graduate School SYSTEMS ...

124
The Pennsylvania State University The Graduate School SYSTEMS INFRASTRUCTURE FOR SUPPORTING UTILITY COMPUTING IN CLOUDS A Dissertation in Computer Science and Engineering by Byung Chul Tak c 2012 Byung Chul Tak Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2012

Transcript of The Pennsylvania State University The Graduate School SYSTEMS ...

The Pennsylvania State University

The Graduate School

SYSTEMS INFRASTRUCTURE FOR SUPPORTING UTILITY

COMPUTING IN CLOUDS

A Dissertation in

Computer Science and Engineering

by

Byung Chul Tak

c© 2012 Byung Chul Tak

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

May 2012

The dissertation of Byung Chul Tak was reviewed and approved∗ by the following:

Bhuvan Urgaonkar

Associate Professor of Computer Science and Engineering

Dissertation Advisor, Chair of Committee

Anand Sivasubramaniam

Professor of Computer Science and Engineering

Trent Jaeger

Associate Professor of Computer Science and Engineering

Qian Wang

Associate Professor of Mechanical and Nuclear Engineering

Rong N. Chang

Research Staff Member & Manager at IBM Research

Special Member

Raj Acharya

Professor of Computer Science and Engineering

Head of the Department of Computer Science and Engineering

∗Signatures are on file in the Graduate School.

Abstract

Recent emergence of cloud computing is being considered an important enablerof the long-cherished paradigm of utility computing. Utility computing representsthe desire to have computing resources delivered, used, paid for and managed withassured quality similar to other commoditized utilities such as electricity. Theprincipal appeal of utility computing lies in the systemized framework it creates forthe interaction between providers and consumers of computing resources. Whilecurrent clouds are undoubtedly succeeding towards this goal, they lack some ofthe crucial features necessary to realize a mature utility. First, one foundationalfeature of a utility is the ability to accurately measure and manage the usage of theresources by its various consumers. Modern VM-based cloud platform, providersof the cloud services face significant difficulties in obtaining the accurate picture ofresource consumption by their consumers. Second, consumers of a utility expectto have systematic ways to infer their resource needs so that they can make cost-effective resource procurement decisions. However, current consumers of the cloudsare ill-equipped in making their resource procurement decisions because of lackof information regarding resource quantity and their implication on applicationperformance.

In the first part of the dissertation, we consider provider-side issues of resourceaccounting. It is nontrivial to correctly apportion the usage of a shared resourcein the cloud to its various users. Towards achieving accurate understanding ofthe overall resource usage within the cloud, we develop a technique for dynami-cally discovering the various resources that are directly or indirectly being usedby a consumer’s application. This, in turn, enables us to build techniques foraccurately accounting the resource usage. The benefits of our approach are ex-plained by comparing with and illustrating the problems of using state-of-the-artmethods in resource accounting. In the next part of the dissertation, we focuson the problem of understating the cost of application deployments to the cloud

iii

from the consumers perspective. Employing empirical approaches to estimate theresource requirements of the target application, we present how to systematicallyincorporate important systems characteristics such as workload intensity, growth,and variances as well as comprehensive set of hosting options into determining theeconomic feasibility of application deployment in the cloud.

iv

Table of Contents

List of Figures viii

List of Tables xiii

Acknowledgments xiv

Chapter 1Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope and Outline of Dissertation . . . . . . . . . . . . . . . . . . . 8

1.2.1 Provider-end Resource Accounting . . . . . . . . . . . . . . 91.2.1.1 Dependency Discovery . . . . . . . . . . . . . . . . 91.2.1.2 Resource Usage Inference . . . . . . . . . . . . . . 9

1.2.2 Consumer-end Decision Making . . . . . . . . . . . . . . . . 10

Chapter 2Related Work 112.1 Provider-end Resource Usage Inference and Accounting . . . . . . . 11

2.1.1 Statistical Inference-based Technique . . . . . . . . . . . . . 112.1.2 System-dependent Instrumentation . . . . . . . . . . . . . . 122.1.3 Resource Accounting . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Consumer-end Decision Making . . . . . . . . . . . . . . . . . . . . 15

Chapter 3Provider-end Dependency Discovery 173.1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . 173.2 Dependency Discovery: Problem Statement and Requirements . . . 193.3 Inadequacy of Existing Approaches . . . . . . . . . . . . . . . . . . 21

v

3.4 Proposed Solution: vPath . . . . . . . . . . . . . . . . . . . . . . . 213.4.1 Design and Implementation of our Dependency Discovery

Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4.1.1 Implementation . . . . . . . . . . . . . . . . . . . . 23

3.4.2 Applicability to Other Software Architectures . . . . . . . . 263.4.3 Usefulness of Proposed Solution . . . . . . . . . . . . . . . . 29

3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5.2 Overhead of vPath . . . . . . . . . . . . . . . . . . . . . . . 323.5.3 Dependency Discovery for vApp . . . . . . . . . . . . . . . . 353.5.4 Dependency Discovery forTPC-W . . . . . . . . . . . . . . . 373.5.5 Dependency Discovery for RUBiS and MediaWiki . . . . . . 383.5.6 Discussion on Benchmark Applications . . . . . . . . . . . . 39

3.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 40

Chapter 4Provider-end Resource Usage Inference 414.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.1 Usefulness of Resource Accounting Information . . . . . . . 444.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Our Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4 Design and Implementation: sAccount . . . . . . . . . . . . . . . . 49

4.4.1 Local Monitoring . . . . . . . . . . . . . . . . . . . . . . . . 504.4.1.1 Identification of S and T . . . . . . . . . . . . . . 504.4.1.2 Identifying Resource Principals & Scheduling Events 51

4.4.2 Collective Inference . . . . . . . . . . . . . . . . . . . . . . . 534.4.3 Implementation of sAccount . . . . . . . . . . . . . . . . . . 54

4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5.1 Accounting Accuracy for a Synthetic Shared Service . . . . . 564.5.2 Accounting for Real-world Services . . . . . . . . . . . . . . 60

4.5.2.1 Clustered MySQL as the Shared Service . . . . . . 604.5.2.2 HBase as the Shared Service . . . . . . . . . . . . . 66

4.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 72

Chapter 5Consumer-end Decision Making 735.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.1 Net Present Value . . . . . . . . . . . . . . . . . . . . . . . . 745.2.2 Cost Components . . . . . . . . . . . . . . . . . . . . . . . . 75

vi

5.2.3 Application Hosting Choices . . . . . . . . . . . . . . . . . . 775.2.4 Determining Hardware Size Requirement . . . . . . . . . . . 77

5.2.4.0.1 In-house Provisioning: . . . . . . . . . . . 785.2.4.0.2 Cloud-based Provisioning: . . . . . . . . . 79

5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.3.1 Data Transfer, Vertical Partitioning . . . . . . . . . . . . . . 815.3.2 Storage Capacity, Software Licenses . . . . . . . . . . . . . . 835.3.3 Workload Variance and Cloud Elasticity . . . . . . . . . . . 855.3.4 Horizontal Partitioning . . . . . . . . . . . . . . . . . . . . . 875.3.5 Indirect Cost Components . . . . . . . . . . . . . . . . . . . 89

5.4 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 93

Chapter 6Conclusion and Future Work 94

Bibliography 97

vii

List of Figures

1.1 Cloud-based hosting of a consumer’s application through the Vir-tual Cluster interface. Virtual Cluster is a generalized version ofinterface exposed to the consumer through which they can specifydesired quantities of various resources they need such as comput-ing power, storage and networks. The figure shows one examplemapping of virtual components to actual physical resources as de-termined by the provider’s management algorithms. . . . . . . . . 4

1.2 N-to-m relationship between consumer and provider. Each con-sumer has their own type of applications they wish to deploy inthe cloud. Many questions, as noted in the figure, can arise onthe consumer side. Providers offer various hosting choices to theconsumers. They all have different set of virtual resources withdifferent pricing and performance characteristics. . . . . . . . . . . 7

3.1 Example deployment of VCs from chargeable entities CEA andCEB. Two VCs share the database server instance. This shar-ing is transparent to the chargeable entities since who will sharethe service instance is the decision of provider. The abstractionsof “Set of Used Servers” and “Resource Accounting Tree” are alsolabeled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 The principle idea of finding the causality of our proposed solution. 223.3 Multi-threaded Server Architecture . . . . . . . . . . . . . . . . . . 273.4 Event-driven and SEDA model architecture . . . . . . . . . . . . . 283.5 The topology of TPC-W benchmark set-up. . . . . . . . . . . . . . 303.6 The topology of vApp used in evaluation. . . . . . . . . . . . . . . . 313.7 TPC-W Response Time and CPU Utilization. . . . . . . . . . . . . 333.8 Examples of vApp’s paths discovered by vPath. The circled num-

bers correspond to VM IDs in Figure 3.6. . . . . . . . . . . . . . . . 363.9 CDF of vApp’s response time, as estimated by vPath and actually

measured by the vApp client. . . . . . . . . . . . . . . . . . . . . . 363.10 Typical paths discovered by vPath technique. . . . . . . . . . . . . 37

viii

4.1 A portion of a platform that hosts two applications, each a CE,and the servers hosting their components. Arrows indicate commu-nication between components. Also shown is a shared service - adatabase used by both the CEs. The shared service itself consistsof multiple software components, some of which are exercised bythe CEs indirectly (e.g., the “Data Store”), i.e., via requests madeto other components (e.g., the “Front-end”). . . . . . . . . . . . . 43

4.2 Overall architecture of sAccount implementation. . . . . . . . . . . 494.3 Solution concept. Start and end of CPU accounting is determined

by the arrival of messages and departure of response messages. Asthe threadx of VM2 sends a message to the threadA of VM1, theVM1 starts to account the CPU usage of threadA to CE1. This bind-ing stops when threadA sends reply back. CPU usage of threadB isnot charged to CE1 inbetween. This requires us to be able to detectthread scheduling events. . . . . . . . . . . . . . . . . . . . . . . . 51

4.4 Design and configuration of our synthetic shared service and theCEs exercising it. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5 Impact of burstiness and shared service resource utilization on theaccuracy offered by sAccount versus LR. We use our syntheticshared service along with three chargeable entities. We comparethe percentage error in CPU accounting for sAccount and LR. Welabel the errors for our three chargeable entities with LR as CE1,CE2, and CE3, respectively, and label their average as “LR Aver-age.” In all cases, the accounting information offered by sAccountshows less than 1% error (we plot the average of error for the threeCEs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.6 Impact of caching and number of CEs on the accuracy of resourceaccounting of (a) network and (b) CPU for our synthetic sharedservice. The number of CEs is three in (a). We plot the averageerror across all CEs and standard deviation. . . . . . . . . . . . . . 59

4.7 Shared MySQL cluster setting. Three CEs labeled CE1, CE2, andCE3 share this database service. . . . . . . . . . . . . . . . . . . . 61

ix

4.8 Network traffic and individual CPU utilization time-series. Graph(a) shows the network traffic exchanged between SN and each of ourthree CEs, and forms part of the input to LR. Graph (b) presentsthe CPU usage at SN induced by each of the CEs when it runsseparately from other CEs as part of offline profiling that we do.These usages serve as our estimate of the ground truth for theCPU usage each of the CEs induces in the actual experiment. Theresource accounting results of LR and sAccount should be comparedwith (b) to see how far from this estimated ground truth they are. 64

4.9 Comparison of CPU accounting results. CPU usage of MySQLCluster SQL node is being accounted. In (a) the accounting startsat time 200 since LR needs to collect some amount of data. Bycomparing the areas of equivalent color we can see the rank orderdetermined by each technique as well as accuracies. Please comparewith Figure 4.8(b) to see the true CPU consumption. The result ofsAccount includes the ‘unaccountable’ portion. This can be dividedamong chargeable entities by any reasonable policy. . . . . . . . . 65

4.10 Response time change of RUBiS application. This graphs showsthe development of RUBiS response time for two cases - throttledby LR, and controlled by sAccount. Since LR picks the wrong CE(CE3) as a culprit for performance degradation, throttling the re-quest rate of CE3 is ineffective. However, the sAccount techniqueshows noticeable effects. The moving average response time by sAc-count indicates sAccount can contain the performance interference. 66

4.11 Our setup for using HBase as a shared service. Our CEs are basedon client programs that use the YCSB workload generator. . . . . 67

4.12 Evolution of network traffic that are incoming to the region serverfrom two CEs during the run. Both CEA and CEB start out bysending similar requests to HBase during t=0 to t=100s, implyingthe network bandwidth and CPU usage of the region server shouldbe accounted equally to them for this period. CEA changes itsbehavior at t=100s, whereas CEB changes its behavior at t=200s. . 68

4.13 Profiling measurement of CPU and Network resources at the regionserver for CEA and CEB. These are obtained from individual (notcombined) runs of each chargeable entities. They are intended toserve as an estimate of true resource usage quantity when analyzingthe performance of resource accounting results. . . . . . . . . . . . 69

x

4.14 Comparison of resource accounting results between LR and sAc-count on the out-bound network traffic size from data node to theregion server. The contour of (a) drawn in think line is obtained byiptables. Note the resemblance of this to the overall traffic size of(b) as a quick sanity check of sAccount technique. . . . . . . . . . 70

4.15 Resource accounting by sAccount at various nodes of HBase. Theresult (b) provides the proof of sAccount’s capability to performresource accounting at multiple hops away from the front-end ofthe shared service. Note that data node of HBase does not have adirect communication with any of the chargeable entities. . . . . . 71

5.1 Taxonomy of costs involved in in-house and cloud-based applicationhosting. Costs can be classified according to quantifiability anddirectness. Quantifiable costs are grouped into material, labor andexpenses. The material category roughly corresponds to capitalexpenses (Cap-Ex). The labor and expenses categories correspondsto operational expenses (Op-Ex). In this study we focus on thequantifiable category. . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2 Marginal throughput measurements. Both graphs show how muchthroughput gain there is by adding one more unit of resource. ForJboss server (a), we observe the marginal gain by adding one moreserver that has single core. For Mysql (b) we add one more CPUcore and observe the marginal gain. . . . . . . . . . . . . . . . . . 78

5.3 EC2 Instance’s CPU Microbenchmark Results. Graph (a) com-pares the latency of finishing the same number of arithmetic op-erations. Roughly the CPU of EC2 instance is one third of ourreference machine. Graph (b) shows that the distribution of EC2CPU bandwidth is bimodal. . . . . . . . . . . . . . . . . . . . . . . 79

5.4 NPV over a 10 year time horizon for TPC-W. We consider threedifferent workload intensity of small (20 tps at t=0), medium work-load intensity (100 tps) and high workload intensity (500 tps) Wealso consider two different workload growth rate of 0% and 20% peryear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.5 Closer look at cost components for four cloud-based applicationdeployment options in the 5th year. Initial workload is 100 tps andthe annual growth rate is 20%. . . . . . . . . . . . . . . . . . . . . . 82

5.6 Cost break-down of TPC-E at the 6th year. . . . . . . . . . . . . . 835.7 Two sets of TPC-E results at initial workload of 300 tps and 900 tps. 845.8 Effect of workload variance on the cost of in-house hosting for TPC-

W. The workload is 100 tps and the growth rate is 20% per year. . 86

xi

5.9 Timeseries xt and workload distribution f(x) for a fixed τ . . . . . 885.10 Cost behavior of horizontal partitioning as a function of varying

threshold value. Although not shown in the graph, the cost atPAR=11 (at 5.5K on x-axis), the cost is $625K. . . . . . . . . . . . 89

5.11 Impact of labor cost for the medium workload intensity using TPC-W. The cases for small and large workload intensity is not shown.For small workload, cloud-based hosting is always cheaper. Forlarge workloads, in-house hosting is alyways cheaper. . . . . . . . . 91

5.12 (a) Effect of labor costs on the hosting decision in relation to theworkload intensity and business size. (b) Stacked view of in-housecost at year 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

xii

List of Tables

1.1 Summary of the problems addressed in the following chapters. . . . 8

3.1 Response time and throughput of TPC-W. “App Logging” repre-sents a log-based tracking technique that turns on logging on alltiers of TPC-W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Performance impact of vPath on RUBiS. . . . . . . . . . . . . . . . 343.3 Worst-case overhead of vPath and breakdown of the overhead. Each

row represents the overhead of the previous row plus the overheadof the additional operation on that row. . . . . . . . . . . . . . . . 34

4.1 Usage pattern of AWS shared services. The total number applica-tions are 120. RDS in the AWS is not a shared service as we definehere since consumers own separate VM instances. ‘Custom DB’means that user has installed their own database within the EC2instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Description of how the workload imposed by the three CEs is variedover the course of our experiment with the MySQL cluster as ourshared service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Labor cost per server. We differentiate 3 different ratio of IT staffvs. Server. Average cost per server per hour (3rd column) is basedon the average IT staff salary of $33/h. . . . . . . . . . . . . . . . 90

xiii

Acknowledgments

This dissertation could not have been written without supports from many people.First of all, I am deeply indebted to my dissertation advisor, Professor BhuvanUrgaonkar, for his guidance. His insights and advices were most helpful in shapingand developing this dissertation. I would also like to thank my doctoral committeemembers, Professor Anand Sivasubramaniam, Professor Trent Jaeger, ProfessorQian Wang and Dr. Rong N. Chang for accepting the role as committee membersand providing useful feedbacks. I feel fortunate to have spent three summers atIBM T.J. Watson Research Center as an intern under the supervision of Dr. RongN. Chang. It was a wonderful experience to be able to work with Dr. ChunqiangTang and Dr. Chun Zhang. I would also like to express thanks to CSL members,Shiva Chaitanya, Sriram Govindan, Arjun Nath, Dharani Sankar Vijayakumar,Niranjan Soundara, Youngjin Kwon, Aayush Gupta, Di Wang, Chuangang Ren,Jeonghwan Choi, Euiseong Seo, Srinath Sridharan, Cheng Wang, Bo Zhao. Theyhave been wonderful colleagues and friends and I have really enjoyed working andspending time with them.

I cannot express my gratitude enough to my family for their unconditional loveand support for such a long period of time. I would like to thank my parents fortheir encouragements. I am also grateful to my father-in-law and mother-in-law fortheir supports. My gratitude extends to my sister, sister-in-law and their newly-born son as well. Finally and importantly, I deeply thank my wife, Sekyoung Huh,for being supportive and encouraging all along without losing faith in me. Withouther support, I cannot imagine how I could have finished my dissertation.

xiv

Chapter 1Introduction

1.1 Motivation

Cloud computing has emerged as a novel model for IT hosting, impacting many sec-

tors ranging from industry to academia. Several industry giants and mid-size/small

enterprises as well as academic/research institutions are conducting or considering

major restructuring of their IT infrastructure to avail of cloud computing offerings.

Among these cloud offerings are those based on infrastructure-as-a-service (e.g.,

Amazon’s Elastic Compute Cloud (EC2) [1] and Simple Storage Service [2], Sun

Grid [3], Rackspace [4]), software/platform-as-a-service (e.g., Microsoft Azure [5],

Google App Engine [6]), and a number of others. As is often the case with a

newly emergent technology, different views exist on the meaning and scope of

cloud computing [7, 8]. Despite the difficulty this poses, the defining feature of

cloud computing, cross-cutting all these views, is easily identified as the separation

it offers between the usage and management of IT infrastructure. In its most basic

form, cloud computing can be viewed as creating two distinct sets of entities, the

provider and the consumer of IT. The provider owns and manages IT resources,

thereby relieving the consumer of these responsibilities and allowing them to fo-

cus solely on using these resources. Myriad views of cloud computing that have

emerged can be seen as different takes on the details of how the partitioning of

these responsibilities occurs and what the interfaces offered to the consumers are.

Regardless of these different kinds of offerings, the growing popularity of cloud

computing, slated to continue its momentum based on many indicators [9, 10], can

2

be attributed to two main reasons:

• The growing complexity of managing IT, and the ensuing need for special-

ization, is claimed to have rendered out-sourcing of many IT management

responsibilities to experts desirable and cost-effective. This trend has ex-

panded to apply to a wide spectrum of consumers ranging from vendors of

Internet-scale services and managers of enterprise/department-scale IT en-

vironments to even individual users. As an example of the latter, according

to Jeff Barr, Senior Manager of Cloud Computing Solutions at Amazon, re-

ported that at the end of year 2011, there were 762 billion objects stored in

Amazon S3. And, the peak number of objects being processed reached over

500,000. Comparing with last year, it was the growth of 192%

• During the past decade, we have witnessed a proliferation of large data

centers with substantial investments in IT infrastructure. The resulting

economies-of-scale, gains derived from statistical multiplexing, and concen-

tration of expertise imply that the owners of such facilities are well-positioned

to take up the role of providers of cloud-hosted services of various kinds.

The Problem Cloud computing is viewed by many as one promising realization

of the long-cherished utility computing paradigm [11, 12, 13]. Utility computing

represents the desire to have IT acquired, delivered, used, paid for and managed

in a similar way we use other commoditized utilities such as electricity, telephone

service, cable television, etc. The principal appeal of utility computing lies in the

systematized framework it could create for the interaction between providers and

consumers of IT resources.

Current cloud computing offerings lack some of the crucial features necessary

to realize a mature utility. First, one of the foundational features of a utility is its

ability to accurately measure and charge the allocation and usage of the commodity

being exchanged between a provider and a consumer: we refer to this as account-

ing of resources. Resource accounting may be crucial to a provider for conducting

accurate billing. It also provides valuable information for a variety of resource allo-

cation and tuning decisions. However, it is well-known that resource accounting is

non-trivial even within a single server that consolidates multiple applications [14].

3

Often the workloads emerging from different user-level software entities are mul-

tiplexed in complex ways. It is often unclear how to “charge” different user-level

applications for the resources used on their behalf by other software including

systems software (e.g., operating system or virtual machine monitors), or runtime

(e.g., garbage collector). Similarly, looking at a larger scale, accounting of resource

usage for the shared services (e.g., shared file system, shared database server) by

the group of distinct resource consuming entities are not a straightforward problem.

Especially, resource virtualization and complex, distributed nature of modern con-

sumer applications make it non-trivial. One major challenge of performing resource

accounting stems from the fact that granularity at which resource consumption is

measured within the system does not match the granularity by which computing

resources are consumed. This implies that we need to process and transform the

measurement of resource usage data into desired granularity where accuracies may

be lost if certain conditions are not met. While problems very closely related to

this have received attention [15, 16, 17, 18, 19, 20, 21, 22, 23, 24], existing solutions

have shortcomings in their generality and accuracy. Note that accountability has

other connotations (e.g., guarantees related to executing the right software [25])

and we only use it in the specific sense related to resource allocation described

above.

Second, as cloud-based offerings mature, the interfaces exposed to consumers

grow richer, and multiple providers compete with each other, we envision a cloud-

based utility where consumers would desire being able to conduct complex decision-

making, involving trade-offs between the cost it spends towards procuring re-

sources and the value it receives. If desirable, a consumer might procure resources

from multiple providers. Additionally, it might dynamically modulate this set of

providers as well as the resources it procures from them. Such decision-making

is exercised, to different degrees of complexity, by consumers of existing utilities.

Examples range from (a) the extremely simple choice between multiple providers

and monthly subscription plans for the telephone service to (b) the more complex

case of certain electricity consumers being able to adapt (e.g., via re-scheduling

of tasks) their usage to time-varying power prices that are exposed to them by

their electricity provider [26]. How could a consumer of the IT utility, navigate a

decision-space comprising such trade-offs? Specifically, it should be able to incor-

4

���������������� ��������

�� ��������������������������������������� ��������������

�����������

������ ���

�������� �

���� ��� ��� ���������������

�������

��������������������

��������������������

������ ���������������� �������������

�������

���� ��� ��

� �����

����� ��

����������

����������

�����

�� ��

�� ��

�� ��

Figure 1.1. Cloud-based hosting of a consumer’s application through the Virtual Clus-

ter interface. Virtual Cluster is a generalized version of interface exposed to the consumerthrough which they can specify desired quantities of various resources they need suchas computing power, storage and networks. The figure shows one example mapping ofvirtual components to actual physical resources as determined by the provider’s man-agement algorithms.

porate into its decision-making multiple features affecting its cost versus revenue,

especially, the multiple hosting options that open up in a cloud-based utility envi-

ronment: in-house IT hosting, cloud-based hosting, and myriad “hybrid” hosting

options spanning in-house and cloud. In this dissertation, we develop techniques

and underlying enabling systems mechanisms to address these problems.

Current cloud-based offerings come in a variety of forms. For example, the

“Infrastructure as a Service” (IaaS) model exposes virtualized hardware (e.g., in

its most general form, a virtualized data center) on which the consumer can run

its entire software stack (including applications and systems software) [1, 27, 4, 3].

The consumer is allowed partial control over resource management via these virtual

resources, which are securely multiplexed with those of other consumers on the

provider’s infrastructure. The “Software as a Service” (SaaS) model, on the other

hand, only allows the consumer access to certain software hosted by the provider for

collaborating with other components of consumer application, and the facility to

5

run this application on the provider’s infrastructure; the consumer is provided little

exposure to and participation in resource management decisions [6]. Between SaaS

and IaaS lies the “Platform as a Service” (PaaS) model, in which consumers are

given tools and APIs they must use to write their applications. In PaaS, consumers

can partially control some aspects of resource management by specifying options

such as how much to replicate and how much storage to use, etc [5]. Another

distinction worth noting is that of between public and private clouds, where the

former are intended as a general utility while the latter is meant to cater to a

restricted set of consumers, such as within an organization. However, the problems

we address in this dissertation is not limited to any specific type of cloud, and

equally relevant to both platforms.

We assume a model of provider-consumer interaction that is general enough

to encompass these various options. Figure 1.1 provides an overview of vari-

ous relevant entities and abstractions using an illustrative consumer application

being hosted by a cloud-based provider. We assume that underlying physical

infrastructure of the cloud being managed by the provider is a state-of-the-art

data center. Such a data center consists of a large cluster of high-end servers

with multi-core processors, several GB of memory, and some local storage inter-

connected by a high bandwidth network for communication. In addition, these

servers are also connected to a consolidated high capacity storage device/utility

through one or more SANs, each of which facilitates data sharing and migration

of applications between servers without explicit movement of data. These servers

are connected via some gateway to the Internet to service end-user requests from

clients of consumer applications. We assume that the provider implements dis-

tributed resource management software spanning its physical infrastructure, that

is responsible for securely partitioning and multiplexing resources among appli-

cations belonging to different consumers; such mechanisms have been researched

extensively [28, 29, 30, 31, 32, 33]. Current providers offer numerous ways for a

consumer to specify the resource needs of its application. We expect these inter-

faces to become more rich and expressive as the cloud computing model becomes

more popular and competition between providers grows. In this dissertation, we

assume that the provider exposes a very general interface, called a virtual cluster

(VC), via which a provider allows a consumer to specify its resource needs. A

6

VC consists of: (i) a set of virtual servers, (ii) a virtual network amongst them

as well as connectivity to the outside world, and (iii) virtual shared storage and

bandwidth to it. Virtual servers come with their CPU, main memory, and lo-

cal secondary storage specifications. The owner of a VC becomes what we call

a chargeable entity. The chargeable entity is a portion or a collection of software

applications hosted by the cloud provider whose resource usage needs to be tracked

and accounted separately from other such entities. Often it would be the same as

a consumer application or the owner of them.

We use a generic consumer application hosted within such a data center, as

shown in Figure 1.1, to drive our discussion. In general, a consumer may choose to

partition components of its distributed application across multiple such providers

(not shown in the figure). The execution of this application involves the partici-

pation of: (i) software components of the application itself, as well as (ii) certain

provider-owned software components securely shared by the provider’s resource

management software among hosted applications (e.g., shared relational database

service [5, 34], key-value store service [35, 36, 37]). A consumer is assumed to sup-

ply the provider with data and executables needed by its application. The provider

would multiplex its physical resources among VCs for various hosted applications

in ways that optimize some metric meaningful to it while adhering to certain con-

straints, an extensively-studied problem [38, 39, 29, 30, 40, 28, 41, 42, 43, 33].

Such multiplexing would be constrained by the nature of the guarantees provided

to the consumers as well as the properties of their workloads and resource needs.

We assume a general enough billing model that depends both on the resource al-

location sought by the application and on its actual usage. Such a billing scheme

implies an incentive for the consumer to estimate well its resource needs, and also

to control well its usage of actual resources via various knobs available to it. At

the same time, it also captures the incentive that the provider has to efficiently

utilize and allocate its physical resources.

Taking a more expansive view, the interaction between multiple providers and

consumers takes the form of the many-to-many relationship as shown in Figure 1.2.

Consumers have different type of applications they wish to host in the cloud and/or

in their in-house facilities, including possibly private clouds (depicted as three dif-

ferent applications in the figure) and the cloud providers have offerings with differ-

7

������� ��� �������

���������

���

����

��

��

��������

��

���

��

!"#"$%&

&&

&&&

'()*+,-.'(/01-20,32,0.

Q: Migrate to Cloud, or use In-house?

Q: Which Cloud to choose?

Q: Does cloud deliver sufficient performance?

��������������� ���������������� ������������4��

�����5��

�����6��

���������

��������

��������

���������

�������� ���������

������� ������������

���������

���������

Figure 1.2. N-to-m relationship between consumer and provider. Each consumer hastheir own type of applications they wish to deploy in the cloud. Many questions, as notedin the figure, can arise on the consumer side. Providers offer various hosting choices tothe consumers. They all have different set of virtual resources with different pricing andperformance characteristics.

ent pricing policies and performance characteristics. Both kinds of parties would

be interested in maximizing their own utility functions. For consumers, the utility

function might be defined in terms of factors such as usage costs, performance

and/or service convenience levels etc. Consumers will try to collect relevant in-

formation about available providers and take necessary actions to maximize this

function. For example, they may choose to alternate between several clouds regu-

larly if it happens that there are certain time periods (e.g., within a day cycle) in

which one cloud’s performance is significantly better or some cloud charges lower

rate than others. One consumer strategy might be to use a cloud service only when

workload exceeds certain level. On the other hand, the provider’s utility function

might be defined in terms of its revenue and costs (which determines its profit).

Providers may try to draw in more consumers into their service in order to increase

the revenue. They may also apply various systems optimizations to minimize their

operational and capital expenditures.

In one possible evolution of these relationships, being explored by some re-

searchers [44, 45], each provider and consumer is a “selfish” agent, interested in

optimizing its own utility/satisfaction. Other possibilities also exist for how this

8

Chapter Addressed Problems

Chapter 3 Dependency discovery problem: The problem of discoveringcausal dependencies established via message exchangesbetween application components.

Chapter 4 Resource accounting problem: The problem of determiningaccurate resource usage of participating entities orgroup of entities, called chargeable entities.

Chapter 5 Consumer-side application deployment decision problem: Theproblem of selecting the most cost-effective cloud-deploymentoption of consumper applications in the cloud.

Table 1.1. Summary of the problems addressed in the following chapters.

“cloud world” might evolve, such as a subset of partially cooperating providers.

Similarly, more sophisticated pricing schemes might evolve in the future than the

current instance usage based billing. Amazon EC2 already offers spot pricing for

some of its instances, for example [46]. Whereas the exact resource management

techniques desirable for providers and consumers would closely depend on how the

interactions between providers and consumers evolve, the two problems we choose

to address in this dissertation are universally applicable and, hence, a useful set of

contributions to cloud-based utility computing.

1.2 Scope and Outline of Dissertation

In order for the vision of utility computing to be realized through cloud computing

model described above, problems from both the provider and consumer must be

addressed. For the provider, resource accounting capabilities must be significantly

improved from the current state-of-the-art. For the consumer, intelligent decision-

making framework for cloud-based application deployment must be established.

Toward these ends, this dissertation studies and develops systems facilities with

supporting mechanisms.We describe these below as well as our approach for ad-

dressing them. Table 1.1 summarizes the problems addressed in each chapter of

this dissertation.

9

1.2.1 Provider-end Resource Accounting

As a solution for the provider-end problem, we design and build techniques that

can deliver the desired level of correctness and accuracy in resource accounting.

We formulate our solution to the resource accounting problem as being composed

of constructing two data structures. The first data structure is, what we call

the Set of Used Servers. ‘Set of Used Servers’ captures the information regarding

which server node is consuming the resources on behalf of each chargeable entity at

certain time granularity. We call this relationship formed by resource consumption

as dependency. The second data structure is the Resource Accounting Tree. It

captures the information regarding which chargeable entities are consuming how

much of the resource within a server node. ‘Resource Accounting Tree’ is also a

time-varying data structure. More detailed explanation and illustrations of these

data structures are given in Chapter 3. We can obtain the final resource usage

information by combining the information from these two data structures.

1.2.1.1 Dependency Discovery

To conduct accounting at a particular server, the first logical stop is to identify

which chargeable entities cause resource consumption on this server. We refer to

the problem of determining at which servers each chargeable entity causes resource

consumptions (directly or indirectly) as dependency discovery. Dependency dis-

covery is equivalent to finding the set of used servers. The study of dependency

discovery in this dissertation is focused on building the data structure of the ‘set of

used servers’ via novel technique of causality discovery. This research is the topic

of Chapter 3.

1.2.1.2 Resource Usage Inference

In resource usage inference step, our focus is to construct the data structure of the

‘resource accounting tree’ for each resource type (CPU, network or disk I/O) at

each server node. The scope of this study is, first, to develop VM-transparent(i.e.,

implemented at the hypervisor layer) and accurate resource accounting technique.

Second, we study how much more it is effective compared to state-of-the-art tech-

nique. The baseline for comparison is the technique that uses common monitoring

10

utilities with well-known linear regression method. Finally, we demonstrate the

usefulness of our resource accounting solution by constructing a scenario of SLA

violation due to resource usage imbalance. We show that our solution correctly

identifies the culprit and initiates targeted resource throttling for fairness whereas

baseline technique fails. This study is presented in Chapter 4.

1.2.2 Consumer-end Decision Making

Consumer-end decision making is concerned with determining if the consumer ap-

plication should migrate to a cloud and in what way. In this study, we employ

empirical approaches to estimate the resource requirements of the consumer appli-

cation and analyze the costs of comprehensive hosting options including in-house

and multiple cloud-based choices. We present how to systematically incorporate

important systems properties such as workload growth, intensity, variability and

subjective costs into determining the economic feasibility of cloud-based hostings.

We present this study in Chapter 5.

Chapter 2Related Work

2.1 Provider-end Resource Usage Inference and

Accounting

First, we explain existing works in the area related to the provider-end dependency

discovery technique and resource accountings. Dependency discovery technique is

classified into statistical inference-based technique in which data-mining is applied

to the measurement to infer some properties, and instrumentation-based technique

in which information is intentionally inserted into the software to aid in tracking

the behavior.

2.1.1 Statistical Inference-based Technique

Aguilera et al. [15] proposed two algorithms for debugging distributed systems.

The first algorithm finds nested RPC calls and uses a set of heuristics to infer

the causality between nested RPC calls, e.g., by considering time difference be-

tween RPC calls and the number of potential parent RPC calls for a given child

RPC call. The second algorithm only infers the average response time of compo-

nents; it does not build request-processing paths. WAP5 [21] intercepts network

related system calls by dynamically re-linking the application with a customized

system library. It statistically infers the causality between messages based on

their timestamps. By contrast, our method is intended to be precise. It monitors

thread activities in order to accurately infer event causality. Anandkumar et al.

12

[16] assumes that a request visits distributed components according to a known

semi-Markov process model. It infers the execution paths of individual requests

by probabilistically matching them to the footprints (e.g., timestamped request

messages) using the maximum likelihood criterion. It requires synchronized clocks

across distributed components. Spaghetti is evaluated through simulation on sim-

ple hypothetical process models, and its applicability to complex real systems

remains an open question. Sengupta et al. [22] proposed a method that takes ap-

plication logs and a prior model of requests as inputs. However, manually building

a request-processing model is non-trivial and in some cases prohibitive. In some

sense, the request-processing model is in fact the information that we want to ac-

quire through monitoring. Moreover, there are difficulties with using application

logs as such logs may not follow any specific format and, in many cases, there may

not even be any logs available.

2.1.2 System-dependent Instrumentation

Magpie [17] is a tool-chain that analyzes event logs to infer a request’s processing

path and resource consumption. It can be applied to different applications but its

inputs are application dependent. The user needs to modify middleware, applica-

tion, and monitoring tools in order to generate the needed event logs. Moreover,

the user needs to understand the syntax and semantics of the event logs in order

to manually write an event schema that guides Magpie to piece together events of

the same request. Magpie does kernel-level monitoring for measuring resource con-

sumption, but not for discovering request-processing paths. Pip [20] detects prob-

lems in a distributed system by finding discrepancies between actual behavior and

expected behavior. A user of Pip adds annotations to application source code to

log messaging events, which are used to reconstruct request-processing paths. The

user also writes rules to specify the expected behaviors of the requests. Pip then

automatically checks whether the application violates the expected behavior. Pin-

point [19] modifies middleware to inject end-to-end request IDs to track requests.

It uses clustering and statistical techniques to correlate the failures of requests to

the components that caused the failures. Chen et al. [18] used request-processing

paths as the key abstraction to detect and diagnose failures, and to understand the

13

evolution of a large system. They studied three examples: Pinpoint, ObsLogs, and

SuperCall. All of them do intrusive instrumentation in order to discover request-

processing paths. Stardust [23] uses source code instrumentation to log application

activities. An end-to-end request ID helps recover request-processing paths. Star-

dust stores event logs into a database, and uses SQL statements to analyze the

behavior of the application.

2.1.3 Resource Accounting

Earlier works by Banga et. al. [14] have addressed the issue of correct resource

accounting within a single host. They introduced new abstraction, called resource

containers, to be used as a resource principal within the kernel. Distributed re-

source container [47] is an extension of the resource container to the distributed

environment in which local resource containers are bound together by exchang-

ing identifiers in order to coordinate the resource consumption across hosts. In

their work, the goal was to throttle the energy consumption per applications. Our

work advances this thread of research by enabling resource accounting at various

locations within the distributed application hierarchy by exploiting the message

causality tracking technique and thread-level monitoring capability.

Recall that there are two aspects to resource accounting solution: local mon-

itoring and collective inference. Some monitoring techniques do not require any

modification to application software, and we label all of them as non-intrusive.

These techniques span a spectrum of the effort involved in modifying the underly-

ing systems software. Among the least intrusive techniques are popular user-level

monitoring tools such as top, iostat, vmstat, sar, netstat, etc. These tools

either rely upon OS system calls or read certain OS-provided information (such

as /proc/stat) to find resource usage. Some techniques insert hooks within the

systems/runtime software for data collection. While still non-intrusive according

to our classification, they entail different amounts of additional effort in their de-

sign and use. For example, the tool OProfile [48] requires insertion of a kernel

module, or kernel recompilation with reconfiguration. Chopstix [49] adds a data

structure, sketches, to the kernel in order to monitor low-level OS operations such

as page allocation, mutex/semaphore locking and CPU utilization. Kprobes [50]

14

allows you to insert probes to the kernel functions or addresses. Since it induces

breakpoint exceptions, it is known that performance degradation is high. Using

Kprobes may require turning CONFIG KPROBES and other configurations on and

rebuild, depending on the Linux distributions. It is intended to be used for ker-

nel debugging. Xenprobes [51] can be used to inject breakpoints into entry and

exit point of any kernel function within the guest domain, similar to Kprobes.

Although Xenprobes is designed with test and debugging in mind, sAccount can

certainly make use of Xenprobes to collect richer monitoring information. Unfor-

tunately, the code is not available to public as of now, preventing us from neither

adopting it nor investigating the feasibility. An intrusive technique, on the other

hand, requires modifying the application itself. The additional information re-

sulting from these modifications often allows more accurate/detailed monitoring.

However, this comes at the cost of added programming effort of modifying the

application (which may be difficult or even impossible in many scenarios) and a

possible run-time slowdown. For example, IBM ARM [52] requires recompilation

after instrumenting the application with certain calls for monitoring the travel

path of user requests through servers.

Monitoring itself is seldom the final goal and there are usually domain-specific

higher-level goals (e.g., while the goal of inference in this work is resource account-

ing, frequently occurring goals in existing work are capacity planning [53, 54],

debugging [15, 20, 21, 19], performance management [55, 52, 56]). Once monitor-

ing data is collected, the next step is to apply suitable inference techniques that

process this data to derive information needed for achieving such goals. The body

of work on inference techniques is, of course, vast (see [54, 53, 57, 58] for some

surveys) and spans the entire spectrum of statistical sophistication. For example,

non-intrusive tool mentioned earlier that needs to interpret application-specific

logs (e.g., access log for the popular apache web server) can be considered as a

simple inference technique. On the other hand, application of TAN model to the

classification problem of SLO violations [58] falls into the group of sophisticated

inference techniques.

15

2.2 Consumer-end Decision Making

Questions related to cloud economics have been raised by several researchers and

are drawing increasing attention. Gray [59] has looked at economics in the context

of distributed computing and he came up with the amount of resource users can buy

with one dollar in the year 2003. One of his conclusions was that since data transfer

costs are non-negligible for Web-based applications, it is economical to optimize

the application towards reducing data transfer even at the cost of increased CPU

consumptions. We find that this argument still holds in today’s cloud environment

according to the results of our cost analysis. Economics have also been looked at

in the grid computing domain. Thanos et al. [60] have identified factors related

to economics that could stimulate the adoption of grid computing by business

sectors. There also have been several efforts to promote the commercialization of

grid [61, 62].

Armbrust et al. [63] have extended the cost analysis of Gray’s data into the year

2008 and presented how the cost of each resource type evolved at different rate.

They have also pointed out that cost analysis can be complicated due to cost factors

such as power, cooling and operational costs, which are in many cases difficult to

quantify. Their arguments are in line with our observations and treatment of

cost factors since those costs correspond to the class of “less quantifiable” and/or

“indirect” costs from our cost taxonomy.

Walker [64] has looked at issues related to the economics of purchasing or leas-

ing CPU hours using the NPV concept. The focus of his work is to provide a

methodology that can aid in deciding whether to buy or lease the CPU capac-

ity from the organization’s perspective. His analysis ignores application-specific

intricacies. For example, in calculating the cost of leasing the CPU hours from

Amazon EC2, required total CPU hours is assumed to be statically fixed. In real-

ity CPU requirement (and usage) will vary depending on the type of application

and the amount of injected workloads. Similarly, Walker et al. [65] also studied

the problem of using storage cloud vs. purchasing hard disks. Both studies bring

many useful insights about the cost of procuring a known fixed set of hardware

resources. However, in order to see the feasibility of moving specific applications

into the cloud, additional variables must be taken into account. Our study differs

16

from these in a sense that we try to address the question at the level of individual

applications.

Harms and Yamartino [66] explain how the emergence of cloud impacts the eco-

nomics for IT businesses. They show that cloud infrastructure benefits economies

of scale in three areas and this provides incentives for organizations and businesses

to migrate their IT infrastructures into the cloud. Although they do not develop

detailed analysis for application migration, they mention the workload variability

and growth patterns of an application as key properties, which we incorporate into

our analysis. In our prior work we have also addressed some economic issues of

cloud migration as they apply to digital library and search systems such as Cite-

Seer [67, 68]. Klems et al. [69] propose a framework for valuation of cloud in order

to enable cost comparison. However, the emphasis is on the procedural aspect

of the problem, rather than cost analysis. Wang et al. [70] conducts preliminary

studies on several aspects of current pricing schemes of cloud. They discuss the

relationship and consequences between cost and performance optimization from

both the user and provider’s view point. They do not address the problem of cost

related to migrating application. There is also a study on how to minimize the cost

of map-reduce applications using transient VM instances such as Amazon Spot In-

stances [71]. Campbell et al. [72] carry out simple calculations to determine the

break-even utilization point for owning vs. renting the system infrastructure for a

medium-sized organization. There are also tools that aid in calculating the cost of

using the cloud services [73, 74]. Overall, although there have been attempts to an-

alyze the cost of application migration, most of them are preliminary assessments

or limited to specific application with many simplifying assumptions. Our work is

an effort to broaden the insights by identifying and incorporating comprehensive

set of critical factors into cost analysis.

Chapter 3Provider-end Dependency Discovery

3.1 Introduction and Background

This chapter is devoted to the study of a dependency discovery technique that

forms the foundation of our overall resource accounting solution. In general, the

term “dependency” in the context of distributed systems may represent some kind

of reliance of one component on another. We define dependency in the following

specific way: a dependency is said to exist between a chargeable entity c and

a server node s during time [t, t + ∆] if c causes consumption of resources of s

during this period. The chargeable entity c and the server node s can be located

anywhere in the cloud infrastructure, with c (or parts of it) not necessarily being

hosted within s or “adjacent to” s.

We use the illustration shown in Figure 3.1 to define some of the abstractions

and to describe the role of dependency discovery within our overall resource ac-

counting technique. Figure 3.1 depicts the deployment scenario of two VCs from

two distinct chargeable entities that make use of a shared database service. VMs

v1,...,v5 are VM instances owned exclusively by chargeable entities, CEA and CEB.

VMs v6,...,v9 together comprise the shared database service. Labels s1,...,s9 refer

to the physical servers hosting these VMs. We employ two abstractions, Set of

Used Servers and Resource Accounting Tree to formalize the resource accounting

problem and to capture the distributed information that our accounting solution

must keep track of.

18

����

����

������������������ ���

����

����

78����

78

����

78

����������

�������������

�������������

��������������

����

78 78

�����������

������������������

���

���

���

���

���

����

���

���

���

���

���

���

��������

�����������������

��

���

��� �

��

���9 �������������� �� �

����

��

��

�����������������

:;<

=>?@ ABB ABC

DEF DEGDEF DEG

DEF DEG

���������������������������� �

��������������

�������������� ��

��������H

����

�����������������

��

��I��J

��I

��J��I

K

Figure 3.1. Example deployment of VCs from chargeable entities CEA and CEB. TwoVCs share the database server instance. This sharing is transparent to the chargeableentities since who will share the service instance is the decision of provider. The abstrac-tions of “Set of Used Servers” and “Resource Accounting Tree” are also labeled.

• Set of Used Servers (Sc(t)): For each CE c, the accounting solution maintains

Sc(t), the set of servers whose resources are used for c during the time interval

[t, t+∆]. This usage may be either: (i) direct, i.e., by one or more components

of c, or (ii) indirect, i.e., by a shared service on behalf of c. In Figure 3.1,

there are two chargeable entities CEA and CEB owning v1, v2, v3 and v4, v5,

respectively, as part of their VCs. The Set of Used Servers for each of them

are marked with colored boundaries as well as labels. In this example, note

that server s8 happens to belong only to SA(t).

• RAT (Resource Accounting Tree) T rs (t): For resource r within a server s, the

accounting solution must maintain resource accounting information during

the interval [t, t+∆] in the form of a resource accounting tree T rs (t). We use

the example shown in Figure 3.1 to explain a resource accounting tree for

the CPU resource on server s9. The entire usage of the CPU resource within

s9 is represented by the root of the tree. The next level of nodes in the tree

represent a breakdown of this overall CPU usage among three entities: (i)

the VM Hypervisor or VMM, (ii) CEA, and (iii) CEB. The usage of VMM

19

may further be broken down into portions attributed to applications CEA

and CEB as captured by the nodes of the tree below the node for the VMM’s

CPU usage. We denote the sum of the usage corresponding to all leaf nodes

associated with the CE c as urc(t).

3.2 Dependency Discovery: Problem Statement

and Requirements

The goal of our dependency discovery is to construct the Set of Used Servers Sc(t)

for each chargeable entity c. For a chargeable entity c, Sc(t) can vary due to several

reasons. Properties of the requests from c to the shared service may change in many

ways. Requests may transition from read-intensive to write-intensive which may

render any type of caching within the shared service to behave differently (e.g.,

more traffic to the data store nodes v9 from the front-end v6 in Figure 3.1). Other

possibility would be, c may start to request previously unaccessed data in which

case Sc(t) may grow to include another server node because of new access pattern.

The granularity of time t also affects Sc(t). Large time granularity would tend

to encompass large number of server nodes into the set and stay stable. As the

time granularity gets smaller, Sc(t) may shrink or grow depending on the workload

patterns of c. The choice of time granularity depends on the inference method to

be applied during the resource usage inference, the topic of Chapter 4.

The information about currently dependent set of chargeable entities for a

given server node offers important benefits. By having the minimal set of charge-

able entities as the target of resource accounting, we expect that it will improve

the efficiency of accounting algorithms as well as the accuracy of results. Regard-

less of which algorithm we employ at the resource usage inference step, having

smaller input data set always reduces the amount of computations. Smaller num-

ber of chargeable entities also improves the accuracy because the uncertainty of

input data is less. It is well known in the data mining field that separating out

larger number of component distributions from the mixture of them degrades the

accuracy. This is because more work has to be done for a fixed amount of input

information. Similar difficulty is also known in the signal processing in the name of

20

the blind source separation problem [75]. Obtaining the minimal dependent set of

chargeable entities can be crucial if there are large number of chargeable entities.

It is not uncommon to have hundreds or thousands of chargeable entities in the

large scale cloud environment. Without the knowledge of minimal dependency set,

we must use all the potential entities as the input of accounting algorithm when, in

fact, only one tenth of them could be involved in the actual resource consumption.

There can be several ways to discover the dependency set. One way is to

infer from the set of chosen measurements that are appropriate to the selected

inference algorithm. Another way could be to modify the software stacks to insert

information for easier discovery through post processing of logs. However, we strive

to develop a technique that delivers accurate set unlike the inference approach,

and that does not require any software modification unlike the instrumentation

approach. We focus on tracking the messages between application components to

find the causal path or trail of messages. By ‘causal’, we imply that one message

exchange between two components, say c1 and c2, triggers or be directly responsible

for the generation of another set of messages between c2 and c3. It is important

to discover such causality. The messages between c2 and c3, in the example, can

be merely coincidental and, if that is the case, this means that any activity within

c3 cannot be attributed to c1. Declaring c3 as dependent to c1 in this case will

end up adding c3 spuriously into the dependency set. Therefore, the key to the

dependency discovery technique for resource accounting is to find the causality

between components.

In order for a solution to be practical, we set the following requirements.

• Transparency: The technique should not require any user-level knowledge.

Acquisition of such information usually mandates intrusive modification to

the user applications or guest kernels. The technique must work only with

the information available from the hypervisor side.

• Accuracy: The accuracy of dependency information directly affects the qual-

ity of resource accounting and any other optimizations built on it.

• Generality: The technique should ideally work regardless of software archi-

tecture running in the user virtual machines.

21

• Efficiency: In order for the technique to be deployable in an online fashion,

the overhead must be within tolerable level.

3.3 Inadequacy of Existing Approaches

We classify the existing dependency discovery techniques into two categories: (i)

instrumentation-based techniques and (ii) statistical inference techniques. The

instrumentation-based approach modifies middleware or applications to record

events (e.g., request messages and their end-to-end identifiers) that can be used to

reconstruct paths of the messages. They can provide accurate information, but are

not generally applicable. Their applicability is limited, because it requires knowl-

edge (and often source code) of the specific middleware or applications in order

to do instrumentation. This class of technique fails to meet the requirement of

generality and transparency. The statistical inference technique is an approach

that takes readily available information (e.g., timestamps of network packets) as

inputs, and infer the dependency in a “most likely” way. Statistical approachs are

general but not accurate. Their accuracy can degrade for a multiple reasons in dif-

ferent circumstances. Some of the administrative actions cannot be applied if the

information is not accurate due to the danger of misoperation. Another drawback

is that statistical techniques often require heavy computations for model construc-

tion which may impact the performance. This class of technique fails to meet the

accuracy and efficiency requirement. We discuss techniques of both these kinds as

well as ones that combine them in detail in Chapter 2.

3.4 Proposed Solution: vPath

We propose a solution, named vPath, that approaches the dependency discovery

problem from a new direction. Our solution focus on tracking the messages between

application components to find the causal path or trail of messages. By ‘causal’, we

imply that one message exchange between two components, say c1 and c2, triggers

or be directly responsible for the generation of another set of messages between c2

and c3. It is important to discover such causaility. The messages between c2 and c3,

in the example, can be merely coincidental and, if that is the case, this means that

22

������������ ����� ����������� �����

����������

����LMNOMPQ j

����LMRST i

UVWVXYYMZQ[XY \

��� �

�������LMNOMPQ i

�������LMRST j

UVWVXYYMZQ[XY ]

����������

����^_`a_bc j

defeghh_icjgh k

��� �

�������^_lmn j

defeghh_icjgh o

����^_lmn j

�������^_`a_bc j

Figure 3.2. The principle idea of finding the causality of our proposed solution.

any activity within c3 cannot be attributed to c1. Declaring c3 as dependent to c1

in this case will end up adding c3 spuriously into the dependency set. Therefore,

the key to the dependency discovery technique for resource accounting is to find

the causality between components. In the distributed applications the resource

of a server node is consumed upon the arrival of request messages from other

components. This implies that the identification of the Set of Used Servers of a

chargeable entity c is equivalent to determining the set of servers touched by the

messages originated from c during [t, t + ∆]. In turn, this is equivalent to tracking

the passage of each message across components of the virtual clusters. Therefore,

in order to be able to construct the Set of Used Servers, we need to develop a

technique that can track the passage/path of the messages.

3.4.1 Design and Implementation of our Dependency Dis-

covery Technique

To reconstruct paths of messages across virtual cluster components, we need to

find two types of causality. Intra-node causality captures the behavior that, within

one component, processing an incoming message i triggers sending an outgoing

message j. Inter-node causality captures the behavior that, an application-level

message j sent by one component corresponds to message j′ received by another

component. Our thread-pattern assumption enables the inference of intra-node

causality, while the communication-pattern assumption enables the inference of

inter-node causality.

Specifically, we reconstructs the path of the messages in Figure 3.2 as fol-

23

lows. Inside component 1, the synchronous-communication assumption allows us

to match the first incoming message over ‘TCP Connection 1’ with the first out-

going message over ‘TCP Connection 1’ match the second incoming message with

the second outgoing message, and so forth. (Note that one application-level mes-

sage may be transmitted as multiple network-level packets.) Therefore, ‘Receive

Request i’ can be correctly matched with ‘Send Reply i’. Similarly, we can match

component 1’s ‘Send Request j’ with ‘Receive Reply j’, and also match component

2’s ‘Receive Request j’ with ‘Send Reply j’.

Between two components, we can match component 1’s first outgoing message

over ‘TCP Connection 2’with component 2’s first incoming message over ‘TCP

Connection 2’, and so forth, hence, correctly matching component 1’s ‘Send Re-

quest j’ with component 2’s ‘Receive Request j’.

The only missing link is that, in component 1, ‘Receive Request i’ triggers

‘Send Request j’. From the thread-pattern assumption, we can indirectly infer this

causality between them. Recall that we have already matched ‘Receive Request i’

with ‘Send Reply i’. Between the time of these two operations, we observe that

the same thread performs ‘Send Request j’ and ‘Send Reply i’. It follows from

our thread-pattern assumption that ‘Receive Request i’ triggers ‘Send Request

j’send-request-Y. This completes the construction of all the causality needed to

discover the dependency.

3.4.1.1 Implementation

Our proposed solution vPath’s toolset consists of an online monitor and an offline

log analyzer. The online monitor continuously logs which thread performs a send

or recv system call over which TCP connection. The offline log analyzer parses

logs generated by the online monitor to discover the paths and the performance

characteristics at each step along these paths. The online monitor tracks network-

related thread activities. This information helps infer the intra-node causality of

the form “processing an incoming message X triggers sending an outgoing message

Y .” It also tracks the identity of each TCP connection, i.e., the four-element tuple

(source IP, source port, dest IP, dest port) that uniquely identifies a live TCP con-

nection at any moment in time. This information helps infer inter-node causality,

i.e., message Y sent by a component corresponds to message Y ′ received by an-

24

other component. The online monitor is implemented in Xen 3.1.0 [76] running on

x86 32-bit architecture. The guest OS is Linux 2.6.18. Xen’s para-virtualization

technique modifies the guest OS so that privileged instructions are handled prop-

erly by the VMM. Xen uses hypercalls to hand control from guest OS to the VMM

when needed. Hypercalls are inserted at various places within the modified guest

OS. In Xen’s terminology, a VM is called a domain. Xen runs a special domain

called Domain0, which executes management tasks and performs I/O operations

on behalf of other domains.

Monitoring Thread Activities: vPath needs to track which thread performs

a send or recv system call over which TCP connection. If thread scheduling

activities are visible to the VMM, it would be easy to identify the running threads.

However, unlike process switching, thread context switching is transparent to the

VMM. For a process switch, the guest OS has to update the CR3 register to

reload the page table base address. This is a privileged operation and generates

a trap that is captured by the VMM. By contrast, a thread context switch is not

a privileged operation and does not result in a trap. As a result, it is invisible to

the VMM.

Luckily, this is not a problem for vPath, because vPath’s task is actually sim-

pler. We only need information about currently active thread when a network

send or receive operation occurs (as opposed to fully discovering thread-schedule

orders). Each thread has a dedicated stack within its process’s address space. It is

unique to the thread throughout its lifetime. This suggests that the VMM could

use the stack address in a system call to identify the calling thread. The x86

architecture uses the EBP register for the stack frame base address. Depending

on the function call depth, the content of the EBP may vary on each system call,

pointing to an address in the thread’s stack. Because the stack has a limited size,

only the lower bits of the EBP register vary. Therefore, we can get a stable thread

identifier by masking out the lower bits of the EBP register.

Specifically, vPath tracks network-related thread activities as follows:

• The VMM intercepts all system calls that send or receive TCP messages.

Relevant system calls in Linux are read(), write(), readv(), writev(),

recv(), send(), recvfrom(), sendto(), recvmsg(), sendmsg(), and

25

sendfile(). Intercepting system calls of a para-virtualized Xen VM is pos-

sible because they use “int 80h” and this software trap can be intercepted

by VMM.

• On system call interception, vPath records the current DomainID, the con-

tent of the CR3 register, and the content of the EBP register. DomainID

identifies a VM. The content of CR3 identifies a process in the given VM.

The content of EBP identifies a thread within the given process. vPath uses

a combination of DomainID/CR3/EBP to uniquely identify a thread.

By default, system calls in Xen 3.1.0 are not intercepted by the VMM. Xen

maintains an IDT (Interrupt Descriptor Table) for each guest OS and the 0x80th

entry corresponds to the system call handler. When a guest OS boots, the 0x80th

entry is filled with the address of the guest OS’s system call handler through

the set trap table hypercall. In order to intercept system calls, we prepare our

custom system call handler, register it into IDT, and disable direct registration of

the guest OS system call handler. On a system call, vPath checks the type of the

system call, and logs the event only if it is a network send or receive operation.

Contrary to the common perception that system call interception is expensive,

it actually has negligible impact on performance. This is because system calls

already cause a user-to-kernel mode switch. vPath code is only executed after this

mode switch and does not incur this cost.

Monitoring TCP Connections: On a TCP send or receive system call, in

addition to identifying the thread that performs the operation, vPath also needs to

log the four-element tuple (source IP, source port, dest IP, dest port) that uniquely

identifies the TCP connection. This information helps match a send operation in

the message source component with the corresponding receive operation in the

message destination component. The current vPath prototype adds a hypercall

in the guest OS to deliver this information down to the VMM. Upon entering a

system call of interest, the modified guest OS maps the socket descriptor number

into (source IP, source port, dest IP, dest port), and then invokes the hypercall to

inform the VMM.

This approach works well in the current prototype, and it modifies fewer than

26

100 lines of source code in the guest OS (Linux). However, our end goal is to

implement a pure VMM-based solution that does not modify the guest OS at all.

Such a pure solution would be easier to deploy in a Cloud Computing platform

such as EC2 [77], because it only modifies the VMM, over which the platform

service provider has full control.

As part of our future work, we are exploring several techniques to avoid mod-

ifying the guest OS. Our early results show that, by observing TCP/IP packet

headers in Domain0, four-element TCP identifiers can be mapped to socket de-

scriptor numbers observed in system calls with high accuracy. Another alternative

technique we are exploring is to have the VMM keep track of the mapping from

socket descriptor numbers to four-element TCP identifiers, by monitoring sys-

tem calls that affect this mapping, including bind(), accept(), connect(), and

close().

3.4.2 Applicability to Other Software Architectures

The proposed technique relies on two assumptions as described above. These

assumptions work well with one predominant form of software architecture which is

the multi-threaded software architecture. However, additional facilities are needed

to cover other forms of software architectures. In this section we explain what other

software architectures exist and what needs to be done to apply the principles of

our proposed technique.

Multi-threaded Dispatcher & Worker Model: Figure 3.3 (a) shows the

dispatcher-worker model, which is arguably the most widely used threading model

for server applications. In the front-end, one or more dispatcher threads use the

select() system call or the accept() system call to detect new incoming TCP

connections or new incoming requests over existing TCP connections, respectively.

Once a user request is identified, the request is handed over to a worker thread for

further processing. This single worker thread is responsible for executing all activ-

ities triggered by the request (e.g., reading HTML files from disk, making JDBC

calls to retrieve data from database servers, calling back-end CICS mainframe ap-

plications, doing local computation, etc.) and finally sending a response-message

back to the user. After the worker thread finishes processing the request, it is put

27

dispatcher threadRequest

worker threads

component

(a) Simple Multi-threaded Server Model.

���������

�������

����

��

� �����

��������

����� ���

�����������

�����������

�����

�������

�����

��������

����

(b) Finite State Machine of a Worker Thread.

Figure 3.3. Multi-threaded Server Architecture

back into the worker thread pool, waiting for being picked to process another in-

coming request. The finite-state machine describing the behavior of such a worker

thread is depicted in Figure 3.3 (b).

Event-Driven Model: Figure 3.4(a) shows the architecture of an application’s

component built using an event-driven programming model. Unlike other thread-

ing models, the event-driven model uses a relatively small number of threads, typi-

cally equal to or slightly larger than the number of CPUs on the server hosting the

component. When processing a request R, a thread T1 always uses non-blocking

system calls. If it cannot make progress on processing the request R because a

non-blocking I/O operation on behalf of R has not yet completed, the thread T1

records the current status of R in a finite state machine (FSM) maintained for R,

and moves on to process another request. When the I/O operation on behalf of

R finishes, an event is created in the event queue, and eventually a thread T2 re-

trieves the event and resumes the request R. Note that T1 and T2 may be different

threads, both participating in the processing of the same request, but at different

times during its lifetime.

In order for our proposed technique to extend to the Event-driven model, one

feature must be implementable. Since a few number of threads multiplex over

28

Event queue

I/O events

Finite State Machine

for Request i

Threads

component

Finite State Machine

for Request j

Finite State Machine

for Request k

Fetch

(a) Event-driven model

Event queue

Stage 1

component

Event queue

Stage n

(b) SEDA model

Figure 3.4. Event-driven and SEDA model architecture

many communications, we need to be able to detect the moment when thread

switches to processing another communication. This can be done theoretically

by intercepting the event fetch operation since event fetch signifies such switch of

context. We have not delved into accomplishing this engineering task, but believe

that this can be done with reasonable amount of efforts if desired.

Staged Event-Driven Architecture: Shown in Figure 3.4(b) is the architec-

ture of one component within a SEDA-based application [78]. SEDA partitions the

request processing pipeline into stages and each stage has its own dedicated pool

of threads. Any two neighboring stages are connected by a event queue. Threads

in stage i put events in the queue qi, and threads in stage i+1 remove events from

this queue and process them, which may further trigger the generation of events

in the queue for the subsequent stage. One advantage of SEDA, claimed in [78], is

that the size of each stage’s thread pool can be adjusted independently based on

the observed workload.

The argument about whether the proposed vPath technique applies to SEDA

29

or not follows the same reasoning presented for the event-driven model. SEDA is a

extension of event-driven model with stages and for the same reason, it is possible

to apply our proposed technique only if we can intercept the event fetching. We

also do not look into implementing this in our study.

3.4.3 Usefulness of Proposed Solution

Applications that adopt the event-driven model cannot be handled by vPath. How-

ever, the pure event-driven model in is rarely used in real applications. The

Flash Web server [79] is often considered as a notable example that adopts the

event-driven model, but Flash actually uses a hybrid between event-driven and

multi-threaded programming models. In Flash, a single main thread does all

non-blocking network I/O operations and a set of worker threads do blocking

disk I/O operations. The event-driven model is not yet popular in real appli-

cations and there is considerable consensus in the research community that pro-

gramming/debugging applications based on a pure event-driven model are difficult.

Furthermore, even the frequently-cited performance advantages of the event-driven

model are questionable in practice as it is extremely hard to ensure that a thread

actually never blocks. For example, the designers of Flash themselves observed

that the supposedly never-blocking main thread actually blocks unexpectedly in

the “find file” stage of HTTP request processing, and subsequently published mul-

tiple research papers [80, 81] to describe how they solved the problem by changing

the operating system. Considering the excellent expertise of the Flash researchers

on this subject, it is hard to imagine that regular programmers have a better chance

of getting the implementation right. Similar sentiments were expressed by Behren

et al. who have had extensive experience programming a variety of applications

using event-driven approaches [82].

3.5 Evaluation

Our experimental testbed consists of Xen VMMs (v3.1.0) hosted on Dell servers

connected via Gigabit Ethernet. Each server has dual Xeon 3.4 GHz CPUs with 2

MB of L1 cache and 3 GB RAM. Each of our servers hosts several virtual machines

30

VMMVMMVMMVMM

Apache

JBoss2

MySQL

VM1

JBoss1

Client

Linux Dom-0

VM2D

om-0

VM3D

om-0

VM4D

om-0

Figure 3.5. The topology of TPC-W benchmark set-up.

(VMs) with each VM assigned 300 MB of RAM. We use the xentop utility in

Domain0 to obtain the CPU utilization of all the VMs running on that server.

3.5.1 Applications

To demonstrate the generality of vPath, we evaluate vPath using a diverse set

of applications written in different programming languages (C, Java, and PHP),

developed by communities with very different backgrounds.

TPC-W: To evaluate the applicability of vPath for realistic workloads, we use a

three-tier implementation of the TPC-W benchmark [83], which represents an on-

line bookstore developed at New York University [84]. Our chosen implementation

of TPC-W is a fully J2EE compliant application, following the “Session Facade”

design pattern. The front-end is a tier of Apache HTTP servers configured to

load balance the client requests among JBoss servers in the middle tier. JBoss

3.2.8SP1 [85] is used in the middle tier. MySQL 4.1 [86] is used for the back-end

database tier. The topology of our TPC-W setup is shown in Figure 3.5. We

use the workload generator provided with TPC-W to simulate multiple concurrent

clients accessing the application.

This setup is a heterogeneous test environment for vPath. The Apache HTTP

server is written in C and is configured to use a multi-process architecture. JBoss

is written in Java and MySQL is written in C.

RUBiS: RUBiS [87] is an e-Commerce benchmark developed for academic re-

search. It implements an online auction site loosely modeled after eBay, and adopts

31

VM1

VM3

VM2 VM4

VM5

Tier 1 Tier 2 Tier 3

vApp

Client

Figure 3.6. The topology of vApp used in evaluation.

a two-tier architecture. A user can register, browse items, sell items, or make a bid.

It is available in three different versions: Java Servlets, EJB, and PHP. We use the

PHP version of RUBiS in order to differentiate from TPC-W, which is written in

Java and also does e-Commerce. Our setup uses one VM to run a PHP-enabled

Apache HTTP server and another VM to run MySQL.

MediaWiki: MediaWiki [88] is a mainstream open source application. It is the

software behind the popular Wikipedia site (wikipedia.org), which ranks in the

top 10 among all Web sites in terms of traffic. As mature software, it has a large

set of features, e.g., support for rich media and a flexible namespace. Because

it is used to run Wikipedia, one of the highest traffic sites on the Internet, its

performance and scalability have been highly optimized. It is interesting to see

whether the optimizations violate the assumptions of vPath (i.e., synchronous

remote invocation and event causality observable through thread activities) and

hence would fail our technique. MediaWiki adopts a two-tier architecture and is

written in PHP. Our setup uses one VM to run PHP-enabled Apache and another

VM to run MySQL.

vApp: vApp is our own prototype application. It is an extreme test case we

designed for vPath. It can exercise vPath with arbitrarily complex paths of the

messages. It is a custom multi-tier multi-threaded application written in C. Fig-

ure 3.6 shows an example of a three-tier vApp topology. vApp can form various

topologies, with the desired number of tiers and the specified number of servers at

each tier. When a server in one tier receives a request, it either returns a reply,

or sends another request to one of the servers in the downstream tier. When a

server receives a reply from a server in the downstream tier, it either sends an-

32

Response time in seconds Throughput(req/sec)

Configuration (Degradation in %) (Degradation in %)

Average 90th percentile Average

Vanilla Xen 4.45 11.58 4.88

vPath 4.72 (6%) 12.28 (6%) 4.59 (6%)

App Logging 10.31 (132%) 23.95 (107%) 4.10 (16%)

Table 3.1. Response time and throughput of TPC-W. “App Logging” represents alog-based tracking technique that turns on logging on all tiers of TPC-W.

other request to a server in the downstream tier, or returns a reply to the upstream

tier. All decisions are made based on specified stochastic processes so that it can

generate complex paths with different structures and path lengths.

We also developed a vApp client to send requests to the front tier of the vApp

servers. The client can be configured to emulate multiple concurrent sessions. As

request messages travel through the components of the vApp server, the identi-

fiers of visited components are appended to the message. When a reply is finally

returned to the client, it reads those identifiers to precisely reconstruct the path,

which serves as the ground truth to evaluate vPath. The client also tracks the re-

sponse time of each request, which is compared with the response time estimated

by vPath.

3.5.2 Overhead of vPath

We first quantify the overhead of vPath, compared with both vanilla (unmodified)

Xen and log-based tracking techniques [24, 22]. For the log-based techniques, we

turn on logging on all tiers of TPC-W. The experiment below uses the TPC-W

topology in Figure 3.5.

Overhead of vPath for TPC-W. Table 3.1 presents the average and 90th

percentile response time of TPC-W benchmark as seen by the client, catering to

100 concurrent user sessions. For all configurations, 100 concurrent sessions cause

near 100% CPU utilization at the database tier. Table 3.1 shows that vPath has

low overhead. It affects throughput and average response time by only 6%. By

contrast, “App Logging” decreases throughput by 16% and increases the average

33

0.00

0.20

0.40

0.60

0.80

1.00

5 10 15 20 25 50 65

CD

F

Response Time (sec)

Vanilla XenvPath

App Logging

(a) CDF (cumulative distribution function) comparison of TPC-W response time

0.00

0.20

0.40

0.60

0.80

1.00

0 10 20 30 40 50 60 70 80 90 100

CD

F

CPU Utilization (%)

Vanilla XenvPath

App Logging

(b) CDF Comparison of TPC-W JBoss tier’s CPU utilization

Figure 3.7. TPC-W Response Time and CPU Utilization.

response time by as high as 132%. The difference in response time is more clearly

shown in Figure 3.7(a), where vPath closely follows “vanilla Xen”, whereas “App

Logging” significantly trails behind.

Figure 3.7(b) shows the CPU utilization of the JBoss tier when the database

tier is saturated. vPath has negligible CPU overhead whereas “App Logging” has

significant CPU overhead. For instance, vPath and “vanilla Xen” have almost

identical 90th percentile CPU utilization (13.6% vs. 14.4%), whereas the 90th per-

centile CPU utilization of “App Logging” is 29.2%, more than twice that of vPath.

Thus, our technique, by eliminating the need for using application logging to trace

paths, improves application performance and reduces CPU utilization (and hence

34

Response Time in millisec Throughput in req/sec

(Degradation in %) (Degradation in %)

Vanilla Xen 597.2 628.6

vPath 681.8 (14.13%) 593.4 (5.60%)

Table 3.2. Performance impact of vPath on RUBiS.

Response time (in sec) Throughput (req/sec)

Configuration Avg(Std.) Overhead Avg(Std.) Overhead

Vanilla Xen 1.69(.053) 2915.1(88.9)

(1) Intercept Syscall 1.70(.063) .7% 2866.6(116.5) 1.7%

(2) Hypercall 1.75(.050) 3.3% 2785.2(104.6) 4.5%

(3) Transfer Log 2.02(.056) 19.3% 2432.0(58.9) 16.6%

(4) Disk Write 2.10(.060) 23.9% 2345.4(62.3) 19.1%

Table 3.3. Worst-case overhead of vPath and breakdown of the overhead. Each rowrepresents the overhead of the previous row plus the overhead of the additional operationon that row.

power consumption) for data centers. Moreover, vPath eliminates the need to re-

peatedly write custom log parsers for new applications. Finally, vPath can even

work with applications that cannot be handled by log-based discovery methods

because those applications were not developed with this requirement in mind and

do not generate sufficient logs.

Overhead of vPath for RUBiS. Due to space limitation, we report only sum-

mary results on RUBiS. Table 3.2 shows the performance impact of vPath on

RUBiS. We use the client emulator of RUBiS to generate workload. We set the

number of concurrent user sessions to 900 and set user think time to 20% of the

original value in order to drive the CPU of the Apache tier (which runs PHP) to

100% utilization. vPath imposes low overhead on RUBiS, decreasing throughput

by only 5.6%.

Worst-case Overhead of vPath. The relative overhead of vPath depends on

the application. We are interested in knowing the worst-case overhead (even if the

worst case is unrealistic for practical systems).

The relative overhead of vPath can be calculated as vv+p

, where v is vPath’s

35

processing time for monitoring a network send or receive operation, and p is the

application’s processing time related to this network operation, e.g., converting

data retrieved from the database into HTML and passing the data down the OS

kernel’s network stack. vPath’s relative overhead is highest for an application that

has the lowest processing time p. We use a tiny echo program to represent such a

worst-case application, in which the client sends a one-byte message to the server

and the server echoes the message back without any processing. In our experiment,

the client creates 50 threads to repeatedly send and receive one-byte messages in

a busy loop, which fully saturates the server’s CPU.

When the application invokes a network send or receive system call, vPath

performs a series of operations, each of which introduces some overhead: (1) in-

tercepting system call in VMM, (2) using hypercall to deliver TCP information

(src IP, src port, dest IP, dest port) from guest OS to VMM, (3) transferring log

data from VMM to Domain0, and (4) Domain0 writing log data to disk. These

operations correspond to different rows in Table 3.3, where each row represents

the overhead of the previous row plus the overhead of the additional operation on

that row.

Table 3.3 shows that intercepting system calls actually has negligible overhead

(1.7% for throughput). The biggest overhead is due to transferring log data from

VMM to Domain0. This step alone degrades throughput by 12.1%. Our current

implementation uses VMM’s printk() to transfer log data to Domain0, and we

are exploring a more efficient implementation. Combined together, the operations

of vPath degrade throughput by 19.1%. This is the worst-case for a contrived tiny

“application.” For real applications, throughput degradation is much lower, only

6% for TPC-W and 5.6% for RUBiS.

3.5.3 Dependency Discovery for vApp

Our custom application vApp is a test case designed to exercise vPath with arbi-

trarily complex paths. We configure vApp to use the topology in Figure 3.6. The

client emulates 10-30 concurrent user sessions. In our implementation, as a request

message travels through the vApp servers, it records the actual path, which serves

as the ground truth to evaluate vPath.

36

1

3

5

3

5

3

1

3

1

3

1

2

1

2

11

2

4

2

1

(a) Simple path (b) Complex path

Figure 3.8. Examples of vApp’s paths discovered by vPath. The circled numberscorrespond to VM IDs in Figure 3.6.

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

10 20 30 40 50 60 70 80 90 100

CD

F

Response Time (ms)

Estimated by vPathMeasured by vApp Client

Figure 3.9. CDF of vApp’s response time, as estimated by vPath and actually measuredby the vApp client.

The message paths of vApp, as described in 3.5.1, is designed to be random. To

illustrate the ability of our technique to discover sophisticated paths, we present

two discovered paths in Figure 3.8. The simple path consists of 2 remote invoca-

tions in a linear structure, while the complex path consists of 7 invocations and

visits some components more than once.

In addition to discovering the message paths, vPath can also accurately cal-

culate the end-to-end response times as well as the time spent on each tier along

a path. This information is helpful in debugging distributed systems, e.g., iden-

tifying performance bottlenecks and abnormal requests. Figure 3.9 compares the

end-to-end response time estimated by vPath with that actually measured by the

vApp client. The response time estimated by vPath is almost identical to that

observed by the client, but slightly lower. This small difference is due to message

delay between the client and the first tier of vApp, which is not tracked by vPath

37

Apache

JBoss2

MySQL

JBoss1

Client

Client Request

Large Number

of Requests and

Replies

between

JBoss & MySQLPartial Reply

Partial Reply

Partial ReplyPartial Reply

RUBiS(PHP) MySQL

Client Request

Reply

Client

Exactly 3

Round Trips

About 50

Consecutive

recv()

Possibly

Sending Large

Data Here

(a) TPC-W (b) RUBiS

Figure 3.10. Typical paths discovered by vPath technique.

because the client runs on a server that is not monitored by vPath.

We executed a large number of requests at different session concurrency levels.

We also experimented with topologies much larger than that in Figure 3.6, with

more tiers and more servers in each tier. All the results show that vPath precisely

discovers the path of each and every executed request.

3.5.4 Dependency Discovery forTPC-W

The three-tier topology (see the top of Figure 3.10(a)) of the TPC-W testbed is

static, but its message paths are dynamic and can vary, depending on which JBoss

server is accessed and how many queries are exchanged between JBoss and MySQL.

The TPC-W client generates logs that include the total number of requests, current

session counts, and individual response time of each request, which serve as the

ground truth for evaluating vPath. In addition to automated tests, for the purpose

of careful validation, we also conduct eye-examination on some samples of complex

paths discovered by vPath and compare them with information in the application

logs.

vPath is able to correctly discover all the message paths with 100% complete-

ness and 100% accuracy. We started out without knowing how the paths of TPC-

W would look. From the results, we were able to quickly learn the path structure

without any knowledge of the internals of TPC-W. Typical paths of TPC-W have

the structure in Figure 3.10(a).

We observe two interesting things that we did not anticipate. First, when

38

processing one request, JBoss makes a large number of invocations to MySQL.

Most requests fall into one of two types. One type makes about 20 invocations

to MySQL, while the other type makes about 200 invocations. These two types

represent radically different TPC-W requests.

The second interesting observation with TPC-W is that, both the JBoss and

Apache send out replies in a pipeline fashion (see Figure 3.10(a)). For example,

after making the last invocation to MySQL, JBoss reads in partial reply from

MySQL and immediately sends it to Apache. JBoss then reads and sends the

next batch of replies, and so forth. This pipeline model is an effort to reduce

memory buffer, avoid memory copy, and reduce user-perceived response time. In

this experiment, once JBoss sends the first partial reply to Apache, it no longer

makes invocations to MySQL (it only reads more partial replies from MySQL

for the previous invocation). vPath is general enough to handle an even more

complicated case, where JBoss sends the first partial reply to Apache, and then

makes more invocations to MySQL in order to retrieve data for constructing more

replies. Even for this complicated, hypothetical case, all the activities will still be

correctly assigned to a single path.

3.5.5 Dependency Discovery for RUBiS and MediaWiki

Unlike TPC-W, which is a benchmark intentionally designed to exercise a breadth

of system components associated with e-Commerce environments, RUBiS and Me-

diaWiki are designed with practicality in mind, and their paths are actually shorter

and simpler than those of TPC-W.

Figure 3.10(b) shows the typical path structure of RUBiS. With vPath, we are

able to make some interesting observations without knowing the implementation

details of RUBiS. We observe that a client request first triggers three rounds of

messages exchanged between Apache and MySQL, followed by the fourth round in

which Apache retrieves a large amount of data from MySQL. The path ends with

a final round of messages exchanged between Apache and MySQL. The pipeline-

style partial message delivery in TPC-W is not observed in RUBiS. RUBiS and

TPC-W also differ significantly in their database access patterns. In TPC-W,

JBoss makes many small database queries, whereas in RUBiS, Apache retrieves a

39

large amount of data from MySQL in a single step (the fourth round). Another

important difference is that, in RUBiS, many client requests finish at Apache

without triggering database accesses. These short requests are about eight times

more frequent than the long ones. Finally, in RUBiS, Apache and MySQL make

many DNS queries, which are not observed in TPC-W.

For MediaWiki, the results of vPath show that very few requests actually reach

all the way to MySQL, while most requests are directly returned by Apache. This is

because there are many static contents, and even for dynamic contents, MediaWiki

is heavily optimized for effective caching. For a typical request that changes a wiki

page, the PHP module in Apache makes eight accesses to MySQL before replying

to the client.

3.5.6 Discussion on Benchmark Applications

We started our experiments with little knowledge of the internals of TPC-W,

RUBiS and MediaWiki. During the experimentation, we did not read their manuals

or source code. We did not modify their source code, bytecode, or executable

binary. We did not try to understand their application logs or write parsers for

them. We did not install any additional application monitoring tools such as IBM

Tivoli or HP OpenView. In short, we did not change anything in the user space.

Yet, with vPath, we were able to make many interesting observations about the

applications. Especially, different behaviors of the applications made us wonder,

in general how to select “representative” applications to evaluate systems perfor-

mance research. TPC-W is a widely recognized de facto e-Commerce benchmark,

but its behavior differs radically from the more practical RUBiS and MediaWiki.

This discrepancy could result from the difference in application domain, but it is

not clear whether the magnitude of the difference is justified. We leave it as an

open question rather than a conclusion.

This question is not specific to TPC-W benchmark. For example, the Trade6

benchmark [89] developed by IBM models an online stock brokerage Web site.

We have intimate knowledge of this application. As both a benchmark and a

testing tool, it is intentionally developed with certain complexity in mind in order

to fully exercise the rich functions of WebSphere Application Server. It would

40

be interesting to know, to what degree the conclusions in systems performance

research are misguided by the intentional complexity in benchmarks such as TPC-

W and Trade6.

3.6 Summary and Conclusion

In this chapter we proposed a novel technique for discovering the dependency

among distributed application components. Specifically the goal was to discover

the end-to-end causal dependency established by message traversal across compo-

nents. Our proposed technique is hypervisor-based, meaning that it is transpar-

ent to the VMs running on top of the hypervisor and delivers accurate causality

of every message. We have presented our prototype implementation using Xen

virtualization platform and tested the effectiveness on various software settings

including synthetic TCP/IP-based multi-threaded program, TPC-W, RUBiS and

MediaWiki applications. Our evaluation shows that the dependency discovery

technique of our design, i.e., the technique based on the assumptions of multi-

threadedness paradigm and synchronous communication pattern, is an effective

method of dependency discovery.

Chapter 4Provider-end Resource Usage

Inference

4.1 Introduction

Recall that overall resource accounting problem breaks down into two subproblems:

the dependency discovery and the resource usage inference. We have presented our

study on the dependency discovery and the solution in Chapter 3. In this chapter,

we move on to the study of the resource usage inference problem to complete the

overall goal of resource accounting. The first main goal of this study is to under-

stand what the shortcomings of common approaches are in resource accounting

and to develop a technique that overcomes them. Popular choice of technique for

resource accounting, which we select as our baseline for comparison to our solu-

tion, is to apply well-established linear regression method on the monitoring data

collected via commonly-available system monitoring tools. Then, we set up several

applications with appropriate workloads to evaluate the efficacy of our solution.

We also present the result of applying our overall resource accounting solution

to one example system management scenario to demonstrate its advantages over

existing solutions.

The first major reference to the resource accounting problem is in the work on

resource container [14]. This work identified the mismatch between the notion of

a scheduling entity from the perspective of the OS kernel and an accounting entity

42

(akin to CE we define), and proposed to introduce a new abstraction called resource

container to correctly account resource usage. There have been follow-up works

that extended the notion of resource container to distributed environments [47].

Our resource accounting problem has some similarity with the problem addressed

in the work on the distributed resource container in that it also tries to determine

the resource usage at a target server node incurred by some set of remote entities.

However, there is one significant difference that prevents us from simply borrowing

the work on the distributed resource container. This difference arises due to the

presence of what we call the shared services and the matter in which resource

usages of each chargeable entity is interleaved within them.

Shared services are the software components that are run and managed by the

cloud providers to deliver certain services or functionalities to the user applications.

The instances of the shared services, whether they be virtualized or not, are not

owned by any chargeable entities. Software components within CEs’ VMs only

see an interface through which they can request services. Behind this interface

lies arbitrary number of shared service instances that multiplex CEs’ requests

according to suitable admission policy. There are many real-world examples of

shared services that follow this architecture. One notable example is the SQL Azure

relational database service. In SQL Azure, subscribers of the service are given

logical database servers, which is actually formed from multiple physical database

server instances that are located on several physical servers. Each database server

instance on each physical server harbors data for multiple subscribers [90].

The presence of shared services poses several difficulties to the resource ac-

counting.

• First, unlike software installed within CE’s exclusive VM instances, a shared

service may only be exercised by a CE indirectly, making it more difficult to

ascertain this dependency between the shared service and chargeable entities.

For example, in Figure 4.1, the front-end component of the shared database

is invoked directly by both the CEs. Existing work, including ours [91], can

be easily leveraged for identifying this dependency (if it is not already well-

known for some reason). However, the data store is exercised via requests

made to the front-end and not directly by the CEs. One approach for in-

ferring such dependency may be to instrument the messaging S/W to inject

43

����

����

������������������ ���

����

����

pq����

pq

����

pq

����������

�������������

�������������

����

����������

����

pq pq

�����������

������������������

���

���

���

���

���

����

���

���

���

���

���

���

��������

�����������������

��

���

��� �

��

����

��

��

�����������������

����

�����������������

��

��r��s

��r

��s��r

Figure 4.1. A portion of a platform that hosts two applications, each a CE, and theservers hosting their components. Arrows indicate communication between components.Also shown is a shared service - a database used by both the CEs. The shared serviceitself consists of multiple software components, some of which are exercised by the CEsindirectly (e.g., the “Data Store”), i.e., via requests made to other components (e.g., the“Front-end”).

tracking identifiers, which is not generally applicable and may be prohibitive.

• Second, application-owned S/W components are contained within resource

principals that are likely to be easily identifiable by underlying resource man-

agement software (e.g., separate VMs on top of the virtual machine monitors

(VMMs) in Figure 4.1). This implies an existing local accounting solution

such as resource container can be readily used by this management software

to associate these resource principals with the corresponding CE. E.g., in

Figure 4.1, the web/app servers of CEA and CEB are hosted within their

own virtual machines (VMs) created by the underlying VMM software; by

associating containers with these VMs, existing accounting functionality can

be leveraged 1. On the other hand, a shared service’s software design and

configuration may not be amenable to easy adaption of existing solutions for

1In fact, this is the essential idea behind distributed resource containers [47]: individualservers use resource containers for local accounting and the network stacks within server operatingsystems are modified to embed tokens within messages sent to/by components that uniquelyidentify their CEs.

44

Services S3 SimpleDB EBS ELB SQS RDS Custom DB

Percentage 73.3% 8.3% 38.3% 16.6% 15% 21.6% 18.3%

Table 4.1. Usage pattern of AWS shared services. The total number applications are120. RDS in the AWS is not a shared service as we define here since consumers ownseparate VM instances. ‘Custom DB’ means that user has installed their own databasewithin the EC2 instances.

local accounting. For example, the data store component highlighted in Fig-

ure 4.1 multiplexes the resources assigned to its internal schedulable entities

(e.g., threads) in highly application-specific (and possibly unknown) ways

among the activities it carries out on behalf of CEA and CEB, rendering a

solution such as resource containers difficult to adapt.

Increasingly, for the cloud provider, accounting the resource usage of such

shared services is becoming important. Our investigation of current application

deployment patterns reveals that the majority of application deployments in cur-

rently popular clouds involve one or more shared services. In the case of Amazon’s

AWS cloud [77], more than 87% of the applications out of 120 deployments cur-

rently rely on the shared services Amazon provides. Only in 12% of the cases do

they use only the raw VM instances and install necessary software themselves. Ta-

ble 4.1 presents a detailed break-down of shared service usage. The usage pattern

of shared services is similar for the Windows Azure [92]. About 73% out of total 56

applications make use of the SQL Azure [93]. SQL Azure’s architecture conforms

well to the shared service we define, having each instance service multiple CEs’

workloads [90]. These facts indicate that shared services are an increasingly critical

part of cloud offerings that should not be disregarded in an accounting solution.

As we will show empirically, accurate accounting of resources used by such shared

services can lead to improved efficiency for the cloud provider.

4.1.1 Usefulness of Resource Accounting Information

Accurate resource accounting information helps providers gain more clear view

of what is going on within their cloud infrastructure. This allows them to im-

prove their management decisions or even enable new management actions that

were difficult to apply. We can consider various areas where resource accounting

45

information can be of aid.

• Cost Optimization: Consolidation of IT infrastructure is an important issue

in the industry. Owners of data centers are interested in the “right sizing”

of system resources for cost optimization. In order to carry out the “right

sizing” one must have the knowledge of how much computing resources are

needed so that required number of base hardware units can be estimated.

Resource accounting information can be used to reason about various con-

solidation configurations given certain number of applications to operate.

Accuracy in resource accounting will lead to better estimation of costs.

• Dynamic Resource Management: Resource accounting can be used to dynam-

ically adjust system resources on the fly. The knowledge of resource usage

per chargeable entity allows providers to enforce certain fairness constraint or

load balancing scheme. Coupled with flexibility in resource assignment due

to virtualization, providers can also dynamically control resource assignment

(e.g., number of replica assigned to certain front-end shared service node) to

adjust system capacity and performance. In this scenario, simple reactive

provisioning based on reading immediate server conditions may be too slow

and inadequate. Our resource accounting information tells us accurate rela-

tionship between the inbound request rate from chargeable entities and the

magnitude of resource usage per chargeable entity of arbitrary shared service

server nodes. This allows us to calculate how early and how much system

resources to adjust for even the server nodes lying deeply within the shared

service infrastructure.

• Modeling: The numbers obtained from the resource accounting can be used

to construct models by relating them to any input measurements. In many

cases, inbound network traffic volume from chargeable entities to the front-

end shared service can serve as the input measurements. The models built

here can be used to describe how resources are shared among chargeable

entities. For example, it could be used to explain what the caching effects

are when two specific workloads are combined to single shared service.

• Diagnosis: Resource accounting can also be used for system diagnosis. Partic-

46

ularly it is useful for detection of anomalies in resource consumption. When

certain shared service node approaches resource saturation, resource account-

ing can be used to identify who is causing the resource saturation. It may

turn out that the resource consumption is due to some internal activities,

having no causality to any chargeable entities. These discoveries can ease

the following diagnosis efforts by narrowing down the possible causes.

• Charging/Billing: One direct application of resource accounting is the charg-

ing and billing. Charging/billing is important in both public and private

clouds. In public clouds, resource usage information forms the base for charg-

ing the user of their usage. In private clouds, cloud providers can also utilize

them to enforce some quota per certain group or constrain the usage as well.

However, in this study, we do not emphasize the charging/billing aspect of re-

source accounting. Although resource accounting is vital for charging/billing,

it usually involves much wider range of factors that are often subjective.

Complete understanding of issues involved in charging/billing is the out of

scope of this dissertation.

4.2 Problem Definition

We refer readers to Figure 3.1 in Chapter 3 for key abstractions in defining the

resource usage inference problem. We show an illustrative portion of a public

cloud (a representative multi-tenant IT platform) that hosts software applications

for its clients. As mentioned earlier, we refer to a platform user’s application

whose resource usage must be separately tracked and accounted as a chargeable

entity (CE). It shows two such CEs (labeled CEA and CEB) each a multi-tier

e-commerce site. Each of these CEs supplies the cloud provider with a set of

software components that the cloud runs on its behalf. Our platform runs each

of these components within a VM, and this set of VMs is accommodated within

physical servers s1, ..., s5. Each of these servers runs a VMM/hypervisor layer that

multiplexes its server resources among overlying VMs.

Also shown in the figure is a shared service - a SaaS database - that the platform

offers to its CEs. This shared service itself has multiple components that span

47

servers s6, ..., s9. Although this database service is not a CE itself, we require

our accounting solution to keep track of the resources it consumes on behalf of the

CEs. Unlike for servers s1, ..., s5, where an accounting-capable VMM (e.g., using an

existing accounting solution such as resource containers) could associate the virtual

machines v1, v2, and v3 with CEA and v4, v5 with CEB, existing solutions cannot

be directly adapted for accounting within servers s6, ..., s9 where the VMM-visible

resource principals do not have a fixed association with any CE. Additionally, since

the back-end tier of the database service is only exercised by the CEs indirectly,

i.e., via work generated during processing of requests that are made by CEs to

the front-end, additional thought is needed to identify what portion of its resource

usage should be attributed to which CE.

Problem Definition: Given a set of CEs and an accounting granularity ∆, the

goal of sAccount is to infer, for each CE c, the time-series Uc(t).

4.3 Our Approaches

Any accounting solution must have two elements: (i) local monitoring and (ii)

collective inference. We use the phrase “local monitoring” to refer to facilities

within each server that record events and statistics pertaining to the resource us-

age of (or on behalf of) each CE. E.g., in resource containers, local monitoring is

carried out by the server operating system that is modified to identify resource

allocation/scheduling events (e.g., when threads are scheduled/descheduled on the

CPU) and using this information to charge their usage to appropriate contain-

ers [14]. We use the phrase “collective inference” to refer to functionality that is

needed to combine the pieces of information offered by local monitoring to create a

correct overall picture of accounting. Since resource containers are only concerned

with a single server, collective inference is trivially realized from the monitored

data. Distributed resource containers must address a more complicated version

of collective inference, and it does this by augmenting the locally monitored data

within each server with the identity of the distributed container (carried within

messages exchanged between container components) that they correspond to [47].

As argued in Section 4.2, both local monitoring and collective inference need to

48

be reconsidered for servers running shared services. There exist a large number of

techniques and tools for local monitoring that one could choose from. In particular,

these existing techniques span a wide spectrum of the “level of detail” they offer at

the cost of generality, application intrusiveness, and overheads posed. At one end

of this spectrum are techniques that can instrument user-space and OS/VMM code

to create a very detailed record of a shared service’s resource usage that contains

sufficient information for collective inference [94]. At the other end of the spectrum

are CE-oblivious resource usage reporting tools that rely on information available

within the server’s OS and VMM. E.g., top, and iostat. As we will empirically

show in Section 4.5, collective inference that relies on data offered by these tools

can have significant inaccuracies in accounting. Furthermore, as we will find, such

inference can be extremely sensitive to a variety of system properties and environ-

mental conditions, an undesirable feature. Although our results will be based on

a specific inference technique, we argue that the root cause of these inaccuracies

is the inadequacy of information contained in the monitoring information offered

by these tools, and even more sophisticated inference techniques relying on such

information would falter.

Generally speaking, collective inference is a statistical learning problem that

must derive models that can meaningfully tie together the data provided by lo-

cal monitors, possibly filling in any gaps or discrepancies within these data. The

efficacy of such inference crucially depends upon the resource usage phenomena

collected by local monitoring elements. Existing monitoring tools that are not

application-intrusive have been designed for information collection at the gran-

ularity of OS/VMM-relevant abstractions (e.g., threads, TCP connections) that

may not coincide with the needs of our accounting. Consequently we identify the

following design principle that underlies our accounting solution: our local mon-

itoring must explicitly capture information pertaining to resource usage on behalf

of CEs to allow accurate accounting by our collective inference.

Although we believe our design principle is the right direction to follow, we do

not dismiss the possibility that there might exist some advanced technique, yet

unknown to us, that can equally or better perform in resource accounting as our

solution does. However, finding such technique is not easy and requires strong ex-

pertise in the related fields. The efforts in finding and applying such technique may

49

����������������

������ ��

��������

������

���������

���� ����� �

���������� �

tuvwux yzvu{|}~���z �w�uv�u���v

���������

�� � ����

� ������������

����� ���

����Xen VMM

������

����

����

��� ��������

���

���

��� ��������

���

�� 

��� ��������

���

���

����� �������� �������������������

����

����

���

����

����

���

����

����

���

����

����

���

Figure 4.2. Overall architecture of sAccount implementation.

easily outweigh the benefits from using them. We regard the resource accounting

problem as having the trade-off of efforts between the “local monitoring” and the

“collective inference” stage. Our approach can be viewed as investing more efforts

into the “local monitoring” to gain the simplicity at the “collective inference” stage

as well as accuracy of resource accounting. The yet-unknown technique, on the

other hand, can be considered as paying large efforts in the “collective inference”

stage for having relatively small efforts during the “local monitoring” stage. We

believe that the trade-off of efforts between two stages for our approach is more

beneficial to the other inference-heavy approach. Even when such advanced col-

lective inference technique is found, our approach still provides benefits to it by

delivering monitoring data that are fine-grained and richer in information than the

monitoring data obtained from common system monitoring tools.

4.4 Design and Implementation: sAccount

Throughout this section, we first present the general ideas underlying our solu-

tion. We follow the description of each key idea with details of how we implement

it within our prototype accounting system. Our prototype, called sAccount, com-

prises a cluster of upto 10 Dell poweredge sc1425 servers with 2GB RAM each and

1Gbps network. Each server runs a modified Xen 3.1.4 hypervisor within which we

implement our local monitoring facilities. Additionally, a dedicated server receives

monitored data from all others and runs our collective inference that yields the ac-

50

counting information. Figure 4.2 presents the overall schematic for our sAccount

prototype.

4.4.1 Local Monitoring

There are two key aspects to the local monitoring that we need to perform at

each server: (i) recording information needed to identify the sets of used servers

for each CE and the structure of resource accounting trees within each server, and

(ii) identifying and recording information about resource principals and scheduling

events of interest. In what follows, we discuss these two issues.

4.4.1.1 Identification of S and T

General Design Considerations: We need to answer the following basic ques-

tion: for a given pair of a CE c and a server machine m, does c make use of any

resources on m during the monitoring interval of interest? Recall from Section 4.2,

that the real challenge in answering this question arises when c uses resources

on m indirectly, i.e., when a shared service component s running on m consumes

resources on behalf of c. How accurately the local monitoring on server m can

identify such indirect usage depends on its accuracy in recognizing the underlying

causation (i.e., some activities of c caused certain activities of s which consumed

some resources of m). When c is only “one hop away from s, the presence of direct

communication between c and s can yield this causation information. E.g., the

front-end component of the shared database service in Figure 3.1 is one hop away

from the CEs. However, identifying causation becomes trickier when the compo-

nent s is “more than one hop away” from c. An example of this is seen for the data

store component in Figure 3.1 which is two hops away from the CEs. Solving this

problem, in general, requires some form of statistical inference based on building

a probabilistic model to capture this causation, and closely related examples can

be seen in some recent work [16, 18].

Realization in sAccount: If a CE c is only one hop away from a shared service

component s, and is using its resources, the Xen hypervisor on the machine m

running s simply recognize that m ∈ Sc(t) if it observes an IP addresses belonging

51

������¡ ������¢

������£ ������£

�������

������

���

¤¥¦§¨©ª«¬

­®

¯°±« ²³¨©

´¥µ¶

·¥¤­

������������� �

´¥µ¶¸¹

º»¼½¾¿À½Á

¸¹

º»¼½¾¿À½Á

ÂÃÄ°Å« èªÄ«ÆÄ ¨² ¤¥¦

����

¯Ç«È« Ä©¨ ÄÇɫʬ ȫ˱«ªÄÈ

ÊÉ« ÊÃè̪ī¬ Ĩ ¤¥¦

·¥¤­

Figure 4.3. Solution concept. Start and end of CPU accounting is determined by thearrival of messages and departure of response messages. As the threadx of VM2 sends amessage to the threadA of VM1, the VM1 starts to account the CPU usage of threadA

to CE1. This binding stops when threadA sends reply back. CPU usage of threadB isnot charged to CE1 inbetween. This requires us to be able to detect thread schedulingevents.

to c on any of its incoming messages during the interval [t, t + ∆]. To recognize a

CE c that is more than one hop away, we reply upon ideas from our prior work on

vPath [91]. Very briefly, if one assumes that software components are constructed

using a multi-threaded architecture where: (i) a given thread is only associated

with acting on behalf of one CE at a given time, and (ii) all threads only employ

synchronous communication, then the problem of identifying causation can be

solved exactly (rather than only probabilistically as in the general case) [91]. A

more general realization could employ statistical techniques mentioned above and

is interesting future work.

4.4.1.2 Identifying Resource Principals & Scheduling Events

General Design Considerations: This aspect of local monitoring is concerned

with collecting information about when a schedulable entity begins to use a re-

source on behalf of a certain CE and when it stops doing so. The local monitoring

must record such information solely based on what the hypervisor can observe or

discover about the resource principals on that server, and the events correspond-

ing to their scheduling. For a server that is being directly used by a CE, this

may be relatively straightforward. E.g., the hypervisor can simply use the per-

VM scheduling information that it has access to. Additionally, one may consider

collecting information to enable accounting of resources consumed by systems soft-

52

ware (e.g., the hypervisor itself, privileged VMs that deal with significant portions

of a IO virtualization in many systems, etc.) on behalf of the CEs, similar to such

accounting in resource containers [14].

For a server machine m indirectly used by a CE c (i.e., running a shared service

component s exercised by c), we need to identify CPU (de-)scheduling events within

the software of s that correspond to durations for which s was using the CPU on

behalf of c. Identification of any IO activities initiated during these same periods

allows for accounting IO bandwidth usage by s on behalf of c. Figure 4.3 gives

illustrative examples of these ideas.

Realization in sAccount: How completely and accurately the ideal local mon-

itoring described above can be realized depends crucially on certain aspects of the

software architecture employed by the shared service component s in question. In

the sAccount prototype, we assume that shared service components employ the

prevalent multi-threaded concurrency architecture. In this architecture, subsets of

existing threads cater to each CE using s. Furthermore, each individual thread

caters only to a single CE at any given time, although this mapping itself can be dy-

namically changed by the application’s scheduling policy. With this architecture, it

becomes possible to observe and record relevant scheduling events accurately from

within the Xen hypervisor in the following manner: (i) context switching points

within the VM hosting s correspond exactly to events when processing on behalf

of a particular CE begins/end, (ii) context switches, despite being performed by

the guest/VM kernel, trap to the hypervisor due to the paravirtualized nature of

the Xen that we use, allowing its local monitoring facility to precisely record them,

(iii) our causation establishment technique, described earlier, allows us to correctly

keep track of dynamically evolving binding between threads comprising s and the

CEs that they act on behalf of. Note that the reliance on paravirtualization for

(ii) is not a significant shortcoming and can be overcome even in a system with a

fully virtualized Xen hypervisor (e.g., written for Intel VT). An example technique

for this is based on the following modification to such a hypervisor: the hypervisor

uses the PRESENT bits in the PTE corresponding to the stack of the thread whose

context switching it wishes to intercept.

Once CPU intervals used on behalf of a CE are identified, IO activities ini-

53

tiated during these are marked as corresponding to these CEs. Network-related

monitoring is based on tracking the system calls that are related to the network

activities. System calls such as READ/WRITE and RECV/SEND triggers network us-

age. Return bytes of these system calls are interpreted as the network bandwidth

usage and accumulated to the corresponding CEs. Disk I/O-related monitoring

follows similar principle as network accounting. The system calls to track are READ

and WRITE. These two system calls are also used for the network reads and writes.

4.4.2 Collective Inference

Given the extensive information that our local monitoring gathers, collective infer-

ence for accounting CPU and network/disk IO bandwidth essentially boils down

to simple aggregation of the resource usage information collected by various local

monitoring units.

CPU Accounting: CPU accounting is done by measuring CPU cycle counts

between start and end of the thread segments. Cycle counts are accumulated

to corresponding CE’s CPU accounting variables. Thread segment that is not

labeled is accounted as ‘unaccountable’ (See Figure 4.9(b) and Figure 4.15(a) for

examples). The ‘unaccountable’ quantities tell us the possible range of errors in

CPU accounting.

Network Accounting: Any network related activities between start and end of

the thread segment is accounted to the current identified CE. We observe network-

related system calls such as recv or send and add return byte sizes to determine

how much of the network bandwidth has been consumed. Note that this quantity

does not include any bandwidth consumption due to protocol-specific overheads

such as retransmission try and various header/trailer portions added across the

protocol stacks.

Disk I/O Accounting: Unfortunately, disk I/O accounting has limitations.

Due to nondeterminism introduced by the page cache and block I/O handling

mechanisms within the kernel, it is not possible to accurately identify the block

traffics that are caused by each thread segments. From the application system

54

calls, we can only know how many bytes are requested to be read or written. How-

ever, we cannot determine exactly which part of those translates to actual requests

to the device. This forces us to resort to inference techniques from the information

collected by sAccount techniques.

Disk READ traffic: When a thread issues disk READ I/O requests, there

can be either page cache hit or miss. In case of hit, the system call latency is

fast. In case of miss, the system call has to block until the data is fetched from the

physical device. The latency of miss is significantly high. By measuring the latency

of individual system calls, we are able to identify the disk READ I/O requests that

triggered actual disk I/O traffic. We collect the number of reads (that missed the

page cache) issued by each thread segment and use this ratio among CEs to divide

the actual (read) block traffic observed at the storage device under the control of

the hypervisor.

Disk WRITE traffic: Disk WRITE I/O requests do not exhibit latency differ-

ence between page cache hit and miss. All write I/Os hit page cache (unless page

cache eviction is triggered) and destaged in bursts later. The sAccount is unable

to precisely account the disk WRITE I/O requests for this delayed destaging and

block I/O coalescing. Division by ratio as used in the READ case is not possible

because the locality of each thread’s WRITEs may be different. In this case, we

have no choice but to use inference techniques.

4.4.3 Implementation of sAccount

Our environment is based on the Xen [95] virtualization. We have employed para-

virtualized xen on 30 dual-CPU blade servers divided over two racks. The dis-

tributed applications we have used in our evaluations are all hosted within indi-

vidual virtual machines in these blade servers and they communicate over 1Gbps

ethernet. In our environment we focus on performing resource accountings on

three resource types (CPU, network bandwidth and disk bandwidth). However,

our resource accounting is not limited to these resources. The basic principles of

the technique, based on observing resource consumption at the thread granularity,

can be extended to other type of resource that servers might use. For example,

55

memory bandwidth, memory space and/or SAN networks/storage spaces can be

subjected to sAccount frameworks.

Figure 4.2 depicts the overall architecture of sAccount and relevant components.

At each physical hosts, we have modified Xen hypervisor to add functionalities

for system call entry/exit interception, kernel thread switch interception and VM

scheduling events. This information is delivered to Dom-0 and recorded through

the modified version of xentrace. The output of xentrace from multiple hosts are

transferred to a central location, labeled as ‘Accounting Node’ in the figure and this

node runs the parsing and analysis algorithm to generate time series of resource

usage per each physical hosts.

Overhead of Running sAccount: It is important that sAccount incurs small

runtime overhead in order to be practical. In our current implementation, the

overhead that can potentially impact the guest VM or user application comes

from the part where we collect the trace information. Once trace information is

collected at each individual hypervisor, since they are transferred to the separate

server for processing, the execution overhead of running the accounting algorithm

does not impact the guest VMs.

In order to assess how much overhead the tracing mechanism creates, we have

conducted the measurements in the following way. First, we created a user-level

application that would generate 8 million system calls of one type in a tight loop.

Then, we measured total running time of the application with and without having

the xentrace mechanism enabled. The average elapsed time was measured to be

5.62 seconds when xentrace was not enabled, and 5.91 seconds when enabled. This

shows us that the overhead of xentrace mechanism is about 5.2%. Note that this

level of overhead is observed only when the application does nothing but the system

calls back-to-back. Since real applications would have many other instructions

between system calls, we expect that the effective overhead would be significantly

less that 5.2%.

56

4.5 Evaluation

In this section we evaluate the efficacy of sAccount and compare it against a

baseline called LR (described below). In the interest of space, we do not report

results about the aspect of sAccount dealing with inferring and keeping track of

dynamically changing sets of used servers for different CEs. Instead, we restrict

our attention only to the most interesting aspects of the resource accounting tree

within one particular server belonging to the shared service in question.

Our Baseline Accounting Technique (LR): Our baseline is based on a linear

regression model relating the per-CE resource usage to the inbound network traffic

volume from each CE. In order to account the CPU usage of a server to n different

CEs, one can use the volume of inbound network traffic from each CE as X input

and the CPU utilization of the server from TOP as the Y. Assuming the linearity

between X and Y, we can solve AX=Y to find the coefficients. The coefficients can

then be interpreted as the contribution of each client to the server’s CPU usage.

For one measurement data point we can think of forming the following equation:

a1x1i + a2x

2i + ... + anx

ni = yi (4.1)

where xni represents the measurement of input traffic volume to the frond-end of

the shared service cluster induced from CEn at time i, and yi the aggregate resource

usage measured by system utility functions of target resource (e.g. CPU utilization

by top). At time i, the coefficients a1, a2, ..., an is interpreted as how much each

CE has contributed to the resource usage. Therefore, negative coefficient values

are undefined.

4.5.1 Accounting Accuracy for a Synthetic Shared Service

Experimental Setup: Fig. 4.4 shows the design and configuration of a synthetic

shared service we employ. We use a two-tiered design for the shared service with

the front-end acting as a caching tier. Cache misses in the front-tier result in work

generated at the back-end. Multiple clients send requests to the front-end during

long-lasting sessions and correspond to our synthetic CEs. We define operation

57

���

����

���

���

����

�����������

��� ����������������

�� ������� ����������

����

���������

����

�������

����

�������

�����������

�����������

�����������

Figure 4.4. Design and configuration of our synthetic shared service and the CEsexercising it.

types offered by the server whose resource consumption we extensively measure

offline to construct the “ground truth” about their resource needs.

Experiment Design and Key Findings:

Bursty vs. Non-Bursty Workload: Figures 4.5(a)-(b) compare the effect of bursti-

ness/variance in workload on the accuracy of CPU accounting. For different values

of the average request rate imposed on the shared service by a group of three charge-

able entities CE1, CE2, CE3 (which create different CPU utilization levels at the

server), we pick a ”non-bursty” scenario where the requests are uniformly spaced

in time, and a ”bursty” scenario where the request inter-arrival times follow log-

normal(0,1.0) distribution. We find that the efficacy of LR varies depending upon

the extent of variation within the imposed workload. This is in line with results

known in existing work that find non-stationarity in workloads useful for certain

kinds of prediction and modeling [57]. Intuitively, better accuracy is achieved with

bursty workloads because the higher variety/dynamism in the input data supplies

more information to LR; we expect this basic insight to apply to any statistical

inference technique for accounting. For a less bursty workload, a large part of

the input data may be redundant and not offer new information to an inference

technique. On the other hand, by virtue of its direct measurement of relevant phe-

nomena, sAccount is able to achieve accurate accounting that is robust to changes

in such workload conditions. We also find the accuracy of LR to be sensitive to

the overall CPU utilization level at the shared server, although we do not see a

clear pattern. sAccount performs well in all utilization regions offering less than

1% error, which may be particularly desirable in the high utilization regions.

58

0

50

100

150

200

250

13% 20% 32% 41% 54% 66% 78% 90%

Err

or

(%)

in A

ccu

racy

CPU Utilization

CE1CE2CE3

LR AveragesAccount

0

5

10

15

20

25

30

10% 22% 34% 47% 53% 60% 70% 84%

Err

or

(%)

in A

ccu

racy

CPU Utilization

CE1CE2CE3

LR AveragesAccount

(a) Steady (less bursty) workload (b) Highly bursty workload

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0 1 2 3 4 5 6 7 8 9

Cro

ss C

orr

ela

tio

n

Time Lag (second)

13%20%32%41%54%66%78%90%

-0.2

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9

Cro

ss C

orr

ela

tio

n

Time Lag (second)

10%22%34%47%53%60%70%84%

(c) Correlation of steady workload (d) Correlation of bursty workload

Figure 4.5. Impact of burstiness and shared service resource utilization on the accuracyoffered by sAccount versus LR. We use our synthetic shared service along with threechargeable entities. We compare the percentage error in CPU accounting for sAccountand LR. We label the errors for our three chargeable entities with LR as CE1, CE2, andCE3, respectively, and label their average as “LR Average.” In all cases, the accountinginformation offered by sAccount shows less than 1% error (we plot the average of errorfor the three CEs).

Caching/Buffering: We find that caching and buffering - both valuable and preva-

lent performance enhancement techniques - affect the accuracy of accounting of

a technique like LR. As is well-known in general, a cache within or in front of a

service can destroy/distort correlations between its incoming request/traffic events

and the workload imposed on its underlying server in complex ways. Buffering of

requests/traffic can also have a similar effect by modifying the time lag between an

event (e.g., the issuance of a request) and its cause (e.g., the actual servicing) in

complicated ways. We carry out experiments where we vary several factors affect-

ing the degree and nature of caching within the front-tier of our shared service (see

59

0

10

20

30

40

50

60

70

80

1 min 2 min 3 min 4 min 5 min

Err

or

(%)

in A

ccu

racy

Mesurement Time Granularity

w1w2w3s4

sAccount

0

2

4

6

8

10

12

14

16

18

2 3 4 5 6 7

Err

or

(%)

in A

ccu

racy

Number of CEs

LRLR stdevsAccount

(a) Effect of Caching (b) Effect of number of CEs

Figure 4.6. Impact of caching and number of CEs on the accuracy of resource account-ing of (a) network and (b) CPU for our synthetic shared service. The number of CEs isthree in (a). We plot the average error across all CEs and standard deviation.

Figure 4.4): request size (fixed or varying), read/write ratio (from 10:1 to 1:1), tem-

poral locality (non-existent to very high), and the extent of common/overlapping

content requested by the chargeable entities. Of the large parameter space, we

present results for the following four kinds of workloads imposed on the shared

service: (i) w1, with a choice of factors that we expect to offer minimal caching

gains, (ii) w2, with a choice of factors that we expect to offer significant caching,

(iii) w3, moderate gains between the last two, and (iv) w4, a workload that uses

these three in succession.

Figure 4.6(a) shows results for the accuracy of accounting the network band-

width used for communication between the front-tier (cache) and back-tier of our

shared service. The chargeable entities impose a bursty workload (as described

earlier) that imposes an average CPU load of 50% on the server hosting the front-

tier. The x-axis of the Figure 4.6(a) indicates the length of data points fed into

the LR. For example, ‘1 min’ means that we have used 60 data since we collected

measurements every one second. If the total run is 20 minutes, then we perform

20 LR computations for ‘1 min’ case and 4 LR computations for ‘5 min’ case. The

longer data we use for LR, the better the accuracies become. The graph shows that

caching also has large impact to the accuracy. Accuracy gains from the burstiness

of the workloads can be easily offset if application happens to employ some form

of caching structure internally.

Varying Number of Chargeable Entities: Figure 4.6(b) shows the effect of the num-

60

ber of chargeable entities on the accuracy of LR and sAccount. We observe that

the accuracy of LR (both average and variance) deteriorates as the number of

chargeable entities grows whereas the accuracy of sAccount is unaffected.

Summary of Key Findings: To summarize, we find that the efficacy of LR

relies upon both the quality of data it gathers as well as the presence/extent of cor-

relation between its inputs and outputs. Even when accurate data can be obtained

(as with our implementation of LR), several factors including (i) inherent workload

properties (e.g., variance, temporal locality, intensity), (ii) system mechanisms and

algorithms (e.g., caching or buffering), and (iii) environmental conditions (e.g., de-

gree of resource interference from other software) might affect such correlation and

affect the accuracy of the accounting technique. We find empirical evidence that,

owing to its ability to directly measure relevant phenomena accurately, sAccount

is robust to such effects, and offers high-accuracy accounting information across a

wide range of operating conditions.

4.5.2 Accounting for Real-world Services

In general, we cannot expect to determine a real-world application’s actual resource

usage on behalf of different chargeable entities without resorting to extensive ap-

plication and OS instrumentation. Consequently, unlike for our synthetic shared

service, we cannot obtain/present a direct comparison of the efficacy of our tech-

niques, i.e., distance of the accounting information offered by sAccount versus that

offered by LR from the ”ground truth.” We present results for the accounting of

the most bottlenecked resource for the shared service, which we find to be CPU

cycles for MySQL and network bandwidth for HBase.

4.5.2.1 Clustered MySQL as the Shared Service

Experimental Setup: Figure 4.7 shows the set-up of our MySQL cluster that

is used as a shared service by three CEs. Two of these CEs use the TPC-W

benchmark [83] to generate workload for the database, while the third CE uses

RUBiS [87]. The cluster consists of a front-end SQL node that interacts with

the chargeable entities, three data nodes, and a management node; each node is

61

������������

������

����������

������

����

������

�����

��

�� ����� ������������������� � ��

��������������

����

����������

ÍÎÏÐÑÒ����

��� ���

������������

ÍÎÏÐÑÒ����

��� ��� ����

����

������

��� ���

Figure 4.7. Shared MySQL cluster setting. Three CEs labeled CE1, CE2, and CE3

share this database service.

hosted within its dedicated server. One interesting aspect of the cluster’s operation

is that even in the absence of any workload imposed by the CEs, a large number of

small messages are exchanged between all pairs of nodes within the cluster. These

messages are for liveness check. The CEs house separate/non-overlapping data

within the database which is spread across the three data nodes, and the cluster a

replication degree of 1.

Experiment Design and Key Findings: Given exact accuracy numbers are

elusive, we compare the efficacy of sAccount and LR in the following online re-

source control situation: we wish to ensure that when the aggregate workload

imposed upon the MySQL cluster causes its server CPUs to saturate, we identify

the contribution of various CEs to this “overload,” and then enforce targeted CPU

throttling only to the CE causing the overload.

We implement a CPU policing mechanism within the Xen hypervisors of the

MySQL cluster servers which manipulates the rate at which timer interrupts are

delivered, similar to the idea of time dilation [96], to the guest VM only when the

thread serving the CE causing the overload is to be scheduled. There are two parts

implemented into the xen hypervisor to achieve this CPU controlling mechanism.

First, there are set of variables, one for each chargeable entity, for accumulating

the CPU usage. Second, there is a mechanism for enforcing desired amount of

resource consumption of the chargeable entities. The method we use to control the

per-CE CPU consumption is based on manipulating the rate of timer interrupts

being delivered to the guest VM. Delivering timer interrupts to a VM (specifically

62

the guest kernel) has the effect of speeding up the notion of time flow from the

guest kernel’s view point. This will, in turn, trigger the scheduler invocation at a

faster pace. Our strategy is to deliver the timer interrupt faster when the VM is

about to schedule the thread that are currently serving the chargeable entity we

want to limit the resource consumption of. When any other innocent threads are

about to gain the CPU, we switch back to the normal rate of timer interrupts.

This strategy requires us to detect what thread is to be scheduled to run next,

which we have also implemented with the xen hypervisor. Our version of Xen,

32bit para-virtualized Xen, supplies a point where every thread context switch

occurring within the guest kernel is intercepted. Thread context switch of the

guest kernel traps because stack switch requires modifying the kernel stack pointer

stored in TSS (Task State Segment) and that is a privileged operation. The guest

kernel, running in ring 1, cannot modify TSS which is only allowed at ring 0.

Executing this strategy requires systematic ways of determining the timer rate.

The default interval between timer interrupts are 10 ms in our settings and the

guest VM expects to receive one timer interval every 10 ms as well. By setting this

interval to a smaller value, we are able to issue timer interrupts at a faster rate. For

example, setting the interval to 1 ms would generate timer interrupts 10 times faster

and the guest VM would invoke the thread scheduler every 1 ms, thinking that 10

ms has already passed. In order to decide on certain value for this timer interval,

we adopt similar approach used in FAST TCP [97] and PARDA [98]. Let this

timer interval be denoted by wit at time t for CEi and 1 < wi

t < 10. Additionally,

let Li be the desired resource usage level of the CEi and, Lit represents the observed

resource usage level of CEi. The equation below is used to determine the timer

interval, wit, from the observed resource usage of CEi, where γ is a smoothing

factor.

wit = (1 − γ)wi

t−1 + γLi

Lit

wit−1 (4.2)

Our choice of this policing mechanism is only for demonstration purposes, and in

practice a more sound technique would be desirable.

We configure our CEs to impose a dynamically changing workload (consisting

of three phases) on MySQL as described at a high-level in Table 4.2. In phase 1,

all CEs generate a low-intensity workload, whose aggregate does not saturate the

63

Phase Time Window Workload Top User

Phase 1 0-400s All 3 CE generate light loads CE2

Phase 2 400-600s CE2 starts to issue CE2

CPU-heavy requestsCE2’s workload

Phase 3 600-1200s overwhelms CPU, CE2

load increases every 100s

Table 4.2. Description of how the workload imposed by the three CEs is varied overthe course of our experiment with the MySQL cluster as our shared service.

MySQL servers. During phase 2, starting at t=400s, CE2 starts issuing more CPU-

intensive requests. We are interested in observing how LR and sAccount handle

this sudden change of behavior. Finally, in phase 3, starting at t=600s, CE2 issues

continually increasing workload that causes the CPUs to saturate. Here, we are

interested in observing how our simple resource control performs based on the

accounting information offered by LR and sAccount.

Since we do not have precise knowledge about true resource consumption, we

engineer the workloads so that the CPU consumptions imposed by the CEs are

significantly different from each other, allowing us to rank their contributions with-

out ambiguity. For example, we make the CPU consumption of CE2 much larger

than other starting at t=400s so that other CEs cannot be mistaken as heavy CPU

consumers. We begin by taking an in-depth look at the CPU accounting informa-

tion offered by LR and sAccount at one of the MySQL servers (SN) and how it

evolves during phases 1-3 (results for CPU accounting of other MySQL nodes are

qualitatively similar and we do not present them in the interest of space).

In Figures 4.8(a),(b), we depict the inputs for LR (per-CE network traffic and

aggregate CPU usage at SN’s server) and its output (accounting information for

each CE), respectively. These figures are helpful to consider in combination with

the following discussion of LR’s accounting and its comparison with sAccount.

Figures 4.9(a),(b) show CPU accounting for SN’s server as carried out by LR

and sAccount, respectively. We use a ”stacked” representation, where the area

under the curve corresponding to a CE represents the CPU usage charged to it.

During phase 1, both LR and sAccount produce correct rank orders of CEs, al-

though LR slightly overestimates the CPU consumption for CE2. However, during

phase 2, LR starts to report incorrect rank order: it determines CE3 to be the

64

0

20

40

60

80

100

120

140

160

180

200 400 600 800 1000

Net

wor

k T

raffi

c (K

B)

Time (Seconds)

CE1CE2CE3

(a) Inputs to LR accounting.

50

100

150

200 400 600 800 1000

CP

U U

tiliz

atio

n (%

)

Time (Seconds)

CE1CE2CE3

(b) CPU usage pattern of 3 CEs from separate runs.

Figure 4.8. Network traffic and individual CPU utilization time-series. Graph (a)shows the network traffic exchanged between SN and each of our three CEs, and formspart of the input to LR. Graph (b) presents the CPU usage at SN induced by eachof the CEs when it runs separately from other CEs as part of offline profiling that wedo. These usages serve as our estimate of the ground truth for the CPU usage each ofthe CEs induces in the actual experiment. The resource accounting results of LR andsAccount should be compared with (b) to see how far from this estimated ground truththey are.

cause of the increased CPU usage. Upon investigating the reason for this mistake

by LR, we find the following. While CE2 issues CPU-heavy requests and waits for

MySQL’s response, CE3 continues to issue requests at a relatively high rate that

are not CPU-heavy. However, the higher rate of requests coming from CE3 causes

LR to infer spurious positive correlation between CE3’s requests and SN’s CPU

usage. In fact, LR is unable to correct this throughout phase 2.

As we show in Figure 4.9(b), besides correctly identifying the correct rank order

65

Ó

ÔÓ

ÕÓ

ÖÓ

×Ó

ØÓ

ÙÓ

ÚÓ

ÛÓ

ÜÓ

ÔÓÓ

Ó ÔÓÓ ÕÓÓ ÖÓÓ ×ÓÓ ØÓÓ ÙÓÓ ÚÓÓ ÛÓÓ ÜÓÓ ÔÓÓÓ ÔÔÓÓ

���

���

���

��������������� ��

�������������

ÝÞßàá âãäåæçèé

Ýäåêëãá ìéèí

îäçá ïèçêî

Ýðñòåîóéåîçèêáîåéîá

ôõ öåîå ÷èøøã÷îçèê

çêîãéæåø èì ßùù áã÷

(a) Result of CPU accounting using LR. This is a stacked graph.

ú

ûú

üú

ýú

þú

ÿú

�ú

�ú

ú

ú

ûúú

ú ûúú üúú ýúú þúú ÿúú �úú �úú úú úú ûúúú ûûúú ����

����������� �

������

��������������� �� ����� ������ �

������������

�� �����

������� �� � �������

��� !"" !"# !"$

%&' %&( %&)

*+, *+- *+. //

*+, *+- *+. /

������������������

(b) Result of CPU accounting using sAccount. This is a stacked graph.

Figure 4.9. Comparison of CPU accounting results. CPU usage of MySQL ClusterSQL node is being accounted. In (a) the accounting starts at time 200 since LR needsto collect some amount of data. By comparing the areas of equivalent color we cansee the rank order determined by each technique as well as accuracies. Please comparewith Figure 4.8(b) to see the true CPU consumption. The result of sAccount includesthe ‘unaccountable’ portion. This can be divided among chargeable entities by anyreasonable policy.

in its accounting, sAccount also reports what portion of the CPU usage of SN’s

server it finds unaccountable. This amount indicates that sAccount’s algorithm was

unable to charge the given thread’s resource usage to any of the chargeable entities

because no direct association was found. This can happened if some thread is

spawned independently of input requests from the chargeable entities and performs

maintenance jobs. Or, it could be due to the nature of the thread that is created to

service other running threads. In any case, sAccount provides this resource usage

to the user and it is up to the user to divide up among chargeable entities. The

most reasonable division would be to divide the ‘unaccountable’ portion according

to the proportion of resource usage by each chargeable entity within that time

window.

During phase 3, starting at t=600s, CE2 starts to saturate the CPU by dras-

66

0

100

200

300

400

500

600

700

800

0 100 200 300 400 500 600 700 800 900 1000 Time (Seconds)

Res

pons

e T

ime

(ms)

Online sAccount control start

Warning Level

SLA Violation Level

Response Time with LRMoving Avg with LR

Response Time with sAccountMoving Avg with sAccount

Figure 4.10. Response time change of RUBiS application. This graphs shows thedevelopment of RUBiS response time for two cases - throttled by LR, and controlled bysAccount. Since LR picks the wrong CE (CE3) as a culprit for performance degradation,throttling the request rate of CE3 is ineffective. However, the sAccount technique showsnoticeable effects. The moving average response time by sAccount indicates sAccountcan contain the performance interference.

tically increasing the workload it imposes as described in Table 4.2. As portions

of Figures 4.9(a) and (b) for this phase show, LR continues to perform incorrect

accounting. This has a detrimental effect on our CPU policing based resource

control. Figure 4.10 shows the change of response time for CE3 whose RUBiS ap-

plication is accessing the shared MySQL Cluster service. Starting from time 600,

the response time increases. We have set the response time of 300 ms as the initial

warning level and 600 ms as the SLA violation level. The CPU saturation caused

by CE2 continues to degrade the response time of RUBiS and eventually it violates

the SLA. Since LR determines that CE3, not CE2, is the source of overload (see

Figure 4.9 (a)), CE2 is not marked for any counter actions. However, sAccount is

able to identify true cause of the overload and, starting at time 730, it initiates the

CPU throttling for CE2. Figure 4.10 indicates that the moving average of response

time under the control of sAccount is able to contain the response time below the

SLA limit. This demonstrates one promising capability of sAccount (which is the

thread-level monitoring technique) in critical resource managements of such shared

resources.

4.5.2.2 HBase as the Shared Service

Experimental Setup: Our second real-world shared service is HBase, a key-

value storage offering an open-source implementation of Google’s Bigtable [36],

67

����������� �

�����

����0

�����

����1

��������

�����

�����

�������� �

�����

���� �

�������

�� �� 0

��������� �

����0

����1

Figure 4.11. Our setup for using HBase as a shared service. Our CEs are based onclient programs that use the YCSB workload generator.

that has significantly different resource usage characteristics from a database such

as MySQL. An HBase cluster consists of “region servers” and the HBase “master”

that manages these region servers. HBase stores its data in a Hadoop cluster,

and this is a great example of a shared service (HBase) relying upon another

shared service (Hadoop) to cater to the needs of CEs using it. The region servers

act as in-memory caches for the contents of the data nodes. A region server

employs an HDFS client [99] to communicate with the Hadoop cluster that stores

data persistently. HBase employs Zookeeper [100] for coordinating distributed

operations and locating region servers. We configure our HBase with a single region

server. HBase operation involves significant data transfer from the data nodes to

the region server, whereas the CPU load imposed by most requests is small as most

requests are for simple data retrieval or inversions. Since the network bandwidth

available to the region server becomes the bottleneck resource well before the CPU,

we highlight accounting results for network bandwidth more. Our HBase caters to

requests from two CEs derived from the YCSB workload generator [101], that we

refer to as CEA and CEB.

Experiment Design and Key Findings: We run an experiment lasting 500

seconds, during which the loads offered by CEA and CEB are varied as follows:

(i) during t=0 to t=100s, both CEA and CEB generates identical workloads which

contains 5% update requests, (ii) at t=100s, CEB changes to a read-intensive mode

with good temporal locality, which incurs high hits in the region server causing its

68

0

10

20

30

40

50

60

100 200 300 400 500

Net

wor

k T

raffi

c (K

B)

Time (Seconds)

CEB goes read-intensive

CEA goes update-intensive

CEACEB

Figure 4.12. Evolution of network traffic that are incoming to the region server fromtwo CEs during the run. Both CEA and CEB start out by sending similar requests toHBase during t=0 to t=100s, implying the network bandwidth and CPU usage of theregion server should be accounted equally to them for this period. CEA changes itsbehavior at t=100s, whereas CEB changes its behavior at t=200s.

CPU usage to increase proportionally with the network traffic sent to CEB, (iii)

at t=200s, CEA starts to issue CPU-intensive insert-type requests that cause the

CPU usage at the region server to increase. Figure 4.12 shows the network traffic

size inbound to the region server from the two CEs under the described workload

scenario.

Figure 4.13 shows the profiling result of individual runs for both CEA and

CEB chargeable entities. Unlike Mysql Cluster where there can be several different

request types each having varying resource demands in terms of CPU and Network,

HBase seems to behave in a much simpler way. HBase’s request types in terms of

resource consumption are not diverse, because HBase uses simple key-value storage

that usually does not carry out complex logic functions such as join and sub-query.

As a result, for both CPU and network, they show very similar trends with client

input network.

Summary of Results: Figure 4.14 (a) and (b) shows the result of accounting the

network bandwidth by LR and sAccount at the region server of HBase. The inputs

to the LR are two time series of inbound network traffic from two chargeable entities

as shown in Figure 4.12. We have configured LR to use 100 second-long data as an

input length in this HBase’s resource accounting, producing no accounting results

69

0

10

20

30

40

50

60

70

100 200 300 400 500

CP

U U

tiliz

atio

n (%

)

Time (Seconds)

CEACEB

(a) CPU usage at the region server for two CEs.

2M

4M

6M

8M

10M

12M

14M

100 200 300 400 500

Net

Thr

ough

put (

KB

)

Time (Seconds)

CEACEB

(b) Network input traffic from two data nodes to the region server.

Figure 4.13. Profiling measurement of CPU and Network resources at the region serverfor CEA and CEB. These are obtained from individual (not combined) runs of eachchargeable entities. They are intended to serve as an estimate of true resource usagequantity when analyzing the performance of resource accounting results.

for the first 100 seconds of the run. The accounting results from sAccount are

presented in Figure 4.14 (b). In both of the stacked graphs, the red area (the top

most area) corresponds to the portion of network bandwidth used by CEB and

the blue area (the second area from the top), the portion used by CEA. From the

Figure 4.13, we know that the network bandwidth usage of CEA should significantly

drop from time 200, and thus making CEB the heavier consumer of the network

bandwidth. We observe that the result of LR unfortunately does not reflect these

true values. LR reports that CEA remains to be the dominant consumer of the

network bandwidth throughout the run. In contrast, sAccount correctly reflects

70

����

����

����

����

�����

�����

�����

� ��� ��� ��� ��� ���

����

����

����

�������� �������

�� ���� ��2 �� ��������� ��

������� ��� ������ ���������

����������

(a) Result of LR on incoming traffic from datanodes to region server. This is a stacked graph.

3

4333

5333

6333

7333

83333

84333

3 833 433 933 533

����

�������� �������

�����������:���� �������������

������������������������������

�������� ��

(b) Result obtained from sAccount. This is a stacked graph.

Figure 4.14. Comparison of resource accounting results between LR and sAccount onthe out-bound network traffic size from data node to the region server. The contour of(a) drawn in think line is obtained by iptables. Note the resemblance of this to theoverall traffic size of (b) as a quick sanity check of sAccount technique.

this resource usage by CEs. (See the size of red area in Figure 4.14 (b) after time

200). The misjudgment by LR can be explained in terms of caching effects. Since

time 200, the input network traffic from CEB to the region server doubles until

the end of run (See Figure 4.12) whereas the incoming traffic from the data nodes

increases by only 20% (See Figure 4.13(b)). We believe this is due to the internal

caching at the region server and HBase documentation supports this conjecture.

Earlier in Figure 4.6, we have seen that caching can impact the performance of

LR.

In addition to the network accounting results, we also present some of the

selected accounting results using sAccount. Figure 4.15 (a) shows the CPU ac-

counting at the region server for CEA and CEB. According to our preplanned

workload scenario, CEB should consume more CPU than CEA after time 200 sec.

The accounting result indicates this behavior. Notice the significant portion of un-

71

��������������� ��

;

<;

=;

>;

?;

@;

A;

B;

; @; <;; <@; =;; =@; >;; >@; ?;; ?@;

CD E

CD F

GHIJJKLHMINOP

����������

(a) CPU Accounting by sAccount at the region server. This is a stacked graph.

����

�������� �������

Q

RQQQ

SQQQ

TQQQ

UQQQ

VQQQ

WQQQ

Q VQ RQQ RVQ SQQ SVQ TQQ TVQ UQQ UVQ

XY Z

XY [

\]^__`a]b^cde

����������

(b) Network accounting by sAccount at data node. This is a stacked graph.

Figure 4.15. Resource accounting by sAccount at various nodes of HBase. The result(b) provides the proof of sAccount’s capability to perform resource accounting at multiplehops away from the front-end of the shared service. Note that data node of HBase doesnot have a direct communication with any of the chargeable entities.

accountable CPU usage at the bottom region of the stacked graph (orange color).

It includes CPU cycles consumed for communicating with the Zookeeper, HBase

master and Hadoop Namenodes. Especially, we have noticed two high peaks at

around 80 sec and 470 sec. By studying the HBase documentations, we have con-

cluded that these would most likely be due to the I/O compaction at the region

server. This nondeterminism disturbs the correct operation of LR whereas our

sAccount correctly identifies them as not being correlated to chargeable entities.

Figure 4.15 (b) presents the accounting result of network usage in one of the datan-

odes. We present this result here in order to provide evidence that our sAccount

can perform resource accounting at the node that lies “more-than-one-hop” away

from the chargeable entities. This shows that our implementation can success-

fully deliver the causality of messages to across nodes and make use of them into

resource accounting.

72

4.6 Summary and Conclusion

In this chapter we have presented our solution to the provider-end resource ac-

counting problem. Our solution, called sAccount, operates entirely within the

hypervisor, giving us high transparency, and the granularity of data is the thread-

level. As a motivation we have provided evidence that existing solutions that are

based on statistical inferences may suffer from several weaknesses and that our

sAccount solution is robust to such conditions. We have evaluated our solution to

the synthetic TCP/IP software as well as two real-world distributed applications

- MySQL Cluster and HBase. MySQL Cluster represents popular SaaS offer-

ing of the relational database found in today’s cloud services. HBase represents

the key-value type of simple storage service. HBase is an open-source version of

Google’s Bigtable. Using those real-world applications we have observed that sta-

tistical technique can actually mislead the cloud platform operators by providing

incorrect results under certain conditions. We have also applied online-version of

the sAccount to the scenario of sudden workload changes to cause SLA violation.

Comparing side-by-side with statistical inference technique on the same workload

scenario, we have shown that sAccount can be more responsive and correct in

critical real-time system management scenario.

Chapter 5Consumer-end Decision Making

5.1 Introduction

As utility computing develops into maturity, consumers are presented with numer-

ous service providers to choose from and sets of service options in each of them.

Consumers are faced with the problem of incorporating all information of cloud of-

ferings and making a decision of hosting their applications in the most cost-optimal

way. In such decision making, it is important that consumers are aware of impor-

tant cost components that potentially have large impact to the overall costs. Also,

consumers need a systematic methodology or tools that can incorporate various

cost factors and present useful analysis results with reasonable margin of errors.

This chapter is concerned with investigating such a methodology that can aid in

the decision making for the problem of selecting optimal hosting configurations.

We first discuss the taxonomy of cost components and current cloud-based host-

ing options. Then we utilize the Net-Present-Value concept to analyze the cost of

hosting a consumer’s application for a given period of time.

The quintessential question when considering a move to the cloud is: should

I migrate my application to the cloud? Whereas there have been several studies

into this question [63, 72, 64, 102], there is no consensus yet on whether the cost

of cloud-based hosting is attractive enough compared to in-house hosting. There

are several aspects to this basic question that must be considered. First, although

many potential benefits of migrating to the cloud can be enumerated for the gen-

eral case, some benefits may not apply to my application. For example, benefits

74

related to lowered entry barrier may not apply as much to an organization with

a pre-existing infrastructural and administrative base. As another example, the

benefits of pay-per-use are less attractive for a well-provisioned application whose

workload does not exhibit much variation in its workload. Second, there can be

multiple ways in which an application might make use of the facilities offered by

a cloud provider. For example, using the cloud need not preclude a continued use

of in-house infrastructure. The most cost-effective approach for an organization

might, in fact, involve a combination of cloud and in-house resources rather than

choosing one over the other. Third, not all elements of the overall cost consider-

ation may be equally easy to quantify. For example, the hardware resource needs

and associated costs may be reasonably straightforward to estimate and compare

across different hosting options. On the other hand, labor costs may be signifi-

cantly more complicated: e.g., how should the overall administrators’ salaries in

an organization be apportioned among various applications that they manage? As

another example, in a cloud-based hosting, how much effort and cost is involved

in migrating an application to the cloud? Answering these questions requires an

in-depth understanding of the cost implications of all the possible choices spe-

cific to my circumstances. This chapter presents our findings on the economic

aspects of application hosting in a cloud-based environment using cost analysis for

representative e-commerce benchmarks. Although we restrict our attention to a

single cloud provider, it will become clear that our methodology readily extends

to scenarios where multiple cloud providers are available to a consumer.

5.2 Background

5.2.1 Net Present Value

In financial analysis, investigating the suitability of an investment involves assess-

ing the overall costs expected to be incurred over its lifetime. From the financial

point of view, the decision-making of whether to migrate an application to the

cloud can be viewed such as an investment decision problem. The concept of Net

Present Value (NPV) [103] is popularly used in financial analysis to calculate the

profitability of an investment decision over its expected lifetime considering all the

75

cash inflows and outflows. Borrowing existing notation, we define the NPV of an

investment choice spanning T years into the future as:

NPVT =T−1∑

t=0

Ct

(1 + r)t(5.1)

where r is the discount rate and Ct the expenditure during tth year. The role of

the discount rate is to capture the phenomenon that the value of a dollar today is

worth more than a dollar in the future, with its value decreased by a factor (1+ r)

per year.

As a simple example to understand NPV, assuming r = 5%, consider two

choices to purchase 10 items, each costing $1,000 over a one year span: (i) buy all

today: NPV=$10,000, and (ii) buy half today, and half next year: NPV = $5,000

+ $5,000 / 1.05 = $9,761. The latter is the preferred choice here since it allows

us to spend a lower amount than with choice (i) for the same overall purchase.

Whereas the NPV definition can be enhanced to also incorporate the effect of

inflation (e.g., in case (ii) we might need more than $5,000 to buy 5 items a year

from now), we assume it to be small and ignore it in our present work.

5.2.2 Cost Components

Hosting an application poses various types of costs including hardware, software

costs and/or operational costs. Each cost type has its own idiosyncrasies requiring

us to scrutinize them one by one and understand in what circumstances and how

the cost is incurred. In an attempt to better understand and facilitate the following

discussions, we first develop taxonomy of costs in the context of in-house and cloud-

based application hosting. We classify costs along two dimensions: quantifiability

and directness of contribution to the overall costs. Quantifiability refers to whether

the cost is easily representable with dollar amount. The directness of contribution

means whether the cost is solely for the product/service under consideration or

not.

Figure 5.1 presents our classification of costs. Certain cost components are

less easy to quantify than others, and we use the phrases “quantifiable” and “less

quantifiable” to make this distinction. Examples of less quantifiable costs include

76

���������� �����������

����������

����������

������������� �������

������������ �������� ������

���� ���������� ����������

��������

���

�� �����

���������������������������������������

����� ���!����������������"��� � ��� ����#��#������������������

���������������$%&������ � � ������

��'����������� �#����� ��'���������#�������(��

������������ ������!�� ��)�!����������� ������*�����!���������#�

��+���� ���!� ��

��+���������#�����#� �������

��,���#����������

���

Figure 5.1. Taxonomy of costs involved in in-house and cloud-based application hosting.Costs can be classified according to quantifiability and directness. Quantifiable costs aregrouped into material, labor and expenses. The material category roughly corresponds tocapital expenses (Cap-Ex). The labor and expenses categories corresponds to operationalexpenses (Op-Ex). In this study we focus on the quantifiable category.

effort of migrating an application to the cloud, porting an application to the pro-

gramming API exposed by a cloud (e.g., as required with Windows Azure), time

spent doing the migration/porting, any problems/vulnerabilities that arise due to

such porting or migration, etc. “Quantifiable costs”, on the other hand, can be ac-

curately quantified and we further divide these into three sub-categories: material,

labor and expenses.

Adhering to well-regarded convention in financial analysis, we also employ the

classification of costs into the “direct” and “indirect” categories based on their

ease of traceability and accounting. If a cost can be clearly traced and accounted

to a product, service, or personnel, it is a direct cost, else it is an indirect cost. As

shown in Figure 5.1, examples of direct cost include hardware and software costs;

examples of indirect cost include staff salaries. It should be noted that certain

costs may be “less quantifiable” yet direct (e.g., porting an application in case

of using Platform-as-a-service cloud). Similarly, certain costs may be quantifiable

yet indirect (e.g., staff salaries, cooling, etc.) Since less quantifiable costs can

contain ambiguity in their interpretation, requiring excessive efforts when trying

to quantify into dollar amounts, we mainly focus on the quantifiable costs in this

work. Within the quantifiable category, we first carry out the cost analysis using

the direct costs only. Next, we add indirect costs into the picture to understand

what effect they have to the result of previous analysis (Section 5.3.5). Accounting

77

the indirect costs often requires organization-specific knowledge. In our analysis,

we use ranges rather than exact numbers for indirect costs to capture a wide

spectrum of scenarios.

5.2.3 Application Hosting Choices

Besides pure in-house and pure cloud-based hosting, a number of intermediate

and/or hybrid options have been suggested, and are worth considering [104]. We

view these schemes as combinations of different degrees of “vertical” and “hori-

zontal” partitioning of the application. Vertical partitioning splits an application

into two subsets (not necessarily mutually exclusive) of components - one is hosted

in-house and the other migrated to the cloud - and may be challenging if any

porting is required [104]. Horizontal partitioning replicates some components of

the application (or the entire application) on the cloud along with suitable work-

load distribution mechanisms. Such partitioning is already being used as a way to

handle unexpected traffic bursts by some businesses (e.g., KBB.com and Domino’s

Pizza [105]). Such a partitioning scheme must employ mechanisms to maintain

consistency among replicas of stateful components (e.g., databases) with associ-

ated overheads. Given myriad cloud providers and hosting models (we consider

IaaS and SaaS), there can be multiple choices for how a component is migrated

to the cloud, each with its own cost implications. In this work, we choose three

such options (in addition to pure in-house and pure cloud-based) that we described

next.

5.2.4 Determining Hardware Size Requirement

The first step in performing the cost analysis is determining the size of hardware

base for the given workload intensity. Knowing the required number of hardware

units allows us to calculate other dependent costs such as OS licenses or power

consumption. This is an important problem that is starting to receive attention

from researchers [102]. Consumers need such techniques in order to come up

with accurate cost estimates, especially when there are many service providers

to choose from. This problem is closely tied to the general problem of application

performance modeling. Once performance model is established it can be used to

78

0

50

100

150

200

250

300

350

0 100 200 300 400 500 600

Th

rou

gh

pu

t(re

q/s

)

Sessions(# of clients)

One NodeTwo Nodes

0

100

200

300

400

500

600

0 200 400 600 800 1000

Th

rou

gh

pu

t(re

q/s

)

Sessions(# of clients)

One CPUTwo CPUs

(a) Jboss server (b) MySQL server

Figure 5.2. Marginal throughput measurements. Both graphs show how much through-put gain there is by adding one more unit of resource. For Jboss server (a), we observethe marginal gain by adding one more server that has single core. For Mysql (b) we addone more CPU core and observe the marginal gain.

answer the question of what the required hardware size is for a given performance

by exploring combinations of model parameters. Performance modeling has been

extensively studied for the traditional hosting environment [106, 107, 108] and is

being actively studied for the virtualized environment as well [109, 110]. The goal

of this chapter is to observe and understand the effects of various cost factors to

the overall costs. In this chapter, we employ simple techniques based on empirical

measurements and interpolations as described below. We find that inaccuracies

in such simplistic techniques do not affect the findings and insights we learn from

this cost analysis.

5.2.4.0.1 In-house Provisioning: We employ a cluster of servers in our lab as

our in-house hosting platform all of which have a Intel Xeon 3.4GHz dual-processor

with 2GB DRAM and are connected via a 1 Gbps Ethernet. In order to determine

the number of machines required to meet the desired throughput, we empirically

obtain the marginal throughput gain offered by adding an extra unit (granularity of

CPU core as well as single machine) to it when all other tiers are well-provisioned.

Assuming each unit is eventually operated at relatively low utilization (i.e., it

is sufficiently over-provisioned), we can use these marginal gains to predict the

capacity needs of each tier for a given workload intensity. Figure 5.2(a) and (b)

show the marginal throughput offered to TPC-W (i) by an extra machine for JBoss,

and (ii) by an extra CPU for Mysql, respectively. From these empirical results, we

obtain the capacity of each of our machines for Jboss and Mysql tier to be 146.6

79

0

4

8

12

16

20

Reference EC2

Ela

pse

d S

eco

nd

sAvg: 3.85 s

(Std: 0.014)

3.4GHz CPU

Avg: 11.9 s

(Std: 1.86)

0

0.1

0.2

0.3

0.4

0.5

0.6

9.5 sec 13.4 sec

Pro

ba

bili

ty

(a) Performance Comparison (b) Latency Distribution of EC2

Figure 5.3. EC2 Instance’s CPU Microbenchmark Results. Graph (a) compares thelatency of finishing the same number of arithmetic operations. Roughly the CPU of EC2instance is one third of our reference machine. Graph (b) shows that the distribution ofEC2 CPU bandwidth is bimodal.

transactions/sec (tps, henceforth) and 311.5 tps per machine. As for the Apache

tier, the resource consumption was insignificant and we estimated from observed

CPU utilization that one machine could handle about 4ktps.

5.2.4.0.2 Cloud-based Provisioning: We would like to find cloud-offered

resources that are likely to offer performance to TPC-W comparable with that

offered in-house. The hosting options that we consider require us to do this exercise

for the following: (i) Amazon EC2 instances (IaaS), (ii) Amazon RDS (SaaS) for

database, and (iii) SQL Azure (SaaS) for database. Based on existing results, we

assume that for (ii) and (iii) the cloud provider internally employs techniques to

scale resources to match workload needs and this reflects in the payments [111].

For (i), we must determine the resource needs and procure sufficient number of

instances. We describe our methodology for estimating this number for each tier of

TPC-W where we employ Amazon EC2’s small instance type (we did not find any

improvements in performance/dollar offered by large and extra large instances,

hence we restrict our attention to only small instances). Amazon EC2’s small

instance type claims to provide a computing power equivalent to 1.0-1.2 GHz

CPU. Interestingly the /proc/cpuinfo of such an instance shows that it has a

Intel(R) Xeon CPU 2.6GHz CPU. It is known that Amazon EC2 multiplexes two

VM instances on one physical core making it effectively 1.3GHz. We ran a simple

microbenchmarks to verify this and to establish computing power relative to our

machines. Our microbenchmark performed increment operations on an integer

80

variable in a loop from a single thread. We set the loop count to be 2 × 109

times and measured the elapsed time for both the reference machine (our lab

machine) that had 3.4GHz CPU and the EC2 small instance that had 2.6GHz

CPU. EC2 small instance is found to operate at 1.09GHz, about 1/3 speed of our

reference machine, which matches the claim of 1.0-1.2 GHz. Using this benchmark

information we set the throughput limit of single EC2 small instance for the Jboss

and Mysql tiers to be 57.34 tps and 121.86 tps, respectively.

5.3 Analysis

We first conduct our study on the effect of workload intensity and growth rates over

varying operational periods. We include only the “quantifiable” and “direct” cost

components in this discussion which will enable us to better understand the roles

of basic cost components. This will help us avoid drawing premature or incorrect

conclusions from possibly unnoticed effects of “Indirect” costs. We extend our

analysis to the domain of “indirect” costs in Section 5.3.5.

Figure 5.4 presents NPV cost calculations for up to a 10 year time horizon for

TPC-W. We present results with three initial workload intensities: (i) 20 tps, (ii)

100 tps, and (iii) 500 tps which represent “small”, “medium” and “large” in the

overall spectrum. Although the choice of 500 tps as “large” may be arbitrary, we

do so since analysis results does not change beyond that point. We also present

two intensity growth scenarios: (i) 20% increase per year and (ii) stagnant. The

former represents a thriving business and, the latter, a stabilized one. Also, for

the growth rates, there can be a business growing at an astronomical rate, but the

analysis results do not change qualitatively.

For a specific intended operational period, Figure 5.4 can be used to identify

the most economical hosting option. For example, in case of “medium” and stag-

nant growth (Figure 5.4(e)), if operational period of 3 years is expected, the best

hosting option would be “Fully EC2” whereas the winner becomes “Fully Inhouse”

if operational period of 8 years is assumed. Therefore it is important to identify

such cross-over points for decision making.

From Figure 5.4 we observe several interesting points. We find that in-house

provisioning is cost-effective for medium to large workloads, whereas cloud-based

81

10K

20K

30K

40K

50K

1 2 3 4 5 6 7 8 Years

NP

V C

ost(

$)

Legend identical

as (d)

10K

30K

50K

70K

90K

110K

1 2 3 4 5 6 7 8 Years

NP

V C

ost(

$)

Legend identicalas (d)

10K

100K

200K

300K

1 2 3 4 5 6 7 8 Years

NP

V C

ost(

$)

Legend identicalas (d)

(a) 20 tps, 20% growth (b) 100 tps, 20% growth (c) 500 tps, 20% growth

10K

20K

30K

40K

50K

1 2 3 4 5 6 7 8 Years

NP

V C

ost(

$)

Fully InhouseFully EC2

EC2+RDSInhouse+RDS

Inhouse+SQL Azure

10K

30K

50K

70K

90K

110K

1 2 3 4 5 6 7 8 Years

NP

V C

ost(

$)

Fully InhouseFully EC2

EC2+RDSInhouse+RDS

Inhouse+SQL Azure

10K

100K

200K

300K

1 2 3 4 5 6 7 8 Years

NP

V C

ost(

$)

Fully InhouseFully EC2

EC2+RDSInhouse+RDS

Inhouse+SQL Azure

(d) 20 tps, 0% growth (e) 100 tps, 0% growth (f) 500 tps, 0% growth

Figure 5.4. NPV over a 10 year time horizon for TPC-W. We consider three differentworkload intensity of small (20 tps at t=0), medium workload intensity (100 tps) andhigh workload intensity (500 tps) We also consider two different workload growth rateof 0% and 20% per year.

options suit small workloads. For small workloads, the servers procured for in-house

provisioning end up having significantly more computational power than needed

(and they remain severely under-utilized) since they are the lowest granularity

servers available in the market today. On the other hand, cloud can offer instances

with computational power matching the small workload needs (due to the statis-

tical multiplexing and virtualization it employs). For medium workload intensity,

cloud-based options are cost-effective only if the application needs to be supported

(i.e. operate temporarily) for 2-3 years, and become expensive for longer-lasting

scenarios. These workload intensities are able to utilize well provisioned servers

making in-house procurement cost-effective.

5.3.1 Data Transfer, Vertical Partitioning

Consistently in all cases of Figure 5.4, the most economical cloud-based deploy-

ment option turns out to be “Fully EC2”, closely followed by “EC2+RDS”. Both

82

������

�����

������

������

����������

����������

���� �

����������

���� ������

�������������

��� �

�����������

���� � �����

������

������ �����

�����

��������

�������������

��� �

����� ��� ����

������������

����� ��� ����

�����

����������������

(a) Fully EC2 ($22k) (b) EC2+RDS ($29k)

������

�����

�������

�� ���

����

����

�����������

��������

��� ����

�� �

�����

������

���

��������� �

�����

������

������� ��

�����

����������������������

����

����

�����

���

(c) In-house+RDS ($70k)(d) In-house+SQL Azure ($63k)

Figure 5.5. Closer look at cost components for four cloud-based application deploymentoptions in the 5th year. Initial workload is 100 tps and the annual growth rate is 20%.

“Inhouse+RDS” and “Inhouse+SQL Azure” are significantly more expensive than

either of “Fully EC2” and “EC2+RDS”. To explain the reason we provide detailed

break-downs of NPV for five-year long hosting of TPC-W for hosting options in-

volving the cloud in Figure 5.5. For “Fully EC2” and “EC2+RDS”, the cost of

purchasing cloud instances (including RDS instance) takes up about 60% or more

and the rest is mostly the charge for the data traffic in and out of the cloud. How-

ever, for the “Inhouse+RDS” and “Inhouse+SQL Azure”, the data traffic cost

dominates the overall cost by reaching more than 70%. Data transfer (in & out)

costs in Figure 5.5(c),(d) are larger than those in Figure 5.5(a),(b) because traf-

fic per transaction between Jboss and MySQL (16KB/tr) is larger than between

clients and Apache (3KB/tr). From this observation, we find that data transfer is

a significant contributor to the costs of cloud-based hosting - between 30%-70% for

TPC-W.

As a corollary, this also suggests that vertical partitioning choices may not be

83

�����

����� ����

����������

��

���

����������

����������

���������

�����

����� �������

����

�������������� ��

���

������

�������������

����

���

�������

����

��������

�����

�����

��� ���� �������

�����������

����

���

�����

��������������� ��

(a) Fully In-house ($482K) (b) Fully EC2 ($348K) (C) EC2+SQL Server ($279K)

Figure 5.6. Cost break-down of TPC-E at the 6th year.

appealing for applications that exchange data with the external world and/or across

components that fall across the boundary of partitioning.

5.3.2 Storage Capacity, Software Licenses

Storage capacity can be a key factor that might overturn the decision of cloud-

based hosting vs. in-house hosting. Whereas TPC-W poses relatively small costs

for storage capacity (its database only needs a few GB and its storage capacity

costs do not even show up in Figure 5.5), TPC-E has significant data storage needs

(its database requires about 4.5TB). Figure 5.7 presents the NPV cost evolution

for TPC-E for two initial intensities - 300 tps (medium) and 900 tps (high). The

annual growth rate in both cases is 20%. We only present fully in-house and two

cloud (Fully EC2 and EC2+SQL server) options since we have already established

the high costs of vertical partitioning. We find that in-house provisioning for

TPC-E has to make significant investments in high-end RAID arrays (gap A),

that constitute about 75% of overall costs. For initial workload intensity of 300

tps, these costs go down substantially with fully EC2 (i.e., renting storage from

EC2 is cheaper than the amortized cost of procuring this much storage in-house),

causing the overall costs to improve by 50% (year 1, shown as gap A) and 28%

(year 6, shown as gap B in Figure 5.7).

The software licensing fee for SQL Server and Windows can also be a significant

contributor to TPC-E costs: 2nd largest (17.4% of overall) and largest (67%)

84

� � � � � � � � �

����

�����

����

����

�� �����

���������

��������������

����

����

����

���

!"#

$���

%���

� � � � � � � � �

����

������

� �������������������

����

����

����

����

���

���

!���

(a) Initial workload: 300 tps, 20% Growth rate (b) Initial workload: 900 tps, 20% Growth rate

Figure 5.7. Two sets of TPC-E results at initial workload of 300 tps and 900 tps.

contributor, respectively, for fully in-house and EC2 options as shown in Figure 5.6.

Using pay-per-use SaaS DB allows the elimination of SQL Server licensing fees

(shown as gap C in Figure 5.7) and results in even better costs. SaaS options can be

cost-effective for applications built using software with high licensing/maintenance

fee. Note that these concerns did not arise with TPC-W which employed open-

source software, implying a different ordering of cost-efficacy among options.

It is also worth comparing the cost evolution for two intensities in Figure 5.7.

With medium intensity (300tps), in-house option is less attractive than cloud-

based option for the entire 10 year period without ever having a cross-over point.

However, at higher intensity (900tps), cloud-based hosting quickly (after 2 years

for fully EC2 and after 4 years for EC2+SQL server) becomes more expensive than

in-house. This is qualitatively similar to the observations for TPC-W. However,

cloud-based options remain attractive for a larger range of workload intensity than

for TPC-W (compare Figure 5.7(a) with Figure 5.4(b) both of which have the

same growth rate but differ in intensity by a factor of 9) - the key reasons for this

difference are gaps B and C, i.e., the higher storage costs for in-house TPC-E as

well as the contribution of software licenses in non-SaaS options.

A final interesting phenomenon arises due to the following: when buying cloud

instances for TPC-E database, we do find machines that offer required computa-

tional power per-VM but not the requisite degree of parallelism. Combined with

current database pricing policy of major venders regarding virtual cores (which is

explained a few sentences below), this affects the overall cost of cloud-based hosting

negatively. For example, High-Memory Quadruple Extra Large Instance of Ama-

85

zon EC2 offers 8 virtual cores each with 3.25 EC2 compute unit. A virtual core of

3.25 EC2 compute unit generates a computing power equivalent to 3.6 Ghz CPU

which is in par with typical processor speeds. However, since maximum number

of cores is limited to 8 for cloud instances whereas in-house server being used for

TPC-E analysis has 12 (6 cores × 2 sockets), this forces the cloud-based options

to procure more instances than in-house. Additionally, using commercial database

in the cloud-based hosting requires purchase of separate licenses for each virtual

core. Considering the prevalent pricing policy of per-socket (instead of per-core)

charging for traditional (non-virtualized) server, this puts the use of commercial

database in the cloud at a relatively high disadvantage. And, at the same time it

encourages the use of SaaS database service when using the cloud. This suggests

that a reconsideration of software licensing structures, particularly as applicable

to large-scale parallel machines, may be worthwhile for making cloud-based hosting

more appealing.

5.3.3 Workload Variance and Cloud Elasticity

Our cost analysis so far has not taken into account the variances in workload

intensity. One potential cost benefit of cloud-based hosting is from the ability

to dynamically match the resource capacity to the workloads at finer time-scale

than in-house hosting. Given high burstiness (i.e., high peak-to-average ratio or

PAR) in many real workloads, it is common in practice to provision close to the

peak. Whereas in-house provisioning must continue this practice, the usage-based

charging and elasticity offered by the cloud open new opportunities for savings (for

both in-cloud and horizontal partitioning). We investigate costs of variance-aware

provisioning for three degrees of burstiness corresponding to time-of-day effects and

flash crowds. Researchers have reported the magnitude of daily fluctuation (ratio

of peak to average) to be 40-50% [112, 113] for social networking applications,

and about 70% (min:40, max:135 tps) for an e-commerce Web site [114]. Flash

crowds can cause orders of magnitude higher peaks than the average and become

a particularly appealing motivation for considering the use (perhaps partial) of

cloud. The logs of World Cup 1998 has reported 70 times increase of web requests

due to flash crowds [115]. There have been many efforts to handle the flash crowds

86

� � � � � � � � �

������

����

� �

� �

� �

� ����������

� �

�� �

������

����

�������������

�� �

�� �

������������

������������

Figure 5.8. Effect of workload variance on the cost of in-house hosting for TPC-W.The workload is 100 tps and the growth rate is 20% per year.

for the enterprise applications [116, 117, 118]. We represent the workload variance

using peak-to-average ratio (PAR) which we define as max(xt)/E(xt) where xt

is the time series representing workload intensity (e.g., number of arrivals/sec).

Borrowing from [114], we choose PAR of 1.54 to represent daily variations and

PAR values of 11 and 51 to represent two flash crowd scenarios (i.e., peak of 10

and 50 times the average, respectively).

For cloud-based hosting, we assume the followings. Cloud provides a mecha-

nism to monitor and detect the occurrence of sudden burst of workloads. Also,

cloud provides a mechanism to scale out at run time to match the change of ob-

served workloads. These assumptions are safe to make since those features are

already supported by mainstream cloud providers. Therefore, regardless of PAR,

the cost of cloud-based hosting is theoretically equal to the overall average work-

load intensity. In real world, however, since most of the cloud charges the usage at

the granularity of hours, the actual cost can be slightly higher than the theoretical

costs.

Figure 5.8 illustrates the effect of three levels of burstiness on the in-house pro-

visioning cost. We select the case of in-house with medium & increasing workloads

(Figure 5.4(b)). Provisioning for the diurnal fluctuation of 70% (PAR=1.54) does

not impact the cost whereas flash crowd noticeably increases costs. Provisioning

for PAR=11 shifts the cross-over point with “Fully EC2” from year 2.5 to year 6.5

87

(See annotations in Figure 5.8). Provisioning for PAR=51 becomes uncompetitive

compared to “Fully EC2” over the entire 10 year period. Note that the effect

of diurnal fluctuation is miniscule. It is because provisioned servers already have

enough capacity to embrace the peak of diurnal fluctuation.

5.3.4 Horizontal Partitioning

We explore the benefits offered by a horizontal partitioning scheme that sets a

threshold of workload intensity over which we create a replica in the cloud to

handle the excess.

Total cost of horizontal partitioning is the sum of two costs: one from in-house

provisioning and the other from the usage of cloud as:

C = CH(τ) + CC(τ) (5.2)

where C denotes “cost” and τ is the threshold of workload intensity. The cost of

in-house provisioning (CH(τ)) is computed the same way as in previous sections,

with the value of τ being the required capacity to provision. In both cases of

horizontal partitioning schemes described in the Figure ?? of Section 5.3.4, the

cost of in-house provisioning part is determined using the same steps. That is:

CH(τ) = CH/W (τ) + CS/W (τ) + Copex(τ) (5.3)

where Copex represents the sum of operational expenses such as electricity, and

labor costs.

Calculating the cost of cloud usage requires us to assume a suitable distribu-

tion for workload intensities. Since we are interested in observing the effect of

workload burstiness to the cost of horizontal partitioning, we select a heavy-tailed

distribution - lognormal distribution. Let us denote the distribution as f(x : θ)

where θ is the distribution parameters. In order to determine CC(τ), we first find

two values from the distribution.

• Average workload intensity of all time t where xt ≥ τ (xt is the workload

at time t). Pictorially it is the dotted line in Figure 5.9(b).

• Proportion of time length where xt ≥ τ .

88

����

��������

������� �

��

� ��������

����������

������ ����

������

����

(a) Time series xt (b) PDF f(x)

Figure 5.9. Timeseries xt and workload distribution f(x) for a fixed τ .

These two values allow us to calculate how many of the VM instances are used for

how many hours. The average workload intensity I is given by:

I(τ, f(x : θ)) =∑

x>τ

x ·f(x : θ)

1 − F (τ : θ)(5.4)

where f is divided by 1 − F (τ : θ) since∑

f(x : θ) = 1 − F (τ : θ) as shown in

Figure 5.9(b). The proportion of time T where xt ≥ τ is:

T (τ, f(x : θ)) = 1 − F (τ : θ) (5.5)

These I and T is fed into the procedure we used to calculate the “Fully In-cloud”

cost. This provides the cost for CC(τ), completing the cost calcuation for the total

cost C.

Figure 5.10 shows the cost change over a range of threshold for TPC-W with

average workload of 500 tps. We assume a lognormal (µ:0, σ:1) distribution to

simulate the bursty traffic. Applying the equations described above, we are able to

observe the behavior of costs as a function of threshold when horizontal partitioning

is used. Blue dotted line in Figure 5.10 is the total cost, the sum of two separate

costs from in-house hosting infrastructure and usage charge of replica within the

cloud. As the threshold moves to higher workload intensity, in-house cost rises

and the cloud cost shrinks. The in-house cost shows a stepwise increasing pattern

because the server capacity increases at a discrete size. The cloud cost is shown as a

smooth decreasing function because an application deployed in the cloud can grow

89

5K

10K

15K

20K

25K

30K

35K

40K

45K

300 600 900 1.2K 1.5K 1.8K 2.1K 2.4K 2.7K

Co

st(

$)

Workload Intensity (requests/sec)

Optimal Threshold (1100 req/s)

(cost: $248K)

Average (500 req/s)

(cost: $309K)

In-house partCloud part

Total

Figure 5.10. Cost behavior of horizontal partitioning as a function of varying thresholdvalue. Although not shown in the graph, the cost at PAR=11 (at 5.5K on x-axis), thecost is $625K.

and shrink to match the current workload intensity. The cloud cost diminishes

since the probability of overflowing the in-house server capacity becomes smaller

as the threshold moves to the higher region. Due to inherent burstiness in the

workload (captured by the heavy-tailedness), the cost of cloud does not diminish

rapidly. The equilibrium point where the cost becomes minimum is found at the

workload intensity of 1100 tps as marked in the Figure 5.10. If the threshold is

set to an average, the cost will be 25% higher than the minimum. Also, if the

horizontal partitioning is not employed and the in-house server is provisioned for

the PAR of 11 (not shown in the Figure 5.10 because the data point is beyond

range), then the overall cost becomes 2.5 times higher than the minimum. This

suggests that horizontal partitioning can be effectively used to eliminate the cost

increase from provisioning for the peak.

5.3.5 Indirect Cost Components

In this section we study the effect of one representative indirect costs, namely the

labor cost. Since indirect costs are not dependent entirely on the application under

consideration, we are unable to treat them as a single quantity as a function of

application size. Instead we consider different cases for the labor cost per server

with numbers borrowed from other literature.

90

Size of Ratio of Average Labor Cost forBusiness Staff:Svr $/Svr/hour 1 Server/Year

SMB 1:10 $3.3 $6,336Large Enterprise 1:100 $.33 $633Data Center class 1:1000 $.033 $63

Table 5.1. Labor cost per server. We differentiate 3 different ratio of IT staff vs. Server.Average cost per server per hour (3rd column) is based on the average IT staff salary of$33/h.

In order to assess the impact of labor cost, we need to determine the value

range of following two metrics.

• Yearly labor cost per server: This is the contribution of labor to the

overall cost for a single server. This allows us to quantify the labor cost as

a function of the system size.

• Ratio of optimal manageable number of instances between VM

and physical server: This value indicates how many VM instances an

IT staff can manage with the same effort as managing one physical server.

This is used to calculate the labor cost for a given number of cloud-hosted

VM instances.

In a typical large enterprise and in a data center, the ratio of IT staff vs. server is

reported to be 1:100, and 1:1000 respectively [119]. In addition to this, we consider

a case for the small and medium-sized businesses (SMB) having the IT staff vs.

server ratio of 1:10 as shown in the 2nd column of Table 5.1. From hereon, we

use the term small, medium and large to refer to SMB, Large Enterprise and Data

Center class, respectively. The average salary of typical IT staff is between $61,924

and $66,196 as of 2010 [120]. This translates to about $33 per hour for one IT

staff. Combining this with the ratio of IT staff vs. server, we are able to determine

the yearly cost of one server as shown in the last column of Table 5.1. As for the

second metric, it is difficult to obtain the actual value and, thus, we approximate

it using the average VM density per physical server of about 7 in a virtualized

cluster environment [121]. Note that this value is a conservative estimate because

the number of cloud-based VM instances an IT staff can manage is likely to be

higher than this since cloud-based VMs do not require caring of underlying physical

servers.

91

10K

30K

50K

70K

90K

110K

130K

150K

170K

190K

1 2 3 4 5 6 7 8 Years

NP

V C

ost(

$)

Fully InhouseFully EC2

EC2+RDSInhouse+RDS

Inhouse+SQL Azure

10K

30K

50K

70K

90K

110K

130K

150K

170K

190K

1 2 3 4 5 6 7 8 Years

NP

V C

ost(

$)

Legend identical as (a)

(a) Without labor cost (b) Large business size

10K

30K

50K

70K

90K

110K

130K

150K

170K

190K

1 2 3 4 5 6 7 8 Years

NP

V C

ost(

$)

Legend identical as (a)

10K

30K

50K

70K

90K

110K

130K

150K

170K

190K

1 2 3 4 5 6 7 8 Years

NP

V C

ost(

$)

Legend identical

as (a)

(c) Medium business size (d) Small business size

Figure 5.11. Impact of labor cost for the medium workload intensity using TPC-W.The cases for small and large workload intensity is not shown. For small workload,cloud-based hosting is always cheaper. For large workloads, in-house hosting is alywayscheaper.

From Table 5.1 we can expect that the cost of SMB’s in-house hosting will

become higher because it has the highest yearly labor cost per server. However,

the cost of cloud-based hosting also increases as the number of cloud instances

increases. If the rate of increase for the cloud instance is sufficiently high, the labor

cost of managing cloud instances can exceed the labor cost of in-house hosting.

Thus, it is not straightforward to anticipate the impact of labor cost to the overall

decision of in-house vs. cloud-based hosting. Figure 5.11 shows the effect of labor

cost for a medium workload (100tps) of TPC-W. We show only the case for the

medium workload since this is the region where the decision of in-house vs. cloud-

92

������������

�����������

�����

������

�����

������

����

������

�����

���

����

���������

��� ���

������

��� ���

����� ���� ����

�� ��

�����������

��

���

������������

���

���

���

���

����������

Figure 5.12. (a) Effect of labor costs on the hosting decision in relation to the workloadintensity and business size. (b) Stacked view of in-house cost at year 5.

based hosting is reversed. For the case of small workload (20tps), the best option

is already cloud-based hosting even without considering the labor cost. Likewise,

the best option for high workload (500tps) is in-house hosting and the gap widens

as the business size shrinks from large to small. From Figure 5.11 (b), (c) to (d), it

can be seen that the in-house cost rises rapidly. Although the cost of cloud-based

hosting rises as well, the magnitude is smaller. As a result, the in-house cost loses

competitiveness in (d). Figure 5.12(b) shows the composition of in-house cost at

the 5th year from Figure 5.11 in more detail.

The effect of labor cost on the hosting decision can be summarized as Fig-

ure 5.12(a). Looking at the y-axis, if the workload is small, cloud-based hosting is

preferable for all business sizes. If the workload is large, in-house hosting is more

economical. However, there is a range of medium workloads in which the host-

ing decision shifts from cloud-based to in-house as the business size grows. In this

range, higher labor cost of smaller business size manifests as higher cost of in-house

hosting. Looking at the x-axis, the diagram shows that cloud-based hosting makes

more sense for the smaller business size most of the time. In order for a small

business to consider in-house hosting option, the workload has to be significantly

high. For a large enterprise, it is important to analyze the cost of available hosting

options because they are more likely to be on the boundary of decision spectrum.

93

5.4 Summary and Conclusion

In this chapter we have studied how to incorporate various cost factors into de-

termining the cost of hosting application in several cloud-based hosting options as

well as in-house hosting. As representative applications, we have used TPC-W and

TPC-E both of which are multi-tiered and web-based service application that uses

database as a back-end. We have presented the classification of costs related to

application hostings and explained what horizontal and vertical partitionings are

and how they can be used for the cloud-based hosting of consumer’s applications.

We have considered several important application characteristics such as workload

intensity, growth rate, workload variances and the impact of uncertainty in some

of the cost factors. We summarize below the key lessons we have learned from our

study.

• Cloud-based hosting makes sense for applications with workload intensity

and growth rate that are relatively small (the exact range is specific to the

application, we provided illustrative numbers for TPC-W and TPC-E).

• Data transfer (network usage) cost is one of the major costs of using the

cloud and this can make vertical partitioning infeasible for some applica-

tions.

• Costs such as storage and license can have a big impact on the choice of

best hosting options.

• Small workload variances such as diurnal patterns are unlikely to impact

the hosting costs, but flash crowds can have a huge impact.

• Indirect costs such as labor costs have a range of values depending on the

external factors like size of the organization and skill level of the staff.

We learn that it can render the economic benefit of cloud-based hostings

meaningless for smaller organization sizes.

Complementary to our cost analysis in this chapter is the problem of predicting the

performance an application is likely to experience once it is (partially) migrated

to a certain cloud(s). This forms the subject of our investigation in future work.

Chapter 6Conclusion and Future Work

In this dissertation, we have addressed the problem of realizing mature utility com-

puting paradigm from currently emerging clouds. We have argued that there were

some important missing functionality in the current cloud computing infrastruc-

ture in order for it to evolve into a mature utility computing platform.

From the provider’s point of view, we have identified that accurate resource

accounting methodology is needed. This solution would enable the cloud providers

to gain clear understanding of resource usage on behalf of the chargeable entities.

We have found that existing methods suffered from various runtime conditions of

the shared services and could even lead to incorrect operation of cloud manage-

ments. The fundamental reason for such shortcoming was that traditional methods

of monitoring data collection were inherently ill-suited to the purpose of resource

accounting. To this end, we have devoloped a novel resource accounting technique

that is implementable within the hypervisor layer and that operate at the gran-

ularity of thread-level monitoring data. One sub-problem that we had to solve

was the problem of discovering causal, end-to-end dependency among application

components. We have presented the solution and evaluation in Chapter 3. Our

complete resource accounting solution that incorporated the dependency discovery

technique has shown to be robust to various system and workload conditions that

impacted the traditional approaches. While traditional approaches suffered the

error rate of more than 100%, our technique consistently maintained low error rate

of less than 1% for all tested conditions. The evaluation using two real-world appli-

cations revealed that our method was superior compared to traditional statistical

95

approaches. In the scenario of response time management via resource allocation

control, statistical methods failed to avoid SLA violation due to incorrect deci-

sion of which chargeable entity is the dominant resource consumer, whereas our

methods succeeded in avoiding the SLA violation.

From the consumer’s point of view, we have identified that consumers needed

to understand the complex interactions of various cost factors related to using the

cloud services in order to be able to make informed decision of how to deploy their

applications in the cloud. Using well-regarded multi-tiered applications, we have

studied the impact of factors such as workload intensity, burstiness, growth rates,

time of operation, storage and license costs as well as labor costs. We have learned

that cloud-based hosting tends to make sense for a low workload intensity and

low growth rates. Data transfer costs could become one of the major cost factors

and this could render the hybrid hosting option of using the in-house and cloud

together very uneconomical.

Future Work: There are many interesting direction of future work on both the

provider and consumer side of the dissertation topic. For the provider-side, our

solution to the resource accounting problem can be used to address the question

of performance interferences among VMs or application components. Since our

solution provides accurate picture of how much resources are consumed, we are now

able to predict what the combined resource usage would be if several components

are combined and serviced by identical shared services. The challenge would to be

develop a sound technique that could translate this total combined resource usage

into expected performance numbers. The accurate performance (or the range of

possible performance) prediction would be an invaluable tools for optimizing the

cloud management. On the other hand, on the consumer side, we can consider

the research direction of predicting what the performance of a target application

would be if it were to be deployed into specific cloud. Since there are variety

of cloud services with different pricing and characteristics, the information of the

expected performance of consumer’s application is important in order to make an

optimal decision of which cloud to use and how. Deploying the application directly

to multiple target cloud for performance testing is prohibitive because such task

is often time-consuming if not impossible and error-prone. We are in need of

96

technique that enables us to accurately estimate the performance without having

to actually deploy the applications every time. Additional requirement is that,

since the best choice of cloud could change over time, we need the method to be

lightweight so that it can be applied repeatedly.

Bibliography

[1] “Amazon Elastic Compute Cloud (EC2),” http://www.amazon.com/ec2/.

[2] “Amazon Simple Storage Service (S3),” http://aws.amazon.com/s3/.

[3] “Sun Grid,” http://www.sun.com/service/sungrid/.

[4] “The Rackspace Cloud,” http://www.rackspacecloud.com.

[5] “Windows Azure Platform,” http://www.microsoft.com/azure/.

[6] “Google App Engine,” http://code.google.com/appengine.

[7] Hamm, S. (2008), “Confusion Over Cloud Computing,” Business Week.

[8] (2009), “A NIST Notional Definition of Cloud Computing,” csrc.nist.gov/

groups/SNS/cloud-computing/cloud-def-v14.doc.

[9] (2009), “Google Trends: Cloud Computing Surpasses Virtual-ization in Popularity,” http://www.elasticvapor.com/2009/04/

google-trends-cloud-computing-surpasses.html.

[10] Horrigan, J. (2008), “Use of Cloud Computing Applications and Services,”Pew Internet and American Life Project.

[11] Rappa, M. A. (2004) “The utility business model and the future of com-puting services,” IBM Syst. J., 43, pp. 32–42.URL http://dx.doi.org/10.1147/sj.431.0032

[12] Eilam, T., K. Appleby, J. Breh, G. Breiter, H. Daur, S. A.

Fakhouri, G. D. H. Hunt, T. Lu, S. D. Miller, L. B. Mummert,J. A. Pershing, and H. Wagner (2004) “Using a utility computing frame-work to develop utility systems,” IBM Syst. J., 43, pp. 97–120.URL http://dx.doi.org/10.1147/sj.431.0097

98

[13] Paleologo, G. A. (2004) “Price-at-Risk: A methodology for pricing utilitycomputing services,” IBM Syst. J., 43, pp. 20–31.URL http://dx.doi.org/10.1147/sj.431.0020

[14] Banga, G., P. Druschel, and J. C. Mogul (1999) “Resource containers:a new facility for resource management in server systems,” in Proceedings ofthe third symposium on Operating systems design and implementation, OSDI’99, USENIX Association, Berkeley, CA, USA, pp. 45–58.URL http://portal.acm.org/citation.cfm?id=296806.296810

[15] Aguilera, M. K., J. C. Mogul, J. L. Wiener, P. Reynolds, andA. Muthitacharoen (2003) “Performance debugging for distributed sys-tems of black boxes,” in SOSP’03: Proceedings of the 19th Symposium onOperating Systems Principles, ACM, New York, NY, USA, pp. 74–89.

[16] Anandkumar, A., C. Bisdikian, and D. Agrawal (2008) “Tracking in aspaghetti bowl: monitoring transactions using footprints,” in SIGMETRICS’08: Proceedings of the 2008 ACM SIGMETRICS international conferenceon Measurement and modeling of computer systems, ACM, New York, NY,USA, pp. 133–144.

[17] Barham, P., A. Donnelly, R. Isaacs, and R. Mortier (2004) “UsingMagpie for Request Extraction and Workload Modeling,” in OSDI’04: Pro-ceedings of the 6th conference on Symposium on Opearting Systems Design& Implementation, USENIX Association, Berkeley, CA, USA.

[18] Chen, M. Y., A. Accardi, E. Kiciman, J. Lloyd, D. Patterson,A. Fox, and E. Brewer (2004) “Path-based faliure and evolution manage-ment,” in NSDI’04: Proceedings of the 1st conference on Networked SystemsDesign and Implementation, USENIX Association, Berkeley, CA, USA.

[19] Chen, M. Y., E. Kiciman, E. Fratkin, A. Fox, and E. Brewer (2002)“Pinpoint: Problem Determination in Large, Dynamic Internet Services,” inDSN ’02: Proceedings of the 2002 International Conference on DependableSystems and Networks, IEEE Computer Society, Washington, DC, USA, pp.595–604.

[20] Reynolds, P., C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah,and A. Vahdat (2006) “Pip: detecting the unexpected in distributed sys-tems,” in NSDI’06: Proceedings of the 3rd conference on Networked SystemsDesign & Implementation, USENIX Association, Berkeley, CA, USA.

[21] Reynolds, P., J. L. Wiener, J. C. Mogul, M. K. Aguilera, andA. Vahdat (2006) “WAP5: black-box performance debugging for wide-area

99

systems,” in WWW ’06: Proceedings of the 15th international conference onWorld Wide Web, ACM, New York, NY, USA, pp. 347–356.

[22] Sengupta, B. and N. Banerjee (2008) “Tracking Transaction Footprintsfor Non-intrusive End-to-End Monitoring,” Autonomic Computing, Interna-tional Conference on, 0, pp. 109–118.

[23] Thereska, E., B. Salmon, J. Strunk, M. Wachs, M. Abd-El-Malek,J. Lopez, and G. R. Ganger (2006) “Stardust: tracking activity in a dis-tributed storage system,” in SIGMETRICS ’06/Performance ’06: Proceed-ings of the joint international conference on Measurement and modeling ofcomputer systems, ACM, New York, NY, USA.

[24] Wang, T., C. shing Perng, T. Tao, C. Tang, E. So, C. Zhang,R. Chang, and L. Liu (2008) “A Temporal Data-Mining Approach forDiscovering End-to-End Transaction Flows,” in 2008 IEEE InternationalConference on Web Services (ICWS08)., Beijing, China.

[25] Yumerefendi, A. and J. Chase (2004) “Trust but Verify: Accountabilityfor Internet Services,” in Proceedings of the Eleventh ACM SIGOPS Euro-pean Workshop.

[26] (2006), “Time of Use Electricity Billing: How Puget Sound Energy Re-duced Peak Power Demands (Case Study),” http://energypriorities.

com/entries/2006/02/pse tou amr case.php.

[27] “Open Cirrus the HP/Intel/Yahoo! Open Cloud Computing ResearchTestbed,” http://opencirrus.org.

[28] Benani, M. and D. Menasce (2005) “Resource Allocation for AutonomicData Centers Using Analytic Performance Models,” in Proceedings of IEEEInternational Conference on Autonomic Computing, Seattle (ICAC-05) .

[29] Chase, J. and R. Doyle (2001) “Balance of Power: Energy Managementfor Server Clusters,” in Proceedings of the Eighth Workshop on Hot Topicsin Operating Systems (HotOS-VIII), Elmau, Germany.

[30] Chen, Y., A. Das, W. Qin, A. Sivasubramaniam, Q. Wang, andN. Gautam (2005) “Managing Server Energy and Operational Costs inHosting Centers,” in Proceedings of the ACM International Conference onMeasurement and Modeling of Computer Systems (SIGMETRICS 2005),Banff, Canada, June 2005.

[31] Urgaonkar, B., P. Shenoy, and T. Roscoe (2002) “Resource Overbook-ing and Application Profiling in Shared Hosting Platforms,” in Proceedings

100

of the 5th USENIX Symposium on Operating Systems Design and Implemen-tation (OSDI), Boston.

[32] Waldspurger, C. (2002) “Memory Resource Management in VMWareESX Server,” in Proceedings of the Fifth Symposium on Operating SystemDesign and Implementation (OSDI’02).

[33] Xu, M. and C. Xu (2004) “Decay Function Model for Resource Configu-ration and Adaptive Allocation on Internet Servers,” in Proceedings of theTwelfth International Workshop on Quality-of-Service (IWQoS 2004).

[34] Amazon Relational Database Service, http://aws.amazon.com/

rds/.

[35] Apache HBase, http://hbase.apache.org/.

[36] Chang, F., J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach,M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber (2006)“Bigtable: a distributed storage system for structured data,” in Proceedingsof the 7th USENIX Symposium on Operating Systems Design and Implemen-tation - Volume 7, USENIX Association, Berkeley, CA, USA, pp. 15–15.URL http://portal.acm.org/citation.cfm?id=1267308.1267323

[37] Amazon SimpleDB, http://aws.amazon.com/simpledb/.

[38] Abdelzaher, T., K. G. Shin, and N. Bhatti (2002) “PerformanceGuarantees for Web Server End-Systems: A Control-Theoretical Approach,”IEEE Transactions on Parallel and Distributed Systems, 13(1).

[39] Appleby, K., S. Fakhouri, L. Fong, G. Goldszmidt, M. Kalantar,S. Krishnakumar, D. Pazel, J. Pershing, and B. Rochwerger (2001)“Oceano-SLA Based Management of a Computing Utility,” in Proceedingsof the IEEE/IFIP Integrated Network Management.

[40] Doyle, R., J. Chase, O. Asad, W. Jin, and A. Vahdat (2003) “Model-Based Resource Provisioning in a Web Service Utility,” in Proceedings of theFourth USITS.

[41] Slothouber, L. (1996) “A Model of Web Server Performance,” in Proceed-ings of the Fifth International World Wide Web Conference.

[42] Urgaonkar, B., G. Pacifici, P. Shenoy, M. Spreitzer, andA. Tantawi (2005) “An Analytical Model for Multi-tier Internet Servicesand its Applications,” in Proceedings of the ACM International Conferenceon Measurement and Modeling of Computer Systems (SIGMETRICS 2005),Banff, Canada.

101

[43] Urgaonkar, B. and P. Shenoy (2004) “Sharc: Managing CPU and Net-work Bandwidth in Shared Clusters,” 15(1), pp. 2–17.

[44] Lee, S. C. M. and J. C. S. Lui (2008) “On the Interaction and Competi-tion among Internet Service Providers,” IEEE Journal on Selected Areas inCommunications, 26(7), pp. 1277–1283.

[45] Shakkottai, S. and R. Srikant (2006) “Economics of network pricingwith multiple ISPs,” IEEE/ACM Trans. Netw., 14, pp. 1233–1245.URL http://dx.doi.org/10.1109/TNET.2006.886393

[46] Amazon EC2 Spot Instances, http://aws.amazon.com/ec2/

spot-instances/.

[47] Weissel, A. and F. Bellosa (2004) “Dynamic Thermal Management forDistributed Systems,” in Proceedings of the First Workshop on Temperatur-Aware Computer Systems (TACS’04), Munich, Germany.

[48] John Levon, “Oprofile,” http://oprofile.sourceforge.net/credits/.

[49] Bhatia, S., A. Kumar, M. E. Fiuczynski, and L. Peterson (2008)“Lightweight, high-resolution monitoring for troubleshooting production sys-tems,” in Proceedings of the 8th USENIX conference on Operating systemsdesign and implementation, OSDI’08, USENIX Association, Berkeley, CA,USA, pp. 103–116.URL http://portal.acm.org/citation.cfm?id=1855741.1855749

[50] “Kprobes,” http://sourceware.org/systemtap/kprobes/.

[51] Quynh, N. A. and K. Suzaki “Xenprobes, a lightweight user-space probingframework for Xen virtual machine,” in 2007 USENIX Annual TechnicalConference on Proceedings of the USENIX Annual Technical Conference,Berkeley, CA, USA, pp. 2:1–2:14.URL http://dl.acm.org/citation.cfm?id=1364385.1364387

[52] Johnson, M. W. (1998) “Monitoring and Diagnosing Application ResponseTime with ARM,” in Proceedings of the IEEE Third International Workshopon Systems Management, IEEE Computer Society, Washington, DC, USA,pp. 4–.URL http://portal.acm.org/citation.cfm?id=829512.830401

[53] Wood, T., L. Cherkasova, K. Ozonat, and P. Shenoy (2008) “Profil-ing and modeling resource usage of virtualized applications,” in Proceedingsof the 9th ACM/IFIP/USENIX International Conference on Middleware,Middleware ’08, Springer-Verlag New York, Inc., New York, NY, USA, pp.

102

366–387.URL http://portal.acm.org/citation.cfm?id=1496950.1496973

[54] Zhang, Q., L. Cherkasova, G. Mathews, W. Greene, andE. Smirni (2007) “R-Capriccio: a capacity planning and anomaly detec-tion tool for enterprise services with live workloads,” in Proceedings of theACM/IFIP/USENIX 2007 International Conference on Middleware, Middle-ware ’07, Springer-Verlag New York, Inc., New York, NY, USA, pp. 244–265.URL http://portal.acm.org/citation.cfm?id=1516124.1516142

[55] Gupta, D., L. Cherkasova, R. Gardner, and A. Vahdat (2006) “En-forcing performance isolation across virtual machines in Xen,” in Proceedingsof the ACM/IFIP/USENIX 2006 International Conference on Middleware,Middleware ’06, Springer-Verlag New York, Inc., New York, NY, USA, pp.342–362.URL http://portal.acm.org/citation.cfm?id=1515984.1516011

[56] Ren, G., E. Tune, T. Moseley, Y. Shi, S. Rus, and R. Hundt (2010)“Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Cen-ters,” IEEE Micro, 30, pp. 65–79.URL http://dx.doi.org/10.1109/MM.2010.68

[57] Stewart, C., T. Kelly, and A. Zhang (2007) “Exploiting nonsta-tionarity for performance prediction,” in Proceedings of the 2nd ACMSIGOPS/EuroSys European Conference on Computer Systems 2007, Eu-roSys ’07, ACM, New York, NY, USA, pp. 31–44.URL http://doi.acm.org/10.1145/1272996.1273002

[58] Cohen, I., M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase

(2004) “Correlating instrumentation data to system states: a building blockfor automated diagnosis and control,” in Proceedings of the 6th conferenceon Symposium on Opearting Systems Design & Implementation - Volume 6,USENIX Association, Berkeley, CA, USA, pp. 16–16.URL http://portal.acm.org/citation.cfm?id=1251254.1251270

[59] Gray, J. (2008) “Distributed Computing Economics,” Queue, 6, pp. 63–68.URL http://doi.acm.org/10.1145/1394127.1394131

[60] Thanos, G., C. Courcoubetis, and G. Stamoulis (2007) “Adoptingthe Grid for Business Purposes: The Main Objectives and the AssociatedEconomic Issues,” in Grid Economics and Business Models (D. Veit andJ. Altmann, eds.), vol. 4685 of Lecture Notes in Computer Science, pp. 1–15.

103

[61] Kenyon, C. and G. Cheliotis (2004) “Grid resource management,” chap.Grid resource commercialization: economic engineering and delivery scenar-ios, Kluwer Academic Publishers, Norwell, MA, USA, pp. 465–478.URL http://portal.acm.org/citation.cfm?id=976113.976142

[62] Cheliotis, G. and C. Kenyon (2003) “Autonomic Economics: Why Self-Managed e-Business Systems Will Talk Money,” in IEEE Conference onE-Commerce (CEC’03).URL http://www.zurich.ibm.com/grideconomics/refs.html

[63] Armbrust, M., A. Fox, R. Griffith, A. D. Joseph, R. Katz,A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica,and M. Zaharia (2009) “Above the Clouds: A Berkeley View of CloudComputing,” .URL http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/

EECS-2009-28.html

[64] Walker, E. (2009) “The Real Cost of a CPU Hour,” Computer, 42.URL http://portal.acm.org/citation.cfm?id=1550393.1550432

[65] Walker, E., W. Brisken, and J. Romney (2010) “To Lease or Not toLease from Storage Clouds,” Computer, 43, pp. 44–50.URL http://dx.doi.org/10.1109/MC.2010.115

[66] Harms, R. and M. Yamartino (2010) The Economics of the Cloud, Tech.rep., Microsoft.

[67] Teregowda, P., B. Urgaonkar, and C. L. Giles (2010) “Cloud Com-puting: A Digital Libraries Perspective,” in Proceedings of the 2010 IEEE3rd International Conference on Cloud Computing, CLOUD ’10, IEEE Com-puter Society, Washington, DC, USA, pp. 115–122.URL http://dx.doi.org/10.1109/CLOUD.2010.49

[68] Teregowda, P., B. Urgaonkar, and L. Giles (2010) “CiteSeerX: ACloud Perspective,” in Proceedings of the Second USENIX Workshop on HotTopics in Cloud Computing.

[69] Klems, M., J. Nimis, and S. Tai (2009) “Do Clouds Compute? AFramework for Estimating the Value of Cloud Computing,” in DesigningE-Business Systems. Markets, Services, and Networks, vol. 22 of LectureNotes in Business Information Processing, Springer Berlin Heidelberg, pp.110–123.

[70] Wang, H., Q. Jing, R. Chen, B. He, Z. Qian, and L. Zhou (2010)“Distributed systems meet economics: pricing in the cloud,” in Proceedings of

104

the 2nd USENIX conference on Hot topics in cloud computing, HotCloud’10,USENIX Association, Berkeley, CA, USA, pp. 6–6.URL http://portal.acm.org/citation.cfm?id=1863103.1863109

[71] Chohan, N., C. Castillo, M. Spreitzer, M. Steinder, A. Tantawi,and C. Krintz (2010) “See spot run: using spot instances for mapreduceworkflows,” in Proceedings of the 2nd USENIX conference on Hot topics incloud computing, HotCloud’10, USENIX Association, Berkeley, CA, USA,pp. 7–7.URL http://portal.acm.org/citation.cfm?id=1863103.1863110

[72] Campbell, R., I. Gupta, M. Heath, S. Y. Ko, M. Kozuch,M. Kunze, T. Kwan, K. Lai, H. Y. Lee, M. Lyons, D. Milojicic,D. O’Hallaron, and Y. C. Soh (2009) “Open CirrusTMcloud comput-ing testbed: federated data centers for open source systems and servicesresearch,” in Proceedings of the 2009 conference on Hot topics in cloud com-puting, HotCloud’09, USENIX Association, Berkeley, CA, USA, pp. 1–1.URL http://portal.acm.org/citation.cfm?id=1855533.1855534

[73] “AWS Simple Monthly Calculator,” http://calculator.s3.amazonaws.

com/calc5.html.

[74] “Cloud Price Calculator,” http://cloudpricecalculator.com/.

[75] Shi, Z., H. Tang, and Y. Tang (2005) “Blind source separation of moresources than mixtures using sparse mixture models,” Pattern Recogn. Lett.,26, pp. 2491–2499.URL http://dx.doi.org/10.1016/j.patrec.2005.05.006

[76] Barham, P., B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,R. Neugebuer, I. Pratt, and A. Warfield (2003) “Xen and the Art ofVirtulization,” in Proceedings of the 19th Symposium on Operating SystemsPrinciples (SOSP).

[77] Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2/.

[78] Welsh, M., D. Culler, and E. Brewer (2001) “SEDA: an architecturefor well-conditioned, scalable internet services,” SIGOPS Oper. Syst. Rev.,35(5), pp. 230–243.

[79] Pai, V. S., P. Druschel, and W. Zwaenepoel (1999) “Flash: an efficientand portable web server,” in ATEC ’99: Proceedings of USENIX AnnualTechnical Conference, USENIX Association, Berkeley, CA, USA.

105

[80] Ruan, Y. and V. Pai (2004) “Making the box transparent: system callperformance as a first-class result,” in Proceedings of the USENIX AnnualTechnical Conference 2004, USENIX Association Berkeley, CA, USA.

[81] ——— (2006) “Understanding and Addressing Blocking-Induced NetworkServer Latency,” in Proceedings of the USENIX Annual Technical Conference2006, USENIX Association Berkeley, CA, USA.

[82] Behren, R. V., J. Condit, and E. Brewer (2003) “Why Events Are ABad Idea (for high-concurrency servers),” in Proceedings of HotOS IX.

[83] Smith, W., “TPC-W: Benchmarking An Ecommerce Solution,” http://

www.tpc.org/information/other/techarticles.asp.

[84] “NYU TPC-W,” http://www.cs.nyu.edu/pdsg/.

[85] “The JBoss Application Server,” http://www.jboss.org.

[86] “MySQL,” http://www.mysql.com.

[87] “RUBiS,” http://rubis.objectweb.org/.

[88] MediaWiki, http://www.mediawiki.org.

[89] Yu, H., J. Moreira, P. Dube, I. Chung, and L. Zhang (2007) “Perfor-mance Studies of a WebSphere Application, Trade, in Scale-out and Scale-upEnvironments,” in Third International Workshop on System ManagementTechniques, Processes, and Services (SMTPS), IPDPS.

[90] Wayne Walter Berry and Vitor Tomaz, “Inside SQL Azure,”http://social.technet.microsoft.com/wiki/contents/articles/

1695.inside-sql-azure.aspx.

[91] Tak, B. C., C. Tang, C. Zhang, S. Govindan, B. Urgaonkar, andR. N. Chang “vPath: precise discovery of request processing paths fromblack-box observations of thread and network activities,” in Proceedings ofthe 2009 conference on USENIX Annual technical conference, USENIX’09,Berkeley, CA, USA, pp. 19–19.URL http://dl.acm.org/citation.cfm?id=1855807.1855826

[92] Windows Azure, http://www.microsoft.com/windowsazure/

windowsazure/.

[93] Microsoft SQL Azure, http://www.microsoft.com/en-us/sqlazure/default.aspx.

106

[94] Barham, P., A. Donnelly, R. Isaacs, and R. Mortier (2004) “Usingmagpie for request extraction and workload modelling,” in OSDI’04: Pro-ceedings of the 6th conference on Symposium on Opearting Systems Design& Implementation, USENIX Association, Berkeley, CA, USA.

[95] Barham, P., B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,R. Neugebauer, I. Pratt, and A. Warfield (2003) “Xen and the art ofvirtualization,” in SOSP ’03: Proceedings of the 19th ACM Symposium onOperating Systems Principles, ACM, New York, NY, USA, pp. 164–177.

[96] Gupta, D., K. Yocum, M. McNett, A. C. Snoeren, A. Vahdat, andG. M. Voelker (2006) “To infinity and beyond: time-warped network em-ulation,” in Proceedings of the 3rd conference on Networked Systems Design& Implementation - Volume 3, NSDI’06, USENIX Association, Berkeley, CA,USA, pp. 7–7.URL http://dl.acm.org/citation.cfm?id=1267680.1267687

[97] Wei, D. X., C. Jin, S. H. Low, and S. Hegde (2006) “FAST TCP: mo-tivation, architecture, algorithms, performance,” IEEE/ACM Trans. Netw.,14(6), pp. 1246–1259.

[98] Gulati, A., I. Ahmad, and C. A. Waldspurger (2009) “PARDA: pro-portional allocation of resources for distributed storage access,” in FAST ’09:Proccedings of the 7th conference on File and storage technologies, USENIXAssociation, Berkeley, CA, USA, pp. 85–98.

[99] “HDFS,” http://hadoop.apache.org/hdfs/.

[100] Hunt, P., M. Konar, F. P. Junqueira, and B. Reed (2010)“ZooKeeper: wait-free coordination for internet-scale systems,” in Proceed-ings of the 2010 USENIX conference on USENIX annual technical confer-ence, USENIXATC’10, USENIX Association, Berkeley, CA, USA, pp. 11–11.URL http://dl.acm.org/citation.cfm?id=1855840.1855851

[101] Cooper, B. F., A. Silberstein, E. Tam, R. Ramakrishnan, andR. Sears (2010) “Benchmarking cloud serving systems with YCSB,” in Pro-ceedings of the 1st ACM symposium on Cloud computing, SoCC ’10, ACM,New York, NY, USA, pp. 143–154.URL http://doi.acm.org/10.1145/1807128.1807152

[102] Li, A., X. Yang, S. Kandula, and M. Zhang (2010) “CloudCmp: com-paring public cloud providers,” in Proceedings of the 10th annual conf. onInternet measurement, New York, NY, USA.URL http://doi.acm.org/10.1145/1879141.1879143

107

[103] Johnson, R. W. and W. G. Lewellen (1972) “Analysis of the Lease-or-Buy Decision,” Journal of Finance, 27(4), pp. 815–23.URL http://ideas.repec.org/a/bla/jfinan/v27y1972i4p815-23.html

[104] Hajjat, M., X. Sun, Y.-W. E. Sung, D. Maltz, S. Rao, K. Sri-

panidkulchai, and M. Tawarmalani “Cloudward bound: planning forbeneficial migration of enterprise applications to the cloud,” in Proceedingsof the ACM SIGCOMM 2010 conf.URL http://doi.acm.org/10.1145/1851182.1851212

[105] Windows Azure Lessons Learned, http://channel9.msdn.com/

Blogs/benriga/.

[106] Ipek, E., S. A. McKee, R. Caruana, B. R. de Supinski, andM. Schulz (2006) “Efficiently exploring architectural design spaces viapredictive modeling,” in Proceedings of the 12th international conferenceon Architectural support for programming languages and operating systems,ASPLOS-XII, ACM, New York, NY, USA, pp. 195–206.URL http://doi.acm.org/10.1145/1168857.1168882

[107] Lee, B. C. and D. M. Brooks (2006) “Accurate and efficient regres-sion modeling for microarchitectural performance and power prediction,”SIGOPS Oper. Syst. Rev., 40, pp. 185–194.URL http://doi.acm.org/10.1145/1168917.1168881

[108] Stewart, C., T. Kelly, A. Zhang, and K. Shen (2008) “A dollar from15 cents: cross-platform management for internet services,” in USENIX 2008Annual Technical Conference on Annual Technical Conference, USENIX As-sociation, Berkeley, CA, USA, pp. 199–212.URL http://portal.acm.org/citation.cfm?id=1404014.1404029

[109] Xu, J., M. Zhao, J. Fortes, R. Carpenter, and M. Yousif (2008)“Autonomic resource management in virtualized data centers using fuzzylogic-based approaches,” Cluster Computing, 11, pp. 213–227.URL http://portal.acm.org/citation.cfm?id=1395064.1395067

[110] Kundu, S., R. Rangaswami, K. Dutta, and M. Zhao (2010) “Applica-tion performance modeling in a virtualized environment.” in HPCA’10, pp.1–10.

[111] Kossmann, D., T. Kraska, and S. Loesing (2010) “An evaluation ofalternative architectures for transaction processing in the cloud,” in Proceed-ings of the 2010 international conference on Management of data, SIGMOD’10, ACM, New York, NY, USA.

108

[112] Chen, G., W. He, J. Liu, S. Nath, L. Rigas, L. Xiao, andF. Zhao (2008) “Energy-aware server provisioning and load dispatching forconnection-intensive internet services,” in Proceedings of the 5th USENIXSymposium on Networked Systems Design and Implementation, NSDI’08,USENIX Association.URL http://portal.acm.org/citation.cfm?id=1387589.1387613

[113] Guha, S., N. Daswani, and R. Jain (2006) “An Experimental Study ofthe Skype Peer-to-Peer VoIP System,” in Proceedings of the 5th InternationalWorkshop on Peer-to-Peer Systems, 2006.

[114] Wang, Q., D. Makaroff, H. K. Edwards, and R. Thompson (2003)“Workload characterization for an E-commerce web site,” in Proceedings ofthe 2003 conference of the Centre for Advanced Studies on Collaborativeresearch, CASCON ’03, IBM Press.URL http://portal.acm.org/citation.cfm?id=961322.961372

[115] Arlitt, M. and T. Jin (1999) Workload Characterization of the 1998 WorldCup Web Site, Tech. rep., IEEE Network.

[116] Chen, X. and J. Heidemann (2005) “Flash crowd mitigation via adaptiveadmission control based on application-level observations,” ACM Trans. In-ternet Technol., 5, pp. 532–569.URL http://doi.acm.org/10.1145/1084772.1084776

[117] Ramamurthy, P., V. Sekar, A. Akella, B. Krishnamurthy, andA. Shaikh (2008) “Remote profiling of resource constraints of web serversusing mini-flash crowds,” in USENIX 2008 Annual Technical Conference onAnnual Technical Conference, USENIX Association, Berkeley, CA, USA, pp.185–198.URL http://portal.acm.org/citation.cfm?id=1404014.1404028

[118] Urgaonkar, B., P. Shenoy, A. Chandra, P. Goyal, and T. Wood

(2008) “Agile dynamic provisioning of multi-tier Internet applications,” ACMTrans. Auton. Adapt. Syst., 3, pp. 1:1–1:39.URL http://doi.acm.org/10.1145/1342171.1342172

[119] Greenberg, A., J. Hamilton, D. A. Maltz, and P. Patel (2008) “Thecost of a cloud: research problems in data center networks,” SIGCOMMComput. Commun. Rev., 39, pp. 68–73.URL http://doi.acm.org/10.1145/1496091.1496103

[120] 2010 IT Salary Survey, http://ejobdescription.com/Salary.htm#

epm1 1.

109

[121] Microsoft War on Cost Study, http://download.microsoft.

com/download/1/F/8/1F8BD4EF-31CC-4059-9A65-4A51B3B4BC98/

Hyper-V-vs-VMware-ESX-and-vShpere-WP.pdf.

Vita

Byung Chul Tak

Byung Chul Tak is a Ph.D candidate in the department of computer scienceand engineering at The Pennsylvania State University. He entered the doctoralprogram in September, 2006. He received his BS in computer science from YonseiUniversity, Korea, in 2000. He holds MS degree in computer science from KAIST,Korea, in 2003. From 2004 to 2006, he was a research engineer in the embeddedsoftware center at ETRI (Electronics and Telecommunications Research Institute),the national research laboratory in Korea. His research interests include VirtualMachines, Cloud, Operationg System and Distributed Systems. He has coauthoredpapers that were publised in USENIX ATC, HotCloud, ICDCS and ISPASS. Hespent three summers at IBM Research, Hawthorne as a research intern.