Main effects screening: a distributed continuous quality assurance process for monitoring...

10
Main Effects Screening: A Distributed Continuous Quality Assurance Process for Monitoring Performance Degradation in Evolving Software Systems Cemal Yilmaz , Arvind S. Krishna , Atif Memon , Adam Porter , Douglas C. Schmidt , Aniruddha Gokhale , Balachandran Natarajan Dept. of Computer Science, University of Maryland, College Park, MD 20742 Dept. of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37235 Abstract Developers of highly configurable performance- intensive software systems often use a type of in-house performance-oriented “regression testing” to ensure that their modifications have not adversely affected their software’s performance across its large configuration space. Unfortunately, time and resource constraints often limit developers to in-house testing of a small number of configurations and unreliable extrapolation from these results to the entire configuration space, which allows many performance bottlenecks and sources of QoS degradation to escape detection until systems are fielded. To improve performance assessment of evolving systems across large configuration spaces, we have developed a distributed continuous quality assurance (DCQA) process called main effects screening that uses in-the-field resources to execute formally designed experiments to help reduce the configuration space, thereby allowing developers to perform more targeted in-house QA. We have evaluated this process via several feasibility studies on several large, widely-used performance-intensive software systems. Our results indicate that main effects screening can detect key sources of performance degradation in large-scale systems with significantly less effort than conventional techniques. 1. Introduction The quality of performance-intensive software systems, such as high-performance scientific computing systems and distributed real-time and embedded (DRE) systems, depend heavily on their infrastructure platforms, such as the hard- ware, operating system, middleware, and language process- ing tools. Developers of such systems often need to tune the infrastructure and their software applications to accommo- date the (often changing) platform environments and per- formance requirements. This tuning is commonly done by (re)adjusting a large set (10’s-100’s) of compile- and run- time configuration options that record and control variable software parameters, such as different operating systems, resource management strategies, middleware and applica- tion feature sets; compiler flags; and/or run-time optimiza- tion settings. For example, SQL Server 7.0 has 47 config- uration options, Oracle 9 has 211 initialization parameters, and Apache HTTP Server Version 1.3 has 85 core configu- ration options. Although these software parameters promote flexibility and portability, they also require that the software be tested in an enormous number of configurations. This creates seri- ous challenges for developers who must ensure that their de- cisions, additions, and modifications work across this large (and often changing) configuration space: Settings that maximize performance for a particular platform/context may not be suitable for different ones and certain groups of option settings may be semanti- cally invalid due to subtle dependencies between op- tions. Limited QA budgets and rapidly changing code bases mean that developers’ QA efforts are often limited to just a few software configurations, forcing them to extrapolate their findings to the entire configuration space. The configurations that are tested are often selected in an ad hoc manner, so quality is not evaluated system- atically and many quality problems escape detection until systems are fielded. Since exhaustively testing all configurations is infeasible under the circumstances listed above, developers need a quick way to estimate how their changes and decisions af- fect software performance across the entire configuration space. To do this, we have developed and evaluated a new hybrid (i.e., partially in-the-field and partially in-house) dis- tributed continuous quality assurance (DCQA) process that improves software quality iteratively, opportunistically, and efficiently by executing QA tasks continuously across a grid of computing resources provided by end-users and dis- tributed development teams. In prior work [12], we created a prototype DCQA sup- port environment called Skoll that helps developers cre- ate, execute, and analyze their own DCQA processes, as described in Section 2. To make it easier to implement DCQA processes, we also integrated model-based software development tools with Skoll, which help developers cap- ture the variant and invariant parts of DCQA processes and the software systems they are applied to within high- level models that can be processed to automatically generate configuration files and other supporting code artifacts [7]. Some model-based tools integrated with Skoll include the Options Configuration Modeling language (OCML) [14] 1

Transcript of Main effects screening: a distributed continuous quality assurance process for monitoring...

Main Effects Screening: A Distributed Continuous Quality Assurance Processfor Monitoring Performance Degradation in Evolving Software Systems

Cemal Yilmaz�, Arvind S. Krishna

�, Atif Memon

�, Adam Porter

�, Douglas C. Schmidt

�,

Aniruddha Gokhale�, Balachandran Natarajan

��Dept. of Computer Science, University of Maryland, College Park, MD 20742�

Dept. of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37235

AbstractDevelopers of highly configurable performance-

intensive software systems often use a type of in-houseperformance-oriented “regression testing” to ensurethat their modifications have not adversely affected theirsoftware’s performance across its large configurationspace. Unfortunately, time and resource constraints oftenlimit developers to in-house testing of a small number ofconfigurations and unreliable extrapolation from theseresults to the entire configuration space, which allows manyperformance bottlenecks and sources of QoS degradationto escape detection until systems are fielded. To improveperformance assessment of evolving systems across largeconfiguration spaces, we have developed a distributedcontinuous quality assurance (DCQA) process calledmain effects screening that uses in-the-field resources toexecute formally designed experiments to help reducethe configuration space, thereby allowing developers toperform more targeted in-house QA. We have evaluatedthis process via several feasibility studies on several large,widely-used performance-intensive software systems. Ourresults indicate that main effects screening can detect keysources of performance degradation in large-scale systemswith significantly less effort than conventional techniques.

1. IntroductionThe quality of performance-intensive software systems,

such as high-performance scientific computing systems anddistributed real-time and embedded (DRE) systems, dependheavily on their infrastructure platforms, such as the hard-ware, operating system, middleware, and language process-ing tools. Developers of such systems often need to tune theinfrastructure and their software applications to accommo-date the (often changing) platform environments and per-formance requirements. This tuning is commonly done by(re)adjusting a large set (10’s-100’s) of compile- and run-time configuration options that record and control variablesoftware parameters, such as different operating systems,resource management strategies, middleware and applica-tion feature sets; compiler flags; and/or run-time optimiza-tion settings. For example, SQL Server 7.0 has 47 config-uration options, Oracle 9 has 211 initialization parameters,and Apache HTTP Server Version 1.3 has 85 core configu-ration options.

Although these software parameters promote flexibilityand portability, they also require that the software be testedin an enormous number of configurations. This creates seri-ous challenges for developers who must ensure that their de-cisions, additions, and modifications work across this large(and often changing) configuration space:� Settings that maximize performance for a particular

platform/context may not be suitable for different onesand certain groups of option settings may be semanti-cally invalid due to subtle dependencies between op-tions.� Limited QA budgets and rapidly changing code basesmean that developers’ QA efforts are often limited tojust a few software configurations, forcing them toextrapolate their findings to the entire configurationspace.� The configurations that are tested are often selected inan ad hoc manner, so quality is not evaluated system-atically and many quality problems escape detectionuntil systems are fielded.

Since exhaustively testing all configurations is infeasibleunder the circumstances listed above, developers need aquick way to estimate how their changes and decisions af-fect software performance across the entire configurationspace. To do this, we have developed and evaluated a newhybrid (i.e., partially in-the-field and partially in-house) dis-tributed continuous quality assurance (DCQA) process thatimproves software quality iteratively, opportunistically, andefficiently by executing QA tasks continuously across agrid of computing resources provided by end-users and dis-tributed development teams.

In prior work [12], we created a prototype DCQA sup-port environment called Skoll that helps developers cre-ate, execute, and analyze their own DCQA processes, asdescribed in Section 2. To make it easier to implementDCQA processes, we also integrated model-based softwaredevelopment tools with Skoll, which help developers cap-ture the variant and invariant parts of DCQA processesand the software systems they are applied to within high-level models that can be processed to automatically generateconfiguration files and other supporting code artifacts [7].Some model-based tools integrated with Skoll include theOptions Configuration Modeling language (OCML) [14]

1

that models configuration options and inter-option con-straints and the Benchmark Generation Modeling Language(BGML) [8] that composes benchmarking experiments toobserve QoS behavior under different configurations andworkloads.

This paper extends our earlier work by developing anew model-based hybrid DCQA process that leverages theextensive (albeit less dedicated) in-the-field computing re-sources provided by the Skoll grid, weeding out unimpor-tant options to reduce the configuration space, thereby al-lowing developers to perform more targeted QA using theirvery limited (but dedicated) in-house resources. This hybridDCQA process first runs formally designed experimentsacross the Skoll grid to identify a subset of important per-formance-related configuration options. Whenever the soft-ware changes thereafter, this process exhaustively exploresall configurations of the important options using in-housecomputing resources to estimate system performance acrossthe entire configuration space. This hybrid approach is fea-sible because the new configuration space is much smallerthan the original, and hence more tractable using in-houseresources.

This paper presents an evaluation of our new hybridDCQA process on ACE, TAO, and CIAO (deuce.doc.wustl.edu/Download.html), which are widely-used production quality, performance-intensive middlewareframeworks. Our results indicate that (1) hybrid model-based DCQA tools and processes can correctly identify thesubset of options that are important to system performance,(2) monitoring only these selected options helps to quicklydetect key sources of performance degradation at an accept-able level of effort, and (3) alternative strategies with equiv-alent effort give less reliable results.

2. The Model-based Skoll DCQA EnvironmentTo improve the quality of performance-intensive soft-

ware across large configuration spaces, we are explor-ing distributed continuous quality assurance (DCQA) pro-cesses [12] that evaluate various software qualities, such asportability, performance characteristics, and functional cor-rectness, “around-the-world, around-the-clock.” To accom-plish this, DCQA processes are divided into multiple sub-tasks, such as running regression tests on a particular systemconfiguration, evaluating system response time under differ-ent input workloads, or measuring usage errors for a systemwith several alternative GUI designs. As illustrated in Fig-ure 1, these subtasks are then intelligently and continuouslydistributed to – and executed by – clients across a grid ofcomputing resources contributed largely by end-users anddistributed development teams. The results of these evalua-tions are returned to servers at central collection sites, wherethey are fused together to guide subsequent iterations of theDCQA processes.

To support this effort we have developed Skoll, a model-

Skoll ClientGroup #12

Skoll ClientGroup #32

Central CollectionSite

DCQA Model•Configuration model,•Automatic characterization•Adaptation strategies•ISA•….

DCQA Model•Configuration model,•Automatic characterization•Adaptation strategies•ISA•….

Configuration artifacts from models

Figure 1. The Skoll Architecture

based DCQA environment described at www.cs.umd.edu/projects/skoll. For completeness, this sectiondescribes some of Skoll’s components and services, whichinclude languages for modeling system configurations andtheir constraints, algorithms for scheduling and remotelyexecuting tasks, and planning technology that analyzes sub-task results and adapts the DCQA process in real time.

The cornerstone of Skoll is its formal model of a DCQAprocess’s configuration space, which captures different con-figuration options and their settings. Since in practice not allcombinations of options make sense (e.g., feature X may notbe supported on operating system Y), we define inter-optionconstraints that limit the setting of one option based on thesettings of others. A valid configuration is one that violatesno inter-option constraints (for the feasibility study in Sec-tion 4, we used the OCML modeling tool to visually definethe configuration model and to generate the low-level for-mats used by other Skoll components). Skoll uses this con-figuration space model to help plan global QA processes,adapt these processes dynamically, and aid in analyzing andinterpreting results from various types of functional and per-formance regression tests.

Since the configuration spaces of performance-intensivesoftware can be quite large, Skoll has an Intelligent Steer-ing Agent (ISA) that uses AI planning techniques to con-trol DCQA processes by deciding which valid configurationto allocate to each incoming Skoll client request. When aclient is available, the ISA decides which subtask to assignit by considering many factors, including (1) the configura-tion model, which characterizes the subtasks that can legallybe assigned, (2) the results of previous subtasks, which cap-ture what tasks have already been done and whether theresults were successful, (3) global process goals, such astesting popular configurations more than rarely used onesor testing recently changed features more heavily than un-changed features, and (4) client characteristics and prefer-ences, e.g., the selected configuration must be compatiblewith the OS running on the client machine or configura-tions must run with user-level – rather than superuser-level– protection modes.

2

After a valid configuration is chosen, the ISA pack-ages the corresponding QA subtask into a job configu-ration, which consists of the code artifacts, configura-tion parameters, build instructions, and QA-specific code(e.g., developer-supplied regression/performance tests) as-sociated with a software project. Each job configuration isthen sent to a Skoll client, which executes the job configu-ration and returns the results to the ISA (for the feasibilitystudies described in Section 4, we used the BGML model-ing tools [9] to generate most of the code that comprises ajob configuration). The ISA can learn from the results andadapt the process, e.g., if some configurations fail to workproperly, developers may either want to pinpoint the sourceof the problems or refocus on other unexplored parts of theconfiguration space. To control the ISA, Skoll DCQA pro-cess designers can develop customized adaptation strate-gies that monitor the global process state, analyze it, anduse the information to modify future subtask assignmentsin ways that improve process performance.

Since DCQA processes can be complex, Skoll users of-ten need help to interpret and leverage process results. Skolltherefore supports a variety of pluggable analysis tools,such as Classification Tree Analysis (CTA) [1]. In previ-ous work [12, 16], for example, we used CTA to diagnoseoptions and settings that were the likely causes of specifictest failures. For this work, we developed statistical toolsto analyze data from the formally-designed experiments de-scribed in the following section.

3 Performance-Oriented Regression TestingAs software systems change, developers often run re-

gression tests to detect unintended functional side effects.Developers of performance-intensive systems must also bewary of unintended side effects on their end-to-end QoS.To detect such performance problems, developers often runbenchmarking regression tests periodically. As describedin Section 1, however, in-house QA efforts can be con-founded by the enormous configuration space of highly con-figurable performance intensive systems, where time andresource constraints (and often high change frequencies)severely limit the number of configurations that can be ex-amined. For example, our earlier experience with apply-ing Skoll to the ACE+TAO middleware [12] found thatonly a small number of default configurations are bench-marked routinely by the core ACE+TAO development team,who thus get a very limited view of their middleware’sQoS. Problems not readily seen in these default configu-rations therefore often escape detection until systems basedon ACE+TAO are fielded by end-users.

This section describes how we address this problem byusing the model-based Skoll environment (Section 2) to de-velop and implement a new hybrid DCQA process calledmain effects screening. We also describe the formal foun-dations of our approach, which is based on design of ex-

periments theory, and give an example that illustrates keyaspects of our approach.3.1. The Main Effects Screening Process

Main effects screening is a technique for rapidly detect-ing performance degradation across a large configurationspace as a result of system changes. Our approach re-lies on a class of experimental designs called screening de-signs [15], which are highly economical and can reveal im-portant low order effects (such as individual option settingsand option pairs/triples) that strongly affect performance.We call these most influential option settings “main effects.”

At a high level, main effects screening involves the fol-lowing steps: (1) compute a formal experimental designbased on the system’s configuration model, (2) execute thatexperimental design across fielded computing resources inthe Skoll DCQA grid by running and measuring bench-marks on specific configurations dictated by the experimen-tal design devised in step 1, (3) collect, analyze and dis-play the data so that developers can identify the main ef-fects, (4) estimate overall performance whenever the soft-ware changes by evaluating all combinations of the maineffects (while defaulting or randomizing all other options),and (5) recalibrate the main effects options by restarting theprocess periodically since the main effects can change overtime, depending on how fast the system changes.

The assumption behind this five step process is that sincemain effects options are the ones that affect performancemost, evaluating all combinations of these option settings(which we call the “screening suite”) can reasonably esti-mate performance across the entire configuration space. Ifthis assumption is true, testing the screening suite shouldprovide much the same information as testing the entireconfiguration space, but at a fraction of the time and effortsince it is much smaller than the entire configuration space.3.2. Technical Foundations of Screening Designs

For main effects screening to work we need to identifythe main effects, i.e., the subset of options whose settingsaccount for a large portion of performance variation acrossthe system’s configuration space. One obvious approachis to test every configuration exhaustively. Since exhaus-tive testing is infeasible for large-scale, highly configurableperformance-intensive software systems, however, develop-ers often do some kind of random or ad hoc sampling basedon their knowledge of the system. Since our experience in-dicates that these approaches can be unreliable [12, 8], weneed an approach that samples the configuration space, yetproduces reasonably precise and reliable estimates of over-all performance.

The approach we chose for this paper uses formally-designed experiments, called screening designs, that arehighly economical and whose primary purpose is to iden-tify important low-order effects, i.e., first-, second-, or third-order effects, where an ����� -order effect is an effect causedby the simultaneous interaction of � factors. For instance,

3

for certain web server applications, a ��� -order effect mightbe that performance slows considerably when logging isturned on and another might be that it also slows when fewserver threads are used. A �� �� order effect involves the in-teraction of two options, e.g., web server performance mayslow down when caching is turned off and the server per-forms blocking reads.

There are many ways to compute screening designs. Theone we use is based on traditional factorial designs. Con-sider a full factorial design involving � binary factors. Sucha design exhaustively tests all combinations of the factors.Therefore, the design’s run size (number of experimentalobservations) is ��� . Although this quickly becomes expen-sive, it does allow one to compute all effects.

To reduce costs statisticians have developed fractionalfactorial designs. These designs use only a carefully se-lected fraction (such as ���� or ���� ) of a full factorial de-sign. This saves money, but does so by giving up the abil-ity to measure some higher-order effects. This is becausethe way observations are chosen aliases the effects of somelower-order interactions with some higher-order ones. Thatis, it lumps together certain high- and low-order effects onthe assumption that the high-order effects are negligible.

Screening designs push this tradeoff to an extreme.Roughly speaking, at their smallest, screening designs re-quire only as many observations as the number of effectsone wishes to calculate (i.e., � observations to compute � ,���� -order effects). Of course, experimenters will often usemore than the minimum number of observations to improveprecision or to deal with noisy processes. As before, thisis only possible because the design aliases some low-orderwith some high-order effects.3.3. Screening Designs in Action

To show how screening designs are computed, we nowpresent a hypothetical example of a software system with4 binary configuration options, A through D, with no inter-option constraints. To compute a specific screening design,developers must (a) decide how many observations they canafford, (b) determine which effects they want to analyze,and (c) select an aliasing strategy consistent with these con-straints. Note that in practice screening designs are usuallycomputed by automated tools.

The configuration space for our sample system has �������� configurations. Therefore, the corresponding full facto-rial design involves 16 observations. Let’s assume that ourdevelopers, however, can afford to run only 8 observations.Obviously, the full factorial design would therefore be unac-ceptable. Let’s also assume that our developers are mostlyinterested in capturing the 4, ��� -order effects (i.e., the ef-fect of each option by itself.) Given this goal the developersdecide to use a screening design.

The first step in computing the screening design is to cre-ate a ��� full factorial design over 3 (arbitrarily selected) op-tions, in this case A, B, and C. This design is shown in Ta-

ble 1(a) with the binary option settings encoded as (-) or(+). This is the starting point because �����! is the max-imum number of observations the developers can afford torun.

A B C- - -+ - -- + -+ + -- - ++ - +- + ++ + +

(a)

A B C D- - - -+ - - +- + - ++ + - -- - + ++ - + -- + + -+ + + +

(b)

Table 1. (a) ��� Design and (b) � �#"%$&(' Design

This initial design would be fine for 3 options. But itcan’t handle our developer’s 4-option system since they’vealready decided that the full factorial design would be tooexpensive. We need a way to stretch the current 3-optiondesign to cover the 4 ��� option. As mentioned previously,this is done by aliasing some effects.

Here the developers must specify which effects can besafely aliased. They do this by choosing a desired resolu-tion for the screening design. In resolution ) designs, noeffects involving * factors are aliased with effects involv-ing less than ),+-* factors. Since our developers are onlyinterested in computing ��� -order effects, they choose a res-olution IV design. With this design, � �� -order effects willbe aliased with .�/0� -order or higher effects and �� �� -order ef-fects will be aliased with �� �� -order or higher effects. Ourdevelopers, therefore, are assuming that .1/2� -order or highereffects are negligible and that they are not interested in com-puting the �� �� -order effects. If the developers are unwillingto make these assumptions, then screening designs may beinappropriate for them (or they may need to use the screen-ing designs in an iterative, exploratory fashion).

The final step in the process to compute the specific val-ues of option D consistent with the developers desire fora resolution IV design. To do this a function called a de-sign generator must be determined. This function com-putes the setting of option D based on the values of op-tions A, B and C. In our hypothetical example, automatedtools chose 3 �54�687 as the design generator. Thismeans that for each observation, D’s setting is computedby multiplying the settings for options A, B, and C (thinkof + as 1 and - as + 1). In general there is no closed-form solution for choosing a design generator. In fact, insome situations, none may exist. For this paper we iden-tified design generators using the factex function of theSAS QC package (See http://www.math.wpi.edu/saspdf/qc/chap14.pdf). This search-based function

4

essentially looks through numerous possible design genera-tors until an acceptable one is found (or until the algorithmgives up, if none can be found).

Table 1(b) gives the final design, which is identifieduniquely as a � ��"9$&:' design with the design generator 3;�4<687 . The ���#"%$ designation means that the total numberof options is 4, that we will examine a ����� ( �="9$>�?����� )fraction of the full factorial design, that the design is a res-olution IV screening design, and that the aliasing pattern is3@�A4<6B7 .

After defining the screening design, developers can ex-ecute it across the Skoll grid. For our process, each ob-servation involves measuring a developer-supplied bench-marking regression test while the system runs in a particularconfiguration. Once the data is collected we would analyzeit to calculate the effects. For binary options (with settings- or +), the main effect of option A, ME(A), isCEDGF 4�HI�AJ F 4K+<HL+MJ F 4ONPH (1)

where z(A-) and z(A+) are the mean values of the observeddata over all runs where option A is (-) and where option Ais (+), respectively.

If appropriate, �� �� -order effects can be calculated in asimilar way. The interaction effect of option A and B,INT(A, B) is:QSR>TUF 4KV268HW�X����ZY CEDGF 6\[ 4ONPH]+ CED^F 6\[ 4K+<H(_ (2)

�X����ZY CEDGF 48[ 6BNPH]+ CED^F 48[ 6G+<H(_ (3)

HereCEDGF 6\[ 4ONPH is called the conditional main effect of B

at the + level of A. The effect of one factor (e.g., B) there-fore depends on the level of the other factor (e.g., A). Simi-lar equations exist for higher order effects.

Once the effects are computed developers will want todetermine which of these effects are important and whichare not. There are several ways to determine this, includ-ing using standard hypothesis testing approaches. For thispaper we opted not to use formal hypothesis tests primarilybecause they require strong assumptions about the standarddeviation of the experimental samples. In future work wewill avoid this problem by simply replicating observationsin the experimental design. For this work however, we dis-play the effects graphically and ask developers to use theirexpert judgement to decide which effects they consider im-portant.4. Feasibility Study

This section describes a feasibility study that assessesthe implementation cost and the effectiveness of the maineffects screening process described in Section 3 on a suiteof large, performance-intensive software systems.4.1. Experimental DesignHypotheses. Our feasibility study explores the followinghypotheses: (1) our model-based Skoll environment cost-effectively supports the definition, implementation and ex-ecution of our main effects screening process described

in Section 3, (2) the screening design used in main ef-fects screening correctly identifies a small subset of optionswhose effect on performance is important, and (3) exhaus-tively examining just the options identified by the screen-ing design gives performance data that (a) is representativeof the system’s performance across the entire configurationspace, but less costly to obtain and (b) is more representa-tive than a similarly-sized random sample.

Subject applications. The experimental subject applica-tions for this study were based on a suite of performance-intensive software: ACE v5.4 + TAO v1.4 + CIAOv0.4. ACE provides reusable C++ wrapper facades andframework components that implements core concurrencyand distribution patterns [13] for communication software.TAO is a high-performance, highly configurable Real-timeCORBA ORB built atop ACE to meet the demanding QoSrequirements of DRE systems. CIAO is QoS-enabled mid-dleware that extends TAO to support components, whichenables developers to declaratively provision QoS policiesend-to-end when assembling a DRE system.

ACE, TAO, and CIAO are ideal subjects for our fea-sibility study since they share many characteristics withother highly configurable performance-intensive softwaresystems. For example, they collectively contain over 2M+lines of source code, functional regression tests, and perfor-mance benchmarks contained in ` 4,500 files that averageover 300 CVS commits per week. They also run on a widerange of OS platforms, including all variants of Windows,most versions of UNIX, and many real-time operating sys-tems, such as LynxOS and VxWorks.

Application scenario. Due to recent changes made to themessage queuing strategy, the developers of ACE+TAO+-CIAO were concerned with measuring two performance cri-teria: (1) the latency for each request and (2) total messagethroughput (events/second) between the ACE+TAO+CIAOclient and server. For this version of ACE+TAO+CIAO, thedevelopers identified 14 binary run-time options they feltaffected latency and throughput (See Table 2). Thus, theentire configuration space has �1$��K�a���=Vb.� �� different con-figurations.

Experimental process. Our experimental process usedSkoll’s model-driven tools to implement the main effectsscreening process and evaluate our three hypotheses. To dothis, we executed the main effects screening process acrossa prototype Skoll grid of dual processor Xeon machines run-ning Red Hat 2.4.21 with 1GB of memory in the real-timescheduling class. The experimental task involved running abenchmark application in a particular configuration, whichevaluated the application scenario outlined above by creat-ing an ACE+TAO+CIAO client and server. For each task wemeasured message latency and overall throughput betweenthe client and the server. The client sends 300K requests to

5

Option Option OptionIndex Name Settings

A ORBReactorThreadQueue c FIFO, LIFO dB ORBClientConnectionHandler c RW, MT dC ORBReactorMaskSignals c 0, 1 dD ORBConnectionPurgingStrategy c LRU, LFU dE ORBConnectionCachePurgePercentage c 10, 40 dF ORBConnectionCacheLock c thread, null dG ORBCorbaObjectLock c thread, null dH ORBObjectKeyTableLock c thread, null dI ORBInputCDRAllocator c thread, null dJ ORBConcurrency c reactive, thread-per-connection dK ORBActiveObjectMapSize c 32, 128 dL ORBUseridPolicyDemuxStrategy c linear, dynamic dM ORBSystemidPolicyDemuxStrategy c linear, dynamic dN ORBUniqueidPolicyReverseDemuxStrategy c linear, dynamic dTable 2. Some ACE+TAO Options and their Set-tings

0.0 0.5 1.0 1.5

05

1015

20

Half−Normal Probability Plot for Latency (Full−Factorial)

half−normal quantiles

optio

n ef

fect

s

H E K D G A N L MI C F

J

B

0.0 0.5 1.0 1.5

050

010

0015

0020

00

Half−Normal Probability Plot for Throughput (Full−Factorial)

half−normal quantiles

optio

n ef

fect

s

H G A D E K N L M C I F

J

B

Figure 2. Option effects based on full data

the server, where after each request it waits for a responsefrom the server and records the latency measure. At theend of 300K requests, the client computes the throughputachieved in terms of number of requests served per second.We finally analyzed the resulting data to evaluate our hy-potheses. Section 6 describes the limitations with our cur-rent experimental process.4.2. The Full Data Set

To evaluate our approach, we generated performancedata for all 16,000+ valid configurations, which we referto as the “full suite” and the performance data as the “fulldata set.” We then examined the effect of each option andjudged whether they had important effects on performanceusing a graphical method called half-normal probabilityplots, which show each option’s effect against their corre-

sponding coordinates on the half-normal probability scale.If [ ef[ $�g [ eh[ i gkjljmjLg [ eh[ & are the ordered set of effect es-timations, the half-normal plot then consists of the points

Fon "9$ F�p j q Npjrq=s *�+

pjrq�t �

Q H(V�[ eh[ uvHWwyx�zO*L�,��V jmjlj VQ

(4)

wheren

is the cumulative distribution function of a stan-dard normal random variable.

The rationale behind half-normal plots is that unimpor-tant options will have effects whose distribution is normaland centered near 0. Important effects will also be normallydistributed with means different that 0. If no effects areimportant, the resulting plot will show a set of points on arough line near { � p . Options whose effects deviate sub-stantially from 0 are considered important.

Note that “importance” is not defined formally and dif-fers in spirit from the traditional notion of statistical signifi-cance. In particular, developers must decide for themselveshow large effects must be to warrant their attention. Whilethis has some downsides (see Section 6), even with tradi-tional statistical tests that measure statistical significancedevelopers still must make judgments as to the magnitudeof effects.

Figure 2 plots the effect across the full data set of each ofthe 14 ACE+TA+CIAO options on latency and throughput.We see that options B and J are clearly important, whereasoptions I, C and F are arguably important, and the remainingoptions are not important.

4.3. Evaluating Screening DesignsWe now evaluate whether the remotely executed screen-

ing designs can correctly identify the same important op-tions discovered in the full data set. To do this, we calcu-lated and executed three different screening designs, whosespecifications appear in Appendix A. These designs exam-ined all 14 options using increasingly larger run sizes (32,64, or 128 observations). We refer to the screening designsas |W} z � i , |W} z�~ � and |W} z $ i2� , respectively.

Figure 3 shows the half-normal probability plots ob-tained from our screening designs. The figures show thatall screening designs correctly identify options B and J asbeing important (as is the case in full-factorial experiment).|W} z $ i2� also identifies the possibly important effect of op-tions C, I, and F. Due to space considerations in the paperwe only present data on latency. Throughput analysis showsidentical results unless otherwise stated.

These results suggest that (1) screening designs can de-tect important options at a small fraction of the cost of ex-haustive testing, (2) the smaller the effect, the larger the runsize needed to identify it, and (3) developers should be cau-tious when dealing with options that appear to have an im-portant, but relatively small effect, as they may actually beseeing normal variation ( |�} z � i and |W} z�~ � both have exam-ples of this).

6

0.0 0.5 1.0 1.5

05

1015

20

Half−Normal Probability Plot for Latency (Resolution IV, 32−run)

half−normal quantiles

optio

n ef

fect

s

K M G I D L A F N C H E

J

B

Half−Normal Probability Plot for Latency (Resolution IV, 64−run)

half−normal quantiles

0.0 0.5 1.0 1.5

G N M K H E D F L C A I

J

B

Half−Normal Probability Plot for Latency (Resolution IV, 128−run)

half−normal quantiles

0.0 0.5 1.0 1.5

H K N E L M G A D C I F

J

B

Figure 3. Option Effects Based on Screening De-signs

4.4. Estimating Performance with Screening SuitesWe now evaluate whether screening all the combinations

of the most important options can be used to estimate per-formance quickly across the entire configuration space weare studying. The estimates are generated by examining allcombinations of the most important options, while default-ing the settings of the unimportant options. In the previoussection, we determined that options B and J were clearlyimportant and that options C, I, and F were arguably impor-tant. Developers will therefore make the estimates based onbenchmarking either 4 (all combinations of options B andJ) or 32 (all combinations of options B, J, C, I, and F) con-figurations. We will refer to the set of 4 configurations asthe top-2 screening suite and the set of 32 configurations asthe top-5 screening suite.

Figure 4 shows the distributions of latency for the fullsuite vs. the top-5 screening suite and for the full suitevs. the top-2 screening suite. From the figure, we seethat the distributions of the top-5 and top-2 screening suitesclosely track the overall performance data. Such plots,called quantile-quantile (Q-Q) plots, are used to see howwell two data distributions correlate. To do this they plotthe quantiles of the first data set against the quantiles of thesecond data set. If the two sets share the same distribu-tion, the points should fall approximately on �>� { line. Inaddition we performed Mann-Whitney non-parametric teststo determine whether each set of screening data (top-2 andtop-5 suites) appears to come from the same distribution asthe full data. In both cases we were unable to reject thenull hypothesis that the top-2 and top-5 screening suite datacome from the same distribution as the full suite data. Thisdata suggests that the screening suites computed at step 4 ofthe main effects screening process (Section 3) can be usedto estimate overall performance in-house at extremely lowtime/effort, i.e., running 4 benchmarks takes 40 seconds,running 32 takes 5 minutes, running 16,000+ takes 2 days.4.5. Screening Suites vs. Random Sampling

60 70 80 90 100 110 120 130

6070

8090

100

110

120

130

Q−Q Plot for Latency

full suite

top−

2 sc

reen

ing

suite

60 70 80 90 100 110 120 130

6070

8090

100

110

120

130

Q−Q Plot for Latency

full suite

top−

5 sc

reen

ing

suite

Figure 4. Q-Q plots for the top-2 and top-5 screen-ing suitesAnother question is whether our main effects screening

process was any better than other low-cost estimation pro-cesses. In particular, we compared the latency distributionsof several random samples of 4 configurations to that of

7

full

suite

top−

2 su

ite

rand

1

rand

2

rand

3

rand

4

rand

5

rand

6

rand

7

rand

8

90

100

110

120

Latency Distributions

test suite

late

ncy

Figure 5. Latency Distribution from full, top-2,and random suites

the top-2 screening suite found by our process. The re-sults of this test are summarized in Figure 5. These boxplots show the distributions of latency metric obtained fromexhaustive testing, top-2 screening suite testing, and ran-dom testing. These graphs suggest the obvious weaknessof random sampling, i.e., while sampling distributions tendtoward the overall distribution as the sample size grows, in-dividual small samples may show wildly different distribu-tions.

4.6. Dealing with Evolving SystemsA primary goal of main effects screening is to detect

performance degradations in evolving systems quickly. Sofar we have not addressed whether – or for how long –screening suites remain useful as a system evolves. Tobetter understand this issue, we measured latency on thetop-2 screening suite, once a day, using CVS snapshotsof ACE+TAO+CIAO. We used historical snapshots for tworeasons: (1) the versions are from the time period for whichwe already calculated the main effects and (2) developertesting and in-the-field usage data have already been col-lected and analyzed for this time period (see www.dre.vanderbilt.edu/Stats/), allowing us to assess thesystem’s performance without having to exhaustively testall configurations for each system change.

Figure 6 depicts the data distributions for the top-2screening suites broken down by date (higher latency mea-sures are worse). We see that the distributions were stablethe first two days, crept up somewhat for 3 days and thenshot up the ����� day (12/14/03). They were brought back un-der control for several more days, but then moved up againon the last day. Developer records and problem reports indi-

cate that problems were noticed on 12/14/03, but not beforethen.

Another interesting finding was that the limited testingdone by ACE+TAO+CIAO developers measured a perfor-mance drop of only around 5% on 12/14/03. In contrast,our screening process showed a much more dramatic drop– closer to 50%. Further analysis by system developersindicated that their unsystematic testing failed to evaluateconfigurations where the degradation was much more pro-nounced.

X20

03.1

2.09

X20

03.1

2.10

X20

03.1

2.11

X20

03.1

2.12

X20

03.1

2.13

X20

03.1

2.14

X20

03.1

2.15

X20

03.1

2.16

X20

03.1

2.17

X20

03.1

2.18

X20

03.1

2.19

X20

03.1

2.20

X20

03.1

2.21

80

100

120

140

160

Screening 2 Options (4 cfgs) Over a Time Period

late

ncy

Figure 6. Performance estimates across time4.7. Higher-Order Effects

The analyses done so far only calculated first-order ef-fects, which worked well for our subject application andscenario, but might not be sufficient for other situations.Figure 7 shows the effects of all pairs of options in the sub-ject systems based on the full data set and on a screeningdesign. We used a resolution VI design here (rather thanresolution IV as in the previous sections) and increased therun size to 2,048 to capture the second-order effects. Fromthe figure we see several things. First, the important interac-tion effects involve only options that are already consideredimportant by themselves, which supports the idea that mon-itoring only first-order effects was sufficient for our sub-ject systems. Second, we see that the screening design cor-rectly identifies the 5 most important pairwise interactionsat ���� ��� the cost of exhaustive testing.

5. Related WorkApplying DOE to software engineering. As far as wecan tell, no one has used screening designs for softwareperformance assessment. The use of design of experiments(DoE) theory within software engineering has mostly been

8

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Half−Normal Probability Plot for Latency (Full−Factorial)

half−normal quantiles

seco

nd−o

rder

effe

cts

IJ

AFFG

BF

FJ

BC

CJ

BJ

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.1

0.2

0.3

0.4

0.5

Half−Normal Probability Plot for Latency (Resolution VI, 2048−run)

half−normal quantiles

seco

nd−o

rder

effe

cts

DL

EGBF

FJ

BC

CJ

BJ

Figure 7. Pairwise Effects Based on Full andScreening Suite

limited to interaction testing, largely to compute and some-times generate minimal test suites that cover all combina-tions of specified program inputs. Some examples of thiswork include Dalal et al. [4], Burr et al. [2], Dunietz etal. [5], and Kuhn et al [10]. Yilmaz et al. [16] used cov-ering arrays as a configuration space sampling technique tosupport the characterization of failure-inducing option set-tings. Other relevant literature on performance monitoringincludes:� Offline analysis , which has been applied to program

analysis to improve compiler-generated code. For example,the ATLAS [6] numerical algebra library uses an empiricaloptimization engine to decide the values of optimization pa-rameters by generating different program versions that arerun on various hardware/OS platforms. The output fromthese runs are used to select parameter values that providethe best performance. Mathematical models are also usedto estimate optimization parameters based on the underly-ing architecture, though empirical data is not fed into themodels to refine it.� Online analysis , where feedback control is used to

dynamically adapt QoS measures. An example of onlineanalysis is the ControlWare middleware [17], which usesfeedback control theory by analyzing the architecture andmodeling it as a feedback control loop. Actuators and sen-sors then monitor the system and affect server resource al-

location. Real-time scheduling based on feedback loops hasalso been applied to Real-time CORBA middleware [11] toautomatically adjust the rate of remote operation invocationtransparently to an application.� Hybrid analysis , which combines aspects of offlineand online analysis. For example, the continuous compila-tion strategy [3] constantly monitors and improves applica-tion code using code optimization techniques.6 Concluding Remarks

This paper presents a new distributed continuous qual-ity assurance (DCQA) process called main effects screen-ing that is designed to detect performance degradation effi-ciently in performance-intensive software systems that havelarge configuration spaces. To evaluate this process, weconducted a formally-designed experiment across a grid ofin-house and in-the-field computers in the Skoll environ-ment. The results of this experiment helped in estimatingperformance across the large configuration space of ACE,TAO, and CIAO software systems.

All empirical studies suffer from threats to their internaland external validity. For this work, we were primarily con-cerned with threats to external validity since they limit ourability to generalize the results of our experiment to indus-trial practice. One potential threat is that several steps inour process require human decision making and input, e.g.,developers must provide reasonable benchmarking applica-tions and must also decide when they consider effects to beimportant.

Another possible threat to external validity concerns therepresentativeness of the ACE+TAO+CIAO subject appli-cations, which though large are still just one suite of soft-ware systems. A related issue is that we have focused ona relatively simple and small subset of the entire configura-tion space of ACE+TAO+CIAO that only has binary optionsand has no inter-option constraints. While these issues poseno theoretical problems (screening designs can be createdfor much more complex situations), we need to apply ourapproach to larger, more realistic configuration spaces infuture work to understand how well it scales.

Another potential threat is that for the time period westudied, the ACE+TAO+CIAO subject application was ina fairly stable phase, i.e., changes were made mostly tofix bugs and reduce memory footprint, but the system’sfunctionality was relatively stable. For situations where asystem’s basic functionality is in greater flux, it may beharder to distinguish significant performance degradationfrom normal variation.

Despite these limitations, we believe our study supportsour basic hypotheses. We reached this conclusion by notingthat our studies showed that: (1) screening designs can cor-rectly identify important options, (2) these options can beused to quickly produce reliable estimates of performanceacross the entire configuration space at a fraction of the cost

9

of exhaustive testing, (3) the alternative approach of ran-dom or ad hoc sampling can give highly unreliable results,(4) the main effects process detected performance degra-dation on a large and evolving software system, and (5)the screening suite estimates were more precise than the adhoc process currently used by the developers of the subjectsystem. Main effects screening can also be a defect detec-tion aid, e.g., if the screened options change unexpectedly,developers can reexamine the software to identify possibleproblems with software updates.

We believe that this line of research is novel and inter-esting, but much work remains to be done. We are there-fore continuing to develop enhanced model-based Skoll ca-pabilities and using them to create and validate new moresophisticated DCQA processes that overcome existing lim-itations and threats to external validity. In particular, weare exploring the connection between design of experi-ments theory and the quality assurance of systems withlarge configuration spaces. We are also working to incor-porate Skoll services into software repositories, such as ES-CHER (www.escherinstitute.org). Finally, we areconducting a much larger case study using Skoll to conductthe ACE+TAO+CIAO daily build and regression test pro-cess with 100+ machines contributed by users and develop-ers worldwide.Acknowledgements We thank the anonymous reviewersfor their helpful comments. This material is based on worksupported by the US National Science Foundation underNSF grants CCR-0312859, CCR- 0205265, CCR-0098158and CCF-0447864 as well as funding from BBN Technolo-gies, Lockheed Martin, Raytheon, and Siemens.References

[1] L. Breiman, J. Freidman, R. Olshen, and C. Stone. Classi-fication and Regression Trees. Wadsworth, Monterey, CA,1984.

[2] K. Burr and W. Young. Combinatorial test techniques:Table-based automation, test generation and code coverage.In Proc. of the Intl. Conf. on Software Testing Analysis &Review, 1998.

[3] B. Childers, J. Davidson, and M. Soffa. Continuous Compi-lation: A New Approach to Aggressive and Adaptive CodeTransformation. In Proceedings of the International Paralleland Distributed Processing Symposium, Apr. 2003.

[4] S. R. Dalal, A. Jain, N. Karunanithi, J. M. Leaton, C. M.Lott, G. C. Patton, and B. M. Horowitz. Model-based testingin practice. In Proc. of the Intl. Conf. on Software Engineer-ing, (ICSE), pages 285–294, 1999.

[5] I. S. Dunietz, W. K. Ehrlich, B. D. Szablak, C. L. Mallows,and A. Iannino. Applying design of experiments to softwaretesting. In Proc. of the Intl. Conf. on Software Engineering,(ICSE ’97), pages 205–215, 1997.

[6] Kamen Yotov and Xiaoming Li and Gan Ren et.al. A Com-parison of Empirical and Model-driven Optimization. InProceedings of ACM SIGPLAN conference on ProgrammingLanguage Design and Implementation, June 2003.

[7] G. Karsai, J. Sztipanovits, A. Ledeczi, and T. Bapty. Model-Integrated Development of Embedded Software. Proceed-ings of the IEEE, 91(1):145–164, Jan. 2003.

[8] A. S. Krishna, D. C. Schmidt, A. Porter, A. Memon, andD. Sevilla-Ruiz. Improving the Quality of Performance-intensive Software via Model-integrated Distributed Contin-uous Quality Assurance. In Proceedings of the 8th Interna-tional Conference on Software Reuse, Madrid, Spain, July2004. ACM/IEEE.

[9] A. S. Krishna, N. Wang, B. Natarajan, A. Gokhale, D. C.Schmidt, and G. Thaker. CCMPerf: A Benchmarking Toolfor CORBA Component Model Implementations. In Pro-ceedings of the 10th Real-time Technology and ApplicationSymposium (RTAS ’04), Toronto, CA, May 2004. IEEE.

[10] D. Kuhn and M. Reilly. An investigation of the applicabilityof design of experiments to software testing. Proc. 27th An-nual NASA Goddard/IEEE Software Engineering Workshop,pages 91–95, 2002.

[11] C. Lu, J. A. Stankovic, G. Tao, and S. H. Son. FeedbackControl Real-Time Scheduling: Framework, Modeling, andAlgorithms. Real-Time Systems Journal, 23(1/2):85–126,July 2002.

[12] A. Memon, A. Porter, C. Yilmaz, A. Nagarajan, D. C.Schmidt, and B. Natarajan. Skoll: Distributed ContinuousQuality Assurance. In Proceedings of the 26th IEEE/ACMInternational Conference on Software Engineering, Edin-burgh, Scotland, May 2004. IEEE/ACM.

[13] D. C. Schmidt, M. Stal, H. Rohnert, and F. Buschmann.Pattern-Oriented Software Architecture: Patterns for Con-current and Networked Objects, Volume 2. Wiley & Sons,New York, 2000.

[14] E. Turkay, A. Gokhale, and B. Natarajan. Addressing theMiddleware Configuration Challenges using Model-basedTechniques. In Proceedings of the 42nd Annual SoutheastConference, Huntsville, AL, Apr. 2004. ACM.

[15] C. F. J. Wu and M. Hamada. Experiments: Planning, Anal-ysis, and Parameter Design Optimization. Wiley, 2000.

[16] C. Yilmaz, M. B. Cohen, and A. Porter. Covering arraysfor efficient fault characterizations in complex configurationspace. In Proceedings of the ACM SIGSOFT InternationalSymposium on Software Testing and Analysis (ISSTA), 2004.

[17] R. Zhang, C. Lu, T. Abdelzaher, and J. Stankovic. Control-ware: A Middleware Architecture for Feedback Control ofSoftware Performance. In Proceedings of the InternationalConference on Distributed Systems 2002, July 2002.

A. Actual Screening DesignsThe screening designs used in Section 4.3 were calcu-

lated using the SAS statistical package. (www.sas.com).|W} z � � is a � $���"f�&:' with design generators �!�a4�687 , �!�4<683 , ���@4O7P3 ,

Q ��687P3 , ����4<6 D , ���@4O7 D ,� �A6B7 D ,C ��4�3 D ,

R ��6U3 D .|W} z�~:� is a � $���" �&:' with design generators ����4<687 , ���4<683 ,

Q ��4�6 D , ���A4O7P3 D , ���A4�6U� ,� ��4O7P3G� ,C �A4�7 D � ,

R �A4�3 D � .|W} z $ �� is a � $���"f�&:' with design generators ���E4<6B7 ,

Q �4<683 D , ���,4�6U3G� , ����4�7 D � ,

� ��4�7P3^� ,C �

4<6 D ��� ,R ��6B7P3 D ��� .

10