Download - The impact of I/O on program behavior and parallel scheduling

Transcript

The Impact of I/O on Program Behavior and Parallel SchedulingEmilia Rosti�, Giuseppe Serazziy, Evgenia Smirniz, Mark S. Squillantex� Dipartimento di Scienze dell'Informazione, Universit�a degli Studi di Milano, Milano, Italy. [email protected] Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milano, Italy. [email protected] Department of Computer Science, College of William and Mary, Williamsburg, VA, USA. [email protected] IBM Research Division, T. J. Watson Research Center, Yorktown Heights, NY, USA. [email protected] this paper we systematically examine various performanceissues involved in the coordinated allocation of processor anddisk resources in large-scale parallel computer systems. Mod-els are formulated to investigate the I/O and computationbehavior of parallel programs and workloads, and to ana-lyze parallel scheduling policies under such workloads. Thesemodels are parameterized by measurements of parallel pro-grams, and they are solved via analytic methods and simu-lation. Our results provide important insights into the per-formance of parallel applications and resource managementstrategies when I/O demands are not negligible.1 IntroductionParallel computing has emerged as an indispensable tool forsolving problems in various scienti�c/engineering domains,as many classical disciplines are increasingly dependent uponpowerful parallel systems for the execution of both compu-tation and data intensive applications. The allocation andmanagement of resources in these system environments is aparticularly complex problem that is fundamental to sustain-ing and improving the bene�ts of parallel computing. Thelarge number of users attempting to simultaneously executework on the system, the parallelism of the applications andtheir respective computational and secondary storage needs,the satisfying of execution deadlines for certain applications,are all issues that exacerbate the resource scheduling prob-lem.Most of the previous research in this area has focused onthe scheduling of processors in parallel systems. This includesvarious forms of space-sharing where the processors are par-titioned among di�erent parallel jobs (e.g., [5, 2, 12, 16]),and to a lesser extent combinations of space-sharing andtime-sharing (e.g., [7, 18, 3]). Some studies have also con-sidered the coordinated allocation of processors and mem-ory [9, 13, 8]. However, an important dimension that hasbeen essentially ignored in previous parallel scheduling re-search concerns the impact of the I/O demands of the parallelapplications.For many applications, and speci�cally those in the com-putational science and engineering �elds, manipulating largeThis work was partially supported by the PQE2000 project and theItalian CNR CESTIA center. Portions of this research were conductedduring an academic visit by M. S. Squillante to the Dipartimento diElettronica e Informazione, Politecnico di Milano.

volumes of data is a necessity and high-performance I/O be-comes critical for performance. Some early characterizationsof I/O-intensive parallel applications suggest that the I/Orequirements of a parallel job can signi�cantly a�ect the jobe�ciency and scalability [15]. The application scalability isin fact a function of both the number of assigned processorsand the con�guration of the I/O system (i.e., the number ofavailable disks and the distribution of the application dataon the disks). Furthermore, if the application has to competewith other jobs for secondary storage resources, its e�ciencymay decrease due to the increase in time spent by the appli-cation waiting for its I/O requests to complete. As a result,there may be signi�cant perturbations in the expected I/Oservice times and consequently in the job execution times.The problem of coordinated allocation of processor anddisk resources has two fundamental aspects that are stronglyinterconnected. One concerns the behavior of parallel pro-grams and workloads with respect to computation and I/Oas a function of the number of processors and disks on whichthey are executed. The other concerns the allocation andmanagement of resources among the parallel jobs comprisingthe system workload to satisfy various objective functions,such as minimizing response time and maximizing through-put. In this paper we develop a systematic approach to in-vestigate these key issues of the general resource schedulingproblem in parallel systems, and we exploit our formal frame-work to gain a better understanding of this complex problem.First, we formulate a detailed yet compact model of par-allel program behavior that e�ectively captures the I/O andcomputation characteristics of parallel applications which weobserve frommeasurement data obtained as part of this study.A functional form for application speedup with respect toboth processor and disk resources is derived, leading to ageneralization of Amdahl's law [1] that extends the functionto three dimensions to also include the disks. This analysisis then extended to address the computation and I/O be-havior of parallel workloads. The models and analysis arevalidated against measurements of parallel applications andshown to be in excellent agreement. Our analysis is usedto examine various performance measures of parallel appli-cations as a function of the number of processors and disksassigned to them. In addition to this formal analysis, ourmethodology provides several practical bene�ts that includesigni�cantly more accurate predictions than from measure-ments alone and considerable simpli�cations in acquiring (aswell as more accurate) measurements.We then exploit our formal models and analysis of paral-lel program and workload behavior as the foundation for ouranalysis of coordinated processor and disk resource allocationstrategies. A set of scheduling policies is evaluated, whichspans the range from simple space-sharing schemes to com-binations of space-sharing and time-sharing where the I/Orequests of some jobs are overlapped with the computation

of other jobs. Our formal analysis includes the derivation ofa generalization of standard ow-equivalent service centersto capture the contention for disk resources in our analyticscheduling models. Our results demonstrate that when thereis contention for the disk resources, the trends in system per-formance can be signi�cantly di�erent from those appearingin the research literature when I/O behavior is negligible ornot explicitly considered. In particular, the trends for thebest partition size under static space-sharing do not decreasewith increasing system load in most cases, and many of thestatic policies outperform a favorable version of dynamic par-titioning (i.e., no recon�guration overhead). We show thatthe I/O behavior of parallel workloads can have a dominat-ing e�ect on system performance under various resource al-location policies in a manner somewhat analogous to earlierresults demonstrating that memory overhead can dominatethe execution time of a parallel job when it is allocated toofew processors [9, 13, 8]. Our results further show the poten-tial for signi�cant performance bene�ts by combining space-sharing with time-sharing to overlap the I/O of some jobswith the computation of other jobs.x2 presents our models and analysis of parallel programand workload behavior. x3 presents our models and analysisof several parallel scheduling policies under workloads basedon the results in x2. A performance comparison of thesepolicies is provided in x4, and our concluding remarks aregiven in x5. Throughout this paper, we omit some of thetechnical details and results of our study in the interest ofspace, and we refer the interested reader to [11].2 Program and Workload Behavior Models and AnalysisA general model of program and workload behavior is formu-lated that captures the computation and I/O characteristicsof parallel applications. We derive an analysis of the model,including a generalization of Amdahl's law [1], which is thenused to characterize the behavior of parallel programs basedon actual measurements. Throughout the remainder of thepaper, we use ZZ+ and IR+ to denote the set of positive integerand positive real numbers, respectively.2.1 Parallel Program BehaviorThe I/O properties of many parallel programs result in ex-ecution behavior that can be naturally partitioned into dis-joint intervals, each of which consists of a single burst of I/Oactivity (possibly with a minimal amount of computation)followed by a single burst of computation (possibly with aminimal amount of I/O). We use the term phase to referto each such interval composed of an I/O burst followed bya computation burst, and we refer to a sequence of consec-utive (sets of) phases that are statistically identical as anI/O working set. The execution behavior of a program istherefore comprised of a sequence of I/O working sets. Thisgeneral model of program behavior is consistent with resultsfrom recent measurement studies (e.g., [14, 15]). As a sim-ple example, Fig. 1 plots measurement data for a parallelimplementation of the Schwinger Multichannel method forcalculating low-energy electron-molecule collisions (ESCAT).Consider such a program executed on one processor usingone disk (or any minimum resource allocation required byan application). Let N 2 ZZ+ denote the number of phasesfor the program of interest, and de�ne Tn 2 IR+ to be theduration, or length, of the nth phase, 1 � n � N .1 The rel-ative length of the nth phase is then given by Ln = Tn=T ,

1e-05

0.0001

0.001

0.01

0.1

0 1000 2000 3000 4000 5000 6000

1

10I/O operation

Execution Time

I/O

Ope

ratio

n C

ost

(seconds)

(seconds)

Figure 1: Measurements of program behavior for ESCAT.where T �PNn=1 Tn is the total execution time of the pro-gram. Let T IOn and TCPUn be the length of the I/O andcomputation bursts of the nth phase, respectively, and thusTn = T IOn + TCPUn . We then de�ne Fn = T IOn =Tn to be thefraction of the nth phase that represents the length of thephase's I/O burst; hence, 1� Fn is the fraction of the phasethat represents computation. Note that during program ini-tialization and termination, either T IOi or TCPUi can be (es-sentially) zero, i 2 f1; Ng.A simple example of these model parameter de�nitions isgraphically illustrated in Fig. 2, where (a) and (b) show thephase behavior of a program with respect to absolute and rel-ative execution time, respectively. We refer to the relative ex-ecution time diagram as the phase signature for the program.This example consists of an initial phase composed solely ofa small amount of computation (i.e., F1 = 0), followed by aphase that primarily reads the problem data from disk (i.e.,F2 � 0:67). The program then repeats three phases that arecomputationally intensive (i.e., Fi � 0:33, 3 � i � 5), whichtogether comprise the �rst of two multi-phase I/O workingsets in this example. The subsequent I/O working set iscomprised of two balanced phases (i.e., Fi � 0:5, i = 6; 7),representing the second multi-phase working set. The �nalphase consists of writing the results to disk (i.e., F8 � 0:8).���������������������������

���������������������������

������������������������

������������������������

��������

��������

����������

����������

0

(a)

T

CPU bursts

I/O bursts

execution timeT IO4 T 4

CPU

phases1 2 3 4 5 6 7 8

% execution time

I/O working sets

(b) F1.0

0.25

0.75

100

0.5

0

0

1 2 3 4 5

8

6 7

1

2

3 4 5

Figure 2: Example of program behavior with N = 8 phases:(a) absolute execution time and (b) phase signature.We now consider a parallel system consisting of P iden-tical processors and D identical disks. When the program ofinterest is executed on p processors and d disks, the compu-tation and the I/O requirements of the program have both

parallel and sequential components, as previously observedfor the computational aspects of parallel programs [1]. LetoCPUn;p;d and oIOn;p;d respectively denote the duration of the se-quential components for the computation and I/O bursts ofthe nth phase when executed on p processors and d disks,1 � p � P , 1 � d � D.1 De�ne sCPUn;p;d and sIOn;p;d to bethe respective scaling factors for the lengths of the parallelcomponent of the computation and I/O bursts for the nthphase of the program when it is executed on p processors us-ing d disks. The execution time for each component of thenth phase can then be expressed asT xn;p;d = sxn;p;d (T xn � oxn;1;1) + oxn;p;d; (1)where sxn;1;1 � 1, sxn;p;d � 0, x 2 fIO; CPUg.1Each of the above model parameters are assumed to ber.v.s with general probability distribution functions (PDF)and �nite expectations. For a given program with a par-ticular data set executed on a speci�c system con�guration,these variables may be deterministic. When the program isexecuted on di�erent data sets, however, the PDFs for theser.v.s capture the di�erences in execution times. Moreover, anentire class of applications, or an entire application workload,can be represented via the PDFs for the r.v.s in eq. (1).Let bN denote the maximum number of phases in the classof applications of interest, i.e., bN � supfn : P [N � n] >0; n 2 ZZ+g. The expected execution time E [Tp;d] of this classof applications when executed on p processors and d disks canthen be expressed as E [Tp;d] = PbNn=1 E �T IOn;p;d � IfN�ng� +E �TCPUn;p;d � IfN�ng�, where IA denotes the indicator functionfor event A having the value 1 if A occurs and the value 0 ifevent A does not occur. Since the r.v. N is a stopping time,then T xn;p;d and IfN�ng are independent for each n, and thusE [Tp;d] = bNXn=1E �T IOn;p;d�Fn +E �TCPUn;p;d�Fn (2)where Fn � P [N � n].This expression can be simpli�ed considerably, with theexact form depending upon the properties of N , T xn;p;d andthe elements of T xn;p;d for the class of applications of interest.As a speci�c example, upon substituting eq. (1) into (2) withZxn � T xn �oxn;1;1 and the r.v.s sxn;p;d and Zxn independent, wecan simplify eq. (2) toE [Tp;d] = bNXn=1 Fn �E �sIOn;p;d�E �ZIOn �+ E �oIOn;p;d�+E �sCPUn;p;d�E �ZCPUn �+ E �oCPUn;p;d�� : (3)This case is chosen here because its assumptions hold for, andthus it is used to obtain, the results presented in x2.5. Thesimpli�ed forms of eq. (2) for all other cases can be foundin [11], together with the details on all derivations.2.2 Performance MetricsAn important and standard measure to characterize the be-havior of a parallel program is speedup, which in the presentcontext is de�ned as the ratio of the execution time on a single1All subsequent uses of the indices n, p, d and x are intended tobe under the constraints 1 � n � N , 1 � p � P , 1 � d � D andx 2 fIO;CPUg, respectively, unless noted otherwise.

processor and disk to the execution time on p processors andd disks. In the stochastic setting, speedup can be expressedin a number of di�erent ways. One particular de�nition ofinterest here is given bySp;d = E [ T1;1 ]E [ Tp;d ] ; (4)which describes the speedup surface in the space (Sp;d; p; d).Note that this expression for speedup is consistent with thatused in [6], where speedup is de�ned to be the ratio of the ex-pected response time in a queueing system with one processorto the expected response time in this system with multipleprocessors. We also note, however, that alternative expres-sions for speedup are considered in [11].The e�ciency of a program is typically de�ned as thespeedup per processor allocated to the application. Since theresources allocated in the present context include the disksused by the application, we express program e�ciency asEp;d = Sp;d=g(p; d) = E [T1;1] =(g(p; d)E [Tp;d]), where g(p; d)is a general function that can be used to scale the speedupaccording to any desired weightings of the two resource allo-cations. Note that the traditional upper bound of unity fore�ciency holds when d = 1, but in general the upper bounddepends upon the speci�c form of the function g(p; d).The speedup can also be scaled by a general cost func-tion of the resources allocated, C(p; d), which we de�ne hereto be C(p; d) = g(p; d)=Sp;d. This yields a particular mea-sure called program e�cacy as follows: �p;d = S2p;d=g(p; d) =E [T1;1]2 =(g(p; d)E [Tp;d]2).2.3 Generalized Amdahl's LawOur model and analysis of x2.1 and x2.2 can be used to de-rive several fundamental properties of the behavior of parallelprograms. We now present one such derivation that extendsAmdahl's law to three-dimensional space to include proces-sors and disks.Upon substituting (2) into eq. (4) and multiplying boththe numerator and denominator by E [T1;1]�1, we obtain anexpression that under the assumptions of eq. (3) leads toSp;d = f E [T1;1]�1 bNXn=1 (E �sIOn;p;d�E �ZIOn �+ E �oIOn;p;d�+E �sCPUn;p;d�E �ZCPUn �+ E �oCPUn;p;d�)Fn g�1: (5)This expression, together with the corresponding eqs. in [11]for di�erent assumptions, can be viewed as general forms ofAmdahl's law, with more simple forms obtained by makingadditional assumptions.De�ne fxpar �PbNn=1 E [Zxn ]Fn=E [T1;1] and fxseq �PbNn=1E �oxn;p;d�Fn=E [T1;1] to be the cumulative parallel and se-quential fractions of the x-component in a parallel program,respectively. Under the assumption that the r.v.s sIOn;p;d andsCPUn;p;d are identical for all phases n, i.e., sxn;p;d = h�1x (p; d),we can express eq. (5) asSp;d = � f IOpar � E �h�1IO (p; d)� + f IOseq+fCPUpar � E �h�1CPU(p; d)� + fCPUseq �1; (6)where h�1x (p; d) are general functions that can be used toscale the speedup according to any desired weightings of the

two resource allocations. An even simpler form can be ob-tained by setting E �h�1IO (p; d)� = d�1 and E �h�1CPU(p; d)� =p�1 in (6), which yieldsSp;d = � f IOseq + fCPUseq + f IOpar=d+ fCPUpar =p �1: (7)Note that eq. (7) reduces to the original version of Amdahl'slaw upon ignoring the I/O component of the parallel programand observing that fCPUseq + fCPUpar = 1.2.4 Workload Model and AnalysisA general model of parallel workloads can be formulated as amixture of applications, each of which is represented by themodel and analysis of x2.1{2.3. Although we omit the detailshere, our analysis leads quite naturally to an expression ofthe form given in (2) where the PDFs for the variables on theRHS capture the details of the parallel workload as a wholein addition to individual (classes of) applications. Given thisform, one can follow the derivation in the previous sectionsto obtain a series of expressions analogous to those in (3)through (7) to characterize a general parallel workload.2.5 Program Behavior ResultsWe now apply our models and analysis to characterize the be-havior of real parallel applications, demonstrating that ourmethodology is an extremely accurate and powerful abstrac-tion for application scalability as a function of its resourcerequirements. A representative sample is provided here.Our experiments were conducted on a 512-node Intel ParagonXP/S with 64 4-GB Seagate disks, using Pablo [10] to collectperformance data. Each application was executed in single-user mode on the assigned processor and disk partitions, thesizes of which were varied from 1 to 64. Application datawere distributed on the disk partition according to the IntelParallel File System strategy using the default striping unit(i.e., 64 KB) starting from a randomly selected disk, so as tokeep the load balanced across the disks. Alternative strip-ing parameters were not considered because our interest isin analyzing application resource requirements rather thaninvestigating parallel �le system performance.The applications used in our experiments consisted ofthose in the Scalable Input/Output Initiative (SIO) suite thathave non-negligible resource requirements. Each applicationconsists of a number of programs, or stages, executed in apipeline fashion. The measurements for the individual stagesof each application were used to parameterize the programbehavior model described in x2.1{2.3. We �rst note thatcomparisons between the estimates from the eqs. in x2.2 andthe corresponding measurement data for the entire applica-tion show that the results from our model are in excellentagreement with the measurements. We further note that ourmodel can signi�cantly simplify the measurement process byallowing one to measure each unique phase in isolation. Thiscan also lead to more accurate performance data by reducingthe likelihood and impact of external events on the measuredtimes for long running applications.Our program behavior model can also be used to accu-rately estimate the value of performance metrics for the caseswhere measurements are not available. In particular, we usedour model parameterized by the measurements for smallervalues of p and d to extrapolate the value of performancemetrics for larger values of p and d. We then took thesemodel predictions for the largest p and d values for which wealso had total execution time measurement data, and com-pared them with the corresponding results obtained directly

by extrapolating from the overall execution time measure-ments for smaller values of p and d. Our results demonstratethat the model estimates for these cases are essentially identi-cal to the measurement data, and they are signi�cantly moreaccurate than those obtained from the overall execution timemeasurements. The estimates based on our model can lead toconsiderably more accurate predictions than those obtainedpurely from total execution times because the model cap-tures the computation and I/O characteristics of the parallelapplications in detail.We next consider the three-dimensional speedup surfacesobtained from our modeling analysis parameterized by themeasurement data. In particular, Fig. 3 illustrates the speedupfunction from our models for the QCRD (quantum chemicalreaction dynamics) and MESSKIT (electronic structure cal-culations via the Hartree-Fock algorithm) applications withrespect to both the processor and disk partition sizes. Theseresults clearly illustrate that speedup does not necessarilyincrease as the number of allocated disks and processors in-crease. In fact, the quantitative computation and I/O re-quirements of the application, and their interactions, areamong the important factors determining its ideal numberof processors and disks.To examine more carefully the relative computation andI/O requirements of the applications, we focus on the phasebehavior of QCRD's second stage and MESSKIT's SCF stage.The phase behavior and phase signature for each program isobtained via the analysis of x2.1. QCRD's phase signature isillustrated in Fig. 4 and SCF's phase signature is illustratedin Fig. 5. As indicated by both �gures, the I/O activity canbe characterized as repetitive and bursty, with 12 distinctphases for QCRD and 5 distinct phases for SCF. Since forboth programs the overall signature is roughly uniform, eachapplication is characterized by a single I/O working set. Weignore the negligible sequential I/O and computation activ-ity at the beginning and at the end of each program. Oncethe I/O working set is identi�ed, it is trivial to apply ourmethodology and derive the expected application executiontime.I/O burst CPU burst

1 2 3 54 6 7 8 9 10 11 12

1 2 3 54 6 7 8 9 10 11 12

�������������������������������������������������������������������������������������������������������������

���������

������

������

������

������

������

������������������

������

������

������������

���������

���������

��������

������������

������������

����������������

�������� ����

(b)

0 500 1000 1500 2000 2500 3000

Execution Time

0

1.0

0.8

0.6

0.4

0.2

F

(a)

100% of executionFigure 4: Phase behavior and signature of QCRD's stage 2.

10 20 30 4050 60

1020

3040

5060

0

5

10

15

10 20 30 40 50 60

1020

3040

5060

0

5

10

15

1020 30

4050 60

1020

304050

60

0123456789

1020 30

4050 60

1020

304050

60

0

5

10

15

20

25

ProcessorsProcessorsDisks

Disks

SPE

ED

UP

ProcessorsDisks

DisksProcessors

SPE

ED

UP

SPE

ED

UP

SPE

ED

UP

SCF ARGOS

QCRD stage2

QCRD stage4

Figure 3: Speedup surface of QCRD (stages two and four) and MESSKIT (stages ARGOS and SCF).

����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

����������������������������������������������������������������������������������������������������������������������������������������������������������

00 80%20% 40% 60% 100%

0.49

F

% execution Time(b)

I/O Bursts CPU Bursts

0 500 1000 1500 2000 2500

Execution Time

(a)

Figure 5: Phase behavior and signature of MESSKIT's SCF.3 Scheduling Models and AnalysisExploiting the analysis of x2 to characterize parallel appli-cation and workload behavior, in this section we develop aformal framework to investigate the coordinated allocationof processor and disk resources in parallel systems. First,we present our parallel system model, and then we derivea general analysis of the sojourn time through a disk par-tition when multiple processor partitions share these disksbased on decomposition and a generalization of standard ow-equivalent service centers [4]. Some of the schedulingpolicies considered in our study, and speci�c aspects of theirrespective scheduling models, are brie y described in turn.

3.1 Parallel System ModelWe consider a parallel system consisting of P identical pro-cessors and D identical disks, both of which are allocated ac-cording to a particular scheduling policy. An FCFS in�nite-capacity queue is used to hold all jobs that are waiting to beallocated resources. Let MP and MD denote the minimumnumber of processors and disks allocated to any parallel job,respectively, and therefore the maximum number of processorand disk partitions are respectively given by NP = bP=MP cand ND = bD=MDc; hence, p � MP and d � MD through-out.The interarrival times of parallel jobs are modeled asi.i.d. IR+-valued r.v.s having PDF A(�), whereas the par-allel job service times depend upon the number of proces-sors and disks allocated to the job. Following our model ofx2 for the entire workload, the per-phase service times aremodeled as independent sequences of i.i.d. IR+-valued r.v.shaving PDFs BIOn;p;d(�) and BCPUn;p;d(�). The PDFs for all pa-rameters of our scheduling models are assumed to be of PH-type. In particular, A = PH(�;SA) of order mA with mean��1 = ��S�1A e < 1, and Bxn;p;d = PH(�xn;p;d;SxB;n;p;d) oforder mxB;n;p;d with mean �x�1n;p;d = ��xn;p;dSx�1B;n;p;de < 1,where e denotes the column vector of all ones. It follows di-rectly from the convolution properties of PH-type PDFs thatthe total execution time of a job when exclusively executedon a processor partition of size p and a disk partition of sized is an order PNn=1mIOB;n;p;d +mCPUB;n;p;d PH-type PDF [11].The use of PH-type PDFs is of theoretical importancein that we exploit their mathematical properties to derivetractable solutions to aspects of our general analysis whilecapturing the fundamental elements of the parallel schedulingsystems of interest. It is also of practical importance in thata number of algorithms have been developed for �tting PH-type PDFs to empirical data.

3.2 Decomposition of I/O CenterThe parallel system is modeled as an open queueing networkwith a multi-server processor center and a multi-server diskcenter. However, we can signi�cantly simplify our analysisof the stochastic processes for the space-sharing schedulingmodels by using a single composite center to represent theprocessor and disk partitions. In particular, if the processorand disk partitions are not shared by multiple jobs, an equiv-alent queueing system is obtained by using the convolutionof the per-phase computation and I/O PDFs BCPUn;p;d(�) andBIOn;p;d(�) for the single-center service time PDF. Otherwise,when there is contention for a disk partition, we use decom-position to analyze the corresponding processor partitionsand disk partition in isolation and obtain the correspondingsojourn-time PDF(s) for a generalized ow-equivalent cen-ter. The ow-equivalent PDF(s) for the time spent in eachI/O phase is then used together with the per-phase compu-tation PDFs to characterize the total execution time at thecomposite center.We now derive the details of our approach for the casewhere the phases basically have the same PDF. This simpli-�es the presentation of our approach, and this assumptionholds for the applications considered in our numerical ex-periments presented in x4. Note, however, that our generalapproach can address the case where di�erent phases havedi�erent PDFs by using a multiple-class queueing network inthe analysis of the aggregate network representing the proces-sor partitions and disk partition in isolation, which is viablein practice due to our e�cient solution.Consider a closed two-center queueing network with a sin-gle delay center, representing the set of Y � 2 processorpartitions sharing the disk partition, and a single queueingcenter, representing the disk partition. The service times atthe computation center follow an order mP PH-type PDFBP = PH( ;SPB ) with mean ZP = � SP �1B e, which is con-structed from the PDFs BCPUn;p;d(�) for the processor partitionsbeing considered. Similarly, the service times at the disk cen-ter follow an order mD PH-type PDF BD = PH(�;SDB ) withmean ZD = ��SD�1B e, which is constructed from the PDFsBIOn;p;d(�) for the disk partition of interest.We represent this queueing network as a CTMC fX(t) ; t �0g de�ned on a �nite state space , whose elements aregiven by (i; jD; jP ) where jz � (jz1 ; jz2 ; : : : ; jzmz ), z 2 fD; Pg,jD̀ � 0, PmD`=1 jD̀ = i, jP̀ � 0, PmP`=1 jP̀ = Y � i. The statevector variable i 2 f0; : : : ; Y g denotes the number of jobs atthe disk center, jD̀ 2 f0; : : : ; ig denotes the number of jobs atthe disk center whose service time process is in stage `, andjP̀ 2 f0; : : : ; Y � ig denotes the number of jobs at the com-putation center whose service time process is in stage `. Forconvenience, we also partition the state space into Y + 1disjoint sets i � �(i; jD; jP ) j (i; jD; jP ) 2 , 0 � i � Y .Let yi;w 2 i, 1 � w � Di � jij, 0 � i � Y , be a lex-icographical ordering of the elements of i. We then de�ne� � (�0; : : : ;�Y ), �i � (�(yi;1); �(yi;2); : : : ; �(yi;Di)) and�(yi;w) � limt!1 P [X(t) = yi;w], yi;w 2 i, 1 � w � Di,0 � i � Y . Assuming the stochastic process fX(t); t � 0g tobe irreducible and positive recurrent, the invariant probabil-ity vector is uniquely determined by solving the global bal-ance eqs. �Q = 0 together with the normalizing constraint�e = 1, where Q is the in�nitesimal generator matrix for theprocess.The generator Q, arranged in the same order as the ele-

ments of the stationary probability vector �, is given byQ = 26666664 A0;1 A0;0 0 � � � 0 0A1;2 A1;1 A1;0 � � �0 A2;2 A2;1 � � � ... ...... ... ... . . . AY�2;0 0AY�1;1 AY�1;00 0 0 � � � AY;2 AY;137777775 ; (8)where Ai;2, Ai;1 and Ai;0 are �nite matrices of dimensionsDi �Di�1, Di �Di and Di �Di+1, respectively, 0 � i � Y .A simple yet general method for constructing these matricesfor all instances of our model can be found in [11].The invariant vector � then satis�es the eqs.�0A0;1 + �1A1;2 = 0; (9)�k�1Ak�1;0 + �kAk;1 + �k+1Ak+1;2 = 0; (10)�Y�1AY�1;0 + �YAY;1 = 0; (11)and �e = 1, 1 � k � Y � 1. De�ne R(Y ) � AY;1, R(k) �Ak;1�Ak;2R(k � 1)�1Ak�1;0, 0 � k � Y �1, and R(�1) � 0.Substituting this into eqs. (9){(11), we obtain�k = ��k+1Ak+1;2R(k)�1; 0 � k � Y � 1; (12)�Y = ��Y�1AY�1;0R(Y )�1; (13)where R(k) has dimension Dk�Dk and is non-singular. Thevector �Y�1 can then be determined up to a multiplicativeconstant by solving �Y�1� = 0 and �Y�1e = 1, where � =AY�1;1�AY�1;2R(Y � 2)�1AY�2;0�AY�1;0R(Y )�1AY;2, forY � 2 (note that � = A1;1 � A1;2R(0)�1A0;0 when Y = 1).Upon substitution into eqs. (12) and (13), we obtain the re-maining components �0; : : : ;�Y�2;�Y up to that same con-stant, from which the vector � is uniquely determined by thenormalizing eq. �e = 1. Note that this approach for calcu-lating � signi�cantly reduces the computational complexityover that of numerically solving the global balance and nor-malizing eqs. directly, and thus makes it possible to solvelarge systems that are otherwise prohibitively expensive.Given the stationary vector �, the kth moment of thenumber of jobs at the two centers are given byE �UkP � = Y�1Xi=0 (Y � i)k�ie; E �UkD� = YXi=1 ik�ie; (14)Y � 2. From Little's law [4], the throughput for the ag-gregate network can be expressed as XA = E [UP ] =ZP , andthen we can obtain the mean sojourn time through the diskcenter as E [Z0D] = E [UD] =XA. Higher moments of the so-journ time Z0D can be obtained from (14), BP , � and gen-eralizations of Little's law (see [11] for the technical detailsand references). We then construct an order m0D PH-typePDF B0D = PH(�0;SD0B ) with mean E [Z0D] = ��0SD0�1B e,to match as many moments of Z0D as desired. This PDF(s)for the per-phase sojourn time through the disk partition isthen used together with the per-phase computation PDF(s)BP = PH( ;SPB ) and convolution to characterize the totalexecution time at the composite center in our analytic space-sharing models.3.3 Static Space-SharingThe static space-sharing strategy SP(KP ; KD) divides theP processors into KP partitions of size bP=KP c or dP=KP e

such that bP=KP c � MP , and similarly divides the D disksinto KD partitions of size bD=KDc or dD=KDe such thatbD=KDc � MD. Processor partitions are used exclusively,whereas disk partitions may be shared. Each job arrival isallocated one such pair of partitions if one is available, butotherwise the job waits in the FCFS queue until a partitionpair becomes free. Every parallel job is executed to comple-tion without interruption and all of the resources in the pro-cessor partition are reserved by the application throughoutthis duration. Upon job departure, the available partitionpair is allocated to the job at the head of the system queue.3.4 Dynamic Space-SharingThe dynamic space-sharing strategy DP(KD) attempts toequally allocate the P processors among the parallel jobs inthe system, whereas the disks are divided into KD partitionsas in SP(KP ; KD). In particular, upon the arrival of a jobwhen the system is executing i jobs, 0 � i < NP , the P pro-cessors are repartitioned among the i+1 jobs such that eachjob is allocated a processor partition either of size bP=(i+1)cor of size dP=(i+1)e. A job arrival that �nds i � NP jobs inthe system is placed in the FCFS queue to wait until a pro-cessor partition becomes available. When one of the i jobs inthe system departs, 2 � i � NP , the system recon�gures theprocessor allocations among the remaining i�1 jobs such thatthere are only partitions of sizes bP=(i� 1)c and dP=(i� 1)e.A departure when the system contains more than NP jobscauses the job at the head of the queue to be allocated theavailable processor partition, and no recon�guration is per-formed. The times required to repartition the processorsamong the i jobs to be executed are modeled as i.i.d. r.v.shaving an order mC;i PH-type PDF Ci = PH(�i;SC;i) withmean ��1i = ��iS�1C;ie <1, 1 � i � NP .3.5 Static Space-Sharing with Time-SharingThe static space-sharing with time-sharing strategy TS(KP ; KD)divides the processors and disks in scheduling pairs as inSP(KP ; KD). The key feature of this policy is that eachprocessor-disk pair may be shared up to a maximum multi-programming level ofM jobs. The policy takes advantage ofthe bursty CPU and I/O behavior of the jobs by interleavingM jobs on each processor and disk partition pair. In thismanner, it attempts to keep both resources simultaneouslybusy. Essentially, the computation bursts of some of the jobsassigned to a partition are overlapped with the I/O bursts ofthe partition's other jobs. The extent to which such an over-lap is advantageous depends upon the phase characteristicsof each individual job scheduled on the partition pair.A processor-disk partition pair is available as long as thereis a processor partition whose current multiprogramming levelis less than M. When a job arrives at the system one of thefollowing four cases is possible. If there is a processor-diskpair where both partitions are idle, then the newly arrivedjob is assigned to this pair. Otherwise, the policy selects apair where at least one of the two partitions is idle. If thereis a scheduling slot where both partitions are already busywith less than M jobs each, then the job is scheduled ontoit. Finally, if all partition pairs are assigned M jobs each,then there are no available scheduling slots and the job waitsin the queue. The maximum number of jobs simultaneouslyexecuting in the system is M�minfKP ; KDg.

4 Scheduling ResultsIn this section we present a representative sample of theresults from our scheduling analysis based on x3 under aworkload based on the models of x2 parameterized by mea-surements of seven programs from the SIO suite. A par-allel system with P = 128, D = 32, MP = 8, MD = 1,KP 2 f1; 2; 4; 8; 16g and KD 2 f1; 2; 4; 8; 16; 32g is consid-ered. The job interarrival and service times have coe�cientsof variation that cover the range of statistics reported in pre-vious measurement studies of workloads at supercomputingcenters. The performance measures for our space-sharingmodels are obtained using the exact analysis (with state-dependent PDFs) derived in [17] together with the analysisof x3.2, whereas the scheduling systems with time-sharing areanalyzed via simulation. Throughout, we use � to denote thetra�c intensity of the parallel system, 0 < � < 1.4.1 Space-Sharing Response TimeWe �rst consider the response time measures of the systemunder space-sharing strategies. Previous research on parallelspace-sharing policies, when only the scheduling of proces-sors is considered and I/O behavior is essentially ignored, hasbasically shown that the optimal static partitioning schemeat light system loads consists of fewer (thus larger) parti-tions, and that the optimal static partitioning scheme con-sists of increasingly more (hence smaller) partitions as thetra�c intensity rises, with the e�ects of memory-constrainedenvironments being an exception [9]. This results in a fam-ily of response time curves for the various possible partitionsizes whose relative ordering, with respect to providing thebest performance, typically follows the natural order on thenumber of partitions, from 1 (i.e., the entire system) to NP(i.e., the maximum number of processor partitions). Further-more, dynamic partitioning policies have often been shownto provide a lower mean response time than all of the staticspace-sharing policies across the entire range of system loads.We now exploit the space-sharing models in x3, whichexplicitly consider a workload with non-negligible I/O re-quirements, to examine the performance of coordinated pro-cessor and disk allocation policies and compare the corre-sponding trends with previous scheduling results. The meanjob response times for the space-sharing systems with di�er-ent numbers of processor and disk partitions are plotted inFig. 6. We observe that when KD is greater than or equalto the entire set of KP values considered, then our resultsare basically consistent with previous scheduling results inthe research literative. Speci�cally, for KD � 16, a singleprocessor partition provides the best performance among thestatic policies at very light loads. When � is quite small, jobresponse times are essentially dominated by their processingrequirements given the number of allocated processors, andhence the bene�ts of maximizing the parallel execution ofjobs outweigh its associated costs. As � increases from near0 towards 1, however, KP = 2 provides the best static parti-tioning performance, followed by KP = 4 and then KP = 8.This is because queueing delays have an increasingly largerrelative e�ect on job response times as the tra�c intensityrises, which together with the non-linear speedups of the ap-plications comprising the parallel workload tends to causethe optimal static partition size to decrease with increasingvalues of �.The only exception to these trends is the case of KP =16, which provides poorer mean response times than theSP(8; 16) system across all tra�c intensities. This is due to

0

2000

4000

6000

8000

10000

0 0.001 0.002 0.003 0.004 0.005 0.006

0

2000

4000

6000

8000

10000

0 0.001 0.002 0.003 0.004 0.005

0

2000

4000

6000

8000

10000

0 0.001 0.002 0.003 0.004 0.005 0.006

SP(8,4)

SP(16,4)

SP(1,4)

SP(2,4)

SP(4,4)DP(4)

SP(16,8)

SP(1,8)

DP(8)

SP(4,8)

SP(8,8)

SP(2,8)SP(1,16)

SP(2,16)

DP(16

SP(4,16)

SP(16,16)

SP(8,16)

SP(4,1)

0.0040

2000

4000

6000

8000

10000

0 0.001 0.002 0.003

SP(8,1)

SP(2,1)

SP(1,1)SP(16,1)

DP(1)

Job Arrival Rate

Mea

n Jo

b R

espo

nse

Tim

eM

ean

Job

Res

pons

e T

ime

Mea

n Jo

b R

espo

nse

Tim

eM

ean

Job

Res

pons

e T

ime

Job Arrival Rate Job Arrival Rate

Job Arrival Rate

(a) 1 Disk Partition, Various Processor Partitions (b) 4 Disk Partitions, Various Processor Partitions

(d) 16 Disk Partitions, Various Processor Partitions(c) 8 Disk Partitions, Various Processor Partitions

Figure 6: Mean response times under SP(KP ; KD) and DP(KD) for KP 2 f1; 2; 4; 8; 16g and KD 2 f1; 4; 8; 16g.a particular artifact of one of the parallel applications in ourworkload in which the execution time as a function of thenumber of allocated processors and disks is discontinuous.Note that in each of the above cases there is no contentionfor the disk resources since each processor partition has itsown private disk partition. The above performance trendsand observations tend to also hold for KD = 8, althoughwith changes in the range of � values over which a particularvalue of KP is best.On the other hand, as the value of KD is reduced further,the performance trends previously observed in the researchliterature for static partitioning appear to be no longer valid.In particular, when there is a single partition consisting of allthe disks, we observe that there is a strict ordering amongthe static space-sharing strategies for all tra�c intensities,where SP(1; 1) provides the best performance, followed in or-der by SP(2; 1), SP(4; 1), SP(8; 1) and SP(16; 1), 0 < � < 1.A primary cause of these performance trends is contentionamong the KP processor partitions for the single disk parti-tion, which tends to have a dominating e�ect over the queue-ing delays and non-linear speedups mentioned above. Thequantitative results from our analysis of x3.2 clearly demon-

strate the performance impact of these e�ects. Of course,this contention for the disk resources declines as the numberof processor partitions decreases. Similar observations can bemade with respect to the mean response times for KD = 2and KD = 4, with the contention for the disk partitions de-creasing as the value of KD increases. It is important tonote that the results presented in this section are based onthe use of a processor sharing discipline at the disk parti-tions (although our analysis in x3 is not limited to this as-sumption and in fact supports general scheduling at the diskpartitions), and thus our model at worst underestimates thecontention for the disk resources.Turning to the results for dynamic partitioning, i.e., thedashed curves in Fig. 6, we observe that DP(KD) provides thebest performance among all space-sharing strategies for smallto relatively moderate values of � depending upon the value ofKD. By adjusting scheduling decisions in response to work-load changes, the DP(KD) system provides the most e�cientutilization of the processors among the various space-sharingpolicies when the tra�c intensity and disk contention are suf-�ciently low. On the other hand, as shown in [17, 16], theparallel system under dynamic partitioning converges toward

the corresponding static partitioning system with KP = NPin the limit as � ! 1. Hence, while DP(KD) typically pro-vides the best performance at relatively light loads, it even-tually starts to su�er from the disk contention problems de-scribed above and thus tends to saturate together with thecorresponding SP(NP ; KD) system. It is important to notethat the dynamic partitioning results in Fig. 6 are based onthe assumption of no repartitioning overhead, i.e., ��1i = 0,and thus our results are at worst biased in favor of dynamicspace-sharing.4.2 Time Sharing Response TimeOur analysis of parallel program and workload behavior in x2clearly identi�es an important characteristic of I/O-intensiveparallel applications, namely the repeated alternation be-tween computation bursts and I/O bursts. This togetherwith the performance trends from the above scheduling re-sults suggest that the overlapping of the I/O demands of somejobs with the computation demands of other jobs may o�era potential improvement in performance, even with limitedmultiprogramming levels. We now exploit the space-sharingwith time-sharing models in x3 to examine these performanceissues in more detail.The mean job response times for the TS(KP ; KD) sys-tems with M = 2 and di�erent numbers of processor anddisk partitions are plotted in Fig. 7. We �rst observe thatthe overlapping of computation and I/O bursts from di�erentjobs results in the most e�cient utilization of the resourcesamong the scheduling policies considered in our study. Thisalso yields a signi�cant increase in the tra�c intensities ac-commodated by the time-sharing counterpart of the schedul-ing policies examined in the previous section. Moreover, theperformance bene�ts of this e�ective resource utilization out-weigh the impact of contention among the KP processor par-titions for the KD disk partitions, which had a considerablee�ect on the performance of space-sharing policies. Note fur-ther that this contention can be even worse in TS(KP ; KD)than in the space-sharing systems when KP � KD, as up toM �KP =KD jobs share each disk partition.The multiprogramming level that provides the best per-formance depends in part upon the ratio of the per-phasecomputation and I/O requirements. In the ideal case, thescheduling strategy will be able to interleave the execution ofapplications such that this ratio is maintained very close to 1,thus achieving a total overlapping of computation and I/O.The problem of coordinated allocation of processor and diskresources consists of a complex tradeo� that depends upon anumber of key factors, including the speedup surfaces of theapplication workload as a function of the number of allocatedresources, the system load, and the balance of the computa-tion and I/O requirements present in the parallel workload.5 ConclusionsIn this paper we systematically investigated performance is-sues involved in the coordinated allocation of processor anddisk resources in parallel computer systems, with the goalof gaining a better understanding of the impact of I/O onprogram behavior and resource allocation. Models were for-mulated to examine the I/O and computation behavior ofparallel programs and workloads, and to analyze parallel re-source scheduling policies under such workloads. Our anal-ysis includes derivations of a generalization of Amdahl's lawand a generalization of ow-equivalent service centers. The

models and analysis were validated against measurements ofparallel applications and shown to be in excellent agreement.Our results provide important insights into the perfor-mance of parallel applications and resource scheduling strate-gies. In particular, our scheduling results demonstrate that,when there is contention for the disk resources, there can besigni�cant di�erences in the performance trends from whathas been previously observed for space-sharing policies whenI/O behavior is negligible or not explicitly considered. Ourresults further show the potentially signi�cant performancebene�ts of combining space-sharing with time-sharing to over-lap the I/O of some jobs with the computation of other jobsin these environments. We are continuing to investigate theimpact of workload variability and the impact of workloadI/O intensity on the e�ectiveness of coordinated schedulingdecisions and performance.Acknowledgements. We thank the Caltech CACR for pro-viding access to the Intel Paragon XP/S where all of ourexperiments were run. We thank Mark Wu and Aaron Kup-perman (Caltech), and Rick Kendall and David Bernholdt(PNL) for providing the QCRD and MESSKIT codes usedin our experiments.References[1] G. M. Amdahl. Validity of the single processor approachto achieving large scale computing capabilities. In Proc.AFIPS Spring Joint Comp. Conf., Vol. 30, pp. 483{485,1967.[2] S.-H. Chiang, R.K. Mansharamani, M.K. Vernon. Useof application characteristics and limited preemption forrun-to-completion parallel processor scheduling policies.In Proc. ACM Sigmetrics Conf., pp. 33{44, 1994.[3] N. Islam, A. Prodromidis, M.S. Squillante, L.L. Fong,A.S. Gopal. Extensible resource mangement for clustercomputing. In Proc. Int. Conf. Dist. Comp. Sys., 1997.[4] E.D. Lazowska, J. Zahorjan, G.S. Graham, K.C. Sev-cik. Quantitative System Performance: Computer Sys-tem Analysis Using Queueing Network Models. PrenticeHall, 1984.[5] C. McCann, R. Vaswani, J. Zahorjan. A dynamic pro-cessor allocation policy for multiprogrammed shared-memory multiprocessors. ACM Trans. Comp. Sys.,11(2):146{178, 1993.[6] R.D. Nelson. A performance evaluation of a general par-allel processing model. In Proc. ACM Sigmetrics Conf.,pp. 13{26, 1990.[7] J.K. Ousterhout. Scheduling techniques for concurrentsystems. In Proc. Int. Conf. Dist. Comp. Sys., pp. 22{30, 1982.[8] E.W. Parsons, K.C. Sevcik. Coordinated allocation ofmemory and processors in multiprocessors. In Proc.ACM Sigmetrics Conf., pp. 57{67, 1996.[9] V.G. Peris, M.S. Squillante, V.K. Naik. Analysis of theimpact of memory in distributed parallel processing sys-tems. In Proc. ACM Sigmetrics Conf., pp. 5{18, 1994.[10] D.A. Reed, R.A. Aydt, R.J. Noe, P.C. Roth,K.A. Shields, B.W. Schwartz, L.F. Tavera. Scalable per-formance analysis: The Pablo performance analysis en-vironment. In Proc. Scalable Parallel Libraries Conf.,pp. 104{113, 1993.

0

1000

2000

3000

4000

5000

6000

0 0.002 0.004 0.006 0.008

0

1000

2000

3000

4000

5000

6000

0 0.001 0.002 0.003 0.004 0.0050

1000

2000

3000

4000

5000

6000

0 0.001 0.002 0.003 0.004 0.005 0.006

0

1000

2000

3000

4000

5000

6000

0 0.002 0.004 0.006 0.008

Job Arrival Rate

Mea

n Jo

b R

espo

nse

Tim

eM

ean

Job

Res

pons

e T

ime

Mea

n Jo

b R

espo

nse

Tim

eM

ean

Job

Res

pons

e T

ime

Job Arrival Rate Job Arrival Rate

Job Arrival Rate

(a) 1 Disk Partition, Various Processor Partitions

(c) 8 Disk Partitions, Various Processor Partitions (d) 16 Disk Partitions, Various Processor Partitions

(b) 4 Disk Partitions, Various Processor Partitions

TS(16,1)

TS(8,1)

TS(4,1)

TS(1,1)

TS(2,1)

TS(16,4)TS(8,4)

TS(1,4)

TS(2,4)

TS(4,4)

TS(1,8)

TS(16,8)

TS(2,8)

TS(4,8)

TS(8,8)

TS(1,16)

TS(16,16)

TS(2,16)

TS(4,16)

TS(8,16)

Figure 7: Mean response times under TS(KP ; KD) for KP 2 f1; 2; 4; 8; 16g and KD 2 f1; 4; 8; 16g.[11] E. Rosti, G. Serazzi, E. Smirni, M.S. Squillante. The im-pact of I/O on program behavior and parallel scheduling.Tech. Rep., IBM Research Division, Sep. 1997; revisedMar. 1998.[12] E. Rosti, E. Smirni, L.W. Dowdy, G. Serazzi, B.M. Carl-son. Robust partitioning policies of multiprocessor sys-tems. Perf. Eval., 19:141{165, 1994.[13] S. K. Setia. The interaction between memory allocationand adaptive partitioning in message-passing multicom-puters. Job Scheduling Strategies for Parallel Processing,D.G. Feitelson and L. Rudolph (eds.), Lecture Notes inComp. Sci. Vol. 949, pp. 146{164, 1995.[14] E. Smirni, R.A. Aydt, A.A. Chien, D.A. Reed. I/Orequirements of scienti�c applications: An evolutionaryview. In Proc. IEEE Int. Symp. High Perf. Dist. Comp.,pp. 49{59, 1996.[15] E. Smirni, D.A. Reed. Lessons from characterizingthe Input/Output behavior of parallel scienti�c appli-cations. To appear, Perf. Eval., 1998.

[16] M.S. Squillante. On the bene�ts and limitations of dy-namic partitioning in parallel computer systems. JobScheduling Strategies for Parallel Processing, D.G. Fei-telson and L. Rudolph (eds.), Lecture Notes in Comp.Sci. Vol. 949, pp. 219{238, 1995.[17] M.S. Squillante. Matrix-analytic methods in stochas-tic parallel-server scheduling models. To appear, Ad-vances in Matrix-Analytic Methods for Stochastic Mod-els, S.R. Chakravarthy and A.S. Alfa (eds.), LectureNotes in Pure and Applied Mathematics, 1998.[18] M.S. Squillante, F. Wang, M. Papaefthymiou. Stochasticanalysis of gang scheduling in parallel and distributedsystems. Perf. Eval., 27&28:273{296, 1996.