Personal adaptive clusters as containers for scientific jobs

12
Cluster Comput (2007) 10: 339–350 DOI 10.1007/s10586-007-0028-5 ORIGINAL PAPER Personal adaptive clusters as containers for scientific jobs Edward Walker · Jeffrey P. Gardner · Vladimir Litvin · Evan L. Turner Published online: 14 June 2007 © Springer Science+Business Media, LLC 2007 Abstract We describe a system for creating personal clus- ters in user-space to support the submission and manage- ment of thousands of compute-intensive serial jobs to the network-connected compute resources on the NSF Tera- Grid. The system implements a robust infrastructure that submits and manages job proxies across a distributed com- puting environment. These job proxies contribute resources to personal clusters created dynamically for a user on- demand. The personal clusters then adapt to the prevailing job load conditions at the distributed sites by migrating job proxies to sites expected to provide resources more quickly. Furthermore, the system allows multiple instances of these personal clusters to be created as containers for individual scientific experiments, allowing the submission environment to be customized for each instance. The version of the sys- tem described in this paper allows users to build large per- sonal Condor and Sun Grid Engine clusters on the TeraGrid. Users then manage their scientific jobs, within each personal E. Walker ( ) Texas Advanced Computing Center, University of Texas at Austin, Austin, TX 78758, USA e-mail: [email protected] J.P. Gardner Pittsburgh Supercomputing Center, Pittsburgh, PA 15213, USA e-mail: [email protected] V. Litvin High Energy Physics Group, California Institute of Technology, Pasadena, CA 91125, USA e-mail: [email protected] E.L. Turner Texas Advanced Computing Center, University of Texas at Austin, Austin, TX 78758, USA e-mail: [email protected] cluster, with a single uniform interface using the feature-rich functionality found in these job management environments. Keywords Cooperative systems · Distributed computing · Resource management 1 Introduction The TeraGrid is a multi-year, multi-million dollar National Science Foundation (NSF) project to deploy one of the world’s largest distributed infrastructure for open scientific research [1]. The project links eight large resource provider sites with a high speed 10–30 GBits/second dedicated net- work, providing, in aggregate, over 100 teraflops of com- puting power, 4 petabytes of storage capacity, with high-end facilities for visualization and data analysis of computation results. In particular, the compute resources on the TeraGrid are composed of a heterogeneous mix of compute clusters running different operating systems on different instruction set architectures. Furthermore, most sites deploy different workload management systems for users to interact with the compute cluster. Even when the same workload manage- ment system is used, different queue names and job sub- mission limits result in subtly different job submission and management semantics across the different sites on the Tera- Grid. Some run-time uniformity exists across sites in the avail- ability of a guaranteed software stack, called the Common TeraGrid Software Stack (CTSS), and a common set of en- vironment variables for locating well known file directory locations. However, the TeraGrid is essentially a large dis- tributed system of computing resources with different capa- bilities and configurations.

Transcript of Personal adaptive clusters as containers for scientific jobs

Cluster Comput (2007) 10: 339–350DOI 10.1007/s10586-007-0028-5

O R I G I NA L PA P E R

Personal adaptive clusters as containers for scientific jobs

Edward Walker · Jeffrey P. Gardner · Vladimir Litvin ·Evan L. Turner

Published online: 14 June 2007© Springer Science+Business Media, LLC 2007

Abstract We describe a system for creating personal clus-ters in user-space to support the submission and manage-ment of thousands of compute-intensive serial jobs to thenetwork-connected compute resources on the NSF Tera-Grid. The system implements a robust infrastructure thatsubmits and manages job proxies across a distributed com-puting environment. These job proxies contribute resourcesto personal clusters created dynamically for a user on-demand. The personal clusters then adapt to the prevailingjob load conditions at the distributed sites by migrating jobproxies to sites expected to provide resources more quickly.Furthermore, the system allows multiple instances of thesepersonal clusters to be created as containers for individualscientific experiments, allowing the submission environmentto be customized for each instance. The version of the sys-tem described in this paper allows users to build large per-sonal Condor and Sun Grid Engine clusters on the TeraGrid.Users then manage their scientific jobs, within each personal

E. Walker (�)Texas Advanced Computing Center, University of Texas atAustin, Austin, TX 78758, USAe-mail: [email protected]

J.P. GardnerPittsburgh Supercomputing Center, Pittsburgh, PA 15213, USAe-mail: [email protected]

V. LitvinHigh Energy Physics Group, California Institute of Technology,Pasadena, CA 91125, USAe-mail: [email protected]

E.L. TurnerTexas Advanced Computing Center, University of Texas atAustin, Austin, TX 78758, USAe-mail: [email protected]

cluster, with a single uniform interface using the feature-richfunctionality found in these job management environments.

Keywords Cooperative systems · Distributed computing ·Resource management

1 Introduction

The TeraGrid is a multi-year, multi-million dollar NationalScience Foundation (NSF) project to deploy one of theworld’s largest distributed infrastructure for open scientificresearch [1]. The project links eight large resource providersites with a high speed 10–30 GBits/second dedicated net-work, providing, in aggregate, over 100 teraflops of com-puting power, 4 petabytes of storage capacity, with high-endfacilities for visualization and data analysis of computationresults. In particular, the compute resources on the TeraGridare composed of a heterogeneous mix of compute clustersrunning different operating systems on different instructionset architectures. Furthermore, most sites deploy differentworkload management systems for users to interact with thecompute cluster. Even when the same workload manage-ment system is used, different queue names and job sub-mission limits result in subtly different job submission andmanagement semantics across the different sites on the Tera-Grid.

Some run-time uniformity exists across sites in the avail-ability of a guaranteed software stack, called the CommonTeraGrid Software Stack (CTSS), and a common set of en-vironment variables for locating well known file directorylocations. However, the TeraGrid is essentially a large dis-tributed system of computing resources with different capa-bilities and configurations.

340 Cluster Comput (2007) 10: 339–350

The TeraGrid can therefore benefit from middlewaretools that provide a common seamless environment for usersto submit and manage jobs across sites, and allow its aggre-gate capabilities to be harnessed transparently. These toolsalso have to preserve resource ownership across sites, al-lowing sites to continue leveraging local and global partner-ships outside of the project. For example many sites supportlocal university users, national users through NSF programs,and possibly, industrial users through partnership programs.Sites therefore need the autonomy to be able to configuretheir systems to meet the different needs from these differ-ent communities.

The middleware tool described in this paper supports en-semble job submissions across the distributed computationalresources on the TeraGrid. A recent TeraGrid user survey[2] indicated that over half the respondents had the require-ment to submit, manage, and monitor many hundreds, oreven thousands, of jobs at once. In particular, two large sci-entific projects provided initial motivation for our research:the Compact Muon Solenoid (CMS) project [3] and the NSFNational Virtual Observatory (NVO) project [4, 13]. Bothprojects have large compute time allocations on the Tera-Grid, and both projects have a requirement to submit andmanage many thousands of serial jobs through the TeraGridclusters.

In this paper, we describe the production infrastructurethat was developed to support the execution of these twoprojects. The system we describe builds personal clustersfor users to submit, manage and monitor computational jobsacross sites on the TeraGrid. These personal clusters can bemanaged by Condor [18, 19], or Sun Grid Engine (SGE)[23], providing a single uniform job management interfacefor users of the TeraGrid.

The rest of the paper will be as follows. Section 2 de-scribes the TeraGrid applications that provided the initialmotivation for our research. Section 3 defines our problemstatement and examines related work supporting large en-semble job submissions on distributed computing environ-ments. Section 4 describes our proposed solution, introduc-ing the concept of a virtual login session and the use of jobproxies to acquire resources for creating a personal clusterson-demand. Section 5 measures the overhead of executinguser jobs through the job proxies created by the system andSect. 6 examines the efficiency of three different job proxyallocation strategies, investigating some of their trade-offs.Finally Sect. 7 concludes our paper with a discussion of fu-ture work.

2 Motivating applications

Two scientific groups, with large allocations on the Tera-Grid, provided the initial motivating applications for our re-

search. The two scientific groups were from the NSF Na-tional Virtual Observatory (NVO) project and the Com-pact Muon Solenoid (CMS) project. This section describesthese motivating applications and the computational chal-lenge they pose to current high performance computationalinfrastructures.

2.1 National virtual observatory (NVO)

Astronomy is facing a data avalanche that grows larger eachyear. Breakthroughs in telescope and detector technologyhave allowed astronomical surveys to produce a volume ofimages and catalogs that is growing much faster than ser-ial computing power. The NSF National Virtual Observa-tory (NVO) is a multiyear effort to build tools, services,registries, protocols, and standards that can extract the fullknowledge content of these massive, multi-frequency datasets.

Many types of services required by the NVO operate aslarge numbers of serial jobs. For example, an astronomermay need to apply the same analysis algorithm to millionsof independent objects such as galaxies or quasars. Or, theymay require a parameter sweep calculation where parame-ters of a model are varied millions of times compared againsta given result or dataset.

The specific NVO application that provided requirementsfor our research involved cosmologists attempting to cal-culate the dark energy content of the Universe by measur-ing the Integrated Sachs-Wolfe (ISW) Effect. The ISW ef-fect tells us that in the presence of dark energy, primordialphotons from the Cosmic Microwave Background Radiation(CMBR) are affected by the gravitational pull of galaxiesand dark matter as they journey towards us over the life-time of the Universe. This effect can be used to determinethe dark energy content of the Universe by cross-correlatingmaps of the CMBR with the foreground galaxy density.

In this application, the Wilkinson Microwave AnisotropyProbe dataset (WMAP) [14] provided the map of theCMBR, and the Sloan Digital Sky Survey dataset (SDSS)[15] provided the galaxy positions. The correlation betweenthe maps was measured by calculating the covariance matrixof a large numbers of statistically equivalent CMBR realiza-tions and the galaxy maps. For the project, this translated toa requirement for submitting over 50,000 jobs to the Tera-Grid clusters, with each job having fairly short run-times—of approximately 10 minutes in duration.

2.2 Compact muon solenoid (CMS)

Two of the principal goals of the Large Hadron Collidor(LHC) physics program are to find the Higgs particles,thought to be responsible for mass, and Supersymmetry(SUSY), a necessary element of String Theories that may

Cluster Comput (2007) 10: 339–350 341

unify all of nature’s fundamental interactions. The CompactMuon Solenoid (CMS) detector at CERN’s Large HadronCollider (LHC) will be a critical instrument used in pursuitof both these quests when it begins operation in 2008.

Scientists from Caltech, UCSD, and across Europe arecollaborating to optimize their ability to enable these discov-eries when data from the CMS becomes available in 2008.With this purpose, the team is using the TeraGrid to generate6 million events for a full detector simulation as part of theiradvanced study of CMS electromagnetic calorimeter stand-alone η-calibration, and 10 million pre-selected events ofdifferent hadronic backgrounds for the Higgs boson dipho-ton decay mode to investigate methods for improving theirability to discover the Higgs boson [16]. Concretely, thegroup computational needs translated to a requirement forsubmitting over 500,000 jobs to multiple IA-64 clusters onthe TeraGrid, with each job having relatively variable andpotentially long run-times—approximately 1 to 24 hours induration.

3 Related work

We examine the question of how to efficiently support mul-tiple users in the submission and management of many thou-sands of simultaneous serial jobs across a heterogeneousmix of large compute clusters connected in a distributed en-vironment.

An approach to submitting ensemble jobs across a dis-tributed infrastructure, like the TeraGrid, is to do so directlythrough a gateway node at each participating site. This ap-proach uses a tool such as Globus [5, 6] or UNICORE [8],to provide a common job submission interface to the differ-ent job managers at each site. For many applications, thisis a very good strategy. However, despite the scalable na-ture of these software solutions, our requirement of submit-ting thousands of jobs by multiple users simultaneously cancause the single gateway resource at each site to becomeoverloaded. Furthermore, application requirements to sub-mit many “small” jobs that execute for only a few minutescan further exacerbate this problem.

An alternative approach to directly submitting jobs to agateway node is to submit jobs to some intelligent centralagent that schedules jobs to a site based on heuristics thatminimize the cost (time and resources consumption) of theentire ensemble execution. These central agents develop jobschedules that help throttle the dispatch of jobs to the gate-way nodes, alleviating the problem as a side-affect. Toolsthat do this include Pegasus/Condor-G [9, 25], Nimrod/G[10], and APST [11]. All three systems have been success-fully used to schedule job ensembles across a Grid and thescheduling technologies they developed have proved to en-able job ensembles to complete more efficiently then simplyflooding each gateway node with as many jobs as possible.

However, this approach of maintaining jobs at a centralagent and dispatching it through a remote execution adaptorhas a number of issues in the context of our specific problem.Firstly, a central agent has to deal with remote outages at agateway node where the only service exposed maybe a jobexecution service like Globus GRAM [7]. This is difficultbecause during a gateway outage, a central meta-schedulercan only guess at the current state of their submitted jobs ata site. Secondly, and more importantly, this approach can re-sult in non-optimal use of the cluster resources on the Tera-Grid due to the serial nature of our job submission require-ments. TeraGrid clusters often have a limit on the number ofjobs a user can submit to a site’s local job queue, and somelocal site schedulers even explicitly favor large parallel jobsover serial jobs [12].

Due to these local scheduler policies, some users haveresorted to repackaging their serial jobs into a “parallel”job where multiple CPUs are requested for each submission.However, repackaging serial jobs into a parallel job submis-sion introduces other problems for the user: the user nowloses the ability to monitor and control individual jobs inan experiment. This ability to monitor and control individ-ual jobs is an important requirement especially in the caseof job failures. Users often need to be notified of a singlejob failure, requiring the ability to diagnose the problem byexamining the error output of the job, and the ability to re-submit individual jobs on problem resolution.

4 Creating adaptive clusters on-demand

The system we propose automatically submits and man-ages parallel job proxies across cluster sites, creating per-sonal clusters from resources contributed by these proxies.These personal clusters can be created on a per-user andper-experiment basis. The approach we propose providesa single job management interface over the heterogeneousdistributed environment and leverages existing feature-richcluster workload management systems. Figure 1 conceptu-ally shows what our proposed solution does.

Our system builds Condor or SGE clusters when a usercreates a virtual login session. Within this virtual login ses-sion, users can submit, monitor and manage jobs through asingle job management interface, emulating the experienceof a traditional cluster login session. Furthermore, multiplevirtual login session can be created as containers for indi-vidual experiments, allowing the scientist to customize thesubmission needs of each experiment within each cluster.

In the system, job proxies, submitted and managed bya cluster builder agent infrastructure, start commodity clus-ter job-starter daemons (currently from Condor and SGE)which call back to a dynamically created cluster at the user’sworkstation. The GridShell framework [20, 21] is leveraged

342 Cluster Comput (2007) 10: 339–350

Fig. 1 Distributed proxy agents“pull” resources into personalclusters created on-demand.Multiple clusters may be createdas containers for ensemble jobruns

to provide a transparent environment to execute the clusterbuilder agents developed for our system.

The system delegates the responsibility of submitting thejob proxies to a single semi-autonomous agent spawned,through Globus, at each TeraGrid site during a virtual loginsession. This agent translates the job proxy submission tothe local batch submission syntax, maintains some numberof job proxies throughout the lifetime of a login session, andmay negotiate migration of job proxies between sites basedon prevailing job load conditions across sites.

Our approach has the following advantages:

• Scalability: user jobs are directly routed to the computenodes at each site and not through the gateway node. Onlya single agent is started at a gateway node that submits andmaintains job proxies in the local queue.

• Fault-tolerance: each semi-autonomous agent at a gate-way node maintains the job proxy submission locally,allowing transient network outages to be tolerated, andgateway node reboots to be handled in isolation from therest of the system.

• Technology-inheritances: the entire commodity Condorand SGE infrastructure, with all their associated add-ontools, is leveraged to provide a single job managementenvironment for the user. Future versions of the systemwill allow other cluster workload management systemsto be the user selectable interface to the TeraGrid, e.g.Portable Batch System (PBS) [22].

4.1 The virtual login session

Figures 2 and 3 show Condor and SGE virtual login ses-sions. A number of features are highlighted in these fig-ures. First, a vo-login command is invoked by the user.The command reads a configuration file, located in the usershome directory, listing the participating sites in each session.Second, the command prompts the user for a GSI [24] pass-word. This creates a valid GSI credential to spawn an agentat the participating sites. The command also allows the userto use an existing credential. Third, a command-line promptis returned to the user, who is then able to issue Condor orSGE commands. Fourth, the user can query the status of allremote job proxies with the agent_jobs command. Fifth,the user can detach from the login shell and reattach to it

Fig. 2 Formatted screenshot of a virtual login session creating a per-sonal Condor pool

Fig. 3 Formatted screenshot of a virtual login session creating a per-sonal SGE cluster

at some future time from a different terminal. And finally,sixth, the system automatically cleans up the environmentand shuts down all remote agents when the shell is exited.

Figure 4 shows the process architecture of our system.When a user invokes vo-login on a client workstation,

Cluster Comput (2007) 10: 339–350 343

Fig. 4 Component architectureof the system

an Agent Manager process is started on the local machine.The agent manager remotely executes a Proxy Managerprocess at each participating site using Globus GRAM. Theagent manager at the client workstation also forks a MasterNode Manager process within a GUN Screen session thatthen starts the Condor/SGE master daemons in user space ifneeded.

The proxy managers send periodic heart-beat messagesback to the central agent manager to allow each to moni-tor the health of the other. Based on missed heart-beats, theagent manager may reallocate job proxies to other proxymanagers, or a proxy manager may voluntarily shut downif it has not heard from the agent manager for a period oftime.

The proxy manager at each site in a virtual login ses-sion is responsible for submitting job proxies through a lo-cal GridShell Submission Agent. This submission agent isinvoked by the proxy manager by wrapping the Condor orSGE job-starter daemon executable in a GridShell script,and invoking it. The submission agent submits and main-tains the job proxy in the local job queue using the GridShellinfrastructure [20].

When the job proxy is finally run by the local site sched-uler, a Task Manager process is started. The task manageris responsible for reading the slave node configuration andstarting Slave Node Manager processes in parallel on thenodes allocated by the local site scheduler. The task managerthen calls back to the proxy manager, allowing the proxymanager to begin monitoring the state of the running jobproxy.

The slave node managers are responsible for starting theCondor or SGE job-starter daemons. These daemons con-

nect back to the master processes on the client workstation,or to a pre-existing departmental cluster, depending on thevirtual login configuration. Job proxies now appear as avail-able compute resource in the users expanding personal Con-dor or SGE cluster.

4.2 Authentication framework

The TCP communication between the Proxy and AgentManagers are not encrypted. However, all TCP connectionsused in the system are authenticated. The authenticationframework used by the system is shown in Fig. 5.

When a user first invokes vo-login, a secret 64-bit key,auth_key, is randomly generated, and given to the AgentManager process as an input parameter on startup. The vo-login command then authenticates with each participat-ing site using the GSI authentication mechanism. Once au-thenticated, the system spawns a GridShell script at the re-mote site, within which the Proxy Manager is started withthe same auth_key as input parameter. Subsequently, allcommunications between the Proxy Manager and the AgentManager is preceded by an auth_key encrypted challengestring. The challenge string consists of a fixed, previouslyagreed upon section, and a variable section. The system en-sures that the variable section is unique for every new con-nection, and allows the Agent Manager to discard connec-tions with the same encrypted challenge string. This is doneto prevent replay attacks from intruders who have success-fully sniffed a previous connection.

344 Cluster Comput (2007) 10: 339–350

Fig. 5 Authenticating connections between the Agent and Proxy Man-agers with an encrypted challenge string

4.3 Job proxy states and recovery

Figure 6 shows the complete job proxy state transitions dur-ing normal fault-free run and recovery.

For the fault-free run, the normal job proxy state transi-tion is PEND → RUN → EXIT. A job proxy is in the PENDstate if it is in the local queue and is waiting to be run. Thejob proxy is in the RUN state when is started by the localsite scheduler and in the EXIT state when it terminates andis removed from the local queue. In a fault-free run, a jobproxy in the RUN state goes to the EXIT state for either oftwo reasons:

1. The proxy is killed by the local batch system if its localwall-clock limit is reached.

2. The proxy exits voluntarily if it remains idle for someconfigurable time period. This can happen when there areinsufficient user jobs that can be scheduled by the system.

The job proxy may also transition between the PEND andRELEASE states. A job proxy is in the RELEASE state whenit is being considered for migration to another site. This isexplained further in Sect. 4.

When the gateway node reboots, the proxy manager is ex-pected to accurately recover the state of its submitted prox-ies. The proxy manager is restarted by a crontab entry thesystem creates when the proxy manager is first started up.

When the proxy manager is restarted on recovery, it firstchecks a job-info file containing the last known jobproxy states. If a proxy is logged in the RUN state, the proxymanager will attempt to connect to the host (and port) of itstask manager, which is also logged in the job-info file. Ifit connects successfully, the proxy is transitions back to theRUN state; otherwise the proxy is transitioned to the EXITstate.

If a job proxy is logged in the PEND state before recovery,the proxy manager will have to consider 4 possible proxystate transitions while it was down. These state transitionsare shown in Table 1.

Fig. 6 Job proxy state transition diagram

Table 1 Events generated from job proxy transitions

State transitions Recovery events

1. PEND → RUN → EXIT No job-tag

2. PEND → EXIT Found job-tag ∧ not in queue

3. PEND (no change) Found job-tag ∧ in queue

4. PEND → RUN Found job-tag ∧ connected

to task starter

All proxies submitted by the proxy manager have aunique job-tag logged in a cache directory. This job-tag isremoved from the cache directory by the task manager whenthe proxy exits. Using this job-tag, the state transitions forcases 1 to 3 (shown in the table) can be distinguished by theabsence or the presence of a proxy’s job-tag when the proxymanager recovers. If a job-tag is found, the proxy managertransitions the proxy state to FOUND_TAG. The proxy man-ager then checks the local queue for the job proxy. This ac-commodates the edge case where the job proxy is manuallyremoved from the local queue while the proxy manager wasdown. If the job proxy is still in the queue, the job proxy stateis transitioned to PEND; otherwise its state is transitioned toEXIT.

Finally for case 4, a job proxy PEND state will transitionto a RUN state when the task manager for the job proxy re-connects with the proxy manager. This will eventually occurafter recovery because the task manager will periodically at-tempt to reconnect with its proxy manager until successful.

4.4 Agent manager failure and recovery mode

The agent manager keeps minimal state. Information aboutremote proxy managers are not persisted because we assumethey will eventually reconnect back with the agent managerwhen the client recovers. In our system the agent manageronly persists its listening port for re-binding on recovery.

Cluster Comput (2007) 10: 339–350 345

4.5 Interposing domain name services system calls

The gateway and compute nodes on all the clusters on theTeraGrid are multi-homed, i.e. there are multiple networkinterface cards (NIC), with multiple hostnames and IP ad-dresses associated with each node. Therefore, in order toensure that the Condor and SGE job starter daemons use thehostname associated with the correct NIC, i.e. the IP withthe public network connectivity, we interpose the domainname services (DNS) uname() and gethostname() sys-tems calls with a version that uses the name specified in theenvironment variable _GRID_MYHOSTNAME if defined.

The _GRID_MYHOSTNAME environment variable is setby our system to the resolved hostname of the correct NICfor the Condor and SGE daemons to use prior to theirstartup. Our system discovers the correct NIC by allowingthe slave node managers to connect to its known remoteagent manager on startup. When a connection is successful,it checks for the locally bound IP address of the success-ful outbound TCP connection. Once this IP address is dis-covered, the environment variable _GRID_MYHOSTNAMEis set to its resolved hostname.

This procedure causes the Condor and SGE daemons toalways use the hostname associated with the correct IP ad-dress. We have found this to be a very successful strategyfor dealing with the complexities of starting Condor andSGE daemons in multi-homed environments, without needfor modifying technology specific configuration files.

5 Overhead of executing jobs through proxies

In this section we examine the overhead of executing jobsthrough the job proxies created by our system. For our ex-periment we create a one-CPU Condor and SGE cluster atNCSA, using a single long running job proxy. We then sub-mit 30 jobs, each with a run time of 60 seconds, to theseclusters from TACC. The total turnaround time, from timeof first job execution to time of last job completion, is thenmeasured.

For the Condor case, we examined seven different jobsubmission scenarios. In scenario one, we manually stagethe required executable, and tell Condor not to stage theexecutable before job execution. In scenario two, we allowCondor to stage the executable prior to job startup (the exe-cutable size was 10 KB). In scenarios three to eight, we usethe Condor file transfer mechanism to stage data file of sizes10 MB, 20 MB, 30 MB, 40 MB and 50 MB respectively,prior to job startup.

For the SGE case, we only measured the turnaround timefor scenario one. This is because SGE does not have a built-in mechanism for staging binaries and input files. We there-fore submitted a shell script, which SGE stages at the remote

Fig. 7 Total turnaround times for executing 30 jobs, with run times of60 seconds each, submitted from TACC and executed through a singlejob proxy at NCSA

Table 2 Execution overhead through Condor and SGE job proxies

1 2 3 4 5 6 7

Condor 5% 9% 9% 17.5% 28.1% 29% 29.8%

SGE 25% – – – – – –

site, within which the pre-staged binary is invoked. For thisscenario, we also submitted the 30 jobs using the SGE job-array submission feature.

The turnaround timings for all scenarios and cases areshown in Fig. 7.

The overhead introduced by our job proxy for each sce-nario is shown in Table 2, where we assume the ideal turn-around time of 30 * 60 = 1800 seconds.

The Condor job proxy introduces a much smaller over-head then the SGE job proxy. The execution overhead inboth cases is still very acceptable. However, where the scien-tist has no interface preference, we do recommend the Con-dor job proxy, over the SGE job proxy, based on its loweroverhead as revealed from experimental findings.

6 Adaptive job proxy allocation strategies

Our system executes user jobs on CPU resources obtainedfrom job proxies submitted to sites on the TeraGrid. Figure 8illustrates a scenario where five user jobs are mapped to twojob proxies. The two job proxies have been submitted to asingle site, and each proxy has a CPU count of two.

Depending on the prevailing load conditions, all jobs maybe able to run in their own individual CPU acquired by thejob proxies, or multiple jobs may have to run consecutivelyon fewer CPUs. If there are few user jobs, some proxiesmay never be used, and are therefore wasted if run. Also,resources may be wasted if only a subset of a proxy’s CPUallocation is used, since the entire job proxy remains runninguntil no more jobs are assigned to it. It is therefore importantto carefully consider job proxy allocation strategies to max-imize the user’s job throughput, and minimize wasted CPUresources.

346 Cluster Comput (2007) 10: 339–350

Fig. 8 Mapping five user jobs to two job proxies of CPU count twoeach. Both proxies run on one site

Two pertinent questions need to be considered by ourchoice of job proxy allocation strategies: how large shouldjob proxies be, and how many job proxies should the systemsubmit.

Smaller job proxies are advantageous because the localsite scheduler may find CPU resources for it more quickly.Also smaller job proxies may result in fewer wasted CPUsclose to the end of a job ensemble run as fewer jobs re-main in the system. However, the disadvantage of smallerjob proxies is that the system needs to submit and managemore job proxies across the sites.

Submitting many more job proxies across sites then thereare user jobs has the advantage of the system increasing itschance of finding resources more quickly for the user. How-ever the disadvantage is that this may result in wasted re-sources. A job proxy is a wasted resource if it is never usedduring a virtual login session. These wasted job proxies con-sume resources when they are added and deleted from the lo-cal queue, and when the scheduler has to consider them forscheduling amongst other jobs in the queue. In our system, awasted job proxy manifest as job proxies in the PEND statebecause job proxies voluntarily terminate when not used andis re-submitted by the system in the PEND state for futureconsideration.

In this section we only consider a subset of the issues dis-cussed above. We examine strategies that minimize the sub-mission of wasted job proxies while maximizing the userjob throughput. We examine three allocation strategies inthis section: (A) over-allocating, (B) under-allocating andexpanding allocation over time, and (C) sharing a fixed allo-cation between sites.

6.1 Experimental setup

We evaluate the proposed job proxy allocation strategiesby dynamically creating personal clusters using the clus-

Fig. 9 CPUs provisioned to personal cluster over time with strategy A

ters at the Texas Advanced Computing Center (TACC), Na-tional Center for Supercomputing Applications (NCSA),San Diego Supercomputing Center (SDSC), and Center forAdvanced Computing Research (CACR). We then simulatea scientist submitting 1000 jobs to the personal cluster, witheach job run-time uniformly distributed between 30 and 60minutes.

We examine the CPU profile of the dynamically gener-ated personal clusters over time as these 1000 user jobs areprocessed through it. The turnaround times of the entire ex-periment is also measured to quantify the impact of the dif-ferent strategies to the user. Twelve experiments were con-ducted by alternating between the proposed strategies overthe course of three consecutive days. The remainder of thesection examines some representative results from the best-performing runs using each of the proposed strategies.

6.2 Strategy A: over allocating job proxies

In this strategy the total number of CPUs requested bythe job proxies is approximately twice the number of userjobs submitted. The job proxy allocation is divided equallyacross the proxy managers and is fixed for the duration ofthe experiment.

In the experiment we configured the proxy managers tosubmit 15 jobs, each requesting 32 CPUs, at every site. Thisequates to requesting 1920 CPUs in total across all the sitesfor the duration of the experiment.

The CPU profile of the personal cluster generated overthe course of the experiment is shown in Fig. 9. The profileshows an average personal cluster size of 933 processors.

The number of job proxies in the PEND state during thelifetime of the experiment is shown in Fig. 10. On average,29 job proxies are always in the PEND state throughout thelifetime of the experiment.

6.3 Strategy B: under allocating and expanding allocationover time

In this strategy, the total number of CPU requested by thejob proxies is initially fewer then the number of user jobssubmitted. However, as time progresses, the proxy managers

Cluster Comput (2007) 10: 339–350 347

Fig. 10 Number of PEND job proxies over time with strategy A

Fig. 11 Job proxies assigned to sites over time with strategy B

independently increase their proxy allocation as soon as allit’s current set of proxies are running and used by our sys-tem. The rationale is to grow the job proxy allocation onsites which have the potential of running more proxies suc-cessfully.

In the experiment, the proxy manager at each site is ini-tially allocated with 5 job proxies. Figure 11 shows the ex-panding job proxy allocation at each of the four sites overtime (at 5 minute time step intervals).

The graph shows the proxy manager at SDSC acquiringjobs over the course of the experiment, reaching a peak of14 job proxies. The proxy manager at TACC also acquiredsome additional number of job proxies, reaching a peak of 9allocated job proxies: 1 pending and 8 running job proxiesof 32 processors each. Interestingly, this is the 256 processorper-user limit set in the local batch queuing system at TACC.

Figure 12 shows the number of job proxies in the PENDstate during the experiment. The number of job proxies inthe PEND state fluctuated between 4 and 7 over the courseof the experiment, but is an improvement over strategy A.

The CPU profile of the personal cluster generated overtime during the experiment is shown in Fig. 13. The poolgrows from an initial size of 400 processors to over 800processors. The average size is shown to be 744 processors.

6.4 Strategy C: sharing a fixed allocation between sites

In this strategy, the total number of CPU requested by thejob proxies is the same as the number of user jobs submitted.The initial job proxy allocation is divided equally across theproxy managers on all the participating sites.

Fig. 12 Number of PEND job proxies over time with strategy B

Fig. 13 CPUs provisioned to personal cluster over time with strat-egy B

Fig. 14 UML sequence diagram of the job proxy sharing strategy be-tween sites

As time progresses, the proxy managers adjust theirproxy allocation by migrating job proxies in the PEND stateto sites where all proxies are currently running in the localsystem. The rationale is to migrate job proxies to sites withshorter queue wait times, adjusting the initial job proxy al-location to maximize the user job throughput.

Figure 14 shows the UML sequence diagram for the jobproxy sharing strategy. When all the job proxies at a site arein the RUN state, the proxy manager can inform the agentmanager that it is ready to accept more proxies. The centralagent manager may then give it a proxy from its stash (con-taining proxies it may have acquired when a proxy managershuts down) or from another site with available job proxiesin the PEND state.

Figure 15 shows the total number of job proxies allo-cated to each of the four clusters during the experiment. Thistime the graph shows the proxy manager at NCSA pullingjob proxies over the course of the experiment, reaching a

348 Cluster Comput (2007) 10: 339–350

Fig. 15 Job proxies assigned to site over time with strategy C

Fig. 16 Number of PEND job proxies over time with strategy C

Fig. 17 CPUs provisioned to personal cluster over time with strat-egy C

peak allocation of 14 job proxies. The proxy manager atSDSC and TACC also acquired additional job proxies butgave them up as they decided the additional job proxiescould not be run. The proxy manager at CACR is seen togive up most of its job proxies, resulting in only 3 allocatedjob proxies over time; probably due to a higher local job loadat CACR during this experiment.

Figure 16 shows the number of job proxies in the PENDstate during the experiment’s life time. The number of jobproxies in the PEND state is dramatically reduced with thisstrategy: during the peak of the experiment all the job prox-ies were running. The increase in job proxies in the PENDstate close to the end of the experiment is explained by thefewer jobs remaining in the system, resulting in the job prox-ies being terminated and resubmitted back into cluster in thePEND state.

The CPU profile of the personal cluster generated overthe course of the experiment is shown in Fig. 17. The aver-age cluster size was 952 processors.

Table 3 The MAX/MIN/MEAN turnaround times (secs) using thedifferent strategies and running at a single site

Strategy A Strategy B Strategy C Single site

(TACC)

MIN 4705 7371 6306 12976

MAX 5918 7487 6395 13090

MEAN 5423 7429 6362 13020

6.5 Experiment turnaround times

The turnaround times of the experiments using the three dif-ferent proxy allocation strategies are shown in Table 3. Wealso provide the turnaround time of the same experimentconducted at just one site (TACC), without using our sys-tem, as reference. The turnaround time is defined as the timefrom first job submission to the completion of the 1000th

job in our experiment. All results are derived from four ex-perimental runs each. The single site run was conducted aweek after the cross-site runs through the system were com-pleted, hence prevailing load conditions at each site mayhave changed in the interim. However, we feel comparisonbetween the results is still instructive.

Strategy A performs the best as expected. Strategy Ccompares well with the overly greedy strategy A. StrategyB performs the worst amongst the three strategies. Howeverall three strategies provide 40%–60% improvement in ex-periment turnaround times compared to just submitting theuser jobs to a single site.

Our system allows users to pick any of the three strategiesstudied in this section. However strategy C (sharing a fixedallocation between sites) provides the best space versus jobturnaround time tradeoff. Strategy A affords an acceptablestrategy if sites are not impacted by the system queuingmany, possibly unneeded, job proxies in the local scheduler.Where a large number of queued job proxies can affect thelocal schedulers, or have a psychological impact on otherusers on the system, strategy C is recommended. StrategyB could possibly perform better with jobs of different runtimes, but we can assume strategy C to be appropriate inthose scenarios too.

7 Conclusion and future work

The Condor version of the system is currently in produc-tion deployment on the NSF TeraGrid [17]. Up to 100,000jobs have been submitted through the system to date, mainlythrough the Caltech CMS group and the NVO project, en-abling approximately 900 teraflops of scientific computa-tion. The system enables users to benefit from a learn-once-run-anywhere submission environment, and resource

Cluster Comput (2007) 10: 339–350 349

providers’ benefit by not needing to accommodate disrup-tive changes in their local cluster configuration. Future ver-sions of the system will support other workload manage-ment interfaces like PBS. Also interesting open problemsstill exists such as how can the system automatically de-termine and adjust the job proxy CPU size request to opti-mize the personal cluster creation times, and reduce resourcewastage. Also, how can parallel job ensembles be supported,especially across sites with non-uniform interconnect laten-cies. Finally the scheduling technologies developed by thePegasus/Condor-G, Nimrod/G and APST systems can beleveraged in the future to intelligently schedule job proxiesacross the TeraGrid. We leave these topics as potential areasfor future work.

Acknowledgements We would like to thank the anonymous review-ers for their helpful and constructive comments. We would also like tothank the TeraGrid Grid Infrastructure Group (GIG), and the resourceproviders’ system administration staff, for useful feedback and help ingetting the system into production. This work was supported in part bythe U.S. National Science Foundation under Grant No. 0503697.

References

1. The NSF TeraGrid, http://www.teragrid.org2. Walker, E., et al.: TeraGrid Scheduling Requirement Analy-

sis Team Final Report, http://www.tacc.utexas.edu/~ewalker/sched-rat.pdf

3. The Compact Muon Solenoid Experiment, http://cmsinfo.cern.ch4. NSF National Virtual Observatory TeraGrid Utilization Proposal

to NRAC, 2004, http://us-vo.org/pubs/files/teragrid-nvo-final.pdf5. Foster, I., Kesselman, C.: Globus: a metacomputing infrastructure

toolkit. Int. J. Supercomput. Appl. 11(2), 115–128 (1997)6. Globus Toolkit, http://www.globus.org/toolkit7. Grid Resource Allocation and Management (GRAM) component,

http://www.globus.org/toolkit/gram/8. UNICORE forum, http://www.unicore.org9. Frey, J., Tannenbaum, T., Foster, I., Livny, M., Tuecke, S.: Condor-

G: a computation management agent for multi-institutional grids.In: Proceedings 10th IEEE International Symposium on High Per-formance Distributed Computing, San Francisco, California, Au-gust 2001

10. Buyya, R., Abramson, D., Giddy, J.: Nimrod/G: an architecture ofa resource management and scheduling system in a global com-putational grid. In: Proceedings of HPC Asia, pp. 283–289, May2000

11. Casanova, H., Obertelli, G., Berman, F., Wolski, R.: The AppLeSparameter sweep template: user-level middleware for the grid. In:Proceedings of Supercomputing’00 (SC00), pp. 75–76, Nov 2000

12. TeraGrid site scheduling policies, http://www.teragrid.org/userinfo/guide_tgpolicy.html

13. National Virtual Observatory, http://www.us-vo.org14. Wilkinson Microwave Anisotropy Probe Dataset, http://map.gsfc.

nasa.gov15. Sloan Digital Sky Survey, http://www.sdss.org16. Litvin, V.A., Newman, H., Shevchenko, S., Wisniewski, N.: QCD

jet simulation with CMS at LHC and background studies toH → γ γ process. In: Proceedings of 10th International Conf. onCalorimetry in High Energy Physics (CALOR2002), Pasadena,Cal., March 2002, pp. 25–30

17. TeraGrid GridShell/Condor System, http://www.teragrid.org/userinfo/guide_jobs_gridshell.html

18. Condor, High Throughput Computing Environment, http://www.cs.wisc.edu/Condor/

19. Litzkow, M., Livny, M., Matka, M.: Condor—a hunter of idleworkstations. In: Proceeding of the International Conference ofDistributed Computing Systems, June 1988, pp. 104–111

20. GridShell, http://www.gridshell.net21. Walker, E., Minyard, T.: Orchestrating and coordinating scien-

tific/engineering workflows using GridShell. In: Proceedings 13thIEEE International Symposium on High Performance DistributedComputing, Honolulu, Hawaii, June 2004, pp. 270–271

22. Portable Batch System, http://www.openpbs.org23. Sun Grid Engine, http://gridengine.sunsource.net/24. The Globus Alliance. Overview of the Grid Security Infrastruc-

ture, http://www.globus.org/security/overview.html25. Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi,

K., Blackburn, K., Lazzarini, A., Arbree, A., Cavanaugh, R., Ko-randa, S.: Mapping abstract complex workflows onto grid environ-ments. J. Grid Comput. 1(1), 25–29 (2003)

Edward Walker is a Research Associate with the Texas AdvancedComputing Center at The University of Texas at Austin. He re-ceived his PhD in Computer Science from the University of York(UK) in 1994. His current research interest includes high perfor-mance grid/distributed computing, parallelizing compiler design, net-work protocol design and parallel programming language construction.

Jeffrey P. Gardner is a Sr. Scientific Specialist at the Pittsburgh Su-percomputing Center and Adjunct Assistant Professor of Physics andAstronomy at the University of Pittsburgh. He received his PhD in As-tronomy from the University of Washington in 2000. His current re-search ranges from computational astrophysics, with a focus on cos-mology and galaxy formation, to massively parallel algorithms and ap-plication design strategies.

350 Cluster Comput (2007) 10: 339–350

Vladimir Litvin is currently a Senior Engineer in High EnergyPhysics at the California Institute of Technology at Pasadena. He re-ceived his MSci in High Energy Physics from the Moscow Institute ofPhysics and Technology (Russia) in 1993. His current research interestincludes grid/distributed computing, hadron physics on CMS at LargeHadron Collider, calibration of electromagentic calorimeter in CMSat LHC, Higgs searches and high mass diphoton resonance studies inCMS at LHC.

Evan L. Turner is a HPC Specialist at the Texas Advanced Com-puting Center. Evan received a BSc in Computer Science from TexasA&M University, Corpus Christi. His research interests are genetic al-gorithms; nature inspired computing, and distributed computing. Priorto TACC, he was with Texas A&M University, Corpus Christi wherehe managed the High Performance Computing Development Centerunder an NSF grant.