Download - Implementation of a Grid Performance Analysis and Tuning Framework using Jade Technology

SESSION

GRID SERVICES, SCHEDULING, ANDRESOURCE MANAGEMENT + RELATED

ISSUES

Chair(s)

TBA

Int'l Conf. Grid Computing and Applications | GCA'08 | 1

2 Int'l Conf. Grid Computing and Applications | GCA'08 |

Network-aware Peer-to-Peer Based GridInter-Domain Scheduling

Agustın Caminero1,∗, Omer Rana2, Blanca Caminero1, Carmen Carrion1

1Albacete Research Institute of Informatics, University of Castilla La Mancha, Albacete (SPAIN)2Cardiff School of Computer Science, Cardiff University, Cardiff (UK)

∗Corresponding author ([email protected])

Abstract Grid technologies have enabled the aggregationof geographically distributed resources, in the context of aparticular application. The network remains an importantrequirement for any Grid application, as entities involvedin a Grid system (such as users, services, and data) needto communicate with each other over a network. The per-formance of the network must therefore be considered whencarrying out tasks such as scheduling, migration or monitoringof jobs. Moreover, the interactions between different domainsare a key issue in Grid computing, thus their effects should beconsidered when performing the scheduling task. In this paper,we enhance an existing framework that provides scheduling ofjobs to computing resources to allow multi-domain schedulingbased on peer-to-peer techniques.

Keywords: Grid computing, interdomain scheduling, peer-to-peer, network

I. Introduction

Grid computing enables the aggregation of dispersedheterogeneous resources for supporting large-scale par-allel applications in science, engineering and com-merce [12]. Current Grid systems are highly variableenvironments, made of a series of independent organiza-tions that share their resources, creating what is knownas Virtual Organizations (VOs) [13]. This variabilitymakes Quality of Service (QoS) highly desirable, thoughoften very difficult to achieve in practice [21]. One ofthe reasons for this limitation is the lack of control overthe network that connects various components of a Gridsystem. Achieving an end-to-end QoS is often difficult,as without resource reservation any guarantees on QoSare often hard to satisfy. However, for applications thatneed a timely response (such as collaborative visualiza-tion [15]), the Grid must provide users with some kindof assurance about the use of resources – a non-trivialsubject when viewed in the context of network QoS.In a VO, entities communicate with each other usingan interconnection network – resulting in the networkplaying an essential role in Grid systems [21].

As a VO is made of different organizations (ordomains), the interactions between different domainsbecomes important when executing jobs. Hence, a user

Fig. 1. Several administrative domains.

wishing to execute a job with particular QoS constraints(such as response time) may contact a resource brokerto discover suitable resources – which would need tolook across multiple domains if local resources cannotbe found.

Metrics related to network QoS (such as latency,bandwidth, packet loss and packet jitter) are importantwhen performing scheduling of jobs to computing re-sources – in addition to the capabilities of the computingresources themselves. As mentioned above, the lackof suitable local (in the user’s administrative domain)resources requires access to those from a differentdomain to run a job. However, the connectivity betweenthe two domains now becomes important, and is themain emphasis of this work. Figure 1 depicts a numberof administrative domains connected with each otherby means of network connections. Each connectionbetween two peers has an effective bandwidth, whosecalculation will be explained in this paper. Each pair ofneighbor peers may have different network paths linkingeach other, thus we rely on networking protocols, suchas the Border Gateway Protocol (BGP) [20] to decidethe optimal path between two destination networks.

The main contribution of this paper is a proposal forinter-domain scheduling, which makes use of techniquesused in Peer-to-Peer (P2P) systems. Also, an analyticalevaluation has been performed showing the behaviorof our proposal under normal network and computingresource workloads. This paper is structured as follows:Section II explains current proposals on network QoSin Grid computing and the lack of attention paid to


inter-domain scheduling. Also, existing proposals forinter-domain scheduling are revised. Section III explainsour proposal of inter-domain scheduling. Section IVprovides an evaluation, demonstrating the usefulness ofour work, and Section V shows some guidelines for ourfuture work.

II. Related work

The proposed architecture supports the effective man-agement of network QoS in a Grid system, and focuseson the interactions between administrative domainswhen performing the scheduling of jobs to computingresources. P2P techniques are used to decide whichneighboring domain a query should be forwarded to,in the absence of suitable local resources. We willfirst provide a brief overview of existing proposals formanaging network QoS in Grids.

General-purpose Architecture for Reservation andAllocation (GARA) [21] provides programmers andusers with convenient access to end-to-end QoS forcomputer applications. It provides mechanisms for mak-ing QoS reservations for different types of resources,including computers, networks, and disks. These uni-form mechanisms are integrated into a modular structurethat permits the development of a range of high-levelservices. Regarding multi-domain reservations, GARAmust exist in all the traversed domains, and the user (ora broker acting on his behalf) has to be authenticatedwith all the domains. This makes GARA difficult toscale.

The Network Resource Scheduling Entity (NRSE) [5]suggests that signalling and per-flow state overheadcan cause end-to-end QoS reservation schemes to scalepoorly to a large number of users and multi-domainoperations – observed when using IntServ and RSVP,as also with GARA [5]. This has been addressed inNRSE by storing the per-flow/per application state onlyat the end-sites that are involved in the communication.Although NRSE has demonstrated its effectiveness inproviding DiffServ QoS, it is not clear how a Gridapplication developer would make use of this capability– especially as the application programming interface isnot clearly defined [3].

Grid Quality of Service Management (G-QoSM) [3]is a framework to support QoS management in compu-tational Grids in the context of the Open Grid ServiceArchitecture (OGSA). G-QoSM is a generic modularsystem that, conceptually, supports various types ofresource QoS, such as computation, network and diskstorage. This framework aims to provide three mainfunctions: 1) support for resource and service discoverybased on QoS properties; 2) provision for QoS guar-antees at application, middleware and network levels,

and the establishment of Service Level Agreements(SLAs) to enforce QoS parameters; and 3) support forQoS management of allocated resources, on three QoSlevels: ‘guaranteed’, ‘controlled load’ and ‘best effort’.G-QoSM also supports adaptation strategies to shareresource capacity between these three user categories.

The Grid Network-aware Resource Broker(GNRB) [2] is an entity that enhances the features ofa Grid Resource Broker with the capabilities providedby a Network Resource Manager. This leads to thedesign and implementation of new mapping/ schedulingmechanisms to take into account both network andcomputational resources. The GNRB, using networkstatus information, can reserve network resourcesto satisfy the QoS requirements of applications.The architecture is centralized, with one GNRB peradministrative domain – potentially leading to theGNRB becoming a bottleneck within the domain.Also, GNRB is a framework, and does not enforce anyparticular algorithms to perform scheduling of jobs toresources.

Many of the above efforts do not take networkcapability into account when scheduling tasks. GARAschedules jobs by using DSRT and PBS, whilst G-QoSM uses DSRT. These schedulers (DSRT and PBS)only pay attention to the workload of the computingresource, thus a powerful unloaded computing resourcewith an overloaded network could be chosen to run jobs,which decreases the performance received by users,especially when the job requires a high network I/O.

Finally, VIOLA [24] provides a meta-schedulingframework that provides co-allocation support for bothcomputational and network resources. It is able to ne-gotiate with the local scheduling systems to find, and toreserve, a common time slot to execute various compo-nents of an application. The meta-scheduling service inVIOLA has been implemented via the UNICORE Gridmiddleware for job submission, monitoring, and control.This allows a user to describe the distribution of the par-allel MetaTrace application and the requested resourcesusing the UNICORE client, while the allocation andreservation of resources are undertaken automatically.A key feature in VIOLA is the network reservationcapability; this allows the network to be treated as aresource within a meta-scheduling application. In thiscontext, VIOLA is somewhat similar to our approach– in that it also considers the network as a key part inthe job allocation process. However, the key differenceis the focus in VIOLA on co-allocation and reservation– which is not always possible if the network is underownership of a different administrator.

Choosing the most useful domain is a key issue whenpropagating a query to another administrative domain.


DIANA [4] performs global meta-scheduling in a localenvironment, typically in a LAN, and utilizes meta-schedulers that work in a P2P manner. Each site has ameta-scheduler that communicates with all other meta-schedulers on other sites. DIANA has been developedto make decisions based on global information. Thismakes DIANA unsuitable for realistic Grid testbeds –such as the LHC Computing Grid [1].

The Grid Distribution Manager (GridDM) is part ofthe e-Protein Project [18], a P2P system that performsinter-domain scheduling and load balancing within acluster – utilizing schedulers such as SGE, Condoretc. Similarly, Xu et al. [25] present a frameworkfor the QoS-aware discovery of services, where theQoS is based on feedback from users. Gu et al. [14]propose a scalable aggregation model for P2P systems– to automatically aggregate services into a distributedapplication, to enable the resulting application to meetuser defined QoS criteria.

Our proposal is based on the architecture presentedin [6] and extended in [7]. This architecture providesscheduling of jobs to computing resources within oneor more administrative domains. A key component is theGrid Network Broker (GNB), which provides schedul-ing of jobs to computing resources, taking account ofnetwork characteristics.

III. Inter-domain scheduling

The proposed architecture is shown in Figure 2and has the following entities: Users, each one witha number of jobs; computing resources, e.g. clusters ofcomputers; routers; GNB (Grid Network Broker), a jobscheduler; GIS (Grid Information Service), such as [11],which keeps a list of available resources; resourcemonitor (for example, Ganglia [16]), which providesdetailed information on the status of the resources; BB(Bandwidth Broker) such as [22], which is in chargeof the administrative domain, and has direct access torouters. The BB can be used to support reservation ofnetwork links, and can keep track of the interconnectiontopology between two end points within a network. Amore in-depth description of the functionality of thearchitecture can be found in [7].

We make the following assumption in the architec-ture: (1) each domain must provide the resources itannounces – i.e. when a domain publishes X machineswith Y speed, those machines are physically locatedwithin the domain. The opposite case would be thata domain contains just a pointer to where the ma-chines are. This is used to calculate the number ofhops between the user and the domain providing theresource(s); (2) the resource monitor should provide

Fig. 2. One single administrative domain.

exactly the same measurements in all the domains. Oth-erwise, no comparison can be made between domains.

We use a Routing Indices (RI) [10] to enable nodesto forward queries to neighbors that are more likely tohave suitable resources. A node continues to forwardsthe query to a subset of its neighbors, based on itslocal RI, rather than by selecting neighbors at randomor by flooding the network (i.e. by forwarding the queryto all neighbors). This minimizes the amount of trafficgenerated within a P2P system.

A. Routing Indices

Routing Indices (RI) [10] were initially developed tosupport document discovery in P2P systems, and theyhave also been used to implement a Grid informationservice in [19]. The goal of RIs is to help users finddocuments with content of interest across potential P2Pnodes efficiently. The RI represents the availability ofdata of a specific type in the neighbor’s informationbase. We use a version of RI called Hop-Count RoutingIndex (HRI) [10], which considers the number of hopsneeded to reach a datum. Our implementation of HRIcalculates the aggregate capability of a neighbor do-main, based on the resources it contains and the effectivebandwidth of the link between the two domains. Moreprecisely, Equation (1) is applied.

Ilp =

( num machinesp∑i=0

max num processesi

current num processesi

)× e f f bw(l, p)

(1)

where Ilp is the information that the local

domain l keeps about the neighbor domain p;num machinesp is the number of machines domain phas; current num processesi is the current number ofprocesses running in the machine; max num processes i

is the maximum number of processes that can be runin that machine; e f f bw(l, p) is the effective bandwidthof the network connection between the local domain land the peer domain p, and it is calculated as follows.At every interval, GNBs forward a query along thepath to their neighbor GNBs, asking for the numberof transmitted bytes for each interface the query goes


through (the OutOctets parameter of SNMP [17]). Byusing two consecutive measurements (m1 and m2, m1

shows X bytes, and m2 shows Y bytes), and consideringthe moment when they were collected (m1 collected attime t1 seconds and m2 at t2 seconds), and the capacityof the link C, we can calculate the effective bandwidthof each link as follows:

e f f bw(l, p) = C − Y − Xt2 − t1

(2)

The effective bandwidth of the path is the smallesteffective bandwidth of links in that path. Also, pre-dictions on the values of the resource power and theeffective bandwidth can be used, for example, calculatedas pointed out in [7]. As we can see, the networkplays an important role when calculating the qualityof a domain. Because of space limitations, we cannotprovide an in-depth explanation of the formulas, see [8]for details on the terms in equation 1.

We used HRI as described in [10]: in each peer, theHRI is represented as a M × N table, where M is thenumber of neighbors and N is the horizon (maximumnumber of hops) of our Index. The n th position in themth row is the quality of the domain(s) that can bereached going through neighbor m, within n hops. As anexample, the HRI of peer P1 looks as shown in Table I(for the topology depicted in Figure 1), where S x.y isthe information for peers that can be reached throughpeer x, and are y hops away from the local peer (P 1).

Peer 1 hop 2 hops 3 hopsP2 S 2.1 S 2.2 S 2.3

P3 S 3.1 S 3.2 S 3.3

TABLE I

HRI for peer P1.

So, S 2.2 is the quality of the domain(s) which canbe reached through peer P2, whose distance from thelocal peer is 2 hops. Each S x.y is calculated by meansof formula 3. In this formula, d(P x, Pi) is the distance(in number of hops) between peers P x and Pi. S x.y iscalculated differently based on the distance from thelocal peer. When the distance is 1, then S x.y = IPl

Px,

because the only peer that can be reached from localpeer Pl through Px within 1 hop is Px. Otherwise, forthose peers Pi whose distance from the local peer is y,we have to add the information that each peer Pt (whichis the neighbor of Pi) keeps about them. So, the HRIof peer P1 will be calculated as shown in Table II.

S x.y =

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

IPlPx, when y = 1,

∑i IPt

Pi,∀Pi, d(Pl, Pi) = y ∧ . . .

· · · ∧ d(Pl, Pt) = y − 1 ∧ d(Pt, Pi) = 1, otherwise(3)

Peer 1 hop 2 hops 3 hopsP2 IP1

P2IP2P4+ IP2

P5IP4P8+ IP4

P9+ IP5

P10+ IP5

P11

P3 IP1P3

IP3P6+ IP3

P7IP6P12+ IP6

P13+ IP7

P14+ IP7

P15

TABLE II

HRI for peer P1.

In order to use RIs, a key component is the goodnessfunction [10]. The goodness function will decide howgood each neighbor is by considering the HRI andthe distance between neighbors. More concretely, ourgoodness function can be seen in Equation (4).

goodness(p) =∑

j=1..H

S p. j

F j−1(4)

In Equation (4), p is the peer domain to be consid-ered; H is the horizon for the HRIs; and F is the fanoutof the topology. As [10] explains, the horizon is the limitdistance, and those peers whose distance from the localpeer is higher than the horizon will not be considered.Meanwhile, the fanout of the topology is the maximumnumber of neighbors of a peer.

B. Search technique

In the literature, several techniques are used forsearching in P2P networks, including flooding (e.g.Gnutella) or centralized servers (e.g. Napster). Moreeffective searches are performed by systems based ondistributed indices. In these configurations, each nodeholds a part of the index. The index optimizes theprobability of finding quickly the requested information,by keeping track of the availability of data to eachneighbor.

Algorithm 1 shows the way that our architectureperforms the scheduling of jobs to computing resources.In our system, when a user wants to run a job, he/shesubmits a query to the GNB of the local domain.This query is stored (line 7) as it arrives for the firsttime to a GNB. Subsequently, the GNB looks for acomputing resource in the local domain matching therequirements of the query (line 9). If the GNB finds acomputing resource in the local domain that matches therequirements, then it tells the user to use that resource torun the job (line 22). Otherwise, the GNB will forwardthe query to the GNB of one of the neighbor domains.


This neighbor domain will be chosen based on the Hop-Count Routing Index, HRI, explained before (line 13).The parameter ToTry is used to decide which neighborshould be contacted next (in Figure 3, p3 will contactp6); if the query is bounced back, then the 2 nd bestneighbor will be contacted (p3 will contact peer p7),and so on. Hence, a neighbor domain will only becontacted if there are no local computing resourcesavailable to fulfill the query (finish the job before thedeadline expires, for instance).

Algorithm 1 Searching algorithm.1: Let q = new incoming query2: Let LocalResource = a resource in the local domain3: Let NextBestNeighbor = a neighbor domain select by the

goodness function4: Let ToTry = the next neighbor domain to forward the

query to5: for all q do6: LocalResource := null7: if (QueryStatus(q) = not present) then8: QueryStatus(q) := 19: LocalResource := MatchQueryLocalResource(q)

10: end if11: if (LocalResource == null) then12: ToTry := QueryStatus(q)13: NextBestNeighbor := HRI(q, ToTry)14: if (NextBestNeighbor == null) then15: Recipient := Sender (q)16: else17: Recipient := NextBestNeighbor18: QueryStatus(q) += 119: end if20: ForwardQueryToRecipient(q, Recipient)21: else22: SendResponseToRequester(q)23: end if24: end for

IV. Evaluation

In order to illustrate the behavior of our design,we will present an evaluation showing how our HRIsevolve when varying the measurements. We use thetopology presented in Figure 3. Only data from peer p1are shown. For simplicity and ease of explanation, weassume that bandwidth of all links is 1 Gbps, and thatall the peers manage a single computational resourcewith 4 Gb of memory and CPU speed of 1 GHz.

For equation 1, we have approximated the values ofthe current number of processes as a uniform distribu-tion between 10 and 100, and the maximum number ofprocesses as 100. Regarding the e f f bw(l, p), we haveconsidered a Poisson distribution for those links that areheavily loaded, and Weibull distribution for those linkswhich are not so loaded, as [9] suggests. In Figure 3,

Fig. 3. A query (Q) is forwarded from p1 to the best neighbors (p3,p6, and p7).

links with even numbered labels will be heavily used,and are depicted with a thicker line.

0 50 100 150 200 250 3000

2.�108

4.�108

6.�108

8.�108

1.�109

Fig. 4. Use of a not heavily loaded link (Weibull distribution).

0 50 100 150 200 250 3000

2.�108

4.�108

6.�108

8.�108

1.�109

Fig. 5. Use of a heavily loaded link (Poisson distribution).

To calculate the parameters for these distributions(the mean μ for the Poisson distribution, and scaleβ and shape α for the Weibull distribution), we haveconsidered that the level of use of heavily used linksis 80%, whilst no heavily used links exhibit a 10%usage. This way, if a heavily used link transmits 800 Mbin 1 second, and the maximum transfer unit of thelinks is 1500 bytes, the inter-arrival time for packetsis 0.000015 seconds. Thus, this is the value for theμ of the Poisson distribution. In the same way, wecalculate the value for the β parameter of the Weibulldistribution, and the value we get is 0.00012 seconds.We can now calculate the inter-arrival time for packetsand the effective bandwidth.

We have simulated a measurement period of 7days, with measurements collected every 30 minutes.Figures 4, 5 and 6 present the variation on the useof links and the number of processes, following the


0 50 100 150 200 250 3000

20

40

60

80

100

Fig. 6. Variation of the number of processes (Uniform distribution).

mathematical distributions explained before. Figures 4and 5 represent the level of use of links compared to theactual bandwidth (1 Gbps), per measurement. Heavilyused links get a higher used bandwidth than not heavilyused links. Thus, the data shown in these figures areused for our HRIs in order to decide where to forwarda query.

0 50 100 150 200 250 3000.00000

0.00005

0.00010

0.00015

0.00020

0.00025

0.00030

0.00035

0.00040

Fig. 7. S 2.1 = Ip1p2 (link p1 − p2 is not heavily loaded).

0 50 100 150 200 250 3000.00000

0.00005

0.00010

0.00015

0.00020

0.00025

0.00030

0.00035

0.00040

Fig. 8. S 3.1 = Ip1p3 (link p1 − p3 is heavily loaded).

0 50 100 150 200 250 3000.00000

0.00005

0.00010

0.00015

0.00020

0.00025

0.00030

0.00035

0.00040

Fig. 9. S 2.2 (S 3.2 would also look like this).

Figures 7, 8 and 9 present the variation of the S x.y forboth heavily/ unheavily loaded links. These figures havebeen calculated by means of the formulas explained inSection III-A, and applying them to the mathematicaldistributions mentioned above. As we explained inTables I and II, S 2.1 = Ip1

p2 , and S 3.1 = Ip1p3 . We can

see that the network performance affects the HRI, as

was expected. We must recall that the higher the HRIis, the better, because it means that the peer is powerfuland well connected. Also, we see that when the linkis not heavily loaded, the S has more high values, andvalues are more scattered across the figure. As opposedto it, when the link is heavily loaded, more values aregrouped together at the bottom of the figure. Also, forFigure 9, S 2.2 = IP2

P4+ IP2

P5, and S 3.2 = IP3

P6+ IP3

P7, which

means that to calculate S 2.2 and S 3.2, both heavily andnot heavily used links are used.

Figures 10 and 11 show the variation of the goodnessfunction for both neighbors of peer p1. Recall thatthe link between p1 and p2 is unloaded, and the linkbetween p1 and p3 is loaded. These facts reflect in bothgoodness functions: in the case of p2 it shows highervalues than the goodness function for p3. It can also beseen that the function of p2 has less values grouped nearthe zero axis. To summarize, a job originated in p1 willmore likely be scheduled through peer p2 than throughpeer p3, as expected due to the links conditions.

0 50 100 150 200 250 3000.00000

0.00005

0.00010

0.00015

0.00020

0.00025

0.00030

0.00035

0.00040

Fig. 10. Goodness function for peer p2 (link p1 − p2 unloaded).

0 50 100 150 200 250 3000.00000

0.00005

0.00010

0.00015

0.00020

0.00025

0.00030

0.00035

0.00040

Fig. 11. Goodness function for peer p3 (link p1 − p3 loaded).

V. Conclusions and future work

The network remains an important requirement forany Grid application, as entities involved in a Gridsystem (such as users, services, and data) need tocommunicate with each other over a network. The per-formance of the network must therefore be consideredwhen carrying out tasks such as scheduling, migrationor monitoring of jobs. Also, inter-domain relations arekey in Grid computing.

We propose an extension to an existing schedul-ing framework to allow network-aware multi-domainscheduling based on P2P techniques. More precisely,


our proposal is based on Routing Indices (RI). This waywe allow nodes to forward queries to neighbors that aremore likely to have answers. If a node cannot find asuitable computing resource for a user’s job within itsdomain, t forwards the query to a subset of its neighbors,based on its local RI, rather than by selecting neighborsat random or by flooding the network by forwarding thequery to all neighbors.

Our approach will be evaluated further using theGridSim simulation In this way, we will be able tostudy how the proposed technique behaves in complexscenarios, in a repeatable and controlled manner.

Acknowledgement

Work jointly supported by the Spanish MEC and Eu-ropean Commission FEDER funds under grants “Con-solider Ingenio-2010 CSD2006-00046” and “TIN2006-15516-C04-02”; jointly by JCCM and Fondo SocialEuropeo under grant “FSE 2007-2013”; and by JCCMunder grants “PBC-05-007-01”, “PBC-05-005-01”.

References

[1] LCG (LHC Computing Grid) Project. Web Page, 2008. http://lcg.web.cern.ch/LCG.

[2] D. Adami et al. Design and implementation of a grid network-aware resource broker. In Intl. Conf. on Parallel and DistributedComputing and Networks, Innsbruck, Austria, 2006.

[3] R. Al-Ali et al. Network QoS Provision for Distributed GridApplications. Intl. Journal of Simulations Systems, Science andTechnology, Special Issue on Grid Performance and Dependabil-ity, 5(5), December 2004.

[4] A. Anjum, R. McClatchey, H. Stockinger, A. Ali, I. Willers,M. Thomas, M. Sagheer, K. Hasham, and O. Alvi. DIANAscheduling hierarchies for optimizing bulk job scheduling. InSecond Intl. Conference on e-Science and Grid Computing,Amsterdam, Netherlands, 2006.

[5] S. Bhatti, S. Sørensen, P. Clark, and J. Crowcroft. NetworkQoS for Grid Systems. The Intl. Journal of High PerformanceComputing Applications, 17(3), 2003.

[6] A. Caminero, C. Carrion, and B. Caminero. Designing an entityto provide network QoS in a Grid system. In 1st Iberian GridInfrastructure Conference (IberGrid), Santiago de Compostela,Spain, 2007.

[7] A. Caminero, O. Rana, B. Caminero, and C. Carrion. AnAutonomic Network-Aware Scheduling Architecture for GridComputing. In 5th Intl. Workshop on Middleware for GridComputing (MGC), Newport Beach, USA, 2007.

[8] A. Caminero, O. Rana, B. Caminero, and C. Carrion. Providingnetwork QoS support in Grid systems by means of peer-to-peertechniques. Technical Report DIAB-08-01-1, Dept. of ComputingSystems. Univ. of Castilla La Mancha, Spain, January 2008.

[9] J. Cao, W. Cleveland, D. Lin, and D. Sun. Nonlinear Estimationand Classification, chapter Internet traffic tends toward Poissonand independent as the load increases. Springer Verlag, NewYork, USA, 2002.

[10] A. Crespo and H. Garcia-Molina. Routing Indices For Peer-to-Peer Systems. In Intl. Conference on Distributed ComputingSystems (ICDCS), Vienna, Austria, 2002.

[11] S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski,W. Smith, and S. Tuecke. A directory service for configuringhigh-performance distributed computations. In 6th Symposiumon High Performance Distributed Computing (HPDC), Portland,USA, 1997.

[12] I. Foster and C. Kesselman. The Grid 2: Blueprint for a NewComputing Infrastructure. Morgan Kaufmann, 2 edition, 2003.

[13] I. T. Foster. The anatomy of the Grid: Enabling scalable virtualorganizations. In 1st Intl. Symposium on Cluster Computing andthe Grid (CCGrid), Brisbane, Australia, 2001.

[14] X. Gu and K. Nahrstedt. A Scalable QoS-Aware ServiceAggregation Model for Peer-to-Peer Computing Grids. In 11thIntl. Symposium on High Performance Distributed Computing(HPDC), Edinburgh, UK, 2002.

[15] F. T. Marchese and N. Brajkovska. Fostering asynchronous col-laborative visualization. In 11th Intl. Conference on InformationVisualization, Washington DC, USA, 2007.

[16] M. L. Massie, B. N. Chun, and D. E. Culler. The Gangliadistributed monitoring system: design, implementation, and ex-perience. Parallel Computing, 30(5-6):817–840, 2004.

[17] K. McCloghrie and M. T. Rose. Management Information Basefor Network Management of TCP/IP-based internets: MIB-II.Internet proposed standard RFC 1213, March 1991.

[18] A. O’Brien, S. Newhouse, and J. Darlington. Mapping ofScientific Workflow within the E-Protein Project to DistributedResources. In UK e-Science All-hands Meeting, Nottingham, UK,2004.

[19] D. Puppin, S. Moncelli, R. Baraglia, N. Tonellotto, and F. Sil-vestri. A Grid Information Service Based on Peer-to-Peer. In11th Intl. Euro-Par Conference, Lisbon, Portugal, 2005.

[20] Y. Rekhter, T. Li, and S. Hares. A Border Gateway Protocol 4(BGP-4). Internet proposed standard RFC 4271, January 2006.

[21] A. Roy. End-to-End Quality of Service for High-End Applica-tions. PhD thesis, Dept. of Computer Science, Univ. of Chicago,2001.

[22] S. Sohail, K. B. Pham, R. Nguyen, and S. Jha. BandwidthBroker Implementation: Circa-Complete and Integrable. Tech-nical report, School of Computer Science and Engineering, TheUniversity of New South Wales, 2003.

[23] A. Sulistio, G. Poduval, R. Buyya, and C.-K. Tham. On in-corporating differentiated levels of network service into GridSim.Future Generation Computer Systems, 23(4), May 2007.

[24] O. Waldrich, P. Wieder, and W. Ziegler. A Meta-schedulingService for Co-allocating Arbitrary Types of Resources. In 6thIntl. Conference on Parallel Processing and Applied Mathematics(PPAM), Poznan, Poland, 2005.

[25] D. Xu, K. Nahrstedt, and D. Wichadakul. QoS-Aware Discoveryof Wide-Area Distributed Services. In 1st Intl. Symp. on ClusterComp. and the Grid (CCGrid), Brisbane, Australia, 2001.


Using a Web-based Framework to Manage Grid Deployments.

Georgios Oikonomou1 and Theodore Apostolopoulos1

1Department of Informatics, Athens University of Economics and Business, Athens, Greece

Abstract - WebDMF is a Web-based Framework for the Management of Distributed services. It is based on the Web-based Enterprise Management (WBEM) standards family and introduces a middleware layer of entities called “Representatives”. Details related to the managed application are detached from the representative logic, making the framework suitable for a variety of services. WebDMF can be integrated with existing WBEM infrastructures and is complementary to web service-based management efforts. This paper describes how the framework can be used to manage grids without modifications to existing installations. It compares the proposed solution with other research initiatives. Experiments on an emulated network topology indicate its viability.

Keywords: WebDMF, Grid Management, Distributed Services Management, Web-based Enterprise Management, Common Information Model.

1 Introduction During the past decades the scenery in computing and

networking has undergone revolutionary changes. From the era of single, centralised systems we are steadily moving to an era of highly decentralised, interconnected nodes that share resources in order to provide services transparently to the end user.

Traditionally, legacy management approaches such as the Simple Network Management Protocol (SNMP) [1], targeted single nodes. The current paradigm presents new challenges and increases complexity in the area of network and systems management. There is need for solutions that view a distributed deployment as a whole, instead of as a set of isolated hosts.

The Web-based Distributed Management Framework (WebDMF) is the result of our work detailed in [2]. It is a framework for the management of distributed services and uses standard web technologies. Its core is based on the Web-based Enterprise Management (WBEM) family of specifications [3], [4], [5]. It is not limited to monitoring but is also capable of modifying the run-time parameters of

the managed service. Finally, it has a wide target group. It can perform the management of a variety of distributed systems, such as distributed file systems, computer clusters and computational or data grids. However, multiprocessor, multi-core, parallel computing and similar systems are considered out of the scope of our work, even though they are very often referred to as “distributed”. The main contribution of this paper is three-fold:

• We demonstrate how a WebDMF deployment can be used for the management of a grid, without any modification to existing WBEM management infrastructures.

• We provide indications for the viability of the approach through a preliminary performance evaluation.

• We show that WebDMF is not competitive to emerging Web Service-based grid management initiatives. Instead, it is a step towards the same direction. Section 2 summarizes some recent approaches in the

field of grid management and compares our work with those efforts. In order to familiarize the reader with some basic concepts, section 3 presents a short introduction to the WBEM family of standards. In section 4 we briefly describe WebDMF’s architecture and some implementation details. In the same section we demonstrate how the framework can be used to manage grids. Finally, we discuss the relationship between WebDMF and Web Service-based management and we present some preliminary evaluation results. Section 5 presents our conclusions.

2 Related Work – Motivation In this section we aim to outline some of the research

initiatives in the field of grid management. The brief review is limited to the most recent ones.

2.1 Related Work An important approach is the one proposed by the

Open Grid Forum (OGF). OGF’s Grid Monitoring Architecture (GMA) uses an event producer – event consumer model to monitor grid resources [6]. However, as the name suggests, GMA is limited to monitoring. It lacks active management and configuration capabilities.


gLite is a grid computing middleware, developed as part of the Enabling Grids for E-sciencE (EGEE) project. gLite implements an “Information and Monitoring Subsystem”, called R-GMA (Relational GMA), which is a modification of OGF’s GMA. Therefore it also only serves monitoring purposes [7].

The Unified Grid Management and Data Architecture (UGanDA) is an enterprise level workflow and grid management system [8]. It contains a grid infrastructure manager called MAGI. MAGI has many features but is limited to the management of UGanDA deployments.

MRF is a Multi-layer resource Reconfiguration Framework for grid computing [9]. It has been implemented on a grid-enabled Distributed Shared Memory (DSM) system called Teamster-G [10].

MonALISA stands for “Monitoring Agents using a Large Integrated Services Architecture”. It “aims to provide a distributed service architecture which is used to collect and process monitoring information” [11]. Many Globus deployments use MonALISA to support management tasks. Again, the lack of capability to modify the running parameters of the managed resource is notable.

Finally, we should mention emerging Service – Based management initiatives, such as the Web Services Distributed Management (WSDM) [12] standard and the Web Services for Management (WS-Man) specification [13]. Due to their importance, they are discussed in greater detail in section 4 of this paper.

2.2 Motivation Table I compares WebDMF with the solutions that we

presented above. For this comparison we consider three factors:

TABLE I. COMPARING WEBDMF WITH OTHER GRID MANAGEMENT SOLUTIONS.

Name Monitoring Set Target Group OGF’s GMA Y Wide

gLite – R-GMA Y Focused

UGanDA – MAGI Y Y Focused

MRF – Teamster-G Y Y Focused

MonALISA Y Wide

WebDMF Y Y Wide

• The ability to perform monitoring. • Whether the approach can actively modify the grid’s

run-time parameters.

• Whether the approach is generic or focuses on infrastructures implemented using a specific technology. Our motivation to design WebDMF was to provide a

framework that would be generic enough to manage grid deployments regardless of the technology used to implement their infrastructure. At the same time, it should not be limited to monitoring but also provide “set” capabilities. Other advantages are:

• It is based on WBEM. This is a family of open standards.

• WBEM allows easy integration with web service – based management approaches.

• WBEM has been considered adequate for the management of applications, as opposed to other approaches (e.g. SNMP) that focus on the management of devices.

• It provides interoperability with existing WBEM-based management infrastructures.

3 Web-based Enterprise Management Web-Based Enterprise Management (WBEM) is a set

of specifications published by the Distributed Management Task Force (DMTF). A large number of companies are also involved in this ongoing management initiative. This section presents a brief introduction to the WBEM family of standards.

Fig. 1 displays the three core WBEM components. The “Common Information Model” (CIM) is a set of specifications for the modeling of management data [3]. It is an object-oriented, platform-independent model maintained by the DMTF. It includes a “core schema” with definitions that apply to all management areas. It also includes a set of “common models” that represent common management areas, such as networks, hardware, software and services. Finally, the CIM allows manufacturers to define technology-specific “extension schemas” that directly suit the management needs of their implementations.

CIM in XMLEncoding

CIM over HTTPTransport

Data ModelCommon Information Model

CIM-XML Fig. 1. The three core WBEM components.

For the interaction between WBEM entities (clients and managed elements), WBEM uses a set of well-defined


request and response data packets. CIM elements are encoded in XML in accordance with the xmlCIM specification [4]. The resulting XML document is then transmitted over a network as the payload of an HTTP message. This transport mechanism is called “CIM Operations over HTTP” [5].

WBEM follows the client-server paradigm. The WBEM client corresponds to the term “management station” used in other management architectures. A WBEM server is made up of components as portrayed in Fig. 2.

Fig. 2. WBEM instrumentation.

The WBEM client does not have direct access to the managed resources. Instead, it sends requests to the CIM Object Manager (CIMOM), using CIM over HTTP. The CIMOM handles all communication with the client, delegates requests to the appropriate providers and returns responses.

Providers act as plugins for the CIMOM. They are responsible for the actual implementation of the management operations for a managed resource. Therefore, providers are implementation – specific. The repository is the part of the WBEM server that stores the definitions of the core, common and extension CIM schemas.

A significant number of vendors have started releasing WBEM products. The SBLIM open source project offers a suite of WBEM-related tools. Furthermore, OpenPegasus, OpenWBEM and WBEMServices are some noteworthy, open source CIMOM implementations. There are also numerous commercial solutions.

4 WebDMF: Web-based Management of Distributed Services In this section we introduce the reader to the concept

and design of the WebDMF management framework and

present some implementation details. Due to length restrictions we can not provide deep technical design details. We explain how the framework can be used to manage grid deployments. The section continues with a discussion about the relationship between WebDMF and Web Service-based management. It concludes with a preliminary performance evaluation, indicating the viability of the approach.

4.1 Design WebDMF stands for Web-based Distributed

Management Framework. It treats a distributed system as a number of host nodes. They are interconnected over a network and share resources to provide services to the end user. The proposed framework’s aim is to provide management facilities for them. Through their management, we achieve the management of the entire deployment.

The architecture is based on the WBEM family of technologies. Nodes function as WBEM entities; clients, servers or both, depending on their role in the deployment. The messages exchanged between nodes are CIM-XML messages.

WebDMF’s design introduces a middleware layer of entities that we call “Management Representatives”. They act as peers and form a management overlay network. This new layer of nodes is integrated with the existing WBEM-based management infrastructure. Representatives act as intermediaries between existing WBEM clients and CIM Object Managers. In our work we use the terms “Management” and “Service” node when referring to those entities. This resembles the “Manager of Managers” (MoM) approach. However, in MoM there is no direct communication between domain managers. Representatives are aware of the existence of their peers. Therefore, WebDMF adopts the “Distributed Management” approach. By distributing management over several nodes throughout the network, we can increase reliability, robustness and performance, while network communication and computation costs decrease [14]. Fig. 3 displays the three management entities mentioned above, forming a very simple topology.

A “Management Node” is a typical WBEM client. It is used to monitor and configure the various operational parameters of the distributed service. Any existing WBEM client software can be used without modifications.

A “Service Node” is the term used when referring to any node – member of the distributed service. For instance, in the case of a data grid, the term would be used to describe a storage device. Similarly, in a computational grid, the term can describe an execution host. As stated


previously, the role of a node in a particular grid deployment does not affect the design of our framework.

Fig. 3. Management entities.

Typically, a Service Node executes an instance of the (distributed) managed service. As displayed in Fig. 4 (a), a WBEM request is received by the CIMOM on the Service Node. A provider specifically written for the service handles the execution of the management operation. The existence of such a provider is a requirement. In other words, the distributed service must be manageable through WBEM. Alternatively, a service may be manageable through SNMP, as shown in Fig. 4 (b). In such a case the node may still participate in WebDMF deployments but some functional restrictions will apply.

Fig. 4. Service node.

The framework’s introduces an entity called the “Management Representative”. This entity receives requests from a WBEM client and performs management actions on the relevant service nodes. After a series of message exchanges, it will respond to the initial request. A representative is more than a simple ‘proxy’ that receives and forwards requests. It performs a number of other operations including the following:

• Exchanges messages with other representatives regarding the state of the system as a whole.

• Keeps a record of Service Nodes that participate in the deployment.

• Redirects requests to other representatives. Fig. 5 displays the generic case of a distributed

deployment. Communication between representatives is also performed over WBEM.

Fig. 5. A generic deployment.

The initial requests do not state explicitly which service nodes are involved in the management task. The decision about the destination of the intermediate message exchange is part of the functionality implemented in the representative. The message exchange is transparent to the management node and the end user.

In order to achieve the above functionality, a representative is further split into building blocks, as shown in Fig. 6. It can act as a WBEM server as well as a client. Initial requests are received by the CIMOM on the representative. They are delegated to the WebDMF provider module for further processing. The module performs the following functions:

Fig. 6. WebDMF representative.

• Determines whether the request can be served locally.


• If the node can not directly serve the request then it selects the appropriate representative and forwards it.

• If the request can be served locally, the representative creates a list of service nodes that should be contacted and issues intermediate requests.

• It processes intermediate responses and generates the final response.

• Finally, it maintains information about the distributed system’s topology. In some situations, a service node does not support

WBEM but is only manageable through SNMP. In this case, the representative attempts to perform the operation using SNMP methods. This is based on a set of WBEM to SNMP mapping rules. This has limitations since it is not possible to map all methods. However, even under limitations, the legacy service node can still participate in the deployment.

In a WebDMF deployment, a representative is responsible for the management of a group of service nodes. We use the term “Domain” when referring to such groups. Domains are organized in a hierarchical structure. The top level of the hierarchy (root node of the tree) corresponds to the entire deployment. The rationale behind designing the domain hierarchy of each individual deployment can be based on a variety of criteria. For example a system might be separated into domains based on the geographical location of nodes. WebDMF defines two categories of management operations: i) Horizontal (Category A) and ii) Vertical (Category B).

Horizontal Operations enable management of the WebDMF overlay network itself. Those functions can, for example, be used to perform topology changes. The message exchange that takes place does not involve Service Nodes. Therefore, the managed service is not affected in any way.

On the other hand, vertical operations read and modify the CIM schema on the Service Node, thus achieving management of the target application. Typical examples include:

• Setting new values on CIM objects of many service nodes.

• Reading operational parameters from service nodes and reporting an aggregate (e.g. sum or average). In line with the above, we have designed two CIM

Schemas for WebDMF, the core schema (“WebDMF_Core”) and the request factory. They both reside on the representatives’ repositories. The former schema models the deployment’s logical topology, as discussed earlier. It corresponds to horizontal functions.

The latter schema is represented by the class diagram in Fig 7 and corresponds to vertical functions. The users

can call WBEM methods on instances of this schema. In doing so, they can define the management operations that they wish to perform on the target application. Each request towards the distributed deployment is treated as a managed resource itself. For example, a user can create a new request. They can execute it periodically and read the results. They can modify it, re-execute it and finally delete it. Each request is mapped by the representative to intermediate WBEM requests issued to service nodes.

Fig. 7. Request Factory CIM Schema.

Request factory classes are generic. They are not related in any way with the CIM schema of the managed application. This makes WebDMF appropriate for the management of a wide variety of services. Furthermore, it does not need re-configuration when the target schema is modified.

4.2 Implementation The WebDMF representative is implemented as a

single shared object library file (.so). It is comprised of a set of WBEM providers. Each one of them implements the management operations for a class of the WebDMF schemas. The interface between the CIMOM and the providers complies with the Common Manageability Programming Interface (CMPI). Providers themselves are written using C++ coding. This does not break CIMOM independence, as described in [15]. The representative was developed on Linux 2.6.20 machines. We used gcc 4.1.2 and version 2.17.50 of binutils. It has been tested with version 2.7.0 of the Open Pegasus CIMOM.

4.3 Using WebDMF to Manage Grids In a grid environment, a service node can potentially

be an execution host, a scheduler, a meta-scheduler or a


resource allocation host. The previous list is non-inclusive. The role of a node does not affect the design of our framework.

What we need is a CIM schema and the relevant providers that can implement WBEM management for the service node. Such schemas and providers do exist. For example, an architecture for flexible monitoring of various WSRF-based grid services is presented in [16]. This architecture uses a WBEM provider that communicates with WSRF hosts and gathers status data. In a WebDMF deployment, we could have many such providers across various domains. Each would reside on a service node and monitor the managed application.

4.4 WebDMF and Web Services The grid community has been working for more than

five years to transform grid computing systems into a group of web service-based building blocks. In line with this effort, management of the resulting infrastructures has also moved towards web service-based approaches. The recent OASIS Web Services Distributed Management (WSDM) [12] standard and the DMTF Web Services for Management (WS-Man) specification [13] have been considered enablers of this vision.

WebDMF adopts a resource – centric approach. This may seem to be a step in the opposite direction. It is not. The authors of this paper consider web service – based approaches to be a very necessary and extremely valuable effort. However, service – oriented management approaches are model – agnostic. They do not define the properties, operations, relationships, and events of managed resources [12]. Two important reasons why we choose WBEM for the resource layer are the following:

• WS-Management exposes CIM resources via web services, as defined in [17]. CIM is an inherent part of WBEM, as explained earlier in this paper.

• DMTF members are working on publishing a standard for the mapping between WS-Man operations and WBEM Generic Operations [18]. Furthermore, in order to implement a WS-Man

operation, a Web Service endpoint needs to delegate requests to instrumentation that can operate on the managed resource. In current, open source WS-Man implementations, management requests are eventually served by a WBEM Server and the appropriate providers. WS-Man and WBEM are related and complementary to each other.

The WebDMF representative has been implemented as a WBEM provider Therefore, if the CIMOM operating on the Representative node provides WS-Man client interfaces, the WebDMF provider will operate normally.

4.5 Performance Evaluation In this section we present a preliminary evaluation of

WebDMF’s performance. Results presented here are not simulation results. They have been obtained from actual code execution and are used as an indication of the solution’s viability.

In order to perform measurements, we installed a testbed environment using ModelNet [19]. Our topology emulates a wide-area network. It consists of 250 virtual nodes situated in 3 LANs. Each LAN has its own gateway to the WAN. The 3 gateways are interconnected via a backbone network, with high bandwidth, low delay links. We have also installed two WebDMF representatives (nodes R1 and R2). This is portrayed in Fig. 8.

Fig. 8. The emulated topology and test scenario.

We assume that for this network deployment, we wish to obtain the total amount of available physical memory for the 200 nodes hosted in one of the LANs. Among those, 50 do not support basic WBEM instrumentation. They only offer SNMP-based management facilities.

In this scenario, the client will form a WBEM CreateInstance() request for class WebDMF_RequestWBEM of the request factory. It is initially sent to the WebDMF Representative (R1). The request will get forwarded to R2. R2 will collect data from the 200 service nodes as follows:

• R2 sends intermediate requests to the 150 WBEM-enabled nodes. Those requests invoke the EnumerateInstances() operation for class Linux_OperatingSystem. Responses are sent back to R2 from the service nodes.

• As stated previously, in this scenario there are 50 SNMP-enabled nodes. R2 sends SNMP-Get packets to


those hosts, requesting the value of the hrMemorySize object. This object is part of the HOST-RESOURCES-MIB defined in RFC 1514 [20]. The transformation is based on the mapping rules mentioned in a previous section. After collecting the responses, R2 calculates the

aggregate (sum) of the reported values. This value becomes part of the response that is sent to R1. R1 sends the final response to the client.

We repeated the above experiment 200 times. Table II summarizes the results. Times are in seconds. Consider the fact that this scenario involves 204 request-response exchanges among various nodes. Furthermore, consider that the packets crossing the network are of a small size (a few bytes). The total execution time includes the following:

TABLE II. EVALUATION RESULTS.

Metrics Values Repetitions N 200

Arithmetic Mean 6.237139 Central Tendency

Median 6.193212

Variance 0.015187 Dispersion

Standard Deviation 0.123237

95% Confidence Interval for the Mean From 6.220059 To 6.254218

• Communication delays during request-response

exchanges. This includes TCP connection setup for all WBEM message exchanges. This does not apply to the SNMP case. SNMP uses UDP at the transport layer, therefore no connection is used.

• Processing overheads on R1 and R2. This is imposed by WebDMF’s functionality.

• Processing at the service nodes to calculate the requested value and generate a response. The absolute value of the average completion time

may seem rather high. However, in general terms, processing times are minimal compared to TCP connection setup and message exchange. With that in mind, we can see that each of the 204 request-responses completes in 30.57 milliseconds on average. This is normal. After 200 repetitions we observe low statistical dispersion (variance and standard deviation). This indicates that the measured values are not widely spread around the mean. We draw the same conclusion by estimating a 95% confidence interval for the mean. This indicates that the same experiment will complete in the same time under similar network traffic conditions.

5 Conclusions Ideally, a management framework should support grid

deployments without need for major modifications on the

existing infrastructure. It should not be limited by the technology used to implement the grid and be generic in order to support future changes. In this paper we introduce WebDMF, a Web-based Distributed Management Framework and present how it can be used to manage grids. We discuss its generality and demonstrate its viability through performance evaluation. Finally, the paper presents its advantages compared to alternative approaches and shows how it is complementary to emerging Web Service-based management approaches.

6 References [1] W. Stallings, SNMP, SNMPv2, SNMPv3, RMON 1 and 2. Addison

Wesley, 1999. [2] G. Oikonomou, and T. Apostolopoulos, “WebDMF: A Web-based

Management Framework for Distributed Services”, in Proc. The 2008 International Conference of Parallel and Distributed Computing (ICPDC 08) to be published.

[3] CIM Infrastructure Specification, DMTF Standard DSP0004, 2005. [4] Representation of CIM in XML, DMTF Standard DSP0201, 2007. [5] CIM Operations over HTTP, DMTF Standard DSP0200, 2007. [6] A Grid Monitoring Architecture, Open grid Forum GFD.7, 2002. [7] A. W. Cooke, et al, “The Relational Grid Monitoring Architecture:

Mediating Information about the Grid,” Journal of Grid Computing, vol. 2, no. 4, pp. 323-339, 2004.

[8] K. Gor, D. Ra, S. Ali, L. Alves, N. Arurkar, I. Gupta, A. Chakrabarti, A. Sharma, and S. Sengupta, "Scalable enterprise level workflow and infrastructure management in a grid computing environment," in Proc. Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05), Cardiff, UK, 2005, pp. 661–667.

[9] P.-C. Chen, J.-B. Chang, T.-Y. Liang, C.-K. Shieh, and Y.-C. Zhuang, "A multi-layer resource reconfiguration framework for grid computing," in Proc. 4th international workshop on middleware for grid computing (MGC'06), Melbourne, Australia, 2006, p. 13.

[10] T.-Y. Liang, C.-Y. Wu, J.-B. Chang, and C.-K. Shieh, "Teamster-G: a grid-enabled software DSM system," in Proc. Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05), Cardiff, UK, 2005, pp. 905–912.

[11] I.C. Legrand, H.B. Newman, R. Voicu, C. Cirstoiu, C. Grigoras, M. Toarta, and C. Dobre, “MonALISA: An Agent based, Dynamic Service System to Monitor, Control and Optimize Grid based Applications,” in Proc. Computing in High Energy and Nuclear Physics (CHEP), Interlaken, Switzerland, 2004.

[12] An Introduction to WSDM, OASIS committee draft, 2006. [13] Web Services for Management (WS Management), DMTF

Preliminary Standard DSP0226, 2006. [14] M. Kahani and P. H. W. Beadle, "Decentralised approaches for

network management," ACM SIGCOMM Computer Communication Review, vol. 27, iss. 3, pp. 36–47, 1997.

[15] Common Manageability Programming Interface, The Open Group, C061, 2006.

[16] L. Peng, M. Koh, J. Song, and S. See, "Performance Monitoring for Distributed Service Oriented Grid Architecture," in Proc. The 6th International Conference on Algorithms and Architectures (ICA3PP-2005), 2005.

[17] WS-CIM Mapping Specification, DMTF Preliminary Standard DSP0230, 2006.

[18] WS-Management CIM Binding Specification, DMTF Preliminary Standard DSP0227, 2006.

[19] A. Vahdat, K. Yocum, K. Walsh, P. Mahadevan, D. Kostic, J. Chase, and D. Becker "Scalability and Accuracy in a Large-Scale Network Emulator," in Proc. 5th Symposium on Operating Systems Design and Implementation (OSDI), December 2002.

[20] Host Resources MIB, IETF Request For Comments 1514, 1993.


SEMM: Scalable and Efficient Multi-ResourceManagement in Grids

Haiying ShenDepartment of Computer Science and Computer Engineering

University of Arkansas, Fayetteville, AR 72701

Abstract - Grids connect resources to enable the world-wide collaboration. Conventional centralized or hier-archical approaches to grid resource management areinefficient in large-scale grids. Distributed Hash Table(DHT) middleware overlay has been applied to gridsas a mechanism for providing scalable multi-resourcemanagement. Direct DHT overlay adoption breaks thephysical locality relationship between nodes. This paperpresents a Scalable and Efficient Multi-resource Man-agement mechanism (SEMM). It collects resource infor-mation based on the physical locality relationship amongresource hosts as well as the resource attributes. Simu-lation results demonstrate the effectiveness of SEMM inlocality-awareness and overhead reduction in compari-son with another approach.

Keywords: Resource management, Resource discov-ery, Grid, Peer-to-Peer, Distributed Hash Table

1 Introduction

Grids enable the sharing, selection, and aggregationof a wide variety of resources to enable world-wide col-laboration. Therefore, scalable and efficient resourcemanagement is vital to the performance of grids.

As a successful model that achieves high scalability indistributed systems, Distributed Hash Table (DHT) mid-dleware overlay [1, 2, 3, 4, 5] facilitates the resourcemanagement in large-scale grid environment. However,direct DHT overlay adoption breaks the physical local-ity relationship of nodes in the underlying IP-level topol-ogy. Since resource sharing and communication amongphysically close nodes enhance resource management ef-ficiency, it is desirable that DHT middleware can pre-serve the locality relationship of grid nodes. Most currentDHT-based approaches for resource management are notsufficiently scalable and efficient. They let the resourcesbe shared in a system-wide scale. Thus, a node may need

to ask a node very far away for resources, resulting in in-efficiency. Since a grid may have a very large scale, ne-glecting resource host locality in resource managementprevents the system from achieving higher scalability.

Locality-aware resource management is critical to thescalability and efficiency of a grid system. To meet therequirements, we propose a scalable and efficient multi-resource management mechanism (SEMM), which isbuilt on a DHT structure. SEMM provides locality-awareresource management by mapping physically close re-source requesters and providers. Thus, resources canbe shared between physically close nodes, and the ef-ficiency of resource sharing will be significantly im-proved.

The rest of this paper is structured as follows. Sec-tion 2 presents a concise review of representative re-source management approaches for grids. Section 3 in-troduces SEMM, focusing on its architecture and algo-rithms. Section 4 shows the performance of SEMM incomparison with another approach in terms of a varietyof metrics. Section 5 concludes this paper.

2 Related Work

Over the past years, the immerse popularity of gridshas produced a significant stimulus to grid resourcemanagement approaches such as Condor-G [6], Globustoolkit [7], Condor [8], Entropia [9], AppLes [10],Javelin++ [11]. However, relying on centralized or hier-archical policies, these systems have limitation in a large-scale dynamic multi-domain environment with variationof resource availability.

To cope with these problems, more and more gridsresort to DHT middleware overlay for resource manage-ment. DHT overlays is an important class of the peer-to-peer overlay networks that map keys to the nodes ofa network based on a consistent hashing function [12].Some DHT-based approaches adopt one DHT overlayfor each resource, and process multi-resource queries in


parallel in corresponding DHT overlays [13]. However,depending on multiple DHT overlays for multi-resourcemanagement leads to high structure maintenance over-head. Another group of approaches [14, 15, 16, 17] orga-nize all grid resources into one DHT overlay, and assignall information of a type of resource to one node. Such anapproach results in imbalance of load distribution amongnodes caused by information maintenance and resourcescheduling. It also leads to high cost for searching re-source information among a huge volume of informationin a node. Moreover, few of current approaches are ableto deal with the locality feature of grids.

Unlike most existing approaches, SEMM preservesthe physical locality relationship between nodes in net-works and achieves locality-aware resource manage-ment. This feature contributes to the high scalability andefficiency characters of SEMM in grid resource manage-ment.

3 Scalable and Efficient Multi-Resource Management

3.1 Overview

SEMM is developed based on Cycloid DHT over-lay [5]. We first briefly describe Cycloid DHT middle-ware overlay followed by a high-level view of SEMM ar-chitecture. Cycloid is a lookup efficient constant-degreeoverlay with n=d · 2d nodes, where d is dimension. Itachieves a time complexity of O(d) per lookup requestby using O(1) neighbors per node. Each Cycloid nodeis represented by a pair of indices (k, ad−1ad−2 . . . a0),where k is a cyclic index and ad−1ad−2......a0 is a cubi-cal index. The cyclic index is an integer, ranging from0 to d − 1, and the cubical index is a binary number be-tween 0 and 2d − 1. The nodes with the same cubicalindices are ordered by their k mod d on a small cycle,which we call cluster. The node with the largest cyclicindex in a cluster is called the primary node of the nodesat the cluster. All clusters are ordered by their cubicalindices mod 2d on a large cycle. For a given key or anode, its cyclic index is set to the hash value of the keyor IP address modulated by d and the cubical index is setto the hash value divided by d. A key will be assignedto a node whose ID is closest to its ID. Briefly, the cubi-cal index represents the cluster that a node or an objectlocates, and the cyclic index represents its position in acluster. The overlay network provides two main func-tions: Insert(key,object) and Lookup(key)to store an object to a node responsible for the key and toretrieve the object. For more information about Cycloid,please refer to [5].

3.2 Locality-aware Middleware Construc-tion

Before we present the details of SEMM, let’s intro-duce a landmarking method to represent node close-ness on the network by indices. Landmark clusteringhas been widely adopted to generate proximity informa-tion [18, 19, 20, 21]. We assume m landmark nodes thatare randomly scattered in the Internet. Each node mea-sures its physical distances to the m landmarks, and usesthe vector of distances < d1, d2, ..., dm > as its coordi-nate in Cartesian space. Two physically close nodes willhave similar vectors. We use space-filling curves [22],such as Hilbert curve [19], to map m-dimensional land-mark vectors to real numbers, such that the closeness re-lationship among the nodes is preserved. We call thisnumber Hilbert number of the node, denoted by H. Hindicates the physical closeness of nodes on the Internet.

SEMM builds a locality-aware Cycloid architectureon a grid. Specifically, it uses grid node i’s Hilbert num-ber, Hi, as its cubical index, and the consistent hashvalue of node i’s IP address, Hi, as its cyclic index togenerate the node’s ID, denoted by (Hi,Hi). Recall thatin a Cycloid ID, the cubical indices differentiate clus-ters and the cyclic indices differentiate node positionsin a cluster. Therefore, the physically close nodes withthe same H will be in a cluster, and those with similarH will be in nearby clusters in Cycloid. As a result, alocality-aware Cycloid is constructed, in which the log-ical proximity abstraction derived from overlay matchesthe physical proximity information in reality.

3.3 Resource Reporting and Query

We define resource information, represented by Ir, asthe information of available resources and resource re-quests. It includes the information of resource host, re-source ID represented by IDr, and etc.

In DHT overlay networks, the objects with the samekey will be stored in a same node. Based on this princi-ple and node ID determination policy, SEMM lets nodei compute the consistent hash value of its resource r, de-noted by Hr, and use (Hr, Hi) to represent IDr. Thenode uses the DHT overlay function Insert(IDr, Ir)to store resource information to a node in its cluster. Asa result, the information of the same type of resources inphysically close nodes will be stored in a same reposi-tory node, and different nodes in one cluster are respon-sible for different types of resources within the cluster.Furthermore, resources of Ir stored in nearby clusters tonode i are located physically close to node i. A reposi-tory node periodically conducts resource scheduling be-tween resource providers and requesters.


0102030405060708090

100

0 5 10 15 20Physical Distance by Hops

Per

cent

age

of re

sour

ce a

mou

nt

assi

gned

(%)

MercurySEMM

0

5000

10000

15000

20000

25000

30000

1 2 3 4 5Resources in each request

Logi

cal c

omm

unic

atio

n co

st fo

r re

ques

ts

SEMMMercury

(a) CDF of allocated resource (b) Logical communication cost

Figure 1. Communication cost of different resource management approaches.

When node i queries for different resources, it sendsout a request Lookup(Hr,Hi) for each resource r.Each request will be forwarded to its repository node innode i’s cluster, which will reply to node i if it has theresource information for the requested resource.

4 Performance Evaluation

We designed and implemented a simulator in Java forevaluation of SEMM. We compared the performance ofSEMM with Mercury [13]. Mercury uses multiple DHToverlays and lets each DHT overlay responsible for oneresource. We used Chord for attribute hub in Mercury.We assumed that there are 11 types of resources, andused Bounded Pareto distribution function to generatethe resource amount owned and requested by a node.This distribution reflects the real world where there areavailable resources that vary by different orders of mag-nitude. In the experiment, we generated 1000 requests,and ranged the number of resources in a resource requestfrom 1 to 5 with step size of 1. We used a transit-stubtopology generated by GT-ITM [23] with approximately5,000 nodes. “ts5k-large” has 5 transit domains, 3 tran-sit nodes per transit domain, 5 stub domains attached toeach transit node, and 60 nodes in each stub domain onaverage. “ts5k-large” is used to represent a situation inwhich a grid consists of nodes from several big stub do-mains.

4.1 Efficiency of Resource Management

In this experiment, we tested the Cumulative distri-bution function (CDF) of the percentage of allocated re-sources. It reflects the effectiveness of a resource man-agement mechanism to map physically close resource re-questers and providers. We randomly generated 5000

resource requests, and recorded the distance betweenthe resource provider and resource requester of each re-source request. Figure 1(a) shows the CDF of the per-centage of allocated resources versus the distances in dif-ferent resource management approaches in “ts5k-large”.We can see that SEMM is able to locate 97% of total re-source requested within 11 hops, while Mercury locatesonly about 15% within 10 hops. Almost all allocatedresources are located within 15 hops from requesters inSEMM, while 19 hops in Mercury. The results showthat SEMM can allocate most resources within short dis-tances from requesters but Mercury allocates most re-source in long distances from the resource requesters.The more resources are located in shorter distances, thehigher locality-aware performance of a resource man-agement mechanism. Using physically close resourcesto itself, a node can achieve higher efficiency in dis-tributed operations such as distributed computing anddata sharing. In addition, communicating with physi-cally close nodes for resources saves cost in node com-munication. The results indicate that the performanceof SEMM mechanism is better than Mercury in termsof locality-aware resource management. Locality-awareresource management helps to achieve higher efficiencyand scalability of a grid system.

A resource node needs to communicate repositorynodes for requested resources. Its request is forwardedby a number of hops based on DHT overlay routing algo-rithm. Thus, communication cost constitutes a main partin the resource management cost. In this test, we evalu-ated the communication cost of resource requesting. Wedefine logical communication cost as the product of mes-sage size and logical path length in hops of the messagetravelled. It represents resource management efficiencyin terms of the numbers of messages and nodes for mes-sage forwarding in resource queries. It is assumed thatthe size of a message is 1 unit. Figure 1(b) plots the log-


0

20

40

60

80

100

120

140

100 1100 2100 3100 4100Number of nodes

Ave

rage

mai

ntai

ned

outli

nks

per

node

MercurySEMM

Figure 2. Overhead of different resource management approaches.

ical communication cost versus the types of resources ina request for resource requesting. In the experiment, re-source searching stops once requested resources are dis-covered. We can observe that SEMM incurs less costthan Mercury. The lookup path length is O(log n) inChord, which is longer than lookup path length O(d) inCycloid. A request with m resources needs m lookups,which amplifies the difference of communication costbetween Mercury and SEMM. Hence, relying on the Cy-cloid DHT as the underlying structure for resource man-agement, SEMM greatly reduces the node communica-tion cost in resource management in a grid system.

4.2 Overhead of Resource Management

Since the resource management mechanisms dependon DHT overlays as middlewares for resource manage-ment in grids, the maintenance overhead of the DHToverlays constitute a main part in the overhead in re-source management. In a DHT overlay, a node needsto maintain its neighbors in its routing table. The neigh-bors play an important role in guaranteeing successfulmessage routing. We define the number of outlink of anode is the number of the node’s neighbors in its rout-ing table, i.e., the average routing table size maintainedby the node. In this experiment, we tested the numberof outlinks per node. It represents the overhead to main-tain the DHT resource management middleware archi-tecture. Figure 2(a) plots the average outlinks maintainedby a node in different resource management approaches.The results show that each node in Mercury maintainsdramatically more outlinks than in others. Recall thatMercury has multiple DHTs with each DHT overlay re-sponsible for one resource. Therefore, a node has a rout-ing table for each DHT overlay, and has outlinks equalto the product of routing table size and the number ofDHT overlays. The results show that SEMM leads toless maintenance overhead than Mercury, which impliesthat SEMM has high scalability with less DHT overlay

maintenance cost in a large-scale grid.

5 Conclusions

Rapid development of grids requires a scalable andefficient resource management approach for its high per-formance. This paper presents a Scalable and Effi-cient Multi-resource Management mechanism (SEMM),which is built on a DHT overlay. SEMM maps physicallyresource requesters and providers to achieve locality-aware resource management, in which resource allo-cation and node communication are conducted amongphysically close nodes, leading to higher scalability andefficiency. Simulation results show the superiority ofSEMM in comparison with another resource manage-ment approach in terms of locality-aware resource man-agement, node communication cost, and maintenancecost of the underlying DHT structure.

Acknowledgements

This research was supported in part by the Acxiom Cor-poration.

References

[1] I. Stoica, R. Morris, D. Liben-Nowell, D. R.Karger, M. F. Kaashoek, F. Dabek, and H. Balakr-ishnan. Chord: A scalable peer-to-peer lookup pro-tocol for Internet applications. IEEE/ACM Trans-actions on Networking, 1(1):17–32, 2003.

[2] S. Ratnasamy, P. Francis, M. Handley, R. Karp, andS. Shenker. A scalable content-addressable net-work. In Proc. of ACM SIGCOMM, pages 329–350,2001.


[3] A. Rowstron and P. Druschel. Pastry: Scalable,decentralized object location and routing for large-scale peer-to-peer systems. In Proc. of IFIP/ACMInternational Conference on Distributed SystemsPlatforms (Middleware), pages 329–350, 2001.

[4] B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D.Joseph, and J. Kubiatowicz. Tapestry: An Infras-tructure for Fault-tolerant wide-area location androuting. IEEE Journal on Selected Areas in Com-munications, 12(1):41–53, 2004.

[5] H. Shen, C. Xu, and G. Chen. Cycloid: A scalableconstant-degree P2P overlay network. PerformanceEvaluation, 63(3):195–216, 2006. An early versionappeared in Proc. of International Parallel and Dis-tributed Processing Symposium (IPDPS), 2004.

[6] J. Frey, T. Tannenbaum, I. Foster, M. Livny, andS. Tuecke. Condor-g: a computation managementagent for multiinstitutional grids. In Proc. 10thIEEE Symposium on High Performance DistributedComputing, 2001.

[7] I. Foster and C. Kesselman. Globus: a metacomput-ing infrastructure toolkit. Int. J. High PerformanceComputing Applications, 2:115–128, 1997.

[8] M. Mutka and M. Livny. Scheduling remote pro-cessing capacity in a workstation-processing bankcomputing system. In Proc. of the 7th Interna-tional Conference of Distributed Computing Sys-tems, September 1987.

[9] A. Chien, B. Calder, S. Elbert, and K. Bhatia. En-tropia: architecture and performance of an enter-prise desktop grid system. Journal of Parallel andDistributed Computing, 63(5), May 2003.

[10] F. Berman, R. Wolski, H. Casanova, W. Cirne,H. Dail, M. Faerman, S. Figueira, J. Hayes,G. Obertelli, J. Schopf, G. Shao, S. Smallen,N. Spring, A. Su, and et al. Adaptive computing onthe grid using apples. IEEE Transactions on Paral-lel and Distributed Systems, 14(4), Apr. 2003.

[11] M. O. Neary, S. P. Brydon, P. Kmiec, S. Rollins,and P. Capello. Javelin++: Scalability issues inglobal computing. Future Generation ComputingSystems Journal, 15(5-6):659–674, 1999.

[12] D. Karger, E. Lehman, T. Leighton, M. Levine,D. Lewin, and Panigrahy R. Consistent hashing andrandom trees: Distributed caching protocols for re-lieving hot spots on the World Wide Web. In Proc.of the 29th Annual ACM Symposium on Theory ofComputing (STOC), pages 654–663, 1997.

[13] A. R. Bharambe, M. Agrawal, and S. Seshan.Mercury: Supporting scalable multi-attribute rangequeries. In Proc. of ACM SIGCOMM, pages 353–366, 2004.

[14] J. Chen M. Cai, M. Frank and P. Szekely. Maan: Amulti-attribute addressable network for grid infor-mation services. Journal of Grid Computing, 2004.An early version appeared in Proc. of GRID’03.

[15] A. Andrzejak and Z. Xu. Scalable, efficient rangequeries for grid information services. In Proc. the2nd Int. Conf. on Peer-to-Peer Computing (P2P),pages 33–40, 2002.

[16] M. Cai and K. Hwang. Distributed aggregationalgorithms with load-balancing for scalable gridresource monitoring. In Proc. of IEEE Interna-tional Parallel and Distributed Processing Sympo-sium (IPDPS), 2007.

[17] S. Suri, C. Toth, and Y. Zhou. Uncoordinated loadbalancing and congestion games in P2P systems. InProc. of the Third International Workshop on Peer-to-Peer Systems, 2004.

[18] S. Ratnasamy, M. Handley, R. Karp, andS. Shenker. Topologically-aware overlay construc-tion and server selection. In Proc. of IEEE Confer-ence on Computer Communications (INFOCOM),2002.

[19] Z. Xu, M. Mahalingam, and M. Karlsson. Turningheterogeneity into an advantage in overlay routing.In Proc. of IEEE Conference on Computer Commu-nications (INFOCOM), 2003.

[20] H. Shen and C.-Z. Xu. Hash-based Proximity Clus-tering for Efficient Load Balancing in Heteroge-neous DHT Networks. Journal of Parallel and Dis-tributed Computing (JPDC), 2008.

[21] H. Shen and C. Xu. Locality-aware and churn-resilient load balancing algorithms in structuredpeer-to-peer networks. IEEE Transactions on Par-allel and Distributed Systems, 2007. An early ver-sion appeared in Proc. of ICPP’05.

[22] T. Asano, D. Ranjan, T. Roos, E. Welzl, and P. Wid-maier. Space filling curves and their use in geomet-ric data structure. Theoretical Computer Science,181(1):3–15, 1997.

[23] E. Zegura, K. Calvert, and S. Bhattacharjee. Howto model an internetwork. In Proc. of IEEE Confer-ence on Computer Communications (INFOCOM),1996.


A QoS Guided Workflow Scheduling Algorithm for the Grid

Fangpeng Dong and Selim G. Akl

School of Computing, Queen’s University

Kingston, ON Canada K7L3N6 {dong, akl}@cs.queensu.ca

Abstract - Resource performance in the Computational Grid is not only heterogeneous, but also changing dynamically. However, scheduling algorithms designed for traditional parallel and distributed systems, such as clusters, only consider the heterogeneity of the resources. In this paper, a workflow scheduling algorithm is proposed. This algorithm uses given QoS guidance to allocate tasks to distributed computer resources whose performance is subject to dynamic change and can be described by predefined probability distribution functions. This new algorithm works in an offline way that allows it to be easily set up and run with less cost. Simulations have been done to test its performance with different workflow and resource settings. The new algorithm can also be easily expanded to accommodate Service Level Agreement (SLA) based workflows.

Keywords: Workflow scheduling algorithm, Grid, Quality of Services, stochastic

1 Introduction

The development of Grid infrastructures now enables workflow submission and execution on remote computational resources. To exploit the non-trivial power of Grid resources, effective task scheduling approaches are necessary. In this paper, we consider the scheduling problem of workflows that can be represented by directed acyclic graphs (DAG) in the Grid. The objective function of the scheduling is to minimizing the total completion time of all tasks (also known as makespan) in a workflow.

As the performance of Grid resources fluctuates dynamically, it is difficult to apply scheduling algorithms that were designed for traditional systems and treat the performance as a known static parameter. Therefore, some countermeasures are introduced to capture relevant information about resource performance fluctuation (e.g., performance prediction 0), or try to provide some guaranteed performance to users (e.g., resource reservation, [2]). These approaches make it possible for Grid schedulers to get relatively accurate resource information. In this paper, we assume that resource performance can be described by some

probability mass functions (PMF) which can be derived from task execution records in the past (e.g., log files) [8]. Since the performance information is not deterministic, the proposed algorithm takes an input parameter as a resource selection criterion (QoS). This algorithm is a list heuristic and consists of two phases: the task ranking phase and the task-to-resource mapping phase. In the ranking phase, tasks will be ordered according to their priorities. The rank of a task is determined by the task’s computational demand, the mathematical expectation of resource performance, and the communication cost for data transfer. In the mapping phase, the scheduler will pick up unscheduled tasks in the order their priorities and assign them to resources according to performance objective and the QoS guide.

The rest of this paper is organized as follows: in Section 2, related work is introduced; Section 3 presents the application and resources models used by the proposed algorithm; Section 4 describes the algorithm in detail; Section 5 presents simulation results and analysis; finally, conclusions are given in Section 6.

2 Related Work

The DAG-based task graph scheduling problem in parallel and distributed computing systems is an interesting research area, and algorithms for this problem keep evolving with computational platforms, from the age of homogeneous systems, to heterogeneous systems and today’s computational Grids [6]. Due to its NP-complete nature [7], most of algorithms are heuristic based, such as the widely cited HEFT algorithm [3]. In [9], the authors proposed a list DAG scheduling algorithm, which is based on deterministic resource performance prediction. In [12], a robust resource allocation algorithm is proposed, which uses the same resource performance model as this paper does. However, it only considers the scheduling of independent tasks. In [11], a SLA based workflow scheduling algorithm is proposed. However, that algorithm does not use resource performance modeling explicitly and works in an online manner, which means it has to monitor the execution of tasks in a workflow continuously.


3 Models and Definitions The targeted system of this paper consists a group of computational nodes r1, …, rn distributed in a Grid. Two nodes ri and rj can communicate with each other via a network connection wi,j. It is assumed that the available performance of both computational nodes and network connections is stochastic and follows some probability mass functions (PMF). Fig. 1 presents an example of a PMF. The PMF can be obtained by investigating historical application running records using statistical measures. In this paper, it is assumed such functions are already known by the scheduler. The PMF of the performance Pi of a computational node ri is denoted as PPi (x), and the PMF of the performance of a network connection wi,j between ri and rj is denoted as PBi,j(x). It is assumed that for all 1≤ i, j≤ n, random variables Pi and Bi,j are independent.

Fig. 1: Performance probability mass function of a

computational resource in the Grid.

In this paper, a workflow is assumed to be represented by a DAG G. An example DAG is presented in Fig. 2. A circular node ti in G represents a task, where 1≤i≤v, and v is number of tasks in the workflow. The number qi (1≤i≤v) shown below ti represents the computational power consumed to finish ti. For example, in Fig. 2, q1 = 5. An edge e(i, j) from ti to tj means that tj needs an intermediate result from ti, so that tj∈succ(ti), where succ(ti) is the set of all immediate successor tasks of ti. Similarly, we have tj∈pred(ti), where pred(ti) is the set of immediate predecessors of ti. The weight of e(i, j) gives the size of intermediate results (or communication volume, for simplicity) transferred from ti to task tj. For example, the communication volume from t1 to t2 is one in Fig. 2.

Following the above performance and workflow models, the completion time Ci,k of task ti on resource rk and the communication cost Di,j,k,l for data transferring between task ti on resource rk and tasks tj or resource rl are also two random variables that can be denoted as Eq. (1) and Eq. (2) respectively

k

iki P

qC =, (1)

lklkji B

jieD,

,,,),(

= (2)

According to (1) and (2), Ci,k and Di,j,k,l are independent variables as Pk and Bk, l are independent, and the PMF of Ci,k and Di,j,k,l, PCi,k(x) and PDi,j,k,l(x), can be easily obtained from PPi (x) and PBi, j

(x). t15

t75

t85

t68

t510

t45

t310

t28

1

1 1

1

32

1

1

4 3

Fig. 2: A DAG depicting a workflow, in which a node

represents a task and a labelled directed edge represents a precedence order with a certain volume of intermediate result

transportation. 4 The QoS Guided Scheduling Algorithm

The primary objective of the proposed algorithm is to map tasks to proper computational resources such that the makespan of the whole workflow can be as short as possible. As an instance of list heuristics, the proposed algorithm has two phases: the task ranking phase and task-to-resource mapping phase.

In the task ranking phase, the priority of a task ti in a given DAG is computed iteratively from the exiting node of the DAG upwards to the entrance node as Eq. (3) shows.

(3) ))(_),((max_)( ,)( jjitsucctii trankdavgjiecavgtrankij

+×+=∈

))((_ ,ki

ki CEAvgcavg =

))((_ ,,,

,, lkji

lkji DEAvgdavg =

In Eq. (3), avg_ci is the average value of the mathematical expectation of Ci, k (denoted as E(Ci, k)) on each computational resource rk. Similarly, avg_di, j is the average value of the mathematic expectation of Di,j,k,l for every network connection pair rk and rl to which ti and tj could be mapped respectively. According to Eq. (3), the priority value of a task is actually an estimate of time consumption, which is from the start time of ti to the completion time of the whole workflow, based on the average expected performance of computational resources and network connections. Once the priorities of tasks are known, the scheduler will put the tasks in a queue in a non-ascending order (ties are broken randomly).

In the mapping phase, the scheduler will fetch

00.10.20.30.40.50.60.70.80.91

10 15 20 25

Probability

Available Performance


unscheduled tasks from the head of the priority queue formed in the ranking phase and map it to a selected resource. Since the priorities are computed upwards, it is guaranteed that a task will always have a higher priority than all of its successors. Therefore, it will be mapped before any of its successors. This ordering eliminates the case that a successor task occupies a resource while its predecessor is waiting for that resource so that deadlocks can be avoided. For tasks that are not related with each other, this approach lets those farther from the exiting task get resource allocation earlier, which will in turn give them a bigger chance of starting earlier and produce a shorter makespan.

If the resource performance is deterministic, a popular and easy way to schedule a task in a heterogeneous environment is choosing the resource that can complete that task the earliest, as the HEFT and PFAS [9] algorithms do. However, in the resource model here, the performance is not deterministic. If only the best performance of a resource is considered, the schedule may suffer a long makespan in the real world due to the small probability of this performance being achieved. To overcome this difficulty, the mathematical expectation of the random performance can be used, just like what we have done in the ranking phase. However, in a non-deterministic system, only providing an estimated mean value might not be sufficient, because the real situation might be quite different. From the users’ point of view, more concerns may be put onto the possibility of achieving a certain performance other than a given static mean value. Therefore, in this paper, a flexible and adaptive way is used which allows the user to provide a QoS requirement to guide the resource selection phase. To simplify the presentation, a binary mapping function M (ti, rk) is defined in Eq. (4):

⎩⎨⎧

= otherwise 0

tomapped is if 1),( ki

ki

rtrtM (4)

In a workflow, the earliest start time EST(ti, rk) of a task ti on a resource rk depends on the data ready time DRT(ti, rk) and the resource ready time RRT(ti, rk):

)),(),,(max(),( kikiki rtRRTrtDRTrtEST = (5) The data ready time DRT(ti, rk) is determined by the time

when ti receives its last input data from its predecessors. It can be expressed by Eq. (6):

)),((max),()( kjtpredtki rtRTrtDRT

ij∈= (6)

1),(|)(),( ,,, =+= ljklijjkj rtMDtCTrtRT (7) RT(tj, rk) is the intermediate data ready time from predecessor tj, which is a predecessor of ti. CT(tj) is the completion time of tj and Dj,i,l,k is the intermediate result transfer time from rl to rk. Here rl is the resource that tj is mapped to.

As all tasks mapped to the same resources will be executed sequentially, the resource ready time RRT(ti, rk) is determined by the completion time of the last task in the job queue of rk. Let t′q be the last task in rk’s job queue currently, then RRT(ti, rk) can be noted as

)'(),( qki tCTrtRRT = (8) Finally, the estimated completion time ECT( ti, rk) of ti on rk is given by

kikiki CrtESTrtECT ,),(),( += (9) To meet a QoS requirement, we need to know the PMF

of ECT(ti, rk), which depends on the PMF of CT(tj), RT(tj) and DRT(ti, rk). Since all predecessors of ti have been scheduled by the time ti is being scheduled, the PMF of CT(tj) is already known (see Eq. (17)), so is the PMF of RRT(ti, rk). According to probability theory, the PMF of the sum of two independent discrete random variables is the discrete convolution of their PMF. Therefore, according to Eq. (7), the PMF of RT(tj, rk) can be expressed as:

)()()( ,,,,

0ixPiPxP klijjkj D

x

iCTRT −=∑

=

(10)

Again, by probability theory, the probability distribution function (also known as the cumulative distribution function (CDF)) of the maximum value of a set of independent variables is the product of the probability distribution functions of these variables. Let FESTi,k, FDRTi,k and FRRTi,k be the CDF of EST(ti, rk) , DRT(ti, rk), RRT(ti, rk) respectively. The following equation can be obtained according to Eq. (5):

)()()( ,,, xFxFxF kikiki RRTDRTEST = (11) Similarly, the FDRTi,k can be obtained from Eq. (12).

)(,...,|)()...()( ''1),(),( ''

1, imptRTptRTDRT tpredttxFxFxF

kmkki ∈=

(12)

For discrete random variable X, its CDF F(x) can be obtained from its PMF P(x) by Eq. (13).

∑≤

=≤=xx

ii

xPxXxF )()Pr()( (13)

On the other hand, if F(x) is known, the PMF P(x) can also be obtained as

)()()( 1−−= iii xFxFxP (14) By Eq. (13) the CDF of RT(tj, rk) can be acquired using the results from Eq. (10), so can the CDF of RRT(ti, rk). The PMF of DRT(ti, rk) can be obtained from Eq. (10), Eq. (12) and Eq. (14). Following the same procedure the PMF of EST(ti, rk) can be obtained, which is denoted as PESTi,k(x). According to Eq. (9), the PMF of ECT(ti, rk) can then be expressed as:

)()()( ,,,

0ixPiPxP kikiki C

x

iESTECT −=∑

=

(15)

Let FECTi,k be the CDF of ECT(ti, rk). FECTi,k can be obtained from the PMF given in Eq. (15).

Now, given a QoS requirement Q as a percentage number, the scheduler will first find a value T in the CDF of ECT whose cumulative probability is greater than Q (Fig. 3). Let ect(ti, rk)l be the lth possible value of ECT and pl be its probability, then the mathematical expectation of values to the left of T (including T itself), which is denoted as R(ti, rk) can be denoted as Eq. (16). By this means, it will cover at least the lower Q percent part of the ECT value distribution.


∑≤

=Trtect

lkilkilki

rtectprtR),(

),(),( (16)

Fig. 3: An example of CDF (A) and PMF (B) of EST. Given the QoS requirement Q, the ceiling point is the left end of the

first CDF interval above Q. Only ECT instances and their probabilities left to the ceiling point (shading bars in (B)) are

considered when the scheduler selects a resource for the current task.

Fig. 4: Pseudo code of the QoS guided workflow scheduling

algorithm. The scheduler then chooses the lowest value of all R(ti, rk), say R(ti, rk′), and maps task ti to rk′. At this point, the PMF of CT(ti) can be known as:

)()( ', xPxP kii ECTCT = (17) When the exiting node tv of the whole graph is scheduled, the algorithm will stop. From the PMF PCTv(x), we can tell the probability of different makespans of the workflow. The pseudo code of the above procedures is given in Fig. 4.

5 Experiments In the simulation, the basic resource topology is generated

by a toolkit called GridG1.0 [4], which can generate heterogeneous computational resources and network connections. Based on the basic resource setup, two groups of experiments are run. One assumes the resource performance follows a uniform distribution and the other assumes it follows a discrete normal distribution. The workflow graphs used by the simulation are generated by a program called Task Graph For Free (TGFF) [5]. TGFF has the ability of generating a variety of task graphs according to different configuration parameters, e.g., the number of task nodes of each graph. Two groups of graphs are used in the simulation. One group is randomly generated, and the other is generated to simulate some real workflow applications in the Grid, such as BLAST [10], which has balanced parallel task chains in its DAGs.

The performance of algorithms tested is measured by the scheduled length ratio (SLR), which is a normalized value of the makespan:

LengthPath Critical Minimum EstimatedMakespan RealSLR =

In each experiments, five groups of task graphs are used, which have 40, 80, 160, 320 and 640 tasks nodes, respectively. On each of the task groups, the HEFT algorithm and the QoS guided algorithms are tested. For the HEFT algorithm, the mathematical expectation of the resource performance is applied. For the QoS algorithm, two QoS values are applied: 80% and 50%, which are respectively denoted as QoS 1 and QoS 2 in Fig. 5~Fig. 8.

The results of experiments on randomly generated graphs are presented in Fig. 5 and Fig. 6. In Fig. 5, the resource performance follows a uniform distribution. It can be observed that all algorithms perform worse as the number of tasks in a workflow increases. Due to the nature of these heuristic algorithms, the longer the critical path in a graph is, the more cumulative errors they will make when computing the priorities of tasks, and the higher probability that they will chose sub-optimal mappings.

In Fig. 5, QoS 1 achieves the best performance among the three stategies. The HEFT algorithm yields QoS 1 with a small margin and is closely followed by QoS 2 which uses 50% as the selection criterion. In Fig. 6, the HEFT and QoS 1 almost get the same results while the performance of QoS 2 is significantly degraded. Filtering the 20% worst performance cases out in a uniform distribution, the expected performance can be improved noticeably, and the resources selected by these means will have a good chance (with a possiblity of 80%) to get a better performance than the mean value which is used by the HEFT. This explains why QoS 1 can perform better than HEFT does in Fig. 5. On the other hand, as the QoS 2 set the selection criterion as 50%, it may cut too much of the random domain and therefore suffer a higher possibility of inaccurate estimate in the reality. The

Input: G, Q, PMF of ri and Wi, j, 1≤i≤n. Output: a schedule of G to r1, …, rn. 1. Compute rank of each task in G, using Eq. (3), and order the

tasks into a queue J in non-ascending order of their ranks. 2. while (J is not empty) do 3. Pop the first task t from J; 4. for every resource r 5. Compute PMF of RRT(t, r) ; //Eq. (8) 6. for every t′∈pred(t) 7. Compute PMF of RT(t′, r);//Eq. (10) 8. end for 9. Compute PMF of DRT(t, r);//Use results of Line 7,

Eq. (12) and (14). 10. Compute PMF of ECT(t, r); // Eq. (15) 11. Compute R(t, r), using Q and PMF of ECT(t, r); //Eq. (16) 12. end for 13. Find the resource r′ that R(t, r′) = min(R(ti, rk)) and insert t to

the job queue of r'; 14. end of while


shortcoming of a too optimistic criterion (low QoS value) is even more obvious in Fig. 6, where the resource performance follows a normal distribution. In this kind of distribution, the mean value of a PMF happens to be the one that has the highest probability. Therefore, the HEFT algorithm can perform well in this situation.

Fig. 5: Simulation result of uniform performance distribution

and random generated graphs.

Fig. 6: Simulation result of discrete normal distribution of

performance and random generated graphs.

Fig. 7: Simulation result of uniform performance distribution and multi-parallel-way graphs.

Fig. 7 and Fig. 8 show the results of task graphs having

multiple balanced parallel task chains. The three scheduling approaches present similar behaviors as they do in the previous experiments. The performance of the HEFT algorithm is still the best in the normal distribution. QoS 1 performs close to HEFT and QoS 2 suffers from its too optimistic resource selection criterion. It is worth to note that,

in all cases, the SLR is shorter compared with the results in Fig. 5 and Fig. 6. This is due to the balanced structure of the task graphs, which makes the length of all paths from the starting task to the exiting task roughly identical so that the possibility of sub-optimum task ranking decreases.

Fig. 8: Simulation result of discrete normal distribution of performance and multi-parallel-way graphs.

6 Conclusions

In this paper, a QoS guided workflow scheduling algorithm is proposed and tested by simulation. The algorithm can be applied in the Grid computing scenarios, in which resource performance is not deterministic but changes, according to certain random probability mass functions. The contribution of this work is twofold. Firstly, the procedures to obtain the PMF of the makespan of a workflow are presented in detail. As the probabilities of different completion times are known, more sophisticated algorithms can be easily developed (although, this is not covered is this paper). For example, if the deadline to finish a workflow is given by the user, the scheduler will be able to tell the probabilities of meeting the deadline in different schedules and then react accordingly. This is very important, as SLA is becoming a popular way for resource allocation in the Grid. Secondly, the proposed algorithm uses a QoS guidance to find the task-to-resource mapping, and the effects of different QoS settings in different resource performance distributions are tested. Our future work includes developing new algorithms that consider SLA scenarios and testing the QoS guided method with more probability distribution functions.

References:

[1] L. Yang, J. M. Schopf and I. Foster. “Conservative

Scheduling: Using Predicted Variance to Improve Scheduling Decisions in Dynamic Environments”. Proc. of the 2003 Supercomputing, pp:31-46, November 2003.

[2] G. Mateescu. “Quality of Service on the Grid via Metascheduling with Resource Co-Scheduling and Co-Reservation”. International Journal of High Performance Computing Applications, Vol. 17, No. 3, pages: 209-218, 2003.

01020304050607080

40 80 160 320 640

SLR

Number of task nodes

HEFT QoS 1 QoS 2

01020304050607080

40 80 160 320 640

SLR


HEFT QoS 1 QoS 2

01020304050

40 80 160 320 640

SLR


HEFT QoS 1 QoS 2

01020304050

40 80 160 320 640

SLR


HEFT QoS 1 QoS 2


[3] H. Topcuoglu, S. Hariri and M.Y. Wu. “Performance- Effective and Low-Complexity Task Scheduling for Heterogeneous Computing”. IEEE Trans. on Parallel and Distributed Systems, Vol. 13, No. 3, pages: 260 - 274, 2002.

[4] D. Lu and P. Dinda. “Synthesizing Realistic Computational Grids”. Proc. of ACM/IEEE Super-computing 2003, Phoenix, AZ, USA, 2003.

[5] R.P. Dick, D.L. Rhodes and W. Wolf. “TGFF Task Graphs for Free”. Proc. of the 6th. International Workshop on Hardware/Software Co-design, 1998.

[6] F. Dong and S. G. Akl. “Grid Application Scheduling Algorithms: State of the Art and Open Problems”. Technical Report No. 2006-504, School of Computing, Queen's University, Canada, Jan 2006.

[7] H. El-Rewini, T. Lewis, and H. Ali. Task Scheduling in Parallel and Distributed Systems, ISBN: 0130992356, PTR Prentice Hall, 1994.

[8] H. Li, D. Groep, L. Wolters. “Mining Performance Data for Metascheduling Decision Support in the Grid”. Future Generation Computer Systems 23. pp 92-99, Elsevier, 2007.

[9] F. Dong, S. Akl. “PFAS: A resource-performance-fluctuation-aware workflow scheduling algorithm for grid computing”. Proc. of the Sixteenth International Heterogeneity in Computing Workshop, International Conference on Parallel and Distributed Systems (IPDPS), Long Beach California, US, March 2007.

[10] D. Sulakhe, A. Rodriguez, et al. “Gnare: an Environment for Grid-Based High-throughput Genome Analysis”. Proc. of CCGrid’05, pp. 455- 462, Cardiff, UK, May 2005.

[11] D.M. Quan and J. Altmann. “Mapping of SLA-based workflows with light communication onto Grid resources”. Proc. of the 4th International Conference on Grid Service Engineering and Management, Leipzig, Germany, Sept. 2007.

[12] V. Shestak, J. Smith, et al. “A stochastic approach to measuring the robustness of resource allocations in distributed systems”. Proc. of International Conference on Parallel Processing, pp. 459–470, Columbus, Ohio, US, Aug. 2006.


A Grid Resource Broker with Dynamic Loading Prediction Scheduling Algorithm in Grid Computing

Environment

Yi-Lun Pan 1, Chang-Hsing Wu1, and Weicheng Huang 1 1National Center for High-Performance Computing, Hsinchu, Taiwan

Abstract - In a Grid Computing environment, there are various important issues, including information security, resource management, routing, fault tolerance, and so on. Among these issues, the job scheduling is a major problem since it is a fundamental and crucial step in achieving the high performance. Scheduling in a Grid environment can be regarded as an extension to the scheduling problem on local parallel systems. The NCHC research team developed a resource broker with proposed scheduling algorithm. The scheduling algorithm used the adaptive resource allocation and dynamic loading prediction functions to improve the performance of dynamic scheduling. The resource broker is based on the framework of GridWay which is a meta-scheduler for Grid, so the proposed resource broker can be implemented in real Grid Computing environment. Furthermore, this algorithm also has been considered several properties of the Grid Computing environment, including heterogeneous, dynamic adaptation and the dependent relationship of jobs.

Keywords: Job Scheduling, Resource Broker, Meta-scheduler, Grid Computing.

1 Introduction In the beginning of the 1990’s, there was a brilliant success in Internet technology because of the birth of a new computing environment - Grid, which is composed by huge heterogeneous platforms, geographic distributed resources, and dynamic networked resources [1]. The infrastructure of Grid Computing is the intention to offer a seamless and unified access to geographically distributed resources connected via internet. Thus, the facilities can be utilized more efficiently to help application scientists and engineers in performing their works, such as the so called “grand challenge problems”. These distributed resources involved in the Grid environment are loosely coupled to form a virtual machine and each of these resources is managed by their own local authority independently, instead of centrally controlled. The ultimate target is to provide a mechanism such that once the users specify their requirement of resource. The virtual computing sites will allocate the most appropriate physical computing sites to carry out the execution of the application.

Therefore, the research team developed a Grid Resource Broker (GRB) based on the framework of the GridWay [2] which is a meta-scheduler for Grid. The GridWay then drives the Globus Toolkit [3] to enable the actual Grid Computing. A scheduling algorithm and its implementation, the Scheduling Module, are developed. According to each different required job criteria, the proposed scheduling algorithm used the dynamic loading prediction and adaptive resource allocation functions to meet users’ requirements. The major task of the proposed Grid Resource Broker is to dynamically identify and characterize the available resources, correctly monitor the queue status of local scheduler. Finally, the presented scheduling algorithm helps to select and allocate the most appropriate resources for each given job. The aim of the presented scheduling algorithm is to minimize the total time to delivery for the individual users’ requirements and criteria.

According to the above scenario, those issues encourage the motivation of our development. Specifically, the design and implementation of the proposed resource broker includes the dynamic loading prediction and adaptive resource selection functions. Moreover, the scheduling algorithm has been considered not only the general features of resource broker but also the "dynamic" feature, i.e. the proposed resource broker which monitors the status of each local queue and provides resources as users’ criteria dynamically. Therefore, the proposed resource broker can select resource efficiently and automatically. Further, the effective job scheduling algorithm also can improve the performance and integrate the resources to supply remote user efficiency in the heterogeneous and dynamic Grid Computing environment [4].

2 State of the Art 2.1 Grid computing

The Grid is an infrastructure of computing providing users the ability to dynamically combine the distributed resources as an ensemble to support the execution of applications that need a large scale of resources. As Ian Foster redefined Grid Computing, “Grid Computing is concerned with coordinated resource sharing and problem


solving in dynamic, multi-institutional virtual organizations” [5]. The key concept of grid computing is its ability to deal with resource sharing and resource management [6]. This technology can bring powerful computing ability by integrating computing resources from all over the world. The Grid software is not a single, monolithic package, but rather a collection of interoperating software packages. The characteristics of the Grid computing environment are: 1) the environment is heterogeneous. The applications submitted from all kinds of users and web sites can share the resources from the entire Grid computing environment; 2) these resources are dynamic and complex in this environment. This phenomenon makes scheduling a major issue in the environment.

This paper focuses on the scheduling of dependent jobs, which means the jobs may have a reciprocal effect, or correlation, in the order between each other. Because of the scheduling, it can help Grid Computing to increase and integrate the utilization of local and remote computing resources. Therefore, scheduling can improve the performance in the dynamic grid computing environment.

2.2 Existing Resource Broker and Scheduling Algorithm

A general architecture of scheduling for resource broker which is defined as the process of making scheduling decisions involving resources over multiple administrative domains [7]. There are three important features of a grid resource broker which are resource discovery, resource selection, and job execution from the previous research. As we know, a lot of researches of resource broker are on going to provide access resources for different applications, such as Condor – G, EDG Resource Broker, AppLes, and so on [8], [9]. The above resource brokers also can provide the capability of monitoring computing resources information and resource selection. Nevertheless, those researches do not deal with monitoring the dynamic queuing information, either to make precise and effective scheduling policy. On the scheduling algorithm scenario, the dynamic job scheduling is crucial and fundamental issue in Grid Computing environment. The purpose of job scheduling is to find the dynamic and optimal method of resources allocation. Mostly, there researchers are applying the traditional job scheduling strategies to allocate computing resources statically, such as list heuristic and the listing scheduling (LS) [10]. The above algorithms focus on the allocation of machines statically. However, the presented research focuses on dynamic job scheduling for each job from users’ requirements and criteria.

Definition 1: The list heuristic scheduling algorithms have three variants - First-Come-First-Serve (FCFS) - the scheduler starts the jobs in the order of their arrival time; Random - the next job to be scheduled is randomly selected

among all jobs. No jobs are preferred; Backfilling [11] - the scheduling strategy is out-of-order of FCFS scheduling that tries to prevent the unnecessary idle time. Actually, there are two kinds of backfill. One is EASY-backfilling, and the other is conservative-backfilling.

Furthermore, most of the researches are discussed under the assumption that jobs are executed independently and statically [12], [13], [14]. In fact, these assumptions are not appropriate in Grid Computing environment, since these jobs are always dependent and dynamic. In this research, the proposed algorithm is designed for scheduling the dependent jobs dynamically. As the following section, the Performance Evaluation, it will use the dependent jobs which are the Computational Fluid Dynamics (CFD) Message Passing Interface (MPI) programs to test.

3 Proposed Scheduling Algorithm of Grid Resource Broker (GRB)

3.1 Research Object The heterogeneous and dynamic properties are considered when designing the job scheduling algorithm. Therefore, the presented proposed algorithm can make job scheduling to achieve minimum makespan (defined at Definition 2), which is to minimize the total time to delivery for the individual users’ requirements and criteria. That is the main contribution of this work to present a Grid Resource Broker (GRB) with the designed job scheduling algorithm. The GRB can provide the faultless mechanism such that once the users specify users’ requirement of resources. Finally, these virtual computing sites will allocate the most appropriate physical computing sites to carry out the execution of the application.

Definition 2: The completion time is defined as the time from the job being assigned to one machine until the time the job is finished. The complete time is also called makespan time.

3.2 Model and Architecture To simplify Grid Computing, each distributed computing resource can be connected by high speed network. There is one important component of middleware - resource broker which plays essential role in such an environment. The responsibilities of resource brokers are to search where the computing resources locate, store the information, and satisfy the users’ requirements of computing resources. Therefore, the dynamic loading prediction job scheduling algorithm should utilize available computational resources fairly and efficiently in Grid.

This proposed algorithm is designed for resource brokers of whole Grid Computing. The purpose of proposed algorithm is to help improve the performance of job


scheduling. As a matter of fact, the proposed scheduling algorithm can preserve the good scheduling sequence of optimal or near-optimal solution, which are generated the best host candidate or better host candidates. And then, the presented Scheduling Module can get rid of the unfit scheduling sequence in the searching process for the scheduling problems. The implementation resource broker architecture is designed as shown in the following FIG. 1 in NCHC.

FIG. 1 System Architecture

The Grid users can submit jobs with using Grid Resource Broker through Grid Portal as the FIG.1 is shown. There are five related functional components and two databases in the System Architecture. The first, Grid Portal serves as an interface to the users. Therefore, the Grid jobs are submitted via the Grid Portal which in turn passes the jobs to the Resource Broker to drive the backend resources. The resource as well as job status are displayed back to the portal for users via either Resource Monitor module or Job Monitor module. Second, Resource Monitor (RM) queries the status of resources from the Information DB and displays the results onto the Grid Portal. Thus, the users’ knowledge about the status of resources is kept up to date. Third, Job Monitor is similar to the Resource Monitor. The Job Monitor component accesses the job information from the JobsDB which maintains the updated information about Grid jobs. The fourth, Grid Resource Broker (GRB) is the core of the system architecture. It takes the requirements from Grid jobs, and then compares with resource information which provided by the Information System. Therefore, the GRB can select the most appropriated resources automatically to meet the requirements from jobs. It will further dispatch jobs to local schedulers for processing. The detailed information about the mechanism of GRB is explained in the later part. The last component, Information System is used to collect the dynamic information of local computing resources periodically and

update the status of the resources in the InformationDB which in turn is queried by the RM to serve the users. The Information System also responses to the query from the GRB and provides the resources status. The dynamical information of local resources, which takes the XML format, is queried from Ganglia, Network Weather Service (NWS), MDS2, MDS4, and so on.

The first database is JobsDB. The JobsDB will be updated by the Job Monitor module with pre-defined time interval. Such information is used by the Job Monitor component to response to the query from the GRB. Finally, the last database is InformationDB which stores all data provided by the Information System.

In order to handle the related processes of job submitting, the GRB adopts the presented Scheduling Module to find the appropriate scheduling sequence, and then dispatches jobs to the local schedulers. Specially, the most important part is the core of resource broker which is the proposed new Scheduling Module which illustrated as the FIG. 2 is shown. The major characteristic of Scheduling Module is the presented scheduling policy, Dynamic Loading Prediction Scheduling (DLPS) algorithm, which provides dynamic loading prediction and adaptive resource allocation functions. The scheduling policy will be described later section.

Grid File Transfer ServicesGrid File Transfer Services Grid Execution ServicesGrid Execution Services Grid Information ServicesGrid Information Services

INPUT

Resource BrokerCore

INPUT

Resource BrokerCore

Resource BrokeCore

OUTPUT

Resource BrokeCore

OUTPUT

FIG. 2 Grid Resource Broker Architecture

3.3 Grid Resource Broker Architecture We will explain the whole Architecture of GRB, and each function of each component, at this section. As the FIG.2 is shown, upon receiving a job request from users, the Request Manager will manage the job submission with


proper job attributes. On the other hand, the user maintains the control over the jobs submitted through the Request Manager. Following Request Manager, the Dispatch Manager invokes the Scheduling Module to acquire the resource list. The resource list is based on the criteria posted by the job as well as the resource status. With the suggestion from the Scheduling Module, the job is then dispatched. The Dispatch Manager also provides the ability to reschedule jobs and reporting pending status back to the Request Manager.

In the Scheduling Module, there are three main functions, namely the Job Priority Policy, the Resource Selection, and the Dynamic Resource Selection. The Job Priority Policy is responsible for initialize the selection process with existing policy such as presented scheduling policy - DLPS, FCFS, Round-Robin, Backfill, Small Job First, Big Job First, and so on. The Resource Selection provides resource recommendation based static information such as hardware specification, specific software, and the High-Performance Linpack Benchmark results. The Dynamic Resource Selection issues suggestions based on dynamic information such as the users’ application requirement, network status as well as work load of individual machines. With the combined efforts of the three components, the Scheduling Module provides the features of the automatic scheduling and re-scheduling mechanism. The automatic scheduling chooses the most appropriate resource candidate followed by the second best choice and so on, while the re-scheduling provides the service to compensate the miss-selection of resource by the users. Once the Scheduling Module provides the best selection of resource, the process is passed to the Execution Manager, the Transfer Manager, and the Information Manager that drive Middleware Access Driver (MAD) layer to submit, control and monitor the resource fabric layer of the Grid Computing environment.

3.4 Scheduling Policy - Dynamic Loading Prediction Scheduling Algorithm (DLPS)

The proposed job scheduling algorithm of Scheduling Policy is called Dynamic Loading Prediction Scheduling (DLPS) in the GRB. The objective function of DLPS is achieving the minimized makespan. Thus, we designed the following equation to describe it, as in (1) : )](s)(dMin[=M kk minmax* − (1)

Definition 3: The *M means the minimized makespan. In order to predict precisely, the equation uses

kd andkS . The

kd is the maximum job ending time of the kth job, which means the end time of job completed. And the kS is the minimum job submitting time of the kth job, which means the time stamp when users submit the kth job.

There are some inputs, such as Resource Specification Language (RSL), job template, and the form of Grid Portal from users’ requirements. Therefore, the operations of DLPS are to select, schedule, reschedule, and submit jobs to the most appropriate resources.

The logical flow chart of the DLPS is illustrated as the FIG. 3. First, the DLPS retrieves the resource information from Information System, and then filters out unsuitable resources with the adaptive resource allocation function. After the adaptive resource allocation function, DLPS compares free nodes with requirement nodes. If current free nodes are enough for fulfilling, DLPS will give higher weight (defined at Definition 4). Otherwise, the following step enters the dynamic loading prediction function with EstBacklog and minimum Job Expansion Factor (defined at Definition 5 and Definition 6) methods to predict which computing resources responses and executes job quickly, and then calculates weight. The DLPS finally ranks all available resources and selects the most appropriate resources to dispatch job.

capabilityn

nk M

fR=weight ×

FIG. 3 The Logical Flow Chart of DLPS

The each step of DLPS algorithm is descried as the following. The each step of algorithm thusly - Step 1: Process users’ criteria and job requirements from RSL, job template, or Portal form specification, including the High-Performance Linpack Benchmark, data set, execution programs, and queue type, etc. Step 2: Make communicate with each GRIS of resource for getting the static and dynamic resource information. Step 3: (1) Store the features and status of each cluster into InformationDB through Information System; (2) Filter out unsuitable resource with the adaptive resource allocation function. Step 4: Compare these free nodes with required nodes. If current free nodes


are enough for fulfilling, DLPS calls weighted function to calculate aweight and bweight (defined at Definition 4).

Definition 4: When free nodes fulfill required nodes, the

designed weighted function is capabilityn

nk M

fR

=weight × .

Where the nR means required nodes, nf means free nodes

and the capabilityM means the capability of each computing resources.

Step 5: If current free nodes are not enough for fulfilling, DLPS calls dynamic loading prediction function with two methods to calculate cWeight (defined at Definition 7), including EstBacklog and minimum Job Expansion Factor methods.

Definition 5: The EstBacklog means estimated backlog of queued work in hours. The general EstBacklog form is shown as the following equation (2):

=*iEBL ×

× )(ompletedTotalJobsC

yCPUAccuracQueuePS

)Pr

3600Pr(ocHoursDedicated

orcHoursAvailablePocHoursTotal ××

(2) *iEBL , it means the ith EstBacklog time. QueuePS is the

idle time of queued jobs, CPUAccuracy is the actual run time of job, the TotalJobsCompleted is the number of jobs completed, the ToatlPorcHours is the total number of proc-hours required to complete running jobs. The AvailableProcHours is the total proc-hours available to the scheduler, and the last variable, DedicatedProcHours, is the total proc-hours made available to jobs.

The some of above values are from the system historical statistic values of queuing system loading and the others are from real-time queuing situation. The output is divided into two categories, Running and Completed. The Running statistics include information about jobs that are currently running. The Completed statistics are compiled using historical information from both running and completed jobs. Ergo, the *

iEBL can forecast the backlog of each computing site with above information.

Definition 6: The job expansion factor subcomponent has an effect similar to the queue time factor but favors shorter jobs based on their requested wallclock run time. In its canonical form, the job expansion factor metric is calculated by the information from local queuing system which described as the equation (3) :

imitWallClockL

RunTimeQueuedTimeJEFi+

= (3)

Definition 7: After getting EstBacklog and job expansion factor, the cWeight metric is calculated by following equation (4) :

∑∑==

×−+×= n

ii

in

ii

ic

EBL

EBL

JEF

JEFWeight

1

*

*

1

)1( λλ (4)

Where the λ is the system modulated parameter which can be obtained from numerous trials. The EstBacklog more can be respected the dynamic situation of queuing system generally. Therefore, it always uses the higher λ value.

Consequently, the Step 6: Calculated the minimum time of total deliver or response time.

4 Performance Evaluation 4.1 Experimental Environment In order to test the efficiency of the presented GRB with the developed scheduling algorithm, we execute the Computational Fluid Dynamics (CFD) Message Passing Interface (MPI) programs on a heterogeneous research testbed. We adopt the NCHC testbed, including Nacona, Opteron, and Formosa Extension two (Formosa Ext. 2) clusters. Those main environment characteristics are summarized in table 1.

And we also measure the High-Performance Linpack Benchmark of these clusters. The Rmax means the largest performance in Gflop/s for each problem run on a machine, and the Rpeak means the theoretical peak performance Gflop/s for the machine. Hence, users can choose the higher Rmax value or set the criteria of computing power when users submit jobs, according to the information, in table 2.

Table 1 Summary Environment Characteristics of NCHC Grid Resources

Resource CPU Model Memory (GB)

CPU Speed (MHz)

#CPUs Nodes Job Manager

Nacona

Intel(R) Xeon(TM) CPU 3.20GHz

4 3200 16 8 Torque

Opteron

AMD Opteron(tm) Processor 248

4 2200 16 8 Moab

Formosa Ext. 2

Intel(R) Xeon(TM) CPU 3.06GHz

4 3060 22 11 Maui

Table 2 High-Performance Linpack Benchmark of NCHC Grid Resources


High-Performance Linpack Benchmark

Nacona Cluster

Opteron Cluster

Formosa Ext.2 Cluster

Rmax(Gflops) 46.791424 34.08 75.447 Rpeak(Gflops) 102 70 134.64 Number of CPUs 16 16 22 The Efficiency of CPU (Gflops/CPU)

2.924 2.13 3.429

4.2 Experimental Scenario The preliminaries of experiment are needed to set up, including the start time of jobs, the convergence of MPI matrix, and the number of required computing CPUs. The above preliminaries are generated by normal distribution. Therefore, we generate several execution MPI programs with 2, 4, and 8 CPUs randomly. The evaluation also has been performed on three clusters with three experiment models, including 4096*4096 matrix which required 2 CPUs to compute, 4096*4096 matrix which required 4 CPUs to compute, and finally the last model is the 8192*8192 matrix which required 8 CPUs to compute.

The following carry on experiment is to compare the performance of DLPS job scheduling algorithm with several algorithms, such as Round-Robin, Short-Job-First (SJF), Big-Job-First (BJF), and First-Come-First-Serve (FCFS). We submitted testing jobs which were generated randomly with the synthetic models as the FIG. 4 is shown. The vertical axle is the value of min makespan (seconds) and the horizontal axle is the number of jobs. The makespan of presented GRB with DLPS job scheduling algorithm is notably less than other algorithms; especially the huge job numbers are submitted. Therefore, the objective function of DLPS approaches the minimized makespan. The dynamic loading prediction characteristic of presented GRB is proved be better under this experiment.

0

100

200

300

400

500

600

700

800

900

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43

# of Jobs

Sec

DLPS

Round-Robin

SJF

BJF

FCFS

FIG. 4 Compare Makespan of DLPS with Other Algorithms

Finally, the last experiment is discussing about the efficiency of all algorithms (defined at Definition 8) in FIG. 5, FIG. 6, FIG. 7, and FIG. 8.

Definition 8: The efficiency of all algorithms is measured by the following equation, as in (5) :

%100lg

lg ×−

orithmA

DLPSorithmA

MakespanMakespanMakespan

(5)

The equation (5) means each algorithm is compared with DLPS. If the efficiency is positive, it means DLPS more efficient. Otherwise, the DLPS is non-efficient.

When the small numbers of jobs are submitted, the efficiency of DLPS may be worse than other algorithms, especially for SJF and FCFS. This situation is reasonable very well, because small jobs are easy consumed by SJF and FCFS. When the number of jobs is increasing, the developed DLPS is absolutely better than SJF and FCFS, because the notable drawback of SJF and FCFS is happened, which the large numbers of jobs are starvation as the FIG. 5 and FIG. 8. Comprehensively discussed the above efficiency figures, the best efficiency of DLPS is occurred extreme full usage of each cluster.

-40

-20

0

20

40

60

1 6 11 16 21 26 31 36 41

# of Jobs

DLPS and SJF

Eff

icie

ncy

(%

)

FIG. 5 Compare the Efficiency of DLPS with SJF

-10

0

10

20

30

40

50

60

1 6 11 16 21 26 31 36 41

# ofJob

DLPS and Round-Robin

Eff

icie

ncy (%

)

FIG. 6 Compare the Efficiency of DLPS with Round-

Robin

0

20

40

60

80

100

1 6 11 16 21 26 31 36 41

#of Jobs

DLPS and BJF

Effic

ienc

y (%

)

FIG. 7 Compare the Efficiency of DLPS with BJF


-20

-10

0

10

20

30

40

50

60

1 6 11 16 21 26 31 36 41

# of Jobs

DLPS and FCFS

Eff

icie

ncy

(%

)

FIG. 8 Compare the Efficiency of DLPS with FCFS

5 Conclusions Conclusion and Future Work

The Grid Resource Broker takes a step further in the direction of establishing virtual computing sites for the Computing Grid service. The presented GRB can satisfy the users’ requirements, including the hardware specification, specific software, and the High-Performance Linpack Benchmark results. And then, the presented GRB can automatically select the most appropriate physical computing resource. With the features of automatic scheduling and dynamic loading prediction provided by the Scheduling Module, the Grid users are no longer required to select the execution resource for the computing job. Instead, the Grid Resource Broker will provide an automatic selection mechanism which integrates both static information and dynamic information of resources, to meet the demand from the Grid jobs.

According to the pervious experiments, the dynamic loading prediction job scheduling has better efficiency and performance than other algorithms; especially the huge job numbers are submitted into the Grid Computing sites. Ultimately, we can obtain an important property that the algorithm is appropriated to deal with large amount of jobs in grid computing environment.

6 References [1] R. AI-Khannak, and B. Bitzer, "Load Balancing for Distributed and Integrated Power Systems using Grid Computing," International Conference on Clean Electrical Power (ICCEP), 22-26 May, 2007, pp. 123-127.

[2] http://www.gridway.org/

[3] http://www.globus.org/

[4] V. Hamscher, U. Schwiegelshohn, A. Streit, and R. Yahyapour, "Evaluation of Job-Scheduling Strategies for Grid Computing," Proceedings of 1st IEEE/ACM International Workshop on Grid Computing (Grid 2000) at 7th International Conference on High Performance Computing (HiPC-2000), Bangalore, India, LNCS 1971, 2002, pp. 191-202.

[5] I. Foster and C. Kesselman, The Grid 2: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, San Francisco, CA, 2004 .

[6] Yi-Lun Pan, Yuehching Lee, Fan Wu, "Job Scheduling of Savant for Grid Computing on RFID EPC Network," IEEE International Conference on Services Computing (SCC), July 2005, 75-84.

[7] J. M. Alonso, V. Hernandez, and G. Molto, "Towards On-Demand Ubiquitous Metascheduling on Computational Grids," the 15th Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP), February 2007, pp 5.

[8] J. Schopf, "A General Architecture for Scheduling on the Grid, "Journal of Parallel and Distributed Computing, special issue, April 2002, p. 17.

[9] A. Othman, P. Dew, K. Djemame and I. Gourlay, "Toward an Interactive Grid Adaptive Resource Broker," Proceedings of the UK e-Science All Hands Meeting, Nottingham, UK, September 2003, pp. 4.

[10] M. Grajcar, "Strengths and Weakness of Genetic List Scheduling for Heterogeneous Systems," Application of Concurrency to System Design, 2001. Proceedings. 2001 International Conference, 25-29 June 2001, pp. 123-132.

[11] Barry G. Lawson, and E. Smirni, "Multiple-queue Backfilling Scheduling with Priorities and Reservations for Parallel Systems," ACM SIGMETRICS Performance Evaluation Review, vol. 29, Issue 4, March 2002, pp. 40-47.

[12] N. Fujimoto, and K. Hagihara, "A Comparison among Grid Scheduling Algorithms for Independent Coarse-Grained Tasks," Symposium on Applications and the Internet-Workshops, Tokyo, Japan, 26 – 30, January, 2004, p. 674.

[13] N. Fujimoto, and K. Hagihara, "Near-optimal Dynamic Task Scheduling of Independent Coarse-grained Tasks onto a Computational Grid," In 32nd Annual International Conference on Parallel Processing (ICPP-03), 2003, pp. 107–131.

[14] P. Lampsas, T. Loukopoulos, F. Dimopoulos, and M. Athanasiou, "Scheduling Independent Tasks in Heterogeneous Environments under Communication Constraints," the 7th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), 4-7 December 2006, Taipei, Taiwan.


http://www.gridway.org/

http://www.globus.org/

Resource Consumption in HeterogeneousEnvironments

Niko ZenkerBusiness Informatics

Department of Technical and Business Information SystemsOtto-von-Guericke Universitat Magdeburg

P.O. Box 4120D-39016 Magdeburg GERMANY

+49(0)391 67 [email protected]−magdeburg.de

Martin Kunz, Steffen MenckeSoftware Engineering

Department of Distributed SystemsOtto-von-Guericke Universitat Magdeburg

P.O. Box 4120D-39016 Magdeburg GERMANY

+49(0)391 67 12662[makunz, mencke] @ivs.cs.uni−magdeburg.de

Abstract—Depending on the server, services need differentresources. This paper tries to identify measurable resources andthe dimensions theses resources need to have. Once measurablea guideline to measure and evaluate theses services is given.

Keywords: Resource Heterogenous Environment

I. INTRODUCTION

An international discussion about global warming and CO2

reduction led to a trend called GreenIT. Each individualcomputer center can contribute to this trend, but to do soit is necessary to measure the resource consumption of thecomputer center. The sum of all the services executed inthe computer center are responsible for the overall resourceconsumption. Therefore it is necessary to have a closer lookat each individual service and its resource consumption. Thishelps to redesign the computer center for a resource savingenvironment.

How much CO2 is used by a Google search? In theinternet several figures are discussed. Let us assume that theoverall consumption of the data centers, the human resourcesnecessary for the operation of these data centers, the networkinfrastructure and of course the individual computer used forthe Google search is summed up to 5 grams1 of CO2. Oncethe user hits the “Search” button the Google servers startworking and present a result within seconds. In respect to theindividual bandwidth of each user it is probably not necessaryto get the result as fast as possible. Performing the search ona “slower” node2 in the data center can reduce the overallusage of CO2. The result is presented to the user and due tothe lower bandwidth, the different search method is not evenmentioned.

Many other scenarios are imaginable where a different nodecan reduce the overall consumption of CO2 in the data center.But for all these scenarios it is necessary to measure availableresources. This paper will break down the data center toa place where different services are executed. The authors

1Due to insufficient proof of the numbers presented by http://blogs.sun.com/rolfk/entry/your co2 footprint when using a smaller amount is assumed

2Assuming that a slower node is using less power and therefore consumingless CO2

assume that each node in the data center is able to execute allservices and therefore different results are comparable. Thepaper is structured as followes. At the beginning differentresources are defined and an effort to describe these resourcesin standardized terms is shown. An approach to measure theresources is presented. Once measured all the values haveto be summarized for the whole chain of services executed.Therefore a formula to create a resource-concerned modelis motivated. Finally the outlook presents ideas and projectswhere the service measurement is needed and the conclusionends the paper.

II. RESOURCES

The term resource is used throughout every field of thescientific community. The W3C sees resources as entities3, butthey give no further description. Following the intentions ofthe Resource Description Framework presented by the W3C,resource are websites, or in general information in the internet.In the same manner the Uniform Resource Identifier (URI)is just a description of the place where a single websitecan be found. An economist describes resources as currentassets, a sociologist identifies the social status of a person asa resource, and the psychologist enriches the term resourcewith social skills and talents of a patient. For this paper aresource is defined as a physical entity that is measurable. Inaddition to this only resources in a data center are taken intoconsideration. Human resources are not considered.

Before the measurement can start, it is necessary to redefinethe dimension of each resource. Current attempts to comparedifferent systems and environments like [ZBP06] and [BBS07]deliver only limited information like CPU Time (measured inMIPS) or disk performance (measured in MB/s). Especially[BBS07] suggest an overall resource classification for ITmaking it possible to compare between different environments.The authors assume that a multi-dimensional index filledwith different measurable values is able to compare theseenvironments. Therefore each dimension is described with

3http://www.w3.org/TR/rdf-mt/


multiple values. The described dimension points can be usedas basic structure for measurement but also as a base forcomparison of a single service between different servers. Anunexhausted list of values for that index is shown below:

RAMString Name = "RAM"String Type = "RDRAM" | "SDRAM"Int Size = 1024String Unit = "B" | "MB" | "GB"

String Name = "HD"Int Size = 30String Unit = "MB" | "GB"Int Partition = 1String PartitionType = "NTFS" | "EXT3"

String Name = "CPU"String Type = "Pentium4" | "AMD64-X2"Float MHZ = 4.2Int Flops = 54638255Int NumberOfCores = 2Int CurrentCore = 0

After identifying the base-measures (the single values de-rived from each dimension) the goal should be to createa combined metric, or a set of metrics. This is especiallyimportant when this resource metric is used to compare andevaluate services between different server environments. Suchan evaluation also demands thresholds and quality models foran analysis of the measured values. A single metric is mucheasier comparable than several different values derived fromthe single dimension. Nevertheless it is very important to saveall dimensional values to have a closer look at all services ifnecessary.

For the combined metric it is important to assess the base-measures as to their priority for the metric. Such an assessmentshould be transparent, to keep the information entropy andto avoid errors due to wrong priority weighting. Furthermoreresearch has to take a closer look at a run-time based metric,with elements from the list above. It may be possible that thereis also a correlation between source code, design, or structuraldescription and values derived from a metric.

Performance properties are measured and compared in[RKSD07]. This approach can deliver good results, especiallyfor the first way of measurement presented later in this paper.

Besides the definition of metrics and indicators anotherimportant issue is the evaluation and analysis of measuredvalues. Therefore an adaption of web service measurementpresented in [Sch04] is proposed. The values are measured ina distinct time frame. Therefore a diagram of all values can becreated and the quality of the service is directly viewable andactions, like restart or rearrange, of a service can take place.Figure 1 gives an example of service measurement over thisdistinct time frame.

In a service oriented architecture not every service is utilizedin the same way as others. The definition of the time frame

Fig. 1. Diagram of Ping and Invocation Times

is difficult, because it needs a detailed look at each service todetermine its utilization span. Some services are used everyother minute (e.g. sending a document to a printer), but someare just utilized once a year (e.g. the creation of the annualfinancial statements). Both services are vital to the profit ofthe company, but the second is only necessary once a year.The time frame for the measurement has to be adopted toget a informative result, useful for evaluation and comparisonbetween different servers.

Using techniques like BPEL it is easy to define certainservices and transactions in a process. The resources aredirectly connected to each step in the process. After theknowledge about the resources is extracted, a combination ofall resources used in the process is possible. This means that atdesign time the estimated amount of resources is predictable.Using the resource information at run-time several scenariosof rearranging services are imaginable. In this manner theresource information can support virtualisation in a way thatthe virtual environment can be configured precisely accordingto the resource consumption of the service that will run in thatenvironment.

III. MEASUREMENT OF RESOURCES

Identifying a server in a computer center is an easy task.The same applies for measuring the CPU load at this server.The result is a number showing an administrator how wellor poorly the server is utilized. It is also easy to look atthe capacity of the hard disks and to make a prediction howlong the remaining space will last, before maintenance has totake place. Breaking it down to an individual service requiresamore complex measurement scenario. This fact derives notonly from the more detailed view on the services, it is also mo-tivated by the commingling of the services itself. This meansthat a service built up by sub-services, with special resourcedemands, needed for its execution. In order to evaluate thewhole service, all the values have to be considered.

The measurement itself can take place at the underlyingoperating system. But a measurement-service needs to know


what services should be measured. Values will be collectedaccording to the demands of the metric. Each individual valuewill then be integrated into the measurement result, providedto the measurement service. Working with interfaces makesit possible to implement different measurement methods; e.g.for different operating systems. The measurement service iscurrently under development, therefore no explicit operatingscheme can be presented. The class diagram, presented infigure 2, is based on [Mar2003] and shows a class diagramfor an ontology for software metrics and indicators. Martinidentifies in his class diagram objects that are measurable,but it also defines how the objects will be measured, whois measuring them and of course the place where the resultsare stored.

The adaption from software metrics toward hardware met-rics is not yet done. For now the authors assume that atransformation process of all hardware metrics is possible.

But before the creation of a metrics it is necessary todefine three stages of service measurement. Each stages hasa different priority. The first is a measurement of the servicewith randomly selected values done by the developer of theservice. The second is a measurement in special testing modeat runtime, and the third is an outside view of the service, atruntime.

Based on the quality of Web-Services these measurementmethods are already discussed and motivated. This approachcan be adopted for the measurement of resources. [SB03]describes all three methods. [Rud05] describes in greaterdetail the testing mode and the outside view. In [RKSD07]the simulation approach is described. All methods producevalues suitable for an estimate of resource consumption, butin order to generate reliable values all three methods haveto be combined in a measurement procedure. This procedureuses the values derived from the initial simulation, done by thedeveloper. In this way a starting value for the service is created.Once installed in the desired computer environment the test-mode will produce new values suitable for a distinguishedproposition. The third method will then refine the simulatedand tested values.

This ensures an acceptable measurement for proper values.The simulation approach is necessary have a starting valuefor a new environment. A developer can influence the resultsof the measurement. Using special values for the simulationphase produce a first estimate of resource consumption, withlittle liability for an environment at runtime. Nevertheless thesevalues are critical for a method to compare actual resourceconsumption and the maximum consumption for a singularservice.

The test-mode in the distinguished environment redefinesthe initial values with detailed immersion of each individualnode existing in the computer center. The test-mode willenumerate all computers in the environment suitable for theservice. The service is computed on each computer underdifferent circumstances, like heavy load of the CPU, or acongested network interface. In the end a detailed map of eachnode and its resource consumption for the service is created.

TABLE IPRIORITIES FOR THE THREE STAGES OF SERVICE MEASUREMENT

Stage PriorityI 0.33II 0.5III 1

This information is also necessary for an estimation method ofresource consumption and current load. The priority, and forthat matter the importance, is higher than the one recoveredin the simulation phase.

Using test-values selected by the developer for the first stageand for the special test-mode in the actual data center canbe dangerous to the output of the measurement process. Wellchosen test-values can create a measurement result suitablefor the service4. These values are of course not desired andthe overall outcome of the measurement process is badlyinfluenced. Therefore another stage of measurement needs tobe considered.

Therefore the authors demand a stage to measure valueswhen the service is in a productive mode at runtime. Themeasurement itself is a collection of needed information fromthe service it self or an outside view on the service. Theoutside view is more complex, because influences of otherservices have to be considered. These influences are welcome,in order to create a complex view on resource consumptioncorrelating with other services. Due to constant changes in theenvironment each value creates an immersion for the resourcemeasurement useful in an adaptive computer center.

In figure 3 all three stages are shown. The values derivedfrom all three stages have to be part of the metric. As describedabove the priority of each stage is not the same. Finding theright coefficient for each stage is a hard task. Depending onthe service these coefficients can vary. Practical research has torefine these coefficient. For now the authors assume that thehighest priority has to be appointed to values derived fromstage III. Values from stage II are better than values fromstage I. Therefore the priority (prior) as shown in table I areassumed.

For the overall value used for the metric the sum of allvalues has to be created. Each value is multiplied by thepriority.

V alueService =∑

prior ∗ V alueStage

This value is now a combination of three values measured withdifferent environments. Each environment is different from theother, therefore heterogeneous systems influence the metric fora service.

All three stages should be used for the measurement, but itis still necessary to integrate the time and other circumstancesinto the values. A service executed on a server with heavy loadreacts different and the performance metric is not as good as

4Service is designed to create a good output in regard of performance. Thiscan be done by special switches inside the source code.


Fig. 2. A UML diagram to the ontology for software (and web) metrics and indicators [Mar2003]

on a server with no load. Especially with the third stage ofmeasurement the circumstances are accessible.

IV. AGGREGATION OF SERVICES

In a heterogeneous environment many service are combinedto fulfill the desired goal of the computer system. In order tomeasure and estimate all these services the aggregation of allservices have to be considered. Especially in a service orientedenvironment many heterogeneous services are executed. Eachservices contains sub-services necessary for its execution. De-pending on the orchestration of the (sub-)services the overallconsumption of resources has to be calculated in differentways. Four types of orchestration methods are described by[CDPEV05]. The parameters are used for quality of serviceparameters, suitable for different projects. Shown in table IIand table III these parameters are listed for the different waysof orchestration.

TABLE IIAGGREGATION FUNCTIONS FOR QOS PARAMETERS FOR SEQUENCE AND

SWITCHES [CDPEV05]

QoS Attr. Sequence Switch

Time (T)m∑

i=1

T (ti)n∑

i=1

pai ∗ T (ti)

Cost (C)m∑

i=1

C(ti)n∑

i=1

pai ∗ C(ti)

Availability (A)m∏

i=1

A(ti)n∑

i=1

pai ∗A(ti)

Reliability (R)m∏

i=1

R(ti)n∑

i=1

pai ∗R(ti)

It is of course possible that each individual service isexecuted on a different server. The orchestration is then influ-enced by the delay the network produces. Therefore not only


Fig. 3. Three stages of service measurement

TABLE IIIAGGREGATION FUNCTIONS FOR QOS PARAMETERS FOR FLOW AND LOOPS

SERVICES AGGREGATION [CDPEV05]

QoS Attr. Flow LoopTime (T) Max{T (ti)i∈{1..p}} k ∗ T (t)

Cost (C)p∑

i=1

C(ti) k ∗ C(t)

Availability (A)p∏

i=1

A(ti) A(t)k

Reliability (R)p∏

i=1

R(ti) R(t)k

one value has to taken into account as base-measurement. Inthe chapter “Resources” a motivation for a multi-dimensionalindex was given. The inclusion of the network delay is justfor this theory.

The re-arrangment of the services is also possible whenexecuted in a virtual environment. The performance of theservice is then of course influenced by the virtual environmentand it resource consumption. In [MEN05] concepts, problems,and metrics for virtualization are discussed.

V. OUTLOOK

The mightiness of the metric, motivated by figure 2, is stillunclear. Further research will extend the metric with moreresults and of course the correlating influences.

A current project is developing and implementing the infras-tructure to measure services within a SOA. The result of themeasurement will be published to prove the fact that there isa difference in resource consumption, according to the serverused for its execution.

For an automation of service resource management thedescribed approach contains a semantic description of the de-fined metrics. In history ontologys possessed the capability todescribe information in a machine accessible manner. Existingsolutions in this area, for example an ontology for object

oriented metrics presented in [KKSD06] are a framework forontology about resource metrics for services.

Once having the automated resource management a frame-work can be created working in a heterogeneous environment.This environment is equipped with different server, most likelyworking with different operating systems. All services can beexecuted within the environment. The framework can then re-arrange these services to fulfill the desired outcome. One ofthese outcomes is the CO2 reduction, but it is also imaginableto re-arrange the service according to other demands like costsaving, a better performance, or a secure execution accordingto the demands of customer. This leads toward an automaticdata center as motivated in [BEME05].

Furthermore the desired resourcemeasurement should beimplemented in service development process and mainte-nance process standards to ensure low resource consumptionthroughout the service life cycle. An adaption of the mea-surement service to international standards like CMMI or SixSigma is desired.

The desired framework is flexible for all current operatingsystems and no expensive hardware has to be acquired. Theusage of the “old” equipment saves costs and due to a betterdistribution of services the overall performance of the systemwill rise.

REFERENCES

[BBS07] R. Brandl, M. Bichler, and M. Strobel. Cost Accounting for SharedIT Infrastructures - Estimating Resource Utilization in Distributed ITArchitectures. Wirtschaftsinformatik, 49(2):83–94, 2007.

[BEME05] M. Bennani, D. Menasce. Resource Allocation for AutonomicData Centers using Analytic Performance Models. Proceedings of the2005 IEEE International Conference on Autonomic Computing, 2005.

[CDPEV05] G. Canfora, M. Di Penta, R. Esposito, and M.L. Villani. Anapproach for QoS-aware service composition based on genetic algorithms.Proceedings of the 2005 conference on Genetic and evolutionary compu-tation, pages 1069–1075, 2005.

[Mar03] Maria de los Angeles Martin, Luis Olsina. Towards an Ontologyfor Software Metrics and Indicators as the Foundation for a CatalogingWeb System. Proceedings of the First Latin American Web Congress(LA-Web03), pages 103-113, 2003.

[MEN05] D.A. Menasce. Virtualization: Concepts, applications, and perfor-mance modeling. Proceedings of the 31th Int. Computer MeasurementGroup Conf, pages 407-414, 2005.

[KKDS06] M. Kunz, S. Kernchen, R. Dumke, A. Schmietendorf. Ontology-based web-service for object-oriented metrics. Proceedings of theInternational Workshop on Software Measurement and DASMA SoftwareMetrik Konkress , pages 99-106, 2006.

[RKSD07] D. Rud, M. Kunz, A. Schmietendorf, R. Dumke. PerformanceAnalysis in WS-BPEL-Based Infastructures. Proceedings of the 23rdAnnual UK Performance Engineering Workshop (UKPEW 2007), pages130-141, 2003.

[Rud05] D. Rud. Qualitaet von Web Services. VDM Verlag, 2005.[SB03] S. Battle. Boxes: black, white, grey and glass box views of web-

services. Technical Report HPL-2003-30. HP Laboratories Bristol, 2003.[Sch04] A. Schmietendorf, R. Dumke, D. Rud. Ein Measurement Service zur

Bewertung der Qualitatseigenschaften von im Internet angeboteten WebServices. MMB-Mitteilungen Nr. 45, pages 6-16, 2004.

[ZBP06] Rudiger Zarnekow, Walter Brenner, and Uwe Pilgram. IntegratedInformation Management - Applying Successful Industrial Concepts inIT. Springer Berlin Heidelberg, 2006.


Experience in testing the Grid based Workload

Management System of a LHC experimentV. Miccioa b c, A. Fanfani c d, D. Spigae b, M. Cinquillie, G. Codispotid, F. Fanzagob c, F. Farinaf b, S.

Lacaprarag, E. Vaanderingh, A. Sciabab S. Belfortei

aCorresponding/Contact AuthorCERN, BAT.28-1-019, 1211 Geneve 23, [email protected] Office: +41 (0)22 76 77215 − Fax:+41 (0)22 76 79330

bCERN, Geneva, Switzerland

cINFN/CNAF, Bologna, Italy

dUniversity Bologna, Italy

eUniversity and INFN, Perugia

fINFN, Milano-Bicocca, Italy

gINFN Legnaro,Italy

hFNAL, Batavia, Illinois, USA

iINFN, Trieste, Italy

The computational problem of a large-scale distributed collaborative scientific simulation and analysis of ex-periments is one of the many challenges represented by the construction of the Large Hadron Collider (LHC) atthe European Laboratory for Particle Physics (CERN). The main motivation for LHC to use the Grid is thatCERN alone can supply only part of the needed computational resources, while they have to be provided by tensof institutions and sites. The key issue of coordinating and integrating such spread resources leads the buildingof the largest computing Grid on the planet.

Within such a complex infrastructure, testing activities represent one of the major critical factor for deployingthe whole system. Here we will focus on the computing model of the Compact Muon Solenoid (CMS) experiment,one of the four main experiments that will run on the LHC, and will give an account of our testing experience forwhat concerns the analysis job workflow.

Keywords: Grid Computing, Distributed Computing, Grid Application Deployment, High Energy PhysicsComputing

Conference: GCA’08 − The 2008 International Conference on Grid Computing and Applications

1. Introduction

CMS represents one of the four particle physicsexperiments that will collect data at LHC start-ing in 2008 at CERN, and one of the two largestcollaborations. The outstanding amount of pro-duced data − something like 2 PB per year −should be available for analysis to world-wide dis-tributed physicists.

The CMS computing system itself relies on geo-graphically distributed resources, interconnectedvia high throughput networks and controlled bymeans of Grid services and toolkits, whose build-ing blocks are provided by the Worldwide LHCComputing Grid (WLCG, [1]). CMS builds ap-plication layers able to interface with several dif-ferent Grid flavors (LCG-2, Grid-3, EGEE, Nor-


Figure 1: CMS computing model and its tiers hierarchy

duGrid, OSG). A WLCG-enabled hierarchy ofcomputing tiers is depicted in the CMS comput-ing model [2], and their role, required function-ality and responsibilities are specified (see Fig-ure 1).

CERN constitutes the so-called Tier-0 center:here data from the detector will be collected, thefirst processing and storage of the data will takeplace and raw/reconstructed data will be trans-fered to Tier-1 centers. Beside the Tier-0 center,CERN will also host the CMS Analysis Facility(CAF), which will have access to the full raw dataand will be focused on latency-critical detector,trigger and calibration activities. It will also pro-vide some CMS central services like the storageof conditions data and calibrations.

There are then two level of tiers for quite differ-ent purposes: organized mass data processing andcustodial storage is performed at about 7 Tier-1located at large regional computing centres, whilea more large number of Tier-2 sites are dedicatedto computing.

For what concerns custodial storage, Tier-1centers receive simulated data produced withinthe Tier-2 centers, and will receive reconstructeddata together with the corresponding raw datafrom the Tier-0. Regarding organized mass dataprocessing activities, Tier-1 centers will be incharge of calibration, re-processing, data skim-

ming and other organized intensive analysis tasks.The Tier-2 centres are essentially devoted to

the production of simulated data and to the userdistributed analysis of data imported from Tier-1 centers. In this sense, the Tier-2 activities aremuch more “chaotic” with respect to the highertiers: analysis are not centrally planned and re-source utilization decisions are closer to end userswhich can leverage wider set of them. So theclaim for high flexibility and robustness of thethe workflow management leads to an extremelycompound infrastructure which inevitably entailsa large effort in phase of testing and deploying.

2. Workload Management System forAnalysis

The Workload and Data Management Systemshave been designed in order to make use of the ex-isting Grid Services as much as possible, buildingon top of them CMS-specific services.

In particular, the CMS Workload ManagementSystem (WMS) relies on the Grid WMS providedby the WLCG project for job submission andscheduling onto resources according to the CMSVirtual Organization (VO) policy and priorities.Using the Grid Information System, it knows theavailable resources and their usage. It performsmatchmaking to determine the best site to runthe job and submits it to the Computing Element


(CE) of the selected site which in turn schedulesit in the local batch system. The Worker Node(WN) machines where jobs run have POSIX-IO-like access to the data stored in the local StorageElement (SE).

On top of the Grid WMS, CMS has built theCMS Remote Analysis Builder (CRAB, [3–5]),an advanced client-server architecture for specificCMS-software (CMSSW) analysis jobs workflowmanagement. It is based on independent com-ponents communicating through an asynchronousand persistent message service, which can providefor the strong requirement of extreme flexibility.Such a server is placed between the user and theGrid to perform a set of actions in user behalfthrough a delegation service which handles usersproxy certificates. The main goal of such an in-termediate server is to automate as much as pos-sible the whole distributed analysis workflow andto improve the scalability of the system in orderto fullfill the target requirement rate of more than100 thousands jobs handled per days when LHCwill be full operational.

Anyway, the client-server implementation istransparent to the end users. From this pointof view CRAB simply looks like a dedicated frontend for specific CMSSW analysis jobs. It enablesthe user to process datasets and Monte Carlo(MC) samples taking care of CMSSW specific fea-tures and requirements, provides the user with asimple and easy to use interface, hides to the userthe direct interaction with the Grid and reducesthe user load by automating most of the actionand looking after error handling and resubmis-sions.

CRAB’s own functionalities and its integratedinteraction with the underlying Grid environmentneeds a dedicated test and deployment activity.

3. Test Experiences

The CMS experiment is getting ready to thereal LHC data handling by building and test-ing its Computing Model through daily experi-ence on production-quality operations as well asin challenges of increasing complexity. The capa-bility to simultaneously address both these com-plex tasks relies on the quality of the developed

tools and related know-how, and on their capabil-ity to manage switches between testbed-like andproduction-like infrastructures.

Such intermediate activities between develop-ing and production are crucial due to the largenumber of different services running on differentlayers using different technologies within multipleoperational scenarios that will operate when LHCwill work at full. Past experience [6,7] has shownthat such training activity practice is one of thebiggest challenges of such a system.

Main issues concern both functionality tests aswell as the scalability of the whole infrastructureto maintain the needed performance and robust-ness both up to the expected full job flows ratesand under realistic usage conditions.

Work flow tests was performed on both on Gridlevel and on more CMS specific CRAB level.

3.1. Grid WorkflowWMS testing was aimed at probing the load

limits of the service both from the hardware andfrom the software point of view. A quasi auto-mated test-suite was set up to steadily submitjobs at adjustable rate. Some very controlled in-stances of WMS have been used and was contin-uously tested, patched and re-deployed, with atight feedback between testers and developers.

The tests involved the submission of large num-bers of jobs to the WLCG production infrastruc-

Figure 2: The Grid WMS is capable to sustain arate of about 20kJ/d for several days


Figure 3: 5-days no-stop 10kJ/d submission onGrid CE, with always 5k active jobs

ture, both using simple “hello world” scripts andreal experiment applications. Problems encoun-tered were reported to the developers, who pro-vided bug fixes, in an iterative process. Accep-tance criteria were defined to assess the compli-ance of the WMS with the requirements from theCMS and ATLAS experiments and the WLCGoperations: uninterrupted submission of at least104 jobs per day for period of at least five days;no service restart required during this period andno degradation in performance at the end of thisperiod, with a number of stale jobs less than 1%of the total at the end of the test.

A successful test on gLite 3.1 middleware fullymet the acceptance criteria. 115’000 jobs wassubmitted along 7 days (16’000 job/day) with 320(0.3%) jobs aborted due to the WMS problems,negligible delay between job submission and ar-rival on the CE. Further tests proved that thegLite WMS is able to sustain for a week an higherrate of 20’000 job/day without degradation inperformances and no stale jobs (see Figure 2).

Testing the performance and reliability of CEwas done submitting at well specified rate (10kjobs per day) in order to always keep at least5k jobs active in the CE, according to the cri-teria defined for the CE acceptance tests. Fig-ure 3 shows the results of a first 5-days no-stop

submission test to verify the acceptance criteria.Something like 60k jobs was submitted; only 119jobs aborted (< 0.2%) but not due to a CE error;no performance degradation observed and the CEservice never restarted.

3.2. CRAB WorkflowTesting the CRAB infrastructure is a more

structured task, since the goal is now to probenot only the sustainability of a high job submis-sion rate, but also the reliability of the whole sys-tem of services and functionalities.

For this purpose the so called JobRobot was de-veloped: a whole agent and dropbox based auto-mated expert system architecture which enablesto simulate a massive and complete user analysisactivity, from creation to submission, monitoringand up to output retrieval. The goal is to under-stand and solve potentially race condition criti-calities that only a realistic, “chaotic” usage canlead to show up.

First very exploratory rounds of test was al-ready performed, using an instance of a prelim-inary version of the CRAB server attached to adedicated Grid WMS machine. Passing throughthe CRAB server, the JobRobot was keepingspreading CMSSW collections of jobs over about30 different sites, using realistic requirements,with a growing submission rate and for a grow-ing number of days. An initial submission rateof 10k jobs per day was quite easily sustained for4 days, showing no overload both on the Grid

Figure 4: Starting a stress test for the CRABserver workflow at a rate of 18kJ/d


Figure 5: The bottleneck of having only one GridWMS

WMS side as well as on the CRAB server side.In a more stress-oriented test a rate of 18k jobsper day was maintained for the first 24 hours andthen raised to 30k jobs per day for the succeed-ing 24 hours. As Figure 4 shows, within the lowerrate jobs complete their workflow (green line) atthe same rate they are submitted (black line).

The high submission rate was overstretched foreven more time (see Figure 5), but the single GridWMS instance was not able to efficiently handlethe overall job flow generated in this way: jobwas dispatched more and more slowly to sites,piling up in WMS queues and bringing a to arapid degradation of its performace, and as aconsequence of the performance of the whole sys-tem (e.g. the lowering of the black curve in Fig-ure 5). The expected message was that to furtherincrease the scale requires additional dedicatedWMSs.

On the contrary the load on the CRAB servermachine was very reasonable, proving the a sin-gle server can still fairly well handle such a highsubmission rate. Moreover such first tests wasalready capable of giving valuable feedback con-cerning improvements in some CRAB server com-ponents and precious feedback was provided todevelopers about that. As a matter of fact, italso shows that the server could represent a fur-ther testing instrument for the underlying GridWMS services, allowing a fine tuning of its config-uration parameters in a much more CMS specificuse cases tailored way.

Already planned next steps involve the set up ofthe next forthcoming major release of the CRABserver: it includes important development up-grades (a client-server reengineering and a refac-tory of the Grid interaction framework) and needsan early test of the overall functionalities. A newscale test with the server pointing at more thanone single Grid WMS is then scheduled.

4. Conclusion

The Grid infrastructure is already really work-ing at production level and actually the every dayCMS activity can not do without it. So far test-ing and integration activities represented a cen-tral part of the work needed to bring the work-load management infrastructure to to a qualitylevel which let users take full advantage of it in aproduction context.

Anyway there are still challenges, to be readyfor when the LHC will be fully operative. Thistesting process is still ongoing and the improve-ments achieved during these months have alreadyhad a big impact on the amount of effort requiredto run the work flow services in production activ-ities.

REFERENCES

1. LHC Computing Grid project:http://lcg.web.cern.ch/LCG/

2. CMS Computing Technical Design Report,CERN-LHCC-2005-023, June 2005

3. D. Spiga, S. Lacaprara, W. Bacchi, M. Cin-quilli, G.Codispoti, M. corvo, A. Dorigo,F. Fanzago, F. Farina, O. Gutsche, C.Kavka, M. Merlo, L. Servoli, A. Fan-fani. (2007). CRAB: the CMS distributedanalysis tool development and design, NU-CLEAR PHYSICS B-PROCEEDINGS SUP-PLEMENTS. Hadron Collider Physics Sym-posium 2007. La Biodola, Isola d’Elba, Italy.20-26 Maggio 2007. vol. 177-178C, pp. 267 -268.

4. A. Fanfani, D. Spiga, S. Lacaprara, W. Bac-chi, M.Cinquilli, G. Codispoti, M. Corvo,A. Dorigo, F. Fanzago, A. Fanfani, F. Fa-rina, M. Merlo, O. Gutsche, L. Servoli,


C. Kavka(2007). The CMS Remote Analy-sis Builder (CRAB). LECTURE NOTES INCOMPUTER SCIENCE. High PerformanceComputing -HiPC 2007 14th InternationalConference. Goa, India. 18-21 Dicembre 2007.vol. 4873, pp. 580 - 586.

5. F. Farina , S. Lacaprara, W. Bacchi, M. Cin-quilli, G. Codispoti, M. Corvo, A. Dorigo,A.Fanfani, F. Fanzago, O. Gutsche, C. Kavka,M. Merlo, L. Servoli, D.Spiga. (2007). Sta-tus and evolution of CRAB. POS PROCEED-INGS OF SCIENCE. XI International Work-shopon Advanced Computing and AnalysisTechniques in Physics Research (ACAT07).Amsterdam. 23-27 April 2007. vol. ACAT20,pp. ACAT020.

6. Andra Sciaba, S. Campana, A. Di Girolamo,E. Lanciotti, N. Magini, P. M. Lorenzo, V.Miccio, R. Santinelli, Testing and integrat-ing the WLCG/EGEE middleware in theLHC computing, International Conferenceon Computing in High Energy and NuclearPhysics (CHEP07), Victoria BC, Canada, 2-7 September 2007

7. V. Miccio, S. Campana, A. Sciaba, Experi-ence in testing the gLite workload manage-ment system and the CREAM computing ele-ment, EGEE’07 International Conference. 1-5October 2007


A Rate Based Auction Algorithm for Optimum Resource Allocation using Grouping of Gridlets

G. T. Dhadiwal1, G. P. Bhole1, S. A. Patekar 2

1 Computer Technology Department, VJTI, Mumbai-19, India. 2Vidyalankar Institute of Technology, Mumbai-37, India.

Abstract - The Problem of allocating resources to a set of independent subtasks (gridlets) with constraints of time and cost has attracted a great deal of attention in a grid environment. This paper proposes the criteria for resource allocation based on its rate (cost by MIPS) and grouping of the gridlets. The rate minimizes the cost and the grouping reduces the communication overheads by fully utilizing the resource in one go. A comprehensive balance is thus achieved between the cost and time within the framework of grid economy model. Comparison of proposed algorithm is made with the single round First-price sealed and Classified algorithms in the literature. The results obtained using Gridsim Toolkit 4.0 demonstrates that the proposed algorithm has a merit over them.

Keywords: Rate based algorithm, Auction algorithm, Gridlets, Gridsim.

1 Introduction and Related Work Allocating independent subtasks (gridlets) in grid environment to the resources which are geographically distributed, heterogeneous, dynamic and owned by various agencies and thus having different costs and capabilities is one of the key problem addressed by various researchers [1- 4].

Grid Classified algorithm [5] and the First-priced sealed algorithm [1, 3] which are based on market mechanism focus on single round auction. The objective of later algorithm is to obtain the smallest Makespan (time require to complete the task) but has neglected the user grade of service demand. The Classified algorithm [5] proposes optimized scheduling algorithm under the limitation of time and cost. However it attempts to complete the task as quickly as possible, up till the granularity time (user expected completion time) and then after its laps resorts to cost minimization, looking at these aspects independently.

The proposed algorithm allocates resources based on rate i.e. Cost by MIPS ratio and identifies a group of gridlets which are submitted as a single bunch which keeps

the allocated resource engage up till the granularity time, resulting in reduced communication overhead. Thus the proposed algorithm balances time and cost comprehensively.

2 The Basis of the Proposed Rate Based Algorithm The Rate based algorithm considers the rate of resource i.e. the ratio of cost of the resource to the MIPS of the resource. Resources are then sorted in increasing order of rates. From sorted list the resource with the least rate and which is within the user budget is selected. For the specified granularity time million instruction performed by the selected resource is computed. Now a group of gridlet (assumed to be independent of each other) is identified such that it’s computing requirements matches with the selected resource, and is allocated to it. By doing this we do not revisit the allocated resource within the granularity time there by substantially reducing in the communication overhead. Now next resource with minimum rate from the sorted list is considered and above procedure is repeated. If during the process all resources are exhausted and still some gridlets are remains to be processed then we resort to a trade off by allocating all the remaining gridlets to the resource with the minimum rate.

2.1 Algorithmic Steps

Let n be the total number of gridlets termed as a task, MI i are the Million Instruction (MI) of gridleti, GranTime is the user expected time to complete processing of all gridlet in seconds, overhead is the communication time for each allocation of the resources to the gridlets, budget of user in Indian Rupees (INR) for the task, m be the total number of resources. Rj is the resource j, RMIPSj indicate the number of MI processed by the resource j in one second, RCostj is the cost in INR which resource charges on per second basis.

Step 1: Read resource information For each resource Rj j < m Get the following data of the available resources Rj, RMIPSj, RCostj. End for


Step 2: Selection of resources based on rate and budget For Each Resource R find Rratej = RCostj / RMIPSj

End for sort Rratej in increasing order. //Find the require MIPS for the entire task.

For Each gridleti Req_MIPS = Req_MIPS + MIi

End for

While Req_MIPS > 0 and budget > 0 Let j be the next resource from sorted list. (j=1, 2,…, m) Req_time = (Req_MIPS /RMIPSj) round to next integer if (Req_time <GranTime and budget > (Req_time * RCostj)) select resource j. add j to selected_resource_list budget = budget – (Req_time * RCostj ) Req_MIPS = Req_MIPS – (Req_time * RMIPSj ) end if else if (budget > (GranTime * Rcostj)) select resource j add j to selected_resource_list. budget = budget – (GranTime * RCostj ) Req_MIPS = Req_MIPS – ( GranTime * RMIPSj ) end else if End While At end of step 2 a list of selected resources has been formed. Step 3: Grouping of gridlets For each gridleti , i < n Gridlet_sent[i] = false; End for While (all gridlet are not sent) Let k be the group number. (k=1, 2, ---) if Flag_GranTime_over = false select the next resource j from selected_resource_list. else // grantime over and hence select the least rate resource select the first resource (j = 1) from selected_resource _list. End if Total_MIj =RMIPSj * GranTime MIofGroupk = 0; For each Gridleti , i < n If (Gridlet_sent[i] == false and (Total_MIPSj > MIofGroupk +MIi)) MIofGroupk = MIofGroupk+MIi

Add gridleti to group k Gridlet_sent[i] = true

End if End For Send group k to resource j. k = k+1 j = j+1 // When all gridlets are not executed within granularity time the remaining gridlets are allocated to least rate resource. if resource list exhausted Flag_GranTime_over =True End if End while

3 Illustrative Example - Comparison of the Proposed Algorithm with the others [1, 5]

The problem is to execute say Task1, in grid environment using Gridsim toolkit 4.0 [6, 7], consisting of 5 gridlets each requiring 200 MI. The communication overhead is 0.2 sec, the budget is 7000 INR and granularity time is 5 sec. The information regarding resources used is listed in Table 1.

Table 1: Detailed Information of Resources Name MIPS Operating

System Cost (INR)

R1 42 UNIX 100 R2 180 LINUX 200 R3 256 LINUX 300 R4 225 WINDOWS 250 R5 384 LINUX 400 R6 39 WINDOWS 50 R7 66 LINUX 60 R8 450 WINDOWS 500

Similarly Task2 to Task8 are considered for execution. The proposed rate based algorithm, the First-price sealed and the Classified algorithm are employed to solve the problem. The results are tabulated in Table 2 and graphically represented in Fig 1 and Fig 2. Resources are allocated during the execution of algorithm. Table 3 shows final allocation of resources to the gridlets for all three algorithms. Grouping of gridlets is also seen in the said Table for the proposed algorithm. For brevity only Task4 is considered.


Table 2: Comparison of Proposed algorithm with First-price and Classified Algorithm

Proposed Rate

Based Classified First Prices

Tas

k

Grid

let

Make span (sec)

Cost (INR)

Make span (sec)

Cost (INR)

Make span (sec)

Cost (INR)

1 5 3.23 1440 1.09 1950 3.22 1500 2 10 4.89 2240 1.96 3950 6.44 5000 3 15 6.46 3480 3.92 4990 9.67 7500 4 20 5.73 4740 8.83 5480 12.89 10000 5 25 4.89 5990 18.51 6350 16.11 12500 6 30 5.09 6990 159.85 9000 19.33 15000 7 35 12.16 8710 196.49 10500 22.56 17500 8 40 16.78 10040 217.13 12000 25.78 20000

0

50

100

150

200

250

Task1 Task2 Task3 Task4 Task5 Task6 Task7 Task8

Task

Mak

esp

an (s

ec)

Rate Based

Classified

First

Fig 1: Comparison of time required for task

completion by all three algorithms

0

5000

10000

15000

20000

25000

Task1 Task2 Task3 Task4 Task5 Task6 Task7 Task8

Task

Co

st (I

NR

)

Rate Based

Classified

First

Fig 2 Comparison of cost required for task completion

by all three algorithms

Table 3: Allocation of resources for Task4

Res

ourc

e

R1

R2

R3

R4

R5

R6

R7

R8

R9

Firs

t

- - - - - - - - 1, 2, 3, … 19 20

Cla

ssifi

ed - - 1,

3, 5, 8, 10

2, 4, 7, 9

- 11, 12, 13, … 19, 20

6 - -

Rat

e B

ased

- 16-19 - 1-5, 20-20

7-15 - 6-6

- -

Remarks – following are the observation made with reasoning on the results obtained in Table 2.

1) Irrespective of magnitude of the task the cost required for the proposed algorithm is minimum because rate based selection of resource is employed.

2) As the magnitude of task increases (see after task 3 in table 2) the time required to complete the task also reduces. This is due to reduced communication overhead because of grouping of gridlets

The superiority of the proposed method is thus apparent.

4 Conclusion The paper proposes an algorithm for resource allocation in grid environment with single round auction. The proposed algorithm considers the ratio of cost by MIPS i.e. the rate of the resource in its allocation and also it incorporates appropriate grouping of gridlet to minimize the transition time of each gridlet to the resource, thus balancing time and cost comprehensively. The comparison of the rate based algorithm is made with the First-price and Classified algorithms in the literature using Gridsim Toolkit 4.0 and the result demonstrates that proposed algorithm has a merit.


5 References [1] Daniel Grosu and Anubhav Das. Auction-Based Resource Allocation Protocols in Grids,In proc. Of the 16th IASTED international Conference on Parallel and Distributed Computing and System (PDCS 2004), November 9-11, 2004, MIT, Cambridge, Massachusetts, USA, pp. 20-27. [2] Mathias Dalheimer, Frans-Josef Pfreundt and Peter Merz. “Agent-based Grid Scheduling with Calana”, In Parllel Procedding and Applied Mathematices (PPAM 2005), vol. 3911, Lecture Notes in Computer Science, Springer, pp. 741-750, 2005. [3] Marcos Dias de Assunção and Rajkumar Buyya. An Evaluation of Communication Demand of Auction Protocols in Grid Environments, Proceedings of the 3rd International Workshop on Grid Economics & Business (GECON 2006), World Scientific Press, May 16, 2006, Singapore. [4] Task Scheduling for Coarse-Grained Grid Application Nithiapidary Muthuvelu, Junyang Liu and Nay Lin Soe Grid Computing and,, Distributed Systems Laboratory Department of Computer Science and Software Engineering The University of Melbourne, Australia. [5] Li Kenli, Tang xiaoyong, Zhaohuan. Grid Classified Optimization Scheduling Algorithm under the Limitation of Cost and Time. In (ICESS’05) IEEE Proceedings of the Second International Conference on Embedded Software and Systems. Dec2005. pp. 496-500. [6] URL of Gridsim simulator

http://sourceforge.net/projects/gridsim [7] Anthony Sulistio, Uros Cibej, Srikumar Venugopal, Borut Robic and Rajkumar Buyya A Toolkit for Modeling and Simulating Data Grids : An Extension to Gridsim, Concurrency and Computation : Practice and Experience (CCPE), Wiley Press, New York, USA (in press, accepted on Dec. 3, 2007).

Appendix Gridsim toolkit 4.0 is used. The First-price sealed, Classified and the proposed rate based algorithm are implemented using the flow chart given in Fig 3.

Fig 3: Process Flow of used in GridSim Simulation for implementing algorithms used.


A Coordination Framework for Sharing Grid Resources

D. R. Aremu1, M. O. Adigun2, 1Department of Computer Science, University of Zululand, KwaDlangezwa, KwaZulu-Natal, South

Africa

1. Introduction

ABSTRACT: Distributed market based models are

frequently used for resource allocation on the

computational grid. But, as the grid size grows, it

becomes more difficult for a customer to directly

negotiate with all the grid resource providers. Middle

agents are introduced to mediate between the providers

and customers so as to facilitate effective resource

allocation process. This paper presents a new market

based model called Cooperative Modeler for mediating

resource negotiation. By pooling resources together in

a cooperative manner, the resource providers can

combine sales returns, operating expenses and

distribute sales efficiently among members in

proportion to the volume of contribution over a

specified time. The paper discussed the designed

framework for implementing the Cooperative Modeler.

1. Introduction

Utility Grid Computing is emerging as a new

paradigm for solving large-scale problems in science,

engineering, and commerce [1][2][3]. The Grid

technologies enable the creation of Virtual Enterprises

(VE) for resource sharing by logically coupling

millions of geographically distributed resources across

multiple organizations, administrative domains, and

policies. The Grid technology comprises heterogeneous

resources (PCs, work-stations, clusters, and

supercomputers), fabric management systems (single

system image OS, queuing systems, etc.) and policies,

and applications (scientific, engineering, and

commercial) with varied CPU, I/O, memory, and/or

network intensive requirements. The users, producers

also called resource owners and consumers have

different goals, objectives, strategies, and demand

patterns. More importantly both resources and end-

users are geographically distributed with different time

zones [3]. These essentially lead to complex control and

coordination problem, which is frequently solved with

economic model based on real or virtual currency. The

nature of economic models, which imply self-interested

participants with some level of autonomy makes an

agent based approach the preferred choice for this

study. This paper presents a new middleware agent

called Cooperative Modeler, for mediating resource

negotiation in the grid computing environments, and

discussed the architecture designed for its

implementation. The presented model, enabled by the

Utility Grid Computing Infrastructure allowed the

providers of resources to come together in a cooperative


relationship to do business. A cooperative is an

autonomous association of agents united voluntarily to

meet their common economic, social, and cultural

needs and aspirations through a joint owned and

democratically controlled enterprise. Cooperative

business is a business owned and controlled by the

people who use its services. By pooling resources

together in a cooperative manner, providers of

resources combine sales returns, and operating

expenses, and pro-rating or distributing sales among

members in proportion to volume each provides

through the cooperative over a specified time. A

cooperative may operate a single resource pool or a

multiple resource pool. In a pool operation, members

bear the risk and gains of changes in market prices. So,

the advantages of pooling are: (i) it spreads market

risks; (ii) it permits management to merchandise

resources according to a program it deems most

desirable and to one that can be planned with

considerable precision in advance; and (iii) it also

permits management to use caution in placing and

timing shipments to market demands and in developing

new markets i.e., orderly marketing; and it helps

finance the cooperative. The rest of the paper is

organized as follows; section 2 discussed the related

work, while section 3 present the Cooperative Modeler,

and the architecture designed for its implementation.

Section 4 presents the Software Specification/

Requirement and Design used for implementing the

Cooperative Modeler. Section 5 concludes the paper.

2.0 Related Work

Economics-based model has been used for

coordinating grid resource allocation in the grid

computing environment. The economic based approach

provides a fair basis in successfully coordinating

decentralization and heterogeneity that is present in the

grid economies. The models investigated are classified

based on auctions, bilateral negotiation, and other

negotiation models.

2.1 Auction Models

Auctions are highly structured forms of negotiation

between several agents. An auction is a market

institution with an explicit set of rules determining

resource allocation and prices on the basis of bids from

the market participants. Auctions enable the sales of

resources without a fixed price or a standard value. The

typical purpose of an auction is for the seller to obtain a

price, which lies as close as possible to the highest

valuation among potential buyers. The auction model

supports one to many negotiations, between a grid

resource provider (seller) and many grid resource

consumers (buyers), and reduces negotiation to a single

value (i.e. price). The three key players involved in an

auction are: the grid resource owners, the auctioneer

(mediator) and the buyers. In a Grid environment,

providers can use an auction protocol for deciding

service value/price. The steps involved in the auction

process are: (i) a Grid Service Provider announces its

services and invites bids; (ii) brokers offer their bids

and they can see what other consumers offer depending

on whether the auction protocol is open or closed; (iii)

the broker and Grid Service provider communicate

privately and use the resource (R). The contents of the

deal template used for work announcements in an

auction include the addresses of the users, the eligibility

requirements specifications, the task/service

abstraction, an optional price that user is willing to

invest, the bid specification (what should the offer

contain) and the expiration time (the deadline for

receiving bids). From a Grid Service Provider�s

perspective, the process in an auction is: (i) receive

tender announcements/advertisements (in Grid market


Directory); (ii) evaluate the service capability; (iii)

respond with a bid; (iv) deliver service if a bid is

accepted; (v) report result and bills the user as per the

usage and agreed bid.

2.2 Bilateral Negotiation Model

Unlike the auction models which supports one to

many negotiations, between a grid service provider

(seller) and many grid service consumers (buyers), and

reduces negotiation to a single issue value (i.e. price),

we also investigated the bilateral negotiation model

which involves two parties, multiple issues value

scoring model as discussed in [4]. In a two party

negotiation sequence called negotiation thread, offers

and counter-offers are generated by linear combinations

of simple functions called tactics. Tactics generate an

offer and counter-offer considering multi criterions

such as price, quantity, quality of service, delivery time,

etc. To achieve flexibility in negotiation, agents may

wish to change their ratings of the importance of the

different criteria, and their tactics may vary over time.

Strategy is the term used to denote the way in which an

agent changes the weights of tactics over time. Through

strategy an agent changes the weight of tactics over

time. Strategy combines tactics depending on the

negotiation history.

2.3 Tender/Contract�Net Model

Tender/Contract-net model is one of the most widely

used models for service negotiation in a distributed

problem solving environments [5]. Tender/Contract-net

is modeled on the contracting mechanism used by

businesses to govern the exchange of goods and

services. It helps in finding an appropriate service

provider to work on a given task. A user/resource

broker asking for a task to be solved is called the

manager and the resource that might be able to solve

the task is called the potential contractor. From a

manager�s perspective, the process in a Tender/Contract

� Net Model is: (i) the consumer (broker) announces its

requirement (using a deal template) and invites bids

from Grid Service Providers; (ii) interested Grid

Service Providers evaluate the announcement and

respond by submitting their bids; (iii) the brokers

evaluates and awards the contract to the most

appropriate Grid Service Provider(s); (iv) step (ii) goes

on until one is willing to bid a higher price or the

auctioneer stops if the mini price line is not met; (v) the

Grid Service Provider offers the service to the one who

wins; (vi) the consumer uses the resource.

2.4 Bid-Based Proportional Resource Sharing

Market-based proportional sharing systems are quite

popular in cooperative problem-solving environments

such as clusters (in a single administrative domain). In

this model, the percentage of resource shared allocated

to the user application is proportional to the bid value in

comparison to other users� bids. The user allocated

credits or tokens, which they can use to have access to

resources. The value of each credit depends on the

resource demand and the value that other users place on

the resource at the time of usage. Two users wishing to

access a resource with similar requirements for

instance, the first user is willing to spend 2 tokens and

the second user is willing to spend 4 tokens. In this case

the first user gets one third of the resource share

whereas the second user gets twice the first the first

user (i.e. two third of the resource share), which is

proportional to the value that both users place on the

resource for executing their applications. This strategy

is a good way of managing a large share resource in an


organization or resource owned by multiple individuals

can have a credit allocation mechanism depending on

the investment they made. They can specify how much

credit they are willing to offer for running their

application on the resource.

2.5.3 Monopoly/Oligopoly

Unlike the previously discussed auction models which

assumed a competitive market where several Grid

Service Providers and brokers/consumers determine the

market price, there exist cases, where a single Grid

Service Provider dominates the market and is therefore

the single provider of a particular service. In economic

theory this model is known as monopoly. Users cannot

influence prices of services and have to choose the

service at the price given by the single Grid Service

Provider who monopolizes the Grid marketplace. An

example is where a single site puts the prices into the

Grid Market Directory or information services and

brokers consult it without any possibility of negotiating

prices. A monopoly�s offer of resources is usually

decoupled from the price at which it acquired the

resource. The classical problem of a monopoly is that it

sets higher price than marginal cost and this distorts the

trade-off in the grid economy, and moves it away from

Pareto efficiency. The fact that a monopoly does not

face the discipline of competition means that a

monopoly may operate inefficiently without being

corrected by the grid marketplace. The competitive

markets are one extreme and monopolies are the other

extreme. In most of the cases, the market situation is an

oligopoly which is in between these two extreme cases:

a small number of Grid Service Providers dominate the

market and set the prices.

3.0 Proposed Cooperative Middleware for Resource Negotiation

This section discussed the Cooperative Modeler, and

the architecture designed for its implementation. The

Cooperative Modeler adopts the concept of Utility grid

computing, which is tied to utility computing where

users can request for resources when ever needed (i.e.

on-demand) and only be charged for the amount being

used. Individual members of a Cooperative group need

not own or purchase expensive grid resources for a

specific project but can instead choose to �rent� or

share among trusted parties, members of the

cooperative. The Cooperative Modeler is a dynamic

alliance of autonomous resource providers distributed

across organizations and administrative domains that

bring in their complementary competencies and

resources that are collectively available to each other

through a virtualized pool called Cooperative Market,

with the objective to deliver products or services to the

market as a collective effort. The Cooperative Modeler

adopt bilateral negotiation model to enable the

consumers of services to negotiate for services in real

time.

3.1 Architecture of the Cooperative Modeler

The architecture of the Cooperative Modeller is made

up of three components namely: the Client component,

the Cooperative Middleware component, and the

Providers component. The architecture promotes a

business situation involving three stakeholders with

three major business roles: (i) End-user - The End-user

role is played by a stakeholder (client) who consumes

services, (ii) Mediator - The Mediator role is the key

player and this role is played by the Cooperative

Middleware agents, (iii) Service Provider - The service

Provider role is played by a service owner who offers


his services to the end user (the client). The Client

Component is made up of a set of n clients. Each client

has at least one task of various length and resource

requirement to execute. The Provider Component is

composed of a set of m Providers, who formed the

cooperative group. The dynamic nature of this model

makes it possible for a provider of resources to be a

client requesting for resources at the same time. The

Cooperative Middleware Components is made up of

five interacting agents saddled with the responsibility of

coordinating resource sharing negotiation at the grid

resource pool. These agents are: the Liaising-Agent, the

Information-Agent, the Marketing-Agent, the Resource-

Control-Agent, and the Execution-Agent. Table I gives

the description of these agents.

Table I: Description of the Agents at the Cooperative resource pool

Name of Agent Responsibility of the agent

Client-Agent This agent is acting on behalf of the Clients (the end user) to negotiate for resources

Liaising-Agent This agent is acting as the controller, manager of managers to the other agents at the

resource pool. Request for resources is always directed to it. It liaises between the

clients, the providers and the agents at the resource pool.

Information-Agent This agent is the manager in charge of the resource pool. It maintains knowledge base

of the resources at the pool. It interacts closely with the pool and the Marketing-Agent.

Marketing-Agent The Marketing-Agent is the expert (manager) in charge of resource negotiation. It is

equipped with all kinds of negotiation tactics so as to optimize the objectives of the

resource providers.

Resource-Control-Agent This agent is responsible for sales recording/documentation, and issuance of resource

id. It is the manager (accountant) in charge of record keeping at the resource pool.

Execution-Agent This agent is responsible for the execution of the clients� tasks.

4. Implementation Design for the Proposed Cooperative Modeler

Table II gives the specification of the negotiation

protocol between the Marketing-Agent and the Client-

Agent of the Cooperative modeler. The role players in

this protocol are the Client-Agent who plays the role of a

buyer of resources, while the Marketing-Agent plays the

role of a seller of resources. The communication

between the two agents consist of exchange of messages

in form of buyers querying sellers for available resource

to meet their task requirements, query for prices of the

available resources, exchanges of proposals followed by

proposal-accepted or counter-proposals. If the two

agents agree over a deal, sales accepted must be reported

followed by sales confirmed. The execution protocol

(Table III) also consists of exchange of

requests/responses between the Client and the

Execution-Agent over task execution. The Execution-

Agent is acting the role of task executor. The protocol

allows the Client to request for the status of the task

execution, during the execution period. End of task

execution is reported immediately after execution is

finished.


Task Execution SpecificationProtocol Content

Client as Client, Execution �Agent as Executor

Roles

ExecutionRequest (XClient→Execution-Agent)

ExecutionAccept (XClient←Execution-Agent)

ExecutionReject (XClient←Execution-Agent)

ExecutionFinished (XClient←Execution-Agent)

ExecutionQuery (XClient→Execution-Agent)

ExecutionStatus (XClient←Execution-Agent)

Messages

Execution Must be started immediately if resource id

is valid Execution finished Must be reported

immediately

Contract

Table III: Specification of the Execution Protocol

Protocol Specification Protocol Content

Client-Agent as Buyer,

Marketing-Agent as Seller

Roles

ResourceQuery (XClientAgent→Marketing-Agent)

PriceQuery (XClientAgent→Marketing-Agent)

PriceOffer (XClientAgent←Marketing-Agent)

NoOffer (XClientAgent←Marketing-Agent)

SaleAccept (XClientAgent→Marketing-Agent)

CounterOffer (XClientAgent→Marketing-Agent)

TerminateNegotiation (XClient→Marketing-Agent)

SaleConfirm (XClientAgent→Marketing-Agent)

Messages

SaleAccept Must be followed by sales Contract

Table II: Specification of the Negotiation Protocol between the Client-Agent and the Marketing-Agent.

Resource_Control-Agent

documentSale(request : String) : String IssueResourceID(resource : String) : String

Client

Execution-Agent

executquest : String) : String executeStatus(JobStatusRequst : String) :String

Liaising-Agent

CoordinateRequest (request : String) : String

Client-Agent resourceQueryneqotiate(resource :String):String

Marketing-Agent process(request : String) : String negotiate(resource : String) : String

Information-Agent process(request : String) : String matchTasl_1WithResource(req : String) : String . . . . . . . matchTask nWithResource(req : String): String

Figure 1: The Class Diagram describing the interaction pattern for the implantation of the Cooperative Modeler


5.0 Conclusion

In this paper, a survey of economic models for

negotiating grid resource was carried out. The paper

discussed the architecture of a new middleware called

Cooperative Modeler. The purpose of creating the

Cooperative Modeler was for mediating resource

negotiation between providers and consumers of

resources. The paper also presented the design

framework for the implementation of the Cooperative

Modeler. The presented model, enabled by the Utility

Grid Computing Infrastructure, allows the providers of

grid resources to collaborate by coming together in a

cooperative relationship, and to contribute their core

competences and share resources such as information,

knowledge, and market access in order to exploit fast �

changing market opportunities..

Reference

[1] Rajkumar Buyya and Srikumar Venugopal,

�The Gridbus Toolkit for Service Oriented

Grid and Utility Computing: An Overview and

Status Report�, Proceedings of

the First IEEE International Workshop on Grid

Economics and Business Models (GECON

2004, April 23, 2004, Seoul, Korea), 19-36pp,

ISBN 0-7803-8525-X, IEEE Press, New

Jersey, USA.

[2] D. Abramson, J. Giddy, and L. Kotler, High

Performance Parametric Modeling with

Nimrod/G: Killer Application for the Global

Grid?, IPDPS�2000, Mexico, IEEE CS Press,

USA, 2000.

[3] Klaus Krauter, Rajkumar Buyya, and Muthucumaru

Maheswaran, �A taxonomy and survey of grid resource

management systems for distributed computing,�

Software: Practice and Experience, vol. 32, no. 2,

February 2002, pp. 135-164.

[4] H. Raiffa. The Art and Science of Negotiation, Harvard

University Press, Cambridge, USA, 1982

[5] Smith R, David R. �The Contract Net protocol: High level

communication and control in a distributed problem

solver,� IEEE Transaction on Computers 1980; 29(12):

1104 � 1113.


A Scheduling Algorithm Using Static Information of Grid

Resources

Oh-Han Kang1, Sang-Seong Kang

2

1Dept. of Computer Education, Andong National University, Andong, Kyungbuk, Korea

2Dept. of Educational Technology, Andong National University, Andong, Kyungbuk, Korea

Abstract - In this paper, we propose a new algorithm,

which is revised to reflect static information in the logic of

WQR(Workqueue Replication) algorithms and show that it

provides better performance than the one used in the

existing method through simulation.

Keywords: Grid, Scheduling, Static information,

Workqueue replication

1 Introduction

The scheduling algorithm for a grid system can be

classified into a number of groups according to the

characteristics of resources and tasks, scheduling time, and

intended goal. But we focused our study on the algorithms

which can obtain minimum completion time for mutually

independent tasks in batch mode. With this goal, the tasks

were assorted by the type of information of our interest and

related works were studied further.

1.1 Algorithm using the information for performance-

evaluation

The studies on algorithms, which utilize information

concerning the length of assigned tasks, the preparation time

of resources, and processing capacity, are dated back to

most conventional parallel and distributed environment.

These typical algorithms range from Min-Min to Max-min.

Out of unassigned tasks, Min(Max)-Min algorithm selects

those with the minimum(maximum) completion time, and

assign these tasks to the resources, which is expected to have

the minimum completion time.

Since Min(Max)-Min algorithm can be simply

implemented, application of such algorithms can be easily

found from other situation. He[6] suggested the modification

of Min-Min algorithm to complete tasks (those that require

QOS in grid computing) in shortest amount of time.

Furthermore, Wu[7] separated sorted tasks into segments

and applied it to Min-Min algorithm,

In contrast to Min(Max)-Min, some algorithms are

introduced by Buyya[1] and Muthuvelu[2] to be developed

specifically for the grid system. Buyya suggested an

algorithm model to search the optimal combination of

resources within the budget, as it is important to consider the

financial costs of resources in a grid system. Because his

model includes cost-optimization, time-optimization, and

cost-time optimization, appropriate algorithm can be chosen

for particular significance of tasks and specific budget of

grid users. Moreover, Muthuvelu proposed a strategy, in

which he gathered tasks composed of a number of small

fragments and, in turn, distributed resource in specific

amount.

1.2 Algorithm not using the information for

performance-evaluation

Evaluating measured capacity and the exact length of a

task to be allocated in consideration of the current load is

not an easy problem. Especially due to the characteristics of

the grid system both the capability to maintain capacity and

status considering load information of the resource and the

power to maintain these in real time and subsequent

difficulty in predicting the completion time, results from

evaluation are virtually ineffectual. Even though it may be

possible to evaluate with complete precision, relatively long

processing time of tasks in utilizing the grid system leads to

continual change in the load of resources, and the scope of

its application consequently remains limited. For these

reasons, a grid scheduling algorithm without information for

performance evaluation is suggested. [3, 4]

The simplest algorithm independent of information for

performance-evaluation is Workqueue. Workqueue allocates

every task to every resource one by one in order, and, in turn,

assigns other tasks immediately to resources, which have

come up with the results. As a result of letting more number

of tasks be allocated to fast resources and less to slow

resources, this algorithm enables minimal amount of task-

completion time. Subramani[4] allocated specific tasks to a

number of resources repeatedly, and called off the

processing of other remaining tasks in other resources when

the task is completed in one of resources. In this case,

because task waiting queue is located in each resource, not

only the utilization of resources enhances and but the total

completion-time is decreased as well.

Two algorithms mentioned above are programs, which

are simple to implement. However, since factors such as


either excessively time-consuming resources or unfeasible

resources from several reasons, extremely lengthy total

completion-time could arise frequently. To solve this

problem, WQR (Workqueue Replication) [3] algorithm is

suggested. WQR algorithm is similar to Workqueue in its

basic distribution method, but does not stop at distributing

tasks once. Until all tasks are completed, WQR algorithm

distributes incomplete tasks repeatedly within certain limit.

Therefore, it can secure stability from excessive load or fault

of certain resources. However, for users who utilize the same

scheduling strategy and use the grid system as well,

overhead could arise from using the same overlapping task

in a number of resources.

2 Simulation environment

WGridSP[4] - a tool for the performance evaluation

and comparison of algorithms, which employs GridSim (the

java-based simulation tool) as an engine - selects its target

randomly from the range of distribution limited within the

features of these resources and hence constitutes the grid.

The maximum load of a resource can be designated by

a user. The minimum load is 0%; the processing capacity

can be utilized 100 %. Also the weighting factor using time

sequence can be changed in each time period.

For simulation, the number of resources is assigned to

be 50, and the minimum load of each resource is chosen to

be 0 %. Also, the maximum load is authorized to shift from

10%, 30%, 50%, 70%, to 90%. Likewise, the reason for

comparing capacity of algorithms according to load is that

the load is considered to be the most critical factor in

turnaround time in the grid system. In case of the load of a

resource remaining excessively high, the resource is either

used in local system or a considerable number of tasks are

concentrated on the resource; sometimes, it can be viewed as

a temporarily unusable case. The load of each resource is

made to alter from minimum to maximum and the maximum

load is multiplied by the weighting factor using time

sequence, which will change the current load of each

resource. Also to prevent every resource from

simultaneously retaining the same load, the GMT based time

period of a resource is randomly assigned in uniform

distribution of [0, 23].

To analyze the efficacy of information used in each

algorithm, two types of simulation are performed. The first

simulation fixed CPU processing capacity to 300, and the

number of CPU to 1. The number expressing CPU

processing capacity means the number of instruction

processing in one second. For example, the task with the

length of 1200 can be processed in 4 seconds with the

resource with 1 CPU of 300 processing capacity. As the

other simulation arbitrarily selected the CPU processing

capacity and the number, each resource is made to retain

different CPU capacity and the number. The CPU

processing capacity is selected randomly within the uniform

distribution of [100, 500], and the number of CPU within the

uniform distribution of [1, 8].

The tasks composing application to be processed

consists of total 200, and the length of tasks is arbitrarily

chosen from the uniform distribution of [1,000,000,

5,000,000]. As a result, when the resource with 0% load is

assumed to be assigned, processing time of a single task

ranges from 2,000 seconds to 5,000 seconds. So as to ignore

the communication time used to distribute a resource to a

task and return the result, input and output data of each

resource is all set to 0. Every simulation is performed 10

times in the same condition and the arithmetic mean is used

as the total makespan.

3 Suggesting new algorithm

3.1 Points to improve for existing algorithm

According to the type of information and the

distribution of resources that are used in a grid scheduling,

the points to improve for existing algorithms are deduced.

Avoidance of excessive load or incapability resources

Inferred from the results from performance evaluation

of four algorithms, an increase in task completion-time when

the maximum load is over 70% is the smallest for WQR.

This advantage is mostly derived from a task duplication

strategy. The scheduling algorithm for the grid system

should include a strategy, which could evade resources that

are either of extremely low capacity or unusable status.

The exclusion of the utility of dynamic information

Min-Min algorithm, a type which utilizes real-time

dynamic information, showed the worst quality in resource

set where static information is equally applied. This defect is

attributed to time difference being unable to cope with the

change in usage rate of resources. In other words, dynamic

information, real time information of resources, seems to be

no help in task scheduling of a grid system. Also, when real

time information is renewed in distributing each task,

overhead additionally decreases the performance of an

algorithm. Therefore, unless there exists a device that can

constantly oversee dynamic information and cancel or

replace tasks on operation according to dynamic information,

the application of dynamic information should be restrained.

The application of static information

Unlike how WQR algorithm showed good capacity in

the grid system with resources with the same static

information, it showed worse performance than WQR

algorithm in the condition of maximum load being less than

70%, set by the system with multiprocessor. Ultimately, the

inability to detect the number of processors and the

processing capacity of resources including static information

leads to inefficient use of resources. Since it is unlikely for

the static information to alter when time passes, an active

application in a scheduling algorithm can help reduce the

completion time.


3.2 New algorithm

In this paper, we have developed WQRuSI

(Workqueue-Replication using Static Information) that

includes three factors to improve the avoidance of excessive

load or incapability resources, the exclusion of the utility of

dynamic information such as real time load information, and

the application of unchanging static information.

To take account of three factors, each resource is

initially aligned in the order of CPU processing ability,

which is basic static information. When tasks are distributed,

the number of CPU retained by a resource should be the

number to be distributed. Because the basic allocation

strategy is based on repetitively assigning the same task to

more than two resources, WQR method that prevents

disorder or the slowing of a resource is applied. The detailed

logic of WQRuSI algorithm is described in [Figure 1].

Sort the available resources to descending order according to

processor capacity;

Duplicate all tasks which have the same number as

MaxReplication;

Save the duplicated tasks to TaskManager;

for i:=1 to ResourceList.size do

for j:=1 to ResourceList[i].PEList.size do

Take out a task from TaskManager under the condition

that a task is not allocated to the same resource;

Allocate the selected task to ResourceList[i];

if TaskManager is empty break all for;

end for

end for

while TaskManager is not empty do

Wait for the completed task from resources;

if completed task exists another resources then

Cancel the tasks;

Take out tasks as the same number of canceled tasks

from TaskManager under the condition that a task is

not allocated to the same resource;

Allocate the selected task to the canceled resource.;

end if

Take out a task from TaskManager under the condition

that a task is not allocated to the same resource;

Allocate the selected task to the previously returned

resource;

end while

Wait for the de-allocation of tasks;

[Figure 1] WQRuSI Algorithm using Static Information of

Grid Resources

3.3 Performance of the Algorithm

To evaluate the performance of the suggested algorithm,

WQRuSI algorithm went through simulation with WQR and

Min-Min algorithm that showed competence in two

situations. Simulation environment was set in the same way

as the usual setting for analysis. The results from the

analysis is shown in [Figure 2] and [Figure 3].

[Figure 2] Performance of WQRuSI-Consistent Static

Information

[Figure 3] Performance of WQRuSI-Variable Static

Information

In contrast to how WQRuSI algorithm showed a similar

performance with WQR in the setting that consists of

consistent static information, WQRuSI algorithm showed the

best performance in the environment where static

information for resources is selected randomly. When the

static information is set in the same way, the algorithm that

accounts for static information of resources did not show

much improvement. However, in the environment where the

number of processors and the processing ability is set in

various ways, the improvement in performance was clear

4 Conclusion

In this paper, we proposed the WQRuSI algorithm.

When it went through the simulation with the same

environment with preceding simulation, WQRuSI showed

similar performance with WQR in the status where static

information of resources is fixed. In contrast, when there

was variation in static information, WQRuSI demonstrated

× 1,000

× 1,000


much improvement than ordinary algorithms. Because there

are diverse capacity of resources and the number of

processors in the actual grid setting, WQRuSI is expected to

contribute much to reducing the task completion-time.

5 References

[1] R. Buyya, "Economic-based distributed Resource

Management and Scheduling for Grid Computing", Ph. D,

Thesis, Monash University, Melbourne, Austrailia, 2002.

[2] N. Muthuvelu, J. Liu, N. L. Soe, S. Venugopal, A.

Sulistio and R. Buyya, "A Dynamic Job Grouping-Based

Scheduling for Deploying applications with Fine-Grained

Tasks on Global Grids", AusGrid2005, Vol.44, 2005.

[3] D. P. Silva, W. Cirne and F. V. Brasileiro, "Trading

Cycles for Informations: Using Replication to Scheduling

Bag-of-Tasks Applications on Computational Grids", in

Proc of Euro-Par 2003, pp.169~180, 2003.

[4] V. subramani, r. Ketimuthu, S. Srinivasan and P.

Sadayappan, "Distributed Job Scheduling on Computational

Grids using Multiple Simultaneous Requests", in Proc. of

11th IEEE Symposium on HPDC, pp. 359~366, 2002.

[5] O. H. Kang, S. S. Kang, "Web-based Dynamic

Scheduling Platform for Grid Computing“, IJCSNS

International Journal of Computer Science and Network

Security, VOL.6 No.5B, May 2006

[6] X. He, X. sun and G. Laszewski, "A Qos Guided Min-

Min Heuristic for Grid Task Scheduling", in J. of Computer

Science and Technology, Special Issue on Grid Computing,

vol. 18, No.4, pp. 442~451, July 2003.

[7] M. Wu, W. Shu and H. Zhang, "Segmented Min-Min:

A Static Mapping Algorithm for Meta-Tasks on

Heterogeneous Computing Systems", in Proc. of the 9th

HCW, pp. 375~385, 2000.


e-Science Models and the Research Life Cycle

How will it affect the Philippine Community?

Junseok Hwang*

International Information Technology Policy

Program (ITPP)

Seoul National University

Seoul, South Korea

[email protected]

Emilie Sales Capellan*; Roy Rayel Consulta

*

International Information Technology Policy

Program (ITPP)

Seoul National University

Seoul, South Korea

{capellan, roycons2001}@tepp.snu.ac.kr

Abstract - In a digital information processes, the replica of

life cycle are shaping the method and the way the learners’

study. In a bigger system, part of the things that will be

represented in the life course is the Model, like the research

process, through a chain of sequentially interconnected

stages or phases in which information is manipulated or

produced. This paper presents and discusses methods and

ways in which the life cycle approach offers insight into the

relationships among the stages and activities in research,

especially in the field of technology evolution, like e-

Science. And this paper will also present an idea and

concept how this research life cycle in e-Science will affect

the Philippine community. An understanding of this

viewpoint may contribute further insight into the function of

e-science in the larger picture of methodical and scientific

research.

Keywords: e-Science, Life Cycle, Grid Computing,

Philippines

* Corresponding Authors

1 Introduction

Utilization of computers generates many challenges as it

expands and develops the field of the possible in methodical

and scientific research and many of these challenges are

usual to researchers in diverse areas. The insights achieved

in one area may catalyze change and accelerate discovery in

many others. It is absolutely true in the statement that it is no

longer possible to do science without doing computing [1].

Computing in the sciences and humanities has developed a

great deal over the past decades. The life cycle method turns

us to a more sensitive possible information loss in the gaps

between stages. The transition and the evolution points in the

life cycle are essential junctions for further important

activities in the research field, such as e-Science. Many

issues and streams of activity flow throughout the life cycle

of research, including project administration, grant

procurement, data management, knowledge creation, ethical

judgments, intellectual property supervision and technology

management as a way e-Science is being implemented.

Linking activities across stages requires harmonization and

coordination and a sense of continuity in the overall process

[2]. In the Philippines, the research undertaken in the

Sustainable Technologies Group of the De La Salle

University makes use of a highly interdisciplinary approach

to providing effective solutions to environmental problems

[3]. These problems require an intelligent, integrated

approach to yield solutions that are beneficial on a life cycle

basis. Also in the Philippines, it makes use of the life cycle

framework in most of the projects. Therefore, it makes use of

advanced computing techniques such as:

• Knowledge-based and rule-based decision support

systems

• Monte Carlo and fuzzy sets

• Pinch analysis

• Artificial neural networks

• Swarm intelligence

To be open and responsive to e-science, researchers

must evaluate and assess the services it provided for both

research outcomes and data. Given the stages of the life

cycle associated with e-science, it needs to determine the

services to be provided by research libraries and the

partnerships required to implement and sustain these

services. There is barely a scientist or scholar remaining who

does not use a computer for research purposes. There are

distinctive terms in use to point out the fields that are

particularly oriented to computing in specific disciplines. In

the instinctive and technical sciences, the term “e-Science”

has recently become popular, where the “e” of course stands

for “electronic” [4]. Science ever more done through

distributed and dispersed worldwide collaborations enabled

by the Internet, using very large data collections, tera-scale

computing resources and high performance visualization.

With the technology today, a very powerful

infrastructure is required to support and sustain e-Science.

The Grid is an architecture projected to produce all the

issues together and make a reality of such a vision for e-

Science. In the field of technology, such as Grid computing,


architecture examines Grid technology as a standard and

generic integration mechanism assembled from Grid

Services (GS), which are an extension of Web Services (WS)

to comply with additional Grid requirements. The principal

extensions from WS to GS are the management of state,

identification, sessions and life cycles and the introduction of

a notification mechanism in conjunction with Grid service

data elements [5]. The e-Science term is intended to confine

an idea of the future for scientific research-based on

distributed resources especially data-gathering instruments

and group researches.

E-Science is scientific investigation performed through

distributed global collaborations between scientists and their

resources, and the computing infrastructure that enables this.

Scientific progress increasingly depends on pooling know-

how and results; making connections between ideas, people,

and data; and finding and interpreting knowledge generated

by strangers in new ways other than that intended at its time

of collection. E-Science offers a promising vision of how

computer and communication technology can support and

enhance the scientific process. It does this by enabling

scientists to generate, analyze, share and discuss their

insights, experiments and results in a more effective manner.

In the Philippines, as the technology evolved, the agency

called ASTI1 being mandated to conduct scientific research

and development in the advanced fields of Information and

Communications Technology and Microelectronics,

undertake projects committed to the development of its

people and country as a whole. ASTI has its project called

PSIGrid program, which will initiate the establishment of the

necessary infrastructure and community linkages to operate

throughout the country its grid facility. [6]. ASTI will

deploy a reliable and secure grid management system for

managing users, nodes and software to ensure the reliability

and security of the entire grid.

2 Life Cycle Model of Research

The information that will be presented in this section

consists of the variety of methods used in communicating

and coordinating research outcomes. The research outcomes

and data upon which these outcomes are based collectively

document the knowledge for an area of study. The life cycle

model helps monitor both the digital objects bound within a

stage and those objects that flow across stages. This is

represented above in the lightly shaded box around Data and

Research Outcomes [7]. Figure 1 show how the life cycle

model of research knowledge is being created.

1 Advanced Science and Technology Institute, R & D agency

forearm of Department of Science and Technology (DOST)

in the Philippines

Figure 1. Life cycle model of reseach knowledge Source: Humphrey, Charles; e-Science and the Life Cycle of

Research, IASSIST Communiqué, 2006

Every chevron in the above model symbolizes a stage in

the life cycle of research knowledge creation. The spaces

between chevrons indicate the transitions between stages.

These transitions tend to be vulnerable points in the

documentation of a project’s life cycle. When a stage is

completed, its information may not get systematically

preserved and instead end up dead-ended (most often on

someone’s hard drive.) Shifts in the responsibility for the

objects of research also tend to occur at these points of

transition. For example, the data collection stage passes

along completed interviews or questionnaires to the data

processing stage; the data processing stage passes one or

more clean data files to the data access and dissemination

stage. In each transition, someone else usually becomes

responsible for the outcomes of the previous stage. These

transition points become important areas in negotiating the

digital duration plan for a project as partners in the life cycle

of research identify who is responsible for the digital objects

created at each stage.

Now in e-Science, the knowledge life cycle can be

observed as a set of challenges as well as a sequence of

stages. Each stage has variously been seen as a blockage.

The attempt of acquiring knowledge was one bottleneck

recognized early [8]. But so too are modeling, retrieval,

reuse publication and maintenance. In this section, we

examine the nature of the challenges at each stage in the

knowledge life cycle and review the various methods and

techniques at our disposal. Although it is often suffer from a

deluge of data and too much information, all too often what

we have is still insufficient or too poorly specified to address

our problems, goals, and objectives. In short, we have

insufficient knowledge. Knowledge acquisition sets the

challenge of getting hold of the information that is around,

and turning it into knowledge by making it functional. This

might involve, for instance, making implied knowledge

explicit, identifying gaps in the knowledge already held,

acquiring and integrating knowledge from multiple sources

(e.g. different experts, or distributed sources on the Web), or

Study Concept

& Design

Data Collect

ion

Data Proces

sing

Data Access & Dissemin

ation

Analy

sis

Research

Outcomes

K

T

Cy

cle

Data

Data Discove

ry

Data Repurposing


acquiring knowledge from unstructured media (e.g. natural

language or diagrams).

A variety of techniques and methods has been developed

ever since to facilitate knowledge acquisition. Much of this

work has been carried out in the context of attempts to build

knowledge-based or expert systems. Techniques include

varieties of interview, different forms of observation of

expert problem-solving, methods of building conceptual

maps with experts, various forms of document and text

analysis, and a range of machine learning methods [9]. Each

of these techniques has been found to be suited to the

elicitation of different forms of knowledge and to have

different consequences in terms of the effort required to

capture and model the knowledge [10, 11]. Specific software

tools have also been developed to support these various

techniques [12] and increasingly these are now Web enabled

[13]. However, the process of explicit knowledge acquisition

from human experts remains a costly and resource intensive

exercise. Hence, the increasing interests in methods that can

(semi-) automatically elicit and acquire knowledge that is

often implicit or else distributed on the Web [14].

A variety of information extraction tools and methods

are being applied to the huge body of textual documents that

are now available [15]. Another style of automated

acquisition consists of systems that observe user behavior

and assumed knowledge from that behavior. Examples

include recommender systems that might look at the papers

downloaded by a researcher and then detect themes by

analyzing the papers using methods such as term frequency

analysis [16]. The recommender system then searches other

literature sources and suggests papers that might be relevant

or else of interest to the user. Methods that can engage in the

sort of background knowledge acquisition described above

are still in their infancy but with the proven success of

pattern directed methods in areas such as data mining, they

are likely to assume a greater prominence in our attempts to

overcome the knowledge acquisition blockage.

3 Research trends – e-Science in a

transparency

The fascinating e-Science concept illustrates changes

that information technology is bringing to the methodology

of scientific research [17]. e-Science is a relatively new

expression that has become particularly accepted after the

launch of the major United Kingdom initiative [18].

e-Science describes the new approach to science involving

distributed global and international collaborations enabled

by the Internet and using very large data collections,

terascale computing resources and high-performance

visualizations. e-Science is about global collaboration in key

areas of science, and the next generation of infrastructure,

namely the Grid, that will enable it. Figure 2 summarizes the

e-Scientific method.

Fig. 2. Computational science and information technology

merge in e-Science

In a simplest manner, it can illustrate and characterize

the last decade as directing simulation and its integration

with science and engineering – this is computational science.

e-Science builds on this adding data from all sources with the

needed information technology to analyze and incorporate

the data into the simulations.

Fifty years ago, scientific performance has evolved to

reflect the growing power of communication and the

importance of collective wisdom in scientific discovery.

Originally scientists collaborated by sailing ships and carrier

pigeons. At the present aircraft, phone, e-mail and the Web

have greatly enhanced communication and therefore the

quality and real-time nature of scientific collaboration. The

cooperation can be both “real” and enabled electronically

[19,20] early influential work on the scientific collaboration.

e-Science and hence the Grid is the infrastructure that

enables collaborative science. The Grid can provide the

basic building blocks to support real-time distance

interaction, which has been exploited in distance education.

Particularly important is the infrastructure to support shared

resources – this includes many key services including

security, scheduling and management, registration and search

services and the message-based interfaces of Web services to

allow powerful sharing (collaboration) mechanisms. All of

the basic Grid services and infrastructure provide a critical

venue for collaboration and will be highly important to the

community.

In Philippine perspective, researchers created what they

say are the first generic system for Grid Computing that

utilizes an industry-standard Web service infrastructure. The

system, called Bayanihan Computing .NET [21] is a generic

Grid computing framework based on Microsoft .NET that

uses Web services to harness computing resources through

“volunteer” computing similar to projects such as

SETI@Home [22], and to make the computing resources

easily accessible through easy-to-use and interoperable

computational Web services. As mentioned in the preceding

section that ASTI agency from the Philippines is managing a

project called Philippine e-Science Grid Program (PSiGrid).


This emerging computing model provides the ability to

perform higher throughput computing by taking advantage

of many networked computers to model a virtual computer

architecture that is able to distribute process execution

across a parallel infrastructure. The establishment and

planning of this PSiGrid is expected to foster collaboration

among local research groups as they share computing

resources to further their research efforts. This is also

expected to enable efficient utilization of local computing

resources.

4 e-Science Practical Model

Application

With the global advancement of technology, new

advances in networking and computing technology have

produced an explosive growth in networked applications and

information services. Applications are getting more complex,

heterogeneous and dynamic. In the recently concluded forum

regarding national e-Science development strategy, held on

August 24 at the Westin Chosun Seoul under the supervision

of KISTI and under the joint auspices of the Ministry of

Science and Technology (MOST) and the Korea e-Science

Forum, has reported significance of R&D activity changing

into e-Science system.

The necessity of national e-Science is becoming more &

more important because of its new research method which

challenges huge applications and research in limited

environments, improvement in research productivity which

enables us to utilize research resource at remote places and

collaborate between researchers, education learning trait

which enables diverse learning equipment's utilization with

networked studying environment, and finally economic

development's new growth engine with cutting-edge

technology innovation. [23]

One major impact that it had made contributed was on

the medical field, for instance on reduction on the period for

drug development; enabling global research projects in

fields of aerospace development, nuclear fusion research,

tsunami and SARS prevention; boosting national science

technology competitiveness by developing a new

methodology model in which IT and science technologies are

converged by securing convergence research's core

technology and cooperation and collaboration among

nations, regions and fields in that the researchers can have

access to cutting edge equipment, data and research

manpower. By means of cutting-edge technology

innovation, this national e-Science can serve as a new growth

engine of economic development, provoking astronomical

economic ripple effect.

Aside from the R&D applications, e-Science has also

proven its importance by its introduction to classroom. In

UK, a pilot project has begun to explore the potential

benefits of collecting and sharing scientific data within and

across schools and closer collaborations between schools and

research scientists with a view to running a national project

involving multiple schools. [24 ]. This pilot project has

begun to reveal the educational potential through the

collaboration of teachers and students, in a way they input,

manipulate their collected data and share this Grid-like

technologies, such activities can provide and a larger scale

project would have the potential to begin to feed schools-

sampled local pollution data into a more significant GRID-

based data set which scientists could use to build up a picture

of pollution levels across the country.

Another major contribution that UK, being the first

country to develop a national e-Science Grid, developed was

used in the diagnosis and treatment of breast cancer as in one

of the pilot projects, they developed a digital mammographic

archive together with an intelligent medical decision support

system which an individual hospital without supercomputing

facilities could access through the use of grid. This project

is called e-DiaMoND and Integrative Biology. [25 ]

In Australia, they have introduced the world’s first

degrees in e-Science. Two Australian computer science

departments namely Australian National University and

RMIT have worked together, established a program called

“Science Lectureships Initiative” designed to foster linkages

between academia and industry with the idea of attracting

students into science-related areas which would then benefit

emerging industries. [ 26] At RMIT, the eScience Graduate

Diploma started with only 10 students in the first year, but

thereafter struggled to gain enough extra students to become

self sustaining as a separate program while at ANU there was

a large influx of overseas students particularly from Indian

subcontinent and from East Asia. With these initiatives, this

can provide guidance and attract other universities to set up

similar education programs.

5 Definition and Relevance of

e-Science in the Philippine

Perspective

As noted by many different studies and researches that

been done by the different authors in all parts of the world,

we can say that e-Science have its own role, function, and

relevance in this modern society. Many developed countries

have gone far in this field. However in the Philippines, it is

just on the introduction phase. Thus, e-Science could be

defined as a solution that can guarantee the Philippines,

through international collaboration, improve its

technological innovation in researches and discoveries within

the applied technological approach.

This paper serves as the driving force in addressing the

issue on three most important application of ICT which is

education, health and governance. With its direct

connectivity to a number of international research and


education networks such as Asia Pacific Advanced Network

(APAN) and the Trans-Eurasia Information Network 2

(TEIN2), this will benefit researchers on the academe sectors

to collaborate in the global research community.

6 Research Life Cycle Model in the

Philippines

In Fig. 3, this model PREGINET2 will be the network

backbone that will support the key major players in the

whole system flow of the scientific research arising from the

academe and the government’s research and development

institutes. Thus in this research life cycle model, this will

serve as the highway that will serve the applications on

which, in this research, is the e-Science.

As shown in the model, e-Science will take part as the

heart of these important areas of researches , as this will

become the central application to researchers from the

academe and other R&D institutions, e-Library and distance

learning. This platform will allow linkages among its

partners in the network locally and globally.

Fig. 3. Development Model and Work Processes

And this will be linked to a central depository which will

be managed and controlled by a policy making body or the

technical working group. As the development model and

work process have shown, the Department of Science and

Technology (DOST) has provided funding, under its Grants-

in-Aid (GIA) Program, to implement the Philippine

e-Science Grid (PSiGrid) Program. The three-year program

(2008-2011), which aims to establish a grid infrastructure in

the Philippines to improve research collaboration among

educational and research institutions, both locally and

2 PREGINET is a nationwide broadband research and education

network that interconnects academic, research and government institutions. It is the first government-led initiative to establish a National Research and Education Network (NREN) in the country. PREGINET utilizes existing infrastructure of the Telecommunications Office (TELOF) of the Department of Transportation and Communications (DOTC) as redundant links.

abroad, which covers three (3) projects, namely: (1)

Boosting Grid Computing Using Reconfigurable Hardware

Technology (Grid Infrastructure); (2) Developing a

Federated Geospatial Information System (FedGIS) for

Hazard Mapping and Assessment; and (3) Boosting Social

and Technological Capabilities for Bioinformatics Research

(PBS). The Program will be implemented by the Advanced

Science and Technology Institute (ASTI), an attached

institute of the DOST focusing on R&D in ICT and

Microelectronics as stated in the previous section.

Moreover, four (4) components of e-Science as shown

in figure 3 emphasizes the importance of the following by-

products and elements in the Philippine e-Science

perpective. First, Researches from the Academe emphazises

and stresses the collaboration between the academic and

R&D institutions or sectors that may be functional within the

e-Science framework. Secondly, in association with the first

component or element, Researches from the R&D Sectors

accentuates the use of linking R&D to the e-Science work

processes. Thirdly, given the PREGINET infrastructure,

Distance Learning emphasizes the importance in the

collaborative e-Science framework because framework itself

might be a special tool to deliver the interactive and real-

time education. Lastly, e-Library brings and take advantage

the potential means in a distributed computing.

7 Effect of Research Life Cycle

Utilizing Grid Technology –

the e-Science in the Philippines

By understanding the full research life cycle allows us to

identify gaps in services, technologies and partnerships that

could harness eventual utilization of Grid technology in an

e-Science framework. There is also a need to understand the

process of collaboration in e-Science in order to fully and

accurately define requirements for next generation Access

Grids [27]. The emergence of e-Science systems raises also

challenging issues concerning design and usability of

representations of information, knowledge or expertise

across variety of potential users that could lead to a scientific

discovery. [28].

The quest about e-Science frequently focuses on

enormous hardware, user interfaces, storage capacity and

other technical issues, in the end, the capability of e-Science

to serve the needs of scientific research teams boils down to

people ; the ability of the builders of the infrastructure to

communicate with its users and understand their needs and

the realities of their work cultures [29].

The builders and implementors of e-Science

infrastructure requires in focusing more about fostering,

preferably than building the infrastructure. There are social

features to research that must be recognized, from

understanding how research teams work and interact to

Researches from DOST

R&D institutions

Researches from the Academe

Distance Learning

E-Library

Data Storage

e-Science

e-

PREGINET


realizing that research often does not involve the kinds of

large, interdisciplinary projects engaged in by virtual

organizations, but rather individual work and unplanned or

ad-hoc, flexible forms of collaboration within wider

communities.

The grid is transforming science, business which in

effect, e-Science research, business and commerce will

significantly benefit from grid based technologies which will

potential increase abilities, efficiency and effectiveness

through leading edge technology applications and solving

large scientific and business computing problems. Although

on the part of socio-economic aspects, this will demand

investigation to address issues such as ethics, privacy,

liability, risk and responsibility for future public policies. In

addition, for the envisaged new forms of business models,

economic and legal issues are also at stake which will require

interdisciplinary research.

In the long run, the lasting and permanent effects of

high-speed networks, data stores, computing systems, sensor

networks, and collaborative technologies that make e-

Science possible will be up to the people who create it and

use it.

For e-science projects like PSiGrid program in the

Philippines, the majority (if not all) of the funding is from

government sources of all types. For this cooperation to be

sustainable, however, especially in commercial or

government settings, participants need to have an economic

incentive. Thus, as it is stated in the preceding sections,

PSIGrid aims to establish a grid infrastructure in the

Philippines that will be needed to maximize and improve the

potential of research collaboration among educational and

research institutions, both locally and abroad. For this start

and with the promising vision that e-Science have, there is a

great chance for the PSiGrid program to also participate on

the global world and thus come up with technologies that

would be beneficial to its citizens.

8 Conclusion

To conclude, given the above viewpoints on lifecycle

and e-Science models, there have been important changes

how technology especially on scientific researches can be

successfully managed.

The trend in technology goes towards increasingly

global collaborations for scientific research. In every

country that initiated implementing its vision for e-Science, it

can be seen that each had its own strategy to face the

challenges not only with regards to technical issues such as

dependability, interoperability, resource management, etc.

but also more on people-centric relating to its collaboration

and sharing of its resources and data. For example in the

case of United Kingdom (UK), they had established nine

e-Science centers and eight other regional centers covering

most of UK which primarily aimed to allocate substantial

computing and data resources and run standard set of Grid

middleware to form the basis for the construction of UK

Grid testbed, to generate portfolio industrial Grid

middleware and tools and lastly to disseminate information

and experience of Grid. [30]

The ideas presented on this paper on both the e-Science

models and lifecycle approach will have impact in giving

insights, directions and encouragement for policy makers

along with valuable contribution to serving the Filipino

people especially those scientists and researchers in coming

up with technological breakthrough. To further the studies

and research, evaluation of the life cycle that co-exist in the

Global trends and in the Philippine perpective, e-Science

tools must be more intuitive for the biomedical community

not only use them in a collaborative R&D.

7 References

[1] The 2007 Microsoft e-Science Workshop at RENCI,

https://www.mses07.net/main.aspx

[2] Conceptualizing the Digital Life Cycle,

http://iassistblog.org/?p=26

[3] Sustainable Technologies Research Group,

http://www.dlsu.edu.ph/research/centers/cesdr/strg.asp

[4] Boonstra, O; Breure, L; Doorn, P; Past, Present and

Future of Historical Information Science, Netherlands

Institute for Scientific Information Royal Netherlands

Academy of Arts and Sciences, 2004.

[5] Berman, F; Hey, A. J. G; Fox, G. C; Grid Computing –

Making the Global Infrastructure a Reality, Wiley

Series of Communications Networking & Distributed

Systems, 2003

[6] http://www.psigrid.gov.ph/index.php

[7] Humphrey, Charles; e-Science and the Life Cycle of

Research, IASSIST Communiqué, 2006

[8] Hayes-Roth, F.; Waterman, D. A; Lenat, D. B; Building

Expert Systems, Reading, Addison-Wesley, 1983

[9] Shadbolt, N. R; Burton, M, Knowledge elicitation: A

systematic approach, in evaluation of human work, and

Wilson, J. R; Corlett, E. N. (eds) A Practical

Ergonomics Methodology, London, UK: Taylor &

Francis, 1995

[10] Hoffman, R.; Shadbolt, N. R.; Burton, A. M; Klein, G.,

Eliciting knowledge from experts: A methodological


analysis. Organizational Behavior and Decision

Processes, 1995

[11] Shadbolt, N. R.; O’Hara, K.; Crow, L., The

experimental evaluation of knowledge acquisition

techniques and methods: history, problems and new

directions, International Journal of Human Computer

Studies, 1999

[12] Milton, N.; Shadbolt, N.; Cottam, H.; Hammersley, M.;

Towards a knowledge technology for knowledge

management. International Journal of Human Computer

Studies, 1999

[13] Shaw, M. L. G.; Gaines, B. R., WebGrid-II: Developing

hierarchical knowledge structures from flat grids.

Proceedings of the 11th Knowledge Acquisition

Workshop (KAW ’98), 1998 Banff, Canada, April,

1998, http://repgrid.com/reports/KBS/WG/.

[14] Crow, L.; Shadbolt, N. R., Extracting focused

knowledge from the semantic web. International Journal

of Human Computer Studies, 2001

[15] Ciravegna, F.; Adaptive information extraction from text

by rule induction and generalization, Proceedings of

17th International Joint Conference on Artificial

Intelligence (IJCAI2001), Seattle, 2001

[16] Middleton, S. E.; De Roure, D.; Shadbolt, N. R.,

Capturing knowledge of user preferences: Ontologies in

recommender systems, Proceedings of the First

International Conference on Knowledge Capture, K-

CAP2001. New York: ACM Press, 2001

[17] Fox, G; e-Science meets computational science and

information technology. Computing in Science and

Engineering, 2002,

http://www.computer.org/cise/cs2002/c4toc.htm.

[18] Taylor, J. M. and e-Science, http://www.e-

science.clrc.ac.uk and http://www.escience-grid.org.uk/

[19] W. Wulf; The National Collaboratory – A White Paper

in Towards a National Collaboratory, Unpublished

report of a NSF workshop, Rockefeller University, New

York, 1989

[20] Kouzes, R. T.; Myers, J. D.; Wulf, W. A.;

Collaboratories: Doing science on the Internet, IEEE

Computer August 1996, IEEE Fifth Workshops on

Enabling Technology: Infrastructure for Collaborative

Enterprises (WET ICE ’96), 1996,

http://www.emsl.pnl.gov:2080/docs/collab/presentations

/papers/IEEECollaboratories.html.

[21] Sarmenta, L. F. G; Bayanihan Computing .NET: Grid

Computing with XML Web Services, 2002,

http://bayanihancomputing.net/

[22] Search for Extraterrestrial Intelligence,

http://setiathome.berkeley.edu/sah_about.php

[23] Survival Strategy for Securing International

Competitiveness, Korea IT Times, 2007

[24] Ella Tallyn, et.al. Introducing e-Science to the

Classroom,

http://www.equator.ac.uk/var/uploads/Ella2004.pdf

[25] Prof. Iversen, Oxford University e-Science Open Day,

2004

[26] Henry Gardner, et. al, eScience curricula at two

Australian universities, Australian National University

and RMIT, Melbourne, Australia, 2004

[27] David De Roure, et. al; A Future e-Science

Infrastructure, 2001

[28] Usability Research Challenges in e-Science, UK e-

Science Usability Task Force

[29]Voss, Alex; Features: e-Science It’s Really About

People, HPCWire – High Productivity Computing

website, RENCI, 2007

[30] Tony Hey and Anne E. Trefethen, The UK e-Science

Core Programme and the Grid, ICCS, Springer-Verlag

Berlin Heidelberg, 2002


An Agent-based Service Discovery Algorithm Using Agent Directors for Grid Computing

Leila Khatibzadeh1, Hossein Deldari2 1Computer Department, Azad University, Mashhad, Iran

2Computer Department, Ferdowsi University, Mashhad, Iran

Abstract - Grid computing has emerged as a viable method to solve computational and data-intensive problems which are executable in various domains from business computing to scientific research. However, grid environments are largely heterogeneous, distributed and dynamic, all of which increase the complexities involved in developing grid applications. Several software has been developed to provide programming environments hiding these complexities and also simplifying grid application development. Since agent technologies are more than a ten-year study and also because of the flexibility and the complexity of the grid software infrastructure, multi-agent systems are one of the ways to overcome the challenges in the grid development. In this paper, we have considered the needs for programs running in grid, so a three-layer agent-based parallel programming model for grid computing is presented. This model is based on interactions among the agents and we have also implemented a service discovery algorithm for the application layer. To support agent-based programs we have extended GridSim toolkit and implemented our model in it.

Keywords: Agents, Grid, Java, Parallel Programming, Service Discovery Algorithm

1 Introduction Grid applications are the next-generation network

applications for solving the world’s computational and data-intensive problems. Grid applications support the integrated and secure use of a variety of shared and distributed resources, such as high-performance computers, workstations, data repositories, instruments, etc. The heterogeneous and dynamic nature of the grid requires its applications’ high performance and also needs to be robust and fault-tolerant [1]. Grid applications run on different types of resources whose configurations may change during run-time. These dynamic configurations could be motivated by changes in the environment, e.g., performance changes, hardware failures, or the need to flexibly compose virtual organizations from available grid resources [2]. Grids are also used for large-scale, high-performance computing. High performance requires a balance of computation and communication among all resources involved. Currently this

is achieved through managing computation, communication and data locality by using message-passing or remote method invocation [3]. Designing and implementing applications that possess such features is often difficult from the ground up. As such, several programming models have been presented.

In this paper, we have proposed a new programming model for Grid. Numerous research projects have already introduced class libraries or language extensions for Java to enable parallel and distributed high-level programming. There are many advantages of using Java for Grid Computing including portability, easy deployment of Java’s bytecode, component architecture provided through JavaBeans, a wide variety of class libraries that include additional functionality such as secure socket communication or complex message passing. Because of these, Java has been chosen for this model. As agent technology will be of great help in pervasive computing, Multi-Agent systems in aforementioned environments pose important challenges, thus, the use of agents has become a necessity. In agent-based software engineering, programs are written as software agents communicating with each other by exchanging messages through a communication language [4]. How the agents communicate with each other differs in each method. In this paper, a three-layer agent-based model is presented based on interactions between the agents and a service discovery algorithm is implemented in this model.

The rest of the paper is organized as follows: In Section 2, a brief review of related works on programming models in Grid and a review of service discovery algorithms are given. Section 3 presents the proposed three-layer agent-based parallel programming model in details. The model in which the service discovery algorithm has been presented is shown. Simulation results are presented in Section 4. Finally, the paper is concluded in Section 5.

2 Related works There are different kinds of programming models, each

of which have been implemented in different environments. The summary of these programming models are briefly explained below.


Superscalar is a common concept in parallel computing [5]. Sequential applications composed of the tasks of a certain granularity are automatically converted into a parallel application. There the tasks are executed in different servers of a computational Grid.

MPICH-G2 is a Grid-enabled implementation of the Message Passing Interface (MPI) [6]. MPI defines standard functions for communication between processes and groups of processes. Using the Globus Toolkit, MPICH-G2 provides extensions to MPICH. This gives users familiar with MPI an easy way of Grid enabling their MPI applications. The following services are provided by the MPICH-G2 system: co-allocation, security, executable staging and results collection, communication and monitoring [5].

Grid–enabled RPC is a Remote Procedure Call (RPC) model and an API for Grids [5]. Besides providing standard RPC semantics, it offers a convenient, high-level abstraction whereby many interactions with a Grid environment can be hidden. GridRPC seeks to combine the standard RPC programming model with asynchronous course-grained parallel tasking.

Gridbus Broker is a software resource that transparently permits users access to heterogeneous Grid resources [5]. The Gridbus Broker Application Program Interface (API) provides a straightforward means to users to Grid-enable their applications with minimal extra programming. Implemented in Java, the Gridbus Broker provides a variety of services including resource discovery, transparent access to computational resources, job scheduling and job monitoring. The Gridbus broker transforms user requirements into a set of jobs that are scheduled on the appropriate resources. It manages them and collects the results.

ProActive is a Java-based library that provides an API for the creation, execution and management of distributed active objects [7]. Proactive is composed of only standard Java classes and requires no changes to the Java Virtual Machine (JVM). This allows Grid applications to be developed using standard Java code. In addition, ProActive features group communication, object-oriented Single Program Multiple Data (OO SPMD), distributed and hierarchical components, security, fault tolerance, a peer-to-peer infrastructure, a graphical user interface and a powerful XML-based deployment model.

Alchemi is a Microsoft .NET Grid computing framework, consisting of service-oriented middleware and an application program interface (API) [8,9]. Alchemi features a simple and familiar multithreaded programming model. Alchemi is based on the master-worker parallel programming paradigm and implements the concept of Grid threads.

Grid Thread Programming Environment (GTPE) is a programming environment, implemented in Java, utilizing the Gridbus Broker API [1]. GTPE further abstracts the task of Grid application development and automates Grid management while providing a finer level of logical program control through the use of distributed threads. The GTPE is architected with the following primary design objectives: usability and portability, flexibility, performance, fault tolerance and security. GTPE provides additional functionality to minimize the effort necessary to work with grid threads [1].

Open Grid Services Architecture (OGSA) is an ongoing project that aims to enable interoperability among heterogeneous resources by aligning Grid technologies with established Web services technology [5]. The concept of a Grid service is likened to a Web service that provides a set of well-defined interfaces that follow specific conventions. These Grid services can be composed into more sophisticated services to meet the needs of users. The OGSA is an architecture specification defining the semantics and mechanisms governing the creation, access, use, maintenance and destruction of Grid services. The following specifications are provided: Grid service instances, upgradability and communication, service discovery, notification, service lifetime management and higher level capabilities.

Dynamic service discovery is not a new issue. There are several solutions proposed for fixed networks, all with different levels of acceptance [10]. We will now briefly review some of them: SLP, Jini, Salutation and UPnP’s SSDP.

The Service Location Protocol (SLP) is an Internet Engineering Task Force standard for enabling IP network-based applications to automatically discover the location of a required service [11]. The SLP defines three “agents”, User Agents (UA) that perform service discovery on behalf of client software, Service Agents (SA) that advertise the location and attributes on behalf of services, and Directory Agents (DA) that store information about the services announced in the network. SLP has two different modes of operation. When a DA is present, it collects all service information advertised by SAs. The UAs unicast their requests to the DA, and when there is no DA, the UAs repeatedly multicast these requests. SAs listen to these multicast requests and unicast their responses to the UAs [10].

Jini is a technology developed by Sun Microsystems [12]. Its goal is to enable truly distributed computing by representing hardware and software as Java objects that can adapt themselves to communities and allow objects to access services on a network in a flexible way. Similar to the Directory Agent in SLP, service discovery in Jini is


based on a directory service, named the Jini Lookup Service (JLS). As Jini allows clients to always discover services, JLS is necessary for its functioning.

Salutation is an architecture for searching, discovering, and accessing services and information [13]. Its goal is to solve the problems of service discovery and utilization among a broad set of applications and equipment in an environment of widespread connectivity and mobility. Salutation architecture defines an entity called the Salutation Manager (SLM). This functions as a directory of applications, services and devices, generically called Networked Entities. The SLM allows networked entities to discover and use the capabilities of other networked entities [10].

Simple Service Discovery Protocol (SSDP) was created as a lightweight discovery protocol for the Universal Plug-and-Play (UPnP) initiative [14]. It defines a minimal protocol for multicast-based discovery. SSDP can work with or without its central directory service, called the Service Directory. When a service intends to join the network, it first sends an announcement message to notify its presence to the rest of the devices. This announcement may be sent by multicast, so that all other devices will see it, and the Service Directory, if present, will record the announcement. Alternatively, the announcement may be sent by unicast directly to the Service Directory. When a client wishes to discover a service, it may ask the Service Directory about it or it may send a multicast message asking for it [10].

In this paper, according to the models presented, we have integrated the benefits of message passing through explicit communication. In addition, we have made use of the Java-based model because of the similarity of portability to that of the distributed object method. Agents that act like distributed threads were also utilized. Playing the role of the third layer of this model, the service discovery algorithm was also implemented.

3 The three-layer agent-based parallel programming model

In this section, a three-layer agent-based parallel programming model for grid is presented and the communications among agents classified into three layers:

The lower level is the transfer layer which defines the way of passing messages among agents. In this layer, the sending and receiving of messages among agents are based on the UDP protocol. A port number is assigned to each agent which, in turn, sends and receives messages.

The middle level is the communication layer which defines the manner of communicating among agents. There are different communication methods.

Direct communication, such as contract-net and specification sharing, has some disadvantages [4]. One of the problems with this form of communication is the cost. If the agent community is as large as grid, then the overhead in broadcasting messages among agents is quite high. Another problem is the complexity of implementation. Therefore, an indirect communication method using Agent Directors (ADs) has been considered. An AD is a manager agent which directs inter-communication among its own agents as well as communication between these agents and other agents through other ADs. Figure 1 shows the scheme of this communication. Message passing and transfer rate are reduced in this model. Depending on the agent’s behavior (requester/replier), two AMS (Agent Management System) are considered in each AD. In each AMS, there is information about other agents and their ADs which directs the request to the desired AD as necessary. As for the two AMS, time spent seeking in the agent platform decreases.

Based on the AD’s needs and the type of request, the AD’s table of information on other agents in other ADs is updated in order to produce a proper decision. This update reduces the amount of transactions among ADs.

Figure 1. The scheme of communication among agents in the presented model

The following parameters are considered for each AMS when being stored in the AMS table:

• AID: Is a unique name for each agent which is composed of an ID of the name of the machine where the agent was created.

AD

agent agentagentagent

agent

agent

AD

AD AD

AD

agent agent

agent

agent

agent

agent agent

agent agent

agent


• State: Shows the state of the agent. Three states are possible in our model:

I. Active State: When the agent is fully initiated and ready to use.

II. Waiting State: When the agent is blocked and waiting for a reply from other agents.

III. Dead State: When the agent has completed its job. In this state, information about this agent is removed from AMS.

• Address: Shows the location of an agent. If an agent has migrated to other machines, this is announced to the AD.

• AD name: Shows the name of the AD associated with the agent. This parameter is essential as communication among agents is performed through ADs.

• Message: Shows the message sent from an agent to the AD. According to the agent’s behavior, different kinds of messages are generated. The content of the message shows the interaction that is formed between the AD and the agent. Based on FIPA [15], we have considered ACL (Agent Communication Language) with some revisions. Below is the list of the different message types:

If the agent acts as a requester, then the content of the message, based on the program running on grid, explains the content of the request, e.g., a required service in the service discovery algorithm.

As the agent is created, the message content includes the agent’s specifications for registering in AMS.

If the agent acts as a replier, the message content, based on the program which is running on grid, explains the content of the reply for that request.

When the job that is related to the agent is completed, the message content informs AD to remove this record from AMS.

While a program is running, if an error occurs, the message content informs AD.

We have considered a history table which keeps information about agents which have completed their jobs. The jobs done through AD are reported to the user.

In this model, AD decides to send the request to its own agents or to other agents in other platforms. This decision is made according to the information that is stored in the replier AMS and the condition of agents.

The higher level is the user application layer, which defines the application running through software agents. The application used in this model for testing is a service discovery algorithm. The algorithm

implemented is based on SLP and Jini. We have considered three kinds of agents similar to those of SLP:

1- User Agents (UA) that run service discovery on behalf of client software.

2- Service Agents (SA) that announce location and attributes on behalf of services. In the implemented algorithm, each SA may have different kinds of services. The creation and termination of services are dynamic and the list of services that each SA owns dynamically changes. The root of action of SAs in our model is similar to that of process groups in MPI.

Directory Agents (DA) that store information about services informed in the network are similar to JLS in Jini. In this model, each AD contains one DA. Because the structure of ADs includes histories of the processes that each agent performs, this model is effective for algorithms such as the service discovery. With two different histories implemented for each AD, the way of searching services in grid has been facilitated.

4 Simulation Results This model has been implemented in Java by using the

GridSim toolkit [16]. We extended the GridSim toolkit to implement agent-based programs with this simulator. The gridsim.agent package has also been added, and three different models have been implemented. The three-layer agent-based parallel programming model presented in this paper has been compared with these models through a service discovery algorithm.

The first model involves message passing among nodes that form the service discovery algorithm. The second, called Blackboard, is a simple model for communication among agents [4]. In this communication model, information is made available to all agents in the system through an agent named “Blackboard”. The Blackboard agent acts like a central server. The third model is our own and was fully explained in Section 3.

In order to evaluate our model, we measured the number of messages which were sent by agents until the time all user agents obtained their services. Three different methods were considered for this algorithm:

1- Message passing: In this method, there is no difference between the types of agents. All agents act like nodes in the message passing method.

2- Blackboard: In this method, the Blackboard agent acts like a Directory Agent.

3- Agent Director: In this method, each AD acts like a DA.


Figure 2. The comparison of the average number of messages received by agents

In Figure 2, the vertical axis represents the average number of messages and the horizontal axis represents the number of agents. In the first method, the number of agents means the number of nodes. In the other methods, the number of agents is the sum of the three types of agents earlier explained.

Figure 3. Cost Vs. Number of Agents

The results show that the average number of messages increase as the number of agents increase. This is quite obvious, because as the number of agents increase, the communication among agents also increases.

It is observed from Figure 3 that if the number of agents (nodes in message passing) exceeds 40, the cost of the communication in message passing implementation rises suddenly. As the number of nodes increases, the number of messages performing service discovery grows, along with the cost of communication. In other words, due to the lack of a database which could store the specifications of nodes having services, the cost of message passing rises suddenly when the number of agents exceeds 40. However, AD performs better than Blackboard. We estimate the cost of each method as follows:

- In message passing, the cost is estimated to be between 10 and 100.

- In Blackboard, the cost between agents and the Blackboard agent is estimated to be between 10 and 50.

- In AD, because of the two existing types of communications, two different costs were calculated. One is for ADs and is estimated to be between 50 and 100. The other is between AD and agents, which is estimated to be between 10 and 20.

These estimations are derived according to the method’s actions.

Figure 4. Time for different service discovery algorithms Vs. number of agents

It is obvious from Figure 4 that with an increase in the number of agents, execution time also increases. Due to the agent-based nature of Blackboard and AD, the time in which all user agents reach their services is shorter.

5 Conclusion and Future Work In this paper, we have studied different programming models that had been previously presented for grid. Because of the Java advantages for Grid Computing, among others, portability, easy deployment of Java’s bytecode, component architecture and a wide variety of class libraries, it has been chosen in this research. As agent technology is very effective in pervasive computing, the use of Multi-Agent systems in pervasive environments poses great challenges, thus, making the use of agents a necessity. In agent-based software engineering, programs are written as software agents that communicate with each other by exchanging messages through a communication language. How the agents communicate with each other differs in each method. In this paper a three layer agent-based programming model has been presented that is based on


interactions among agents. We have integrated the benefits of message passing through explicit communication and made use of a Java-based model because its portability is similar to that of the distributed object method. In addition, agents acting like distributed threads were utilized and so formed a three-layered programming model for Grid based on agents. We have extended the GridSim toolkit simulator and added the gridsim.agent package for agent-based parallel programming. A service discovery algorithm has been implemented as the third layer of the model. Using our model, measured parameters, such as the number of messages, cost and execution time, resulted in a better operation compared to that of the other methods.

6 References [1] H. Soh, S. Haque,W. Liao, K. Nadiminti, and R. Buyya, “GTPE: A Thread Programming Environment for the Grid”, Proceedings of the 13th International Conference on Advanced Computing and Communications, Coimbatore, India, 2005

[2] I. Foster, C. Kesselman, and S. Tuecke. “The anatomy of the grid: Enabling scalable virtual organizations”. Intl. J. Supercomputer Applications, 2001.

[3] D. Talia, C. Lee, “Grid Programming Models: Current Tools, Issues and Directions”, Grid Computing, G. F. Fran Berman, Tony Hey, Ed., pp. 555–578, Wiley Press, USA, 2003.

[4] C. F. Ngolah, “A tutorial on agent communication and knowledge sharing”, University of Calgary, SENG609.22 Agent-based software engineering, 2003.

[5] H. Soh, S. Haque, W. Liao and R. Buyya, “Grid programming models and environments”, In: Advanced Parallel and Distributed Computing ISBN 1-60021-202-6

[6] N. T. Karonis, B. Toonen, and I. Foster, “MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface”, Journal of Parallel and Distrbuted Computing (JPDC), vol. 63, pp. 551-563, 2002.

[7] ProactiveTeam, “Proactive Manual REVISED 2.2”, Proactive, INRIA April 2005.

[8] A. Luther, R. Buyya, R. Ranjan, and S. Venugopal, “Alchemi: A .NET-Based Enterprise Grid Computing System”, Proceedings of the 6th International Conference on Internet Computing (ICOMP’05), June 27-30, 2005, Las Vegas, USA.

[9] A. Luther, R. Buyya, R. Ranjan, and S. Venugopal, “Peer-to-Peer Grid Computing and a .NET-based Alchemi Framework”, High Performance Computing: Paradigm and Infrastructure, L. Y. a. M. Guo, Ed.: Wiley Press, 2005.

[10] Celeste Campo, “Service Discovery in Pervasive MultiAgent Systems”, pdp_aamas2002, Workshop on Ubiquitous Agents on embedded, wearable, and mobile devices 2002 Bolonia, Italy.

[11] IETF Network Working Group. “Service Location Protocol”, 1997.

[12] S. Microsystems. “Jini architectural overview”. White paper. Technical report, 1999.

[13] I. Salutation Consortium. “Salutation architecture overview”. Technical report, 1998.

[14] Y. Y. Goland, T. Cai, P. Leach, and Y. Gu. “Simple service discovery protocol/1.0”. Technical report, 1999.

[15] http://www.fipa.org/

[16] R. Buyya, and M. Murshed, GridSim: “A Toolkit for the Modeling, and Simulation of Distributed Resource Management, and Scheduling for Grid Computing”, The Journal of Concurrency, and Computation: Practice, and Experience (CCPE), Volume 14, Issue 13-15, Pages: 1175-1220, Wiley Press, USA, November - December 2002.


Optimization of Job Super Scheduler Architecture in Computational Grid

Environments M. Shiraz.+

M. A. Ansari*. + Allama Iqbal Open University Islamabad.

*Federal Urdu University of Arts, Sciences & Technology Islamabad [email protected]. +920339016430

* [email protected]. +9203215285504 Conference GCA’08. Abstract - Distributed applications running over distributed system communicate through inter process communication mechanisms. These mechanisms may be either with in a system or between two different systems. The complexities of IPC adversely affect the performance of the system. Load balancing is an important feature of distributed system. This research work is focused on the optimization of the Superscheduler architecture. It is a load balancing algorithm designed for sharing work load on computational grid. It has two perspectives i.e. Local Scheduling and Grid Scheduling. Some unnecessary inter process communication has been identified in the local scheduling mechanism of the job Superscheduler architecture. The critical part of this research work is the interaction between grid scheduler and autonomous local scheduler. In this paper an optimized Superscheduler architecture has been proposed with optimal local scheduling mechanism. Performance comparisons with earlier architecture of workloads are conducted in a simulation environment. Several key metrics demonstrate that substantial performance gains can be achieved in local scheduling via proposed Superscheduling architecture in distributed computation environment. Keywords: Inter Process Communication (IPC), Distributed Computing, Grid Scheduler, Local Scheduler, Grid Middleware, Grid Queue, Local Queue, Superscheduler (SSCH).

I. INTRODUCTION

Distributed computing has been defined in a number of different ways [1] Different types distributed systems are deployed worldwide e.g. Internet, Intranet, Mobile Computing etc. Distributed applications run over distributed system. These applications are the main communicating entities at the application layer of the distributed system. E.g. video conferencing, web application, Email application, chatting software etc. Each application has its own architecture and requires

specific protocol for its implementation. All the distributed applications run over middleware, and use its services for IPC. Cluster computing [1] and Grid computing [8] [10] are two different forms of distributed system. Load balancing is a challenging feature in distributed computing environment. Objective is to find the under loaded system and share the processing work load dynamically so that to efficiently utilize network resources and increase throughput. A job Superscheduler architecture for load balancing in grid environment has been proposed earlier [9]. This architecture has two schedulers i.e. autonomous local scheduler and grid scheduler. Job scheduling has two perspectives in this architecture. I.e. local scheduling and grid scheduling. Local scheduling is used to schedule jobs on local hosts, while grid scheduling is used to schedule the jobs for remote hosts for sharing work load. This research work is based on the optimization of a specific aspect of Superscheduler architecture i.e. in local scheduling different components of the Superscheduler get involved. I.e. Grid Scheduler, Grid Middleware, Grid Queue, Local Scheduler, and Local Queue. Interaction between different components involve inter process communication (IPC). IPC involves the complexities of context switching and domain transition.[7][3] Therefore large number of IPC adversely affect the performance of the system. [2]. Some unnecessary IPC has been identified in the local scheduling of job Superscheduler architecture. This research work is focused on this specific context of Superscheduler architecture. An optimized architecture has been proposed with minimum possible IPC. Processing workload in simulation environment evaluates performances of both architectures.


Several key metrics demonstrate that substantial performance gains in local scheduling can be achieved via proposed Superscheduling in distributed computing environment.

II. RELATED WORK

There are different policies available for job scheduling on distributed grid environment [4][5][6][11]. Superscheduler architecture [9] is a load balancing technique for sharing work load on distributed grid environment. There are three major processes and two data structures used in this architecture: The processes include Grid Middleware (GM), Grid Scheduler (GS) and Local Scheduler (LS). Data Structures Include Grid Queue (GQ), and Local Queue (LQ). The architecture is illustrated below.

Fig. I.Distributed Architecture of the Grid Superscheduler [9]

During grid Superscheduling interaction between different components of the architecture occurs as follows: A newly coming job enters Grid Queue; Grid Scheduler computes its resource utilization requirements and queries Local Scheduler through Grid Middleware for Approximate Waiting Time (AWT) on local system for that job. The job waits in the Grid Queue before beginning execution on the local system. Local Scheduler computes AWT based on local scheduling policy and Local Queue status. If the local resources cannot fulfill the requirements of the job AWT of infinity is returned. If AWT is below a threshold value job is moved directly from Grid Queue to Local Queue without any external network communication. If the value of AWT is greater than threshold or infinity, then one of the three migration policies [9] is invoked for the selection of appropriate under loaded remote host and job migration. Processor always processes job from local queue (whether on local system or remote system i.e.

grid partner). Once job enters the Local Queue; Local Scheduler (independent of Grid Scheduler) monitors its execution. Grid Scheduler has no control over Local Scheduler. Analysis of the flow among different components of the architecture shows some unnecessary inter process communication. It is expected that by minimizing the communication among different components of the architecture performance may be improved. This work is focused on performance optimization on the local scheduling policy. In earlier scheduling process bi-directional communication among Grid Scheduler, Grid Middleware, and Local Scheduler is identified as unnecessary for the case of local job processing. An optimized Superscheduler architecture with minimum IPC has been proposed in this paper. In the modified architecture it has been focused to minimize inter process communication as much as possible so that the Superscheduler algorithm could be optimized.

III. PROPOSED ARCHITECTURE

The proposed architecture contains the same number of components and in the same positions. In this architecture the main focus is the change of sequence of flow in the initial process of job scheduling. The proposed sequence of flow is such that a newly arrived job should enter Local Queue (Instead of Grid Queue), Local Scheduler should compute the processing requirements for the newly entered process (instead of Grid Scheduler). If the Average Waiting Time (AWT) on the local system is less than a thresholdφ, then the Local Scheduler should schedule the job by using local scheduling policy. It would not involve Grid Queue, Grid Scheduler, and Grid Middleware at all. None of these components are needed in local scheduling. These components are needed in those situations only when the local system is overloaded, and it could not execute the task as much efficiently as other system of the grid environment. In that situation Local Scheduler would communicate with the Grid Scheduler through Grid Middleware, it would send processing requirements of the newly arrived process. Then the job will be moved from Local Queue to Grid Queue. Grid Scheduler would then initiate any of the job migration policies [9] as in earlier architecture, and would migrate the job to the best available host on the grid


environment. Proposed architecture is shown in the following figure.

Fig. II Proposed Architecture.

IV. RESULTS AND DISCUSSION

The work load has been processed in simulation environment. Table 1 show the workload processed through simulator.

TABLE I Workload Processed through Simulator

Table 1 is composed of the following attributes.

1. Job Number: A counter field, starting from1. 2. Input Time: in seconds. It represents the

time at which the gridlet (a single job) is submitted for processing. The earliest time the log refers to is zero, and is the submittal time the of the first job. The lines in the log are sorted by ascending submittal times.

3. Run Time: in seconds. The time for which the gridlet will use a single processing element of CPU.

4. Number of Allocated Processors: an integer. In most cases this is also the number of processors the job uses. A job may require more than one PE’s.

TABLE II

Comparison of simulation output of Superscheduler Architecture vs. Proposed Optimized Superscheduler

Architecture.

GridletId

Input

Time

Local Queue Arrival Time in SSCH

Local Queue Arrival Time in OSCH

IPC Delay

in SSCH

IPC Delay

in OSCH

Total Cost in SSCH

Total Cost in OSCH

1 5 7.25 6.25 1 0 124 123

2 8 11.25 9.25 2 0 185 183

3 10 14.25 11.25 3 0 216 213

4 12 17.25 13.25 4 0 247 243

5 15 21.25 16.25 5 0 278 273

6 18 25.25 19.25 6 0 307 303

7 20 28.25 21.25 7 0 337 333

8 23 32.25 24.25 8 0 367 363

9 27 37.25 28.25 9 0 400 393

10 29 40.25 30.25 10 0 429 423

11 33 45.25 34.25 11 0 462.75 453

12 34 47.25 35.5 11.75 0 490 483

13 39 48.5 40.25 8.25 0 519.25 513

14 41 53.25 42.25 11 0 552.75 543

15 42 56.25 43.5 12.75 0 584.5 573

16 43 58.25 44.75 13.5 0 615.25 603

17 44 60.25 46 14.25 0 644 633

18 48 62.25 49.25 13 0 674.75 663

19 49 69.25 50.5 18.75 0 710.5 693

20 50 71.25 51.75 19.5 0 741.25 723

21 51 73.25 53 20.25 0 772 753

22 52 75.25 54.25 21 0 802.75 783

23 53 77.25 55.5 21.75 0 833.5 813

24 54 79.25 56.75 22.5 0 864.25 843

25 55 81.25 58 23.25 0 896.25 873

Job ID

Input Time

Run Time

Number of PE Required

1 5 20 2 2 8 30 2 3 10 35 2 4 12 40 2 5 15 45 2 6 18 50 2 7 20 55 2 8 23 60 2 9 27 65 2

10 29 70 2 11 33 75 2 12 34 80 2 13 39 85 2 14 41 90 2 15 42 95 2 16 43 100 2 17 44 105 2 18 48 110 2 19 49 115 2 20 50 120 2 21 51 125 2

22 52 130 2 23 53 135 2 24 54 140 2 25 55 145 2


Simulation output of workload processed through both Job Superscheduling Architectures is compared in tabulated form in Table II. This table has the following attributes: 1. Gridlet Id: This attribute represents job id. 2. Status: This attribute represents the status of

job processed. if it has been successfully processed the value of this field will be successful. Other wise the value will be unsuccessful. As all the values of this field are successful, it means the jobs have been processed successfully.

3. Input time: This attribute shows the time at which a job enters the system initially.

4. Local Queue Arrival Time: It shows the time at which the jobs enter the local queue for local processing.

5. Inter Process Communication Delay (IPCD): This attribute indicates IPC delay experienced by each job before entering the local queue. The value of this field is derived by subtracting local queue arrival time of each job in optimized architecture from local queue arrival time in earlier scheduling architecture. In earlier architecture each job experience IPC delay before entering the local queue, while in optimized architecture each job is submitted directly to the local queue, therefore it experience no IPC delay for entering local queue.

6. Execution Start Time: This attribute represents the time at which a job is pick up by processor from local queue for processing.

7. Finish Time: This attribute represents the time at which a job leaves the processor.

8. Processing Cost: The difference between finish time and execution start time is called the processing cost of each job.

9. Total Cost: This attribute shows the total cost of each job processed. Total Cost of each job is computed from processing cost and IPC Delay cost of each job.

The comparison of workload processing through both architectures indicates that there is difference between local queue arrival time of a job in both techniques, e.g. job id 1 arrives 1 second late to the local queue in earlier scheduling technique i.e. 7.25 as compared to the proposed scheduling technique i.e. 6.25 while the input time is same for both techniques i.e. 5. Similarly job 2 is submitted at time 8 for both techniques, but arrives the local queue 2 seconds

late in earlier scheduler technique i.e. 11.25 than that of proposed optimized scheduling technique i.e. 9.25. Job 25 is submitted at time 34, it arrives 23.25 seconds late in earlier scheduling technique i.e. 60.25, than that of proposed scheduling technique, i.e. 45. This is because of the extra communication involved among different components of the Superscheduler architecture. Total cost comparison in Table II indicates that total cost of the gridlet depends on two parameters, i.e. processing cost and IPC delay cost. Processing cost depends upon the run time and number of CPUs needed for each gridlet. Simulation results show that processing cost will remain the same for both scheduling techniques while earlier scheduling technique experiences extra IPC delay which will eventually increase total cost. If a gridlet experiences n seconds IPC delay its total cost will increase n times. e.g. In Table II gridletid 1 has total cost 123 in proposed optimized scheduling technique while it experiences IPC delay of one second in earlier scheduling technique, therefore its total cost is 124 i.e. one point increased. Similarly the total cost of each gridlet is increased depending upon its IPC delay. In proposed architecture IPC has been minimized in local scheduling, therefore the performance of the local processing scenario has been improved. The results of the comparison in table II have been elaborated through the following charts:

IPC Delay Comparison

0

5

10

15

20

25

1 3 5 7 9 11 13 15 17 19 21 23 25

Gridlet

IPC

Del

ay

IPC Delay in SSCH IPC Delay in OSCH

Fig. III. IPC delay comparison for each job in both scheduling techniques

As stated earlier the in optimized architecture each job enters local queue instead of grid queue. Therefore it experiences no IPC delay. Fig. III compares IPC Delay experienced by each job in both architectures. Blue line indicates the trend of increasing IPC delay in earlier architecture. Pink line indicates 0 unnecessary IPC delay for


each job. The graph is based on output of simulation.

Comparison of SuperScheduler Gridlet Local Queue Local Queue Arrival Time with Optimized

Scheduler Arrival Time

0102030405060708090

1 3 5 7 9 11 13 15 17 19 21 23 25

Gridlet

Loca

l Que

ue A

rriv

al T

ime

Local Queue Arrival Time in SSCH Local Queue Arrival Time In OSCH Fig. IV: Comparison of local queue arrival time in both architectures.

Fig. IV compares local queue arrival time of each job in both architectures. Purple lines indicate local queue arrival time in earlier architecture. Red lines indicate local queue arrival time in optimized architecture. It clearly shows that in earlier architecture jobs take more time to enter the local queue as compared to the optimized architecture. This is because in earlier architecture a job is first entered the grid queue, then it experience IPC delay because of inter process communication among different components of the architecture before it enters the local queue. While in optimized architecture each job enters the local queue directly for processing. Thus there is no unnecessary inter process communication, and therefore it experience no IPC delay.

Total Cost Comparison

0

500

1000

1 3 5 7 9 11 13 15 17 19 21 23 25

Gridlet

Tota

lCos

t= P

C +

IPC

DC

TotalCost in SSCH TotalCost in OSCH

Fig V. Comparison of Superscheduler Architecture with Optimized Scheduler Architecture

Fig.5 compares total cost of each job processed through both architectures. Blue lines indicate total cost of each job processed through earlier architecture. Red lines indicate total of each job in optimized architecture. Total cost of each job is computed from processing cost and IPC Delay

cost. Each job has the same processing cost in both architectures. But in earlier architecture there is a factor of IPC delay cost for each job. Therefore the total cost of job will differ in both architectures. There is a trend of increasing total cost for each job in earlier architecture because of unnecessary IPC delay.

V. CONCLUSION

Simulation results add to the conclusion of the research work performed. It is concluded that extra inter process communication involved in the earlier Superscheduler architecture in local job processing adversely affected the performance of local system. Large number of IPC means large number of domain transitions and context switching. Therefore each job will experience unnecessary IPC delay before reaching the local queue. This unnecessary communication has been eradicated from the proposed architecture. Therefore the results show a substantial performance improvement.

VI. FUTURE WORK

This research work does not consider grid scheduling which is another perspective of the Superscheduler architecture. This work may be extended to external scheduling as well to further improve total system performance. Future work may consider job migration policies for external job migration, as the current job migration policies do not consider many complexities involved in network communication therefore it is expected to come up with more optimized architecture.

VII. REFERENCES [1].Cluster Computing,

http://www.cisco.com/application/pdf/en/us/. Accessed 20th march 2007.

[2].Colulouris. G., Dollimore. J., Kindberg. T. (2001) [3].Context_switch.html(2006) The Linux Information

Project, http://www.bellevuelinux.org/context_switch.html accessed on August 2, 2007.

[4].Feitelson. G. D., Rudolph. L, Schwiegelshohn. U, Sevcik. C. K, & Wong. P. (1997) Theory And Practice In Parallel Job Scheduling, In 3rd Workshop on Job Scheduling Strategies for Parallel Processing, volume LNCS 1291, pp. 1–34.

[5].Feitelson. G. D. & Weil. M. A. (1996) “Packing schemes for gang scheduling” In 2nd Workshop on Job Scheduling Strategies for Parallel Processing, volume LNCS 1162, pp. 89–100.

[6].Franke. H, Jann. J, Moreira. E. J., Pattnaik. P, and Jette. A.M. (1999) An evaluation of parallel job scheduling for ASCI Blue-Pacific. In Proc. SC99.

[7].Fromm. R, Treuhaft. N, (ud) Revisiting the Cache Interference Costs Of Context Switching, .


http://citeseer.ist.psu.edu/252861.html. [8].Global Grid Forum. http://www.gridforum.org [9].Hongzhang. S., Leonid. O., Rupak. B. (2003) “Job

Superscheduler Architecture and Performance in Computational Grid Environments”, Proceedings of the ACM/IEEE SC2003 Conference

[10].McCarthy. B. M. (2006) “Building Scalable, High Performance Cluster and Grid Networks: The Role of Ethernet”, Force10 Networks, Inc.

[11].Nelson. D. R, Towsley. F. D, and Tantawi. N. A.(ud) Performance Analysis Of Parallel Processing Systems, IEEE Transactions on Software Engineering, 14(4): pp. 532–540.


SESSION

GRID COMPUTING APPLICATIONS +ALGORITHMS

Chair(s)

TBA


Rapid Prototyping Capabilities for Conducting Research of Sun-Earth System

T. Haupt, A. Kalyanasundaram, I. Zhuk

High Performance Computing Collaboratory, Mississippi State University, USA

Abstract - This paper describes the requirements, design and implementation progress of an e-Science environment to enable rapid evaluation of potential uses of NASA research products and technologies to improve future operational systems for societal benefits. This project is intended to be a low-cost effort focused on integrating existing open source, public domain, and/or community developed software components and tools. Critical for success is a carefully designed implementation plan allowing for incremental enhancement of the scale and functionality of the system while maintaining an operational the system and hardening its implementation. This has been achieved by rigorously following the principles of separation of concerns, loose coupling, and service oriented architectures employing Portlet (GridSphere), Service Bus (ServiceMix), and Grid (Globus) technologies, as well as introducing a new layer on top of the THREDDS data server. At the current phase, the system provide data access through a data explorer allowing the user to view the metadata and provenance of the datasets, invoke data transformations such as subsampling, reprojections, format translations, and de-clouding of selected data sest or collections, as well as generate simulated data sets approximating data feed from future NASA missions.

Keywords: Science Portals, Grid Computing, Rich Interfaces, Data Repository, Online tools

1 Introduction 1.1 Objectives of Rapid Prototyping Capability

The overall goal of the National Aeronautic and Space Administration’s (NASA) initiative to create a Rapid Prototyping Capability (RPC) is to speed the evaluation of potential uses of NASA research products and technologies to improve future operational systems by reducing the time to access, configure, and assess the effectiveness of NASA products and technologies. The developed RPC infrastructure will accomplish this goal and contribute to NASA's Strategic Objective to advance scientific knowledge of the Earth system through space-based observation, assimilation of new observations, and development and deployment of enabling technologies, systems and capabilities including those with potential to improve future operational systems.

Figure 1: The RPC concept as an integration platform for composing, executing, and analyzing numerical experiments for Earth-Sun System Science supporting the location transparency of resources.

The infrastructure to support Rapid Prototyping Capabilities (RPC) is thus expected to provide the capability to rapidly evaluate innovative methods of linking science observations. To this end, the RPC should provide the capability to integrate the tools needed to evaluate the use of a wide variety of current and future NASA sensors and research results, model outputs, and knowledge, collectively referred to as “resources”. It is assumed that the resources are geographically distributed and thus RPC will provide the support for the location transparency of the resources.

This paper describes a particular implementation of a RPC system under development by the Mississippi Research Consortium, in particular Mississippi State University, under a NASA/SSC contract as part of the NASA Applied Sciences Program. This is a work in progress, about one year from the inception of this project.

1.2 RPC experiments

Results of NASA research (including NASA partners) provide the basis for candidate solutions that demonstrate the capacity to improve future operational systems through activities administered by NASA’s Applied Sciences Program. Successfully extending NASA research results to operational organizations requires science rigor and capacity throughout the pathways from research to operations. A framework for the extension of applied sciences activities involves a Rapid Prototyping Capability (RPC) to accelerate

Figure 1: The RPC concept as an integration


Figure2: Two major categories of experiments and subsequent analysis to be supported by RPC.

the evaluation of research results in an effort to identify candidate configurations for future benchmarking efforts. The results from the evaluation activity are verified and validated in candidate operational configurations through RPC experiments. The products of RPC studies will be archived and will be made accessible to all customers, users and stakeholders via the internet with a purpose of being utilized in competitively selected experiments proposed by the applied sciences community through NASA’s “Decisions” solicitation process [1].

Examples of currently funded RPC experiments (through the NASA grant to the Mississippi Research Consortium (MRC)) include: Rapid Prototyping of new NASA sensor data into the SEVIR system, Rapid prototyping of hyperspectral image analysis algorithms for improved invasive species decision support tools, an RPC evaluation of the watershed modeling program HSPF to NASA existing data, simulated future data streams, and model (LIS) data products, and Evaluation of the NASA Land Information System (LIS) using rapid prototyping capabilities.

2 System Requirements The requirements for the infrastructure to support RPC experiments fall into two categories: (1) a computational platform seamlessly integrating geographically distributed resources into a single system to perform RPC experiments and (2) collaborative environment for dissemination of research results enabling a peer-review process.

2.1 Enabling RPC Experiments

The RPC is expected to support at least two major categories of experiments (and subsequent analysis): comparing results of a particular model as fed with data coming from different sources, and comparing different models using the data coming from the same source, as depicted in Fig. 2.

In spite of being conceptually simple, two use cases defined in Fig. 2 in fact entail a significant technical challenge. The barriers currently faced by the researchers include inadequate data access mechanisms, lack of simulated data approximating feeds from sensors to be deployed by future NASA missions, a plethora of data formats and metadata systems, complex multi-step data pre-processing, and rigorous statistical analysis of results (comparisons between results obtained using different models and/or data).

The data from NASA and other satellite missions are distributed by Data Active Archive Centers (DAAC) operated by NASA and its partners. The primary focus of DAACs is to feed post-processed data (e.g., calibrated, corrected for atmospheric effects, etc.) – referred to as the data products – for operational use by the US government and organizational users around the world. The access to the data by an individual researcher is currently cumbersome (the requests are processed asynchronously, as the data in most cases are

not readily available online), and the pre-processing made by DAACs usually does not meet the researcher’s needs. In particular, the purpose of many RPC experiments is to define new pre-processing procedures that, if successful, can be later employed by DAACs to generate new data products.

The pre-processing of the data takes many steps and the steps to be performed depends on technical details of the sensor and the nature of the research. For the sake of brevity, only Moderate Resolution Imaging Spectroradiometer (MODIS) [2] data are discussed here, as a representative example. MODIS sensors are deployed on two platforms: Aqua and Terra that are viewing the entire Earth's surface every 1 to 2 days, acquiring data in 36 spectral bands. The data (the planetary reflectance) is captured in swatches 2330 km (cross track) by 10 km (along track at nadir). The postprocessing of MODIS data may involve the selection of the region of interest (that may require combining several swatches taken at different times, and possibly merging data from Terra and Aqua), sub-sampling, re-projection, band selection or computation of the vegetation or moisture indices by combining data from different spectral bands, noise removal and de-clouding, feature detection, correlation with GIS data and other. The post-processed data are then fed into computational models and/or compared with in situ observations (changes in vegetation, changes in soil moisture, fires, etc.). Of particular interest for RPC experiments currently performed by MRC is the time evolution of Normalized Difference Vegetation Index (NDVI) defined as (NIR-RED)/(NIR+RED), where RED and NIR stand for the spectral reflectance measurements acquired in the red, and near-infrared regions, respectively. Different algorithms are being tested for eliminating gaps in the data caused by the cloud cover by fusing data collected by Aqua and Terra and weighted spatial and temporal interpolations. Finally, the comparison of data coming from different sources (and corresponding model predictions) require handling the differences in spatial and temporal resolutions, differences in satellites orbits, differences in spectral bands, and other sensor characteristics.


Enabling RPC experiments, in this context, means thus a radical simplification of access to both actual and simulated data, as well as tools for data pre- and post-processing. The tools must be interoperable allowing the user to create computational workflows with the data seamlessly transferred as needed, including third-party transfers to high-performance computing platforms. In addition, the provenance of the data must be preserved in order to document results of different what-if scenarios and to enable collaboration and data sharing between users.

The development of the RPC system does not involve developing the tools for data processing. These tools are expected to be provided by the researchers performing experiments, projects focused on the tool development, and the community at large. Indeed, many tools for handling Earth science data are available from different sources, including NASA, USGS, NOAA, UCAR/Unidata, and numerous universities. Instead, the RPC system is expected to be an integration platform supporting adding (“plugging in”) tools as needed.

Enabling Community-Wide Peer-Review Process The essence of the RPC process is to provide an evaluation of the feasibility of transferring research capabilities into routine operations for societal benefits. The evaluation should result in a recommendation for the NASA administrators to pursue or abandon the topic under investigation. Since making an evaluation requires a narrow expertise in a given field (e.g., invasive species, crop predictions, fire protection, etc.), the results presented by a particular research needs to be peer-reviewed. One way of doing that is publishing papers in professional journals and conferences. However, this introduces latency to the process, and the information given in a paper is not always sufficient for conclusive evaluation of the research results. The proposed remedy is to provide means for publishing the results electronically – that is, providing the community access not only to the final reports and publication but also to the data used and/or produced during the analysis, as well as providing access to tools used to derive the conclusions of the evaluation. The intention is not to let the peer scientists to repeat a complete experiment, which may involve processing voluminous data on high-performance systems, but rather to provide means for testing new procedures, tools, and final analysis developed in the course of performing the experiment.

Design Considerations The development of an RPC system satisfying all the requirements described above is an immense task. Consequently, one of the most important design decisions was to prioritize the system features and select the sequence of actions that would lead towards the implementation of the full functionality. Taking into account the particular needs of

the experiments carried on by MSC, the following implementation roadmap has been agreed upon [3].

Phase I: Interactive web site for describing the experiments and gathering feedback from the community. All experiments are performed outside the RPC infrastructure.

Phase II: RPC data server acting as a cache for experimental data (Unidata’s THREDDS server [4]). In the prototype deployment a small amount (~6 TBytes) of disk space is made available for the experimenters with support for transfers of the data between the RPC data server and a hierarchical storage facility at the High Performance Computing Collaboratory (HPCC) at Mississippi State University via a 2 Mbytes/s link. The experiments obtain the data from DAACs “the old way” (through asynchronous requests) and store them at HPCC, or generate using computational models run on HPCC Linux clusters. Once transferred to the RPC data server, the data sets are available online. This is a transitional step, and still the experiments are executed outside the RPC infrastructure. However, since the data are online, they can be accessed by various standalone tools, such as Unidata’s Integrated Data Viewer (IDV) [5].

Phase III: Online tools for data processing (“transformations”). The tools are deployed as web services and integrated with the RPC data server. Through a web interface, the user sets the transformation parameters and selects the input data sets by browsing or searching the RPC data server. The results of the transformations (together with the available provenance information) are stored back at the RPC data server at the location specified by the user. The provenance information depends on the tool, in some case it is just the input parameter files and standard output, other tools generate additional log files and metadata. Since the THREDDS server “natively” handles data in the netCDF [6] format, the primary focus is given to tools for transforming NASA’s HDF-EOS [7] format (for example, MODIS data are distributed in this format), including HDF-EOS to geoTIFF Conversion Tool (HEG) [8] supporting reformatting, band selection, subsetting and stitching , and MODIS re-projection tools (MRT[9] and MRTswath[10]). The second class of tools integrated with the RPC system is the Applications Research Toolbox (ART) and the Time Series Product Tool (TSPT), developed specially for the RPC system by the Institute for Technology Development (ITD) [11] located at the NASA Stennis Space Center. The ART tool is used for generating simulated Visible/Infrared Imager/Radiometer Suite (VIIRS) data. VIIRS is a part of the National Polar-orbiting Operational Environmental Satellite System (NPOESS) program, and it is expected to be deployed in 2009. The VIIRS data will replace MODIS. The TSPT tool generates layerstacks of various data products to assist in time series analysis (including de-clouding). In particular, TSPT operates on MODIS and simulated VIIRS data. Finally, the RPC system integrates the Performance Metrics Workbench tools for data visualizations and statistical analysis. These tools have been developed at the GeoResources Institute at


Mississippi State University and the Geoinformatics Center at the University of Mississippi. At this phase, the experimenters can use the RPC portal for rapid prototyping of experimental procedures using online, interactive tools on data uploaded to the RPC data server. Furthermore, the peer researchers can test the proposed methods using the same data sets and the same tools.

Phase IV: Support for batch processing. The actual data analysis needed to complete an experiment usually requires processing huge volumes of data (e.g., a year’s worth of data). This is impractical to perform interactively using web interfaces. Instead, support for asynchronous (“batch”) processing is provided. The tools are still deployed as web services; however, they delegate the execution to remote high-performance computational resources. The user selects the range of files (or a folder), selects the transformation parameters and submits processing of all selected files by clicking a single submit button. Since the data pre-processing is usually embarrassingly parallel (the same operation is repeated for each file or for each pixel across the group of files in TSPT), the user automatically gains by using the Portal, as the system seamlessly makes all necessary data transfers and parallelizes the execution. Since the batch execution is asynchronous, the Portal provides tools for monitoring the progress of the task. Furthermore, even very complex computational models (as opposed to a relatively simple data transformation tools) can be easily converted to a Web service, and thus all of the computational needs of the user can be satisfied through the Portal. At this phase the user may actually perform the experiment using the RPC infrastructure, assuming that the input data sets are “prefetched” to the RPC data server or HPCC storage facility, all computational models are installed at HPCC systems, and all tools are wrapped as Web Services.

Phase V: The RPC system is deployed at NASA Stennis Space Center, and it becomes a seed for a Virtual Organization (VO). Each deployment comes with its own portal, creating a network of RPC points of access. Each Portal deploys a different set of tools that are accessible through a distributed Service Bus. Each site contributes storage and computational resources that are shared across the VO. In collaboration with DAACs the support for online access is developed.

Implementation

Grid Portal The functionality of the RPC Portal naturally splits into several independent modules such as interactive Web site, data server, tool’s interfaces, or monitoring service. Each such module is implemented as an independent portlet [12]. The RPC Portal aggregates the different contents provided by the portlets into a single interface employing a popular GridSphere [13] open source portlet container. GridSphere,

while fully JSR-168 compliant, also provides out-of-the-box support for user authentication and maintaining user credentials (X509 certificates, MyProxy [14]), vital when providing access to remote resources.

Access to full functionality of the Portal, which includes the right to modify the contents served by the Portal, is granted only to the registered users who must explicitly login to start a Portal session. To access remote resources, in addition, the user must upload his or her certificate to the myProxy server associated with the Portal, using a portlet developed by the GridSphere team. In phases II - IV of the RPC system deployment, the only remote resources available to the RPC users are those provided by HPCC. Remote access to the HPCC resources is granted to registered users with certificates signed by the HPCC Certificate Authority (CA).

Phase V of the deployment calls for establishing a virtual organization allowing the users to access the resources made available by the VO participants, including perhaps the NASA Columbia project and TeraGrid. To simplify the user task to obtain and manage certificates, Grid Account Management Architecture (GAMA) [15] will be adopted. It remains to be determined what CA(s) will be recognized, though.

Interactive Web Site It is imperative for the PRC system to provide an easy way for the experimenters to update the contents of the web pages describing their experiments, in particular, to avoid intermediaries such as a webmaster. A ready to use solution for this problem is Wiki - a collaborative website which can be directly edited by anyone with access to it [16]. From several available open source implementations, Media Wiki [17] has been chosen for the RPC portal, as the RPC developers are impressed by the robustness of the implementation proven by the multi-million pages Wikipedia [18].

Media Wiki is deployed as a portlet managed by GridSphere. The only (small) modification introduced to Media Wiki in order to integrate it with the RPC Portal is replacing the direct Media Wiki login by automatic login for users who successfully logged in to GridSphere. With this modification, by a single login to the RPC Portal the user not only gets access to the RPC portlets and remote resources accessible to RPC users (through automatic retrieval of the user certificate from the MyProxy server) but also she acquires the rights to modify the Wiki contents.

The rights to modify the Wiki contents are group-based. Each group is associated with a namespace and only members of the group can make modifications to the pages in the associated namespace. For example, only participants of an RPC experiment can create and update pages describing that experiment, while anyone can contribute to the blog area and participate in the discussion of the experimental pages. In


addition, each group is associated with a private namespace – not accessible to nonmembers at all – which enables collaborative development of confidential contents.

Data Server The science of the Sun-Sun system is notorious for collecting an incredible amount of observational data that come from different sources in a variety of formats and with inconsistent and/or incomplete metadata. The solution of the general problem of managing such data collections is a subject of numerous research efforts and it goes far beyond the scope of this project. Instead, for the purpose of the RPC system it is desirable to adopt an existing solution representing the community common practice. Even though such solution is necessarily incomplete, by the virtue of being actually used by the researchers, it is useful enough and robust enough to be incorporated into the RPC infrastructure.

From the available open source candidates, Unidata’s THREDDS Data Server (TDS) has been selected and deployed as a portlet. In order to better integrate it with the other RPC Portal functionality, and in particular, to provide user-friendly interfaces to the data transformations, a thin layer of software on top of TDS – referred to as the TDS Explorer – has been developed

THREDDS Data Server THREDDS (Thematic Real-time Environmental Distributed Data Services) [4] is middleware developed to simplify the discovery and use of scientific data and to allow scientific publications and educational materials to reference scientific data. Catalogs are the heart of the THREDDS concept. They are XML documents that describe on-line datasets. Catalogs can contain arbitrary metadata. The THREDDS Catalog Generator produces THREDDS catalogs by scanning or crawling one or more local or remote dataset collections. Catalogs can be generated periodically or on demand, using configuration files that control what directories get scanned, and how the catalogs are created.

THREDDS Data Server (TDS) actually serves the contents of the datasets, in addition to providing catalogs and metadata for them. The TDS uses the Common Data Model to read datasets in various formats, and serves them through OPeNDAP, OGC Web Coverage Service, NetCDF subset, and bulk HTTP file transfer services. The first three allow the user to obtain subsets of the data, which is crucial for large datasets. Unidata’s Common Data Model (CDM) is a common API for many types of data including OPeNDAP, netCDF, HDF5, GRIB 1 and 2, BUFR, NEXRAD, and GINI. A pluggable framework allows other developers to add readers for their own specialized formats.

Out-of-the-box TDS provide most functionality needed to support data sets commonly used in climatology applications

(e.g., weather forecasting, climate change) and GIS applications because of the supported file formats. It is possible to create CDM-based modules to handle other data formats, in particular, HDF4-EOS that is critical for many RPC experiments, however, that would possibly lead to loss of the metadata information embedded in HDF headers. Furthermore, while TDS provides support for subsetting CDM-based data sets, it does not allow for other operations, often performed on HDF4-EOS data, such as re-projections. To minimize modifications and extensions to TDS needed to integrate it with RPC infrastructure, the new functionality needed for RPC is developed as a separate package (a web application) that acts as an intermediary between the user interface and TDS. The requests for services that can be rendered by TDS are forwarded to TDS, while the others are handled by the intermediary: the TDS Explorer.

TDS Explorer The TDS native interface allows browsing the TDS catalog one page at a time, which makes the navigation of a hierarchical catalog a tedious process. To remedy that, a new interface inspired by the familiar Microsoft File Explorer (MSFE) has been developed. The structure of the catalog is now represented as an expandable tree in a narrow pane (iframe) on the left hand side of the screen. The selection of a tree node or leave results in displaying the catalog page of the corresponding data collection or data set, respectively, in an iframe occupying the rest of the screen. The TDS explore not only makes the navigation of the data repository more efficient, it also simplifies the development of interfaces to other services not natively supported by TDS. Among the services are:

• Creating containers for new data collections and uploading new data sets. The creation of a new container is analogous to creating a new folder in MSFE: select a parent folder, from menu select option “new collection”, in a pop-up iframe type in the name of the new collection and click OK (or cancel). There are two modes of uploading files: from the user workstation using HTTP and from a remote location using gridFTP (until phase V of the deployment, only HPCC storage facility). In either case select the destination collection, from menu select option “uploadHTTP” or “uploadGridFTP” , select files(s) in the file chooser popup, and click OK (or cancel).

• Renaming and deleting datasets and collections: select dataset or data collection and use the corresponding option in the menu

• Downloading the data either to the user desktop using HTTP or transferring it to a remote location using gridFTP.

• Displaying the provenance of a dataset. By choosing this option, the list of files is displayed (instead of the


TDS catalog page) that were generated when creating the datasets, if any. Typically, the provenance files are generated when an RPC tool is used to create a dataset, and the list may include the standard output, the input parameter file, a log file, a metadata record, or other depending on the tool.

• Invoke tools for the selected fileset(s) or collection. Some tools operate on a single dataset (e.g., multispectral viewer, other may be invoked for several datasets (e.g. HEG tool), yet others operate on data collections (e.g., TSPT). The tools GUI pop up as new iframes. The tools are described in Section 4 below.

The user interface of the TDS Explorer is implemented using JavaScript, including AJAX. The server side functionality is a web application using JSP technology. The file operations (upload, delete, rename) are performed directly on the file system. The changes in the file system are propagated to the TDS Explorer tree by forcing TDS to recreate the catalog by rescanning the file system (with optimizations prohibiting rescanning folders that have not changed). Finally, the TDS explorer web application invokes TDS API for translating datasets logical names into physical URLs, as needed.

RPC Tools From the perspective of the integration, there are three types of tools. One type is standalone tools capable of connecting to the RPC data server to browse and select the dataset but otherwise performing all operations independently of the RPC infrastructure. The Unidata IDV is an example of such a tool. Obviously, such tools require no support from the RPC infrastructure except for making the RPC data server conform to accepted standards (such as DODS). One of the advantages of TDS is that it supports many of the commonly used data protocols, and consequently, the data served by the RPC data server may be accessed by many existing community developed tools, immediately enhancing the functionality of the RPC system.

The second type of tools is transformations that take a dataset or a collection as an input, and output the transformed files. Examples of such transformations are HEG, MRT, ART, and TSPT. They come with a command line interface (a MatLab executable in the case of ART and TSPT), and are controlled by a parameter input file. The integration of such tools with the RPC infrastructure is made in two steps. First, the Web-based GUI is developed (using JavaScript and Ajax as needed to create lightweight but rich interfaces) to produce the parameter file. The GUI is integrated with the TDS explorer to simplify the user task of selecting the input files and defining the destination of the output files. The other step is to install the executable on the system of choice and convert it to a service. To this end open source ServiceMix [19] that implements the JSR-208 Java Business Integration Specification [20] is used; implementations of JBI are usually referred to as “Service Bus”. Depending on the user chosen

target machine the service forks a new process on one of the servers associated with the RPC system, or submits the job on the remote machine using Globus GRAM[21]. In the case of remote submission, the service registers itself as the listener of GRAM job status change notifications. The notifications are forwarded to a Job Monitoring Service (JMS). JMS stores the information on the jobs in a database (mySQL). A separate RPC portlet provides the user interface for querying the status of all jobs submitted by the user. The request for a job submission (local or remote) contains an XML job descriptor that specifies all information needed to submit the job: the location of the executable, values of environmental variables, files to be staged out and in, etc. Consequently, the same ServiceMix service is used to submit any job with the job descriptor generated by the transformation GUI (or supporting JSP page). Furthermore, a new working directory is created for each instance of a job. Once the job completes, the result of the transformation is transferred to the TDS server to the location specified by the user, while “byproducts” such as standard output and log files, if created, are transparently moved to a location specified by the RPC server: a folder with the name created automatically by hashing the physical URL of the transformation result. This approach eliminates unnecessary clutter in the TDS catalog. Using the TDS explorer the user navigates only the actual datasets. If the provenance information is needed, the TDS explorer recreates the hash from the dataset URL and shows the contents of that directory providing the user with access to all files there.

Finally, the data viewers and statistical analysis tools do not produce new datasets. In this regard, they are similar to standalone tools. The advantage of integrating them with the RPC infrastructure is that the data can be preprocessed on the server side reducing the volume of the necessary data transfers. Because of the interactivity and rich functionality (visualizations), they are implemented as Java applets.

Summary This paper describes the requirements, design and implementation progress of an e-Science environment to enable rapid evaluation of innovative methods of processing science observations, in particular data gathered by sensors deployed on NASA-launched satellites. This project is intended to be a low-cost effort focused on integrating existing open source, public domain, and/or community developed software components and tools. Critical for success is a carefully designed implementation plan allowing for incremental enhancement of the functionality of the system, including incorporating new tools per user requests, while maintaining an operational system and hardening its implementation. This has been achieved by rigorously following the principles of separation of concerns, loose coupling, and service oriented architectures employing Portlet (GridSphere), Service Bus (ServiceMix), and Grid (Globus) technologies, as well as introducing a new layer on top of the


THREDDS data server (TDS Explorer). At the time of writing this paper, the implementation is well into phase IV, while continuing to add new tools. The already deployed tools allow for subsampling, reprojections, format translations, and de-clouding of selected data sets and collections, as well as for generating simulated VIIRS data approximating data feeds from future NASA missions.

References

[1] NASA Science Mission Directorate, Applied Sciences Program. Rapid Prototyping Capability (RPC) Guidelines and Implementation Plan, http://aiwg.gsfc.nasa.gov/esappdocs/RPC/RPC_guidelines_01_07.doc

[2] http://modis.gsfc.nasa.gov

[3] T. Haupt and R. Moorhead, “The Requirements and Design of the Rapid Prototyping Capabilities System”, 2006 Fall Meeting of the American Geophysical Union, San Francisco, USA, December 2006.

[4] http://www.unidata.ucar.edu/projects/THREDDS/

[5] http://www.unidata.ucar.edu/software/idv/

[6] http://en.wikipedia.org/wiki/NetCDF

[7] http://hdf.ncsa.uiuc.edu/hdfeos.html

[8] http://newsroom.gsfc.nasa.gov/sdptoolkit/HEG/HEGHome.html

[9] http://edcdaac.usgs.gov/landdaac/tools/modis/index.asp

[10] http://edcdaac.usgs.gov/news/mrtswath_update_020106.asp

[11] http://www.iftd.org/rapid_prototyping.php

[12] JSR-168 Portlet Specification, http://jcp.org/aboutJava/communityprocess/final/jsr168/

[13] www.gridsphere.org

[14] MyProxy Credential Management Service, http://grid.ncsa.uiuc.edu/myproxy/

[15] K. Bhatia, S. Chandra, K. Mueller, "GAMA: Grid Account Management Architecture," First International Conference on e-Science and Grid Computing (e-Science'05), pp. 413-420, 2005.

[16] Howard G. "Ward" Cunningham, http://www.wiki.org/wiki.cgi?WhatIsWiki

[17] http://en.wikipedia.org/wiki/MediaWiki

[18] http://en.wikipedia.org/wiki/Wikipedia:About

[19] http://incubator.apache.org/servicemix/home.html

[20] http://jcp.org/en/jsr/detail?id=208

[21] http://www.globus.org


The PS3 R© Grid-resource model

Martin Rehr and Brian VintereScience center, University of Copenhagen, Copenhagen, Denmark

Abstract—This paper introduces the PS3 R© Grid-resourcemodel, which allows any Internet connected Playstation 3to become a Grid Node without any software installation.The PS3 R© is an interesting Grid resource as each ofthe over 5 millions sold world wide contains a powerfulheterogeneous multi core vector processor well suited forscientific computing. The PS3 R© Grid node provides a nativeLinux execution environment for scientific applications.Performance numbers show that the model is usable whenthe input and output data sets are small. The resulting systemis in use today, and freely available to any research project.

Keywords: Grid, Playstation 3, MiG

1. Introduction

The need for computation power is growing daily as anincreasing number of scientific areas use computer modelingas a basis for their research. This evolution has led to awhole new research area called eScience. The increasingneed of scientific computational power has been known foryears and several attempts have been made to satisfy thegrowing demand. In the 90’s the systems evolved from vectorbased supercomputers to cluster computers which is build ofcommodity hardware leading to a significant price reduction.In the late 90’s a concept called Grid computing[7] wasdeveloped, which describes the idea of combining the differentcluster installations into one powerful computation unit.

A huge computation potential beyond the scope of clustercomputers is represented by machines located outside theacademic perimeter. While traditional commodity machinesare usually PC’s based on the X86 architecture a whole newtarget has turned up with the development and release of theSony Playstation 3 (PS3 R©). The heart of the PS3 R© is theCell processor, The Cell Broadband Engine Architecture (CellBE)[4] is a new microprocessor architecture developed in ajoint venture between Sony, Toshiba and IBM, known as STI.Each company has their own purpose for the Cell processor.Toshiba uses it as a controller for their flat panel televisions,Sony uses it for the PS3 R©, and IBM uses it for their HighPerformance Computing (HPC) blades. The development ofthe Cell started out in the year 2000 and involved around 400engineers for more than four years and consumed close tohalf a billion dollars. The result is a powerful heterogeneousmulti core vector processor well suited for gaming and HighPerformance Computing (HPC)[8].

1.1. Motivation

The theoretical peak performance of the Cell processor inthe PS3 R© is 153,6 GFLOPS in single precision and 10.98GFLOPS in double precision[4]1. According to the press morethan 5 million PS3’s have been sold worldwide at October2007. This gives a theoretical peak performance of morethan 768.0 peta-FLOPS in single precision and 54.9 peta-FLOPS in double precision, if one could combine them allin a Grid infrastructure. This paper describes two scenariosfor transforming the PS3 R© into a Grid resource, firstly theNative Grid Node (NGN) where full control is obtained ofthe PS3 R©. Secondly the Sandboxed Grid Node (SGN) whereseveral issues have to be considered to protect the PS3 R© fromfaulty code, as the machine is used for other purposes thanGrid computing.

Folding@Home[6] is a scientific distributed application forfolding proteins. The application has been embedded into theSony GameOS of the PS3 R©, and is limited to protein folding.This makes it Public Resource Computing as opposed to ourmodel which aims at Grid computing, providing a completeLinux execution environment aimed at all types of scientificapplications.

2. The Playstation 3

The PS3 R© is interesting in a Grid context due to thepowerful Cell BE processor and the fact that the game consolehas official support for other operating systems than the defaultSony GameOS.

2.1. The Cell BE

The Cell processor is a heterogeneous multi core processorconsisting of 9 cores, The Primary core is an IBM 64 bit powerprocessor (PPC64) with 2 hardware threads. This core is thelink between the operating system and the 8 powerful workingcores, called the SPE’s for Synergistic Processing Element.The power processor is called the PPE for Power ProcessingElement, figure 1 shows an overview of the Cell architecture.The cores are connected by an Element Interconnect Bus (EIB)capable of transferring up to 204 GB/s at 3.2 GHz. Each SPE

1. The PS3 R© Cell has 6 SPE’s available for applications. Each SPE isrunning at 3.2 GHz and capable of performing 25.6 GFLOPS in singleprecision and 1.83 GFLOPS in double precision.


BEI

IOIF_1

IOIF_0

PPE

Element Interconnect Bus (EIB)

SPE0 SPE2 SPE4 SPE6

SPE1 SPE3 SPE5 SPE7

MICXIOXIO

BEI

IOIF

MIC

PPE

SPE

XIO

Cell Broadband Engine Interface

I/O interface

Memory Interface Controller

Power Processor Element

Synergistic Processor Element

Rambus XDR I/O

Figure 1. An overview of the Cell architecture

Synergistic Processor Element (SPE)

Even pipeline

Local Store

Memory Flow Controller (MFC)

Instruction Prefetch and Issue Unit

Register File

Odd pipeline

Element Interconnect Bus (EIB)

Figure 2. An overview of the SPE

is dual pipelined, has a 128x128 bit register file and 256 kBof on-chip memory called the local store. Data is transferedasynchronously between main memory and the local storethrough DMA calls handled by a dedicated Memory FlowController (MFC). An overview of the SPE is shown in figure2.

By using the PPE as primary core, the Cell processor canbe used out of the box, due to the fact that many existingoperating systems support the PPC64 architecture. Therebyit’s possible to boot a PPC64 operating system on the Cellprocessor, and execute PPC64 applications, however these willonly use the PPE core. To use the SPE cores it’s necessary todevelop code specially for the SPE’s, which includes settingup a memory communications scheme using DMA throughthe MFC.

2.2. The game console

Contrary to other game consoles, the PS3 R© officially sup-ports alternative operating systems besides the default SonyGame OS. Even though other game consoles can be modifiedto boot alternative operating systems, this requires either anexploit of the default system or a replacement of the BIOS.Replacing the BIOS is intrusion at the highest level, expensiveat a large volume and not usable beyond the academic perime-ter. Security exploits are most likely to be patched within thenext firmware update, which makes this solution unusable inany scenario. Beside the difficulties modifying other gameconsoles towards our purposes, the processors used by thegame consoles currently on the market, except for the PS3 R©,

PS3 Hypervisor Virtualization Layer

SPU GPU AUDIOGbE

WiFi

ATA

HDD/CD

USB

HID

BlueTooth

PPU

SPUFS*

Audio/Video*

ALSA**

FB

PS3VRAMMTD

GbE*

Network

TCP/IP

Storage*

SCSI

HDD** CD

OHCI/EHCI*

USB** PPU

PS3PF*

PS3 Hardware

PS3 Linux Kernel

***

PS3 Hypervisor Linux drivers provided by SONY Linux drivers NOT included on the PS3LIVE CD

Figure 3. An overview of the PS3 R© Hypervisor structurefor the Grid-resource model

are not of any interest for scientific computing.The fact that the PS3 R© is low priced from a HPC point

of view, equipped with a high performance vector processor,and supports alternative operating systems, makes it interestingboth as an NGN node and an SGN node. All sold PS3’scan be transformed to a powerful Grid resource with a littleeffort from the owner of the console. Third party operatingsystems work on top of the Sony GameOS, which acts as ahypervisor for the guest operating system. See figure 3. Thehypervisor controls which hardware components are accessiblefrom the guest operating system. Unfortunately the GPU isnot accessible by guest operating systems2, which is a pity,as it in itself is a powerful vector computation unit witha theoretical peak performance of 1.8 tera-FLOPS in singleprecession. However 252 MB of the 256 MB GDDR3 ramlocated on the graphics card can be accessed through thehypervisor, The hypervisor reserves 32 MB of main memoryand 1 of the 7 SPE’s available in the PS3 R© version of the Cellprocessor3. This leaves 6 SPE’s and 224 MB of main memoryfor guest operating systems. Lastly a hypervisor model alwaysintroduces a certain amount of performance decrease, as theguest operating system does not have direct access to thehardware.

3. The PS3 R© Grid resource

The PS3 R© supports alternative operating systems, makingthe transformation into a Grid resource rather trivial, as asuitable Linux distribution and an appropriate Grid client arethe only requirements. However if you target a large amountof PS3’s this becomes cumbersome. Furthermore if the PS3’s

2. It is not clear whether it’s to prevent games to be played outside SonyGameOS, due to DRM issues or due to the exposure of the GPU’s register-level information

3. The Cell processor consists of 8 SPE’s, but in the PS3 R© one is removedfor yield purposes, if one is defective it is removed, if none is defective agood one is removed to assure that all PS3’s have exactly 6 SPE’s availablefor applications, to preserve architectural consistency


located beyond the academic perimeter are to be reached,minimal administrational work form the donator of the PS3 R©

is a vital requirement. Our approach minimizes the workloadrequired transforming a PS3 R© into a powerful Grid resource byusing a LIVECD. Using this CD, the PS3 R© is booted directlyinto a Grid enabled Linux system. The NGN version of theLIVECD is targeted at PS3’s used as dedicated Grid nodes,and uses all the available hardware of the PS3 R©, whereas theSGN version uses the machine without making any change4

to it, and is targeted at PS3’s used as entertainment devices aswell as Grid nodes.

3.1. The PS3-LIVECD

Several requirements must be met by the Grid middlewareto support the described LIVECD. First of all the Gridmiddleware must support resources which can only beaccessed through a pull based model, which means thatall communication is initiated by the resource, i.e. thePS3-LIVECD. This is required because the PS3’s targeted bythe LIVECD are most likely located behind a NAT router.Secondly, the Grid middleware needs a scheduling modelwhere resources are able to request specific types of jobs,e.g. a resource can specify that only jobs which are targetedthe PS3 R© hardware model can be executed.

In this work the Minimum intrusion Grid[11], MiG, isused as the Grid middleware. The MiG system is presentednext, before presenting how the PS3-LIVECD and MiG worktogether.

3.2. Minimum intrusion Grid

MiG is a stand-alone Grid platform, which does not inheritcode from any earlier Grid middlewares. The philosophybehind the MiG system is to provide a Grid infrastructurethat imposes as few requirements on both users and resourcesas possible. The overall goal is to ensure that a user is onlyrequired to have a X.509 certificate which is signed by a sourcethat is trusted by MiG, and a web browser that supports HTTP,HTTPS and X.509 certificates. A fully functional resourceonly needs to create a local MiG user on the system and tosupport inbound SSH. A sandboxed resource, the pull basedmodel, only needs outbound HTTPS[1].

Because MiG keeps the Grid system disjoint from bothusers and resources, as shown in Figure 4, the Grid systemappears as a centralized black box[11] to both users andresources. This allows all middleware upgrades and troubleshooting to be executed locally within the Grid without anyintervention from neither users nor resource administrators.Thus, all functionality is placed in a physical Grid systemthat, though it appears as a centralized system in reality isdistributed. The basic functionality in MiG starts by a usersubmitting a job to MiG and a resource sending a request fora job to execute. The resource then receives an appropriate jobfrom MiG, executes the job, and sends the result to MiG that

4. One has to install a boot loader to be able to boot from CD’s

Grid

Client Resource

Resource

ResourceClient

Client

Client

Figure 4. The abstract MiG model

can then inform the user of the job completion. Since the userand the resource are never in direct contact, MiG provides fullanonymity for both users and resources, any complaints willhave to be made to the MiG system that will then look at thelogs that show the relationship between users and resources.

3.2.1. Scheduling. The centralized black box design of MiGmakes it capable of strong scheduling, which implies fullcontrol of the jobs being executed and the resource execut-ing them. Each job has an upper execution time limit, andwhen the execution time exceeds this time limit the job isrescheduled to another resource. This makes the MiG systemvery well suited to host SGN resources, as they by nature arevery dynamic and frequently join and leave the Grid withoutnotifying the Grid middleware.

4. The MiG PS3-LIVECD

The idea behind the LIVECD is booting the PS3 R© byinserting a CD, containing the Linux operating system and theappropriate Grid clients. Upon boot, the PS3 R© connects to theGrid and requests Grid jobs without any human interference.Several issues must be dealt with. First of all the PS3 R© mustnot be harmed by flaws in the Grid middleware nor exploitsthrough the middleware, Secondly the Grid jobs may not harmthe PS3 R© neither by intention nor by faulty jobs. This isespecially true for SGN resources where an exploit may causeexposure of personal data.

4.1. Security

To keep faulty Grid middleware and jobs from harming thePS3 R©, both the NGN and SGN model use the operating systemas a security layer. The Grid client software and the executedGrid jobs are both executed as a dedicated user, who doesnot have administrational rights of the operating system. TheMiG system logs all relations between jobs and resources, thusproviding the possibility to track down any job.

4.2. Sandboxing

The SGN version of the LIVECD operates in a sand-boxed environment to protect the donated PS3 R© from faultymiddleware and jobs. This is done by excluding the devicedriver for the PS3 R© HDD controller from the Linux kernel


used, and keeping the execution environment in memoryinstead. Furthermore, the support for loadable kernel modulesis excluded, which prevents Grid jobs from loading modulesinto the kernel, even if the OS is compromised and root accessis achieved.

4.3. File access

Enabling file access to the Grid client and jobs withouthaving access to the PS3’s hard drive is done by using thegraphics card’s VRAM as a block device. Main memory isa limited resource5, therefore using the VRAM as a blockdevice is a great advantage compared to the alternative ofusing a ram disk, which would decrease the amount of mainmemory available for the Grid jobs. However the total amountof VRAM is 252 MB and therefore Grid jobs requiringinput/output files larger than 252 MB are forced to use aremote file access framework[2].

4.4. Memory management

The PS3 R© has 6 SPE cores and a PPE core all capable ofaccessing the main memory at the same time, through theirMFC controllers. This results in a potential bottleneck in theTLB, as it in the worst case ends up thrashing, which is aknown problem in multi core processor architectures. TLBthrashing can be eliminated by adjusting the page size to fitthe TLB, which means that all pages have an entry in the TLB.This is called huge pages, as the page size grows significantly.The use of huge pages has several drawbacks, one of them isswapping. Swapping in/out a huge page results in a longerexecution halt as a larger amount of data has to be movedbetween main memory and the hard drive.

The Linux operating system implements huge pages as amemory mapped file, this results in a static memory divisionof traditional pages and huge pages, using different memoryallocators. The operating system and standard shared librariesuse the traditional pages which means the memory footprintof the operating system and the shared libraries has to beestimated in order to allocate the right amount of memory forthe huge pages. Opposite to a cluster setup where the executionenvironment and applications are customized to the specificcluster, this can’t be achieved in a Grid context6. Therefore ageneric way of addressing the memory is needed. Furthermorefuture SPE programming libraries will most likely use the de-fault memory allocator. This and the fact that no performancemeasurement clarifying the actual gain of using huge pagescould be found, led to the decision to skip huge pages for thePS3-LIVECD.

At last it’s believed by the authors that the actual applica-tions which could gain a performance increase by using hugepages is rather insignificant, as the the majority of applicationswill be able to hide the TLB misses by using double- ormulti buffering, as memory transfers through the MFC areasynchronous.

5. The PS3 R© only has 224 MB of main memory for the OS and applications6. Specially in MiG where the user and resources are anonymous to each

other

5. The execution environment

The PS3-LIVECD is based on the Gentoo Linux[9] PPC64distribution with a customized kernel[5] capable of communi-cating with the PS3 R© hypervisor. Gentoo catalyst[3] was usedas build environment, this provides the possibility of configur-ing exactly which packages to include on the LIVECD, as wellas providing the possibility to apply a custom made kernel andinitrd script. The kernel was modified in different ways, firstlyloadable modules support was disabled to prevent potentialevil jobs, which manages to compromise the OS security,from modifying the kernel modules. Secondly the frame-buffer driver has been modified to make the VRAM appearas a memory technology device, MTD, which means that theVRAM can be used as a block device. The modification ofthe frame-buffer driver also included freeing 18 MB of mainmemory occupied by the frame-buffer used in the defaultkernel7.

The modified kernel ended up consuming 7176 kB of thetotal 229376 kB main memory for code and internal datastructures, leaving 222200 kB for the Grid client and jobs.Upon boot the modified initrd script detects the block deviceto be used as root file system8 and formats the detected devicewith the ext2 filesystem, reserving 2580 kB for the superuser,leaving 251355 kB for the Grid client and jobs9. When theblock device has been formatted, the initrd script sets up theroot file system by coping writable directories and files fromthe CD to the root file system. Read-only directories, files,and binaries are left on the CD and linked symbolically to theroot filesystem keeping as much of the root filesystem free forGrid jobs as possible. The result is that the root file systemonly consumes 1.6 MB of the total space provided by the usedblock device.

When the Linux system is booted the LIVECD initiatesthe communication with MiG through HTTPS. This is doneby sending a unique key identifying the PS3 R© to the MiGsystem, if this is the first time the resource connects to theGrid a new profile is created dynamically. The response tothe initial request is the Grid resource client scripts, these aregenerated dynamically upon the request. By using this methodit’s guaranteed that the resource always has the newest versionof the Grid resource client scripts, disabling the need fordownloading a new CD upon a Grid middleware update. Whenthe Grid resource client script is executed the request of Gridjobs is initiated through HTTPS. Within that request a uniqueresource identifier is provided, giving the MiG scheduler thenecessary information about the resource such as architecture,memory, disc space and an upper time-limit. Based on theseparameters the MiG scheduler finds a job suited for the PS3 R©

and places it in a job folder on the MiG system. From thislocation the PS3 R© is able to retrieve the job consisting of

7. As the hypervisor isolates the GPU from the operating system, thedisplay is operated by having the frame-buffer writing the data to be displayedto an array in main memory, which is then copied to the GPU by thehypervisor

8. The SGN version uses VRAM, DGN version uses the real hard driveprovided through the hypervisor

9. This is true for the SGN version, the NGN version uses the total discspace available, which is specified through the Sony Game OS


job description files, input-files, and executables. The locationof these files is returned within the result of the job request,and is a HTTPS URL including a 32 character random stringgenerated upon the job request and deleted when the jobterminates. At job completion the result is delivered to theMiG system which verifies that it’s the correct resource (bythe unique resource key) which delivers the result of the job.If it’s a false deliver10 the result is discarded, otherwise it’saccepted. And the PS3 R© resource requests a new job whenthe result of the previous one has been delivered.

6. Experiments

Testing the PS3 R© Grid-resource model was done establish-ing a controlled test scenario consisting of a MiG Grid serverand 8 PS3’s. The experiments performed included a modeloverhead check, a file system benchmark using VRAM asa block device, and application performance, using a proteinfolding and a ray tracing application.

6.1. Job overhead and file performance

The total overhead of the model was tested by submitting1000 empty jobs to the Grid with only one PS3 R© connected.The 1000 jobs completed in 12366 seconds, which translatesto an overhead of approximately 13 seconds per job. Theperformance of the VRAM used as a block device was testedby writing a 96 MB file sequentially. This was achieved in1.5 seconds, resulting in a bandwidth of 64 MB/s. Readingthe written file was achieved in 9.6 seconds, resulting in abandwidth of 10 MB/s. This shows that writing to the VRAMis a factor of approximately 6.5 faster than reading from theVRAM, which was an expected result as the nature of VRAMis write from main memory to VRAM, not the other wayaround.

6.2. Protein folding

Protein folding is a compute intensive algorithm for foldingproteins. It requires a small input and generates a small output,and is embarrassing parallel which makes it very suitable forGrid computing. In this experiment, a protein of length 27was folded on one PS3 R© resulting in a total execution timeof 57 minutes and 16 seconds. The search space was thendivided into 17 different subspaces using standard divide andconquer techniques. The 17 different search spaces were thensubmitted as jobs to the Grid, which adds up to 4 jobs for eachof the 4 nodes used in the experiment plus one extra job toensure unbalanced execution. Equivalently, the 17 jobs weredistributed among 8 nodes, yielding 2 jobs per node plus oneextra job. The execution finished in 18 minutes and 50 secondsusing 4 nodes giving a speedup of 3.04. The 8 node setupfinished the execution in 10 minutes and 56 seconds giving aspeedup of 5.23, this is shown in figure 5. These results areconsidered quite useful in a Grid setup as opposed to a clustersetup where this would be considered bad.

10. The resource keys doesn’t match, the time limit has been violated, oranother resource is executing the job, due to a rescheduling

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8 9

Spee

d up

Number of nodes

Figure 5. The speedup achieved using the PS3-LIVECDfor protein folding with 4 and 8 nodes

0

0.5

1

1.5

2

2.5

3

3.5

4

0 1 2 3 4 5 6 7 8 9

Spee

d up

Number of nodes

Figure 6. The speedup achieved using the PS3-LIVECDfor ray tracing with 4 and 8 nodes

6.3. Ray tracing

Ray tracing is compute intensive, requires a small amount ofinput and generates a large amount of output. This experimentuses a Ray tracing code written by Eric Rollings[10], this hasbeen modified from a real time ray tracer to a ray tracer whichwrites the rendered frames to files in a resolution of 1920x1080(Full HD). The final images are jpeg compressed to reduce thesize of the output. A total of 5000 frames were rendered in78 minutes and 6 seconds on a single PS3 R©, the search spacewas then divided into 25 equally large subspaces. These weresubmitted as jobs to the Grid resulting in a total of 25 jobs,which adds up to 6 jobs per node plus one extra in the 4 nodesetup, and 3 jobs per node plus one extra in the 8 node setup.The execution time using 4 nodes was 32 minutes and 23seconds giving a speedup of 2.41 and the execution time using8 nodes was 25 minutes and 12 seconds giving a speedup of3.09, this is sketched in figure 6. While the speedup achievedwith 4 nodes is quite useful in a Grid context, the speedupgained using 8 nodes is quite disappointing. The authors


believe this is due to network congestion when the renderedframes are sent to the MiG storage upon job termination.

7. Conclusion

In this work we have demonstrated a way to use theSony Playstation 3 as a Grid computing device, without theneed to install any client software on the PS3 R©. The useof the Linux operating system provides a native executionenvironment suitable for the majority of scientific applications.The advantage of this is that existing Cell applications can beexecuted without any modifications. A sandboxed version ofthe execution environment has been presented which deniesaccess to the hard drive of the PS3 R©. The advantage of thisis that donated PS3’s can’t be compromised by faulty or eviljobs, the disadvantage is the lack of file access, which is solvedby using the VRAM of the PS3 as block device.

The Minimum intrusion Grid supports the required pull-job model for retrieving and executing Grid jobs on a re-source located behind a firewall without the need to openany incoming ports. By using the PS3-LIVECD approach anyPS3 R© connected to the Internet can become a Grid resource bybooting it with the LIVECD. When a Grid connected PS3 R© isshut down the MiG system will detect this event, by a timeout,and resubmit the job to another resource.

Experiments show that the ray tracing application doesn’tscale well, due to the large amount of output data resulting innetwork congestion problems. Opposite to this, a considerablespeedup is reached when folding proteins despite of the modeloverhead of 13 seconds applied to each job.

References

[1] Rasmus Andersen and Brian Vinter. Harvesting idle windowscpu cycles for grid computing. In Hamid R. Arabnia, editor,GCA, pages 121–126. CSREA Press, 2006.

[2] Rasmus Andersen and Brian Vinter. Transparent remote fileaccess in the minimum intrusion grid. In WETICE ’05: Pro-ceedings of the 14th IEEE International Workshops on EnablingTechnologies: Infrastructure for Collaborative Enterprise, pages311–318, Washington, DC, USA, 2005. IEEE Computer Soci-ety.

[3] Gentoo Catalyst. http://www.gentoo.org/proj/en/releng/catalyst.

[4] Thomas Chen, Ram Raghavan, Jason Dale, and Eiji Iwata. Cellbroadband engine architecture and its first implementation. IBMdeveloperWorks, 2005. http://www.ibm.com/developerworks/power/library/pa-cellperf.

[5] PS3 Linux extensions. ftp://ftp.uk.linux.org/pub/linux/Sony-PS3.

[6] Folding@home. http://folding.stanford.edu.

[7] Ian Foster. The grid: A new infrastructure for 21st centuryscience. Physics Today, 55(2):42–47, 2002.

[8] Mohammad Jowkar. Exploring the Potential of the CellProcessor for High Performance Computing. Master’s thesis,University of Copenhagen, Denmark, August 2007.

[9] Gentoo Linux. http://www.gentoo.org.

[10] Eric Rollings. Ray tracer. http://eric rollins.home.mindspring.com/ray/ray.html.

[11] Brian Vinter. The Architecture of the Minimum intrusion Grid(MiG). In Communicating Process Architectures 2005, sep2005.


http://www.gentoo.org/proj/en/releng/catalyst

http://www.ibm.com/developerworks/power/library/pa-cellperf

http://www.ibm.com/developerworks/power/library/pa-cellperf

ftp://ftp.uk.linux.org/pub/linux/Sony-PS3

ftp://ftp.uk.linux.org/pub/linux/Sony-PS3

http://folding.stanford.edu

http://www.gentoo.org

http://eric_rollins.home.mindspring.com/ray/ray.html

http://eric_rollins.home.mindspring.com/ray/ray.html

Numerical Computational Solution of Fredholm

Integral Equations of the Second Kind by Using

Multiwavelet

K. Maleknejada

aDepartment of mathematicsIran University of Science and Technology

Narmak, Tehran 1684613114, Iran

T. Lotfib, and K. NouriabDepartment of mathematicsI. A. U. H. (Hamadan Unit)

Hamadan, Iran

Abstract The main purpose of this paper is to

develope a multiwavelets Galerkin method for ob-

tain numerical solution of Fredholm integral equa-

tions of second kind. On other hand, we use a

class of multiwavelet which construct a bases for

L2(R) and leads to a sparse matrices with high pre-

cision, in numerical methods as Galerkin method.

Because multiwavelets are able to offer a combi-

nation of orthogonality, symmetry, higher order

of approximation and short support, methods us-

ing multiwavelets frequently outperform those us-

ing the comparable scale wavelets. Since spline

bases have a maximal approximation order with

respect to their length, we using a family of spline

multiwavets that are symmetric and orthogonal as

basis. Finally, by using numerical examples we

show that our estimation have a good degree of

accuracy.

Keywords: Integral Equation, Multiwavelet, Galerkin

System, Orthogonal Bases

1 Introduction

This section provide an overview of the topics thatwe need in this paper. The use of wavelet basedalgorithms in numerical analysis is superficially sim-ilar to other transform methods, in which, insteadof representing a vector or an operator in the usualway it is expanded in a wavelet basis, or it’s matrixrepresentation is computed in this bases.

1.1 Multiwavelet

The multiwavelet is more general than the scalerwavelet. The recursion coefficients are now matri-

ces, the symbols are trigonometric matrix polyn-imials, and so on. This change is responsible formost of the extra complication. We now considera dilation factor of m rather than 2. The mul-tiscaling function is still φ, the multiwavelet areψ(1), ..., ψ(m−1). Likewise, the recursion coefficientsare Hk and G(1), ...,G(m−1) and so on.

Definition 1 A refinable function vector is avector-valued function

φ(x) =

φ1(x)...

φr(x)

, φn : R −→ C,

which satisfies a two-scale matrix refinement equa-tion of the form

φ(x) =√m

k1∑k=k0

Hφ(mx− k), k ∈ Z. (1)

r is called the multiplicity of φ; the integer m ≥ 2is the dilation factor. The recursion coefficients Hk

are r × r matrices.

2 Construction MultiwaveletBases

We begin with construction of a class of bases forL2[0, 1]. The class is indexed by p ∈ Z+, which de-notes the number of vanishing moments of the basisfunctions; we say a basis {b1, b2, b3, . . . } form thisclass is of order p, if∫ 1

0

bi(x)xjdx = 0, j = 0, . . . , k − 1,

for each bi with i > p.


2.1 Multiwavelet Bases for L2[0, 1]

We employ the multiresolution analysis frame-work, Keinert [1]. For m = 0, 1, 2, . . . and i =0, 1, . . . , 2m−1, we define a half open interval Im,i ⊂[0, 1) as below

Im,i = [2−mi, 2−m(i+ 1)). (2)

For a fixed m, the dyadic intervals Im,i are di-joint and their union is [0, 1); also Im,i = Im+1,2i ∪Im+1,2i+1. Now we suppose that p ∈ Z+ and form = 0, 1, . . . , and i = 0, 1, . . . , 2m − 1, we define aspace V k

m,i of piecewise polynomial functionl

V km,i = {f | f : R → R, f = PpχIm,i

}, (3)

and we define

V pm = V p

m,0 ⊕ V pm,1 ⊕ V p

m,2 ⊕ · · · ⊕ V pm,2m−1.

It is apparent that for each m and i the space V pm,i

has dimension p, the space V pm has dimension 2mp,

andV p

m,i ⊂ V pm+1,2i ⊕ V p

m+1,2i+1;

thusV p

0 ⊂ V p1 ⊂ · · · ⊂ V p

m ⊂ . . . .

Form = 0, 1, 2, . . . and i = 0, 1, . . . , 2m−1, we definethe p-dimensional space W p

m,i to be the orthogonalcomplement of V p

m,i in V pm+1,2i ⊕ V p

m+1,2i+1,

V pm,i⊕W

pm,i = V p

m+1,2i⊕Vpm+1,2i+1, W p

m,i ⊥W pm,i,

and we define

W pm = W p

m,0 ⊕W pm,1 ⊕W p

m,2 ⊕ · · · ⊕W pm,2m−1.

Now we have V pm⊕V p

m = V pm+1, so we inductively

obtain the decomposition

V pm = V p

0 ⊕W p0 ⊕W p

1 ⊕ · · · ⊕W pm−1. (4)

Suppose that functions ψ1, ψ2, . . . , ψp : R → Rform an orthogonal basis for W p

0 . Since W p0 is or-

thogonal to V p0 , the first p moments ψ1, . . . , ψp van-

ish, ∫ 1

0

ψi(x)xjdx = 0, j = 0, 1, . . . , p− 1.

The space W pm,i then has an orthogonal basis con-

sisting of the P functions ψ1(2mx−i), . . . , ψp(2mx−i), which are non-zero only on the interval Im,i;furthermore, each of the functions has p vanishingmoments. Introducing the notation ψj

m,i for j =

1, . . . , p, m = 0, 1, 2, . . . , and i = 0, 1, 2, . . . , 2m − 1,by the formula

ψjm,i(x) = ψj(2mx− i), x ∈ R,

we obtain from decomposition (4) the formula

V pm = V p

0 ⊕ linear span{ψjm,i}

j = 1, . . . , p; m = 0, 1 . . . ; i = 0, 1, . . . , 2m − 1. (5)

An explicit construction of ψ1, . . . , ψp is given inWalter and Shen [3].

We define the space V p to be the union of the V pm,

given by the formula

V p =∞⋃

m=0

V pm, (6)

and observe that V p = L2[0, 1]. In particular, V 1

contains the Haar basis for L2[0, 1], which consistsof functions piecewise constant on each of the inter-vals Im,i. Here the closure V p is defined with respectto the L2-norm. We let {φ1, . . . , φp} denote any or-thogonal basis for V p

0 ; in view of (5) and (6), theorthogonal system

Bp = {φj}j

⋃{ψj

m,i}i,j,m

spans L2[0, 1]; we refer to Bp as the multiwaveletbasis of order p for L2[0, 1]. In Resnikoff and Wells[4] it is shown that Bp may be readily generalized tobases for L2(R), L2(Rd), and L2[0, 1]d.

3 Second Kind Integral Equa-tions

The matrix representations of integral operators inmultiwavelet bases are sparse. We begin this sectionby introducing some notation for integral equations.A linear Fredholm integral equation of the secondkind is an expression of the form

f(x) = g(x) +∫ b

a

K(x, t)f(t)dt, (7)

where we assume that the kernel K is in L2[a, b]2andthe unknown f and g are in L2[a, b]. For simplicity,we let [a, b] = [0, 1], and

(kf)(x) =∫ 1

0

K(x, t)f(t)dt.


Suppose that {θi}i=1 is an orthonormal basis forL2[0, 1], the expansion of K in this basis is givenby the formula

K(x, t) =∑i=1

∑j=1

Kijθi(x)θj(t), (8)

where the coefficient Kij is given by the expression

Kij =∫ 1

0

∫ 1

0

K(x, t)θi(x)θj(t) dxdt, i, j = 1, 2, . . . .

(9)Similarly, the functions f and g have expansion

f(x) =∑i=1

fiθi(x), g(x) =∑i=1

giθi(x),

where the coefficients fi and gi are given by

fi =< f(x), θi(x) >=∫ 1

0

f(x)θi(x)dx,

gi =< g(x), θi(x) >=∫ 1

0

g(x)θi(x)dx, i = 1, 2, . . . .

By this notation, the integral equation (7) can bewritten as a infinite system of equations

fi −∑j=1

Kijfj = gi, i = 1, 2, . . . .

We can truncate the expansion forK at a finite num-ber of terms, and show it by the integral operator Tas following

(Tf)(x) =∫ 1

0

n∑i=1

n∑j=1

Kijθi(x)θj(t) f(t)dt,

f ∈ L2[0, 1], x ∈ [0, 1]

which approximates K. Therefore the integral equa-tion (7) can be approximated by the following sys-tem

fi −n∑

j=1

Kijfj = gi, i = 1, . . . , n, (10)

which is a linear system of n equation in n unknownsfi. Equations (10) may be solved numerically for ap-proximate solution of equation (7). In this case wehave the approximation solution as

fT(x) =n∑

i=1

fiθi(x).

Now we estimate the error eT = f − fT . We followthe derivation by the Delves and Mohamed in [5].

Let gT(x) =∑n

i=1 giθi(x), we rewrite equations (7)and (10) in terms of operators K and T, we have

(I−K)f = g, (I−T)fT = gT .

Therefore we have

(I−K)eT = (K−T)fT + (g − gT ).

Provided that (I−K)−1 exists, we obtain the errorbound

‖eT ‖ ≤ ‖(I−K)−1‖.‖(K−T)fT + (g− gR)‖. (11)

4 Numerical performances

For showing efficiency of numerical method, we con-sider the following examples. We note that, Delvesand Mohamed ([5]):

‖ eN ‖=(∫ 1

−1

e2N (t) dt) 1

2

≈

(1N

N∑i=0

e2N (xi)

) 12

,

where

e(si) = x(si)− xN (si), i = 0, 1, . . . , N,

Such that xN (si) and x(si) are, respectively the ap-proximate and exact solutions of the integral equa-tions.

4.1 EXAMPLES

Example 1. x(s) = sin s − s +∫ π/2

0st x(t) dt,

with exact solution x(s) = sin s.

Example 2. x(s) = es − es+1−1s+1 +

∫ 1

0estx(t)dt with

exact solution x(s) = es.

Example 3. x(s) = s+∫ 1

0K(s, t)x(t) dt,

K(s, t) =

{s, s ≤ t

t, s ≥ t,

with exact solution: x(s) = sec 1 sin s.The following table shows the computed error ‖ eN ‖for the before examples.

Table 1: Errors ‖eN‖ at m=6 forMultiwavelet

N Example 1 Example 2 Example 32 5.2× 10−2 3.3× 10−2 3.5×10−2

3 5.5× 10−3 7.2× 10−3 1.5× 10−3

4 4.6× 10−6 9.8×10−5 8.5× 10−4

5 5.3× 10−9 3.2× 10−7 2.8× 10−7

6 3.6× 10−12 8.9× 10−10 1.0× 10−9


5 Conclusion

The main advantage of multiwavelets over scalarwavelets in numerical methods lies in their short sup-port, which makes boundaries much easier to han-dle. If symmetric/antisymmetric multiwavelets areused, it is even possible to use only the antisymmet-ric components of the boundary function vector forproblems with zero boundary conditions. The char-acteristics of the multiwavelets bases which lead toa sparse matrix representation are that

1. The basis functions are orthogonal to low orderpolynomials (having vanishing moments).

2. Most basis functions have small interval of sup-port.

References

[1] F. Keinert, ”Wavlets and Multiwavelets.” Chap-man and Hall/CRC, 2004.

[2] G.S. Burrus, R.A. Gopinath, H. Guo, ”Introduc-tion to Wavelets and Wavelet Transform.” Pren-tice Hall,1998.

[3] G.G. Walter, X. Shen, ”Wavelets and OtherOrthogonal Systems.” Chapman and Hall/CRC,Second Edition, 2001.

[4] H.L. Resnikoff, R.O. Wells, ”Wavelet Analysis.”Springer, 1998.

[5] L.M. Delves, J.L. Mohamed, ”ComputationalMethods for Integral Equations.” CambridgeUniversity Press, Cambridge, 1985.

[6] I. Daubechies, ”Ten Lecture on Wavelets.”SIAM, Philadelphia, PA, 1992.

[7] I. Daubechies, ”Orthonormal Bases of Com-pactly Supported Wavelets II, Variations on aThems.” SIAM J.Math.Ana, 24(2), pp. 499-519,1993.

[8] K. Maleknejad, S. Rahbar, ”Numerical solutionof Fredholm Integral Equation of the Second Kindby Using B-Spline Function.” Int.J.Eng.Sci,13(5), pp. 9-17, 2000.

[9] K. Maleknejad, F. Mirzaee, ”Using RationalizedHaar Wavelet for Solving Linear Integral Equa-tions.” Applied Mathematics and Computation(AMC), 160(2), pp. 579-587, 2005.

[10] K. Maleknejad, H. Mesgarani, T. Nikzad,”Wavelet-Galerkin Solution for Fredholm Inte-gral Equation of the Second Kind.” Int.J.Eng.Sci,13(5), pp. 75-80, 2002.


A Grid-based Context-aware Recommender System for

Mobile Healthcare Applications

Mohammad Mehedi Hassan1, Ki-Moon Choi

2, Seungmin Han

2, Youngsong Mun

3 and Eui-Nam Huh

1

1, 2Department of Computer Engineering, Kyung Hee University, Global Campus, South Korea

3Department of Computer Engineering, Soongsil University, South Korea

Abstract - In recent years, with their small size format and

ubiquitous connectivity, mobile devices such as smart phones

and PDAs can offer interesting opportunities for novel

services and applications. In this paper, we propose a

context-aware doctor recommender system called CONDOR

which recommends suitable doctors for a patient or user at

the right time in the right place based on his/her preferences

and current context (location, time, weather, distance etc.)

information. Existing centralized recommender systems

(CRSs) cannot resolve the contradiction between good

recommendation quality and timely response which is

essential in mobile healthcare. CRSs are also prone to single

point failure and vulnerable to privacy and security threads.

So we propose a framework by integrating Grid technology

with context-aware recommender system to alleviate these

problems. Also we present the construction process of

context-aware recommendation as well as the performance of

our architecture compare to existing CRSs.

Keywords: Context-aware recommender system, Grid,

Mobile healthcare.

1 Introduction

In recent years mobile computing, where users equipped

with small and portable devices, such as mobile phones,

PDA’s or laptops are free to move while staying connected to

service networks, has proved to be a true revolution [1].

Applications have begun to be developed for these devices to

offer online services to people whenever and wherever they

are. One of the most popular tools provided in e-commerce to

accommodate user shopping needs with vendor offers is

recommender systems (RS) [2].

However, it is becoming clear that the use of mobile

technologies will become quite pervasive in our lives and

that we need to support development of applications in

different areas. In particular, we have recently been involved

in the development of context-aware Recommender system

in mobile healthcare setting. Any patient or user carrying

mobile phone or PDA moves different places he/she has

never been to before and may face difficulties to find good

doctors in those unknown places for emergency healthcare.

Therefore, in this research, we propose a context-aware

doctor recommender system called (CONDOR-CONtext-

aware DOctor Recommender), to recommend suitable

doctors for a patient or user at the right time in the right place

in mobile computing environment. This time-critical

recommendation requires establishing system architectures

that allow support infrastructure for wireless connectivity,

network security and parallel processing of multiple sources

of information.

Moreover, unlike stationary desktop oriented machines

(PCs), mobile devices, (smart phones, PDA’s) are

constrained by their shape, size and weight. Due to their

limited size, these devices tend to be extremely resource

constrained in terms of their processing power, available

memory, battery capacity and screen size among others [3].

These portable devices need to access various distributed

computing powers and data repositories to support intelligent

deployment of the proposed RS.

Furthermore, most RS of today’s are centralized ones

which are suitable for single websites but not for large-scale

distributed applications of recommendation. Centralized

recommender systems (CRSs) cannot resolve the

contradiction between good recommendation quality and

timely response. In case of performance, the centralized

architectures prone to single point failure and cannot ensure

low latency and high reliability which is essential in mobile

healthcare. Also CRSs are vulnerable to privacy and security

threads. [4]

In this paper, we propose an architecture by combining

context-aware recommender system with Grid technology for

mobile healthcare service. And there are very few researches

that integrate recommender system with Grid. Traditional

recommendation mechanisms like collaborative and content-

based approaches are not suitable in this environment. So we

present a recommendation mechanism that analyzes a user’s

demographic profile, user’s current context information (i.e.,

location, time, and weather), doctor’s information (i.e.,

education, availability, location, reputation, visit fee etc.) and

user’s position so that doctor information can be ranked

according to the match with the preferences of a user. The

performance of our architecture is evaluated compare to

existing CRSs.

The paper is structured as follows: Section 2 briefly

reviews related works. Section 3 presents the proposed

system architecture. Section 4 describes the recommendation

process. Section 5 shows the analytical performance of our

architecture and finally Section 6 concludes the paper.


2 Related Works

2.1 Decentralized Recommender Systems

Modern distributed technologies need to be incorporated

into recommender systems to realize distributed

recommendation. Chuan-Feng Chiu, et al proposed the

mechanism to achieve community recommendation based on

the design of the generic agent framework which is designed

based on the peer-to-peer (P2P) computing architecture [5].

Peng Han, et al. proposed a distributed hash table (DHT)

based technique to implement efficient user database

management and retrieval in decentralized Collaborative

Filtering system [6]. Pouwelse et al. also proposed in [7], a

P2P recommender system capable of social content discovery

called TRIBLER. TRIBLER uses an algorithm based on an

epidemic protocol. However the above systems did not

consider context-awareness which is very important in case

of mobile computing. Also in healthcare service, parallel

processing of multiple sources of information is very

important.

Today Grid computing promises the accessibility of vast

computing and data resources across geographically

dispersed areas. This capability is significantly enhanced by

establishing support for mobile wireless devices to access

and perform on-demand service delivery from the Grid [8].

Integration of recommender system with Grid can enable

portable devices (mobile phones, PDA’s) to perform complex

reasoning and computation efficiently with various context

information exploiting the capabilities of distributed resource

integration (computing resources, distributed databases etc.).

The author in [4] proposed a knowledge Grid based

intelligent electronic commerce recommender systems called

KGBIECRS. The recommendation task is defined as a

knowledge based workflow and the knowledge grid is

exploited as the platform for knowledge sharing and

knowledge service. Also context-awareness for mobility

support is not considered.

We propose a framework by combining context-aware

recommender system with Grid technology in a mobile

environment.

2.2 Context-aware Recommender Systems

Context is any information that can be used to

characterize the situation of an entity. An entity is any

person, place or object that is considered relevant to the

interaction between a user and an application, including the

user and application themselves [9]. Examples of contextual

information are location, time, proximity, and user status and

network capabilities. The key goal of context-aware systems

is to provide a user with relevant information and/or services

based on his current context.

There are many researches in the literature to use

context-aware recommender system in different application

areas like travel, shopping, movie, music etc. [10-13]. All

these RSs are centralized ones. Thus they are prone to single

point failure and vulnerable to security threads.

Existing popular recommendation mechanism like

collaborative, content based and their hybrid approaches

cannot be used in our application area as they cannot handle

both user situation and personalization at the same time. So

we propose an efficient recommendation mechanism which

effectively handles users’ current context information and

recommends appropriate doctors in case of normal or

emergency condition.

3 Proposed System Architecture

To provide recommendation, the proposed CONDOR

system needs following functional requirements:

(i) Appropriate data preparation/collection

(ii) Creation of personalization method of recommendation

Data preparation/collection includes the following:

a) User Demographic Information: This includes age, gender,

income range, own car, health insurance etc.

b) User Context Information: This includes current location,

distance, time, and weather information.

c) User Preference Information: This is the user’s tendency

toward selecting certain doctor among others.

d) Doctor Information: This includes specialty, board

certification, price, service hour etc. The system will have

access to different hospitals and healthcare databases to

collect doctor information. Also different doctors can register

their information to our system. Doctors will be encouraged

to provide their information to get more patients and improve

their healthcare quality according to user feedback. Also user

can register good doctors in their area into the system.

The overall CONDOR architecture is shown in figure 1.

Our architecture consists of user interface, middleware and

data Service. The major components of our architecture are

described as follows:

a) Web Portal: This is used as a web interface for the user

which can be accessed from users mobile’s or PDA’s internet

browser. It also provides Grid security interface (GSI) for

preventing unauthorized or malicious users. Users/patients

and doctors register their profiles or information through this

web portal. Also user can give some doctors information

through this portal. Also user submits his recommendation

request (query) through this portal as shown in figure 2.

b) Context Manager: The context manager retrieves

information about the user’s current context by contacting the

appropriate context Information services (see the fig 1) and

send context information to recommendation web service.

The location information is collected by two major location

positioning technologies, Global Positioning System (GPS)

and Mobile Positioning System (MPS). The distance is the

Euclidean distance between user location and doctor location.

Time and day information is provided by the computer

system and the weather information is obtained from the

weather bureau website.

c) OGSA-DAI: OGSA-DAI (Open Grid Service Architecture-

Data Access and Integration) [11] is an extensible framework

for data access and integration. It exposes heterogeneous data

resources to be accessed via stateful web services. No

additional code is required to connect to a data base or for


Figure 1: Our Proposed CONDOR System Architecture

Figure 2: User query interface in mobile browser

querying data. OGSA-DAI supports an interface integrating

various databases such as XML databases and relational

databases. It provides three basic activities- querying data,

transforming data, and delivering the results using ftp, e-mail

etc. All information regarding users, doctors, hospitals,

recommendation results, doctor’s rating information etc. are

saved through OGSA-DAI to the distributed databases.

d) Recommendation Generation Web Service (RGWS): This

actually generates the recommendation using our

recommendation technique. Usually this generates top 5

doctors who are suitable in user’s current context.

e) Map Service: Using map service, the doctor’s location

map is displayed on the user’s mobile phone browser.

The workflow of our architecture is as follows:

(1) When a user or patient needs the recommendation

service, a recommendation request is sent to the system from

user’s mobile phone web browser.

(2) Then recommendation web service is invoked from the

web service factory for that user. It then collects necessary

information like user’s profile data, current doctor’s

information on that location through OGSA-DAI and user’s

current context information from context manager.

(3) The service broker then schedules the recommendation

service to run on different machines and a list of

recommended doctor’s are generated using our

recommendation algorithm and are passed to the user’s

mobile phone browser through the web portal.

(4) When the user selects a doctor, his location map is

displayed through the map service on the mobile phone’s

web browser display.

(5) User is also asked to provide his/her rating about that

doctor he/she already visited so that the system can produce

better recommendation.

4 Recommendation Process

4.1 Identification of Appropriate Doctors

Using Bayesian Network

To identify appropriate doctors for the individual

mobile patient or user in any location based on his/her

current context, it requires effective representations of the

relationships among several kinds of information as shown in

figure 3. Because many uncertainties exist among different

kinds of information and their relationships, a probabilistic

approach such as a Bayesian Network (BN) can be utilized.

Bayesian networks (BNs) which constitute a probabilistic

framework for reasoning under uncertainty in recent years,

have been representative models to deal with context

inference [14].

With the Bayesian network, we can formulate

user/patient’s interest to a doctor in his/her current situation

Patient/User Id

Doctor Specialty: Cardiology, medicine etc.

Visit Fee Range: Average, Low, High

Condition: Normal/Emergency

Parking Area: (Yes/No)


Figure 3: A simple recommendation Space

in the form of a joint probability distribution. For

constructing the structure of the Bayesian network, we need

the knowledge of user preference of choosing a good doctor

in any location. A survey by US National Institute of Health

[15] showed that board certification, rating (reputation), type

of insurance accept, location (distance), visit fee, hours the

service is available and existence of lab test facility, in

descending order, are the most important factors to

user/patient in choosing a new doctor. Also car parking

facility is required by user in doctor’s location. Therefore, we

design the BN structure considering user preference

information as shown in figure 4. From the BN structure we

can easily calculate the probability of Interest (u, d) – interest

of user u on a doctor d, that is,

Interest (u, d) = p (interest | user_age, user_gender, doctor_

specialty, user_ current location, time, user_income) (1)

Figure 4: A BN structure for finding Interest (u, d)

Bayesian networks built by an expert cannot reflect a

change of environment. To overcome this problem, we apply

a parameter learning technique. Based on collected data,

CPTs (Conditional Probability Tables) are learned by using

EM (Expectation Maximization) algorithm as shown below:

Let, V denotes variable set 1 2{ , ,....... }nx x x and S Bayesian

network structure. The value of each variable in structure S

is 1 2{ , ,...., }r

i i ix x x .And 1 2{ , ,...., }mD C C C= is samples data set.

iπ is the parent’s sets of xi . Then ( | )k j

i ip x π represents that

the probability when xi is the kth

and i

π is the jth

, we denote it

byijkθ . The purpose of Bayesian networks parameter learning is

to evaluate the conditional probabilistic density

( | , )p D Sθ using prior knowledge when network structure S

and samples set D are given.

The main task in EM algorithm is calculating

conditional probability ( )( , | , )t

i i lp x Dπ θ for all parameter set

D and all variables xi. When data set D is given, the log

likelihood is:

( | ) ln ( | )l

l

l D p Dθ θ=∑

( , ) lnk j

i i ijk

ijk

f x π θ=∑ (2)

Where ( , )k j

i if x π denotes the value in dataset when i

x k= and

ijπ = , the maximal log likelihood θ can be obtained by :

( , )

( , )

k j

i iijk k j

i i

f x

f x

πθ

π=∑

(3)

The EM algorithm initializes an estimated value (0)θ and

repairs it continuously Th.ere are two steps from current ( )tθ to next ( 1)tθ + , Expectation calculation and Maximum.

Expectation calculation calculates the current expectation of

θ when D is given.

( ) ( )

( | ) ln ( , | ) ( | , )i

t t

l l l l

l x

l p D X p X Dθ θ θ θ=∑∑ (4)

For all θ , there are ( 1) ( )( | ) ( | )t tl lθ θ θ θ+ ≥ . From equation (2):

( )

, ,

( | ) ( , ) lnt k j

t i i ijk

i j k

l f xθ θ π θ= ∑

Where ( )( , ) ( , | , )t

i i i i i l

l

f x p x Dπ π θ=∑ (5)

Maximization chooses next ( 1)tθ + by maximize expectation

of current log likelihood: ( 1) ( )argmax [ ( | ) | , , ]t t

ijkE P D D Sθ θ θ+ ′=

( , )

( , )

k j

i i

k j

j i i

f x

f x

π

π=∑

(6)

Equation (5) calculates the Expectation and (6) the

Maximization. When this EM algorithm converges slowly,

we use the improved E and M step using the procedure in

[16].

4.2 Calculation of Final Ranking Score

Considering User Sensitivity to Location

The sensitivity of users to location is considered by the

CONDOR system. We posit that the likelihood of a doctor

being visited by a user/patient not only depends on hi/her

interest on a doctor but also on distance between them. The

distance is the Euclidean distance between user location and

doctor location. Usually user is more likely to choose a

doctor with highest similarity and minimum distance. In case

of emergency, distance will get more priority and user will

choose a doctor with minimum distance and moderate

similarity. So we consider a distance weight variable (DWV)

User Demographic

Information

User Context

Information

Doctor

Information

Doctor 2 Doctor 1

distance2 distance1


for measuring the user’s sensitivity to distance. DWV for a

user u with respect to different doctors i

d is calculated as

follows:

max2

2 max

distance ( , )log [ ]

distance ( , ) +1( . )

log distance ( , )

ii

u d

u dDWV u d

u d= (7)

where i = 1, 2, 3……n (No. of doctors in the similarity list)

maxdistance ( , )u d = The maximum or farthest distance of a

doctor location from user current location among the doctors

in the preferred list.

distance ( , )i

u d = The distance between the user and any

doctor.

In the formula (7), DWV will be reduced when

distance ( , )i

u d increases and DWV is also normalized. I.e., if

distance ( , )i

u d =max

distance ( , )u d , then DWV = 0, if

distance ( , )i

u d = 0, then DWV = 1.Therefore, the final score

for any doctor di for a user u in user’s present position with

current user context is calculated as follows:

1 2 ( , ) ( , ) ( , )i i i

Score u d W Interest u d W DWV u d= ∗ + ∗

………. (8)

where W1 and W2 are two weighting factors

( 1; 0 1; 0 11 2 1 2W W W W+ = ≤ ≤ ≤ ≤ ) reflecting the relative

importance of similarity/interest (u, d) and distance. Based on

the highest score, the doctors will be ranked and

recommended to that particular user/patient.

Initially, W1 = 0.5 and W2 = 0.5 if we consider equal

importance of interest value and distance. If emergency

situation arrives, W1 = 0.1 and W2 = 0.9. Figure 5 illustrates

the scenario of the recommendation process.

Recommendation Input Space

DRS output in the user’s mobile Internet browser

Figure 5: CONDOR’s recommendation construction process

5 Evaluation

5.1 Effectiveness of the Recommendation

Algorithm

In order to find the effectiveness of the recommendation

process, some case studies are presented in this section. We

have created different sample datasets and applied our

recommendation mechanism.

Suppose user x wants to use the CONDOR system for

finding a doctor of Cardiology. So he registers his

demographic information to the system. As Bayesian

Network is used, all data should be in discrete form. Table 1

shows preprocessed data set which consists of demographic

information of user x required for registration. Let, in user’s

current location, the system finds four doctors of cardiology

specialist. User’s current context information is shown in

table 2. Table 3 shows the information of four doctors (D1,

D2, D3 and D4) in user’s current location.

Table 1: Demographic information of user x

User x

Gender Male

Age 30-39

Income (1,000 won) 200-250

Health Insurance No.(If exist) H-0012

Own Car Yes

Table 2: Context information of user x

Context Information of User x

Time 10am – 11am

Weather Sunny

Day Weekday

Location Differ Distance

Table 3: Doctors information in current location

D1 D2 D3 D4

Specialty Card Card. Card. Card.

Board

Certification Yes Yes No No

Overall Rating 0.8 1 0.3 0.3

Accept Health

Insurance Yes Yes Yes Yes

Visit Fee Avg. Avg. Low Avg.

Service Hour

(Weekday)

10a.m.

-

9p.m.

10a.m.

- 9p.m.

9a.m. -

10p.m.

9a.m. -

10p.m.

Service Hour

(Weekend)

10a.m.

-

3p.m.

11a.m.

- 4p.m.

10a.m. -

3p.m.

10a.m.

- 3p.m.

Lab Test

facility Yes Yes No No

Differ

Distance(km) 2 8 5 3

Parking Area Yes Yes Yes No

To calculate the final score of each doctor for

recommendation, first using the formula in equation (1), the

probability of interest of user x to each doctor di is calculated

Doctors’ Information

User Preference

User Context

1 2 ( , ) ( , ) ( , )Score u d W Interest u d W DWV u di i i= ∗ + ∗

Doctor’s List

(Based on Ranking Score)

D1 (View Map)

D2 (View Map)

D3 (View Map)

Distance (Km)

2

4

3


using the BN structure as shown in figure 4. We used Hugin

Lite [17] for calculation. Using the equation in (7), the DWV

of doctors’ location is calculated. The results of Interest (x,

Di) and DWV are shown in figure 6.

Figure 6: Interest (x, Di) and DWV value of four doctors

If user x selects normal condition, then final ranking score

for each doctor is calculated using equation (8), considering

W1= 0.5 and W2 = 0.5:

1

RankingScoreD

= 0.5*0.8 + 0.5*0.47 = 0.635

2

RankingScoreD

= 0.5*1.0 + 0.5*0 = 0.5

3RankingScore

D= 0.5*0.65 + 0.5*0.14 = 0.395

4RankingScore

D= 0.5*0.6 + 0.5*0.33 = 0.465

Final doctors list is displayed on user mobile phone web

browser as shown in figure 7.

Figure 7: Recommendation results in normal condition

If in the same position, user x selects emergency condition,

then the result will be calculated considering W1= 0.1 and W2

= 0.9. The result is shown in figure 8. The doctor D4 will not

be in the list since it has long distance.

1RankingScore

D= 0.1*0.8 + 0.9*0.47 = 0.503

2RankingScore

D= 0.1*1.0 + 0.9*0 = 0.1

3

RankingScoreD

= 0.1*0.65 + 0.9*0.14 = 0.191

4RankingScore

D= 0.1*0.6 + 0.9*0.33 = 0.356

Figure 8: Recommendation results in emergency case

5.2 Effectiveness of the Architecture

In this experiment we are concentrated on the on-line

workload of the CONDOR system in Grid environment. Let

us model the CONDOR system as an M/G/1 queue

(centralized system). An M/G/1 queue consists of a FIFO

buffer with requests arriving randomly according to a

Poisson process at rate λ and a server, which retrieves

requests from the queue for servicing. User requests are

serviced at a first-come-first-serve (FCFS) order. We use the

term ‘task’ as a generalization of a recommendation request

arrival for service. We denote the processing requirements of

a recommendation request as ‘task size’ and service time is a

general distribution. All requests in its queue are served on a

FCFS basis with mean service rate µ .

It has been observed that the workloads in Internet are

heavy-tailed in nature [18], characterized by the function,

Pr{ } ~X x xα−> , where 0 2α≤ ≤ . In CONDOR system,

users/patients request for recommendation varies with the no.

of doctors of that category. For example, in case of medicine,

there may be 10,000 doctors, where in case of cardiology; the

no. of doctors may be 2000. Based on the no. doctors the

processing requirements (i.e. task size) also vary. Thus, the

task size on a given CONDOR’s service capacity follows a

Bounded Pareto distribution. The probability density function

for the Bounded Pareto ( , , )B k p α is:

1( )1 ( / )

kf x x

k p

αα

α

α − −=−

where α represents the task size variation, k is the smallest

possible task size and p is the largest possible task size

( )k x p≤ ≤ . By varying the value of α , we can observe

distributions that exhibit moderate variability ( 2)α = to high

variability ( 1)α = . We start the derivation of waiting

time ( )E W . W is the time a user has to wait for service.

( )qE N is the number of waiting customers and ( )E X is the

mean service time. By Little’s law, the mean queue length

( )qE N can be expressed in terms of waiting time. Therefore,

( ) ( )qE N E Wλ= and load on the server, ( )E Xρ λ= . Let

( )jE X be the j-th moment of the service distribution of the

tasks. We have,

(( ) ( ) )

( )(1 ( ) )( )

ln( )

(1 ( ) )

j j

j

p k p k p

j k pE X

k p k

k p

α

α

α

α

α

α

α

−

− −= −

Hence, using P-K formula, we obtain the expected

waiting time in the CONDOR system’s queue, 2( ) ( ) 2(1 )E W E Xλ ρ= − . Now we want to measure the

expected waiting time with respect to varying server load and

task sizes. In centralized CONDOR system, ( )E W waiting

time increases when the service time increases and load on

server also increases. But in the Grid environment, the load

will be distributed to different machines and new load will

if j α≠

if j α=

Doctors List

D1 (View Map)

D2 (View Map)

D4 (View Map)

D3 (View Map)

Distance (Km)

2

8

3

5

Doctor’s List

D1 (View Map)

D4 (View Map)

D3 (View Map)

Distance (Km)

2

3

5

(9)

(10)


be (1 )redirect

Pρ = − on primary server and new arrival rate will

be (1 )redirectPλ λ= − . Figure 9 shows the effectiveness of our

CONDOR system in Grid environment compare to CRS in

terms of expected waiting time considering task

variability ( 1.5)α = and redirection probability ( 0.5)redirect

P = .

In figure 9, we can see a reasonable improvement in expected

waiting time in distributed environment. Without sharing

resources, when the system load approaches to 1.0, the user

perceived response time for recommendation service.

Figure 9: Effectiveness of our Grid-based CONDOR system

architecture compare to Centralized RS

6 Conclusion

In this paper, we present a novel context-aware doctor

recommender system architecture called CONDOR in Grid

environment. The system helps a user/patient to find suitable

doctors at the right time, in the right place. We have also

discussed the recommendation process which efficiently

recommends appropriate doctors in case of normal and

emergency case. The CONDOR system based on Grid

technology improves high performance and stability along

with better quality than the centralized ones which are prone

to single point failure and lack the capabilities of improving

recommendation quality and privacy. We are in the

implementation process of the architecture. In future, we will

evaluate our framework using real-world testing data.

Acknowledgement

This research was supported by the MKE (Ministry of

Knowledge Economy), Korea, under the ITRC (Information

Technology Research Center) support program supervised by

the IITA (Institute of Information Technology Advancement) (IITA-2008-(C1090-0801-0002)

References

[1] T. F. Stafford and M. L. Gillenson. “Mobile commerce:

what it is and what it could be”. Communications of the

ACM, 46(12), 2003

[2] Special issue on information filtering. Communications

of the ACM, 35(12), 1992

[3] S. Goyal and J. Carter. “A lightweight secure cyber

foraging infrastructure for resource-constrained Devices”.

The Sixth IEEE workshop on Mobile Computing Systems

and Applications, 2004, pp-186-195

[4] P. Liu, G. Nie, D. Chen and Z. fu. “The knowledge

Grid-Based Intelligent Electronic Commerce Recommender

Systems”. IEEE Intl. Conf. of SOCA, 2007, pp: 223-232

[5] C. Chin, T. K. Shiih and U. Wang. “An Integrated

Analysis Strategy and Mobile Agent Framework for

Recommendation System in EC over Internet”. Tamkang

Journal of Science and Engineering, 2002, 5(3):159-174

[6] P. Han, B. Xie, F. Yang and R. Shen et al. “A scalable

P2P recommender system based on distributed collaborative

filtering”. Expert Systems with Applications, 2004, 27(2)

[7] J. A. Pouwelse, P. Garbacki et al.” Tribler: A social-

based peer-to-peer system”. In the Proceedings of the 5th

International P2P conference (IPTPS 2006), 2006

[8] D. C. Chu and M. Humphrey. “Mobile OGSI.NET:

Grid Computing on Mobile Devices”. The 5th

IEEE/ACM

Int. Workshop on Grid Computing, 2004, Pittsburgh, PA

[9] A.K. Dey. “Understanding and Using Context”,

Personal and Ubiquitous Computing, Vol.5, pp.20-24, 2001

[10] M. V. Setten, S. Pokraev and J. koolwaaij. “ Context-

Aware Recommendation in the Mobile Tourist Application:

COMPASS”. AH-2004, 26-29 August, Eindhoven, The

Netherlands, LNCS, 3137, pp. 235-244

[11] W. Yang, H. Cheng and J. Dia, “A Location-aware

Recommender System for Mobile Shopping Environments”.

Journal of Expert System with Applications, 2006

[12] H. Park, J. Yoo and S. Cho. “A Context-aware Music

Recommendation System Using Fuzzy Bayesian Networks

with Utility Theory”. FSKD 2006, LNAI 4223, pp. 970-979

[13] C. Ono, M. Kurokawa, Y. Motomura and H. Asoh. “A

Context-Aware Movie Preference Model Using a Bayesian

Network for Recommendation and Promotion”, UM 2007,

LNAI 4511, pp. 247-257

[14] P. Korpipaa, et al.”Bayesian Approach to Sensor-based

Context Awareness”. Personal and Ubiquitous Computing,

Vol. 7, pp. 113-124, 2003

[15] http://www.niapublications.org/agepages/choose.asp

[16] S. Zhang, Z. Zhang, N. Yang, J. Zhang and X. Wang

“An improved EM Algorithm for Bayesian Networks

Parameter Learning” Proceedmgs of the Third International

Conference on Machine Learning and Cybernetics, Shanghai,

26-29 August 2004

[17] http://www.hugin.com/Products_Services/Products/De

mo/Lite/

[18] M. E. Crovella, M. S. Taqqu and A. Bestavros. “A

Heavy Tailed Probability Distribution in the World Wide

Web”. A Practical Guide To Heavy Tails, Birkhauser Boston

Inc., Cambridge, MA, USA, pp. 3-26, 1998


A Two-way Strategy for Replica Placement in Data Grid

Qaisar Rasool, Jianzhong Li, Ehsan Ullah Munir, and George S. Oreku School of Computer Science and Technology, Harbin Institute of Technology, China

Abstract – In large Data Grid systems the main objective of replication is to enhance data availability by placing replicas at the proximity of users so that user perceived response time is minimized. For a hierarchical Data Grid, replicas are usually placed in either top-down or bottom-up way. We put forward Two-way replica placement scheme that places replicas of most popular files close to the requesting clients and less popular files a tier below from the Data Grid root. We facilitate data requests to be serviced by the sibling nodes as well as by the parent. Experiments results show the effectiveness of Two-way replica placement scheme against no replication.

Keywords: Replication, Replica placement, Data Grid.

1 Introduction Grid computing [5] is a wide-area distributed

computing environment that involves large-scale resource sharing among collaborations, often referred to as Virtual Organizations, of individuals or institutes located in geographically dispersed areas. Data grids [2] are grid infrastructure with specific needs to transfer and manage massive amounts of scientific data for analysis purposes.

Data replication is an important technique used in distributed systems for improving data availability and fault tolerance. Replication schemes are divided into static and dynamic. While static replication is user-centered and do not support the changing behavior of the system, dynamic replication is more suitable for environments like P2P and Grid systems. In general, replication mechanism determines which files should be replicated, when to create new replicas, and where the new replicas should be placed.

There are many techniques proposed in research for dynamic replication in Grid [10, 7, 11, 13]. These strategies differ by the assumptions made regarding underlying grid topology, user request patterns, dataset sizes and their distribution, and storage node capacities. Other distinctive features include data request path and the manner in which replicas are placed on the Grid nodes. Two common approaches for replica placement in a tree topology Data Grid are top-down [10, 7] and bottom-up [11]. In both cases, the root of Data Grid tree is considered as the central repository for all datasets to be replicated.

For a Data Grid tree, usually clients at the leaf nodes generate data requests. A request travels from client to parent node in search of replica until it reaches at root node. We in this paper propose a Two-way replication scheme that takes a different path for data request. It is assumed that the children under the same parent in the Data Grid tree are linked in a P2P-like manner. For any client’s request, if desired data is not available at the client’s parent node, the request moves to the sibling nodes one by one until it finds the required data. If none of the siblings can fulfill the request, the request moves to the parent node one level up. Here also all the siblings are probed and if data not found the request moves to next parent and ultimately to root node.

In Two-way replication scheme we use both bottom-up and top-down approaches to place the data replicas in order to enhance availability of requested data in Data Grid. The files which are more frequent are placed close to the clients and the less frequent files are placed close to the root, one tier below, in the Grid. The simulation studies show the benefit of Two-way replication strategy over the case when no replication is used. We perform experiments with data files of uniform size and with variable sizes separately.

2 Data Grid Model

Several Grid activities such as [3, 8] have been launched since the early years of this century. We find that many practical Grids, for example, GriPhyN [12] employ topology which is hierarchical in nature. The High Energy Physics (HEP) community seeks to take advantage of the grid technology to provide physicists with the access to real as well as simulated LHC [8] data from their home institutes. Data replication and management is hence considered to be one the most important aspects of HEP Data Grids. In this paper we have used the hierarchical Data Grid model.

A tree T is used to represent the topology of the Data Grid which is composed of root, intermediate nodes and leaf nodes. We hereafter refer the intermediate nodes as cache nodes and leaf nodes as client nodes. All client nodes are local sites issuing request for data stored at the root or cache nodes of the Data Grid. For any parent node, all its children are linked into P2P-like manner (i.e. are siblings) and can transfer replicas to each other when required. The only


exception is client tier since storage space of a client node is very limited and can hold only one file.

Unlike most of the previous hierarchical model in Data Grid in which data request sequence is along the path from child to parent and up to the root of the tree, our hierarchical model has another data request path. A request moves upward to parent node only after all the sibling nodes have been searched for the required data. The process is as follows:

1. A client c requests a file f. If the file is available in the client’s cache then ok. Otherwise step 2.

2. The request is forwarded to the parent of client c. If data is found there, it is transferred to the client. Otherwise request is forwarded to the sibling node.

3. Probe all sibling nodes one after another in search of data. If data is found, it is transferred to the client via shortest path.

4. If data is not found at any sibling node, the request is forwarded to the parent node and the Step 3 is repeated.

5. Step 4 continues until the request reaches at root. The Data Grid model and example data access paths are

shown in the Fig.1.

Fig.1. Data access operation in hierarchical Grid model

3 Two-way Replication Scheme

Replication techniques increase availability of data and facilitate load sharing. With respect to data access operations, Data Grids are read-only environments in which either data is introduced or existing data is replicated. To satisfy the latency constraints in the system, there are two ways: one is to vary the speed of data transfer, and other is to shorten the transfer distance. Since bandwidth and CPU speed are usually expensive to change, shortening transfer distance by placing replicas of data objects closer to

requesting clients is the cheapest and essential way to ensure faster response time.

The Grid Replication Scheduler (GRS) is the managing entity for two-way replication scheme. At each cache node and the root, a Replica Manager is held that stores information about requested files and the time when requests were made. This info is accumulated and coordinated to GRS into a global workload table G. The attributes of G are (reqArrivalTime, clientID, fileID), where reqArrivalTime is the time when the request is arrived in the system. Replicas are registered in the Replica Catalog when they are created and placed to the nodes in the Grid.

3.1 Replica Creation A replica management system should be able to handle a large number of replicas and their creation and placement. Like most of the previous dynamic replication strategies, we base the decision of replica creation on the data access frequency. Over the time the GRS accumulates access request history in the global workload table G. The table G is processed to get a cumulative workload table W having attributes (clientID, fileID, NoA), where NoA stands for Number of Accesses. This table W is used to trigger replication of requested files. The GRS maintains all the necessary information about the replicas in Replica Catalog. Whenever the decision is made to initiate replication, the GRS registered the newly created replicas into the replica catalog along with the information of their creation time and the hosting nodes.

3.2 Replica Placement As stated, for each newly created replica, the GRS

decides where to place the replica and registers that replica into the Replica Catalog. The decision of replica placement is made in the following way. The GRS sorts the cumulative workload table W on the basis of fileID and NoA; the fileID in ascending and NoA in descending order. Then, Project operation is applied on the sorted table, SW, over fileID and NoA to get table F(fileID, NoA). Since the table F may have many NoA entries corresponding to a fileID, therefore the table is processed to get AF, containing the aggregate values of NoA for each individual file so that GRS become aware of how many times a file was accessed in the system.

Further, the table AF is sorted to get SF. The first half of SF contains entries of more-frequent files (MFFs) and the lower half consists of entries of less-frequent files (LFFs). We simply divide the number of entries in SF by 2 and round the result to an integer. For example, there would be 389 MFFs and 388 LFFs in a table of 777 entries (Fig.2).

A data file may be requested by many clients in the Grid and a client which generates the maximum requests for a file may be termed as the best client node to host the replica of that file. We can easily get the information about the best


W SW F AF SF

BC

Fig.2. Procedure to find the MFFs and LFFs

Client file NoA c1 a 20 c2 b 30 c3 a 45 c1 c 40 c4 c 10

Client file NoA c3 a 45 c1 a 20 c2 b 30 c1 c 40 c4 c 10

file NoA a 45 a 20 b 30 c 40 c 10

file NoA a 65 b 30 c 50

file NoA a 65 c 50 b 30

file best client a c3 b c2 c c1

MFFs: a, c LFFs: b

client for any requested file from the table SW. The GRS extracts and stores the entries of best clients for all files into the table BC. The procedure of obtaining MFFs and LFFs from the sorted workload table SW is depicted in Fig.2.

The storage capacity of Grid nodes is an important consideration while replicating the data as the size may be huge. The client nodes comparatively have the lowest storage capacity in the Data Grid hierarchy. Therefore it is not suitable to choose a best client node as a replica server. However, we can get benefit from the principle of locality by placing replicas very close to the best client. We resolve to place replicas at the parent node of the best client. Thus all MFFs are replicated in the immediate parent nodes of the best clients. All the LFFs are replicated at the children of the root node along the path to the best clients.

Two-wayReplicaPlacement(W) t Get-Time()

SW Sort(W) over fileID ASC, NoA DESC BC Extract-BestClient(SW) F project(SW) over fileID, NoA AF Aggregate-NoA(F) over fileID SF sort(AF) over NoA MFF Upper-Half(SF) LFF Lower-Half(SF) con Replicate-HFF(HFF, BC, t) Replicate-LFF(LFF, BC, t)

Fig.3. Two-way Replica Placement Algorithm

After replication, the GRS will flush out the workload

tables in order to calculate the access statistics afresh for the next interval. While placing a replica if GRS finds that desired replica is already present at the selected node, it just updates the replica creation time in the catalog. The steps of Two-way replica placement algorithm are given in Fig.3.

The replication of MFFs and LFFs are depicted by the functions Replicate-MFF and Replicate-LFF respectively. These functions are executed concurrently to get the replicas spread to the selected locations. In each case, the replicas are placed along the way to the best client. Fig.4 depicts the steps for Replicate-MFF algorithm. For each file entry in the table MFF, the GRS finds its best client by consulting the table BC and then replicates that file at the parent node of its best client. Steps for Replicate-LFF algorithm are shown in Fig.5

Replicate-MFF(MFF, BC, t)

for all record r in MFF do f r.fileID

c BC[f].nodeID p Parent(c)

if Exist-in(f) Update-CT(f, p, t) end if skip to end if AvailableSpace(p, t) < size(f) Evacuate(p, t) end if Replicate(f, p, t) end for

Fig.4. Algorithm for Replicating MFF files

If the free space of the replica server is less than the size

of the new replica, then some replicas may be deleted to make room for the new replica. Only if the available space of the node is greater or equal to the size of the requested data file, can functions Replicate-MFF and Replicate-LFF be called. In each case, the function reserves the storage space for file f in the selected node first, and then invokes the transmission of file f to the candidate node in the


background. The replica is transferred to the selected destination from a closest node that has the replica of data file f. After the transmission is completed, the new replica’s creation time is set to t.

Replicate-LFF(LFF, BC, t) for all records r in LFF f r.fileID n BC[f].nodeID while n ≠ Root do n parent(n) end while c child(n) if Exist-in(f, c) update-CT(f, c, t) skip to end end if if AvailableSpace(c, t) < size(f) Evacuate(c, t) end if Replicate(f, c, t) end for

Fig.5. Algorithm for Replicating LFF files

3.3 Replica Replacement Policy

A replica replacement policy is essential to make decision which of the stored replicas should be replaced with the new replica in case there is a shortage of storage space at the selected node. At a specific time, the available space of a replica server will be equivalent to its remaining free space plus the space occupied by the possible redundant replicas. A replica is considered as redundant if it is created earlier than the current time session and currently not being active or referenced; meaning there is no request in the current session for this replica.

The function Evacuate in Replicate-MFF and Replicate-LFF is used for this purpose. For a given selected node it checks the creation times of all present replicas and continues to remove each redundant replica until the storage space is sufficient to host the new replica.

4 Experiments

In order to evaluate the proposed Two-way Replica Placement scheme, we conduct experiments using the Data Grid model shown in Fig.1. The link bandwidth was constructed according to the estimations shown in Table 1. We consider the data to be read-only and so there are no consistency issues involved.

The simulation runs in sessions and each session having a random set of requests generated by various clients in the system. When a client generates a request for a file, the replica of that file is fetched from the nearest replica server

and transported and saved to the local cache of the client. Initially all the data was held only at root node. As the time progresses, the access statistics are gathered and are used for the decision of replica creation and placement. When a replica is being transferred from one node to another, the link is considered busy for the duration of transfer and cannot be used for any other transfer simultaneously. We used an access pattern with medium degree of temporal locality. That is, some files were requested more frequently than others and such requests were 60% of the whole.

Table 1. Grid tiers link bandwidth

Data Grid tiers Bandwidth Scaling Tier 0 – Tier 1 2.5 Gbps 1MB Among Tier 1 7.0 Gpbs 2.8MB Tier 1 – Tier 2 2.5 Gpbs 1MB Among Tier 2 7.0 Gpbs 2.8MB Tier 2 – Tier 3 622 Mbps 0.24MB

We run experiments in two categories. In first category,

we use files of variable sizes ranging 500MB to 1GB. And in experiments of second category we used files of uniform size of 1GB. For convenience, we did scaling to file sizes and bandwidth values uniformly, reducing 1GB file size to 3.2MB. The scaling of bandwidth values between tiers is given in Table 1.

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5

Request patterns with medium degree of temporal locality

Avg

Res

pons

e Ti

me

(sec

)

No Replication Two-way Strategy

Fig.6. Experiment with data files of variable sizes

0

5

10

15

20

25

1 2 3 4 5

Request patterns with medium degree of temporal locality

Avg

Res

pons

e Ti

me

(sec

)

No Replication Two-way Strategy

Fig.7. Experiment with data files of uniform size


For each client node, we keep a record of how much time it took for each file that it requested to be transported to it. The average of this time for various simulation runs was calculated. Compared to no replication case, the Two-way replication technique presents better performance as shown in Fig.6 and Fig.7. Since the file size is uniform in the experiment of second category, therefore in Fig.7 we have a straight line for no replication case.

5 Related Work

An initial work on dynamic data replication in Grid environment was done by Ranganathan and Foster in [10] proposing six replication strategies: No Replication or Caching, Best Client, Cascading, Plain Caching, Cascading plus Caching and Fast Spread. The analysis reveals that among all these top-down schemes, Fast Spread shows a relatively consistent performance through various access patterns; it gives best performance when access pattern are random. However when locality is introduced, Cascading offers good results. In order to find the nearest replica, each technique selects the replica server site that is at the least number of hops from the requesting client.

In [1] an improvement in Cascading technique is proposed namely Proportional Share Replication policy. The method is heuristic one that places replicas on the optimal locations when the number of sites and the total replicas to be distributed is known. The work on dynamic replication algorithms is presented by Tang et al [11] in which they have shown improvement over Fast Spread strategy while keeping the size of data files uniform. The technique places replicas on Data Grid tree nodes in a bottom up fashion.

In [9, 13], replica placement problem is formulated mathematically followed by theoretical proofs of the solution methods.

A hybrid topology is used in [6] where ring and fat-tree replica organizations are combined into multi-level hierarchies. Replication of a dataset is triggered when requests for it at a site exceed some threshold. A cost model evaluates the data access costs and performance gains for creating replicas. The replication strategy places a replica at a site that minimizes the total access costs including both read and write costs for the datasets. The experiments were done to reflect the impact of size of different data files and the storage capacities of replica servers. The same authors proposed a decentralized data management middleware for Data Grid in [7]. Among various components of the proposed middleware, the replica management layer is responsible for the creation of new replicas and their transfer between Grid nodes. The experiments were done considering both the top down and bottom up methods separately with data repository located at the root of Data Grid and at the bottom (clients) respectively.

Each of these techniques uses either the top-down or bottom-up method for replica placement at a time. We in our work have used both methods together in order to enhance the data availability.

6 Conclusions The management of huge amounts of data generated in scientific applications is a challenge. By replicating the frequent data to the selected locations, we can enhance the data availability. In this paper we proposed a two-way replication strategy for Data Grid environment. The most frequent files are placed very close to the users and the less frequent files are replicated from top to bottom one tier down the Grid hierarchy. The experiment results show the performance of two-way replica placement scheme.

7 References [1] J. H. Abawajy. “Placement of File Replicas in Data Grid Environments” Proceedings of International Conf on Computational Sciences, LNCS 3038, 66-73, 2004.

[2] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury and S. Tuecke. “The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets”; Journal of Network and Computing Appl., v23, 3, 187-200, 2000.

[3] EGEE, http://www.eu-egee.org/

[4] Q. Fan, Q. Wu, Y. He, and J. Huang. “Transportation Strategies of the Data Grid” International Conference on Semantics, Knowledge, and Grid (SKG), 2006.

[5] Ian Foster and Carl Kesselman. “The Grid2: Blueprint for a New Computing Infrastructure” Morgan Kaufmann, 2003.

[6] H. Lamehamedi, B.K. Szymanski, Z. Shentu and E. Deelman. “Data Replication Strategies in Grid Environments” Proceedings of International Conf on Antennas and Propagation (ICAP), IEEE Computer Science Press, Los Alamitos, CA, 378-383, Oct 2002.

[7] H. Lamehamedi and B.K. Szymansi. “Decentralized Data Management Framework for Data Grids”; FGCS (Elsevier), v23, 1, 109-115, 2007.

[8] LHC Computing Grid, http://www.cern.ch/LHCgrid/

[9] Y.F. Lin, P. Liu and J.J. Wu. “Optimal Placement of Replicas in Data Grid Environment with Locality Assurance” International Conference on Parallel and Distributed Systems (ICPADS), 2006.


http://www.eu-egee.org/

http://www.cern.ch/LHCgrid/

[10] Kavitha Ranganathan and Ian Foster. “Identifying Dynamic Replication Strategies for a High-performance Data Grid” Proceedings of International Grid Computing Workshop, LNCS 2242, 75-86, 2001.

[11] Ming Tang, Bu-Sung Lee, Chai-Kiat Yeo and Xueyan Tang. “Dynamic Replication Algorithms for the Multi-tier Data Grid”; FGCS (Elsevier), v21, 775-790, 2005.

[12] Grid Physics Network, http://www.griphyn.org/

[13] Y. Yuan, Y. Wu, G. Yang and F. Yu. “Dynamic Data Replication based on Local Optimization Principle in Data Grid” 6th International Conference on Grid and Cooperative Computing (GCC), 2007.


http://www.griphyn.org/

The Influence of Sub-Communities on a Community-Based Peer-to-Peer

System with Social Network Characteristics

Amir Modarresi1, Ali Mamat2, Hamidah Ibrahim2, Norwati Mustapha 2

Faculty of Computer Science and Information Technology, University Putra Malaysia [email protected],

2{ali,hamidah,norwati}@fsktm.upm.edu.my

Abstract

The objective of this paper is to investigate the

effect of sub-communities on Peer-to-Peer systems

which have social network characteristics. We

propose a general Peer-to-Peer model based-on

social network to illustrate these effects. In this model

the whole system is divided into several communities

based on the contents of the peers and each one is

divided into several sub-communities. A computer

based model is created and social network

parameters are calculated in order to show the effect

of sub-communities. The result confirms that a large

community with many highly connected nodes can be

substituted with many sub-communities with normal

nodes.

Keywords: Peer-to-peer computing, social network,

community

1. Introduction

There are many systems form a network structure;

a combination of vertices which are connected by

some edges. Among them we can mention social

network like collaboration network or acquaintance

network; technical network such as World Wide Web

and electrical network; biological network, such as

food network and neural network. The concepts of

social networks are applicable in many technical

networks, especially those which connect people.

These concepts help designers to catch more

information about a group of people who are using

the network and the result is providing better services

for the group according to their interests and needs.

Peer to Peer (P2P) systems also settle in this

structure. Since people are working with these nodes,

social network concepts can be applicable for them.

In theoretical point of view, P2P systems create a

graph in a way that each node will be a vertex and

each neighborhood relation between two nodes will

be an edge of this graph. When no criterion is

considered for choosing a neighbor, this graph will be

a random graph [1]; however, two important factors

[2] change this characteristic in P2P: 1) principal of

limited interest which declares that each peer interests

in some few contents of other peers and 2) spatial

locality law. Since each node represents one user in

the system, a P2P will be a group of users with

different interests who try to find similar users. Such

structure creates a social network. On the other hand

Barab´asi [3] has shown that in the real social

network the probability of occurring a node with

higher degree is very low. In other words, the higher

the degree the least likely it is to occur. This relation

is defined by power law distribution, i.e. 𝑝 𝑑 = 𝑑−𝑘

where k>0 is the parameter of distribution, for degree

of network nodes. The network model which has

been defined with characteristics in [3] has a short

characteristic path length and a large clustering

coefficient as well as a degree distribution that

approaches a power law. Characteristic path length is

a global property which measures the separation

between two vertices; whereas clustering coefficient

is a local property which measures the cliquishness of

a typical neighborhood. As an example, we envision the scenario of

sharing knowledge among researchers. Since each

researcher has a limited number of interests, he can

communicate with other researchers who work in the

same area of interests. Because of many limitations

like distance and resources researcher usually work

with their colleagues in the same institute or college.

Sometimes these connections can be extended to

other places in order to get more cooperation. This

behavior defines a social network with some dense

clusters where these clusters are connected by few

connections like figure 1. If one researcher is

represented by one node, a P2P system will be

created which obeys social network characteristics.


The remaining of this paper is organized as

below. In section 2 some related works are reviewed.

In section 3 community concepts and structure of the

model are explained. In section 4 setting up a

computer-based model and results are shown and

section 5 concludes the paper.

2. Related Works

Different structures and strategies have been

introduced for P2P system for better performance and

scalability. This part mainly reviews those

approaches which focus on community and peer

clustering.

Locality proximate clusters have been used to

connect all peers with the same proximity in one

cluster. Number of hop counts and time zone are

some of criteria for detecting such proximity. In [4]

the general clusters have been introduced which

support unfixed number of clusters. Two kinds of

links, local and global, connect each node to other

nodes in their own cluster or nodes in other clusters.

This clustering system doesn’t concern about content

of nodes. Physical attributes are the main criteria for

making clusters. In [5] a Semantic Network Overly

(SON) has been created based on common

characteristics in an unstructured model. Peers with

the same contents are connected to each other and

make a SON which is actually a semantic cluster. The

whole system can be considered as sets of SONs with

different interest. If a peer, for example, in SON S1

searches contents unrelated to his group, finding

proper peer is not always very efficient. If there is no

connection between S1 and the proper SON, flooding

must be used.

Common interest is another criterion for making

proper overlay. In [6] all peers with the same interest

make a connection with each other, but locality of

peers in one interest group has not been concerned. In

[7] all peers with the same interests are recognized

after receiving many proper answers based on their

interests. Such peers make shortcuts, a logical

connection, to each other. After a while a group of

peers with the same interests will be created and the

richer peer in connection will be the leader of the

group. Since this structure is based on unstructured

system and receiving proper answer in the range of

the issued queries, we cannot expect that all peers

with the same interests in the system are gathered in

one group.

In [8] communities have been considered.

Authors have described community as the

gregariousness in a P2P network. Each community is

created by one or more peers that have several things

in common. The main concern in this paper was

connectivity among peers in communities. They have

explained neither the criteria of creation nor size of

each community. In [9] communities have been

modeled like human communities and can be

overlapped. For each peer three main groups of

interest attributes have been considered, namely

personal, claimed, and private. Interests of each peer

and communities in the system are defined as

collections of those attribute values and peers whose

attributes conform to a specific community will join

it. Since 25 different attributes have been used in the

model, finding a peer which has the same values for

all of these attributes is not easy. That is why a peer

may join in different communities with partial match

in its attributes. Although the concept of communities

is the same as our work, in our model a shared

ontology defines the whole environment and one

community is part of the environment. There is also a

bootstrapping node in each domain in order to

prevention of node isolation. Our model also uses

such nodes, but their main role is controlling sub-

communities. [10] uses a shared ontology in

unstructured P2P for peer clustering. Each peer

advertises his expertise to all of his neighbors. Each

neighbor can accept or reject this advertisement

according to his own expertise. Expertise of each

peer is identified by the contents of files which the

peer has stored. Since the ontology is used, a generic

definition for the whole environment of the model is

provided which is better than using some specific

attributes.

Super peers have also been used for controlling

peer clustering and storing global information about

the system. In [11] super peers are used in partially

centralized model for indexing. All peers who obey

system-known specific rules can connect to a

designated super peer. It creates a cluster that all

peers have some common characteristics. Search in

each cluster is done by flooding, but sending query to

just a group of peers will produce better performance.

According to these rules, super peers who control

common rules must create larger index; therefore

they need more disk space and CPU power. In [12]

instead of using rules, elements of ontology are used

for indexing. In this structure each cluster is created

based on indexed ontology which is similar to our

method. All peers with the same attribute are

indexed. Our model also uses super peers and

elements of ontology for indexing, but instead of

One cluster Bridge

Figure 1: Many related clusters create a community


referring to each node in the cluster, super peers refer

to the representative of that cluster which controls

sub-communities of a specific community. This will

reduce the size of index to number of elements in

ontology which is usually less than the number of

peers in a large system and provide better scalability.

3. Overview and Basic Concepts of the

Proposed Model

We investigate the effect of sub-communities

based on a proposed social network P2P model. First,

we briefly state the community concept and then we

introduce the model. This model uses ontology for

defining the environment of the system and creating

communities. It also uses super peers for referring to

these communities. Sub-communities are considered

for better locality inside each community.

3.1. Community Concepts

A social network can be represented by a graph

G(V, E) where V denotes a finite set of actors, simply

people, in the network and E denotes relationship

between two connected actors such that E VV.

Milgram [13] has shown that the world around us

seems to be small. He experimentally showed that

average shortest path between each two persons is

six. Although this is an experimental result, we

experience it in the real world. We usually meet

persons who are unknown for us but turn out to be a

friend of one of our friends. People usually make a

social cluster based on their interests but in different

size. Such clusters which are usually dense in

connections are connected to each other by few paths.

All of these clusters with similar characteristics

create a community. In each community: 1) each

person must be reachable in reasonable steps (what

Milgram named as small world) and 2) each person

must have some connections to others which are

defined by clustering coefficient. With such

characteristics some structure like tree or lattice

cannot show the behavior of social network. As

stated previously in section 1, each dense cluster in

the network is connected to few other clusters. In

each cluster, some individuals who are called hubs

are more important than others, because they have

more knowledge or connections than other

individuals. In order to join to a cluster as a new

member either a known person or a member of the

cluster must be addressed. We summarize an example as an instantiate of our

model. A computer scientist regularly has to search

publications or correct bibliographic meta data. A

scenario which we explain here is community of

researchers who share the bibliographic data via a

peer-to-peer system. Such scenarios have been

expressed in [14] and [10]. The whole data

environment can be defined by ACM ontology [15].

Each community in the system is defined by an

element of the ontology and represented by a

representative node. Each community comprises of

many sub-communities or clusters which are gathered

around a hub. We show the effectiveness of sub-

communities by the model. Figure 2 depicts this

example.

As a social network point of view, hubs in our

model have been defined with the same

characteristics in the social network, like a

knowledgeable person with rich connection. They

define sub-communities or clusters inside a

community. They may also make connection with

other hubs. Representative and super peers work as a

bridge in the model among communities. The

difference is that representatives work as a bridge

among communities which are closer in the ontology,

while super peers work as bridges among all

communities. Peers have a capacity for making

connections which is defined based on power law

distribution. It doesn’t allow the model to change to a

tree like structure.

3.2. Definition of the Model

We define our P2P model M (like figure 2) as

below.

M has a set of peer P where: 𝑃 = 𝑝1,𝑝2 , … , 𝑝𝑛 . Each peer pi can have d different direct neighbors

which make set Ni and each of them are identified by

dij defined jth

neighbor of peer i as: 𝑁𝑖 = 𝑑𝑖1 , 𝑑𝑖2 , … , 𝑑𝑖𝑑 . As a direct neighbor, pk is one

logical hop away from pi, but physical connection

between pi and pk may not a one hop connection.

M uses shared ontology O in order to define the

environment of the model; therefore by using a

proper ontology this model can be applicable for

different environment. For simplicity all peer pi have

the same ontology.

O can define several logical communities which

have many peers with common interest. Each

One Sub-community

Representative

Hub

Ordinary peer

Hub

<acm:SubTopic>ACMTopic

/Information_Systems</acm:

SubTopic>

Figure 2: Principle elements of the model

One Community

<acm:SubTopic>A

CMTopic/Software

</acm:SubTopic>

Super Peer


community cl contains at least one member as a

known member who is the representative of that

community. This role is usually granted to the first

peer who defines a new community cl and identified

by rl.

Since each community in real world is a set of

clusters or sub-communities and members of each

cluster usually obey some kind of proximity, such a

structure must be considered in the model. Good

criteria to address the proximity in a network can be

either number of hop or IP address as a less precise

metric. Since all peers in one community have similar

interest, located peers with closer number of hop, it

may provide closer distance among peers. Such

configuration gives better response time for queries

whose answers are in one community. In other word,

locality of interest will be established in a better form

inside the community. Popular peers or hubs are

defined in order to provide such locality. Hubs are

peers who are rich in contents and connections in a

specific place. Each hub creates a sub-community or

cluster in one area of a community. All other peers

with the same interest who are in his nearby will

connect to this hub and complete the sub-community.

Since hubs establish many connections to other peers

they are a good point of reference for searching

documents. Hubs in one community can be connected

to each other in order to create non-dense connections

among sub-communities like figure 1. They create set

ppl for cl. Formally, we have: ∨ 𝐶𝑙 : 𝑝𝑝𝑙 = 𝑝1 , 𝑝2 , … , 𝑝𝑠 Where S is number of sub-

communities or hubs in the community, identified by

policy of the system. Each hub in the community cl is

also refereed by the representative rl.

Once peer pi joins the community, pi asks

representative rl about sub-communities inside the

community. rl sends address of all hubs in ppl of the

community. pi communicates with each of them and

calculates his distance from each member of ppl. The

shorter distance to any member of ppl identifies the

cluster which pi must join.

Contents of shared files in each pi identify the

interest of pi. These interests can be defined by shared

ontology O which is stored in each peer to understand

the relationship among communities. If pi has

different kinds of files which distinguish different

interest, pi can contribute in different community cl, as

a result, two communities can be connected to each

other via pi.

M also has a set of super peer SP where:

𝑆𝑃 = 𝑠𝑝1 , 𝑠𝑝2 , … , 𝑠𝑝𝑚 𝑎𝑛𝑑 𝑚 ≪ 𝑛

At the time of joining, pi will announce his

interest to the closest super peer spk. According to the

announcement, spk will identify the community who

has similar interest. Actually spk identifies rl, the

representative of the community cl. Since number of

communities are few as number of elements in the

shared ontology, the size of index will be smaller

than similar models which super peers index files or

even peers in the system. This feature let the system

be more scalable than the similar ones.

The structure of the model M creates

interconnected communities. This advantage let any

peer can search the network for any interest, even if

that interest is not as same as his own interest. Super

peer spk provide such connections in the highest level

of the model; while, representative rl of community cl

creates interconnected sub-communities in lower

level. Sub-community interconnection provides a

path to nearly any peer inside the community. It

means that a piece of data can be retrieved with high

probability in the system with low overhead; because

one part of the system, a community, for a specific

query is searched.

4. Simulation Setup

We wrote a simulator to create a computer based

community model to show the behavior of sub-

communities and in what extend they are close to a

social network. We run two different experiments. In

the first experiment we just simulate one community

and assume that all peers with the same contents

based on a shared ontology are gathered in a specific

community, and then social network parameters are

calculated in order to show the efficiency of the

model. Since in each community all peers have the

same characteristics, showing the behavior of one

community is enough to show the characteristics of

the whole system; although simulating more than one

community is also possible.

In the second experiment we define two different

communities with different interests and create

queries in the system, then we calculate number of

successful queries and recall for finding answers.

In the both experiment, we define the number of

peers in the model in advance and identify a capacity

for making connections with other peers based on

power law distribution. The distribution of the

connections among peers which is defined by the

simulator is shown in figure 3. During the simulation,

joining and leaving of peers are not considered. The

first peer who joins the community is chosen as the

representative of the community. Based on the

definition of the model many peers who are richer in

connection are chosen as hubs. When a new peer

joins the community, he communicates with the

representative. Representative returns addresses of

the hubs in the community. Then, new peer sends a

message to each hub and calculates his distance from

them. This is done in the model by defining random

hops between new peer and hubs. The hub with the


smallest path is chosen. Since hubs are normal peers

with higher capacity for accepting connection, if all

the connections have already been used another hub

will be chosen. Such a restriction in connection

limitation has many reasons. First, it allows

controlling the connection distribution in the system.

Second, after all hubs are full, the new peer must

connect to other normal peers. This mimics the

behavior of joining a member to a community by

another member. If the new peer has capacity more

than one connection, other neighbors will be chosen

randomly. First, all the members inside the same sub-

community are chosen because they may have shorter

distance and then, if all peers cannot accept any more

connections, other peers from other sub-communities

are chosen. These kinds of connections create

potential bridges among sub-communities which

make different sub-communities are connected

accompany with the representative of the community.

Since the locality is the main concern, such

connections will be established if the target peers is

rich in favor contents.

Figure 3: Distribution of connection among nodes

Watts in [16] has shown that a small world graph

is a graph which is located between regular and

random graphs. Such a graph has a characteristic path

length as low as random graph and cluster coefficient

as high as regular graph. The highest cluster

coefficient belongs to fully connected graph and

shortest path is obviously 1. So we calculate cluster

coefficient and characteristic path length for the

model.

Table 1 shows the result of cluster coefficient for

a community with 500 nodes. The capability of

accepting maximum connections, the number of hubs

and sub-communities are changing.

As it can be expected, by defining sub-

communities cluster coefficient is increased even

with just one sub-community. With the small number

of hubs and less capability of accepting connection,

many peers are connected to each other without any

connection to any hubs. This effect defines longer

characteristic path length in table 2. When number of

connections is increased, the cluster coefficient is

also increased. Moreover, there will be more chance

for other peers to connect to hubs. It decreases the

characteristic path length. When capability of

accepting connections is high, more than number of

peers in the community, the graph of the model is

moving toward complete graph. This explains larger

value for cluster coefficient. Moreover, existence of

many points of references in the model, hubs,

decreases the characteristic path length. Needless to

say, when peers have high capability of accepting

connections, many other clusters are created inside

the sub-community. Since they are implicit, reaching

them won’t be very fast, except through its explicit

sub-community.

Table 1: The value of cluster coefficient for different sub-communities and connections

Figure 4: Cluster coefficient while maximum connections change in different sub-communities

Since the results in table 1 show the path length for

one community, by adding 2 extra steps the value for

the whole model is calculated. This is the average

path length when one peer in one community tries to

reach another one in different community through the

available super peers in the model.

In the second experiment, we define two different

communities and 1000 peers in advance. Each peer

can choose one of the defined communities and join

it during network construction. This selection is

based on the interest of the peers which is related to

loaded files for each peers. 60 percent of loaded files

in each peer have the similar interests as same as

peer’s interest. The other 40 percent are files which

are not related to the interest of the peer. Moreover,

we do not consider any special shared data capacity

for hubs. A peer is chosen as a hub that has larger

data capacity with the same distribution that other

Max.

Connections

40

hubs

20

Hubs

10

hubs

5

hubs

1

Hub

No

Hub

10 0.39 0.26 0.19 0.14 0.11 0.096

50 0.41 0.39 0.42 0.29 0.15 0.11

100 0.43 0.4 0.42 0.46 0.25 0.15

500 0.57 0.56 0.53 0.54 0.59 0.48

1000 0.69 0.64 0.64 0.62 0.64 0.48

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 200 400 600 800 1000 1200

Clu

sete

r C

oe

ffie

cen

t

Maximum Connections

No

1

5

10

20

40

Sub Communities

0

50

100

150

200

250

300

350

0 50 100 150 200 250

Fre

qu

en

cy

Connections

Frequency of Nodes per Connections


peers obey. Considering extra storages creates better

results.

Table 2: The value of characteristic path length for different sub-communities and connections

Figure 5: Characteristics path length while

maximum connections change in different sub-communities

Throughout the simulation, a peer is chosen

randomly and poses a query. The query may have the

same interest as the peer’s interest or other interests

which have defined in the model. If the query has the

same interest, answers will be found in the same

community, otherwise it is sent to a proper

community. We define a successful query as a query

which can retrieve at least one result. As we can

expect, by increasing number of sub-communities

and cluster coefficient the number of successful

queries is also increased. Figure 6 shows this result.

Likewise the recall rate is also increased. The recall

rate is also affected by these changing. Figure 7

shows the values of recall rate.

Figure 6: Success rate while maximum connections change in different sub-communities

Figure 7: Recall values for different sub-

communities when maximum connection change

In unstructured network when flooding is used for

answering queries, number of neighbors has a direct

effect on recall rate and also network traffic. By

defining many sub-communities we can obtain same

recall rate with lower network connection. Moreover,

during answering queries just one part of the network

is affected by produced traffic.

5. Conclusion

We can conclude that when sub-community

concept is used, powerful nodes in P2P systems can

be substituted with normal nodes and few

connections. Choosing a tradeoff between maximum

number of connections in the system and number of

sub-communities can reduce resource consumption in

nodes and index size in representatives. In other

words, by using sub-communities a P2P model like

our model can be constructed with regular nodes

instead of powerful nodes.

References

[1] Erdős, P. and Rényi, A. "On Random Graphs", 1956.

[2] Chen, H., Z., Huang and Gong, Z., "Efficient Content

Location in Peer-to-Peer Systems.", In proceedings of the

2005 IEEE International Conference on e-Business

Engineering (ICEBE’05), 2005.

[3] Barabási, Albert-László and Albert, Réka.,

"Emergence of Scaling in Random Networks.", Sience,

1999, Vols. 286:509-512.

[4] Hu, T. -H and Sereviratne, A., "General Clusters in

Peer-to-Peer Networks.", ICON, 2003.

[5] Crespo, A. and Garcia-Molina, H., "Semantic Overlay

Networks for P2P Systems.", USA : Agents and Peer-to-

Peer Computing (AP2PC), 2004.

[6] Sripanidkulchai, K., Maggs, B. M. and Zhang, H.,

"Efficient Content Location Using Interest-Based Locality

in Peer-to-Peer Systems", INFOCOM, 2003.

[7] Chen, Wen-Tsuen, Chao, Chi-Hong and Chiang, Jeng-

Long., "An Interested-based Architecture for Peer-to-Peer

Network Systems.", AINA 2006, 2006.

0

1

2

3

4

5

0 200 400 600 800 1000 1200

Pat

h L

engt

h

Maximum Connections

No

1

5

10

20

40

Sub Communities

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50

Succ

ess

Rat

e

Maximum Connections

No

1

5

10

20

40

Sub Communities

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 10 20 30 40 50

Rec

all

Maximum Connections

No

1

5

10

20

40

Sub Communities

Max.

Connections

40

hubs

20

Hubs

10

Hubs

5

Hubs

1

Hub

No

Hub

10 3.08 3.75 4.14 4.24 4.36 4.42

50 2.64 2.67 2.76 2.83 2.88 2.92

100 2.48 2.47 2.54 2.55 2.52 2.61

500 2.04 2.01 2.06 2.11 1.91 1.96

1000 1.95 1.91 1.91 1.93 1.88 2.01


[8] Shijie, Z., et al., "Interconnected Peer-to-Peer Network:

A Community Based Scheme.", AICT/ICIW 2006, 2006.

[9] Khambatti, M., Dong Ryu, K. and and Dasgupta, P.,

"Structuring Peer-to-Peer Networks Using Interest-Based

Communities.", DBISP2P 2003, Springer LNCS 2944,

2003.

[10] Haase, P., et al., "Bibster - A Semantics-Based

Bibliographic Peer-to-Peer System.", In international

Semantic Web Conference 2004, 2004.

[11] Nejdl, W., et al., "Super Peer-Based Routing and

Clustering Strategies for RDF-Based Peer-to-Peer

Networks.", In proceedings of WWW 2003, 2003.

[12] Schlosser, M., et al., "HyperCuP—Hypercubes,

Ontologies and Efficient Search on P2P networks.",

Bologna, Italy : In International Workshop on Agents and

Peer-to-Peer Computing, 2002.

[13] Milgram, S., "The Small World Problem.",

Psychology Today, 1967, Vols. 1(1):61-67.

[14] Ahlborn, B., Nejdl, W. and Siberski, W., "OAI-P2P: A

Peer-to-Peer Network for Open Archives.", In 2002

International Conference on Parallel Processing Workshops

(ICPPW’02), 2002.

[15] ACM. 1998 ACM Computing Classification System.

[Online] http://www.acm.org/class/1998.

[16] Watts, D. and Strogatz, S., "Collective Dynamics of

'Small World' Networks.", Nature Journal, Vol. 393, P.440-

442, 1998.


SESSION

SECURITY AND RELIABILITY ISSUES +MONITORING STRATEGIES

Chair(s)

TBA


Estimating Reliability of Grid Systems using Bayesian Networks

O. Doguc, J. E. Ramirez-Marquez

Department of Systems Engineering and Engineering Management, Stevens Institute of Technology, Hoboken, New Jersey 07030, USA

Abstract - With the rapid spread of computational environments for large-scale applications, computational grid systems are becoming popular in various areas. In general, a grid service needs to use a set of resources to complete certain tasks. Thus, in order to provide a grid service, allocating required resources to the grid is necessary. The conventional reliability models have some common assumptions that cannot be applied to the grid systems. This paper discusses the use of Bayesian networks as an efficient tool for estimating grid service reliability.

Keywords: Bayesian networks, K2, Grid system reliability

1 Introduction With the rapid spread of computational environments

for large-scale applications, computational grid systems are becoming popular in various areas. However most of those applications require utilization of various geographically or logically distributed resources such as mainframes, clusters and data sources owned by different organizations. In such circumstances, grid architecture offers an integrated system in which the communication between two nodes is available. The ability to share resources is a fundamental concept for grid systems; therefore, resource security and integrity are prime concerns [1]. Traditionally, the function of computer networks is to exchange files between two remote computers. But in grid systems there is a demand that networks can provide all kinds of services such as computing, management, storage and so on [2].

Grid system reliability becomes an important issue for the system users due to excessive system requirements. As an example, the Internet is a large-scale computational grid system. Due to large number of internet users and the resources that are shared through Internet, interactions between the users and resources cannot be easily modeled. Moreover Internet users can share resources through other users, and same resources can be shared by multiple users;

which makes it difficult to understand the overall system behavior.

As a recent topic, there are a number of studies on estimating grid system reliability in the literature [3-6]. In these studies, the grid system reliability is estimated by focusing on the reliabilities of services provided in the grid system. For this purpose, the grid system components that are involved in a grid service are classified into spanning trees, and each tree is studied separately. However these studies mostly focus on understanding grid system topology rather than estimating actual system reliability. Thus for simplification purposes, they make certain assumptions on component failure rates, such as satisfying a probabilistic distribution.

Bayesian networks (BN) have been proposed as an efficient method [7-9] for reliability estimation. For systems engineers BN provide significant advantages over traditional frameworks, mainly because they are easy to interpret and they can be used in interaction with domain experts in the reliability field [10]. In general and from a reliability perspective, a BN is a directed graph with nodes that represent system components and edges that show the relationships among them. Within this graph, each edge is assigned with a probabilistic value that shows the degree (or strength) of the relationship it represents. Using the BN structure and the probabilistic values, the system reliability can be estimated with the help of Bayes’ rule. There are several recent studies for reliability estimation using BN [7, 9, 11-13], which require specialized networks that are designed for a specific system. That is, the BN to be used for analyzing system reliability should be known beforehand, so that the BN can be built by an expert who has “adequate” knowledge about the system under consideration. However, human intervention is always open to unintentional mistakes that could cause discrepancy in the results [14].

To address these issues, this paper introduces a methodology for estimating grid system reliability by combining various techniques such as BN construction from raw component and system data, association rule mining and evaluation of conditional probabilities. Based on extensive literature review, this is the first study that incorporates these methods for estimating grid system reliability. Unlike previous studies, the methodology proposed in this paper does not rely on any assumptions


about the component failure rates in grid systems. Our methodology automates the process of BN construction by using the K2 algorithm (a commonly used association rule mining algorithm), that identifies the associations among the grid system components by using a predefined scoring function and a heuristic. The K2 algorithm has proven to be efficient and accurate for finding associations [15] from a dataset of historical data about the system. The K2 algorithm reduces the algorithmic complexity of finding associations from exponential to quadratic [15] with respect to the number of components in the system. According to our proposed method, once the BN is efficiently and accurately constructed, reliabilities of grid services can be estimated with the help of Bayes’ rule. 2 Grid Systems

Different than typical distributed systems, the computational grid systems require large-scale sharing of resources on different types of components. A service request in a grid system involves a set of nodes and links, through which the service can be provided. In a grid system, the Resource Managers (RM) control and share resources, while the Root Nodes (RN) request service from RM (an RN may also share resources) [5]. Reliability of a grid system can be estimated by using reliabilities of services provided through the system.

There are several studies in the literature that focus on the reliability of grid systems, however many of them rely on certain assumptions [3-6, 16] that will be discussed. Dai and Wang present a methodology to optimally allocate the resources in a grid system in order to maximize the grid service reliability [5]. They use genetic algorithm to find the optimum solution efficiently among numerous possibilities. Later Levitin and Dai propose dividing grid services into smaller-size tasks and subtasks, then assigning the same tasks to different RM for parallel processing [16].

An example grid system is displayed in Figure 1. The RM are shown as single and RN are shown as double circles in the figure. In order to evaluate the reliability for a grid service, the links and nodes that are involved in that service should be identified. Dai and Wang show that the links and nodes in each grid service form a spanning tree [5]. The resource spanning tree (RST) is defined as a tree that starts from the requestor RN (as its root) and covers all resources that are required for the requested grid service [5].

For example in the grid system in Figure 1, when the RN, G1 requests the resources R1 and R5, there are several paths to provide this service. {G1, G2}, {G1, G3, G5}, {G1, G4, G6}, {G1, G3, G5, G6} and {G1, G2, G5, G6} are some of the RST that include the requested resources. The number of RST for a grid service can be quite large; and minimum resource spanning tree (MRST) is defined as an RST that does not contain any redundant components. Thus, when a component is removed from an MRST, it

does not span all the requested resources anymore [5]. For example in the grid system in Figure 1 {G1, G3, G5} is a MRST but {G1, G2, G5} is not, because {G1, G2} already covers all the requested resources. The reliability of a service in a grid system can be evaluated by using the reliabilities of services through MRST and it has been shown that MRST among a grid system can be efficiently discovered [5].

Figure 1: A sample grid system

3 Bayesian Networks

Estimation of systems reliability using BN dates back as early as 1988, when it was first defined by Barlow [17]. The idea of using BN in systems reliability has mainly gained acceptance because of the simplicity it allows to represent systems and the efficiency for obtaining component associations. The concept of BN has been discussed in several earlier studies [18-20]. More recently, BN have found applications in, software reliability [21, 22], fault finding systems [19], and general reliability modeling [23].

One could summarize the BN as an approach that represents the interactions among the components in a system from a probabilistic perspective. This representation is performed via a directed acyclic graph, where the nodes represent the variables and the links between each pair of nodes represent the causal relationships between the variables. From a system reliability perspective, the variables of a BN are defined as the components in the system while the links represent the interaction of the components leading to system “success” or “failure”. In a BN this interaction is represented as a directed link between two components, forming a child and parent relationship, so that the dependent component is called as the child of the other. Therefore, the success probability of a child node is conditional on the success probabilities associated with each of its parents. The conditional probabilities of the child nodes are calculated by using the Bayes’ theorem via the probability values assigned to the parent nodes. Also, absence of a link between any two nodes of a BN indicates that these components do not interact for system failure/success thus, they are considered independent of each other and their probabilities are calculated separately. As will be discussed in Section 3.2 in detail, calculations for the


independent nodes are skipped during the process of system reliability estimation, reducing the total amount of computational work. 3.1 Bayesian Network Construction Using

Historical Data

The K2 algorithm, for construction of a BN, was first defined by Cooper and Herskovits as a greedy heuristic search method [24]. This algorithm searches for the parent set for a node that has the maximum association with it. The K2 algorithm is composed of two main factors: a scoring function to quantify the associations and rank the parent sets according to their scores, and a heuristic to reduce the search space to find the parent set with highest degree of association [24]. Without the heuristic, the K2 algorithm would need to examine all possible parent sets, i.e. starting from the empty set, it should consider all subsets. Even with a restriction on the maximum number of parents (u), the search space would be as large as 2u (total number of subsets of a set of size u); which requires an exponential-time search algorithm to find the most optimal parent set. With the heuristic, the K2 algorithm does not need to consider the whole search space; it starts with the assumption that the node has no parents and adds incrementally that parent whose addition most increases the scoring function. When addition of no single parent can increase the score, the K2 algorithm stops adding parents to the node. Using the heuristic reduces the size of the search space from exponential to quadratic.

With the help of the K2 algorithm, we develop a methodology that uses historical system and component data to construct a BN model. Moreover, as stated above, the BN model is very efficient in representing and calculating the interactions between system components.

3.2 Estimating Reliability Using Bayesian Networks

BN are known to be useful in assessing the probabilistic relationships and identifying probabilistic mappings between system components [3]. The components are assigned with individual CPT (conditional probability table) within the BN. The CPT of a given component X contains p(X|S) where S is the set of X’s parents. In X’s CPT all of its parents are instantiated as either “Success” and “Failure”; so for m parents there are 2m different parent set instantiations; thus 2m entries in CPT. The BN is complete when all the conditional probabilities are calculated and represented in the model.

To illustrate these concepts, the BN shown in Figure 2 presents an expert’s perspective on how the five components of a system interact. For this BN the child-parent relationships of the components can be observed, where on the quantitative side [13] the degrees of these relationships (associations) are expressed as probabilities and each node is associated with a CPT.

In Figure 2 the topmost nodes (G1, L1 and L2) do not have any incoming edges, therefore they are conditionally independent of the rest of the components in the system. The prior probabilities that are assigned to these nodes should be known beforehand -with the help of a domain expert or using historical data about the system. Based on these prior probabilities, the success probability of a dependent node, such as G3, can be calculated using Bayes’ theorem as shown in Equation (1):

),()()|,(

),|(11

3311113 LGp

GpGLGpLGGp = (1)

As shown in Equation (1), the probability of the

node G3 is only dependent on its parents, G1 and L1 in the BN shown in Figure 2. Total number of computations done for calculating this probability is reduced from 2n (where n is the number of nodes in the network) to 2m, where m is number of parents for a node (and m << n). Similar to the prior probabilities, CPT can be computed by using historical data of the system. Equation (1) can be applied to node G5 similarly using L2 as input. Overall system reliability (i.e. success probability of the bottom most node) can also be calculated by using the same equation.

Figure 2: A sample Bayesian network

4 Our Approach – Estimating Grid System Reliability Using BN

There are many studies [7-11, 25-27] in the literature that define reliability estimation methods for traditional small-scale systems. However these studies mostly rely on certain assumptions on the system topology and operational probabilities of the components (links and nodes). In the case of dynamic grid systems, these assumptions may be invalid as links can be destroyed or established on the fly. Moreover due to dynamic creation and modifications of nodes and links, logical connections between nodes may exist [5], thus the operational probabilities of nodes and links cannot be assigned with constant values. There are several recent studies on


estimating grid service reliability using MRST [3-5] as discussed in Section 2; however these studies rely on the assumptions that node and link failures satisfy a probabilistic distribution.

In this paper we present a methodology that uses the K2 algorithm to construct the BN model for each MRST without need of a human domain expert. According to our methodology historical grid system data is used to estimate node and link reliabilities with the help of the BN model. As explained in Section 2, we focus on reliabilities of MRST in a grid system in order to evaluate the overall grid service reliability. In this manner, each MRST can be thought as a smaller size grid system [5] and therefore can be modeled and evaluated separately. For example, {G1, G3, G5} forms a MRST with three nodes and two links in the example grid system given in Figure 1. Thus it can be considered as a system with five components and modeled with a BN. For this purpose, each component in the MRST is represented with a separate node in this BN model. Once a historical system dataset (as shown in Table 1) is available, we can use the K2 algorithm to construct BN for each MRST.

Table 1: Example historical dataset

Observation G1 L1 G3 L2 G5 MRST

Behavior 1 0 0 1 1 0 0 2 1 0 0 0 0 0 3 1 0 0 1 1 0 4 1 1 1 1 1 1 5 1 1 1 1 1 1 6 0 1 0 0 1 0 7 0 0 0 0 0 0 8 1 1 1 1 1 1 9 0 1 0 1 0 0 10 1 0 0 1 1 0

In Table 1, all five components of the example MRST

are represented in different columns (Gs represent nodes and Ls represent links). Each row in Table 1 shows the state of the components (nodes and links) at an instance of time ti; when the observation was done. For the sake of simplicity and without loss of generality in our proposed methodology, we assume that the component failure data follows a binary behavior. That is, for each component in the MRST the value of 0 represents failure while the value of 1 represents full functionality. Also, in Table 1 information about the overall MRST Behavior is provided in last column (0 represents failure and 1 represent availability of the grid service through MRST).

The K2 algorithm finds the associations between these components and outputs the BN structure. In the first step of our proposed method the K2 algorithm starts with the first component in the dataset, G1. As G1 does not have any succeeding components, i.e. possible candidate

parents in the BN, the K2 algorithm skips it and picks the second component in the dataset, which is L1.

For L1, there may be two alternative parent sets: the empty set φ , or G1. Therefore, the K2 algorithm computes the scoring function f using each of these alternative parent sets and compares the results. Then, the set of candidate parents with highest f score will be chosen as the parent set for L1. At the end of this iteration the values

13201 and

42001 are computed and compared; and the

former, representing the score of the empty set {φ }, picked as the parent. So the K2 algorithm decides that L1 has no parents, which means that there is no association between G1 and L1.

In the following iterations of the K2 algorithm, the number of possible candidate parent sets to be considered and thus, the amount of computations for f score calculation increases. Skipping the details, f scores of the candidate parent sets for the G3 component are given in Table 2. Because the K2 algorithm iterates on the components according to their ordering in dataset, components L2 and G5 are not taken into account as candidate parents for G3. The K2 algorithm selects the set {G1, L1} as parent set of G3, because it has the highest f score. The number of computations grows with the order of the component, and when the K2 algorithm finishes with the last column (MRST Behavior in Table 1), it outputs the BN structure displayed in Figure 2.

Table 2: f scores for all possible candidate parent sets for

G3

Parent Set f score

φ 1320

1

{G1} 2800

1

{L1} 1800

1

{G1, L1} 640

1

The next step of the proposed method estimates

system reliability using the BN that was constructed by the K2 algorithm. Besides the associations that were discovered in the previous step, the inference rules described in Section 3 should be used to calculate the conditional probabilities. The conditional probabilities are calculated and stored in CPT and each component with a non-empty parent set in the BN is associated with a CPT (The ones with no parents are independent of others and associated with prior probabilities as explained in Section 3). Each CPT has 2u entries, where u is the number of parents of that component in the network.


The probability values in the CPT are calculated by using the raw data in Table 1 and can be expressed as the probability of an instantiation of the parent set. For example the probability, G3 being 0 given the parent instantiations as G1=0 and L1=0 is 0.5, as two out of ten observations parents are instantiated as 0 and 0; and for one of these cases G3 is instantiated as 0. In the next step, with the help of CPT and the prior probabilities that G1 and L1 have, the success probability value for G3 can be calculated. According to the BN structure (in Figure 2) components G1 and L1 are independent of others; therefore their success probabilities can be directly inferred from the observations dataset in Table 1. From Table 1 it can be evaluated that p(G1=1)=0.6 and p(L1=1)=0.5.

When we continue the computations for the other components in the network, success probabilities for the rest of the components in the sample MRST can be evaluated; such that p(L2=1)=0.6 and p(G5=1)=0.75. In the last step, the MRST reliability can be calculated by using these probability values and the CPT of the MRST Behavior node in the BN structure given in Figure 2. The success probability of the MRST Behavior node can be calculated as 0.35 or 35%; which is the reliability of the MRST used in this section. The reader must recall that this reliability value is calculated based on only 10 observations on the sample system. With more observations available, the K2 algorithm provides more accurate results on the degrees of associations between the system components and calculate more precise values in the CPT of the nodes; which will increase the accuracy of the calculated system reliability in turn.

Extending this methodology, BN can be constructed by using historical data to calculate the reliabilities of each of the MRST, and these reliability values can be combined by using Bayes’ rule to calculate the overall grid service reliability [5]. Unlike other studies on grid systems, our methodology does not rely on any assumptions about the node and link failures in the system. Also with the help of the K2 algorithm [24] and the BN model [17-19, 21-23], our methodology provides efficient and accurate reliability values. 5 Conclusions

Grid system is a newly developed architecture for large-scale distributed systems. In a grid system there can be various nodes that are logically and physically distributed; and large-scale sharing of resources is essential between these nodes. There are mainly two types of nodes in a grid system: RM share resources and RN request service from them. Identification of the links and nodes between RN and RM is essential for estimating the reliability of the requested service. Due to their special and complex nature, traditional reliability estimation methods cannot be used for grid systems. Dai and Wang [5] show that these nodes and links form RST, define MRST as RST that do not have any redundant

components. They also provide an efficient algorithm to find MRST for a given grid service. In this paper we combine the BN model with MRST, to estimate grid service reliability. Unlike previous studies on grid system reliability, our methodology does not rely on any assumptions about component failure rates. Alternatively, using historical system data we represent each MRST for a grid service with a BN and as shown in earlier studies we can efficiently and accurately estimate reliability of MRST using BN. 6 References [1] Butt, A. R., Adabala, S., Kapadia, N. H. and Figueiredo, R., (2002), "Fortes, Fine-Grain Access Control for Securing Shared Resources Incomputational Grids", J.A.B. Parallel and Distributed Processing Symposium Proceedings International, pp.22-29. [2] Wenjie, L., Guochang, G. and Hongnan, W., (2004), "Communication and Recovery Issues in Grid Environment", In Proceedings of the 3rd international Conference on Information Security, pp.82-86. [3] Dai, Y. S. and Levitin, G., (2007), "Optimal Resource Allocation for Maximizing Performance and Reliability in Tree-Structured Grid Services", IEEE Transactions on Reliability, vol.56(3), pp.444-453. [4] Dai, Y. S., Pan, Y. and Zou, X., (2007), "A Hierarchical Modeling and Analysis for Grid Service Reliability", IEEE Transactions on Computers, vol.56, pp.681-691. [5] Dai, Y. S. and Wang, X., (2006), "Optimal Resource Allocation on Grid Systems for Maximizing Service Reliability Using a Genetic Algorithm", Reliability Engineering & System Safety, vol.91(9), pp.1071-1082. [6] Dai, Y. S., Xie, M. and Poh, K. L., (2006), "Reliability of Grid Service Systems", Computers and Industrial Engineering, vol.50, pp.130-147. [7] Amasaki, S., Takagi, Y., Mizuno, O. and Kikuno, T., (2003), "A Bayesian Belief Network for Assessing the Likelihood of Fault Content", In Proceedings of the 14th International Symposium on Software Reliability Engineering, p125. [8] Boudali, H. and Dugan, J. B., (2006), "A Continuous-Time Bayesian Network Reliability Modeling, and Analysis Framework", IEEE Transaction on Reliability, vol.55 (1), pp.86-97.


[9] Gran, B. A. and Helminen, A., (2001), "A Bayesian Belief Network for Reliability Assessment", Safecomp 2001, vol.2187, pp.35–45. [10] Sigurdsson, J. H., Walls, L. A. and Quigley, J. L., (2001), "Bayesian Belief Nets for Managing Expert Judgment and Modeling Reliability", Quality and Reliability Engineering International, vol.17, pp.181–190. [11] Hugin Expert, (2007), Aalborg, Denmark http://www.hugin.dk [12] Dahll, G. and Gran, B. A., (2000), "The Use of Bayesian Belief Nets in Safety Assessment of Software Based Systems", Special Issues of International Journal on Intelligent Information Systems, vol.24(2), pp.205-229. [13] Lagnseth, H. and Portinale, L., (2005), "Bayesian Networks in Reliability", Technical Report, Dept. of Computer Science, University of Eastern Piedmont "Amedeo Avogadro", Alessandria, Italy. [14] Inamura, T., Inaba, M. and Inoue, H., (2000), "User Adaptation of Human-Robot Interaction Model Based on Bayesian Network and Introspection of Interaction Experience", In Proceedings of the 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.2139-2144. [15] Kardes, O., Ryger, R. S., Wright, R. N. and Feigenbaum, J., (2005), "Implementing Privacy-Preserving Bayesian-Net Discovery for Vertically Partitioned Data", In Proceedings of the Workshop on Privacy and Security Aspects of Data Mining, Houston, TX, [16] Levitin, G. and Dai, Y. S., (2007), "Performance and Reliability of a Star Topology Grid Service with Data Dependency and Two Types of Failure", IIE Transactions, vol.39(8), p783. [17] Barlow, R. E., (1988), "Using Influence Diagrams", Accelerated life testing and experts’ opinions in Reliability, pp.145–150. [18] Cowell, R. G., Dawid, A. P., Lauritzen, S. L. and Spiegelhalter, D. J., (1999), Probabilistic Networks and Expert Systems, Springer-Verlag, New York, NY. [19] Jensen, F. V., (2001), Bayesian Networks and Decision Graphs, Springer Verlag, New York, NY. [20] Pearl, J., (1988), Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann, San Francisco, CA.

[21] Fenton, N., Krause, P. and Neil, M., (2002), "Software Measurement: Uncertainty and Causal Modeling", IEEE Software, vol.10(4), pp.116-122. [22] Gran, B. A., Dahll, G., Eisinger, S., Lund, E. J., Norstrøm, J. G., Strocka, P. and Ystanes, B. J., (2000), "Estimating Dependability of Programmable Systems Using Bbns", In Proceedings of the Safecomp 2000, Springer, pp.309-320. [23] Bobbio, A., Portinale, L., Minichino, M. and Ciancamerla, E., (2001), "Improving the Analysis of Dependable Systems by Mapping Fault Trees into Bayesian Networks", Reliability Engineering and System Safety, vol.71(3), pp.249–260. [24] Cooper, G. F. and Herskovits, E., (1992), "A Bayesian Method for the Induction of Probabilistic Networks from Data", Machine Learning, vol.9(4), pp.309-347. [25] Chen, D. J., Chen, R. S. and Huang, T. H., (1997), "A Heuristic Approach to Generating File Spanning Trees for Reliability Analysis of Distributed Computing Systems", Computers and Mathematics with Application, vol.34, pp.115–131. [26] Dai, Y. S., Xie, M., Poh, K. L. and Liu, G. Q., (2003), "A Study of Service Reliability and Availability for Distributed Systems", Reliability Engineering and System Safety, vol.79, pp.103–112. [27] Kumar, A. and Agrawal, D. P., (1993), "A Generalized Algorithm for Evaluating Distributed-Program Reliability", IEEE Transactions on Reliability, vol.42, pp.416-424.


A Novel Proxy Certificate Issuance on Grid Portal

Without Using MyProxy

Ng Kang Siong, Sea Chong Seak, Galoh Rashidah Haron, Tan Fui Bee and Azhar Abu Talib

Cyberspace Security Center, MIMOS Berhad, Kuala Lumpur, Malaysia

Abstract - MyProxy was proposed to solve the problem of

extending proxy certificate issuance capability supported by

Globus Toolkit to Grid Portal running on web server.

However, the introduction of MyProxy has imposed several

security vulnerabilities into the computing grids. This paper

described a novel method to allow direct proxy certificate

issuance from end entity certificate on smart card to Grid

Portal via web browser without using MyProxy. This

method can also be used for proxy certificate issuance on

any web based application.

Keywords: proxy certificate, MyProxy, grid, single sign-on,

PKI.

1 Introduction

In the quest of building computers with more

computation power, distributed computing or grid

computing is gaining popularity. The primary advantage of

grid computing is that each node can be purchased as

commodity hardware, which when combined can produce

similar computing resources to a many-CPU supercomputer,

but at lower cost.

Ian Foster et al. defined the „grid problem‟ [1] as the

challenge to enable coordinated resource sharing among

dynamic collections of individuals, institutions and

resources. In an effort to address the „grid problem‟, Ian

Foster initiated the open sourced Globus Toolkit (GTK)

middleware for building grid computing based on

commodity computer. Even though there are other software

toolsets available for the construction of computing grids,

GTK remains the most widely used and continuously

maintained and upgraded middleware toolkit for computing

grids.

In view of the fact that computing nodes share

resources over open Internet, a set of libraries and toolkits

called Grid Security Infrastructure (GSI) are included in the

standard GTK to provide the necessary security measures to

support the computing grids. GSI relies on Public Key

Technologies (PKI) to provide mutual strong authentication

and message protection through TLS/SSL [3] protocol.

X.509 digital certificates are used extensively to identify

users, services and hosts [4]. Proxy certificate [5] that

conforms to IETF RFC 3820 can be used by GSI for job

delegation and single sign-on to multiple computing nodes.

By providing a web interface to GTK, Grid Portal

allows end user to interact with computing grids using a

common and mature user interface technology. Due to the

fact that Grid Portal acts as the man-in-the-middle between

GTK and the end user using web browser, the proxy

certificate issuance capability provided by GTK is not

usable in this situation that involves web browser.

MyProxy [7] was proposed and few implementations

that involved the use of MyProxy had been carried out. This

paper examines MyProxy architectural weaknesses. The

contribution of this paper is a novel method to address the

inherent weaknesses of MyProxy. This novel method

enables direct proxy certificate creation via web browser

interface using private key resides in a smart card without

the need of MyProxy server.

2 Background

Globus Security Infrastructure (GSI) is the portion of

the GTK that provides the fundamental security services

needed to support grid computing. The primary motivations

[2] behind GSI are to provide secure communication

between computing nodes, a decentralized security

management and to support "single sign-on" for users of the

Grid.

Public Key Infrastructure became the natural choice for

authentication which can be bind to the establishment of

secure communication between computing nodes through

Transport Layer Security (TLS) protocol [3]. However, in

the situation where a web portal is in the pathway between a

client and a server, this security mechanism does not allow

the server to authenticate the client. The reason being private

key is required during the challenge-respond mechanism for

TLS mutual authentication method. Copying the private key

to the portal violate the needs to keep the private key secret.

2.1 Proxy Certificate

Proxy credential is used to solve the problem by

allowing the user to use his private key to sign on a


temporary generated public key forming a proxy certificate.

The proxy credential resides in the portal. The private key of

the proxy credential is used for the authentication between

the proxy and the server.

The proxy certificate contains the user's identity with

additional information to indicate that it is a proxy [5] acting

on behalf of the user. The new certificate is signed by the

user‟s private key, rather than a Certification Authority (CA)

as depicted in Figure 1. This establishes a chain of trust from

the CA to the proxy certificate through the user certificate

[8].

CA Root Certificate

User Certificate

Proxy1

CertificateProxyn

Certificate

sign

Private Key

sign

Private Key

sign

Private Key

Figure 1 Chain of trust from CA to proxy certificate.

2.2 MyProxy

While Globus Toolkit (GTK) version 4 natively

supports the creation and usage of multiple level proxy

certificates, the situation is different if we want to allow the

user to interact with GTK via web interface. MyProxy [7]

allows the user to create proxy credential in the form of

private key and proxy certificate using a dedicated MyProxy

Client program. The user can access her credential with

standard web browser using userID and passphrase. The user

can also instruct MyProxy through Grid Portal to generate

proxy-proxy certificate (second level proxy certificate)

based on the proxy certificate resides in MyProxy Server.

The operations of MyProxy are depicted in Figure 2. In

step 1, the user interacts with MyProxy Server through

MyProxy Client software. UserID and passphrase are

required for the user to gain access to MyProxy Server. The

user can instruct MyProxy Client running on the user‟s

computer to initiate Proxy1 Certificate creation using the

user‟s certificate resides in the same computer. The Proxy1

Certificate acts on behalf of the user so that the user

certificate is not required at all time.

In step 2, the user can access Grid Portal using a

standard web browser. At this stage, the user can request

MyProxy to issue Proxy2 Certificate based on the Proxy1

certificate created in MyProxy Server in step 1. The

preconditions for this operation to be successful are:

a) The Proxy1 Certificate created in step 1 is not expired.

b) UserID and passphrase to access MyProxy server shall

be provided by the user.

In step 3, the MyProxy server responds to the request

from Grid Portal to create Proxy2 Certificate (level 2 proxy

certificate acting on behalf of level 1 proxy certificate) on

the Grid Portal to be used by the portal as proxy credential

for the user to access computing resources using GTK

middleware.

In step 4, the Proxy2 Certificate is used by the Grid

Portal via GTK to establish digital certificate based mutual

authentication to GTK computing grids.

Grid Portal

MyProxyServer

Web Browser

MyProxyClient

userID, passphrase

userID, passphrase

User Certificate Proxy1 Certificate

1

2

Proxy2 Certificate

3

userID, passphrase

User

1

2

2

GTK Computing

Grids

Digital certificate based mutual authentication

4

Figure 2 Proxy certificate issuance using MyProxy.

3 Problems

Based on Figure 2, few weaknesses can be identified.

They are:

A secondary path is required for the user to create

proxy certificate on MyProxy server. The user needs to use

dedicated MyProxy client to establish connection to

MyProxy server. This secondary path opens up for potential

exploits.

UserID and passphrase is used instead of the stronger

mutual authentication based on TLS/SSL used by GTK.

Even though communication between computing nodes

within GTK computing grids are based on digital certificate

based mutual authentication, introducing Grid Portal with

MyProxy server does not extend the mutual authentication to

the end user who uses web browser. The problem is

compounded further because Grid Portal is generally

exposed to the public network such as Internet. Therefore,

the weakest link in the entire system coincides with the part

of the system that is opened to public and invites attacks

from potentially large number of internet users. Anyone who

manages to get hold of the right combination of UserID and

passphrase can impersonate the legitimate user to access the

computing grids.


4 Our Contributions

4.1 Architecture

We will describe a novel method to allow proxy certificate

to be issued by a user certificate via a standard web browser

to Grid Portal without using MyProxy. The architecture of

the system is depicted in Figure 3.

The steps involved are:

In step 1, the user login to Grid Portal hosting on a

standard web or HTTP server using a standard web browser,

such as Firefox, via strong TLS/SSL digital certificate based

mutual authentication. Proxy1 certificate is issued by the user

certificate through a web extension program and its

corresponding CGI program running on the Grid Portal.

In step 2, the Grid Portal can utilize the newly generated

Proxy1 certificate to engage Globus Toolkit (GTK) based

computing grids using digital certificate based mutual

authentication with GTK Client running on the Grid Portal.

Grid PortalWeb

Browser

User Certificate Proxy1 Certificate

1

User

1

GTK Computing

Grids2



Figure 3 Proxy certificate issuance via web browser without

MyProxy.

4.2 Detail Processes

Detail sequence diagram is depicted in Figure 4. The entities

involved are:

CGI Program – A program developed by the authors of

this paper that runs on the Grid Portal. It is activated via

Common Gateway Interface by the web server running on

the machine. Its main functions are to generate private-public

key pair of the proxy certificate and to construct HTML file

that contains embedded tag to activate the Browser

Extension Program that runs at the user computer.

Grid Portal – A portal for computing grids running on a

web server that acts as a HyperText Transfer Protocol

(HTTP) server. It provides a web interface for user to

interact with computing grids.

Web Browser – A HTTP browser that runs on the user

computer interacts with a web server. For the purpose of this

paper, the web browsers that have been proven working are

Microsoft Internet Explorer and Mozilla Firefox.

Browser Extension Program – A program developed by

the authors that can be initiated by the Web Browser to carry

out proxy certificate creation based on the parameters in the

embedded tag and interfaces to PKCS#11 or CSP library that

links with private key storage devices. CSP provides

interface for Microsoft Internet Explorer while PKCS#11

integrates with Mozilla Firefox.

PKCS#11/CSP – PKCS#11 [8] is a cryptographic token

interface library that can be loaded into Mozilla Firefox

while CSP serves the same purpose for Microsoft Internet

Explorer. These cryptographic token interface libraries allow

the browsers and browser extension programs to interact

with cryptographic tokens to perform RSA private key

related operations that involved the use of smart card or

virtual memory storage.

Smart Card – A cryptographic smart card that is capable of

performing RSA private key operations using the stored

private key. Malaysian national identity card, MyKAD, is

capable of performing such operation and therefore can also

be used for the purpose stated in this paper. The smart card

can also be replaced by virtual memory storage of private

key with the necessary cryptographic functions to perform

the similar private key operations.

The user initiates the web browser to submit HTTP POST

request to Grid Portal running on web server to activate the

relevant CGI program that later initiates public-private key

pair generation. Upon successful key pair generation, the

CGI program stores the key pair in proper storage and

replies to the HTTP POST request by sending back the

public key that is encoded into base64 format and embedded

in HTML file. An example of the embedded tag included in

the HTML file is as follows:

<EMBED Type="application/x-mgt"

mKEY="AAAAB3NzaC1kc3MAAACBAPY8ZOHY2yFSJA6XYC9HRwNHxa.. "

</EMBED>

One parameter called mKEY is included in the embedded

tag. The based64 format encoded data that tails the mKEY is

the public key generated by the CGI program.

When the web browser receives the HTML file with

embedded tag containing the public key, the appropriate

browser extension program that has been configured to

associate with x-mgt application is activated. The browser

extension program is pre-installed on the computer running

the web browser and has been configured to associate itself

with x-mgt application.

The first task executed by the browser extension program

is to construct a partial X.509 proxy certificate that also

complies with the requirement of IETF RFC 3820 [5] for

proxy certificate format. The browser extension program

reads the user certificate from the smart card via PKCS#11

or CSP and extracts the necessary information from the user


certificate which will become the issuance certificate or End

Entity Certificate (EEC). The public key is extracted from

the variable field mKEY from the embedded tag. The partial

X.509 proxy certificate is constructed based on the above

mentioned information.

Certificate-digest or hash value is calculated from the

partial X.509 format proxy certificate. This hash value is

sent to the smart card via PKCS#11 or CSP interface to be

signed using the private key in the smart card. The signed

hash value is returned to the browser extension program and

is combined with the partial proxy certificate to form a

complete proxy certificate.

The final task of the browser extension program is to

initiate the web browser to send a POST command to deliver

the proxy certificate to the Grid Portal via HyperText

Transfer Protocol (HTTP). This POST command and its

payload of proxy certificate are received by the CGI program

running at the Grid Portal to store it to the appropriate

location for proxy certificate storage.

This concludes the entire proxy certificate issuance

process. The proxy certificate and its corresponding private

key will be used by GTK client to initiate digital certificate

based mutual authentication with other GTK computing

nodes.

Web Browser Grid Portal

Generate Key Pair

Public Key

Proxy Certificate

CGI ProgramBrowser Extension Program

Public Key

Public Key

Proxy Certificate

Request Issuance

Request Generate Key Pair

Smart Card

Sign

Hash

Sign(Hash)

PKCS#11/CSP

Hash

Sign(Hash)

Proxy Certificate

Store

HTML

HTML

Construct Partial X.509 Certificate

Construct Proxy Certificate

Read Certificate

Read Certificate

User Certificate

User Certificate

Figure 4 Proxy certificate issuance sequence diagram.

4.3 Advantages

The immediate advantage to this approach is to eliminate

the need of using MyProxy as intermediary for proxy

certificate storage and issuance. There is no need to maintain

a secondary communication path to MyProxy.

In doing so, we have also reduced the number of

cascading proxy certificate required in order to access to

computing grids. Only one proxy certificate is required for

our method while two stages of proxy certificates are

required for MyProxy implementation.

Strong digital certificate based authentication using

TLS/SSL is maintained for the entire communication chain

between end user browser and computing nodes in GTK

computing grids. No userID and passphrase are used in this

method and hence, eliminating this glaring weakness for

system using MyProxy.

While session management between computing nodes are

maintained by GTK software, session management between

web browser and the Grid Portal can be tied to the user

certificate that establishes TLS/SSL session. This is a feature

that cannot be provided by system using MyProxy

implementation.

Using standard cryptographic token interface library such

as PKCS#11 and CSP allows our method to be generic. Our

method will work with all PKCS#11 or CSP libraries

supplied by respective smart card vendors. It will also work

with user certificate that is stored in other forms such as USB

drive, hard disk and etc., provided that a matching PKCS#11

or CSP is available.

4.4 Variation

On top of the above mention method, we have discovered

that there exists another variation of the solution mentioned

above. The construction of partial X.509 proxy certificate

can be done at the CGI program instead of at the browser

extension program. In order to ensure compliance that the

subject name of the proxy certificate must be the same as the

subject name of the issuer or user certificate as per the

requirement of IETF RFC 3820 [5], the CGI program must

have the capability to access the user certificate for the

purpose of extracting the subject name from the user

certificate.

Bear in mind that the user certificate is at the smart card

and can be accessed by the browser extension program

running at the user computer via PKCS#11 or CSP as per the

solution mentioned above. Fortunately, the user certificate

has been transmitted to the web browser via TLS/SSL digital

certificate mutual authentication process. It can be accessed

by the CGI program from the Grid Portal on web server SSL

session variable. This unique situation makes it possible for

the CGI program to generate the partial X.509 proxy

certificate and subsequently the hash value.


The information exchange between the CGI program and

the browser extension program remains the same as

described in the previous method

The hash value can be sent to the browser extension

program using the same mechanism of putting the parameter

in the embedded tag. The signed hash value is returned to the

CGI program using the similar mechanism as the above

mentioned method. Hash value and signed hash value are

being exchanged instead of public key and a complete proxy

certificate.

Upon receipt of the signed hash value, the CGI program

constructs the complete proxy certificate at the machine

running the Grid Portal.

This solution variant achieves the same goal of direct

proxy certificate issuance via web browser.

5 Conclusions and Future Works

We have described a novel method to replace the usage

of MyProxy for proxy certificate issuance and hence

avoiding the inherent security weakness imposed by the

implementation of MyProxy on Globus Security

Infrastructure (GSI). The advantages of this new method are

outlined. We have also described a process with some

variations to achieve the same goal.

While this paper describes a generic case involving the

usage PKI-enabled smart card for GSI; MyKAD as a

national identity smart card with PKI capability can be used

as the national authentication token to generate proxy

certificate via standard web browser for accessing any web

based application using this novel method.

The potential future work will be to analyze in detail

the two variant processes described above and to fine- tune

the processes so that the CGI program and the browser

extension program can be adopted as a standard software

library for GSI.

6 Acknowledgement

This novel method of direct proxy certificate issuance

on Grid Portal via web browser without using MyProxy was

developed to extend the MyKAD PKI security to the

experimental grid for trusted computing project that is

funded by Ninth Malaysia Plan (9MP) to support the long-

term national grid computing initiative.

7 References

[1] Ian Foster, Carl Kesselman, and Steven Tuecke, “The

Anatomy of the Grid: Enabling Scalable Virtual

Organizations”, International Journal of High Performance

Computing Applications, vol. 15, no. 3, pp. 200-222, August

2001.

[2] University of Chicago, “GT4.0: Credential

Management: MyProxy”, July 2007.

http://www.globus.org/toolkit/docs/4.0/security/myproxy/ind

ex.pdf

[3] T. Dierks, and C. Allen, “The TLS protocol version

1.0”, IETF RFC 2246, January 1999.

http://www.ietf.org/rfc/rfc2246.txt

[4] Von Welch, “Globus Toolkit Version 4 Grid Security

Infrastructure: A Standard Perspective”, Sept 12, 2005.

http://www.globus.org/toolkit/docs/4.0/security/GT4-GSI-

Overview.pdf

[5] S. Tuechke, V. Welch, D. Engert, L. Pearlman, and M.

Thompson, “Internet X.509 Public Key Infrastructure (PKI)

Proxy Certificate Profile”, IETF, RFC 3820, June 2004.

http://www.ietf.org/rfc/rfc3820.txt

[6] Ian Foster, Carl Kesselman, Gene Tsudik, and Steven

Tuecke, “A Security Architecture for Computational Grids”,

Proceedings of the 5th ACM Conference on Computer and

Communications Security, pp. 83-92, 1998.

[7] Jason Novotny, and Steven Tuecke, “An Online

Credential Repository for the Grid: MyProxy”, 10th IEEE

International Symposium on High Performance Distributed

Computing, pp. 104, 2001.

[8] RSA Laboratories, “PKCS#11: Cryptographic Token

Interface Standard”, June 2004.

ftp://ftp.rsasecurity.com/pub/pkcs/pkcs-11/v2-20/pkcs-11v2-

20.pdf


Anomaly Detection and its Application in Grid

Thuy T. Nguyen1,2, Thanh D. Do1, Dung T. Nguyen2, Hiep C. Nguyen2, Tung T. Doan2, Giang T. Trinh3

1 Department of Information System, Hanoi University of Technology 2 High Performance Computing Center, Hanoi University of Technology

3 Hanoi National University Hanoi, Vietnam

Abstract - Size and complexity of grid computing systems increase rapidly. Failures can have adverse effect on application executing on the system. Thus, failure information should be provided to the scheduler to make scheduling decisions that reduce effect of failures. In this paper, we survey anomaly detection methods and present in details window-based method together with experiments to evaluate it. The experiments show that window-based method is efficient in detecting anomalies. It may be implemented in a monitoring service. We introduce architecture of services to make failure information available for other services in grid. Using these services, grid schedulers and job/execution managers can get additional information for matching and handling task failures.

Keywords: anomaly detection, Grid monitoring, job monitoring, window-based detection method, anomaly detection.

1 Introduction In recent years, grid-computing systems have been developed sharply to satisfy various scientific computing demands such as weather forecasting, drug emulation, high-energy physics, etc. Complex hardware and software on these systems make failures more difficult to predict. They may occur in any components, and have different affections to the performance of system and running application. Statistic data of failures is very useful in performance prediction. Real computing capacity of a machine partly depends on failure rate, and is often less than theoretical capacity. Avoiding unreliable machines will increase overall performance of computing systems. Therefore, it is necessary to have a mechanism to monitoring application performance and notify the failure to scheduler or handler.

Although many grid monitoring systems [1] [2] [3] have been developed but, to our knowledge, little effort has focused on aggregating anomaly information into systems monitoring. In existing anomaly detecting solutions [4] [5], obtained information is not provided as a service. This makes it difficult to exploit anomaly capability broadly and efficiently on overall grid system. In this paper, we present a survey of anomaly detection methods. Among various anomaly detection methods, we recommend to integrate

window-based method into a monitoring system. Experiments show that window-based anomaly detecting approach is simple but efficient. We also propose a grid monitoring architecture that supports the process of making scheduling/handling decisions.

The remainder of this paper is organized as follows. In Section 2, we introduce various methods used to detect anomalies. In Section 3, we present window-based method in details. Experiments of window-based method with different parameters are presented in Section 4. In Section 5, we introduce the architecture of information service about anomaly in grid nodes. Finally, we conclude our paper in Section 6.

2 Anomaly detection methods Anomaly detection is the process of finding values that changes much when compared to the remained values. It has a wide range of applications, ranging from intrusion detection to failure monitoring. The values to be considered for detecting anomaly may be system metrics such as CPU usage, free memory, and bandwidth in a computing system. When the metrics significantly deviated from normal pattern, they are considered as anomalies. For example, if CPU usage of an application process at a particular computing node suddenly goes down for a while, the application process might go to a zombie state or is preempted by other higher priority processes. If the decrease in CPU usage takes a long period, the application performance may reduce considerably, leading to an anomaly. There are various methods for detecting anomaly, each of which has its pro and con. In the following section, a brief description of popular methods is presented.

2.1 Value-based change detection

In this method, the value of monitoring metric is considered normal if it lies in a set of normal values and often defined by the lower and upper bounds. If a new value is greater than upper bound value or less than lower bound value, it is considered as an anomaly. This is the simplest and fastest method. However, it is not suitable enough to detect application's anomalies since in such a complicated environment like the grid; a range of values is not always


enough to describe normal behavior of the system or application. Different jobs require different resources and behave differently. For example, matrix multiply application requires a lot of CPU, while data transmission application requires large amount of bandwidth.

2.2 Model-based change detection

In this approach, it is assumed that data is produced by a single source that may be described by a model. A model is a collection of parameters representing the source. Any anomalous change in one of these parameters is considered as an anomaly. In addition, abnormal change in co-relationships among parameters can be a sign of anomaly. For example, if the data is generated by a Gaussian source, the mean and standard deviation are used to characterize the model. The more we understanding the source, the more properties we can use for anomaly detecting purpose. We describe some of the most popular models that can be used in detecting application anomaly in the following.

Operational Model. This model is based on the operational assumption that normal values are quite stable on a specified system. First N values are used to estimate the normal range. Then, new value is compared with this range to decide if it is anomaly or not. This approach may adapt well to real time running system.

Mean and standard deviation model. From N newest values x1,…xn, the mean and standard deviation are calculated by following equation:

N

xmean

N

ii∑

== 1 (1)

21

2

meanN

xstdev

N

ii

−=∑= (2)

The observation xn+1 is defined as abnormal if it falls outside a confidence interval that is d standard deviations from the mean. For example, if d=2, the value which lies between mean ± 2*stdev is consider normal. Otherwise, it is an anomaly.

Multivariate Model. This model is similar to the mean and standard deviation model except that it is based on correlations among two or more metrics. It is useful in monitoring applications where there is relationship among various system metrics such as CPU usage, data transfer and file I/O. For example, in an MPI application, when data transmission among running processes increases, CPU usage of each process decreases because of it has to wait until the data transmission is completed. When the process receives

enough data, it starts to compute intensively [6], leading to the significant increase in CPU usage.

Markov Process Model. In this model, events are classified into types called state variables. A state transition matrix is used to characterize the transition probability between states. A new observation is defined to be abnormal if its probability as determined by the previous state and the transition matrix is too low. To apply this model, observed values must be classified into state and the transition matrix has to be defined. This process seems to be the most difficult task when applying this model in to practice. Moreover, addressing relationships between metrics which is very diversified and may required multi-dimension states [7]. This makes the application of Markov Process Model less popular than others do.

Neural network-based model. Neural network is non-linear statistical model for representing complex relationships between inputs and outputs. By learning, neural network updates the weight among elements in network to adapt desired output, thus it is good if learning set is big. Neural network-based method requires much time to process and is time-consuming in building learning set [8].

2.3 Rule-based change detection

In rule-based approach, normal activities and relationships between metrics are described as rules. If there are any conflicts with these rules, an alarm is triggered. Ideally, a complete set of rules that characterizes all the behavior of the system should be constructed, but in a complex system like distributed computing environments, it may be impossible. Rules are often obtained by machine learning techniques [9] [10].

3 Window-based method Perhaps window-based method is the most simple but efficient and flexible one. In this technique, a window is moving across the interest system metrics. Any violation from the window within a specified threshold will be considered as anomaly. A simple modification that can improve its performance is to signal an alarm after a minimum number of consecutive violations of the threshold.

In window-based method, periodical activities of system processes can be addressed. For example, the Network File System (NFS) client process periodically utilizes an amount of CPU resource at a specified interval to listen for any change from the NFS server. These periodical behaviors may change statistical characteristic of data. They should be filter out before anomaly detection phase because if not filtered, they may cause false alarms. Fast Fourier Transform (FFT) technique can be exploited to eliminate periodical pattern of data [4]. In the FFT approach, discrete data is transformed into the frequency domain. Then, signals with greatest


amplitudes will be filtered out. After filtering process, the remaining signals will be re-transformed back to the time domain.

Various parameters affect the performance of the window-based method such as window size, sampling period and abnormal threshold. For example, the larger the window size is, the more accurate the detection results are. However, larger window size means longer time to process. We believe that the ability to choose and customize operating parameters should be offered in the complex environment like the grid.

There are two ways to check if a latest value is an anomaly or not: moving window method and operational model method. The moving window method only uses N newest values. It reflexes the current activity of application. However, a disadvantage of moving window method is that it is not very efficient when the number of anomalous value is high in the N latest samples. In this case, operational model

can be applied. N first samples are considered normal. From this normal pattern, the mean and standard deviation are calculated. To compensate the disadvantage of using N first values as normal pattern, heuristic approach can be used. Current monitored value is updated into normal pattern data set if it is not an anomaly. This is opposite with static normal pattern approach.

TABLE II ANOMALY DETECTION RESULTS USING WINDOW MOVING

METHOD Options HIT FP

nCnSnF 41 13

nCnSF 30 11

nCSnF 41 13

nCSF 30 11

CnSnF 100 102

CnSF 52 60

CSnF 100 102

CSF 52 60

TABLE I ANOMALY DETECTION RESULTS USING OPERATIONAL

MODEL METHOD Options HIT FP

nCnSnFnH 100 163

nCnSnFH 100 2211

nCnSFnH 76 174

nCnSFH 86 760

nCSnFnH 100 163

nCSnFH 100 2211

nCSFnH 76 174

nCSFH 86 760

CnSnFnH 100 163

CnSnFH 100 3398

CnSFnH 88 449

CnSFH 100 3066

CSnFnH 100 163

CSnFH 100 3398

CSFnH 88 449

CSFH 100 3066

One more thing to be considered is the distribution function of the sample. Within threshold of one standard deviation, the probability that a sample is considered abnormal 22.8% if the sample follows normal distribution. This probability will be different if it follows exponential distribution. Another approach is to use the Cumulative Distribution Function (CDF) for describing the set of samples.

Moreover, in order to reduce the effect of outliers, smoothing option is added to smooth values using the exponential weighted moving average function, which takes single parameter α. For each new value, xi, the smoothed value is calculated as α * xi + (1- α)* xi-1. α=1 means no smoothing at all. It is generally accepted that α=0.2 [11].

We therefore have many options. We can choose between moving window method and operational model method. In each method, we can choose to use: FFT option (F) or not (nF), CDF (C) or not (nC). We can also choose whether the smoothing function will be applied (S) or not (nS). In operational model method, two ways of updating normal pattern can be applied: the static approach (not update) and the heuristic one (H or nH).

4 Evaluation In this section, we evaluate the performance of window-based detection method. We used window-based detection


method to detect anomalies in CPU usage of processes. We used two metrics: the number of successful detections, HIT, and the number of false positives, FP.

We carried out the experiment in 25 hours. We injected deliberately and randomly 100 CPU usage anomalies generated by a CPU consumption tool. The consumption tool consists of N simultaneous computing intensive processes. We also run some processes periodically to emulate periodic patterns.

We set the threshold to 2 standard deviations and 10 percent with CDF option. The window size was 256. The number of samples in normal profile was 256, too. We run both operational model method and moving window method with various options. The experimental result is illustrated in table I and table II. It can be seen that smooth option does not affect the detection result. The most influential option is CDF. Using CDF makes the HIT rate higher. CDF reflexes probability of each real value. Thus, the detection result does not depend on the hypothesis of distribution rules.

The second noticeable option, which affects the detection result, is FFT option. FFT option results in lower HIT rate and higher ratio FP/HIT. There are two main reasons. First, the periodic patterns are not clear. Second, FFT option not only eliminates periodic patterns but also alters observed values (as illustrated in Figure 1). In the experiment, any frequency whose amplitude is greater than average is cut. FFT option creates more distortion in observed values. Some normal values are altered to anomaly and vice verse.

Heuristic option makes the number of alarm increases dramatically. Heuristic selection of normal profile makes values converge. Thus, the range of normal values is shrunk. This causes the number of values triggered as anomalies increase.

5 Anomaly detection in grid Monitoring service is considered as an important part of grid information systems [12] such as Globus MDS [1] and R-GMA [2]. At cluster level, Ganglia [3] is a scalable and widely used cluster monitoring system. However, this approach does not support monitoring and detecting the anomaly of jobs that are running in the grid. A number of solutions have been proposed for grids to monitor the application. Systems such as NetLogger [13], OCM-G [14] and Autopilot [15] focus on application monitoring [16]. However, detecting anomaly of grid job requires the interoperation between the monitoring system and the grid services. When the user submits a job to the grid, the job will be managed by a job manager. Each running job in grid consists of multiple processes. The relation between the job id (in grid) and the processes running on each host is identified by the job management service. It is essential to set

up a monitoring system that can be integrated into grid. The monitoring system will collects the information running jobs and uses this information to detect anomaly with the window-based algorithm. The overall architecture is illustrated in Figure 2. It consists of following components:

-50

0

50

100

150

0 50 100 150 200 250

Sample

CP

U (%

)

originafter FFT cut

Fig. 1. Application performance value before (the upper) and after (the lower) FFT cut.

1) Grid Job Information Service. This service provides information about the jobs running in the grid. It communicates with the job manager to get the job manifest, a document that contains both the description of the job and the specification of the requested local resources. An example of the job manager is Globus GRAM [17]. This service registers itself with the directory service, indicating type of information it supports. This data can be queried by the Local Monitoring Manager on each host to identify state of the job in this host and the processes belonged to it.

2) Sensor. Each host in the grid has a sensor attached to it. The sensor consists of two components. Local Monitoring Management is responsible for starting and stopping Information Collector and Anomaly Detector. It identifies the processes to be monitored in each host via the Grid Job Information Service. The Information Collector and Anomaly Detector can monitor significant resource usage of these processes and detect anomalies if they occur.

3) Directory Service. Directory Service is used to publish the location of Grid Job Information Service, Anomaly Event Producer and Monitoring Service. This allows Monitoring Service to discover which Anomaly Event Producer is currently active and contact to get a given sensor’s output. Query-optimized Directory Services such as LDAP [18] and Globus MDS provide fundamental mechanisms for this purpose. LDAP is a simple, standard solution. LDAP servers can be hierarchical. In Grid Monitoring Architecture [19], Producers register themselves with Directory Service and describe the type and structure of information they want to make available to the Grid. Consumer can query Directory Service to find out what type of information is available and locate the relevant Producers. Once Producers are known, Consumer will contact them directly to obtain the appropriate data.


4) Producer Service. Consumer cannot connect directly to a target host because of some system’s policies. Thus, an intermediate Producer can be installed on a host that has connectivity to both sides. This Producer can act as a proxy between the Consumer and the target host. Proper authentication and authorization policies at this proxy make the system secure and manageable.

5) Monitoring Service. The primary role of Monitoring Service is to retrieve dynamic status information from the appropriate Local Monitoring Manager. Monitoring Service only gathers the most recent value published by Producer. It communicates with Producer using a request-reply type transaction. This collected information will be used by any Grid Information Service. In GMA, Consumers have to register with Directory Service to be notified about changes occurring in relevant Producers.

6 Conclusion Grid monitoring plays an important role in building robust grid computing infrastructure. The monitoring task provides information about not only system resource but also status of running jobs. Early alarm of anomaly occurrences has significant benefit. When anomaly is detected, suitable actions such as canceling or restarting job can be made. In this paper, we presented a variety of anomaly detection methods. We described details window-based method with comprehensive options. Evaluation result indicated that a suitable combination of options brings good result. To exploit anomaly-detecting system on nodes, we introduced the architecture of grid monitoring service that addressed anomaly information. The architecture involves producing anomaly information and providing the protocol for incorporating grid monitoring system and grid scheduler. In the future, we plan to implement this service and deploy it at High Performance Computing Center at Hanoi University of Technology.

Acknowledgement

Fig. 2. Components of Anomaly Detection System in Grid

This research is conducted at High Performance Computing Center, Hanoi University of Technology. It is supported in part by VN-Grid, SAREC 90 RF2, and KHCB 20.10.06 projects.

References [1] K. Czajkowski, I. Foster et al:Grid Information Services for Distributed Resource Sharing. in 10th IEEE International Symposium on High Performance Distributed Computing, San Francisco, California. August, 2001.

[2] S. Fisher, R. Byrom et al: R-GMA: A Relational Grid Information and Monitoring System. in 2nd Cracow Grid Workshop, Cracow, Poland. 2003.

[3] M. L. Massie, D. E. Culler and B. N. Chun: The Ganglia distributed monitoring system: design, implementation, and experience. in International Journal Parallel Computing. 2004.

[4] L. Yang, I. Foster et al: Anomaly detection and diagnosis in grid environments. in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'07), Reno, Nevada, USA. 2007.

[5] J. Bresnahan, D. Gunter et al: Log Summarization and Anomaly Detection for Troubleshooting Distributed Systems, in Proceedings of 8th IEEE/ACM International Conference on Grid Computing, Austin, TX, USA. 2007. p. 226-234.

[6] W. Gropp and E. Lusk: Reproducible Measurements of MPI Performance Characteristics. in PVM/MPI, Lecture Notes in Computer Science, Vol. 1697, pp. 11-18, Springer. 1999.

[7] L.R. Rabiner: A tutorial on hidden Markov models and selected applications in speech recognition. in Proceedings of the IEEE, Vol. 77, pp. 257-286. 1989.

[8] R.P.W. Duin: Learned from Neural Networks. in Proc. ASCI 2000, 6th Annual Conf. of the Advanced School for Computing and Imaging (Lommel, Belgium, June 14-16), ASCI, Delft, 9-13. 2000.

[9] P. Langley and H.A. Simon: Applications of Machine Learning and Rule Induction. in Communications of the ACM, 38(11), pp. 54-64. November 1995.

[10] S. Hariri and B.U. Kim: Anomaly-based fault detection system in distributed system. in Proceedings of Fifth International Conference on Software Engineering Research, Management and Applications. 2007.


[11] W.G. Hunter, J.S Hunter et al: Statistics for Experimenters. 1978, Wiley-Interscience.

[12] C. Kesselmna, I. Foster and S. Tuecke: The Autonomy of the Grid, Enabling Scalable Virtual Organizations. International Journal Super Computing. 2002.

[13] B. L. Tierney, D. Gunter et al: Dynamic Monitoring of High Performance Distributed Applications. in Proc. 11th IEEE International Symposium on High Performance Distributed Computing, Edinburgh, Scotland. July 2002.

[14] B. Bali´s, M. Bubak, et al: An Infrastructure for Grid Application Monitoring. in Proc. 9th European PVM/MPI Users’ Group Meeting, Linz, Austria. September/October 2002.

[15] J. Vetter, R. Ribler et al: Autopilot: Adaptive Control of Distributed Applications. in Proc. 7th IEEE Symposium on High Performance Distributed Computing, Chicago, Illinois. July 1998.

[16] G. Gombás and Z. Balaton: Resource and Job Monitoring in the Grid. in Euro-Par 2003 Parallel Processing: 9th International Euro-Par Conference Klagenfurt, Austria. August 26-29, 2003.

[17] I. Foster, K. Czajkowski et al: A Resource Management Architecture for Metacomputing Systems. in IPPS/SPDP ’98 Workshop on Job Scheduling Strategies for Parallel Processing. 1998.

[18] T. Howes, M. Wahl and S. Kille: Lightweight Directory Access Protocol (v3), RFC 2251. December 1997.

[19] R. Aydt, D. Gunter et al: A Grid Monitoring Architecture in Global Grid Forum, Performance Working Group, Grid Working Document GWD-Perf-16-1, 2001


“Secure Thyself”: Securing Individual Peers in

Collaborative

Peer-to-Peer Environments

Ankur Gupta1 and Lalit K. Awasthi

2

1Computer Science & Engineering, Model Institute of Engineering & Technology, Jammu, Jammu &

Kashmir, India 2 Computer Science & Engineering, National Institute of Technology, Hamirpur, Himachal Pradesh, India

Abstract - P2P networks and the computations that they

enable hold great potential in designing the next generation

of very-large scale distributed applications. However, the

P2P phenomenon has largely left untouched large

organizations and business which have stringent security

requirements and are uncomfortable with the anonymity,

lack of centralized control and censorship, which are the

norm in P2P systems. Hence, there is an urgent need to

address the security concerns in deploying P2P systems

which can leverage the under-utilized resources in millions

of organizations across the world. This paper proposes a

novel containment-based security model for cycle-stealing

P2P applications, based on the Secure Linux (SE Linux)

Operating System, which alleviates existing security

concerns, allowing peers to host untrusted or even hostile

applications. The model is based on securing individual

peers rather than computing trust/reputation metrics for the

thousands of peers in a typical P2P network, which involves

significant computational overheads.

Keywords: Peer-to-Peer Computing, Peer-to-Peer Security,

Containment-Based Security Model, Secure Linux

1 Introduction

P2P networks are self-organizing in nature, extremely

fault-tolerant and designed to provide acceptable

connectivity and performance in the face of a highly-

transient population of nodes. Their desirable characteristics

have resulted in several P2P-based applications being

proposed/built for a wide-variety of domains such as

distributed computing (SETI@HOME [1], Condor [2],

Avaki [3]), file/information sharing (kazaa [4], facebook

[5]), collaboration (Groove [6]) and middleware/platforms

(JXTA [7]) enabling further development and deployment of

P2P applications. However, business organizations have

largely been left untouched by the P2P phenomenon, citing

lack of security as the biggest concern for staying away from

potentially beneficial collaborations, sharing of information

and other possible financial benefits that cross-organizational

P2P interactions could enable. The lack of centralized

control, censorship and anonymity offered by P2P systems,

coupled with the potential for malicious activity, could

compromise the confidential data of an organization, besides

putting its compute resources at risk. However, the potential

benefits of enabling cross-organization P2P interactions [8],

necessitates that the research community attempt to mitigate

the security threats posed by the P2P system model.

Several schemes for ensuring the security of peers

participating in P2P networks have been proposed. Most of

the research has focused on security of content-sharing P2P

applications/networks, while P2P applications which

involving remote task execution have not received the

desired attention. A majority of these schemes rely on trust

and reputation [10-12] based mechanisms in attempting to

isolate untrusted/malicious peers from the rest of the peers.

However, such schemes require computing trust values for

each peer interaction and then communicating the computed

values to all peers in the network. Clearly, such schemes are

not fool-proof. Besides forged identities, groups of malicious

peers can act together to circumvent the trust management

schemes. Moreover, the computational overheads required

for such schemes render them ineffective when trying to

ensure bullet-proof security for participating peers, a major

requirement for all organizations. Other schemes [13-16]

incorporating elements of access control,

authentication/authorization and encryption have also been

proposed, which rely on introducing centralized security

servers in an otherwise decentralized P2P environment. This

introduces a single-point-of-failure in the P2P network,

besides constituting a performance bottleneck. Also, in

resource-exchanging P2P environments, once the remote

code is resident on the host-peer these approaches are silent

on containing the potential damage that can occur. What is

really needed is to secure individual peers while dealing with

untrusted peers and reduce or mitigate the impact of hosting


malicious remote code, especially for cycle-stealing P2P

applications, which rely on harvesting idle compute cycles.

This research paper proposes a simple containment-

based security model for cycle-stealing P2P applications,

which creates a virtual sandbox for remote code, thereby

mitigating the security threat posed by any malicious peers.

The model uses the fine-grained privileges and Rule-Based

Access Control mechanisms provided by Secure Linux [9]

and a deploys a custom-built Remote Application Monitor

(RAM) to create a secure computing environment for cycle-

stealing P2P applications, without having to pay the

overheads of managing trust and reputation values. This

work is a part of a bigger effort to enable secure P2P

interactions across organizations and some of the issues

addressed are due to specific requirements of the architecture

proposed in [8] by the authors.

The rest of the paper is organized as follows: section 2

provides an overview of various flavors of Secure Operating

Systems, which form a basis for the proposed solution.

Section 3 discusses the system model of the proposed

solution, while section 4 provides the implementation details

on Secure Linux. Section 5 analyses the effectiveness of the

proposed solution and finally section 6 concludes the paper.

2 Secure Operating Systems

Several Operating Systems [17-19] now provide inbuilt

security mechanisms based on virtualization [20], fine-

grained privileges, role-based access control and concepts of

containment and compartmentalization. The basic idea is to

allow the system to host a wide-variety of users/applications,

requiring access to specific resources, while insulating them

from the other users/applications on the system, thereby

creating a “sandbox” effect. Moreover, virtualization has

been employed by many hardware vendors to create logical

partitions on the same computer system. Some of these

solutions even allow for isolation of physical faults in these

partitions, thereby ensuring that the other partitions continue

to operate normally. However, these features are primarily

designed for specific applications installed on the system and

not for the varied applications that could be uploaded to a

remote peer for execution. Moreover, these security features

are typically available only on the high-end servers, not the

typical systems you would see in a P2P network. A security

framework needs to be more generalized and work with all

possible hardware configurations. Hence, we decided to use

SE Linux for its easy availability (open-source). However,

our implementation can be applied to any of the operating

systems discussed above.

3 System Model

For P2P systems, we need to provide security at two

levels; security of shared content, with multiple remote peers

accessing the data and and when multiple remote

peers/applications make use of the idle CPU cycles and disk

space on the host peer. Any security model would also need

to be flexible and customizable, since different organizations

would have different security policies. We also introduce a

Remote Application Monitoring (RAM) module, which shall

monitor the application for any suspicious activity like

increased/heavy CPU utilization, increased

outbound/inbound network traffic, increased disk space

usage, increased memory utilization etc. If the application

exceeds its pre-defined quotas (quotas could be pre-

negotiated based on the local peers characteristics and

resource availability or via an organization-wide policy), the

RAM can promptly terminate the application. By using the

access control features provided by SE Linux in conjunction

with our custom application monitor, the local peer can be

secured and malicious remote code can be contained without

causing serious damage within the P2P network. The

proposed containment based solution shall be implemented

on individual peers hosting remote applications and shall

work as follows:

a. Create an area reserved for remote P2P

applications.

b. Create a security policy for Secure Linux, which

shall control access to the resources of the local

peer. This policy is pre-configured.

c. Allow remote applications/code to reside in the

reserved area and use idle compute and system

resources, as per defined security policy. SE Linux

security features shall ensure that access to critical

resources is denied to any malicious code.

d. Configure the Remote Application Monitor (RAM)

with pre-defined thresholds or quotas. These quotas

provide the upper limit on peer resource usage and

shall be based on the peer’s local requirements and

resource availability.

e. Monitor the resource usage of the remote

application and take corrective action, if usage

crosses thresholds, say CPU utilization crosses a

particular threshold, or the application stays active

after its allotted time is over.

f. Monitor the application for any potentially

malicious activity; say a spike in the

incoming/outbound network traffic etc. and

terminate the application if needed.

Fig. 1 provides an overview of the system model for the

proposed solution.


4 Implementation

The implementation of the system has two components;

the security policy for individual peers to securely host

remote applications and a custom-built Remote Application

Monitor (RAM). We have used the SE Linux features to

define a security policy suited to meet the security

requirements for cycle-stealing P2P applications. Although

the system is still under development, some aspects of the

system have been realized and tested. Due to the lack of user

support for SE Linux, writing security policies is a hit-and-

trial approach at best. Hence, a lot more testing shall be

needed to fine-tune the system and make it work in a live

environment.

4.1 The SE Linux Security Policy

This section provides details of the policy designed to

secure individual peers, enabling them to host untrusted

remote applications. To host the remote application a user

“remote_user” is created on the host machine. The security

policies are defined in the context of the “remote_user” and

the remote application “remote_app”, which shall make use

of the resources of the host peer. Figs. 2. and 3. provide

indicative security policies on Secure Linux. The syntax and

semantics for writing security policies for SE Linux are

available at [9] along with sample policies. It is assumed that

the relevant P2P middleware shall place the remote

application in the remote_user’s home directory and assign

the ownership of the remote application to “remote_user”.

Moreover, for RAM, the executable name is assumed to be

fixed (remote_app) at present. It is possible to generate

configuration policies at runtime at a later stage to cater to

different applications.

Reserved Area for Remote

Applications on Disk

Remote

Application

Remote Application Monitor

(RAM)

Run-Time

Monitoring

RAM Threshold Configuration

CPU

Util

Disk

Usage

Network

Usage

S

E

L

I

N

U

X

S

E

C

U

R

I

TY

L

A

Y

E

R

System

Resources

(sockets, files

IPC Objects)

System Calls

File System

Access

Request

Access

Decision

Time-

Based

Usage

Fig. 1 System Model for the Proposed Security Scheme; Uses SE Linux Access Control

Features with Custom Remote Application Monitor to Ensure Security of Local Peer

SE LINUX

Security Policy


Fig. 2 SE Linux Security Policy for a Remote User Accessing the resources of the Host Peer.

# Security policy for a remote user. remote_user_t is an unprivileged users domain. Unprivileged users cannot access system resources, unless a domain transition is explicitly specified. type remote_user_t, domain, userdomain, unpriv_userdomain; # Grant permissions within the domain. general_domain_access(remote_user_t);

#defines the domain remote_user_t. Allows the remote_user_t domain to send signals to processes running in unprivileged domains such as remote_user_t. Allows remote_user_t to run ps and see processes in the unprivileged user domains. Also allow remote_user_t to access the attributes of any terminal device.

full_user_role(remote_user) allow remote_user_t unpriv_userdomain:process signal_perms; can_ps(remote_user_t, unpriv_userdomain) allow remote_user_t { ttyfile ptyfile tty_device_t }:chr_file getattr; #don’t allow access to root directory (ls, cd etc.) deny unpriv_userdomain sysadm_home_dir_t:dir { getattr search }; #Don’t grant the privilege of setting/changing the gid or owner deny remote_user_t self:capability { setgid chown fowner }; audit remote_user_t self:capability { sys_nice fsetid }; # Type for home directory. type remote_user_home_dir_t, file_type, sysadmfile, home_dir_type, user_home_dir_type, home_type, user_home_type; type remote_user_home_t, file_type, sysadmfile, home_type, user_home_type; #define types for home directories and tmp_domain (the macro for /tmp file access). Set the default home directory for new created files. This prevents access to the /root dir file_type_auto_trans(privhome, remote_user_home_dir_t, remote_user_home_t) tmp_domain(remote_user, `, user_tmpfile') # Give permissions to create, access and remove files in home directory. file_type_auto_trans(remote_user_t, remote_user_home_dir_t, remote_user_home_t) allow remote_user_t remote_user_home_t:dir_file_class_set { relabelfrom relabelto }; # Allow remote_user to bind to sockets in /tmp directory allow remote_user_t remote_user_tmp_t:unix_stream_socket name_bind;


Fig. 3 SE Linux Security Policy for remote applications executing on the host peer.

# Security Policy for a remote executable # Types for the files created by remote app type remote_app_file_t, file_type; # Allow remote_app to create files and directories of type remote_app_file_t create_dir_file(remote_app_t, remote_app_file_t) # Allow user domains to read files and directories these types r_dir_file(userdomain, { remote_app_file_t }) can_exec(remote_app_t, { remote_app_exec_t bin_t }) # Allow network access can_network(remote_app_t) # Allow Socket Operations can_unix_send( { remote_app_t sysadm_t }, { remote_app_t sysadm_t } ) allow remote_app_t self:unix_dgram_socket create_socket_perms; create_stream_socket_perms; can_unix_connect(innd_t, self) # Allow PIPE operations and binding to a socket allow remote_app_t self:fifo_file rw_file_perms; allow remote_app_t innd_port_t:tcp_socket name_bind; allow remote_app_t innd_var_run_t:sock_file create_file_perms; # Deny privilege to set gid and uid deny remote_app_t self:capability { dac_override kill setgid setuid }; deny remote_app_t self:process setsched; # deny access to key system directories deny remote_app_t { bin_t sbin_t }:dir search; deny remote_app_t usr_t:lnk_file read; deny remote_app_t usr_t:file { getattr read ioctl }; deny remote_app_t lib_t:file ioctl; deny remote_app_t { etc_t resolv_conf_t }:file { getattr read }; deny remote_app_t { proc_t etc_runtime_t }:file { getattr read }; deny remote_app_t urandom_device_t:chr_file read;

4.2 The Remote Application Monitor

The RAM is a collection of scripts which uses the

psmon [21] freeware tool to enforce certain quotas like max

application instances, time to live for the remote application,

max CPU utilization, max memory utilization etc. psmon

has the ability to slay the processes which exceed the

configured limits. Other quotas like max network

connections are implemented using scripts based on the

output of the netstat command on linux. Monitoring is done

on % CPU utilization, number of instances of remote

application, disk usage, number of network connections,

network usage and time-based usage (CPU time). The RAM

allows the user to specify quotas on these parameters and if

the application exceeds its quota, it is terminated by RAM.

Although the custom security policy on SE Linux would

deny access to critical system resources, it does not prevent

the remote application from initiating malicious activity, like

Distributed-Denial-of-Service (DDoS) attacks on other peers

in the P2P network by pumping in invalid packets/queries.

Also, SE Linux is unable to specify the quantum of resource

usage, a requirement for collaborative P2P environments,

where resources are exchanged frequently. Hence, we need

the RAM to strictly enforce resource usage limits for the

peer hosting the remote application, besides monitoring the

number of network connections established by the remote

application and the traffic generated. It is planned to extend


the RAM to monitor several other application specific

parameters in future.

5 Analysis

Fig. 4 provides details of the responses by the proposed

security framework to some potentially malicious activity by

a sample application.

Fig. 4 Possible malicious behavior by remote hosted

application and the responses of the CBSM

S.No Malicious Activity Result

1. Establishing Multiple

Network Connections and

generating traffic above

threshold

RAM Terminates

the Application

2. Attempt to access files in a

directory outside reserved

area

Access declined

by SE Linux

3. Attempt to fork repeatedly

resulting in multiple

instances of the remote

application

RAM terminates

application after it

crosses

num_instances

quota for remote

app.

4. Attempt to generate network

traffic and propagate

Distributed-Denial-of-

Service Attacks

RAM terminates

the application

after monitoring

network traffic

generated by it.

5. Application performs

compute-intensive tasks

leading to high CPU

utilization

RAM terminates

the application

after it crosses

CPU utilization

quota.

6. Attempt to execute system

calls with admin privileges

SE Linux Security

Policy denies

access to the

application.

The overheads introduced by the Remote Application

Monitor (RAM) are incurred only when the remote

application is being run. Although, SE Linux does introduce

some overheads in making access decisions at run-time,

several optimizations like caching access decisions for future

use help to keep the application performance impact to

negligible levels. A complete analysis of performance

overheads of SE Linux can be found in [22]. It is evident that

by combining the rule-based access control mechanisms

provided by SE linux along with the custom Remote

Application Monitor (RAM), the security of the peer hosting

remote work is enhanced significantly. By providing more

flexible security configuration settings, it is possible to safely

enable peer interactions across organizations and host

untrusted remote applications.

6 Conclusions and Future Work

Our research shows that by using standard off-the-shelf

components combined with a custom application monitor, it

is possible to build a secure environment for cycle-stealing

P2P applications. Very little attention of the research

community has been focused on this important area. Hence,

this research is significant, since it allows individual peers to

be secured, allowing them to host any remote application,

without putting its resources/confidential information at risk.

Moreover, the proposed scheme does not require any

expensive trust/reputation management schemes to ensure

security. It is hoped that this research can form the basis for

more comprehensive security mechanisms, promoting the

adoption/deployment of P2P applications across

organizations, helping realize the enormous potential of

cross-organization P2P interactions.

Future work shall involve identifying as many scenarios

indicating malicious activity by the remote applications and

tweaking the system to provide fool-proof security. The

detailed analysis of the effectiveness of the proposed model

and its performance overheads shall be available shortly

when the system is tested extensively using several flavors of

malicious code and compared with existing security

approaches.

7 References

[1] SETI@HOME: Website http://setiathome.berkeley.edu

[2] AVAKI: Website http://www.avaki.com

[3] CONDOR: Website http://www.wisc.com/condor

[4] KAZAA: Website http://kazaa.com/us/index.htm

[5] FACEBOOK: Website http://www.facebook.com

[6] GROOVE: Website http://office.microsoft.com/en-

us/groove/FX100487641033.aspx

[7] JXTA: Website http://www.sun.com/jxta

[8] Ankur Gupta and Lalit K. Awasthi, “Peer Enterprises:

Possibilities, Challenges and Some Ideas Towards Their

Realization”, Proceedings of the 1st International Workshop

on Peer-to-Peer Networks, Nov 26, 2007, OTM WS, Part –

II, LNCS 4806, pp 1011-1020, Springer-Verlag.

[9] Linux SE http://www.nsa.gov/selinux/info/docs.cfm


[10] Sergio Marti and Hector Garcia-Molina, "Taxonomy of

Trust: Categorizing P2P Reputation Systems", COMNET

Special Issue on Trust and Reputation in Peer-to- Peer

Systems, 2005.

[11] S.D Kamwar, M.T. Schlosser and H. Garcia-Molina,

The EigenTrust Algorithm for Reputation Management in

P2P Networks, Proc. WWW 2003

[12] Singh, A and Liu, L, TrustMe: Anonymous

Management of Trust Relationships in Decentralized P2P

Systems. In Proceedings of the Third International

Conference on Peer-to-Peer Computing, September 2003

[13] Y. Kim, D. Mazzocchi, and G. Tsudik, “Admission

Control in Peer Groups,” in Second IEEE International

Symposium on Network Computing and Applications, p.

131, Apr. 2003.

[14] J. S. Park and J. Hwang, “Role-based Access Control

for Collaborative Enterprise In Peer-to-Peer Computing

Environments,” in Proceedings of the eighth ACM

symposium on Access control models and technologies, pp.

93–99, 2003.

[15] Gaspary, L. P., Barcellos, M. P., Detsch, A., and

Antunes, R. S. 2007. Flexible security in peer-to-peer

applications: Enabling new opportunities beyond file

sharing. Comput. Networks 51, 17 (Dec. 2007)

[16] Park, J.S.; An, G.; Chandra, D., Trusted P2P computing

environments with role-based access control, Information

Security, IET Volume 1, Issue 1, March 2007 Page(s):27 –

35

[17] HP-UX 11i Security Containment White Paper

http://h20338.www2.hp.com/hpux11i/downloads/SecurityCo

ntainmentExecutiveWhi epaper1.pdf

[18] Sun Trusted Solaris

http://www.sun.com/software/solaris/trustedsolaris/document

ation/index.xml

[19] VMWare Website: http://www.vwware.com

[20] Mendel Rosenblum, Tal Garfinkel, "Virtual Machine

Monitors: Current Technology and Future Trends,"

Computer, vol. 38, no. 5, pp. 39-47, May, 2005

[21] Process Table Monitoring Script; psmon website:

http://www.psmon.com

[22] Peter Loscocco , Stephen Smalley, Integrating Flexible

Support for Security Policies into the Linux Operating

System, Proceedings of the FREENIX Track: 2001 USENIX

Annual Technical Conference, p.29-42, June 25-30, 2001


Behavioral Trust Management System

S.Thamarai selvi1, R.Kumar

1, and K.Rajendar

1

1Information Technology, CARE, MIT, Anna University , Chennai, Tamil Nadu, India

Abstract - Grid environment provides the capability to

access variety of heterogeneous resources across

multiple domains or organizations. The real challenge

in such a distributed environment is the selection of an

appropriate Resource Provider. Effective utilization of a

Resource provider can be achieved by considering the

Trust relationship among the entities. In this paper,

Trust management System is introduced to resource

selection in Grid, which aims at considering Trust as

one of the parameter. We propose a methodology for

computing the Trustworthiness of the Resource provider

depending on the past behavior and the belief of its

current execution.

Keywords: Trust Management, Resource provider.

1 Introduction

Trust is a complicated concept and the ability to

generate, understand and build relationships. Trust is defined

as the quantified belief by a trustor with respect to the

competence, honesty, security and dependability of a Trustee

within a specified context [28]. Based on the literature survey,

Trust models are classified according to the scope, principle

used for computation of Trust, Transaction and the nature of

the Trust metrics. The trust relationship exists between

various entities in a computation grid environment. RMS

plays a significant role in determining the Trustworthiness of

a RP. Although there is a greater need towards the evaluation

of trust relationship among these entities, the main focus and

the challenge lies in determining the Trust of the Resource

provider. In a computational Grid environment, with varied

and distributed Resource providers and Users, Resource

Management System (RMS) plays a vital role in the

determining the QoS of Resource provider and the security

consideration while selecting the Resource provider.

Integration of Trust in such a Resource management System

has been proposed by various researchers [4]. This objective

of determining the Trust as one of the parameter for Resource

selection motivates us towards the proposal of generic Trust

Management System.

The rest of paper is organized as follows section 2 describes

our proposed Trust management System. Mathematical model

is derived in Section 3. Trust Computation example has been

explained in Section 4. Section 5 concludes our work.

2 Trust Management System

Various proposals have been made towards the

evaluation of Trustworthiness of a Resource provider. PET,

PErsornalized Trust model [29] aims at determining the

Trustworthiness of P2P system using the factors like

Reputation and Risk. This model uses the recommendation

and Interaction derived Information for deriving the

Reputation and Risk. Recommendation is the opinions of the

peers towards the target peers, which involve in the process of

collecting the feedback from peers. Although this model

classified four levels of services ranging from Good to

Byzantine level, the constant value (negative or positive) for

associating these levels are left as the user's choice. Further,

this model lacks the concrete reasons for classifying the four

levels of Services. In [30] , Trust towards the Multi Agent

System have been proposed. In this model, Trust is focused as

the combination of self-esteem, reputation and familiarity

within specific context. The computation methodology used in

this model for determining Reputation, involves the user's

opinion about the Service and hence the result obtained here

is of subjective in nature, as expressed by User. Further, this

model lacks the grade for reviewing the user's level of Trust.

We propose a solution to the above discussed problem by

considering the Resource provider’s past experience with the

Resource Management System and also the present salient

features of the Resource provider, which will be useful for the

computational grid environment. A Trust Management

System that identifies the trust of any Resource provider with

the grid is discussed in the following Section3.

Most of the Research papers deal with the Trust

computation using various methodologies involving the

subjective nature as one of the parameter in the computation.

Further, they focus on the Reputation factor which depends

entirely on the opinion of the entities involved. We propose a

new and Generic Trust Management System (TMS) which

follows the Trust management life cycle in determining the

Trust of a RP. Our model has the advantages of obtaining

Trust metrics which are of objective nature. We compute the

Service entity's reputation and quality of service in an indirect

manner. Our model would like to identify the Trust metrics of

any Resource provider in the Grid environment, and these

metrics are very important to prove the quality of Service. It

also aims at considering the historic value and the Current

Resource capability, which is more significant more

computational environment. TMS follows a specific life cycle

and the various phases of the life cycle are as follows


Figure 1. Life cycle of a Trust Management System

The various phases of TMS are

Trust Metric Identification

Trust Metric Evaluation

Trust Metric Calculation

Trust Value updation and

Trust Integration.

2.1 Trust Metric Identification

Trust Metric Identification is the first and foremost

important phase in Life cycle. It is more important to identify

the Trust Metric which reflects the nature of the system

involved.

2.2 Trust Metric Evaluation

The next phase is the Trust Metric Evaluation, a suitable

methodology is applied to determine the value of those

metrics, which have been identified in the previous phase.

2.3 Trust Metric Calculation

The Trust metric thus evaluated should be computed using

suitable mathematical model. Once the values for all the

metrics are obtained, the overall trust value of a Resource

provider is determined. It requires formalization of trust

model expressed in terms of the metrics identified.

2.4 Trust Value updation

Since, to reflect the dynamic nature of grid environment

where trust value will change rapidly as the resource provider

and consumer interacts. It is mandatory to monitor and

compute the metrics periodically and update the Trust value.

2.5 Trust Integration

The Trust value thus obtained is then used for making

decisions towards job scheduling, service access and for other

purpose depending on the type of the trust system established.

Azzedin [4], have proposed a mathematical model which

depends on the reputation and Trust level. This model aims at

the computation of Trust using the Direct Trust factor and the

reputation value from other entities. Furthermore, the Direct

Trust is Classified as the various levels A through F, with the

assumption A being the highest Trust value and F as the

lowest. It doesn't provide any confined values for those levels.

Also the reputation which is earned by the System is fully

subjective in nature and didn't addresses at identifying the

misbehaving entities. Our model has advantages by

considering the objective nature of the Trust and there are no

subjective measures involved in the computation of the Trust

metric. Although, there have been various possibility of

failures in the Grid Environment, we focus only on the quality

of service provided by the Resource provider, taking into

account about the possibility of failures which occurs on the

Resource provider. We focus Trust of a Resource provider as

the combined effect of the reputation and the degree of

fulfillment belief. Reputation is the opinion about the

Resource provider proven track capability in the user

perspective view. Since we compute Reputation in an indirect

manner without obtaining the user’s opinion, the two main

trust metrics which we have identified viz. are success rate

and the availability. Availability describes the ratio of the

number of times the Resource is in working state to the total

number of attempts made to verify the Resource’s existence.

This ratio depicts the Resource provider’s existence and the

dedicative nature towards the Grid environment. Success Rate

reflects the Resource provider’s capability in the job

execution. Thus it can be represented as the ratio of the

number of success jobs executed to the total number of jobs

submitted to the Resource provider. Thus these two

parameters namely Availability and the Success Rate would

reflect the reputation earned by the Resource provider over a

period of time.

We define a function , to represent the Historic Trust value of

the Resource provider ( ir ) with respect to the parameter

Availability and Success Rate. We represent trust as the

combined effect of the Resource provider’s past experience

with the present Resource provider’s system features. Thus we

can represent Trust as the summation of the function

),( nriHf and the function ( )trf siP

, .The

function ),( nriHf represent the Historic value of the

Resource provider ( ir ) with respect to the parameter

Availability and Success Rate, over a range of ‘n’ interaction

with the Resource Brokder.The function ( )trf siP,

represent the comparative ratio of the Resource provider with


respect to the System configuration and the Network

bandwidth. We describe the Function ),( nriHf as follows

),( nriHf ( ) ( )rSA iviv

r *= ----- (1)

the above function ),( nriHf manipulates the product

of the Availability and the Success Rate . The value of

( )ivrA represents the Availability rate of a Resource

provider ( ir ) and the value ( )rS iv represents the Success

Rate of a Resource provider ( ir ). ( )ivrA of a resource

provider ir can be computed as follows

( )

≥>

≥<=

∑

∑

−=

=

1,

1,1

jnifn

c

jnifn

cA

cA

rA

r

n

njj

r

n

jj

iv

r

------ (2)

The value of ( )rS iv of a resource provider ir can be

expressed as follows,

( )

≥>

≥<

=

∑

∑

−=

=

1,

1,1

kmifm

kmifm

cS

cS

rS

r

m

mkk

r

m

kk

ir

cr

----- (3)

The values n and m are the positive integers ranging from 1.

The above two functions ( )ivrA and ( )rS iv

represent the

simple average of the Availability and the Success Rate over

iteration less than the constant valuecr . For the computation

over iteration greater than a constant cr we apply a moving

average scheme over a period ofcr . The constant cr may

vary depends on the application nature of the grid. In a grid

environment, where resources have been frequently used, the

constant cr can be fixed to a higher value and vice versa, the

constant cr can be fixed to a smaller value. Here we consider

the constant value cr which is not pertaining over a specific

time but only with the number of times the Resources have

been communicated with the Resource Broker. This mainly

reflects how far the Resource have been utilized and proven

its successful results in the job execution over a specific

interaction time constantcr . The value of Aj and

Sk

can take value either ‘0’ or ‘1’. If the Resource is

available during the time of Scheduling, the value Aj is ‘1’

and if it is not reachable the value of Aj is ‘0’. Similarly

the Resource provider ir executes a job successfully, then the

value of Sk

is ‘1’ and if the job is failed during the

execution of a Resource provider ir , then the value of Sk

is ‘0’. Thus for any Resource provider ( ir ), to have its past

historic value, the Resource provider ( ir ) should have atleast

one successful job execution. Thus the value obtained through

the function ),( nriHf represents the past behavior of the

Resource provider ( ir ) with respect to a Resource Broker

(RB).

The function ( )trf siP, computes the current

Resource’s capability with respect to CPU (GHz) and

Bandwidth (Mbps). This function finds the product of the

Network bandwidth and the CPU units as follows,

( )

++=

21

*,CNMax

Nr

CCMax

Cr ii

siP trf ----- (4)

In this equation (4) CMax and NMax can be computed

using the following functions

)(max iCrfCMax = where i = 1 .. N ----- (5)

)(max iNrfNMax = where i = 1 .. N ----- (6)

The ‘N’ in the equation (5) and (6) represents the total

number of Resources considered at the time of the scheduling

and the function )(max iCrf computes the maximum value of

the CPU among the Resources selected in terms of the unit

GHz. Similarly the function )(max iNrf computes the

maximum value of the Network bandwidth in terms of the unit

MBps, from the list of Resource providers selected. The main

significance of these two functions is to make a comparative

analysis among the Resource provider’s list and to select the

best one. More weightage should be given to a Resource

provider ( ir ) whose CPU and Bandwidth are higer. Thus the


function ( )trf siP, computes the present system capability

of a Resource provider ( ir ). Here we introduce the arbitrary

constant 1C and 2C which may vary depends on the CMax

and NMax values obtained respectively. Thus the Trust of

a Resource provider ( ir ) can be represented as follows.

( )[ ] ( )[ ]trfnrfr siPiHiT ,,)( βα += ----- (7)

The constants α and β are the two weights with respect to

the functions. These constants can be fixed to have any value

for giving any priority for the past and the present function.

These constants can be fixed to any value depending on the

System to decide whether to give more importance to the

historic value or the present system capability of a entities.

3 Trust Computation Example

To illustrate how our Trust model works, we provide an

application example of evaluating the Trust of a Service

provider using our Mathematical model proposed. Let us

assume, there are some N Service providers and each entity

has built some level of Trust with the Grid environment, i.e

each has some previous transactions. We describe how our

mathematical model helps to identify the Suitable resources

available in the Grid and hence updating the Trust value of

every Resource which took part in the service life cycle. For

eg, we took 30 Service Providers each with a different past

historic Trust and the System features as tabulated in the table

1.

Table1 . Simulation Results of Trust parameters

Let us assume that all these resources are available and

suitable for the user's requirement. The various trust metrics

like Success rate, Availability rate, CPU (GHz) and

Bandwidth (MBps) and their associate Resources are shown

in the table 1. By applying our mathematical model, the Trust

of all the Resources have been computed. These Trust

computation is based on the constraint that equal weightage

should be given for both the Historic value and the present

system capability. The various trust value thus computed

using our mathematical model is depicted in the following

diagram,

Figure 2. Simulation results of Trust value

From the above figure 2, Resource5 has the highest Trust

value. Resource5 has the highest Availability and Success

Rate of 0.8 and 0.9 respectively. Similarly, looking up the

Bandwidth and CPU power, it has the 570 and 1.5

respectively. This proves that our selection criteria have been

satisfied, and the Resources which has the highest value in

both the factors is selected.

Similarly, the experiments have been carried out by fixing the

priority based Trust computation. In this experiment, we have

given more weight-age to the CPU power and Bandwidth

configuration, rather than the Historic value. The same

Resource characteristics as depicted in the table 1 are taken

into consideration and in a similar fashion, the Trust value is

computed using the Mathematical model. The various Trust

values which have been computed can be shown in the

following diagram.

Figure 3. Simulation results of Trust value


From the figure 3, the Resource having the highest Trust value

is the Resource22. Since, our Resource selection is based on

the more weight-age to the System capability rather than the

historic value, the Resource which has the highest value of

System capability should be selected. By applying our

mathematical model, the Resource having the Highest Trust

value is the Resource22. Resource22 has the Success and

Availability rate of 0.5 and 0.6 respectively. Similarly, the

Resource22 has the highest bandwidth and CPU power of 890

and 3.5 respectively.

After an each transaction, the Availability of all those

Resources have been updated. Similarly, the Transaction

results, whether the success or failure rate has been updated

for a particular Resource entity. Thus, these updated results

act as a source for the historic trust as when computed for the

next transaction.

4 Conclusion

In this paper, we have proposed a Trust Model

architecture, which manages and computes the Trust of any

Service entities available in the Grid. The value of Trust thus

evolved is fully objective in nature and makes an advantage

over the other model of considering the subjective nature. We

also proposed a Mathematical model for computing the Trust

value of any Service entities. Further this Trust architecture

can be integrated to any Grid Meta Scheduler, for the ease of

improving the Resource selection module and hence increases

the reliability of Meta-scheduler.

5 Acknowledgement

The authors would like to thank the Department of

Information Technology, Ministry of Communication and

Information Technology of India for their financial support

and encouragement in pursing this research by its sponsored

Centre for Advanced Computing Research and Education

(CARE).

References

[1] Foster, I., Kesselman C., Tuecke, S., “The Anatomy of the

Grid: Enabling Scalable Virtual Organizations,” International

J. Supercomputer Applications 2001

[2] Foster, I., Kesselman. C., Nick, J.M., Tuecke, S., “The

Physiology of the Grid: An Open Grid Services Architecture

for Distributed Systems Integration”, Open Grid Service

Infrastructure WG, Global Grid Forum 2002.

[3] Foster I, Kesselman C (editors). The Grid: Blueprint for a

New Computing Infrastructure. Morgan Kaufmann: San

Fransisco, CA, 1999.

[4] Farag Azzedin and Muthucumaru Maheswaran

"Integrating Trust into Grid Resource Management Systems"

” Systems International Conference on Parallel Processing. .,

Vol 1., pp 47-54, 2002.

[5] Ramin Yahyapour, Philipp Wieder (editors), “Grid

Scheduling Use Cases,” GFD-I.064, Grid Scheduling

Architecture Research Group (GSA- RG), March 26, 2006.

[6] Eduardo Huedo, Ruben S. Montero and Ignacio M.

Llorente, “An Experimental Framework for Executing

Applications in Dynamic Grid Environments,” ICASE Nov

2002.

[7] R. Yahyapour and Ph. Wieder ,“Grid Scheduling Use

Cases v1.5, Grid Working Draft, Global Grid Forum, 2005.

https://forge.gridforum.org/projects/gsa-rg/document /

GridScheduling Use Cases V1.5.doc

[8] Abramson D, Giddy J, Kotler L, “High performance

parametric modeling with nimrod/g: Killer application for the

global grid?,” Proceedings of the 14th International Parallel

and Distributed Processing Symposium (IPDPS 2000), April

2000; 520–528.

[9] Buyya R, Abramson D, Giddy J. Nimrod/G: An

architecture for a resource management and scheduling

system in a global computational Grid. Proceedings of the

International Conference on High Performance Computing

in Asia–Pacific Region (HPC Asia 2000), 2000.

[10] U. Schwiegelshohn, R. Yahyapour, Ph. Wieder,

“Resource Management for Future Generation Grids,”

Technical Report TR-0005, Institute on Scheduling and

Resource Management, CoreGRID – Network of Excellence,

May 2005

[11] T. Grandison and M. Sloman, “A Survey of Trust in

Internet Applications”, IEEE Communications Surveys and

Tutorials, Vol. 4, No. 4, pp. 2-16, 2000.

[12] R.Dingledine, et al “Reputation in p2p anonymity

systems”, proc.of the 1st Workshop on Economics of p2p

systems June, 2003.

[13] B.Gross and et al “Balances of power on ebay:Peers or

unequals.” , Workshop on Economics of p2p systems June,

2003.

[14] K.Aberer and et al “Managing trust in peer-to-peer

information systems.” Proceedings of the 10th

International

Conference on Information and Knowledge Management,

2001

[15] Indrajit Ray and Sudip Chakraborth, “A Vector Model of

Trust Developing Trustworty Systems”,1999, 259-278.


[16] Bin Yu and Munindar P.Singh, “Distributed Reputation

Management For Electronic Commerce”, First International

Joint Conference on Autonomous Agents and Multiagent

Systems, Bologna, Italy, 2002.

[17] A. Abdul-Rahman,”The PGP Trust Model”,EDI-Forum,

April 1997.

[18] M.Blaze,J. Feigenbaum, and J. Lacy, “Decentralized

trust management,” IEEE Conference on Security and

Privacy, 1996

[19] M. Blaze, “Using the KeyNote trust management

system,” AT&T Research Labs,1999.

[20]P. Resnick, R. Zeckhauser, E. Friedman and K.

Kuwabara,“ Reputation systems”,Communications of the

ACM 43(12):45–48, 2001

[21] R. Dingledine, N. Mathewson, and P. Syverson,

“Reputation in p2p anonymity systems”, Proc. of the 1st

Workshop on Economics of Peer-to-PeerSystems, June

2003.

[22] Matt Blaze, Joan Feigenbaum, and Jack Lacy

“Decentralized Trust Management”, IEEE Symposium on

Security and Privacy,1996, Oakland CA, May 6-8 1996. IEEE

Press

[23] Ernesto Damiani, De Capitani di Vimercati, Stefano

Paraboschi, Pierangela Samarati and Fabio Violante, “A

reputation-based approach for choosing reliable resources in

peer-to-peer networks”, 9th ACM conference on Computer

ACM Press, Nov 2002

[24] Sepandar D. Kamvar, Mario T. Schlosser, and Hector

Garcia-Molina, “The EigenTrust Algorithm for Reputation

Management in P2P Networks”, Twelfth International

World Wide Web Conference, 2003, Budapest, Hungary,

May 20-24 2003. ACM Press.

[25] Michael Brinklov and Robin Sharp, “Incremental Trust

in Grid Computing”, Seventh IEEE International Symposium

on Cluster Computing and the Grid (CCGrid’07), March 2007

[26] Runfang Zhou and Kai Hwang, “Trust Overlay Networks

for Global Reputation Aggregation in P2P Grid Computing”,

IEEE International Parallel and Distributed Processing

Symposium (IPDPS-2006), Rhodes Island, Grace, April,

2006.

[27] Muhammad Hanif Durad, Yuanda Cao, “A Vision for

the Trust Managed Grid”, Proceedings of the Sixth IEEE

International Symposium on Cluster Computing and the Grid

Workshops (CCGRIDW'06)

[28] Tyrone Grandison, Morris Sloman , "Specifying and

Analysing Trust for Internet Applications", 2nd IFIP

conference on e-commerce, e-Business, e-Government,

I3e2002, Lisbon oct.2002.

[29] Zhengqiang Liang and Weisong Shi, "PET: A

PErsonalized Trust model with Reputation and Risk

Evaluation for P2P Resource Sharing", Proceedings of the

38th Hawaii International Conference on System Sciences

(2005).

[30] Gabriel Queiroz Lana and Carlos Becker Westphall ,

”User Maturity Based Trust Management for Grid

Computing”, Seventh International Conference on

Networking, 2008.

[31] Mahantesh Hosamani, Harish Narayanappa and Hridesh

Rajan “Monitoring the Monitor: An Approach towards

Trustworthiness in Service Oriented Architecture” , TR #07-

07 Initial Submission: , 2007.

[32] P.Varalakshmi, S. Thamarai Selvi and M.Pradeep, “A

multi-broker trust management framework for resource

selection in grid”, IAMCOM, 2007.


Research on Security Resource Management Architec-ture for Grid Computing System

Tu Guoqing1

1Computer School, Wuhan University, Wuhan, Hubei, China

Abstract- In grid computing environment, large heteroge-neous resources are shared with geographically distributed virtual organization memberships, each having their own resource management policies and different access and cost models. There have been many projects that have designed and implemented the resource management systems with a variety of architectures and services. Grid applications have increasingly multifunctional and security requirements. However, current techniques mostly protect only the re-source provider from attacks by the user, while leaving the user comparatively dependent on the well behavior of the resource provider. In this paper, we analyze security pro-blems existing in Grid ComputingSystem and describes the security mechanism in Grid Com-puting System, and propose a domain-based trustworthy resource management architecture for grid computing sys-tem.

Keywords: grid computing, resource management

1 IntroductionGrid applications are distinguished from traditional

client-server applications by their simultaneous use of large numbers of resources, dynamic resource requirements, use of resources from multiple administrative domains, complex communication structures, and stringent performance requi-rements, among others[1]. Many of these applications rely on the ability for message processing intermediaries to forward messages, and Controlling access to applications through robust security protocols and security policy is paramount to controlling access to VO resources and assets. Thus, authen-tication mechanisms are required so that the identity of indi-viduals and services can be established, and service provi-ders must implement authorization mechanisms to enforce policy over how each service can be used. The security chal-lenges faced in a Grid environment can be grouped into threecategories: integration with existing systems and technolo-gies, interoperability with different “hosting environments”,and trust relationships among interacting hosting environ-ments.

Security of Grid Computing System should solve the following problems: user masquerade, server masquerade, data wiretapping and sophisticating, remote attack, resource abusing, malicious program, and system integrity. Grid Computing System is a complicated, dynamic and wide-area system, adding restricted authorization on user cannot be solved by the current technologies. So developing new secu-rity architecture is necessary.

Now, some most famous security models are put into application, such as GT2 Grid Security Model and GT3 Security Model for OGSA[2]. Based on analyzing deeply and comparing with many kinds of security model for re-source management in grid computing system, as an effective trustworthy security model in grid computing system, the domain- based security model for Resource Management Architecture is presented in this paper.

This paper is organized as follows: In section 2 back-ground and related works are reviewed. Section 3 describes the architecture of the domain-based model and the advanced reservation algorithm. The conclusion is in Section 4.

2 Related workThe security in Grid architecture is of major concern as

the sharing the Grid environments. It is also more than sha-ring data or basic computing resources in large organiza-tions. Primarily, Grid environments aim at direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource-brokering strategies. While crossing the physical and organi-zation boundaries, Grid environment demands solutions to support security policies and management of credentials. Support for remote access to computing and data resources is to be provided. Further, Grid technology includes wide per-mutations of mobile devices, gateways, proxies, load balan-cers, globally distributed data centers, demilitarized zones etc.

2.1 Grid Security ChallengesThe security challenges in Grid environment can be

grouped into three main categories, Integration, Interoperabi-lity, and Trust Relationship[3].

Integration is unreasonable to expect that a single secu-rity technology can be defined to address all Grid security challenges and to be adoptable in every hosting environment. Legacy infrastructure cannot be changed rapidly, and hence the security architecture in Grid environment should inte-grate with existing security infrastructure and models. For example, each domain in a Grid environment is likely to have one or more registries in which user accounts are main-tained; such registries are unlikely to be shared with other organizations or domains. Similarly, authentication mecha-nisms deployed in an existing environment that is reputed secure and reliable will continue to be used. Each domain typically has its own authorization infrastructure that is de-ployed, managed and supported. It will not typically be ac-ceptable to replace any of these technologies in favor of a single model or mechanism.


Interoperability, grid technology is designed to operate services that traverse multiple domains and hosting environ-ments. In order to correctly and efficiently interact with other systems, interoperability is needed at multiple levels, such as Protocol level, policy level and identity level.

The trust relationship problem is made more difficult in a Grid environment by the need to support the dynamic, user-controlled deployment and management of transient services. Trust relationship among the participating domains in Grid environment is important for end-to-end traversals.

This combination of dynamic policy overlays and dynamically created entities drives the need for three key functions in a Grid security model.

1. Multiple security mechanisms. Organizations partici-pating in a VO often have significant investment in existing security mechanisms and infrastructure. Grid security must interoperate with, rather than replace, those mechanisms.

2. Dynamic creation of services. Users must be able to create new services (e.g., “resources”) dynamically without administrator intervention. These services must be coordina-ted and must interact securely with other services. Thus, we must be able to name the service with an assertable identity and to grant rights to that identity without contradicting the governing local policy.

3. Dynamic establishment of trust domains. In order to coordinate resources, VOs need to establish trust among not only users and resources in the VO but also among the VO’s resources, so that they can be coordinated. These trust do-mains can span multiple organizations and must adapt dyna-mically as participants join, are created, or leave the VO.

In summary, security challenges in a Grid environment can be addressed by categorizing the solution areas:(1) inte-gration solutions where existing services needs to be used, and interfaces should be abstracted to provide an extensible architecture; (2) interoperability solutions so that services hosted in different virtual organizations that have different security mechanisms and policies will be able to invoke each other; and (3) solutions to define, manage and enforce trust policies within a dynamic Grid environment. The dependen-cy between these three categories is illustrated in Fig.1.

Fig.1 Categories of security challenges

2.2 Grid Security RequirementsSecurity is one of the characteristics of an OGSA-

compliant component. The basic requirements of an OGSA security model are that security mechanisms be pluggableand discoverable by a service requestor from a service des-

cription. OGSA security must be seamless from edge of network to application and data servers, and allow the fede-ration of security mechanisms not only at intermediaries, but also on the platforms that host the services being accessed. The basic OGSA security model must address the following security disciplines:

(1) Authentication. Provide plug points for multiple au-thentication mechanisms and the means for conveying the specific mechanism used in any given authentication opera-tion. The authentication mechanism may be a custom authen-tication mechanism or an industry-standard technology. The authentication plug point must be agnostic to any specific authentication technology.

(2) Delegation. Provide facilities to allow for delegation of access rights from requestors to services, as well as to allow for delegation policies to be specified. When dealing with delegation of authority from an entity to another, care should be taken so that the authority transferred through delegation is scoped only to the tasks intended to be perfor-med and within a limited lifetime to minimize the misuse of delegated authority.

(3) Single Logon. Relieve an entity having successfully completed the act of authentication once from the need to participate in re-authentications upon subsequent accesses to OGSA-managed resources for some reasonable period of time. This must take into account that a request may span security domains and hence should factor in federation bet-ween authentication domains and mapping of identities. This requirement is important from two perspectives:

a) It places a secondary requirement on an OGSA-compliant implementation to be able to delegate an entity’s rights, subject to policy

b) If the credential material is delegated to intermedia-ries, it may be augmented to indicate the identity of the in-termediaries, subject to policy.

(4) Credential Lifespan and Renewal. In many scena-rios, a job initiated by a user may take longer than the life span of the user’s initially delegated credential. In those cases, the user needs the ability to be notified prior to expira-tion of the credentials, or the ability to refresh those creden-tials such that the job can be completed.

(5) Authorization. Allow for controlling access to OGSA services based on authorization policies attached to each service. Also allow for service requestors to specify invocation policies. Authorization should accommodate various access control models and implementation.

(6) Privacy. Allow both a service requester and a ser-vice provider to define and enforce privacy policies, for instance taking into account things like personally identifia-ble information (PII), purpose of invocation, etc. (Privacy policies may be treated as an aspect of authorization policy addressing privacy semantics such as information usage rather than plain information access.)

(7) Confidentiality. Protect the confidentiality of the underlying communication (transport) mechanism, and the confidentiality of the messages or documents that flow over the transport mechanism in a OGSA compliant infrastruc-

Integrate Extensible Archi-tecture Using existing ser-vices

TrustTrust RelationshipsTrust Establishment

Interoperate Secure InteroperabilityProtocol mapping


ture. The confidentiality requirement includes point–to–point transport as well as store-and-forward mechanisms.

(8)Message integrity. Ensure that unauthorized changes made to messages or documents may be detected by the recipient. The use of message or document level integrity checking is determined by policy, which is tied to the offered quality of the service (QoS).

(9) Policy exchange. Allow service requestors and pro-viders to exchange dynamically security (among other) poli-cy information to establish a negotiated security context between them. Such policy information can contain authenti-cation requirements, supported functionality, constraints, privacy rules etc.

(10) Secure logging. Provide all services, including se-curity services themselves, with facilities for time-stamping and securely logging any kind of operational information or event in the course of time - securely meaning here reliably and accurately, i.e. so that such collection is neither interrup-tible nor alterable by adverse agents. Secure logging is the foundation for addressing requirements for notarization, non-repudiation, and auditing.

(11) Assurance. Provide means to qualify the security assurance level that can be expected of a hosting environ-ment. This can be used to express the protection characteris-tics of the environment such as virus protection, firewall usage for Internet access, internal VPN usage, etc. Such information can be taken into account when making a deci-sion about which environment to deploy a service in.

(12) Manageability. Explicitly recognize the need for manageability of security functionality within the OGSA security model. For example, identity management, policy management, key management, and so forth. The need for security management also includes higher-level requirements such as anti-virus protection, intrusion detection and protec-tion, which are requirements in their own rights but are typi-cally provided as part of security management.

(13) Firewall traversal. A major barrier to dynamic, cross-domain Grid computing today is the existence of fire-walls. As noted above, firewalls provide limited value within a dynamic Grid environment. However, it is also the case that firewalls are unlikely to disappear anytime soon. Thus, the OGSA security model must take them into account and provide mechanisms for cleanly traversing them—without compromising local control of firewall policy.

(14) Securing the OGSA infrastructure. The core Grid service specification (OGSI) presumes a set of basic infras-tructure services, such as handleMap, registry, and factory services. The OGSA security model must address the securi-ty of these components. In addition, securing lower level components that OGSI relies on would enhance the security of the OGSI environment.2.3 GT2 Grid Security ModelThe Globus Toolkit version 2 (GT2) includes services for Grid Resource Allocation and Management (GRAM), Moni-toring and Discovery (MDS), and data movement (GridF-TP). These services use a common Grid Security Infrastruc-ture (GSI) to provide security functionality[2]. GSI defines a

common credential format based on X.509 identity certifica-tes and a common protocol based on transport layer security. Each GSI certificate is issued by a trusted party known as a certificate authority (CA), usually run by a large organization or commercial company. In order to trust the X.509 certifi-cate presented by an entity, one must trust the CA that issued the certificate. A single entity in an organization can decide to trust any CA, without necessarily involving the organiza-tion as a whole. This feature is key to the establishment of VOs that involve only some portion of an organization, where the organization as a whole may provide little or no support for the VO. CAS allows VOs to express policy, and it allows resources to apply policy that is a subset of VO and local policy.This process comprises three steps: (1) The user authenticates to CAS and receives assertions from CAS expressing the VO’s policy in terms of how that user may use VO resources. (2) The user then presents the assertion to a VO resource along with the usage request. (3) In evaluating whether to allow the request, the resource checks both local policy and the VO policy expressed in the CAS assertion.

2.4 GT3 Grid Security ModelVersion 3 of the Globus Toolkit (GT3) and its accom-

panying Grid Security Infrastructure (GSI3) provide the first implementation of OGSA mechanisms. GT3’s security mo-del seeks to allow applications and users to operate on the Grid in as seamless and automated a manner as possible. Security mechanisms should not have to be instantiated in an application but instead should be supplied by the surroun-ding Grid infrastructure, allowing the infrastructure to adapt on behalf of the application to meet the application's requi-rements. The application should need to deal with only ap-plication-specific policy. GT3 uses the following powerful features of OGSA and Web services security to work toward this goal:

(1)Casts security functionality as OGSA services to al-low them to be located and used as needed by applications.

(2) Uses sophisticated hosting environments to handle security for applications and allow security to adapt without having to change the application.

(3) Publishes service security policy so that clients can discover dynamically what credentials and mechanisms are needed to establish trust with the service.

(4) Specifies standards for the exchange of security to-kens to allow for interoperability.

In order to establish trust, two entities need to be able to find a common set of security mechanisms that both unders-tand. The use of hosting environments and security services, as described previously, enables OGSA applications and services to adapt dynamically and use different security me-chanisms. The published policy in OGSA can express requi-rements for mechanisms, acceptable trust roots, token for-mats, and other security parameters. An application wishing to interact with the service can examine this published policy and gather the needed credentials and functionality by contacting appropriate OGSA security services.

The security of request can be described following steps. Firstly, the client’s hosting environment retrieves and


inspects the security policy of the target service to determine what mechanisms and credentials are required to submit a request. Secondly, if the client’s hosting environment deter-mines that the needed credentials are not already present, it contacts a credential conversion service to convert existing credentials to the needed format, mechanism, and/or trust root. Thirdly, the client’s hosting environment uses a token processing and validation service to handle the formatting and processing of authentication tokens for exchange with the target service. This service relieves the application and its hosting environment from having to understand the details of any particular mechanism. Fourthly, on the server side, the hosting environment likewise uses a token processing service to process the authentication tokens presented by the client. Lastly, after authentication and the determination of client identity and attributes, the target service’s hosting environ-ment presents the details of the request and client informa-tion to an authorization service for a policy decision.

3 Domain-based security architecture for grid computing system

Based on the five-layered security architecture on considering the designation and accomplishment of Grid

security project, we propose domain-based trustworthy re-source management architecture for grid computing system. The security architecture that we have already briefly depic-ted at GCC2002 is shown as Fig. 2.

The domain-based security architecture assumes that each group of VOs is protected by special security VOs that trust each other. These VOs are responsible for authorizing access to services/resources within that group. All delega-tions are stored by the security agent, which has the ability to reason about them. A requester can execute a right or access a resource by providing its identity and/or authorization information to the security VO. The security VO checks this information for validity, and reads its policies to verify that the requester has the right. If the requesting VO does not have the right, the security VO returns an error message, otherwise it forwards the request to the VO in charge of the resource, access VO, along with a message saying that the request is authorized by the security agent. As the security VO is trusted by every other in the system, the requestingVO is granted access. If the access VO has the computing power to reason about certificates, rights and delegations the request can be sent directly to it, instead of via the security VO.

Fig.2 Security architecture for grid computing system

3.1 Definitions of Elements of Security architec-ture (1) Object is resource or process of Grid Computing System. Object is protected by security policy. Resource may be file, memory, CPU, equipment, etc. Process may be pro-cess running on behalf of user, process running on behalf of resource, etc. “O” denotes Object.

(2) Subject is user, resource or process of Grid Compu-ting System. Subject may destroy Object. Resource may be file, memory, CPU, equipment, etc. Process may be process running on behalf of user, process running on behalf of re-source, etc. “S” denotes Subject.

(3) Trust Domain is a logical, administrative region of Grid Computing System. Trust Domain has clear border. “D” denotes Trust Domain.

(4) Representation of Object: There are two kinds of Object in Grid Computing System, which are Global Object and Local Object . A Global Object is the abstraction of one or many Local Objects. Global Objects and Local Objects exist in Grid Computing System at the same time.

(5) Representation of Subject: There are two kinds of Subject in Grid Computing System, which are Global Sub-ject and Local Subject. A Global Subject is the abstraction of one or many Local Subjects. Global Subjects and Local Subjects exist in Grid Computing System at the same time.

Security architecture of the Grid computing system

Grid security application layer

Grid security protocol layer

Grid security basic layer

System and Network security tech layer

Node and interconnection layer


(6) Representation of Trust Domain: There are two kinds of Trust Domain in Grid Computing System, which are Global Trust Domain DG and Local Trust Damian DL. Glo-bal Trust Domain is the abstraction of all Local Trust Do-mains. Global Trust Domain and Local Trust Domain exist in Grid Computing System at the same time. Trust Domain

of Grid Computing System consists of three elements: Ob-jects existing in this Trust Domain, Subjects existing in this Trust Domain and Security Policy which protect Objects against Subjects. Trust Domain can be denoted by D=({O},{S},P), D

denotes Trust Domain, {O} denotes the set of all Objects existing in this Trust Domain, {S}denotes the set of all Sub-jects existing in this Trust Domain, and P denotes Security Policy of this Trust Domain. Global Trust Domain can be denoted by DG=({OG},{SG},PG), and Local Trust Domain can be denoted by

Di=({Oi},{Si},Pi) i=1,2,3…Let us assume that there are two domains DOM1 and

DOM2 that are collaborating on a certain project. If Bob, an administrator at DOM1, wants to access the database of the

client, DOM2, and if Bob has the permission to do so, he sends a Request for Action to his own security agent. The security agent returns an authorization certificate, which Bob uses to access the database.We also assume that Bob has the permission to access the database and that this permission can be delegated. Bob wants all users to access the database as well, and so he sends a certificate containing a delegate statement to security VO. The architecture of domain-based security for Grid Computing System is show as Fig3.

Fig.3 Architecture of domain-based security

3.2 Policy and Implementation of the domain-based security architecture

The Grid Computing System is abstracted to the ele-ments such as Objects, Subjects, Security Policies, Trust Domains, Operations, Authorization, etc. Grid Computing System is composed of four parts: Global Trust Domain, Local Trust Domain, Operations and Authorizations[5,6]. It can be denoted by

G=(DG,{ Di },{O j},{AK}) i=1,2,3… j=1,2,3… k=1,2,3…G denotes Grid Computing System, DG denotes Global Trust Domain, {Di} denotes the set of all Local Domain, {Oj} denotes the set of all Operations, {AK} denotes the set of all Authorizations. The security of Grid Computing Sys-tem can be regarded as the relationship among the basic elements. That is to say, “user access and use resources” can be abstracted as “Subject operate Object”, this can be deno-ted by S—OP—>O. Checking the relationship of Subject, Object and Security Policy, we can examine whether Subject can operate Object, and also can tell whether user can access resource.

This policy consists of authorization and delegation po-licies. Authorization policies deal with the rules for checking the validity of requests for actions. An example of a rule for authorization would be checking the identity certificate of an agent and verifying that the agent has an axiomatic right. Delegation policies describe rules for delegation of rights. A rule for delegation would be checking that an agent has the ability to delegate before allowing the delegation to be ap-proved. A policy also contains basic or axiomatic rights, and rights associated with roles. We introduce the concept of primitive or axiomatic rights, which are rights that all indivi-duals possess and that are stored in the global policy. For example, there are basic rights that are not often expressed, but used implicitly. A policy can be viewed as a set of rules for a particular domain that defines what permissions a user has and what permissions she/he can obtain.

How the Domain-based security policy actually takes place as follows.

(1) SA-DOM2 loads the domain policy for DOM2 and loads a global shared policy.

(2) SA-DOM1 loads the domain policy for DOM1 and loads a global shared policy.

Intrusion Detection

Anti-virus Management

Policy Management

User Management

Key Management Transport, protocol, message security

Policy Expression and Exchange

End-pointPolicy

MappingRules

Authorization

PolicyPrivacyPolicy

Secure Conversations Access Control Enforcement

TrustModel


(3) SA-DOM2 sends message to SA-DOM1 saying that SA-DOM1 has the right to delegate access to db5, which is a database in DOM2, to all users:

a) tell(sa-DOM2, sa-DOM1, idelegate(StartTime, End-Time, sa-Dom2, sa-Dom1, canDo(X,accessDB(db5), user(X,DOM1)), true, true).

b) SA-DOM1 asserts the proposition: dele-gate(IssueTime, StartTime, EndTime, sa-Dom2, sa-Dom1, canDo(X,accessDB(db5),user(X,DOM1)), true, true).

c) SA-DOM1 gives all administrators the right to access db5, but not the ability to delegate: tell(sa-Dom1, sa-Dom1, idelegate(StartTime, EndTime, sa-Dom1, X, canDo(X, ac-cessDB(db5), true), role(X, administrator), false).

d) A delegate statement to be inserted into the kno-wledge base: delegate(IssueTime, StartTime, EndTime, sa-Dom1, X, canDo(Z, accessDB(db5), true), role(X, adminis-trator), false)

(4)Bob requires some information from database, db5, at DOM2. He sends a request to SA-DOM1 along with his certificate request(Bob, accessDB(db5)).

(5) SA-DOM1 knows that the request is from Bob be-cause of his certificate. It then checks the rules to see if Bob as an administrator has access to db5. As this is true, SA-DOM1 creates an authorization certificate and sends it back to Bob.

(6) Bob sends a request to SA-DOM2 with his ID certi-ficate and the authorization certificate: request(Bob, access-DB(db5)).

(7) SA-DOM2 verifies both the certificates and checks its policy. SA-DOM2 approves the access and the request is sent to the agent controlling access to the database.

(8) If Eric, a user tries to access the database, db5, his request will fail because the SA-DOM1 has only given ad-ministrators the right.

If all these steps complete successfully, the target hos-ting environment then presents the authorized request to the target service application. The application, knowing that the hosting environment has already taken care of security, can focus on application-specific request processing steps.

4 ConclusionsGrid computing presents a number of security challenges

that are met by the Globus Toolkit’s Grid Security Infras-

tructure (GSI). Version 3 of the Globus Toolkit (GT3) im-plements the emerging Open Grid Services Architecture; its GSI implementation (GSI3) takes advantage of this evolution to improve on the security model used in earlier versions of the toolkit. GSI3 remains compatible (in terms of credential formats) with those used in GT2, while eliminating privile-ged network services and making other improvements. Its development provides a basis for a variety of future work. In particular, we propose the domain-based trustworthy re-source management architecture for the grid computing sys-tem, we believe it will be very useful in the grid computing system.

5 References[1] I. Foster, C. Kesselman, G. Tsudik, S. Tuecke. A Securi-ty Architecture for Computational Grids. Proc. 5th ACM Conference on Computer and Communications Security Conference, pp. 83-92, 1998.[2] Foster, I. and Kesselman, C. Globus: A Toolkit-Based Grid Architecture. Foster, I. and Kesselman, C. eds. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1999, 259-278.[3] OGSA-SEC-WG Draft, “Security Architecture for Open Grid Services”, https://forge.gridforum.org, 2007.[4] M.Blaze, J.Feigenbaum, J.Lacy. ”Decentralized Trust Management”, IEEE Proceedings of the 17th Symposium on Security andPrivacy, 1996[5] Ellison, M.Carl, et al, ”SPKI Certificate Theory”, RFC 2693, Internet Society, 1999[6] W.Johnston, C.Larsen. ”A use-condition centered ap-proach to authenticated global capabilities: security archi-tectures for large-scale distributed collaboratory environ-ments” , Available at: http://www-itg.1bl.gov/Security/Arch/publications.html[7] A.Herzberg, Y.Mass, J.Michaeli, D.Naor, Y.Ravid, ”Ac-cess control meets public key infrastructure, or: assigning roles to strangers”Available at: http://www.hrl.il.ibm.com/TrustEstablishment/paper.htm[8] IBM, Microsoft, RSA Security and VeriSign. Web Servi-ces Trust Language (WS-Trust), 2002.


SESSION

GRID UTILITIES, SYSTEMS, TOOLS, ANDARCHITECTURES

Chair(s)

TBA


On Architecture of the Economic-aware Data Grid

Thuy T. Nguyen1,2, Thanh D. Do1, Tung T. Doan2, Tuan D. Nguyen2, Trong Q. Duong2, Quan M. Dang3

1Deparment of Information Systems, Hanoi University of Technology, Hanoi, Vietnam 2High Performance Computing Center, Hanoi University of Technology, Hanoi, Vietnam

3School of Information Technology, International University, Germany {thuynt-fit, thanhdd-fit, tungdt-hpc}@mail.hut.edu.vn

Abstract – Data Grid has been adopted as the next-generation platform by many scientific communities to share, access, transport, process, and manage large data collections distributed worldwide. The increasing popularity of Data Grid as a solution for large dataset issues as well as large-scale processing problems promises the adoption of many organizations that have a great demand at this moment. This requires applying innovative business model to conventional Data Grid. In this paper, we propose a business model based on outsourcing approach and a framework called the Economic-aware Data Grid that takes the responsibilities of coordinating operations. The framework works in an economic-aware way to minimize the cost of the organization.

Keywords: Data Grid, economic-aware, business model, grid computing, outsourcing.

1 Introduction

Inspired by the need of sharing, accessing, transporting, processing and managing large data collections distributed worldwide between universities, laboratories as well as High Performance Computing Center (HPCC), Data Grid has appeared and evolved as the next-generation distributed storage platform by many scientific communities[1],[2] as well as industrial companies. Many enterprises show great demand to handle business operation that deals with distributed data sets, such as a large, multinational financial service group with a range of activities covering banking, brokerage, and insurance. The amount of data that has to be retained and manipulated has been growing rapidly. Those organizations have to deal with the problems of organizing massive volumes of data and running data mining applications under certain conditions such as limited budget and resources.

Among related existing technologies, scientific Data Grid seems to be the most suitable one to solve two problems above. Its model usually includes many HPCCs with enormous storage capacity and computing power. However, it has some drawbacks. First, investing a Data Grid needs a lot of money. The financial sources for building such Data Grid are from governments or scientific funding foundations.

Hence, researchers within the Data Grid can use those resources freely. Second, the resources might be used unfairly or inefficiently. Because of these major drawbacks, only a few of business applications are built around these Grid solutions.

In this paper, , we integrate the Software as a Service (SaaS)[3] business model into Data Grid to overcome disadvantages of scientific Data Grid. In particular, we propose a business model, which consists of three main participants: the resource provider, the software provider and the organization deploying the economic-enhanced Data Grid for its business operations. The Data Grid will work in an economic-aware way by complementing necessary scenarios and components.

The rest of this paper is organized as follows. In Section 2, we explain in detail our motivation for the economic-aware Data Grid. The proposed business model is demonstrated in Section 3. In Section 4, we describe the high-level system design. We cite related works in Section 5. After discussing some problems that need to be considered in our framework in Section 6, we present future works and conclude the paper.

2 Motivation

The model of scientific Data Grid does not fully meet the requirements of enterprises. Thus, an economic-enhanced Data Grid is an explicit choice over other possibilities. It is worth describing here, in more detail, why they should make this choice. An economic-aware Data Grid can save lot money in resource investment. It not only provides capability of using resources efficiently but also ensures fairness among participants.

Considering the case of an investment bank, it has many geographical distributed branches each of which has its own business. Each branch usually runs data mining applications over the set of collected financial data. With the time, this activity becomes important and need to be extended. This leads to two challenges. First, the computing tasks need more computing power and storage capacity. Second, the data source is not just from the branch but also from other branches. Because all the branches belong to the investment bank, the data can be shared among branches with a suitable


authorization policy. Thus, the bank needs a technology to share large data sets effectively. To deal with both problems, it is necessary to have a Data Grid in the investment bank. We can see many similar scenarios in the real world.

To build such a Data Grid, one solution is applying the way of scientific Data Grid, in which each branch invests to build its own computer center. Those computer centers are then connected together to form a Data Grid. Users in each branch can use the Grid freely. However, this approach has many disadvantages. First, it costs a lot of money for hardware and software investment. They also need money to operate and maintain the operations of the computer center, such as money for electric power, human resource and so on. The initial investment and maintaining cost could take most of the budget of the branch. Second, the resource utilization is not efficient. Usually, data mining applications are executed when all the financial data are collected. This happens at the end of a month, end of a quarter or end of a year. At those periods, all computing resources are employed and the workload is always 100 %. In normal period, the workload is much lower. Thus, many computers run wastefully. Finally, there may have unfair resource usage on the Grid. Because individual departments within an investment bank are very competitive, the notion of freely sharing their resources is very difficult to accept. Some branches may contribute little resource to the Grid but use a lot.

Another approach is outsourcing. This means each branch does not invest to build a computer center itself but hire from resource providers and pay per use. In other word, the investment bank will build a Data Grid over the business Grid. This approach overcomes disadvantages discussed above. It brings about many benefits to the investment bank and its branches, as follows:

• Economy and efficiency. The users can obtain resources whenever they need them. The degree of need is expressed under user control and can be described with high precision for resource use for now and for future times of interest, like before deadlines. Thanks to pay-per-use characteristic, users are sure to get the great benefits by saving a large amount of money invested in their own computing center. Additionally, by hiring necessary resources in run-time, they could avoid the wastefulness of computing resource as regarding above.

• Fair sharing. To use resource, users have to pay by using accounting and charging service. Thus, the more branches use resources, the more they have to pay.

However, up to now, there is no business model and technical solution to realize this approach. The work in this paper is the first attempt to solve this issue. In particular, the

main contribution of the paper includes the proposal for a business model and the design of an economic-aware Data Grid framework. In next Sections, we will demonstrate our business model to explain how it meets expectations above.

3 Proposed business model

Figure 1. The business model of the system

3.1 Participants

The business model includes three main participants, as illustrated in Figure 1. They are resource provider, software provider, and the organization deploying the Data Grid.

3.1.1 Resource providers The resource provider provides server, storage, and network resources. The providers already have their own accounting, charging, and billing as well as job deployment modules. They offer storage capacity and computing power service. We assume that the price of using resource is published. To ensure the quality of service (QoS), the resource providers should have advanced resource reservation mechanisms. The users could be charged for storage capacity, the number of computing nodes, and the amount of bandwidth they use.

3.1.2 Software provider

Software providers are business entities that provide software services. In particular, they provide software and its license to ensure that, the software can work under negotiated condition. The income of the software provider is from selling software license.

3.1.3 Organization deploying the Data Grid

Organization consists of many branches distributed worldwide. Instead of building a computer center for each branch, the organization has an Information Technology (IT) central department. This department has an economic-aware Data Grid middleware responsible for coordinating internal data sharing among branches.


The Economic-aware Data Grid should belong to the organization deploying the Data Grid because of many reasons. First, it saves the organization much cost of using broker service. Second, it is easier for organization to apply cost optimization policy when having its own control system. Finally, giving data management task for a third party is not as trustable as by itself. The goal of the economic-aware Data Grid is managing the Data Grid in a way that minimizes the cost of the organization.

3.2 Working mechanism

To use the Data Grid’s service, users in each branch have to join the system. This could be done by requesting and getting certificates through the browsers from grid administrators. Then, users could use the economic-aware Data Grid.

The Economic-aware Data Grid performs two main closely related tasks. The first task is the data management service. It includes data transfer service, replication service, authorization and so on. The second task is the job execution service. It receives the requirement from users, gets the software, locates the data, reserves computing resources, deploys, runs the software and returns the result. The output data must be stored or replicated somewhere on the Data Grid. The contract with software providers and resource providers is realized with Service Level Agreement Negotiation (SLA Negotiation)[4, 5]. Obviously, the pay-per-use model brings advantages of saving money and efficiency as analyzed in Section 2.

Discussing about fair sharing, we could look at the scenario in which the user in each branch puts, gets, finds data and runs jobs on the Data Grid. The branch has to pay the cost for using storage and computing power to resource providers. It also has to pay the cost for using software to the software providers. Storage service cost includes data transfer in/out cost. Thus, if user in branch 2 conducts many transfers from the data storage of branch 1, letting branch 1 pay for the transfer cost is unfair. Thus, it is necessary to have payment among branches to ensure fair sharing The more resources a branch use, the more it has to pay for.

4 High-level system design

Figure 2. High-level system design

In this section, we show a design of economic-aware Data Grid, which coordinates the operation in our business model. The high-level system design is illustrated in Figure 2. It consists of various services, which will be described as follows.

4.1 Data manipulation

Like any other Data Grid, data manipulation is a basic service in our system. It helps users to put files to Grid as well as find, access and download files they want. As each branch has a separate storage on the Grid, the file should be put to that storage. As illustrated in Figure 3, the system serves the user in the following order: (1) The Grid receives the requirements with the Grid Portal; (2) The Grid Portal invokes the Metadata Catalog Service (MCS) to find the appropriate information depending on the user's request. If the request is 'put', the MCS will return the data storage location (the store service address); if the request is find, download or delete, the MCS will return the data location. (3) Basing on the information provided from MCS, Grid Portal invokes services provided by service providers to handle the request. (4) When the request completes or fails, the Grid Portal notifies to the user. If the request is completed successfully, the Grid Portal stores the accounting information to the accounting service (5) and stores relevant metadata to MCS as well as Replica Location Service (LRS) (6).


Figure 3. Scenario of putting a file on Grid

4.2 Replication

The replication service is used to reduce access latency, improve data locality, and increase robustness, scalability and performance of distributed applications. The system should analyze the pattern of previous file requests, and replicate files toward sites that show a corresponding increased frequency of file access request [6]. The Replication function is performed by Data Replication Service module.

Figure 4. Scenario of replication on Grid

The operation of the Data Replication Service is showed in Figure 4 and could be described as follows: (1) The Data Replication service receives request, reads and interprets it. (2) The Data replication service invokes scheduling service to find out a suitable replication location. The scheduling service discovers candidate resources, matches the user's requirements and the candidate resources in an optimal way and then returns selected resources to the Data Replication Service. (3) The Data Replication service reserves bandwidth with resource providers by an SLA. (4) The Data Replication service invokes the file transfer service of the determined resource provider for transferring data. (5) The Data Replication service invokes monitoring module to monitor

the QoS. (6-7) If the operation is completed successfully, the Data Replication service stores data information to MCS and RLS. (8) It also stores the accounting information to the accounting service.

4.3 Job Execution

When a user wants to run a job, he provides the software’s name, name of the input/output data, resource requirements and a deadline to complete the job. The scenario of Job Execution is illustrated in Figure 5. It could be described as follows: (1) The Grid Portal receives the user's request. (2) The Grid Portal invokes the SaaS service. (3) The SaaS invokes the Software Discovery service to find the location of the software provider. (4) The SaaS invokes the MCS and RLS to find the location of the data file. (5) The SaaS invokes Scheduling service to find the suitable resource provider. (6) The SaaS signs SLA contract to hire software, computing resource and bandwidth with software providers as well as resource providers. (7)The SaaS transfers the software and data to the execution site and executes the job. (8) During the execution, the monitoring module is invoked to observe the QoS. (9) If error occurs, SaaS will invoke the Error Recovery module. (10) When the execution finishes, the SaaS moves the output data to the defined places and updates MCS and RLS. (11) The SaaS also stores accounting information to accounting service.

Figure 5. Scenario of job execution on Grid

We emphasize that in our system, unlike the general SaaS, the number of software here is not so big, thus the Software Discovery module is relatively simple.

5 Related works

Most current researches in economy grid pay little effort on sharing internal and large data sets effectively. Instead, they are developing open Grid architecture that allows several providers and consumers to be interconnected and to trade services. Others are developing and deploying business models with the purpose of selling their own products and


services such as GridASP[7], GRASP[8], GRACE[9], BIG[10] and EU-funded project GridEcon[11]. These models usually do not involve several providers. For instant, Sun Utility Grid [12] and Amazon EC2 [13] , provide on-demand computing resources while VPG [14] and WebEx [15] provide certain on-demand applications (music, video on-demand, web conferencing, etc) that do not relate to data sharing.

Current large-scale Data Grid projects such as the Biomedical Informatics Research Network (BIRN) [16], the Southern California Earthquake Center (SCEC) [17], and the Real-time Observatories, Applications, and Data management Network (ROADNet) [18] make use of the San Diego Supercomputer Center (SDSC) Storage Resource Broker as the underlying Data Grid technology. These applications require widely distributed access to data by many people in many places. The Data Grid creates virtual collaborative environments that support distributed but coordinated scientific and engineering research. The economic aspects are not considered in those projects.

In [19], a cost model for distributed and replicated data over a wide area network is presented. Cost factors for the model are the network, data server and application specific costs. Furthermore, the problem of job execution is discussed under the viewpoint of sending the job to the required data (code mobility) or sending data to a local site and executing the job locally (data mobility). However, in the model, the cost is not money but the time of job execution. With this assumption, the system is pseudo economic-aware. Moreover, the infrastructure works with the best effort mechanism. The QoS and resource reservation are not considered. Thus, it does not suit with the business environment.

Heiser et. al. [20] proposed a commodity market of storage space within the Mungi operating system. The proposed system focuses on the extra accounting system used for backing store management. All accounting of the operations on storage objects can be done asynchronously without slowing down such operations. It is based on bank accounts from which rent is collected for the storage occupied by objects. Rent automatically increases as available storage runs low, forcing users to release unneeded storage. Bank accounts use a taxation system to prevent excessive build up of funds on underutilized accounts. However, the system considers only the storage resource and the scope of the system is just inside an organization.

Buyya [21], [22] discussed the possible use of economy in a scientific Data Grid environment. Specifically, a token-exchange approach is proposed to regulate demand for data access from the servers of the Data Grid. For example, a token may correspond to 10KB of data volume. By default, a single user may only access as much data as he has tokens. This gives other users a chance to access data. However, the amount of data that they access for a given token depends on

various parameters such as demand, system load, and QoS, etc. The users can trade-off between QoS and tokens. The negotiation/redistribution of tokens after their expiration, their mapping to real money and the pricing policies of storage servers are not discussed Moreover, this work focuses on the resource provider level while we focus on the system built up over the commercial resource providers.

6 Discussion

According to a survey of EU-funded project GridEcon [11], investment banks have been using Grid computing at least 3-4 years. Currently, most of the investment banks have completed this exercise to link these heterogeneous Grids and develop limited resource sharing. The next step is to create an internal shared computing platform for Utility Computing. Further evolutions planned include: (1) Following self-service approach: users submit a job directly to the global grid without going through the IT department; (2) Applying SaaS model and open source to reduce cost, especially software-licensing costs. (3)Adding SLA monitoring, policy management, and charge-back across heterogeneous resources.

Our solution combines the last two. This brings about great advantages (Section 2), but the biggest drawbacks are the dependence of data-intensive application on the network performance as well as the limitation of user's local access to the software. In the near future, these will no longer exist due to the rapid development of high-speed internet. Our works focus on one specific part of this promising trend. We integrate SaaS model with the Data Grid instead of the general grid. Further research includes two issues: What is the scheduling mechanism for single job and workflow job? What is the hiring strategy for storage resource?

Scheduling mechanism is one of the most important problems in any distributed system. In scientific Grid, a scheduling component (such as such as Legion[23], Condor[24], AppLeS[25, 26] Netsolve[27], Punch[28]) decides which jobs are to be executed at which site based on certain cost functions. In the economic approach, the scheduling decision is flexibly conducted regarding to end users' requirements. Whereas a conventional model often deals with software and hardware costs for running applications, the business model primarily charges the end-users for services that they consume based on the value they derive from it. Pricing based on users’ demand and the supply of resources is the main drive in the competitive, economic market model..

The second issue is the hiring strategy for storage resource and software. Our system is based on business Grid and works in an Economic-aware way. Therefore, we put the benefit problems between consumers and providers. The consumers must pay for usage and the providers get money for their storage and software resource. Nevertheless, how the


consumers could hire the storage resource and software is the problem that depends on each organization deploying this system. They could hire on demand. That means they only pay whenever they use. However, in many cases, the users also want to hire resource a long time before using that one. The providers could account for storage usage or for data transfer, or the quality of storage and QoS provided. Hiring strategy must be paid a lot of attention in any Economic-aware system.

7 Conclusion and future work

In this paper, we raised a scope of problems in which current researches have not yet solve completely. We proposed the Economic-aware Data Grid as an appropriate solution for them. Our solution based on two ideas. The first is integrating SaaS into Data Grid. The second is that the system is based on business Grid and works with Economic-aware way. We presented the high-level design and three basic operating scenarios of our system to demonstrate how the system could tackle the problems. Our proposed system has many advantages. It brings economic benefit to business organization. It also gives provider a chance to get money from their resource and software. It also reduces the complexity in resource management problems in compare with scientific Data Grid. We strongly believe that Economic-aware Data Grid in particular and Economic-aware Grid in general will play a more important role in the development of Grid computing in the near future.

We have specified all components at the class level including the working mechanism, interface with other components and input/output parameters. We have designed UML diagrams and portal interface. For next steps, we intend to build a prototype of the system using Java language. We have built a test bed including 12 computers from our HPC Center. We plan to deploy and do some experiments to verify our approach.

Acknowledgements

This research is conducted at High Performance Computing Center, Hanoi University of Technology, Hanoi, Vietnam. It is supported in part by VN-Grid, SAREC 90 RF2, and KHCB 20.10.06 projects.

References [1]. Venugopal, S., R. Buyya, and R. Kotagiri, A Taxonomy

of Data Grids for Distributed Data Sharing, Management and Processing. 2006.

[2]. Allcock, B., et al., Data management and transfer in high-performance computational grid environments, in Parallel Computing. 2002.

[3]. Konary, E.T.a.A., Software-as-a-Service Taxonomy and Research Guide. 2005.

[4]. Quan, D.M. and O. Kao, SLA negotiation protocol for Grid-based workflows. 2005.

[5]. Quan, D.M. and O. Kao, On Architecture for SLA-aware workflow in Grid environments. Journal of Interconnection Networks, World Scientific Publishing Company, 2005.

[6]. Bell, W.H., et al. Evaluation of an Economy-Based File Replication Strategy for a Data Grid. in Proceedings of the 3rd IEEE/ACMInternational Symposium on Cluster Computing and the Grid (CCGrid 2003). 2003. Tokyo,Japan: IEEECSPress.

[7]. GridASP Website. [cited; Available from: http://www.gridasp.org/wiki/.

[8]. GRASP Web site. [cited; Available from: http://eu-grasp.net/.

[9]. GRACE Website. [cited; Available from: http://www.buyya.com/ecogrid/.

[10]. T. Weishupl, F.D., E. Schikuta, H. Stockinger, and a.H. Wanek. Business In the Grid: The BIG Project. in GECON2005, the 2nd International Workshop on Grid Economics and Business Models. 2005. Seoul.

[11]. GridEcon Website. [cited; Available from: http://gridecon.eu/html/deliverables.shtml.

[12]. Sun Grid Web site. [cited; Available from: http://www.sun.com/service/sungrid/.

[13]. Amazon Web Services. [cited; Available from: http://www.amazon.com/.

[14]. Falcon, F., GRID - A Telco perspective: The BT Grid Strategy, in the 2nd International Workshop on Grid Economics and Business Models. 2005: Seoul.

[15]. WebEx Connect: First SaaS Platform to Deliver Mashup Business Applications for Knowledge Workers. 2007 [cited; Available from: http://www.webex.com/pr/pr428.html.

[16]. Biomedical Informatics Research Network. [cited; Available from: http://www.nbirn.net/.

[17]. Southern California Earthquake Center. [cited; Available from: http://www.scec.org/.

[18]. Real-time Observatories, Applications, and Data management Network. [cited; Available from: http://roadnet.ucsd.edu/.

[19]. Stockinger, H., et al. Towards a Cost Model for Distributed and Replicated Data Stores. in Proceedings of the 9th Euromicro Workshop on Parallel and Distributed Processing (PDP 2001) 2001. Italy.

[20]. G. Heiser, F.L., and S. Russell. Resource Management in the Mungi Single-Address-Space Operating System. in Proceedings of Australasian Computer Science Conference. 1998. Perth Australia.

[21]. Buyya, R., et al., Economic models for resource management and scheduling in Grid computing. 2002.

[22]. Buyya, R., D. Abramson, and S. Venugopal, The grid economy. 2005.

[23]. S.Chapin, J. Karpovich, and A. Grimshaw, The legion resource management system, in The 5th Workshop Job Scheduling Strategies for Parallel Processing. 1999: San Juan,Puerto Rico.


http://www.gridasp.org/wiki/

http://eu-grasp.net/

http://eu-grasp.net/

http://www.buyya.com/ecogrid/

http://gridecon.eu/html/deliverables.shtml

http://www.sun.com/service/sungrid/

http://www.amazon.com/

http://www.webex.com/pr/pr428.html

http://www.nbirn.net/

http://www.scec.org/

http://roadnet.ucsd.edu/

[24]. M. Litzkow, M. Livny, and M. Mutka, CondorA hunter of idle workstations, in the 8th Int.Conf.Distributed Computing Systems(ICDCS1988). 1988: SanJose,CA.

[25]. Berman, F. and R. Wolski, The AppLeS project: A status report, in the 8th NEC Research Symp. 1997: Berlin, Germany.

[26]. H. Casanova, et al., The AppLeS parameter sweep template: User-level middleware for the grid, in the IEEE Supercomputing Conf.(SC2000). 2000: Dallas,TX.

[27]. Casanova, H. and J. Dongarra, NetSolve: A network server for solving computational science problems. Int.J.Super computing Applicat.High Performance Computing, 1997.

[28]. Kapadia, N. and J. Fortes, PUNCH: An architecture for web-enabled wide-area network-computing. Cluster Computing, 1999.


Grid-enabling complex system applications with

QosCosGrid: An architectural perspective

Valentin Kravtsov1, David Carmeli

1, Werner Dubitzky

2, Krzysztof Kurowski

3;4, and Assaf Schuster

1

1Technion - Israel Institute of Technology, Technion City, Haifa, Israel

2University of Ulster, Coleraine, Northern Ireland, UK

3Poznan Supercomputing and Networking Center, Poznan, Poland

4University of Queensland, St. Lucia, Brisbane, Australia

Abstract - Grids are becoming mission-critical components

in research and industry, offering sophisticated solutions in

leveraging large-scale computing and storage resources.

Grid resources are usually shared among multiple

organizations in an opportunistic manner. However, an

opportunistic or "best effort" quality-of-service scheme may

be inadequate in situations where a large number of

resources need to be allocated and applications which rely

on static, stable execution environments. The goal of this

work is to implement what we refer to as quasi-

opportunistic supercomputing. A quasi-opportunistic

supercomputer facilitates demanding parallel computing

applications on the basis of massive, non-dedicated

resources in grid computing environments. Within the EU-

supported project QosCosGrid we are developing a quasi-

opportunistic supercomputer. In this work we present the

results obtained from studying and identifying the

requirements a grid needs to meet in order to facilitate

quasi-opportunistic supercomputing. Based on these

requirements we have designed architecture for a quasi-

opportunistic supercomputer. The paper presents and

discusses this architecture.

Keywords: Grid, Quasi-Opportunistic Supercomputing.

1 Introduction

Supercomputers are dedicated, special-purpose

multiprocessor computing systems that provide close-to-best

achievable performance for demanding parallel workloads

[12]. Supercomputers possess a set of characteristics that

enable them to process such workloads efficiently. First, all

the high-end hardware components, such as CPUs, memory,

interconnects and storage devices are characterized not only

by considerable capacity levels but also by a high degree of

reliability and performance predictability. Second,

supercomputer middleware provides a convenient

abstraction of a homogeneous computational and networking

environment, automatically allocating resources according to

the underlying networking topology [2]. Third, the resources

of a conventional supercomputer are managed exclusively

by a single centralized system. This enforces global resource

utilization policies, thus maximizing hardware utilization

while minimizing the turnaround time of individual

applications. Together, these features give supercomputers

their unprecedented performance, stability and dependability

characteristics.

The vision of grids becoming powerful virtual

supercomputers can be attained only if their performance

and reliability limitations can be overcome. Due to the

considerable differences between grids and supercomputers,

the realization of this vision poses considerable challenges.

Some of the main challenges are briefly discussed below.

The co-allocation of a large number of participating

CPUs. In conventional supercomputers, where all CPUs are

exclusively controlled by a centralized resource management

system, the simultaneous allocation (co-allocation) and

invocation (co-invocation) of processing units is handled by

suitable co-allocation and co-invocation components [10]. In

grid systems, however, inherently distributed management

coupled with the non-dedicated (opportunistic) nature of the

underlying resources makes co-allocation very hard to

accomplish. Previous research has focused on co-allocation

in grids of supercomputers and dedicated clusters [17], [15],

[3]. The co-allocation problem has received little attention in

the high-performance grid computing community. While co-

allocation issues arise in other situations (e.g., co-allocation

of processors and memory, co-allocation of CPU and

networks, setup of reservations along network paths), the

dynamic, non-dedicated nature of grids presents special

challenges [6]. The potential for failure and the

heterogeneous nature of the underlying resource pool are

examples of such special challenges.

Synchronous communications. Typically, synchronous

communications form a specific communication topology

pattern (e.g., stencil exchange in MM5 [15] [13] and local

structures in complex systems [7]). This is satisfied by

supercomputers via special-purpose, low-latency, high-

throughput hardware as well as optimized allocation by the

resource management system to ensure that the underlying

networking topology matches the application's

communication pattern [2]. In grids, however, synchronous

communication over a wide area network (WAN) is slow,

and topology-aware allocation is typically not available

despite the existing support of communication libraries [8].


Allocation of resources does not change during runtime.

While always true in supercomputers, this requirement is

difficult to satisfy in grids, where low reliability of resources

and WANs, as well as uncoordinated management of

different parts of the grid, contribute to extreme fluctuations

in the number of available resources.

Fault tolerance. In large-scale synchronous computation,

the high sensitivity of individual processes to failures usually

leads to termination of the entire parallel run in case of such

failures. While rare in supercomputers because the high

specifications of their hardware, system and component

failures are very common in grid systems.

We define a quasi-opportunistic supercomputer as a grid

system which could address the challenges mentioned above

but still hide many of the grid-related complexities from

applications and users.

In this paper we present some of the early results coming out

of the QosCosGrid project1 which is aimed at developing a

quasi-opportunistic supercomputer. The main contributions

of this paper are that we

• Introduce and motivate the concept of a quasi-

opportunistic supercomputer.

• Summarize the main requirements of a quasi-

opportunistic supercomputer.

• Present a detailed system architecture designed for

the QosCosGrid quasi-opportunistic

supercomputer.

2 Requirements of a quasi-opportunistic

supercomputer

Many real-world systems involve large numbers of

highly interconnected heterogeneous elements. Such

structures, known as complex systems, typically exhibit non-

linear behavior and emergence [4]. The methodologies used

to understand the properties of complex systems involve

modeling, simulation and often require considerable

computational resources that only supercomputers can

deliver. However, many organizations wanting to model and

simulate complex systems lack the resources to deploy or

maintain such a computing capability. This was the

motivation that prompted the initiation of the QosCosGrid

project, whose aim it is to develop core grid technology

capable of providing quasi-opportunistic supercomputing

grid services and technology. Modeling and simulation of

complex systems provide a huge range of applications

requiring supercomputer or supercomputer-like capabilities.

The requirements derived from the analysis of nine diverse

complex systems applications are summarized below.

Co-allocation. Complex systems simulations require

simultaneous execution of code on very high numbers of

1 www.QosCosGrid.com

CPUs. In this context, co-allocation also means that

resources for a certain task are allocated in advance. Those

resources must be negotiated in advance and guaranteed to

be available when the task's time slot arrives. This implies

the need for a sophisticated distributed negotiation protocol

which is supported by advance reservation mechanisms.

Topology aware resource management. A complex

system simulation is usually composed of multiple agents

performing tasks of different complexity. The agents are

arranged in a dynamic topology with different patterns of

communication. To execute such a simulation, appropriate

computational and network resources must be allocated. To

perform this task, resource co-allocation algorithms must

consider the topological structure of resource requests and

offers and match these appropriately.

Economics-based grid. In "best-effort" grids, local cluster

administrators are likely to increase the priorities of local

users, possibly disallowing remote jobs completely and thus

disassembling the grid back into individual clusters. This is

because administrators lack suitable incentives to share the

resources. We believe that resource co-allocation must be

backed up by an economic model that motivates resource

providers to honor their guarantees to the grid user and force

the user to carefully weigh the cost of resource utilization.

This model is also intended to address the "free-rider"

problem [1].

Service-level agreements. The economic model should be

supported by formal agreements, whose fulfillment can later

be confirmed. Thus, an expressive language is required to

describe such agreements, along with monitoring,

accounting and auditing systems that can understand such a

language.

Cross-domain fault tolerant MPI and Java RMI

communication. The majority of current fault-tolerant MPI

and Java RMI implementations provide transparent fault

tolerance mechanisms for clusters. However, to provide a

reliable connection within a grid, a cross-domain, fault-

tolerant and grid-middleware-aware communication library

is needed.

Distributed checkpoints. In grids, nodes and network

failures will inevitably occur. However, to assure that an

entire application will not be aborted after a single failure,

distributed checkpoints and restart protocols must be used to

stop and migrate the whole application or part of it.

Scalability, extensibility, ease of use. In order to be widely

accepted by the complex systems community, the

QosCosGrid system must offer easy interfaces yet still allow

extensions and further development. Additionally, for real-

world grids, the system must be scalable in terms of

computing resources and supported users.


Interoperability. The system must facilitate seamless

interoperation and sharing of computing and data resources.

Standardization. To facilitate interoperation and evolution

of the QosCosGrid system, the design and implementation

should be based on existing and emerging grid and grid-

related standards and open technology.

3 QosCosGrid System Architecture

The working hypothesis in QosCosGrid project is that a

quasi-opportunistic supercomputer (as characterized in

Section 1) can be built by means of a collaborative grid

which facilitates sophisticated resource sharing across

multiple administrative domains (ADs). Loosely based on

Krauter [16], a collaborative grid consists of several

organizations participating in a virtual organization (VO)

and sharing resources. Each organization contributes its

resources for the benefit of the entire VO, while controlling

its own administrative domain and own resource

allocation/sharing policies. The organizations agree to

connect their resource pools to a trusted "grid level"

middleware which tries to achieve optimal resource

utilization. In exchange for this agreement, partners gain

access to very large amounts of computational resources.

The QosCosGrid architecture is depicted in Figure 1. The

diagram depicts a simplified scenario involving three

administrative domains (labeled Administrative Domain 1, 2

and 3). Administrative Domain 3 consists of two resource

pools, each of which is connected to an AD-level service,

located in the center of the diagram. AD-level services of all

the administrative domains are connected to a trusted,

distributed grid-level service. The grid-level services are

designed to maximize the global welfare of the users in the

entire grid.

Figure 1: QosCosGrid system architecture.

3.1 Grid fabric

The basic physical layer consists of computing and

storage nodes connected in the form of computing cluster.

The computing cluster is managed by a local resource

management system (LRMS) – in our case a Platform Load

Sharing Facility (LSF Cluster) – but may be replaced by

other advanced job scheduling systems such as SGE or

PBSPro. The LSF cluster runs batch or interactive jobs,

selecting execution nodes based on current load conditions

and the resource requirements of the application. The

current load of CPUs, network connections, and other

monitoring statistics are collected by the cluster and

networking monitoring system, which is tightly integrated

with LRMS and external middleware monitoring services. In

order to execute cluster-to-cluster parallel applications, the

QosCosGrid system supports the advance reservation of

computing resources, in addition to basic job execution and

monitoring features. Advance reservation is critical to the

QosCosGrid system as it enables the QosCosGrid

middleware to deliver resources on-demand with

significantly improved quality of service.

3.1.1 FedStage Open DRMAA Service Provider and

advance reservation APIs

A key component of the QosCosGrid architecture is a

LRMS offering job submission, monitoring, and advance

reservation features. However, for years LRMS provided

either only proprietary script-based interfaces for application

integration or nothing at all, in which case the command-line

interface was used. Consequently, no standard mechanisms

existed for programmers to integrate both grid middleware

services and applications with local resource management

systems. Thanks to Open Grid Forum and its Distributed

Resource Management Application API (DRMAA) working

group [19], the DRMAA 1.0 specification has recently been

released. It offers a standardized API for application

integration with C, Java, and Perl bindings. Today, DRMAA

implementations that adopt the latest specification version

are available for many local resource management systems,

including SGE, Condor, PBSPro, and Platform LSF, as well

as for other systems such as GridWay or XGrid. In the

QosCosGrid we have successfully used FedStage2 DRMAA

for LSF and integrated those APIs with the Open DRMAA

Service Provider (OpenDSP). OpenDSP is an open

implementation of SOAP Web Service multi-user access and

policy-based job control using DRMAA routines

implemented by LRMS. As a lightweight and highly

efficient software component, OpenDSP allows easy and fast

remote access to computing resources. Moreover, as it is

based on standard Web Services technology, it integrates

well with higher level grid middleware services. It uses a

request-response-based communication protocol with

standard JSDL XML and SOAP schemas protected by

2 http://www.fedstage.com/wiki/


transport level security mechanisms such as SSL/TLS or

GSI. However, neither DRMAA nor OpenDSP provide

standard advance reservation and resource synchronization

APIs required by cross-domain parallel applications.

Therefore, in the QosCosGrid project, we have extended

DRMAA and proposed standard advance reservation APIs

that are suited to the various APIs of underlying local

resource management systems.

3.1.2 QosCosGrid parallel cross-domain execution

environments

The QosCosGrid Open MPI (QCG-OMPI) is an

implementation of the message passing interface that enables

users to deploy and use transparently MPI applications in the

QosCosGrid testbed, and to take advantage of local

interconnects technology3. QCG-OMPI supports all the

standard high-speed network technologies that Open MPI

supports, including TCP/Ethernet, shared memory,

Myrinet/GM, Myrinet/MX, Infiniband/OpenIB, or

Infiniband/mVAPI. In addition, it supports inter-cluster

communications using relay techniques in a manner

transparent to users and can be integrated with the LSF

Cluster. QCG-OMPI relates to a check-pointing interface

that provides a coordinated check-pointing mechanism on

demand. To the best of our knowledge, no other MPI

solution provides a fault-tolerant mechanism on a

transparent grid deployment environment. Our intention is

that the QCG-OMPI implementation will be fully compliant

with the MPI 1.2 specification from the MPI Forum4.

QosCosGrid ProActive (QCG-ProActive) is a Java-based

grid middleware for parallel, distributed and multi-threaded

computing integrated with OpenDSP. It is based on

ProActive, which provides a comprehensive framework and

programming model to simplify the programming and

execution of parallel applications. ProActive uses by default

the RMI Java standard library as a portable communication

layer, supporting the following communication protocols:

RMI, HTTP, Jini, RMI/SSH, and Ibis [5].

3.2 Administrative domain- and grid-level

services

Grid fabric software components, in particular

OpenDSP, QCG-OMPI and QCG-ProActive, must be

deployed on physical computing resources at each

administrative domain and be integrated with the AD-level

services. AD-level services, in turn, are connected to the

Grid-level services in order to share and receive information

about the entire grid as well as for tasks that cannot be

performed within a single administrative domain. We

distinguish five main high-level types of services:

3 http://www.open-mpi.org/

4 http://www.mpi-forum.org/

3.2.1 Grid Resource Management System

The Grid Resource Management System (GRMS) is a

grid meta-scheduling framework which allows developers to

build and deploy resource management systems for large-

scale distributed computing infrastructures at both

administrative domain and grid levels. The core GRMS

broker module has been improved in QosCosGrid to provide

both dynamic resource selection and mapping, along with

advance resource reservation mechanisms. As a core service

for all resource management processes in the QosCosGrid

system, the GRMS supports load-balancing among LRMS,

workflow execution, remote job control, file staging and

advanced resource reservation. At the administrative level,

the GRMS communicates with OpenDSP services to expose

the remote access to underlying computing resources

controlled by LRMS. Administrative domain-level GRMS is

synchronized with the Grid-level GRMS during the job

submission, job scheduling and execution processes.

At the grid level, the GRMS offers much more advanced co-

allocation and topology-aware scheduling mechanisms.

From the user's perspective, all parallel applications and

their requirements (including complex resource topologies)

are formally expressed by an XML-based job specification

language called QCG Job Profile. Those job requirements,

together with resource offers, are provided to the GRMS

during scheduling and decision making processes.

3.2.2 Accounting and economic services

Accounting services support the needs of users and

organizations with regard to allocated budgets, credit

transactions, auditing, etc. These services are responsible for

(a) Monitoring: Capturing resource usage records across the

administrative domains, according to predefined metrics; (b)

Usage record storage: Aggregation and storage of the usage

records gathered by the monitoring system; (c) Billing:

Assigning a cost to operations and charging the user, taking

into account the quality of service actually received by the

user; (d) Credit transaction: The ability to transfer credits

from one administrative domain to another as means of

payment for received services and resources; (e) VO

management: Definition of user groups, authorized users,

policies and priorities; (f) Accounting: Management of user

groups' credit accounts, tracking budget, economical

obligations, etc.; (g) Budget planning: Cost estimations for

aggregation of resources according to the pricing model; and

(h) Usage analysis: Analysis of the provided quality of

service using information from usage records, and

comparison of this information to the guaranteed quality of

service.

3.2.3 Resource Topology Information Service (RTIS)

The RTIS provides information on the resource

topology and availability. Information is provided by means


of the Resource Topology Graph (RTG) schema, instances

of which depict the properties of the resources and their

interconnections. For a simpler description process, the RTG

does not contain a "point-to-point" representation of the

desired connections but is based instead on the

communication groups concept, which is quite similar to the

MPI communicator definition. The main goals of the RTIS

are to facilitate topology-aware services to discover the grid

resources picture as well as to disclose information about

those resources, on a "need-to-know" basis.

3.2.4 Grid Authorization System

Currently, the most common solution for mutual

authentication and authorization of grid users and services is

the Grid Security Infrastructure (GSI). The GSI is a part of

the Globus Toolkit and provides fundamental security

services needed to support grids [9]. In many GSI-based grid

environments, the user's identity is mapped to a

corresponding local user identity, and further authorization

depends on the internal LRMS mechanisms. The

authorization process is relatively simple and static.

Moreover, it requires that the administrator manually modify

appropriate user mappings in the gridmap file every time a

new user appears or has to be removed. If there are many

users in many administrative domains whose access must be

controlled dynamically, the maintenance and

synchronization of various gridmap files becomes an

important administrative issue. We believe that more

advanced mechanisms for authorization control are required

to support dynamic changes in security policy definition and

enforcement over a large number of middleware services.

Therefore, in QosCosGrid we have adopted the Grid

Authorization Service (GAS), an authorization system

integrated with various grid middleware such as the Globus

Toolkit, GRMS or OpenDSP. GAS offers dynamic fine-

grained access control and enforcement for shared

computing services and resources. Taking advantage of the

strong authentication mechanisms implemented in PKI and

GSI, it provides crucial security mechanisms in the

QosCosGrid system. From the QosCosGrid architecture

perspective, the GAS can also be treated as a trusted single

logical point for defining security policies.

3.2.5 Service-Level Agreement Management System

In order to enforce the rules of the economic system,

we employ a service-level agreement (SLA) protocol [18]. A

SLA defines a dynamically established and managed

relationship between the resource providers and resource

consumers. Both parties are committed to the negotiated

terms. These commitments are backed up by organization-

wide policies, incentives, and penalties to encourage each

party to fulfill its obligations. For each scheduled task, a set

of SLAs is signed by the administrative domain of the task

owner, and by each of the provider administrative domains.

The SLA describes the service time interval, and the

provided QoS – resources, topology, communication, and

mapping of user processes to provider's resources. SLAs are

represented using the RTG model, and are stored in RTIS.

The SLA Compliance Monitor analyzes the provided quality

of service for each time slot, and calculates a weighted

compliance factor for the whole execution. The compliance

factor is used by the pricing service (which is a part of

accounting services) to calculate the costs associated with

the service if it is provided successfully, or the penalties that

arise when a guarantee is violated.

4 Related Work

One of the largest European grid projects, Enabling Grids

for E-SciencE (EGEE) [11], has developed a grid system

that facilitates the execution of scientific applications within

a production level grid environment. To the best of our

knowledge, EGEE does not support advance reservation,

checkpoint and restart protocols, and cannot guarantee the

desired level of quality of service for long executions. One

of the major drawbacks of the EGEE system stems from the

presence of large numbers of small misconfigured sites. This

results in considerable delays. To some extent, this problem

is caused by the sheer scale of the system, but also by the

lack of an appropriate incentive for the participating

administrative domain administrators.

The HPC4U [14] project is arguably closest to the

objectives of QosCosGrid. Its goal is to expand the potential

of the grid approach to complex problem solving. This is

envisaged to be done through the development of software

components for dependable and reliable grid environments,

combined with service-level agreements and commodity-

based clusters providing quality of service. The QosCosGrid

project differs from HPC4U mainly in its "grid orientation".

QosCosGrid assumes multi-domain, parallel executions (in

contrast to within-cluster parallel execution in HPC4U) and

applies different MPI and checkpoint/restart protocols that

are grid-oriented and highly scalable.

VIOLA (Vertically Integrated Optical Testbed for Large

Applications in DFN) [20] is a German national project

intended for the execution of large-scale scientific

applications. The project emphasizes the provision of high

quality of service for execution node interconnection.

VIOLA uses UNICORE grid middleware as an

implementation of an operational environment and offers a

newly developed meta-scheduler component, which supports

co-allocation on optical networks. The main goals of the

VIOLA project include testing of advanced network

equipment and architectures, development and test of

software tools for user-driven dynamical provision of

bandwidth, interworking of network equipment from

different manufacturers, and enhancement and test of new

advanced applications. The VIOLA project is mainly

targeted at supporting a new generation of networks, which


provide many advanced features that are not present in the

majority of up-to-date clusters and of course not at the

Internet level.

5 Conclusions

The main objective of the QosCosGrid project is to

address some of the key technical challenges to enable

development and execution of large scale parallel

experiments across multiple administrative, resource and

security domains. In this paper we summarized presented the

main requirements and important software components that

make up consistent software architecture. This high-level

architecture perspective is intended to give readers the

opportunity to understand a design concept without the need

to know the many intricate technical details related in

particular to Web service, WSRF technologies, remote

protocol design, and security in distributed systems. The

QosCosGrid architecture is designed to address key quality-

of-service, negotiation, synchronization, advance reservation

and access control issues by providing well-integrated grid

fabric, administrative domain- and grid-level services. In

contrast to existing grid middleware architectures, all

QosCosGrid system APIs have been created on top of

carefully selected third-party services that needed to meet

the following requirements: open standards, open source,

high performance, and security and trust. Moreover, the

QosCosGrid pluggable architecture has been designed to

enable the easy integration of parallel development tools and

supports fault-tolerant cluster-to-cluster message passing

communicators and libraries that are well known in high-

performance computing domains, such as Open MPI,

ProActive and Java RMI. Finally, we believe that without

extra support for administrative tools, it would be difficult to

deploy, control, and maintain such a big system. Therefore,

as a proof of concept, we have developed various client

tools to help administrators connect sites from Europe,

Australia and the USA. After the initial design and

deployment stage, we have begun carrying out many

performance tests of the QosCosGrid system and cross-

domain application experiments. Collected results and

analysis will be taken into account during the next phase of

system re-design and deployment.

Acknowledgments. The work described in this paper was

supported by the EC grant QosCosGrid IST FP6 033883.

6 References

[1] N. Andrade, F. Brasileiro, W.Cirne, and M. Mowbray.

"Discouraging free riding in a peer-to-peer CPU-sharing

grid". In Proceedings of the 13th IEEE International

Symposium on High Performance Distributed Computing

(HPDC-'04), 2004.

[2] Y. Aridor, T. Domany, O. Goldshmidt, J.E. Moreira,

and E. Shmueli. "Resource allocation and utilization in the

Blue Gene/L supercomputer". IBM Journal of Research and

Development, 49(2-3):425–436, 2005.

[3] F. Azzedin, M. Maheswaran, and N. Arnason. "A

synchronous co-allocation mechanism for grid computing

systems". Cluster Computing, 7(1):39–49, 2004.

[4] Complex systems. Science, 284(5411), 1999.

[5] J. Cunha and O. Rana. "Grid computing: Software

environments and tools, chapter 9, Programming,

Composing, Deploying for the Grid". Springer Verlag, 2006.

[6] K. Czajkowski, I. Foster, and C. Kesselman. "Resource

co-allocation in computational grids". In Proceedings of the

8th IEEE International Symposium on High Performance

Distributed Computing (HPDC-'99), page 37, 1999.

[7] R. Bagrodia et al. Parsec. "A parallel simulation

environment for complex systems". Computer, 31(10):77–

85, 1998.

[8] I. Foster and N.T. Karonis. "A grid-enabled MPI:

Message passing in heterogeneous distributed computing

systems". IEEE/ACM Conference on Supercomputing,

pages 46–46, Nov. 1998.

[9] I. Foster and C. Kesselman. "The grid: Blueprint for a

new computing infrastructure", chapter 2, The Globus

Toolkit. Morgan Kaufmann, 1999.

[10] E. Frachtenberg, F. Petrini, J. Fernandez, S. Pakin, and

S. Coll. "STORM: Lightning-fast resource management". In

Proceedings of the 2002 ACM/IEEE Conference on

Supercomputing, pages 1–26, USA, 2002. IEEE Computer

Society Press.

[11] F. Gagliardiand, B. Jones, F. Grey, M-E. Bgin, and M.

Heikkurinen. "Building an infrastructure for scientific grid

computing: Status and goals of the EGEE project".

Philosophical Transactions A of the Royal Society:

Mathematical, Physical and Engineering Sciences,

363(833):1729–1742, 2005.

[12] S. L. Graham, C. A. Patterson, and M. Snir. "Getting

Up to Speed: The Future of Supercomputing". National

Academies Press, USA, 2005.

[13] G.A. Grell, J. Dudhia, and D. R. Stauffer. "A

Description of the Fifth Generation Penn State/NCAR

Mesoscale Model (MM5)". NCAR, USA, 1995.

[14] M. Hovestadt. "Operation of an SLA-aware grid

fabric". IEEE Trans. Neural Networks, 2(6):550–557, 2006.

[15] Y.S. Kee, K. Yocum, A.A. Chien, and H. Casanova.

"Improving grid resource allocation via integrated selection


and binding". In Proceedings of the 2006 ACM/IEEE

Conference on Supercomputing (SC'06), page 99, New

York, USA, 2006, ACM.

[16] K. Krauter, R. Buyya, and M. Maheswaran. "A

taxonomy and survey of grid resource management systems

for distributed computing". Software – Practice and

Experience, 32:135–164, 2002.

[17] D. Kuo and M. Mckeown. "Advance reservation and

co-allocation protocol for grid computing". In Proceedings

of the 1st Intl. Conf. On e-Science and Grid Computing,

pages 164–171, Washington, DC, USA, 2005. IEEE

Computer Society.

[18] H. Ludwig, A. Keller, A. Dan, and R. King. "A service

level agreement language for dynamic electronic services".

WECWIS, page 25, 2002.

[19] P. Troeger, H. Rajic, A. Haas, and P. Domagalski.

"Standardization of an API for distributed resource

management systems". In Seventh IEEE International

Symposium on Cluster Computing and the Grid, pages 619–

626, 2007.

[20] VIOLA Vertically Integrated Optical Testbed for

Large Application in DFN 2005. http://www.viola

testbed.de/.


The Scientific Byte Code Virtual MachineRasmus Andersen

University of CopenhageneScience Centre

2100 Copenhagen, DenmarkEmail: [email protected]

Brian VinterUniversity of Copenhagen

eScience Centre2100 Copenhagen, Denmark

Email: [email protected]

Abstract—Virtual machines constitute an appealing technologyfor Grid Computing and have proved a promising mechanismthat greatly simplifies and enforces the employment of gridcomputer resources.

While existing sandbox technologies to some extent providesecure execution environments for applications deployed in aheterogeneous platform as the Grid, they suffer from a num-ber of problems including performance drawbacks and specifichardware requirements.

This project introduces a virtual machine capable of execut-ing platform-independent byte codes specifically designed forscientific applications. Native libraries for the most prevalentapplications domains mitigate the performance penalty. As such,grid users can view this machine as a basic grid computingelement and thereby abstract away the diversity of the underlyingreal compute elements.

Regarding security, which is of great concern to resourceowners, important aspects include stack isolation by using aharvard memory architecture, and no support for neither I/Onor system calls to the host system.

Keywords: Grid Computing, virtual machines, scientific appli-cations.

I. I NTRODUCTION

Although virtualization was first introduced several decadesago, the concept is now more popular than ever and hasrevived in a multitude of computer system aspects that benefitfrom properties such as platform independence and increasedsecurity. One of those applications is grid Computing[5] whichseeks to combine and utilize distributed, heterogeneous re-sources as one big virtual supercomputer. Regarding utilizationof the public’s computer resources for grid computing, virtual-ization, in the sense of virtual machines, is a necessity for fullyleveraging the true potential of grid computing. Without virtualmachines, experience shows that people are, with good reason,reluctant to put their resources on a grid where they have to notonly install and manage a software code base, but also allownative execution of unknown and untrusted programs. All theseissues can be eliminated by introducing virtual machines.

As mentioned, virtualization is by no means a new concept.Many virtual machines exist and many of them have beencombined with grid computing. However, most of these weredesigned for other purposes and suffer from a few prob-lems when it comes to running high performance scientificapplications on a heterogeneous computing platform. Gridcomputing is tightly bonded to eScience, and while standardjobs may run perfectly and satisfactory in existing virtual

e−Science

Virtual MachinesGrid

Fig. 1. Relatationship between VMs, the Grid, and eScience

machines, ’gridified’ eScience jobs are better suited for adedicated virtual machine in terms of performance.

Hence, our approach addresses these problems by develop-ing a portable virtual machine specifically designed for scien-tific applications: The Scientific Byte Code Virtual Machine(SciBy VM).

The machine implements a virtual CPU capable of executingplatform independent byte codes corresponding to a very largeinstruction set. An important feature to achieve performanceis the use of optimized native libraries for the most prevalentalgorithms in scientific applications. Security is obviously veryimportant for resource owners. To this end, virtualization pro-vides the necessary isolation from the host system, and severalaspects that have made other virtual machines vulnerable havebeen left out. For instance, the SciBy VM supports neithersystem calls nor I/O.

The following section (II) motivates the usage of virtualmachines in a grid computing context and why they arebeneficial for scientific applications. Next, we describe thearchitecture of the SciBy VM in Section III, the compiler inSection IV, related work in Section VI and conclusions inSection VII.

II. M OTIVATION

The main building blocks in this project arise from proper-ties from virtual machines, eScience, and a grid environmentin a combined effort, as shown in figure 1.

The individual interactions impose several effects from theviewpoint of each end, described next.

A. eScience in a Grid Computing Context

eScience, modelling computationally intensive scientificproblems using distributed computer networks, has driven thedevelopment of grid technology and as the simulations get


more and more accurate, the amount of data and neededcompute power increase equivalently. Many research projectshave already made the transition to grid platforms to accom-modate the immense requirements for data and computationalprocessing. Using this technology, researchers gain access tomany networked computers at the cost of a highly heteroge-neous computing platform. Obviously, maintaining applicationversions for each resource type is tedious and troublesome,and results in a deploy-port-redeploy cycle. Further, differenthardware and software setups on computational resourcescomplicate the application development drastically. One neverknows to which resource a job is submitted in a grid, andwhile it is possible to assist each job with a detailed list ofhardware and software requirements, researchers are better leftoff with a virtual workspace environment that abstracts a realexecution environment.

Hence, a virtual execution environment spanning the hetero-geneous resource platform is essential in order to fully leveragethe grid potential. From the view of applications, this wouldrender a resource access uniform and thus the much easier”compile once run anywhere” strategy; researchers can writetheir applications, compile them for the virtual machine andhave them executed anywhere in the Grid.

B. Virtual Machines in a Grid Computing Context

Due to the renewed popularity of virtualization over the lastfew years, virtual machines are being developed for numerouspurposes and therefore exist in many designs, each of themin many variants with individual characteristics. Despite thevariety of designs, the underlying technology encompasses anumber of properties beneficial for Grid Computing [4]:

1) Platform Independence:In a grid context, where it isinherently intrinsic to move around application code as freelyas application data, it is highly profitable to enable applicationsto be executed anywhere in the grid. Virtual machines bridgethe architectural boundaries of computational elements in agrid by raising the level of abstraction of a computer system,thus providing a uniform way for applications to interact withthe system. Given a common virtual workspace environment,grid users are provided with a compile-once-run-anywheresolution.

Furthermore, a running virtual machine is not tied to aspecific physical resource; it can be suspended, migrated toanother resource and resumed from where it was suspended.

2) Host Security: To fully leverage the computationalpower of a grid platform, security is just as important asapplication portability. Today, most grid systems enforce se-curity by means of user and resource authentication, a securecommunication channel between them, and authorization invarious forms. However, once access and authorization isgranted, securing the host system from the application is leftto the operating system.

Ideally, rather than handling the problems after systemdamage has occurred, harmful - intentional or not - gridapplications should not be able to compromise a grid resourcein the first place.

Virtual machines provide stronger security mechanisms thanconventional operating systems, in that a malicious processrunning in an instance of a virtual machine is only capableof destroying the environment in which it runs, i.e. the virtualmachine.

3) Application Security: Conversely to disallowing hostsystem damage, other processes, local or running in othervirtualized environments, should not be able to compromisethe integrity of the processes in the virtual machine.

System resources, for instance the CPU and memory, ofa virtual machine are always mapped to underlying physicalresources by the virtualization software. The real resources arethen multiplexed between any number of virtualized systems,giving the impression to each of the systems that they haveexclusive access to a dedicated physical resource. Thus, gridjobs running in a virtual machine are isolated from other gridjobs running simultaneously in other virtual machines on thesame host as well as possible local users of the resources.

4) Resource Management and Control:Virtual machinesenable increased flexibility for resource management andcontrol in terms of resource usage and site administration. Firstof all, the middleware code necessary for interacting with theGrid can be incorporated in the virtual machine, thus relievingthe resource owner from installing and managing the gridsoftware. Secondly, usage of physical resources like memory,disk, and CPU usage of a process is easily controlled with avirtual machine.

5) Performance:As a virtual machine architecture inter-poses a software layer between the traditional hardware andsoftware layers, in which a possibly different instruction set isimplemented and translated to the underlying native instructionset, performance is typically lost during the translation phase.Despite of recent advances in new virtualization and trans-lation techniques, and the introduction of hardware-assistedcapabilities, virtual machines usually introduce performanceoverhead and the goal remains achieving near-native perfor-mance only. The impact depends on system characteristics andthe applications intended to run in the machine.

To summarize, virtual machines are an appealing technologyfor Grid Computing because they solve the conflict betweenthe grid users at the one end of the system and resourceproviders at the other end. Grid users want exclusive accessto as many resources as possible, as much control as possible,secure execution of their applications, and they want to usecertain software and hardware setups.

At the other end, introducing virtual machines on resourcesenables resource owners to service several users at once,to isolate each application execution from other users ofthe system and from the host system itself, to provide auniform execution environment, and managed code is easilyincorporated in the virtual machine.

C. A Scientific Virtual Machine for Grid Computing

Virtualization can occur at many levels of a computersystem and take numerous forms. Generally, as shown inFigure 2, virtual machines are divided in two main categories:


System virtual machines and process virtual machines, eachbranched in finer division based on whether the host and guestinstruction sets are the same or different. Virtual machines withthe same instruction set as the hardware they virtualize do existin multiple grid projects as mentioned in Section VI. However,since full cross-platform portability is of major importance, weonly consideremulatingvirtual machines, i.e. machines thatexecute another instruction set than the one executed by theunderlying hardware.

Virtual Machines

Process VMs System VMs

Different ISADifferent ISA Same ISASame ISA

Fig. 2. Virtual machine taxonomy. Similar to Figure 13 in [9]

System virtual machines allow a host hardware platformto support multiple complete guest operating systems, allcontrolled by a virtual machine monitor and thus acting as alayer between the hardware and the operating systems. Processvirtual machines operate at a higher level in that they virtualizea given platform for user applications. A detailed descriptionof virtual machines can be found in [9].

The overall problem with system virtual machines thatemulate the hardware for an entire system, including appli-cations as well as an operating system, is the performanceloss incurred by converting all guest system operations toequivalent host system operations, and the implementationcomplexity in developing a machine for every platform type,each capable of emulating an entire hardware environment foressentially all types of software.

Since the application domain in focus is scientific appli-cations only, there is really no need for full-featured operatingsystems. As shown in Figure 3, process level virtual machinesare simpler because they only execute individual processes,each interfaced to the hardware resources through a virtualinstruction set and an Application Binary Interface.

Applications

Operating System

Applications

ISA

Virtual MachineABI

ABIISA

Virtual Machine

Operating System

Fig. 3. System VMs (left) and Process VMs (right)

Using the process level virtual machine approach, the virtualmachine is designed in accordance with a software devel-opment framework. Developing a virtual machine for whichthere is no corresponding underlying real machine may soundcounterintuitive, but this approach has proved successful inseveral cases, best demonstrated by the power and usefulnessof the Java Virtual Machine. Tailored to the Java programminglanguage, it has provided a platform independent computingenvironment for many application domains, yet there is no

commonly used real Java machine1.Similar to Java, applications for the SciBy VM are compiled

into a platform independent byte code which can be executedon any device equipped with the virtual machine. However,applications are not tied to a specific programming language.As noted earlier, researchers should not be forced to rewritetheir applications in order to use the virtual machine. Hence,we produce a compiler based on a standard ansi C compiler.

D. Enabling Limitations

While the outlined work at hand may seem comprehensive,especially the implementation burden with virtual machines fordifferent architectures, there are a some important limitationsthat greatly simplify the project. Firstly, the implementationburden is lessened drastically by only giving support forrunning a single sequential application. Giving support forentire operating systems is much more complex in that itmust support multiple users in a multi-process environment,and hardware resources such as networking, I/O, the graphicsprocessor, and ’multimedia’ components of currently usedstandard CPUs are also typically virtualized.

Secondly, a virtual machine allows fine-grained controlover the actions taken by the code running in the machine.As mentioned in Section VI, many projects use sandboxmechanisms in which they by various means check all systeminstructions. The much simpler approach taken in this projectis to simply disallow system calls. The rationale for thisdecision is that:

• scientific applications perform basic calculations only• using a remote file access library, only files from the grid

can be accessed• all other kinds of I/O are not necessary for scientific

applications and thus prohibited• indispensable systems calls must be routed to the grid

III. A RCHITECTURAL OVERVIEW

The SciBy Virtual Machine is an abstract machine executingplatform independent byte codes on a virtual CPU, either bytranslation to native machine code or by interpretation. How-ever, in many aspects it is designed similarly to conventionalarchitectures; it includes an Application Binary Interface,an Instruction Set Architecture, and is able to manipulatememory components. The only thing missing in defining thearchitecture is the hardware. As the VM is supposed to berun on a variety of grid resources, it must be designed tobe as portable as possible, thereby supporting many differentphysical hardware architectures.

Based on the previous sections, the SciBy VM is designedto have 3 fundamental properties:

• Security• Portability• Performance

That said, all architectural decisions presented in the follow-ing sections rely solely on providing portability. Security is

1The Java VM has been implemented in hardware in the Sun PicoJava chips


obtained by isolation through virtualization, and performanceis solely obtained by the use of optimized native librariesfor the intended applications and taking advantage of thefact that scientific applications spend most of their time inthese libraries. The byte code is as such not designed forperformance. Therefore, the architectural decisions do notnecessarily seek to minimize code density, minimize codesize, reduce memory traffic, increase the average number ofclock cycles per instruction, or other architectural evaluationmeasurements, but more for simplicity and portability.

A. Application Binary Interface

The SciBy VM ABI defines how compiled applicationsinterface with the virtual machine, thus enabling platformindependent byte codes to be executed without modificationon the virtual CPU.

At the lowest level, the architecture defines the followingmachine types arranged in big endian order:

• 8-bit byte• 16-, 32-, or 64-bit halfword• 32-, 64-, or 128-bit word• 64-, 128-, or 256-bit doubleword

In order to support many different architectures, the ma-chine exists in multiple variations with different word sizes.Currently, most desktop computers are either 32- or 64-bitarchitectures, and it probably won’t be long before we seedesktop computers with 128-bit architectures. By letting theword size be user-defined, we capture most existing and near-future computers.

Fundamental primitive data types include, all in signedtwo’s complement representation:

• 8-bit character• integers (1 word)• single-precision floating point (1 word)• double-precision floating point (2 words)• pointer (1 word)

The machine contains a register file of 16384 registers, all1 word long. This number only serves as a value for havinga potentially unlimited amount of registers. The reasons forthis are twofold. First of all due to forward compatibility,since the virtual register usage has to be translated to nativeregister usage, in which one cannot tell the upper limit onregister numbers. So basically, in a virtual CPU, one shouldbe sure to have more registers than the host system CPU.Currently, 16384 registers should be more than enough, butnew architectures tend to have more and more registers.Secondly, for the intended applications, the authors believethat a register-based architecture will outperform a stack-basedone[8]. Generally, registers have proved more successful thanother types of internal storage and virtually every architecturedesigned in the last few decades uses a register architecture.

Register computers exist in 3 classes depending on whereALU instructions can access their operands, register-registerarchitectures, register-memory architectures and memory-memory architectures. The majority of the computers shipped

nowadays implement one of those classes in a 2- or 3-operandformat. In order to capture as many computers as possible, theSciBy VM supports all of these variants in a 3-operand format,thereby including 2-operand format architectures in that thedestination address is the same as one of the sources.

B. Instruction Set Architecture

One key element that separates the SciBy VM from con-ventional machines is the memory model: The machine de-fines a Harvard memory architecture with separate memorybanks for data and instructions. The majority of conventionalmodern computers use a von Neumann architecture with asingle memory segment for both instructions and data. Thesemachines are generally more vulnerable to the well-knownbuffer overflow exploits and similar exploits derived from’illegal’ pointer arithmetic to executable memory segments.Furthermore, the machine will support hardware setups thathave separate memory pathways, thus enabling simultaneousdata and instruction fetches. All instructions are fetched fromthe instruction memory bank which is inaccessible for appli-cations: All memory accesses from applications are directedto the data segment. The data memory segment is partitionedin a global memory section, a heap section for dynamicallyallocated structures, and a stack for storing local variables andfunction parameters.

1) Instruction Format: The instruction format is based onbyte codesto simplify the instruction stream. The format isas follows: Each instruction starts with a one-byteoperationcode (opcode) followed by possibly more opcodes and endswith zero or more operands, see Figure 4. In this sense, themachine is a multi-opcode multi-address machine. Havingonly a single one-byte opcode limits the instruction set to only256 different instructions, whereas multiple opcodes allows fornested instructions, thus increasing the number of instructionsexponentially. A multi-address design is chosen to supportmore types of hardware.

OP OP

0 8 32 4816

R1 R2 R3

0 8

OP R1 R2 R3

OP OP

56

R1 R2 R3OP

0 8 16 24 40

24 40

Fig. 4. Examples of various instruction formats on register operands.

2) Addressing Modes:Based on the popularity of address-ing modes found in recent computers, we have selected 4addressing modes for the SciBy VM, all listed below.

• Immediate addressing: The operand is an immediate, forinstance MOV R1 4 which moves the number 4 to register1.

• Displacement addressing: The operand is an offset anda register pointing to a base address, for instance ADD


R1 R1 4(R2) which adds to R1 the value found 4 wordsfrom the address pointed out by R2.

• Register addressing: Operand is a register, for instanceMOV R1 R2

• Register indirect addressing: Address part is a registercontaining the address of an operand, for instance ADDR1, R1, (R2), which adds to R1 the value found at theaddress pointed out by R2.

3) Instruction Types:Since the machine defines a Harvardarchitecture, it is important to note that data movement iscarried out byLOAD andSTORE operations which operate onwords in the data memory bank.PUSH and POP operationsare available for accessing the stack.

Table I summarizes the most basic instructions available inthe SciBy VM. Almost all operations are simple 3-addressoperations with operands, and they are chosen to be simpleenough to be directly matched by native hardware operations.

Instruction group MnemonicMoves load, storeStack push, popArithmetic add, sub, mul, div, modBoolean and, or, xor, notBitwise and, or, shl, shr, ror, rolCompare tst, cmpControl halt, nop, jmp, jsr, ret, br, beeq, br lt, etc

TABLE IBASIC INSTRUCTIONSET OF THESCIBY VM

While these instructions are found in virtually every com-puter, they exist in many different variations using variousaddressing modes for each operand. To accommodate thisand assist the compiler as much as possible, the SciBy VMprovides regularity by making the instruction set orthogonalon both operations, data types, and the addressing modes. Forinstance the ’add’ operation exists in all 16 combinations ofthe 4 addressing modes on the two source registers for bothintegers and floating points. Thus, the encoding of an ’add’instruction on two immediate source operands takes up 1 bytefor choosing arithmetic, 1 byte to select the ’add’ on twoimmediates, 2 bytes to address one of the 16384 registersas destination register and then 16 bytes for each of theimmediates, yielding a total instruction length of 36 bytes.

C. Libraries

In addition to the basic instruction set, the machine imple-ments a number of basic libraries for standard operations likefloating-point arithmetic and string manipulation. These areextensions to the virtual machine and are provided on an per-architecture basis as statically linked native libraries optimizedfor specific hardware.

As explained above, virtual machines introduce a perfor-mance overhead in the translation phase from virtual machineobject code to the native hardware instructions of the un-derlying real machine. The all-important observation here isthat scientific applications spend most of their running timeexecuting ’scientific instructions’ such as string operations,

linear algebra, fast fourier transformations, or other libraryfunctions. Hence, by providing optimized native libraries, wecan take advantage of the synergy between algorithms, thecompiler translating them, and the hardware executing them.

Equipping the machine with native libraries for the mostprevalent scientific algorithms and enabling future support fornew libraries increases the number of potential instructionsdrastically. To address this problem, multiple opcodes allowsfor nested instructions as shown in Figure 5. The basicinstructions are accessible using only one opcode, whereasa floating point operation is accessed using two opcodes, i.e.FP lib FP sub R1 R2 R3,and finally, if one wishes to usethe WFTA instruction from theFFT_2 library, 3 opcodes arenecessary:FFT lib FFT 2 WFTA args.

Halt

Load

Store

Push

Pop

Str_lib

FP_lib

Fp_add

FP_sub

FFT_1

FFT_2

FFT_3

String_move

String_cmp

WFTA

PFA

FFT_lib

Fig. 5. Native libraries as extension to the instruction set

A special library is provided to enable file access. Whilemost grid middlewares use a staging strategy that downloadsall input files prior to the job execution and uploads output filesafterwards, the MiG-RFA [1] library accesses files directly onthe file server on an on-demand basis. Using this strategy,an application can start immediately, and only the neededfragments of the files it accesses are transferred.

Ending the discussion of the architecture, it is important tore-emphasize that all focus in this part of the machine is onportability. For instance, when evaluating the architecture, onemight find that:

• Having a 3-operand instruction format may give unnec-essarily large code size in some circumstances

• Studies may show that the displacement addressing modeis typically used to nearby addresses, thereby suggestingthat these instructions only need a few bits for the operand

• Using register-register instructions may give unnecessar-ily high instruction count in some circumstances

• Using byte codes increases the code density• Variable instruction encoding decreases performance

Designing an architecture includes a lot of trade-offs, andeven though many of these issues are zeroed by the interpreteror translator, the proposed byte code is far from optimal bynormal architecture metrics. However, the key point is thatwe target only a special type of applications on a very broadhardware platform.


IV. COMPILATION AND TRANSLATION

While researchers do not need to rewrite their scientificapplications for the SciBy VM, they do need to compile theirapplication using a SciBy VM compiler that can translatethe high level language code to the SciBy VM code. Whiledeveloping a new compiler from scratch of course is a possi-bility, it is also a significant amount of work which may proveunprofitable since many compilers designed to be retargetablefor new architectures already exist.

Generally, retargetable compilers are constructed using thesame classical modular structure: A front end that parses thesource file, and builds an intermediate representation, typicallyin the shape of a parse tree, used for machine-independentoptimizations, and a back end that translates this parse tree toassembly code of the target machine.

When choosing between open source retargetable compilers,the set of possibilities quickly narrows down to only a fewcandidates: GCC and LCC. Despite the pros of being the mostpopular and widely used compiler with many supported sourcelanguages in the front end, GCC was primarily designed for32-bit architectures, which greatly complicates the retargetingeffort. LCC however, is a light-weight compiler, specificallydesigned to be easily retargetable to a new architecture.

Once compiled, a byte code file containing assembly in-struction mnemonics is ready for execution in the virtualmachine, either by interpretation or by translation, whereinstructions are mapped to the instruction set of the host ma-chine using either a load-time or run-time translation strategy.Results remain to be seen, yet the authors believe that in casea translator is preferable to an interpreter, the best solutionwould be load-time translator, based on observations fromscientific applications:

• their total running time is fairly long which means thatthe load-time penalty is easily amortized

• they contain a large number of tight loops where run-time translation is guaranteed to be inferior to load-timetranslation

V. EXPERIMENTS

To test the proposed ideas, a prototype of the virtualmachine has been developed, in the first stage as a simpleinterpreter implemented in C. There is no compiler yet, so allsample programs are hand-written in assembly code with theonly goal of giving preliminary results that will show whetherdevelopment of the complete machine can be justified.

The first test is a typical example of the scientific appli-cations the machine targets: A Fast Fourier Transform (FFT).The program first computes 10 transforms on a vector ofvarying sizes, then checksums the transformed vector to verifythe result. In order to test the performance of the virtualmachine, the program is also implemented in C to get thenative base line performance, and in Java to compare theresults of the SciBy VM with an existing widely used virtualmachine.

The C and SciBy VM programs make use of the fftwlibrary[6], while the Java version uses an FFT algorithm from

Vector size Native SciBy VM Java524288 1.535 1.483 7.4441048576 3.284 3.273 19.1742097152 6.561 6.656 41.7574194304 14.249 14.398 93.9608388608 29.209 29.309 204.589

TABLE IICOMPARISON OF THE PERFORMANCE OF ANFFT APPLICATION ON A 1.86

GHZ INTEL PENTIUM M PROCESSOR, 2MB CACHE, 512 MB RAM


TABLE IIICOMPARISON OF THE PERFORMANCE OF ANFFT APPLICATION ON A

DUAL CORE 2.2 GHZ AMD ATHLON 4200 64-BIT, 512KB CACHE PER

CORE, 4GB RAM

the SciMark suite[7]. Obviously, this test is highly unfairin disfavor of the Java version for several reasons. Firstly,the fftw library is well-known to give the best performance,and comparing hand-coded assembly with compiler-generatedhigh-level language performance is a common pitfall. How-ever, even though Java-wrappers for the fftw library exist, itis essential to put these comparisons in a grid context. If thegrid resources were to run the scientific applications in JavaVirtual Machine, the programmers - the grid users - wouldnot be able to take advantage of the native libraries, sinceallowing external library calls breaks the security of the JVM.Thereby, the isolation level between the executing grid job andthe host system is lost2. In the proposed virtual machine, theselibraries are an integrated part of the machine, and using themis perfectly safe.

As shown in Table II the FFT application is run on the 3machines using different vector size,2

19, ..., 2

23. The resultsshow that the SciBy VM is on-par with native execution, andthat the Java version is clearly outperformed.

Since the fftw library is multithreaded, we repeat the ex-periment on a dual core machine and on a quad dual-coremachine. The results are shown in Table III and Table IV.

From these results it is clear that for this application there

2In fact there is a US Patent (#6862683) on a method to protect nativelibraries


TABLE IVCOMPARISON OF THE PERFORMANCE OF ANFFT APPLICATION ON A

QUAD DUAL -CORE INTEL XEON CPU, 1.60 GHZ, 4MB CACHE PER CORE,8GB RAM


is no overhead in running it in the virtual machine. It hasimmediate support for multi-threaded libraries, and thereforethe single-threaded Java version is even further outperformedon multi-core architectures.

VI. RELATED WORK

GridBox [3] aims at providing a secure execution environ-ment for grid applications by means of a sandbox environ-ment and Access Control Lists. The execution environment isrestricted by thechroot command which isolates each appli-cation in a separate file system space. In this space, all systemcalls are intercepted and checked against pre-defined AccessControl Lists which specify a set of allowed and disallowedactions. In order to intercept all system calls transparently, thesystem is implemented as a shared library that gets preloadedinto memory before the application executes. The drawback ofthe GridBox library is the requirement of a UNIX host systemand application and it does not work with statically linkedapplications. Further, this kind of isolation can be opened ifan intruder gains system privileges leaving the host systemunprotected.

Secure Virtual Grid (SVGrid) [10] isolates each grid appli-cations in its own instance of a Xen virtual machine whose filesystem and network access requests are forced to go throughthe privileged virtual machine monitor where the restrictionsare checked. Since each grid virtual machine is securelyisolated from the virtual machine monitor from which it iscontrolled, many levels of security has to be opened in orderto compromise the host system, and the system has provedits effectiveness against several malicious software tests. Theperformance of the system is also above acceptable with a verylow overhead. The only drawback is that while the model canbe applied to other operating systems than Linux, it still makesuse of platform-dependent virtualization software.

The MiG-SSS system [2] seeks to combine Public ResourceComputing and Grid Computing by using sandbox technologyin the form of a virtual machine. The project uses a genericLinux image customized to act as a grid resource and ascreen saver that can start any type of virtual machine capableof booting an ISO image, for instance VMware player andVirtualBox. The virtual machine then boots the linux imagewhich in turn retrieves a job from the Grid and executes thejob in the isolated sandbox environment.

Java and the Microsoft Common Language Infrastructureare similar solutions trying to enable applications written in theJava programming language or the Microsoft .Net framework,respectively, to be used on different computer architectureswithout being rewritten. They both introduce an intermediateplatform independent code format (Java byte code and theCommon Intermediate Language respectively) executable byhardware-specific execution environments (the Java VirtualMachine and the Virtual Execution System respectively).While these solution have proved suitable for many applicationdomains, performance problems and their requirement of aspecific programming language class rarely used for scientific

applications disqualifies the use of these virtual machines forthis project.

VII. C ONCLUSIONS ANDFUTURE WORK

Virtual machines can solve many problems related to usingdesktop computers for Grid Computing. Most importantly, forresource owners, they enforce security by means of isolation,and for researchers using the Grid, they provide a level ofhomogeneity that greatly simplifies application deployment inan extremely heterogeneous execution environment.

This paper presented the basic ideas behind the ScientificByte Code Virtual Machine and proposed a virtual machinearchitecture specifically designed for executing scientific appli-cations on any type of real hardware architecture. To thisend, efficient native libraries for the most prevalent scientificsoftware packages are an important issue that the authors be-lieve will greatly minimize the performance penalty normallyincurred by virtual machines.

An interpreter has been developed to give preliminaryresults that have shown to justify the ideas of the machine. Themachine is on-par with native execution, and on the intendedapplication types, it outperforms the Java virtual machinesdeployed in grid context.

After the proposed initial virtual machine has been im-plemented, several extension to the machine are planned,including threading support, debugging, profiling, an advancedlibrary for a distributed shared memory model, and support forremote memory swapping.

REFERENCES

[1] Rasmus Andersen and Brian Vinter,Transparent remote file access inthe minimum intrusion grid, WETICE ’05: Proceedings of the 14thIEEE International Workshops on Enabling Technologies: Infrastructurefor Collaborative Enterprise (Washington, DC, USA), IEEE ComputerSociety, 2005, pp. 311–318.

[2] , Harvesting idle windows cpu cycles for grid computing, GCA(Hamid R. Arabnia, ed.), CSREA Press, 2006, pp. 121–126.

[3] Evgueni Dodonov, Joelle Quaini Sousa, and Helio Crestana Guardia,Gridbox: securing hosts from malicious and greedy applications, MGC’04: Proceedings of the 2nd workshop on Middleware for grid computing(New York, NY, USA), ACM Press, 2004, pp. 17–22.

[4] Renato J. Figueiredo, Peter A. Dinda, and José A. B. Fortes,Acase for grid computing on virtual machines, ICDCS ’03: Proceedingsof the 23rd International Conference on Distributed Computing Systems(Washington, DC, USA), IEEE Computer Society, 2003.

[5] Ian Foster,The grid: A new infrastructure for 21st century science,Physics Today55(2) (2002), 42–47.

[6] Matteo Frigo and Steven G. Johnson,The design and implementationof FFTW3, Proceedings of the IEEE93 (2005), no. 2, 216–231, specialissue on ”Program Generation, Optimization, and Platform Adaptation”.

[7] Roldan Pozo and Bruce Miller, Scimark 2.0,http://math.nist.gov/scimark2/.

[8] Yunhe Shi, David Gregg, Andrew Beatty, and M. Anton Ertl,Virtualmachine showdown: stack versus registers, VEE ’05: Proceedings ofthe 1st ACM/USENIX international conference on Virtual executionenvironments (New York, NY, USA), ACM, 2005, pp. 153–163.

[9] J.E. Smith and R. Nair,Virtual machines: Versatile platforms for systemsand processes, Morgan Kaufmann, 2005.

[10] Xin Zhao, Kevin Borders, and Atul Prakash,Svgrid: a secure virtualenvironment for untrusted grid applications, MGC ’05: Proceedings ofthe 3rd international workshop on Middleware for grid computing (NewYork, NY, USA), ACM Press, 2005, pp. 1–6.


http://math.nist.gov/scimark2/

Interest-oriented File Replication in P2P FileSharing Networks

Haiying ShenDepartment of Computer Science and Computer Engineering

University of Arkansas, Fayetteville, AR 72701

Abstract - In peer-to-peer (P2P) file sharing networks,file replication avoids overloading file owner and im-proves file query efficiency. Most current methods repli-cate a file along query path from a client to a server.These methods lead to large number of replicas and lowreplica utilization. Aiming to achieve high replica uti-lization and efficient file query, this paper presents aninterest-oriented file replication mechanism. It clustersnodes based on node interest. Replicas are shared bynodes with common interest, leading to less replicas andoverhead, and enhancing file query efficiency. Simu-lation results demonstrate the effectiveness of the pro-posed mechanism in comparison with another file repli-cation method. It dramatically reduces the overhead offile replication and improves replica utilization.

Keywords: File replication, Peer to peer, Distributedhash table

1. Introduction

Over the past years, the immerse popularity of Inter-net has produced a significant stimulus to peer-to-peer(P2P) file sharing networks [1, 2, 3, 4, 5]. A popular filewith very frequent visit rate in a node will overload thenode, leading to slow response to file requests and lowquery efficiency. File replication is an effective methodto deal with the problem of overload due to hot files.Most current file replication methods [6, 7, 8, 9, 10, 11]replicate a file along query path between a requester anda file owner to increase the possibility that a query canencounter a replica node during routing. We use Path todenote this class of methods. In Path, a file query stillneeds to be routed until it encounters a replica node orthe file owner. However, they cannot guarantee that a re-quest meets a replica node. To enhance the effectivenessof file replication on query efficiency, this paper presentsan interest-oriented file replication mechanism, namely

Cluster. It groups nodes with a common interest into acluster. Cluster is novel in that replicas are shared bynodes with common interest, leading to less replicas andoverhead, and enhancing file query efficiency.

The rest of this paper is structured as follows. Sec-tion 2 presents a concise review of representative filereplication approaches for P2P file sharing networks.Section 3 presents Cluster file replication mechanism,including its structure and algorithms. Section 4 showsthe performance of Cluster in comparison to other ap-proaches with a variety of metrics, and analyzes the fac-tors effecting file replication performance. Section 5concludes this paper.

2. Related Work

File replication in P2P systems is designed to releasethe load in hot spots and meanwhile decrease file querytime. PAST [12] replicates each file on a set number ofnodes whose ID matches most closely to the file owner’sID. It has load balancing algorithm for non-uniform stor-age node capacities and file sizes, and uses caching alongthe lookup path for non-uniform popularity of files tominimize fetch distance and to balance the query load.Similarly, CFS [6] replicates blocks of a file on nodesimmediately after the block’s successor on the Chordring [1]. Stading et al. [7] proposed to replicate a filein locality close nodes near the file owner. LAR [13]and Gnutella [14] replicate a file in overloaded nodesat the file requester. Backslash [7] pushes cache to onehop closer to requester nodes as soon as nodes are over-loaded. Freenet [8] replicates objects both on insertionand retrieval on the path from the requester to the target.CFS [6], PAST [12], LAR [13], CUP [9] and DUP [10]perform caching along the query path. Cox et al. [11]studied providing DNS service over a P2P network. Theycache index entries, which are DNS mappings, alongsearch query paths.

Ghodsi et al. [15] proposed a symmetric replication


scheme in which a number of IDs are associated witheach other, and any item with an ID can be replicated innodes responsible for IDs in this group. HotRoD [16] isa DHT-based architecture with a replication scheme. Anarc of nodes (i.e. successive nodes on the ring) is “hot”when at least one of these nodes is hot. In the scheme,“hot” arcs of nodes are replicated and rotated over the IDspace. By tweaking the degree of replication, it can tradeoff replication cost for load balance.

Path methods still require file request routing, and themethods cannot ensure that a request meets replica nodeduring routing. Thus, they cannot significantly improvefile query efficiency. Rather than replicating file in a sin-gle requester, Cluster replicates a file for nodes with acommon interest. Consequently, the file replication over-head is reduced, and replica utilization and query effi-ciency are increased.

3. Interest-oriented File Replication

3.1. Overview

We use Chord Distributed Hash Table (DHT) P2P sys-tem [1] as an example to explain the file replication inP2P file sharing networks. Without loss of generality, weassume that nodes have their interests and these inter-ests can be uniquely identified. A node’s interests are de-scribed by a set of attributes with globally known stringdescription such as “image”, “music” and “book”. Theinterest attributes are fixed in advance for all participat-ing nodes. Each interest corresponds to a category offiles. A node frequently requests its interested files. Thestrategies that allows content to be described in a nodewith metadata [17, 18, 19, 20, 21] can be used to derivethe interests of each node. Due to the space limit, wedon’t explain the details of the strategies.

3.2. Cluster Structure Construction

Consistent hash function such as SHA-1 is widelyused in DHT networks for node or file ID due to itscollision-resistant nature. Using such a hash function,it is computationally infeasible to find two different mes-sages that produce the same message digest. Therefore,the consistent hash function is effective to cluster inter-est attributes based on their differences. Same interest at-tributes will have the same consistent hash value, whiledifferent interest attributes will have different hash val-ues.

Next, we introduce how to use the consistent hashfunction to cluster nodes based on interest. To facili-tate such a clustering, the information of nodes with a

common interest should be marshaled in one node in theDHT network, so that these nodes can locate each otherin order to constitute a cluster. Although logically closenodes do not necessarily have common interest, Clusterenables common-interest nodes report their informationto the same node.

In a DHT overlay [1], an object with a DHTkey is allocated to a node by the interface ofInsert(key,objet), and the object can be foundby Lookup(key). In Chord, the object is assigned tothe first node whose ID is equal to or follows the key inthe ID space. If two objects have similar keys, then theyare stored in the same node. We use H to denote theconsistent hash value of a node’s interest, and Info todenote the information of the node including its IP ad-dress and ID. Because H distinguishes node interests, ifnodes report their information to the DHT overlay withtheir H as the key by Insert(H,Info), the infor-mation of common-interest nodes with similar H willreach the same node, which is called repository node.As a result, a group of information in a repository nodeis the information of nodes with a common interest. Therepository node can further classify the information ofthe nodes based on their locality. The information of thelocality can be included into the reporting information.The work in [22] introduced a method to get the localityinformation of a node. Please refer to [23] for the detailsof methods to measure a node’s locality.

Therefore, a node can find other nodes with the sameinterest in its repository node by Lookup(H). In eachcluster, highest-capacity node is elected as the server ofother nodes in the cluster (i.e., clients). Thus, each clienthas a link connecting to its server, and a server connectsto a group of clients in its cluster. The server has anindex of all files and file replicas in its clients. Everytime a client accepts or deletes a file replica, it reports tothe server. A server uses broadcasting to communicatewith its clients.

P2P overlay is characterized by dynamism in whichnodes join, leave and fail continually. The structuremaintenance mechanism in my previous work [24] isadopted for the Cluster structure maintenance. Thesetechniques are orthogonal to our study in this paper.

3.3. File Replication and Query Algorithms

Cluster reduces the number of replicas, increasesreplica utilization and significantly improves file queryefficiency. Rather than replicating a file to all nodes in aquery path, it considers the request frequency of a groupof nodes, and makes replicas for nodes with a commoninterest. Since common-interest nodes are grouped in acluster, a node can get its frequently-requested file in its


11.5

22.5

33.5

44.5

5

5 10 15 20 25Number of replicating operations when overloaded

Ave

. pat

h le

ngth

ClusterPath

Figure 1. Average path length.

own cluster without request routing in the entire system.This significantly improves file query efficiency.

In addition to file owners, we assume other nodes canalso replicate files. Therefore, file owner and servers areresponsible for file replication. For simplicity, in the fol-lowing, we use server to represent both of them.

When overloaded, a server replicates the file in a nodein a cluster. Recall that the information of common-interest nodes are further classified into sub-groups basedon node locality or query visit rate. The server choosesthe sub-group with the highest file query frequency, andthen chooses the node with the highest file query fre-quency.

Unlike Path methods that replicate a file in all nodesin a query path, Cluster avoids unnecessary file replica-tions by only replicating a file for a group of frequent-requesters. It guarantees that file replicas are fully uti-lized. Requesters that frequently query for file f canget file f from itself or its cluster without query rout-ing. Thus, the replication algorithm improves file queryefficiency and saves file replication overhead.

Considering that file popularity is non-uniform andtime-varying and node interest varies over time, some filereplicas become unnecessary when there are few queriesfor them. To cope with this situation, Cluster lets eachserver keep track of file visit rate of replica nodes, andperiodically remove the under-utilized replicas. If a fileis no longer requested frequently, there will be few filereplicas for it. The adaptation to cluster query rate en-sures that all file replicas are worthwhile and there is nowaste of overhead for unnecessary file replica mainte-nance.

When node i requests for a file, if the file isnot in the requester’s interests, the node uses DHTLookup(key) function to query the file. Otherwise,node i first queries the file in its cluster among nodes in-terested in the file. Specifically, in the first step, node isends a request to its server in the cluster of this inter-

est. The server searches the index for the requested filein its cluster. Searching among interested nodes has highprobability to find a replica of the file. After the steps,if the query still fails, node i resorts to Lookup(key)function.

4 Performance Evaluation

This section presents the performance evaluation ofCluster in average case in comparison with Path. We usereplica hit rate to denote the percentage of the numberof queries that are resolved by replica nodes among totalqueries. The experiment results demonstrate the supe-riority of Cluster over Path in terms of average lookuppath, replica hit rate, and the number of replicas. In theexperiment, when overloaded, a file owner or a client’sserver conducts a file replicating operation. In a filereplicating operation, Cluster replicates a file to a singlenode while Path replicates a file to a number of nodesalong a query path. In the experiments, the number ofnodes was set to 2048. We assumed there were 200 in-terest attributes, and each attribute had 500 files. We as-sumed a bounded Pareto distribution for the capacity ofnodes. The shape of the distribution was set to 2, thelower bound of a node’s capacity was set to 500, and theupper bound was set to 50000. The number of queriedfiles was set to 50, and the number of queries per file wasset to 1000. This distribution reflects real world situa-tions where machine capacities vary by different ordersof magnitude. The file requesters and the queried fileswere randomly chosen in the experiments.

4.1 Effectiveness of File Replication

Figure 1 plots the average path length of Cluster andPath. We can see that Cluster leads to shorter pathlength than Path. Unlike Cluster that replicates a fileonly in one node in each file replicating operation, Path


0.6

0.7

0.8

0.9

1


Rep

lica

hit r

ate

ClusterPath

Figure 2. Replica hit rate.

0

500

1000

1500

2000

2500

3000

3500

4000


Num

ber o

f file

repl

icas Cluster

Path

Figure 3. Number of replicas.

replicates file in nodes along a query path. Therefore,it increases replica hit rate and produces shorter pathlength. However, it is unable to guarantee that everyquery can encounter a replica. Cluster can achieve muchhigher lookup efficiency with much less replicas. It il-lustrates the effectiveness of Cluster to replicate files fora group of nodes. A node can get a file directly from anode in its own cluster which enhances the utilization ofreplicas and also reduces the lookup path length. Clus-ter lets a replica be shared within the group of nodes,which increases the utilization of replicas and reducespath length.

Figure 2 demonstrates the replica hit rate of differ-ent approaches. We can observe that Cluster leads tohigher hit rate than Path. For the same reason observedin Figure 1, Path replicates files at nodes along the rout-ing path, more replica nodes render higher possibility fora file request to meet a replica node. However, its effi-ciency is outweighed by its much more replicas. In ad-dition, it cannot ensure that each request can be resolvedby a replica node. Cluster replicates a file for a group ofcommon-interest nodes, which improves the probabilitythat the file query is resolved by a replica node, leadingto higher hit rate.

4.2 Efficiency of File Replication

Figure 3 illustrates the total number of replicas inCluster and Path. It shows that the number of repli-cas increases as the number of replicating operations in-creases. This is due to the reason that more replicatingoperations for a file lead to more total replicas. Path gen-erates excessively more replicas than others, and Clusterproduces less number of replicas. In each file replicatingoperation, Path replicates a file in multiple nodes alonga routing path, and Cluster replicates a file in a singlenode. Therefore, Path generates much more replicas thanothers. In Cluster, a replica is fully utilized through be-ing shared by a group of nodes, generating high replica

hit rate, and reducing the possibility that a file owner isoverloaded. Thus, the file owner has less replicating op-erations, which leads to fewer replicas and less overheadfor replica maintenance.

5 Conclusions

Most current file replication methods for P2P filesharing networks incur prohibitively high overhead byreplicating a file in all nodes along a query path froma client to a server. This paper proposes an interest-oriented file replication mechanism. It generates areplica for a group of nodes with the same interest.The mechanism reduces file replicas, guarantees highquery efficiency and high utilization of replicas. Sim-ulation results demonstrate the superiority of the pro-posed mechansim in comparison with another file repli-cation approach. It dramatically reduces the overhead offile replication and produces significant improvements inlookup efficiency and replica hit rate.

Acknowledgements

This research was supported in part by the Acxiom Cor-poration.

References

[1] I. Stoica, R. Morris, D. Liben-Nowell, D. R.Karger, M. F. Kaashoek, F. Dabek, and H. Balakr-ishnan. Chord: A Scalable Peer-to-Peer LookupProtocol for Internet Applications. IEEE/ACMTransactions on Networking, 1(1):17–32, 2003.

[2] S. Ratnasamy, P. Francis, M. Handley, R. Karp, andS. Shenker. A scalable content-addressable net-work. In Proc. of ACM SIGCOMM, pages 329–350,2001.


[3] A. Rowstron and P. Druschel. Pastry: Scalable,decentralized object location and routing for large-scale peer-to-peer systems. In Proc. of IFIP/ACMInternational Conference on Distributed SystemsPlatforms (Middleware), pages 329–350, 2001.

[4] B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D.Joseph, and J. Kubiatowicz. Tapestry: An Infras-tructure for Fault-tolerant wide-area location androuting. IEEE Journal on Selected Areas in Com-munications, 12(1):41–53, 2004.

[5] H. Shen, C. Xu, and G. Chen. Cycloid: A Scal-able Constant-Degree P2P Overlay Network. Per-formance Evaluation, 63(3):195–216, 2006.

[6] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris,and I. Stocia. Wide-area cooperative storage withCFS. In Proc. of the 18th ACM Symp. on OperatingSystems Principles (SOSP-18), 2001.

[7] T. Stading, P. Maniatis, and M. Baker. Peer-to-peer Caching Schemes to Address Flash Crowds.In Proc. of the International workshop on Peer-To-Peer Systems, 2002.

[8] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong.Freenet: A Distributed Anonymous InformationStorage and Retrieval System. In Proc. of the Inter-national Workshop on Design Issues in Anonymityand Unobservability, pages 46–66, 2001.

[9] M. Roussopoulos and M. Baker. CUP: ControlledUpdate Propagation in Peer to Peer Networks. InProc. of the USENIX 2003 Annual Technical Conf.,2003.

[10] L. Yin and G. Cao. DUP: Dynamic-tree BasedUpdate Propagation in Peer-to-Peer Networks. InIEEE International Conference on Data Engineer-ing (ICDE), 2005.

[11] R. Cox, A. Muthitacharoen, and R. T. Morris. Serv-ing DNS using a Peer-to-Peer Lookup Service. InProc. of the International workshop on Peer-To-Peer Systems, 2002.

[12] A. Rowstron and P. Druschel. Storage Manage-ment and Caching in PAST, a Large-scale, Persis-tent Peer-to-peer Storage Utility. In Proc. of the18th ACM Symp. on Operating Systems Principles(SOSP-18), 2001.

[13] V. Gopalakrishnan, B. Silaghi, B. Bhattacharjee,and P. Keleher. Adaptive Replication in Peer-to-Peer Systems. In Proc. of the 24th Interna-tional Conference on Distributed Computing Sys-tems (ICDCS), 2004.

[14] Gnutella home page. http://www.gnutella.com.

[15] A. Ghodsi, L. Alima, and S. Haridi. SymmetricReplication for Structured Peer-to-Peer Systems.In Proc. of International Workshop on Databases,Information Systems and Peer-to-Peer Computing,page 12, 2005.

[16] T. Pitoura, N. Ntarmos, and P. Triantafillou. Repli-cation, Load Balancing and Efficient Range QueryProcessing in DHTs. In Proc. of the 10th Interna-tional Conference on Extending Database Technol-ogy (EDBT), 2006.

[17] W. Nejdl, W. Siberski, M. Wolpers, and C. Schm-nitz. Routing and clustering in schema-based superpeer networks. In Proc. of the 2nd InternationalWorkshop on Peer-to-Peer Systems (IPTPS), 2003.

[18] A. Y. Halevy, Z. G. Ives, P. Mork, and I. Tatari-nov. Piazza: Data management infrastructure forsemantic web applications. In Proc. of the 12nd In-ternational World Wide Web Conference, 2003.

[19] K. Aberer, P. Cudre-Mauroux, and M. Hauswirth.The chatty web: Emergent semantics through gos-siping. In Proc. of the 12nd International WorldWide Web Conference, 2003.

[20] A. Loser, W. Nejdl, M. Wolpers, and W. Siberski.Information integration in schema-based peer-to-peer networks. In Proc. of the 15th InternationalConference of Advanced Information Systems En-gieering (CAiSE), 2003.

[21] W. Nejdl, M. Wolpers, W. Siberski, A. Loser,I. Bruckhorst, M. Schlosser, and C. Schmitz. Super-peer-based routing and clustering strategies for rdf-based peer-to-peer networks. In Proc. of the 12ndInternational World Wide Web Conference, 2003.

[22] H. Shen and C. Xu. Locality-Aware and Churn-Resilient Load Balancing Algorithms in StructuredPeer-to-Peer Networks. IEEE Transactions on Par-allel and Distributed Systems, 2007.

[23] H. Shen. Interest-oriented File Replication in P2PFile Sharing Networks. Technical Report TR-2008-01-85, Computer Science and Computer Engineer-ing Department, University of Arkansas, 2008.

[24] H. Shen and C.-Z. Xu. Hash-based Proximity Clus-tering for Efficient Load Balancing in Heteroge-neous DHT Networks. Journal of Parallel and Dis-tributed Computing (JPDC), 2008.


CRAB: an Application for Distributed Scientific Analysis in Grid Projects

D. Spiga1-3-6, S. Lacaprara2, M.Cinquilli3, G. Codispoti4, M. Corvo5, A. Fanfani4, F. Fanzago5-6, F. Farina6-7, C. Kavka8, V. Miccio6, and E. Vaandering9

1University of Perugia, Perugia, Italy2INFN Legnaro, Padova, Italy3INFN Perugia, Perugia, Italy

4INFN and University of Bologna, Bologna, Italy5CNAF, Bologna, Italy

6CERN, Geneva, Switzerland7INFN Milano-Bicocca, Milan, Italy

8INFN Trieste, Trieste, Italy9FNAL, Batavia, Illinois, USA


Abstract - Starting from 2008 the CMS experiment will produce several Petabytes of data each year, to be distributed over many computing centers located in many different countries. The CMS computing model defines how the data has to be distributed in a way that CMS physicists can efficiently run their analysis over the data. CRAB (CMS Remote Analysis Builder) is the tool, designed and developed by the CMS collaboration, that facilitates the access to the distributed data in a totally transparent way. The tool's main feature is the possibility to distribute and and to parallelize the local CMS batch data analysis processes over different Grid environments. CRAB interacts with the local user environment, CMS Data Management services and the Grid middleware.

Keywords: Grid Computing, Distributed Computing, Grid Application, High Energy Physics Computing.

Introduction

The Compact Muon Solenoid (CMS)[1] is one of two large general-purpose particle physics detectors integrated in to the proton-proton Large Hadron Collider (LHC)[2] at CERN (European Organization for Nuclear Research) in Switzerland. The CMS detector has 15 millions of channels; through them data will be taken at a rate of TB/s and selected by an on-line selection system (trigger) that will reduce the frequency of data taken from 40 MHz (LHC frequency) to 100 Hz (writing data frequency), that means 100 MB/s and 2 PB data per year. This challenging experiment is a collaboration constituted by about 2600 physicist from 180 scientific institutes all over the world. The quantity of data to analyze (and to simulate) needs a plenty of resources to satisfy the computational experiment requirements, as the large disk space needed to store the data and many cpus where to run physicist's algorithms. It is also needed a way to make all data and shared resources accessible from all the people in the collaboration. This environment has encouraged the CMS collaboration to define an ad-hoc computing model that satisfy these problematics.. This leans on the use of Grid computing resources, services and toolkits as basic building blocks, making realistic requirements on grid services. CMS decided to use a combination of tools, provided by the LCG (LHC Computing Grid)[3] and OSG (Open Science Grid)[4] projects, as well as specialized CMS tools are used together. In this environment, the computing system has been arranged in tiers (Figure 1) where the majority of computing resources are located away from the host lab. The system is geographically distributed, consistent with the nature of the CMS collaboration itself: • The Tier-0 computing centre is located at

CERN and is directly connected to the experiment for the initial processing,

Figure 1. Multi-tier architecture based on distributed resources and Grid services.


reconstruction and data archiving. • A set of large Tier-1 centers where the Tier-0 distributes data (processed and not); these

centers provide also considerable services for different kind of data reprocessing. • A typical Tier-1 site distributes the processed data to smaller Tier-2 centers which have

powerful cpu resources to run analysis tasks and Monte Carlo simulations. Distributed analysis model in CMS For the CMS experiment in the Grid computing environment there are many problematic points due to the wide amount of dislocated resources. The CMS Workflow Management System (CMS WM) manages the large scale data processing and reduction process which is the principal focus of experimental HEP computing. The CMS WM has to be considered the main flow to manage and access the data, giving to the user an unique interface that allows to interact with the generic Grid services and with the specific experiment services as an unique common environment. The Grid services are mainly constituted by the Grid WMS which takes jobs into account, performs match-making operations and distributes the jobs toward the Computing Element (CE); the CE manages local queues that point to a set of resources located to a specific site as the Worker Nodes where the jobs run; finally there are Storage Elements (SE) that are logic entities that warranty an uniform access to a storage area where the data is stored. As part of its computing model, CMS has chosen a baseline in which the bulk experiment-wide data is pre-located at sites and thus the Workload Management (WM) system submit the jobs to the correct CE. The Dataset Bookkeeping System (DBS) allows to discover and access various forms of event data in a distributed computing environment. The analysis model[5] is batch-like and consists of main steps: the user runs interactively on small samples of the data somewhere in order to develop his code and test it; once the code is the desired one, the user selects a large data and submit the very same code to analyze the chosen data. The results is made available to the user to be analyzed interactively. The analysis can be done in step, saving the intermediate ones and iterating over the latest ones. The distributed analysis workflow over the Grid backs on the Workflow Management System, which is not directly user oriented. In fact the analysis flow in the above specified distributed environment is a more complex computing task because it assume to know which data are available, where data are stored and how to access them, which resources are available and are able to comply with analysis requirements, also at the above specified Grid and CMS infrastructure details.

The CMS Remote Analysis Builder

The users do not want to deal with the previously described issues and they want to analyze data in a simple way. The CMS Remote Analysis Builder (CRAB)[6] is the application designed


and deployed by the CMS collaboration that, following the CMS WM, allows a transparent access to distributed resources over the Grid to end physicists. CRAB perform three main operations: • Interaction with the CMS analysis framework

(CMSSW) used by the users to develop their applications that runs over the data;

• The Data Discovery step, interacting with the CMS data management infrastructure, when required data is found and located;

• The Grid specific steps: from submission to output retrieval are fully handled.

The typical workflow (Figure 2) involve the concept of task and job. The task corresponds to the high-level objective of an user (run an analysis over a defined data). The job is the traditional queue system concept, corresponding to a single instance of an application started on a worker node with a specific configuration and output. A task is generally composed by many jobs. A typical analysis workflow in this contest consists of: • data discovery to determine the Storage Elements of sites storing data (using DBS); • preparation of input sandbox: a package with the user application and related libraries; • job preparation creates a wrapper over the user executable, prepares the environment

where the user application has to run (WN level); at the end handles the output produced; • job splitting which takes in to account the specific data information, the data distribution

and the coarseness requested by the user; • Grid job configuration that consists of a file filled using the Job Description Language

(JDL) which is interpreted by the WMS and which contains the job requirements; • task (jobs) submission to the Grid; • monitoring of the submitted jobs which involves the WMS checking the jobs progress; • when a job is finished from the Grid point of view, the last operation is the output

retrieval which allows to handle the job output (which can also include the copy of the job output to a Storage Element) through the output-sandbox.

CRAB is used by the user on the User Interface (UI), which is the access point to the Grid and where is available the client of the middleware. The user interacts with CRAB via a simple configuration file divided into main sections and then by CLI. In the configuration file there are all the specific parameters of the task and the jobs; after the user has developed and tested

Figure 2. CRAB in to the CMS WM


locally his own analysis code, he specifies which application he wishes to run, the dataset to analyze and the general requirements on the input dataset as the job splitting parameter, the information related to how threat the output that can be retrieved back to the UI or can be directly copied to an existing Storage Element. There are also post-output retrieval operations that can be executed by the users, which include the data publication that allows to register user data into a local DBS instance, to consent an easy access on the user-registered data.

CRAB Architecture and Implementation

The programming language used to develop CRAB is Python. It allows a reduced development time and an improved maintenance, included the fact that does not need to be compiled. CRAB can perform three main kinds of job submission where each one is totally transparent to the user: • the direct submission to the Grid interacting directly with the middleware; • the direct submission to local resources and relative queues, using the batch system, in a

LFS (Load Sharing Facilities) environment; • the submission to the Grid using the CRAB server as a layer where to delegate the

interaction with the middleware and the task/job management.Actually the major effort of the development activity is devoted to the client-server implementation. The client is on the User Interface, while the server could be located somewhere over the Grid. The CRAB client is directly used by the users and it enables to perform the operations involved in the task/job preparation and creation, as: the data discovery, the input-sandbox preparation, the job preparation (included the job splitting) and the requirement definition. Then the client makes a request, completely transparent to the user, to the CRAB server. This one fulfills each request, handling the task and performing the related flow, in any kind of Grid interaction: job submission to the Grid; automatic job monitoring; automatic job output retrieval; re-submission of failed jobs following particular rules for different kinds of job failures; every specific command requested by the user as the job killing; notify the user by e-mail when the task reach a specified level and when it is fully ended (the output results of each jobs are ready). This operation partitioning between the client and the server allows to automatize as much as possible the interaction with the Grid reducing the unnecessary human load, having all possible actions into the server side (at minimum those on client side), centralizing the Grid interaction and then allowing to handle every kind of trouble on a unique place. This also permits to improve the scalability of the whole system[7].The communication between the client and the server is on SOAP[8]. Selecting it has some obvious reasons: it is a de facto standard in the Grid service development community, it uses HTTP protocol, provides interoperability across institutional and application language. The client has to assume nothing about the implementation details of the server and vice versa. In this case


the SOAP based communication is developed by using gSOAP[9]. gSOAP provides a cross-platform development toolkit for developing server and client applications, allowing to not maintain any custom protocol. It does not require any pre-installed runtime environment, but using the WSDL (Web Services Description Language) it generates code in ANSI C/C++.The internal CRAB server architecture (Figure 3) is based on components implemented as independent agents communicating through an asynchronous and persistent message service (as a publish and subscribe model) based on a MySQL[10] database. Each agent takes charge of specific operations, allowing a modular approach from a logical point of view. The actual server implementation provides the following components: • CommandManager: endpoint of SOAP service that handles commands sent by the client; • CrabWorker: component that performs direct job submission to the Grid; • TaskTracking: updates information about tasks under execution polling the database; • Notification: notifies the user by e-mail when his task is ended and the output has been

already retrieved; it also notify the server administrator for special warning situation; • TaskLifeManager: manages the task

life on the server, cleaning ended tasks;

• JobTracking: tracks the status of every job;

• GetOutput: retrieves the output of ended jobs;

• JobKiller: when asked kills single or multiple jobs;

• ErrorHandler: performs a basic error handling that allows to resubmit jobs;

• RSSFeeder: provides RSS channels to forward information about the server status;

• AdminComponent: executes specific server maintenance operations.Many of the above listed components are implemented following a multithreading approach, using safe connection to the database. This allows to manage many tasks at the same time shortening and often totally removing the delay time for an single operation that has to be accomplished on many tasks. The use of threaded components is very important when interacting with the Grid middleware, where some operation (e.g.: on a bulk of jobs at the same moment) requires a not unimportant time. Two important entities in the CRAB server architecture are the WS-Delegation and a specific area on an existing Storage Element. The WS-Delegation is a

Figure 3. CRAB Server


compliant service for the user proxy delegation from the client to the server; this allows to the server to perform each Grid operation for a given task with the corresponding user proxy. The SE allows to transfer the input/output-sandboxes between the User Interface and the Grid, working as a sort of drop-box area. The server has a specific interface made up by a set of API and a core with hierarchical classes which implement different protocols, allowing to interact transparently with the associated remote area, independently from the transfer protocol. The ability of the server to potentially interact with any associated storage server, independently from the protocol, allows to have a portable and scalable object, where the Storage Element that hosts the job sand-boxes is completely independent from the server. Then the CRAB server is then really adaptable to different environments and configurations. It is also complained to have a local disk area mounted on the local CRAB server instance, with a GridFTP server associated with, to be used for the sandbox transfers (instead of a remote Storage Element). The interaction with the Grid is performed using the BossLite framework included in the server core. This framework can be considered a thin layer between the CRAB server and the Grid, used to interact with the middleware and to maintain specific information about tasks/jobs. BossLite is constituted by a set of API that works as an interface to the core. The core of BossLite maps the database objects (e.g.: task, job) and allows to execute specific Grid operations over database-loaded objects.

Conclusions

CRAB is in production since three years and is the only computing tool in CMS used by generic physicist. It is widely used inside the collaboration with more then 600 distinct users during the 2007 and over about 50 distinct Tier-2 sites involved in the Grid analysis activities. The CRAB tool is continuously evolving and the actual architecture allows to simply add new components to the structure to support new use cases that will come up.

References1. The CMS experiment http://cmsdoc.cern.ch2. The Large Hadron Collider Conceptual Design Report CERN/AC/95-053. LCG Project: LCG Technical Design Report,CERN TDR-01 CERN-LHCC-2005-024, June 20054. The Open Science Grid project http://www.opensciencegrid.org5. The CMS Technical Design Report http://cmsdoc.cern.ch/cms/cpt/tdr6. D.Spiga, S.Lacaprara, W.Bacchi, M.Cinquilli, G.Codispoti, M.Corvo, A.Dorigo, A.Fanfani, F.Fanzago, F.Farina, M.Merlo, O.Gutsche, L.Servoli, C.Kavka (2007). "The CMS Remote Analysis Builder (CRAB)". High Performance Computing -HiPC 2007 14th International Conference. Goa, India. vol. 4873, pp. 580-5867. D.Spiga, S.Lacaprara, W.Bacchi, M.Cinquilli, G.Codispoti, M.Corvo, A.Dorigo, F.Fanzago, F.Farina, O.Gutsche, C.Kavka, M.Merlo, L.Servoli, A.Fanfani. (2007). "CRAB: the CMS distributed analysis tool development and design". Hadron Collider Physics Symposium 2007. La Biodola, Isola d'Elba, Italy. vol.177-178C, pp. 267-2688. SOAP Messagging Framework http://www.w3.org/TR/soap12-part19. The gSOAP Project http://www.cs.fsu.edu/~engelen/soap.html10. MySQL Open Source Database http://www.mysql.com


Fuzzy-based Adaptive Replication Mechanism in Desktop Grid Systems

HongSoo Kim, EunJoung ByunDept. of Computer Science & Engineering, Korea University

{hera, vision}@disys.korea.ac.kr

JoonMin GilDept. of Computer Education, Cathoric University of Daegu

[email protected]

JaeHwa Chung, SoonYoung JoungDept. of Computer Science Education, Korea University

{bigbearian, jsy}@comedu.korea.ac.kr

Abstract

In this paper, we discuss the design of replication mech-anism to guarantee correctness and support deadlinetasks in desktop grid systems. Both correctness and per-formance are considered important issues in the designof such systems. To guarantee the correctness of re-sults, voting-based and trust-based sabotage tolerancemechanisms are generally used. However, these mech-anisms suffer from two potential shortcomings: wasteof resources due to running redundant replicas of thetask, and increase in turnaround time due to the inabil-ity to deal with dynamic and heterogeneous environments.In this paper, we propose a Fuzzy-based Adaptive Repli-cation Mechanism (FARM) for sabotage-tolerance withdeadline tasks. This is based on the fuzzy inference pro-cess according to their trusty and result-return prob-ability. Using these two parameters, our desktop gridsystem can provide both the sabotage-tolerance and a re-duction in turnaround time. In addition, simulation re-sults show that compared to existing mechanisms, theFARM can reduce resource waste in replication without in-creasing the turnaround time.

1. Introduction

Desktop grid computing is a means of carrying out high-throughput scientific applications using the idle time ofdesktop computers (PCs) connected to the Internet [5]. Ithas been used in massively parallel applications composedof numerous instances of the same computation. The ap-plications usually involve scientific problems that require a

large amounts of sustained processing capacity over longperiods. In recent years, there has been increased inter-est in desktop grid computing because of the success ofthe most popular examples, such as SETI@Home [10] anddistributed.net [16]. There has been a number of studiesof desktop grid systems that provide an underlying plat-form, such as BOINC [15], Entropia [17], Bayanihan [18],XtremWeb [19], and Korea@Home [6].

One of the main characteristics of desktop grid comput-ing is that computing resources, referred to as volunteers,are free to leave or join, which results in a great deal ofnode volatility. Thus, desktop grid systems (DGSs) are lackof reliability due to uncontrolled and unspecified comput-ing resources and cannot help being exposed to sabotageby erroneous results of malicious resources. When a mali-cious volunteer submits bad results to a server, this may in-validate all other results. For example, it has been reportedthat SETI@Home suffered from the malicious behavior bysome of its volunteers who faked the number of tasks com-pleted. Other volunteers actually faked their results by usingdifferent or modified client software [2, 10]. Consequently,DGSs should be equipped with a sabotage-tolerance mech-anism to protect them from intentional attacks by maliciousvolunteers [22, 23].

In previous studies, the verification of work resultswas accomplished by voting- and trust-based mecha-nisms. In the voting-based sabotage-tolerance (VST)mechanisms [14], when a task is distributed in paral-lel to n volunteers, k or more of these volunteers where k≤ n should have the same result to guarantee result ver-ification for the task. This mechanism is often calledthe k-out-of-n system. In DGSs, it can be assumed thatall n volunteers are stochastically identical and func-tionally independent. This mechanism is simple and


straightforward, but it is inefficient because it wastes re-sources. On the other hand, trust-based sabotage-tolerance(TST) mechanisms [3, 4, 7, 9, 21] have a lower redun-dancy for result verification than voting mechanisms.Instead, a lightweight task for which the correct result is al-ready known is distributed periodically to volunteers. Inthis way, a server can obtain the trust value of each vol-unteer by counting how many lightweight tasks arereturned correctly. This trust value is used as a key fac-tor in the scheduling phase. However, these mechanismsare based on first-come first-serve (FCFS) schedul-ing, which typically allocates tasks to resources whenthe resources become available, without any considera-tion of the applications whose task should be completedbefore a certain deadline. From the viewpoint of result ver-ification, FCFS scheduling results in a high turnaroundtime because it cannot cope effectively with dynamic en-vironments, such as volunteers leaving or joining the sys-tem due to interference with other priorities and hardwarefailures. If a task is allocated to a volunteer with high dy-namicity, then it is susceptive to failure. Thus, it should bereallocated to other volunteers. As a result, this leads to in-creasing the turnaround time of the task. Furthermore, if atask must be completed within a specific time (i.e. a dead-line), then its turnaround time will be high.

To provide DGSs with result verification to support taskdeadlines, this paper proposes Fuzzy-based Adaptive Repli-cation Mechanism (FARM) for sabotage-tolerance based onthe trusty and the result-return probability of volunteers.The FARM provides a mechanism for combining the ben-efits of the voting-based mechanism with the trust-basedmechanism. First, we devised autonomous sampling withmo-bile agents to evaluate the trusty and result-return prob-ability of each volunteer. In this scheme, volunteers re-ceive a sample whose result is already known. The resultcomputed by a volunteer is compared to the original re-sult of the sample to estimate the volunteer’s trusty. In ad-dition, the volunteer’s result-return probability is calculatedbased on the availability and performance of the volunteer.These trusty and result-return probability values are usedto fuzzy sets through fuzzy inference process. The char-acteristic function of the fuzzy set is allowed to have val-ues between 0 and 1, which denotes the degree of mem-bership of an element in a given set. For the transforma-tion from requirements (i.e. correctness and deadline) of anapplication to fuzzy set, we provide the empirical member-ship functions. In this paper, fuzzy inference process is todetermine the replication number through the trust proba-bility and result-return probability of each volunteer. In ad-dition, simulation results show that our mechanism can re-duce the turnaround time compared with the voting-basedmechanism and trust-based mechanism. The FARM is alsosuperior to the other two mechanisms in terms of the effi-

Figure 1. Desktop grid environments.

cient use of resources.The rest of this paper is organized as follows. In Sec-

tion 2, we represent the desktop grid environment. Section3 describes fuzzy-based adaptive replication mechanism forsabotage-tolerance with deadline tasks. In Section 4, werepresent implementation and performance evaluation. Fi-nally, our conclusions are given in Section 5.

2. Desktop Grid Environment

Figure 1 shows the overall architecture of our DGSmodel. As shown in Fig. 1, our DGS model consists ofclients, an application management server, a storage server,coordinators, and volunteers. A client submits its own ap-plication to an application management server. A coordina-tor looks after scheduling, computation group management,and agent management. A volunteer is a resource providerthat contributes its own resources to process large-scale ap-plications during CPU idle time. In this model, volunteersand a coordinator are organized into computation group asthe unit of scheduling. In this group, the coordinator reor-ganizes work groups as the unit of executing a task accord-ing to properties of each volunteer when the coordinator’sscheduler allocates tasks to the volunteers.

Our DGS includes several phases:

1. Registration Phase: Volunteers register their static in-formation (e.g., CPU speed, memory capacity, OStype, etc.) with the application management server.The application management server then sends this in-formation to coordinators.

2. Job Submission Phase: A client submits a large-scaleapplication to the application management server.

3. Task Allocation Phase: The application managementserver splits the application into a set of tasks and allo-cates them to coordinators.


4. Load Balancing Phase: Each coordinator inspects thenumber of tasks in its task pool and balances the load,either periodically or on demand, by transferring sometasks to other coordinators.

5. Adaptive Scheduling Phase: Each coordinator assignstasks in its task pool according to the properties ofavailable resources using fuzzy inference process.

6. Result Collection Phase: Each coordinator collects re-sults from volunteers and performs result verification.

7. Job Completion Phase: Each coordinator returns aset of correct results to the application managementserver.

The application submitted by a client is divided into a se-quence of batches, each of which consists of mutually inde-pendent tasks in the set W = {w1, w2, ..., wn} where n isthe total number of tasks. This type of application is knownas the single program multiple data (SPMD) model, whichuses the same code for different data. It has been used as thetypical application model in most DGSs, and thus we usedan SPMD-type application in the present study.

3. Fuzzy-based Adaptive Replication

This section describes fuzzy-based adaptive replicationmechanism (FARM) for sabotage-tolerance in applicationswith specific deadlines.

3.1. System Model

We assume that application has been divided into tasksand each task has independent property, and the tasks haveto be returned within a deadline. Our FARM determinescorrectness based on a volunteer’s trust probability (TP). Italso uses a result-return probability (RP) based on availabil-ity and dynamic information, such as current CPU perfor-mance, to estimate a volunteer’s ability to complete a taskwithin a certain deadline. In the FARM, the scheduler letstasks to execute in volunteers with high TP and high RP inhigh priority. For volunteers with low TP and low RP, thescheduler applies replication policy to the tasks accordingto fuzzy inference process.

In the desktop grid computing environment, over-all system performance is influenced by the dynamic na-ture of volunteers [12]. In order to classify volunteersinto fuzzy sets according to the dynamic nature of volun-teers, we represent a fuzzy inference process based on TPand RP. First of all, both TP and RP are defined as Ta-ble 1:

Trust Probability (TP). The TP is a factor determined

by the correctness of the computation results exe-cuted by a volunteer. The trusty value of the ith volunteerTPi is

TPi ={

1− fn , if n > 0

1− f, if n = 0(1)

In Eq. (1), TPi represents the trusty value of a volun-teer vi, n is the number of correct results returned in oursampling scheme, and f is the probability that a volun-teer chosen at random returns incorrect results.

Result-return Probability (RP). The RP is the proba-bility that a volunteer will complete a task within a giventime under a computation failure.

In a desktop grid environment, the task completion ratedepends on both the availability and the performance (e.g.,the number of floating point operations per second) of in-dividual volunteers [12]. Therefore, we have defined RP asthe probability that a volunteer will return a result beforea specified deadline under computation failure. The aver-age completion time ACTi of each volunteer is calculatedby

ACTi =∑

γk

K(2)

where γi represents the completion time of the kth sampleand K is the number of samples completed by volunteer i.

To determine the time taken by a volunteer to perform atask through sampling by an agent, we define the estimationtime ETi of each volunteer as follows:

ETi = µ · t (3)

where µ represents the number of floating point operationsfor a sample by a dedicated volunteer, and t is the time re-quired by that dedicated volunteer to execute one floatingpoint operation. Using Eq. (3), we can estimate the comple-tion time without computation failure for volunteer vi.

Then, the average failure ratio AFRi of volunteer i canbe calculated using eqs. (2) and (3).

AFRi = 1− ETi

ACTi(4)

If the time taken to complete a task by volunteer i fol-lows an exponential distribution with average AFRi, thenthe probability of volunteer i completing task j before thedeadline dj is calculated as follows:

RPi(D ≤ dj) =∫ d

0

AFRie−AFRidj =1− e−AFRidj

(5)


Parameters DescriptionTP (Trust Probability) The TP is a factor determining robustness for the computation

result executed by a volunteer.ACT (Average Completion Time) The ACT denotes an average completion time which means

the ratio of mean completion time of samples to the number ofsamples completed by the volunteer.

ET (Estimation Time) The ET means an estimation time which means multiplicationthe number of floating point operations for the computing timeof a float point operation.

AFR (Average Failure Ratio) The AFR denotes an average failure ratio which is calculatedas ET for ACT.

RP(Result-return Probability) The RT is the probability that a volunteer will complete a taskwithin a given time under a computation failure.

RD (replication degree) The RT is the probability that a volunteer will complete a taskwithin a given time under a computation failure.

Table 1. Parameters.

where D represents the actual computation time of the vol-unteer.

3.2. Fuzzy Inference Process

A fuzzy set expresses the degree to which an elementbelongs to a set. The characteristic function of a fuzzy setis allowed to have values between 0 and 1, which denotesthe degree of membership of an element in a given set. Forthe transformation from requirements (i.e. correctness anddeadline) of an application to fuzzy set, we provide the em-pirical membership functions in Figure 2. In Fig. 2(a), afuzzy set for trust probability are determined to group ac-cording to trust range [0, 1] of a volunteer. If the TP valueof a volunteer approaches 0, the volunteer has nearly mali-cious behavior, and 1 denotes fully trusted resource. In Fig2(b), as a fuzzy set for result-return probability, this figureshows membership function for five level of RPi. If RP isnearly 0, the volunteers are extremely return a result withina given time. Contrary, RP is almost 1, it will return a re-sult within a deadline.

In this paper, fuzzy inference process is to deter-mine the replication number of each volunteer through theTPi and RPi. These two parameters can be calculatedfuzzy rule with replication degree in order to return cor-rect results within a given time. We set up the RDi in therange [1, 5] according to our previous experiences. We de-termined 5 as the maximum number of replicas because thenumber of replicas becomes larger than 5 when we con-ducted an experiments for the past one month. As shownin figure 3, we can infer fuzzy set from each volun-teer’ replication degree (RD), which represents the numberof redundancy. In here, each set represents computa-

tional redundancy. For example, volunteers belonging tothe medium set are executed with three redundancy of theset for a task. This set is determined according to fuzzy in-ference rules that are given in the follows:

RULE 1: IF TPi is very high and RPi is very high,THEN RDi is very good.RULE 2: IF TPi is high and RPi is high, THEN RDi isgood....

The fuzzy inference rules determine the number of re-dundancy to make use of the TP and the RP in order to guar-antee correctness and complete within a given deadline. Inthe grid system, membership function and rules should bechosen according to application requirement and user re-quirement.

4. Simulation and Performance Evaluation

In this section, we evaluate the performance of ourFARM. To show an efficiency of the mechanism, we ana-lyze our mechanism in terms of correctness and turnaroundtime, and compare the performance of our mechanism withthat of both VST and TST mechanisms.

4.1. Distribution of Volunteers

In the performance evaluation, we obtained the distribu-tion shown in Fig. 4 from the log data of each volunteerfor one week in Korea@Home DGS. In this figure, the trustprobability on the x-axis is the probability of correct resultsreturned by each volunteer to the coordinator through au-tonomous sampling. The y-axis represents the result-return


0.2

0.4

0.6

0.8

1

0

0 0.2 0.4 0.6 0.8 1

very low low medium high very high

TPi

�(TPi)

(a) Membership function for five levels of TPi

0.2

0.4

0.6

0.8

1

0

0 0.2 0.4 0.6 0.8 1

very low low medium high very high

RPi

�(RPi)

(b) Membership function for five levels of RPi

Figure 2. Membership functions for different levels of TPi and RPi

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

VERY GOOD

GOOD

MEDIUM

BAD

VERY BAD

Figure 4. Volunteer distribution according to result-return probability and trust probability. This fig-ure shows each fuzzy set according to fuzzy inference process.

probability, which is the probability of the returning the re-sult before a deadline. The boxes in Fig. 4 represent fuzzyset, which is redundancy group of volunteers determined bythe fuzzy rule (refer to the fuzzy inference process in Sec-tion 3). As shown in this figure, the volunteers were classi-fied into fuzzy set as five types: very good, good, medium,bad, and very bad.

4.2. Comparison of Result Verification Mecha-nisms

We have compared our FARM with the other result veri-fication mechanisms (VST and TST). For VST mechanism,we used five replicas per a task in a work group. In TSTmechanism, a volunteer used the randomly selected task onbasis of its trust set. For this comparison, turnaround timeand resource utilization were measured by different num-bers of tasks.

Figure 5(a) shows the resource utilization of three resultverification mechanisms. In this figure, we can observe that


10 100 1000

0

500

1000

1500

2000

2500

3000

3500

4000

(a) resource utilization

10 100 1000

0

5000

10000

15000

20000

(b) turnaround time

Figure 5. Comparison of result verification mechanisms

0.2

0.4

0.6

0.8

1

0

1 2 3 4 5

very good good medium bad very bad

RDi

�(RDi)

Figure 3. Membership function for differentlevels of RDi.

our FARM reduces slightly compared with the TST, and itis far more efficiently than the VST. This means that ourmechanism reduce the reallocation cost for the work be-cause it selects resources having high RP and TP in com-putation group. As a result, the FARM has made remark-able progress for the resource utilization compared with theVST and TST mechanisms.

Meanwhile, Figure 5(b) shows the turnaround time ofthree result verification mechanisms. In this figure, we ob-serve that as the number of tasks increases, the turnaroundtime becomes slower. This is not surprised, because theVST and TST mechanisms are not considered list schedul-ing method based on various factors such as result-returnprobability and trust probability. However, our FARM hasthe fastest turnaround time among the other two mech-anisms. This has been expected, as the scheduler in our

FARM allocates a task to the resources to be completedwithin a given deadline. As described previous, the othermechanisms do not essentially consider the deadline inthe phase of resource selection. Accordingly, the results ofsome tasks may be not returned within the deadline, andthus such the tasks should be reallocated to other resources.That results in an increment of turnaround time.

From the results of Fig. 5, we can see that our FARM hasfast turnaround time than the other result verification mech-anism, with relatively low resource redundancy.

5. Conclusion and Future Work

We have proposed the fuzzy-based adaptive replicationmechanism that supports sabotage-tolerance with deadlinetasks in desktop grid systems. In the mechanism, the con-cept of replication groups was introduced to deal with thedynamic nature of volunteers in the phase of result verifica-tion. As a criterion to organize replication groups, result-return probability and trust probability were used. Basedon fuzzy inference process, five fuzzy sets were presented,which can employ differently according to volatility andtrusty of the volunteers. Using these concepts, our resultverification mechanism can guarantee tasks to return cor-rect work results within a deadline.

Performance results are evaluated through simulation theperformance of our FARM, VST and TST mechanisms,from the viewpoints of turnaround time and resource uti-lization. The results showed that our FARM is superior toother two ones in terms of turnaround time, with relativelylow resource redundancy.


Acknowledgment

This work was supported by the Korea Research Founda-tion Grant funded by the Korean Government (MOEHRD,Basic Research Promotion Fund)

References

[1] M. O. Neary and P. Cappello, ”Advanced Eager Schedul-ing for Java Based Adaptively Parallel Computing,” Concur-rency and Computation: Practice and Experience, Vol. 17,Iss. 7-8, pp. 797-819, Feb. 2005.

[2] D. Molnar, The SETI@home Problem,http://turing.acm.org/crossroads/columns/onpatrol/ septem-ber2000.html

[3] L. Sarmenta, ”Sabotage-Tolerance Mechanism for VolunteerComputing Systems,” Future Generation Computer Systems,Vol. 18, No. 4, pp. 561-572, Mar. 2002.

[4] C. Germain-Renaud and N. Playez, ”Result Checking inGlobal Computing Systems,” Proc. of the 17th Annual Int.Conf. on Supercomputing, pp. 226-233, June 2003.

[5] C. Germain, G. Fedak, V. Neri, and F. Cappello, ”GlobalComputing Systems,” Lecture Notes in Computer Science,Vol. 2179, pp. 218-227, 2001.

[6] Korea@Home homepage, ”http://www.koreaathome.org/eng”.[7] F. Azzedin and M. Maheswaran, ”A Trust Brokering Sys-

tem and Its Application to Resource Management in Public-Resource Grids,” Proc. of the 18th Int. Parallel and Dis-tributed Processing Symposium, pp. 22-31, April 2004.

[8] W. Du, J. Jia, M. Mangal, and M. Murugesan, ”UncheatableGrid Computing,” Proc. of the 24th Int. Conf. on DistributedComputing Systems, pp. 4-11, 2004.

[9] S. Zhao, V. Lo, and C. G. Dickey, ”Result Verification andTrust-Based Scheduling in Peer-to-Peer Grids,” Proc. of the5th IEEE Int. Conf. on Peer-to-Peer Computing, pp 31-38,Sept. 2005.

[10] SETI@Home homepage, ”http://setiathome.ssl.berkeley.edu”.[11] S. Choi, M. Baik, H. Kim, E. Byun, and C. Hwang, ”Re-

liable Asynchronous Message Delivery for Mobile Agent,”IEEE Internet Computing, Vol. 10, Iss. 6, pp. 16-25, Dec.2006.

[12] D. Kondo, M. Taufer, C. L. Brooks, H. Casanova, and A.Chien, ”Characterizing and Evaluating Desktop Grids: AnEmpirical Study,” Proc. of the 18th Int. Parallel and Dis-tributed Processing Symposium, pp. 26-35, April 2004.

[13] J. Dongarra, ”Performance of various computers using stan-dard linear equations software,” ACM SIGARCH ComputerArchitecture News, Vol. 20 , pp. 22-44, June 1992.

[14] M. Castro and B. Liskov. ”Practical Byzantine Fault Toler-ance,” Proc. of Symposium on Perating Systems Design andImplementation, pp. 173-186, Feb. 1999.

[15] D. P. Anderson, ”BOINC: A System for Public-ResourceComputing and Storage,” Proc. of 5th IEEE/ACM Int. Work-shop on Grid Computing, pp. 4-10, Nov. 2004.

[16] distributed.net homepage, ”http://www.distributed.net”.[17] Entropia homepage, ”http://www.entropia.com”.

[18] L. F. G. Sarmenta and S. Hirano. ”Bayanihan: Building andStudying Volunteer Computing Systems Using Java,” FutureGeneration Computer Systems, Vol. 15, Issue 5/6, pp. 675-686, Oct. 1999.

[19] O. Lodygensky, G. Fedak, F. Cappello, V. Neri, M. Livny, D.Thain, ”XtremWeb & Condor : sharing resources betweenInternet connected Condor pool,” Proc. of 3rd IEEE/ACMInt. Symposium on Cluster Computing and the Grid: Work-shop on Global and Peer-to-Peer Computing on Large ScaleDistributed Systems, pp. 382-389, May 2003.

[20] A. Baratloo, M. Karaul, Z. Kedem, and P. Wijckoff, ”Char-lotte: Metacomputing on the Web,” Future Generation Com-puter Systems, Vol. 15, Issue 5-6, pp.559-570, Oct. 1999.

[21] J. Sonnek, M. Nathan, A. Chandra and J. Weissman,”Reputation-Based Scheduling on Unreliable Distributed In-frastructures,” Proc. of the 26th International Conference onDistributed Computing Systems (ICDCS’06), pp. 30, July2006.

[22] D. Kondo, F. Araujo, P. Malecot, P. Domingues, L. M. Silva,G. Fedak, and F. Cappello, ”Characterizing Result Errors inInternet Desktop Grids,” Lecture Notes in Computer Science,Vol. 4641, pp. 361-371, Aug. 2007.

[23] M. Taufer, D. Anderson, P. Cicotti, and C. L. Brooks III,”Homogeneous Redundancy: a Technique to Ensure In-tegrity of Molecular Simulation Results Using Public Com-puting,” 19th IEEE Int. Parallel and Distributed ProcessingSymposium - Workshop 1, pp. 119a, April 2005.


Implementation of a Grid Performance Analysis and Tuning Framework using Jade Technology

Ajanta De Sarkar , Dibyajyoti Ghosh

, Rupam Mukhopadhyay

and Nandini Mukherjee

Department of Computer Science and Engineering, Jadavpur University, Kolkata 700 032, West Bengal, India.

Abstract – The primary objective in a Computational Grid

environment is to maintain performance-related Quality of

Services (QoS) at the time of executing the jobs. For this

purpose, it is required to adapt to the changing resource

usage scenario based on analysis of real time performance

data. The objectives of this paper are (i) to present a

framework for performance analysis and adaptive execution

of jobs, (ii) to focus on the object-oriented implementation

of a part of the framework and (iii) to demonstrate the

effectiveness of the framework by carrying out some

experiments. The framework deploys autonomous agents,

each of which can carry out their responsibilities

independently.

Keywords: Grid, hierarchical agent framework, performance

properties, Jade technology.

1 Introduction

Performance monitoring of any application in Grid is

always complex, different and challenging. In the absence of

precise knowledge about the availability of resources at any

point of time and due to the dynamic resource usage

scenario, prediction-based analysis is not possible. Thus,

performance analysis in Grid must be characterized by

dynamic data collection (as performance problems must be

identified during run-time), data reduction (as the amount of

monitoring data is large), low-cost data capturing (as

overhead due to instrumentation and profiling may

contribute to the application performance), and adaptability

to heterogeneous environment.

In order to address the above issues, a hierarchical agent

framework has already been proposed in [3] and [4]. The

novelty of the framework is that it considers execution

performances of multiple jobs (possibly components of an

application or multiple applications) running concurrently in

Grid and aims at maintaining the overall performance of

these jobs at a predefined QoS level. Moreover, unlike other

traditional Grid performance monitoring systems, this

framework support adaptation of the jobs to the changing

resource usage scenarios either by enabling local tuning

actions or migrating them onto a different resource provider

(host). This paper refines the design of the framework

presented in [3] and [4] and uses the concept of performance

properties that have been introduced in [5].

The paper also presents an object-oriented implementation

of a part of the agent framework. Implementation has been

done using Jade framework. Jade agents can be active at

same time on different nodes in Grid and they can interact

with each other without incurring much overhead in real

time. The current implementation of the agent framework

deals with parallel Java programs written in JOMP 1.

Organization of this paper is as follows. Related work is

discussed in Section 2. The hierarchical organization of

analysis agents is briefed in Section 3. The concepts related

to performance properties and their relevance in the current

work are discussed in Section 4. Severity computation for

each property is introduced in Section 5. Section 6 presents

Jade framework-based implementation of a part of the agent-

based system. Section 7 and Section 8 give details of the

experimental setup and results. Section 9 concludes with a

direction of the future work.

2 Related Work

Grid performance tools, such as SCALEA-G [15] and

ASKALON [16] are usually based on the Grid Monitoring

Architecture (GMA). GMA provides an infrastructure based

on OGSA and supports performance analysis of a variety of

Grid services including computational resources, networks,

and applications. Performance tools associated with ICENI

[7] and GrADS [9] also focus on performance monitoring of

the applications running in Grid environment. The ICENI

project highlights a component framework, in which

performance models are used to improve the scheduling

decisions. The GrADS project focuses on building a

framework for both preparing and executing applications in

Grid environment [10]. Each application has an application

manager, which monitors the performance of that application

for QoS achievement. GrADS monitors resources using

NWS [17] and uses Autopilot for performance prediction

[12].

Unlike the previous systems, our work takes into account a

situation where multiple jobs are executing concurrently on

different resource providers. Overall performance of all these

jobs needs to be considered and monitored in order to

maintain a predefined QoS level. Thus, we use a hierarchical

agent structure, which is comparable to that of the Peridot

project [6]. Although, unlike the framework in Peridot, here

1 JOMP implements an OpenMP-like set of directives and

library routines for shared memory parallel programming in

Java.


different categories of Analysis Agents (with various sub

goals) are used. Moreover, if performance degrades, a

suffered job is locally tuned (or migrated) in order to achieve

the QoS level.

3 Hierarchical Organizations of

Analysis Agents

The hierarchical agent framework is part of a multi-agent

system [14], which supports performance-based resource

management for multiple concurrent jobs executing in Grid

environment. Within the system, a group of interacting,

autonomous agents work together towards a common goal.

Altogether six types of agents are used, these are: Broker

Agent, ResourceProvider Agent, JobController Agent,

JobExecutionManager Agent, Analysis Agent, and Tuning

Agent. The functions of these agents and the interactions

among them have been thoroughly discussed in [14]. This

work deals with the last three agents, namely

JobExecutionManager Agent, Analysis Agent, and Tuning

Agent.

As envisaged in [11], Grid may be considered to have a

hierarchical structure with different types of Grid resources

like clusters, SMPs and even workstations of dissimilar

configurations positioned at its lowest level. All of these are

tied together through a middleware layer. A Grid site

comprises a collection of all these local resources, which are

geographically located in a single site. All these Grid sites,

which mutually agree to share resources located in several

sites, form an enterprise Grid. Enterprise Grid provides

support for multiple Grid resource registries and Grid

security services through mutually distrustful administrative

domains. In order to monitor the Grid at all these different

levels (which is necessary for overall monitoring and

execution planning of multiple concurrent jobs), a

hierarchical organization of the Analysis Agents is proposed.

Existing architectures (including GMA) do not extend

support for such hierarchical organization. However, in our

work, we make use of multiple Analysis Agents and organize

them in a hierarchy. Each Analysis Agent at its own level of

deployment, however, resembles with the consumer in

GMA, while the JobExecutionManager Agent (JEM Agent)

is the producer of the events in which the Analysis Agents

are interested.

In accordance with the above understanding, the Analysis

Agents are divided into the following four logical levels of

deployment in descending order: (1) Grid Agent (GA), (2)

Grid Site Agent (GSA), (3) Resource Agent (RA) and (4)

Node Agent (NA). At the four levels of agent hierarchy each

agent has some specific responsibilities [3]. A block diagram

presenting all the agents is shown in Figure 1. Among all the

agents at different levels, the current work concentrates on

the lowest level agents of the hierarchy, i.e. on Node Agents

(NAs) and the Tuning Agents (TAs).

4 Performance Properties and Agents

The main focus of our work is to capture the runtime

behaviour of a job and modify its behaviour during execution

so that the QoS requirement of the job as laid down in the

SLA is met. Design of an NA and TA therefore depends on

the concepts related to performance properties, which have

been thoroughly discussed in [5]. A performance property

characterizes specific performance behaviour of a program

and can be checked by a set of conditions. Every

performance property is associated with a severity figure,

which shows how important the performance property is in

the context of the performance of an application. When the

severity figure crosses a pre-defined threshold value, the

property becomes a performance problem and the

performance property with highest severity value is

considered to be a performance bottleneck.

Figure 1: Block diagram of Agents

In our system, the resource provider publishes the policies

and defines the specific performance properties to be

checked for detecting any kind of performance bottleneck at

the time of execution of a job. For each performance

property, the evaluation process is also defined and stored

for later use by the NA. In addition to the OpenMP

properties described in [5], we have defined a new

performance property, namely Inadequate Resources. This

property identifies the problem related to execution of a

portion of job sequentially (or on less number of processors)

when additional number of processors is required in order to

maintain the QoS.

An NA monitors all the jobs running on a particular resource

provider (an SMP or workstation). When execution of a job

starts, NA begins collecting monitoring data. On the basis of

this data and the performance property specifications (define

the condition, confidence and severity for each property), it

evaluates the severities of specified performance properties

(using the process specifications) and generates Performance

Details Report (Figure 2). In order to evaluate the severity of

each performance property, the NA consults the SLA and the

JobMetaData (discussed later) sent by the client.


A Tuning Agent (TA), which is invoked by the NA, is

responsible for performing any local tuning action, so that

any performance bottleneck or performance problem can be

removed. The TA accepts the Performance Details Report

from the NA, and decides what actions to be taken after

consulting the Performance Property Specification and the

Performance Tuning Process Specification. Performance

Tuning Process Specification stores recommendations

regarding actions, which may be taken for every

performance problem depending upon their severity. The

process specifications are basically an expert’s knowledge

base, which may be created and stored on a particular

resource provider. The TA generates a Performance Tuning

Report (Figure 3) and sends to the GSA for future reference.

Figure 2: Design for an Integrated Node Agent

We create subclasses of the process specifications for

every subclass of performance property. Thus, in our current

implementation, Inadequate ResourcesProcessSpecification

is a subclass of the class

PerformancePropertyProcessSpecification, which specifies

the analysis process of the Inadequate Resources Property,

and LoadImbalanceProcessSpecification is another subclass,

which specifies the analysis process of the Load Imbalance

Property. Similarly, Inadequate

ResourcesTuningProcessSpecification is a subclass of the

class PerformanceTuningProcessSpecification that specifies

the tuning process for the property Inadequate Resources,

and LoadImbalanceTuningProcessSpecification is another

subclass of the same class that specifies the tuning process

for the property Load Imbalance. A class diagram containing

the agent classes and the specification classes is shown in

Figure 4.

5 Severity Computation

At the time of allocating a job onto a specific resource

provider, a service level agreement (SLA) is established

between the resource provider and the client. During this

process, first the client sends Task Service Level Agreement

(TSLA) to the Broker Agent and the Resource Provider

Agent sends Resource Service Level Agreement (RSLA) to

the Broker Agent. When a specific resource provider is

selected for a job [13], the SLA is finalized and the client

sends Binding Service Level Agreement (BSLA) to the

Resource Provider Agent. The NA in this environment uses

the BSLA and the JobMetaData, which also comes as a part

of the BSLA. BSLA provides detailed information about the

requirements of the job and availability of resources, while

JobMetaData contains information regarding significant

parts (such as loops) of the job [3].

Figure 3: Design for an Integrated Tuning Agent

When a job is submitted onto a host, a JEM Agent,

which is a mobile agent is associated with it and is deployed

on the same host [14]. For each performance property, the

JEM Agent appropriately instruments the job, gathers

performance data at the time of its execution and sends the

data to the NA for computation of the severity figures.

A BSLA contains the expected completion time (Tect) of

a job. The JobMetaData contains information about each

loop, such as the start line and end line of a loop, its

proportionate execution time (Lfrac) with respect to total

execution time etc. These are all based on some pre-

execution analysis or historical information about the

execution of a job. These data are later used for computing

the severity of each performance property and deciding

which one of these is a problem.

In case of Inadequate Resources Property, the severity

figure for a specific loop Li is given by,

sev_resr(Li) = [Tfact f(Li) / Tfect

f(Li)] (1)

where, expected completion time of f portion of the specific

loop (Li) is given by Tfect f(Li), which is computed using the

information in the BSLA and JobMetaData. Thus,

Tfect f(Li) = [Lfrac

i * Tect * f] (2)

Actual completion time of f portion of the specific loop (Li)

(as measured by the JEM Agent during execution) is given

by Tfact f(Li).

In case of Load Imbalance Property, execution times on

each processor are measured. Thus, the severity figure for a

specific loop Li is given by

sev_load(Li)=[(Tmaxf(Li)

- Tavg

f(Li))*100] / Tmax

f(Li) (3)


where, Tmaxf(Li) is the maximum time spent by a processor

while executing f portion of the loop and Tavgf(Li) is the

average taken over all the processors executing f portion of

the loop.

6 Jade Implementation of the Agents

The entire multi-agent framework [14] has been

implemented using Java RMI technology. A part of our

hierarchical analysis agent framework has been implemented

using Jade framework [1]. Jade is a Java Agent Development

Environment, built with Java RMI registry facilities. It

provides standard agent technologies. The agent platform

can be distributed on several hosts; each one of them

executes one Java Virtual Machine. Jade follows FIPA

standard, thus its agent communication platform is FIPA-

compliant. It uses Agent Communication Language (ACL)

for efficient communication in distributed environment. Jade

framework supports agent's mobility and agents can execute

independently and parallelly on different network hosts.

Jade uses one thread per agent. Agent tasks or agent

interactions are implemented through the logical execution

threads. These threads or behaviors of the agents can be

initialized, suspended and spawned at any given time.

In Jade, multiple agents can interact with each other

using their own containers [1]. Containers are the actual

environments for each agent. Typically multiple agents can

be active at same time on different nodes with various

containers. But there is only one central agent, and it needs

to start first. In our implementation, the NA is first initiated

as central agent, which coordinates with other agents. In a

Grid environment, it is important that the performance

analysis be done at run-time and tuning action is taken at

run-time without incurring much overhead. Thus, the active

Jade agents on a Grid resource cooperate with each other and

interact in order to detect performance problems in real time.

Initially, performance properties that will be checked by the

NA and priority of checking these properties are decided on

the basis of the nature of a job and kind of a resource

provider. A job is instrumented depending on the types of

data the Analysis Agent requires to collect. As soon as the

job starts its execution, JEM Agent communicates with the

corresponding NA sitting at that particular node.

NA receives the BSLA and the JobMetaData from the JEM

Agent along with a ‘Ready’ message. It then sends a ‘Query’

message to the JEM Agent with a fraction value (e.g. 0.05)

indicating the portion of a significant block of the job (in the

current implementation, a significant loop) to be executed

before collecting any performance data. The job starts and

continues its execution up to the specified percentage or

fraction of the significant block. After completing this

portion, JEM Agent sends execution performance data to the

NA and suspends the job.

Figure 4: Class Diagram for Node Agent and Tuning Agent

NA immediately starts analyzing the data on the basis of the

commitment included in BSLA. While doing this, the NA

computes severity of a specific performance property. If the

computed severity is greater than some threshold value, NA

sends a ‘Warning’ message to the JEM Agent and invokes

TA. NA also sends ‘Performance_Details’ message (XML

form of Performance Details Report) to TA mentioning the

identified performance problem and its severity along with

other details. TA decides a tuning action and directs the JEM

Agent to resume the job after applying the tuning action. If

no performance problem is detected by the NA, the job

resumes and continues its execution. The next performance

property (according to the priority list) is checked after

continuing execution for another fraction of the significant

block and collecting data as before.

It is often possible that multiple jobs are executing on the

same host (particularly on an SMP system). Consequently

multiple JEM Agents are associated with these jobs,

although there is only one NA. Nevertheless, in our system

communication between all the JEM Agents and the sole NA

may continue until the NA starts analysis of data for a

particular job. Since a job remains suspended during the

analysis, it is desirable that the analysis time remains as less

as possible. So the NA does not entertain any communication

during this time. Communication starts again when the

analysis is over. Interactions among the agents during

analysis and local tuning process are depicted in the

sequence diagram in Figure 5.


In the current implementation, we have implemented the

specifications, computations and tuning related to Inadequate

Resources Property and the Load Imbalance Property. The

next sections provide some experimental results

demonstrating the analysis and tuning these properties.

7 Experimental Setup

A local Grid test bed has been set up that consists of

heterogeneous nodes running in Linux. The computational

nodes of the test bed include HP NetServer LH 6000 with 2

processors, HP ProLiant ML570 G4 with 4 Intel core2 Duo

processors, Intel core2 Duo PCs and Intel Pentium-4 PCs.

The nodes communicate through a fast local area network.

The Grid test bed is built with Globus Toolkit 4.0 (GT4) [8].

The multi-agent system (which includes the hierarchical

analysis agent framework) is implemented on top of GT4.

The agents (NA, JEM Agent and TA) are deployed on every

node.

This paper demonstrates the results of local performance

tuning of multiple applications running on the same node.

Exclusively one NA is responsible for analyzing the

performance of multiple jobs submitted to the same node and

running simultaneously. Experiments have been carried out

on a HP ProLiant ML570 G4 with 4 Intel core2 Duo

processors (referred as HPserver) in the Grid environment.

We have used Java codes with JOMP [2] directives as test

codes, and executed them in Jade framework. When a

particular performance problem is identified by the NA, TA

decides to locally tune the code (if possible or otherwise to

migrate the code). Because we have considered only two

performance properties, the TA either provides additional

processors to execute the job (in order to overcome

Inadequate Resources performance problem) or changes the

scheduling strategy of the parallel regions of the job (in case

of Load Imbalance performance problem).

For example, if the job initially executes on p (>=1)

processors, performance enhancement may be achieved by

providing more processors (q>p) at run-time. TA decides

whether the job will continue either with p processors or with

additional processors q (>p) and resumes job after allocating

additional resources to the job.

8 Results

Two sets of experiments have been carried out. The first

set is for detecting Inadequate Resources performance

property and the second set is for detecting both Inadequate

Resources and Load Imbalance performance properties

according to a predefined order. The following two

subsections demonstrate the results of these experiments.

Figure 5: Agent Interaction

8.1 Detecting Inadequate Resources Property

In this case, we have experimented with two different

test codes. However, in order to demonstrate the

effectiveness of our system, we submitted multiple images of

the same code as separate jobs. The jobs have been initiated

at different times. Thus, two exepriments have been carried

out and each time two Matrix Multiplication jobs and two

Gauss Elimination have been executed. These jobs started at

different times; but major portions of them were executed

concurrently. Data sizes are different for different jobs (1000

and 2000 in Case I and 3000 and 4000 in Case II). In both

cases, experimental data were collected in the following

three scenarios. Scenario 1 occurs if the system continues

with the job as submitted by the client, scenario 2 occurs if

some tuning actions are taken at run-time (based on our

algorithm) and scenario 3 is the result of running the job in a

situation which best fits for a specific job and a specific

resource provider.

Scenario 1: Test codes execute on the HPserver with one

processor without any interaction with the NA and TA.

Scenario 2: Test codes execute initially on one processor and

interact with NA. After computing a certain fraction of the

significant loop (here 0.05), NA detects

InadequateResources performance problem and the TA tunes

each job to run (remaining part) on four processors of the

HPserver.

Scenario 3: Test code executes entirely on four processors of

the HPserver without interaction with NA and TA.

Figure 6 (a) and (b) compare the execution times of the

Matrix Multiplication test codes in the above three scenarios.

The results demonstrate that performances obtained in

Scenario 2 and Scenario 3 are almost same, which signify


that the overhead for run-time analysis and tuning is nominal.

In case of Gauss Elimination, we obtained similar results.

Figure 7(a) and 7(b) show the overheads associated with

execution of all the four jobs (two Matrix Multiplications

and two Gauss Eliminations) in Scenario 2. The times

required for run-time analysis, tuning and communication

among the agents are measured and shown. These overheads

are negligible compared to the performance improvement of

the jobs even though when multiple jobs are running

concurrently on the same resource only one NA is

responsible for performance analysis of these jobs

Test Case I

050000

100000150000200000250000300000

1000 2000

jMM1 jMM2

Job w ith data size

Tim

e (

ms) scenario1

scenario2

scenario3

Figure 6(a): Performance Improvement for Test Case I

Test Case II

0500000

100000015000002000000250000030000003500000

3000 4000

jMM3 jMM4

Job with data size

Tim

e (

ms) scenario1

scenario2

scenario3

Figure 6(b): Performance Improvement for Test Case II

8.2 Periodic Measurements for Detection of

Properties

In this experiment, we demonstrate periodic measurements

of performance data and detection of more than one property

according to a given priority. Thus, when the job starts

executing, first f portion of its significant block is executed

and performance data related to a specific performance

property is gathered. If the performance problem occurs, the

TA takes a tuning action and resumes the job. After

executing the next f portion of the same block, again

performance data is collected and the second property is

checked. Again if a performance problem is detected, the TA

takes another tuning action. Thus, the job continues with

periodic measurements of performance data and tuning of the

job based on the analysis of these performance data.

A LU factorization job has been used for the purpose of

this experiment. Here the NA decides to check the

Inadequate Resources Property first and then the Load

Imbalance Property. The experiment has been carried out by

comparing the following three scenarios. As before, scenario

1 occurs if the system continues with the job as submitted by

the client, scenario 2 occurs if some tuning actions are taken

at run-time (based on our algorithm) and scenario 3 is the

result of running the job in a situation which best fits for a

specific job and a specific resource provider.

Test Case I

0% 50% 100%

1000

2000

1000

2000

jMM

1jM

M2

jGE

1jG

E2

Jo

b w

ith

data

siz

e

Time (ms)

5%, Proc=1

total overhead

95%, Proc=4

Figure 7(a): Overhead Calculation for Test Case I

Test Case II

0% 50% 100%

3000

4000

3000

4000

jMM

3jM

M4

jGE

3jG

E4

Jo

b w

ith

data

siz

e

Time (ms)

5%, Proc=1

Total Overhead

95%, Proc=4

Figure 7(b): Overhead Calculation for Test Case II

Scenario 1: Test code executes entirely on the HPserver with

one processor and static schedule strategy without any

interaction with the NA and TA.

Scenario 2: Test code executes initially on one processor,

static schedule. After computing a certain fraction of the

significant loop (here 0.05), NA detects

InadequateResources problem and the TA tunes the job to

run (remaining portion) on four processors of the HPserver.

After computing next 0.05 portion of the significant loop,

NA detects LoadImbalance problem and the TA tunes job to

run (remaining portion) on four processors of the HPserver

with dynamic scheduling strategy.

Scenario 3: Test code executes entirely on the HPserver with

four processor and dynamic schedule without any interaction

with the NA and TA.

The results of running the job in the above three scenarios

are depicted in Figure 8. It is clear from the figure that there


is a significant improvement in scenario 2 compared to

scenario 1, although not much overhead has been incurred

(compared to scenario 3) for performance analysis and

tuning of the job.

Performance Improvement

01020304050

4000

6000

8000

1000

0

Datasize

Tim

e (

min

)

Scenario1

Scenario2

Scenario3

Figure 8: Performance Improvement of LU

9 Conclusion

In this paper, we have presented the design and object-

oriented specification for implementation of Node Agent in a

Hierarchical Analysis Agent framework for Grid

environment. This framework is actually used for

performance analysis of applications executing concurrently

on distributed systems (like Grid) and also for dynamically

improving their execution performances. This paper

highlights the interaction and exchange of information

among the agents for collecting data, analyzing and

improving the performance by application of local tuning

actions. It also discusses the implementation of some of

them. The results presented in this paper highlight the

effectiveness of the framework by demonstrating the

performance improvement through tuning and showing that

the agent control overheads are negligible even when

multiple jobs are submitted concurrently onto the same

resource. In future work we shall experiment in more

complex system. Also the future work will focus on

incorporating more categories of performance properties and

implementation of other analysis agents in the hierarchy.

10 References

[1] Bellifemine, F., Poggi, A. and Rimassa, G., “Developing

multi agent systems with a FIPA-compliant agent

framework”, in Software - Practice & Experience. V31: 103-

128.

[2] Bull J.M., M.E. Kambities, "JOMP - an OpenMP -like

interface for java", Proceedings of the ACM2000 Java

Grande Conference, pp. 44-53, June 2000.

[3] De Sarkar A., S. Kundu and N. Mukherjee, “A

Hierarchical Agent framework for Tuning Application

Performance in Grid Environment” in the Proceedings of the

2nd

IEEE APSCC 2007, Tsukuba, Japan, December 11-14,

2007, pp. 296-303.

[4] De Sarkar A., S. Roy, S. Biswas and N. Mukherjee, “An

Integrated Framework for Performance Analysis and Tuning

in Grid Environment” in Web Proceedings of the

International Conference on High Performance Computing

(HiPC ’06), December 2006.

[5] Fahringer, T., Gerndt, M., Riley, G.D., and Traiff, J. L.,

“Formalizing OpenMP Performance Properties with ASL”,

In Proceedings of the Third International Symposium on

High Performance Computing (October 16-18, 2000),

ISHPC, pp. 428-439.

[6] Furlinger K., “Scalable Automated Online Performance

Analysis of Applications using Performance Properties”,

Ph.D. Thesis, 2006 in Technical University of Munich,

Germany.

[7] Furmento N., A. Mayer, S. McGough, S. Newhouse, T.

field, and J. Darlington, “ICENI: Optimization of

Component Applications within a Grid environment”,

Parallel Computing, 28(12): 1753-1772, 2002.

[8] Globus Toolkit 4.0 – available in

www.globus.org/toolkit.

[9] GrADS: Grid Application Development Software

Project, http://www.hipersoft.rice.edu/grads/.

[10] Kennedy K., et al, “Toward a Framework for Preparing

and Executing Adaptive Grid Programs”, Proceedings of the

International Parallel and Distributed Processing Symposium

Workshop (IPDPS NGS), IEEE Computer Society Press,

April 2002.

[11] Kesler J. Charles, “Overview of Grid Computing”,

MCNC, April 2003.

[12] Ribler R.L, H. Sinitchi, D. A. Reed, “The Autopilot

Performance-Directed Adaptive Control system”, Future

Generation Computer Systems 18(1), pp. 175-187,

September 2001.

[13] Roy S., M. Sarkar and N. Mukherjee – “Optimizing

Resource Allocation for Multiple Concurrent Jobs in Grid

Environment “, Accepted for publication, In Proceedings of

Third International Workshop on scheduling and Resource

Management for Parallel and Distributed systems, SRMPDS

’07, Hsinchu, Taiwan, December 5-7, 2007.

[14] Roy S., N. Mukherjee, “Utilizing Jini Features to

Implement a Multiagent Framework for Performance-based

Resource Allocation in Grid Environment” Proceedings of

GCA’06 – The 2006 International Conference on Grid

Computing and Applications, pp-52-58.

[15] Truong H.-L. , T. Fahringer, "SCALEA-G: A Unified

Monitoring and Performance Analysis System for the Grid",

Scientific Programming, 12(4): 225-237, IOS Press, 2004.

[16] Wieczorek M., R. Prodan and T. Fahringer, “Scheduling

of Scientific Workflows in the ASKALON Grid

Environment”, SIGMOD Record, Vol. 34, No. 3, September

2005.

[17] Wolski, R., N.T. Spring and J. Hayes, “The Network

Weather service: A distributed performance forecasting

service for metacomputing”, Future Generation Computer

Systems 15(5), pp. 757-768, October 1999.


Optimizing Database Resources on the Grid

Prof. MSc. Celso Henrique Poderoso de Oliveira1, and Prof. Dr. Maurício Almeida Amaral2 1FIAP – Faculdade de Informática e Administração Paulista

Av. Lins de Vasconcelos, 1264 – Aclimação – São Paulo – SP 2CEETEPS – Centro Paula Souza

Rua dos Bandeirantes, 169 – Bom Retiro – São Paulo – SP [email protected], [email protected]

Abstract. Research and development activities relating to

Grid Computing are growing in the academic field and

have reached corporate organizations. Now it is possible to

join desktop computers in a grid environment, increasing

the processing and storage capability of application

systems. Although this has allowed progress in building

rapidly various aspects of Grid infrastructure, the

integration of different resources, including database, is

fundamental. The use of relational databases and the query

distribution among them are vital to develop consistent grid

applications. This paper shows the planning, distribution

and parallelization of database query on grid computing

and how one single complex query can optimize the use of

database resources.

Keywords: Database, Grid Computing, Query,

Distribution, OGSA-DAI

1 Introduction

One of the huge obstacles to use grid environment

outside the academic systems is because flat files are used

more than relational databases and there is a few middleware

that integrates these resources on the grid. Most corporate

organizations use relational or object-relational databases to

store and manage data.

This paper shows a research result that used an

algorithm to plan, distribute, and parallelize a SELECT

statement over database management system (DBMS) on the

grid. To achieve this goal, it was used the main middleware

that provides access to databases: Open Grid Services

Architecture - Data Access and Integration (OSGA-DAI).

OGSA-DAI receive the user request, and submit it to the

available database resources on the grid. The module created

intercepts the user request and before OGSA-DAI submit it,

it parses the complex query into simple ones. Then it asks

the available resources to OGSA-DAI and finally submits it

to the available databases. After the execution of each query,

the module join all results into a single response file. This

file is sent back to the user. The process is transparent to

end-user.

This paper is divided into four sections. The second

one shows the fundamentals of grid, plan and database. In

the third section is showed the results and finally in the last

section are listed the conclusions.

2 Grid, Plan and Database Fundamentals

Distributed processing is an important research area

because the evolution of components will not follow the

needs of data processing [3]. Distributed storage is

important as well, because of security reasons and the

amount of data to be stored.

Grid computing is part of distributed processing that

uses distributed resources, i.e. computers, storage, databases,

that can be integrated and shared as if it is a single

environment [4]. It uses a middleware that balance data and

processing load, provide security access and fail control.

The main aspects of a grid computing environment include

decentralization, resources and services heterogeneity. The

services and resources are used to provide data management

and processing through the network.

As resources are shared in a network, it is important

keep access control and management. A group of

organizations or people that share the same interests in a

controlled way is called Virtual Organization (VO). The VO

generally uses the resources to achieve a specific goal [4].

Databases are important resources to use and share in a

VO. Companies uses to store data in databases. Grid

middleware can identify, locate and submit queries to

databases even if it is local or remote. One of the most

important middleware used to manage database on the grid

is OGSA-DAI.

Watson [11] establishes a proposal to integrate

databases on the grid. Some acceptable parameters are: grid

and database independence, and the use of existing

databases instead of create a new one. Databases use some

of existing grid services and it is important to use the same

services [12]: security, database programming, web services

integration, scheduling, monitoring and metadata.


2.1 Grid Database Services

Database services are grid services that use database

interface [5]. Open Grid Services Architecture (OGSA) grid

data service is a grid service that implements one or more

access or management interfaces to distributed resources [5].

A grid data service uses one of four basic interfaces to

control different behavior of a database. These interfaces use

specific Web Services Description Language (WSDL) ports

and when a service implements one of this interfaces it is

known as an Open Grid Services Infrastructure (OGSI) Web

Service.

Data Access and Integration Services (DAIS) group

works to define the key database services that should be

used on the grid. Data Access and Integration (DAI)

implements these definitions on OGSA. The main goal of

OGSA-DAI is to create an infrastructure to deploy high

quality grid database services [1].

2.2 Plan

A planning problem should have an initial state

description, a partial description goal and the tasks that map

the transition of the elements from one state to the other [6].

Some algorithms, simple or complexes, can be used to

achieve the goal of a plan. It is possible relate temporal

aspects, uncertainties, and optimization properties. Artificial

Intelligence is a planning problem that applies tasks in a

workflow and allocates resources in each task [2]. Each task

component is modeled as a planning operator whose effects

and preconditions shows input and output of data relation.

A partial order plan is created by, at least, two tasks in

a plan [9]. There is no importance in task execution order.

Sometimes there can be an order restriction if some data

must be used in the next step of the plan. But it does not

affect the task parallelization because the link between data

input and output is preserved. The huge advantage in using

this plan technique is that it is possible applies some basic

tasks that are independent between them. When the

independent tasks had finished, they are joined to return the

final result.

Heuristics can be used to determine the best execution

plan. In a VO there are some different heterogeneous and

limited resources that can be used. Some important decisive

factor to schedule database resources are: resource

utilization, response time, global and local access policy and

scalability [8].

In this paper, the planning operator will be the SQL

SELECT command. The planning will determine the submit

sequence of this command in different databases using a grid

database service. As the complex SQL SELECT command

is parsed into some simple queries, they can be submitted in

parallel.

3 Results

The goal was to create a service that should make an

execution plan to select available relational databases on the

grid and submit queries based on SELECT command.

OGSA-DAI middleware was chosen because it is the most

used one. It is installed on Globus Toolkit. The proposal was

to develop a grid data service to be used on a virtual

organization. The developed service uses recent techniques

to extract and heuristics to solve query planning problems.

The metrics used to define the heuristics in a database

management system were: CPU, available memory, network

bandwidth, and I/O volume of stored data. These data were

extracted from the databases metadata.

Most databases can parallelize and distribute queries.

But it is done on the same environment and using the same

product. When they use different databases, there are a lot of

limitations that must be followed. Cluster databases can do

the same job, but it is centralized and supervised by one

single control center. Grid Computing aims to bypass this

limitation. Small databases can be installed on

heterogeneous resources and ordinary desktop computers.

3.1 Planning Phases

The planning system was divided into five main parts: (1)

the user request is intercepted by the service; (2) ask the

middleware the available databases; (3) parse complex SQL

query command into simple queries; (4) distribute each

simple query to the middleware; and (5) join the results and

send back to the user.

Figure 1 shows the planning inputs and outputs.

OGSA-DAI identifies and communicate with the databases.

The first information needed is the available resources and

tables to execute each query command. After the parsing the

planning system establishes the execution plan. After this the

plan is submitted to the middleware. The middleware

executes each query on the databases. The service keeps

track of the submission and the final phase is receiving all

rows, joins them and sends it to the user.


Figure 1 – Service Planning Inputs and Outputs

The tests were made over Shipp [10] genetic database.

It is a complex model that stores a lot of DNA data. There is

a lot of attributes on each table and the data analysis is a

good challenge to our service.

In other hand, this model could be easily understood

and can be used in non-academic institutions. Figure 2

shows the entity-relationship model of this database.

Figure 2 – Entity-Relationship Model

There are four tables on this model:

LYMPH_OUTCOME, LYMPH_EXP, LYMPH_MAP e

LYMPH_TEMP. The first one stores data clinic results.

There are 58 rows in this table. The second one stores

genetic values for each different model. DNA sequences are

represented in this table. There are 7.129 rows in this table.

The third one has 7.129 rows too and stores the gene

identity. The fourth one stores 413.482 rows and it is the

data pivot result of the other tables. It is used to store

original genetic expression.

The queries used to test the algorithms are:

Resources

SGBD User Needs

Planning

System

OGSA-DAI

XML File

Results

XML File

Results

Metrics and

Resources


i) select a.gene_id, a.accession,

b.gene_id, b.sample_id from

lymph_map a, lymph_exp b where

a.gene_id = b.gene_id

ii) select a.gene_id, a.accession,

b.gene_id, b.sample_id,

c.sample_id, c.status from

lymph_map a, lymph_exp b,

lymph_outcome c where a.gene_id

= b.gene_id and b.sample_id =

c.sample_id

v) select a.gene_id, a.accession,

b.gene_id, b.sample_id, d.dlbc1

from lymph_map a, lymph_exp b,

lymph_temp d where a.gene_id =

b.gene_id and a.gene_id =

d.gene_id

vi) select a.gene_id, a.accession,

b.gene_id, b.sample_id,

c.sample_id, c.status, d.dlbc1

from lymph_map a, lymph_exp b,

lymph_outcome c, lymph_temp d

where a.gene_id = b.gene_id and

b.sample_id = c.sample_id and

a.gene_id = d.gene_id

The main goal of these queries was identify the service

planning behavior. With these queries it was possible test the

parsing phase of complex SQL SELECT command, the

distribution in parallel of the resulted simple queries into

resources, and time response. The queries use regular joins

to test the ability to distribute queries over available

databases. Query (i) uses two tables. Queries (ii) and (v) use

three tables, and query (vi) uses four tables.

There were only three available databases on the grid.

Two servers were held by a local network. Oracle 10g R2

database was installed in one of them and the other server

has one MySQL 5.0 and one Oracle 10g R1. OGSA-DAI

and Globus were installed on the same server. The best

metrics were on the OGSA-DAI server, because there was

no need to use network. This server had more available

memory and the faster processor.

All databases were populated with the same tables and

rows. We used a data replica because our goal was test only

the power to distribute queries on the grid. Figure 3 shows

the test results.

Recursos x Tempo

0:08:21

0:08:38

0:08:56

0:09:13

0:09:30

0:09:48

0:10:05

0:10:22

i (2 t) ii (3 t) v (3 t) vi (4 t)

Consultas

Te

mp

o

0

0,5

1

1,5

2

2,5

3

3,5

Tempo

Recursos

Figure 3: Query Distribution on the available resources and processing time

Figure 3 shows that there was resource usage

optimization on virtual organization when the service is

used. Based on this result, the more complex a query is and

the more available databases on the grid, the query

distribution would be best. As shown, the planning system

used all available databases based on the query. When the

SQL SELECT command used only two tables on the query,

the planning service uses two databases. When the query

used three tables, the service used three available databases.

When the query used four tables, the service used the

maximum available databases (three) and waits the first

available one to submit the last query.


Time response were better when all available databases

were used, i.e., queries (ii) and (v). When there was more

available database than the tables in the query, the time

response was not better. When there were more tables in the

query and less available databases (vi), time response was

almost the same (ii). It shows that, using the virtual

organization concept, it is important to use as much

resources as possible. It would increase the time response of

complex queries.

3.2 Related Work

There is a similar work that distributes queries on

OGSA-DAI. Distributed Query Processing (OGSA-DQP)

goal is implement declarative and high-level language to

access and analyze data. It tries to simplify the resource

management on the grid using some services. The main

problem of this solution is that it does not uses all OGSA-

DAI services. There are other problems like some databases

cannot be used on DQP and the use of Polar* partitioning

service [7] that is executed outside OGSA-DAI. It uses

Object Query Language (OQL) instead of SQL. SQL is the

relational database standard language.

4 Conclusion

It is possible use a simple service to plan, parallelize

and distribute complex SQL queries on the grid. The use of

OGSA-DAI is and important feature because there is a lot of

researchers establishing the services that will be developed

to integrate databases on the grid. Thus, it will be possible

aggregate new services into the middleware.

Some important issues were followed, as the use of

heterogeneous servers, operating systems, and databases. In

our tests we used standard SQL queries with tables joined

and rows restriction.

The main contribution of this paper was that it shows

the utilization of different and remote databases in a virtual

organization. When it is used, the resources will be

optimized and the queries will return faster than using only

one database. Ordinary desktop computers can be used to do

these tasks that would be overloading the database server.

5 References

[1] Alpdemir, M. N.; Mukherjee, A.; Paton, N. W.;

Watson, P.; Fernandes, A. A. A.; Gounaris, A.; Smith, J.

“Service-Based Distributed Querying on the Grid” : 2003.

[2] Blythe, J.; Deelman, E. e Gil, Y. “Planning and

Metadata on the Computational Grid”. Em: AAAI Spring

Symposium on Semantic Web Services: 2004.

[3] Dantas, M. “Computação Distribuída de Alto

Desempenho”. Ed. Axcel. Rio de Janeiro: 2005.

[4] Foster, I. ; Kesselman, C. e Tuecke, S. “The Anatomy

of the Grid: Enabling Scalable Virtual Organizations” Em:

International J. Supercomputer Applications, vol 15(3):

2001

[5] Foster, I.; Tuecke, S. e Unger, J. “OGSA Data

Services”. Em: http://www.ggf.org: 2003. Acessado em

julho de 2005.

[6] Nareyek, A. N.; Freuder, E. C.; Fourer, R.;

Giunchiglia, E.; Goldman, R. P.; Kautz, H.; Rintanen, J. e

Tate, A. “Constraints and AI Planning”. IEEE Intelligent

Systems 20(2): 62-72: 2005.

[7] OGSA-DAI. http://www.ogsadai.org.uk/about/ogsa-

dqp/. Acessado em fevereiro de 2006.

[8] Ranganathan, Kavitha e Foster, Ian Simulation Studies

of Computation and Data Scheduling Algorithms for Data

Grids. Em: Journal of Grid Computing, Volume 1 p. 53-62:

2003.

[9] Russell, Stuard e Norvig, Peter. Inteligência Artificial.

Ed. Campus. Rio de Janeiro: 2004.

[10] Shipp, M. A. et al. Diffuse Large B-Cell Lymphoma

Outcome Prediction by Gene Expression Profiling and

Supervised Machine Learning. Em: Nature Medicine, vol. 8,

nº 1, pp 68-74: 2002.

[11] Watson, P.; Paton, N.; Atkinson, M.; Dialani, V.;

Pearson, D. e Storey, T. “Database Access and Integration

Services on the Grid” Em: Fourth Global Grid Forum (GGF-

4): 2002. http://www.nesc.ac.uk/technical_papers/dbtf.pdf.

Acessado em julho de 2005.

[12] Watson, P. “Database and the Grid” Em: Grid

Computing: Making the Global Infrastructure a Reality.

John Wiley & Sons Inc: 2003.


An Efficient Load Balancing Architecture forPeer-to-Peer Systems

Brajesh Kumar ShrivastavaSymantec Corporation

brajeshkumar [email protected]

Sandeep Kumar Samudrala,Guruprasad Khataniar,Diganta GoswamiDepartment of Computer Science and Engineering

Indian Institute of Technology GuwahatiGuwahati - 781039, Assam, India

{samudrala,gpkh,dgoswami}@iitg.ernet.in

Abstract - Structured peer-to-peer systems like Chord arepopular because of their deterministic nature in routing, whichbounds the number of messages in finding the desired data.In these systems, DHT abstraction, Heterogeneity in nodecapabilities and Distribution of queries in the data space leadsto load imbalance.

We present a 2-tier hierarchical architecture to deal with thechurn rate, load imbalance thereby reducing the cost of loadbalancing and improve the response time. Proposed approachof load balancing is decentralized in nature and exploits thephysical proximity of nodes in the overlay network during loadbalancing.

Keywords: P2P Network, Overlay network,Load bal-ancing, DHT, Super node and Normal node.

1 IntroductionPeer-to-Peer(P2P) systems offer a different paradigm

for resource sharing. In P2P systems every node has gotequal functionality. Every node provides and takes theservice to and from the network. The main objective ofP2P systems is to efficiently utilize the resources withoutoverloading any of the nodes. Especially Structured P2Psystems like Chord[1], CAN[2] etc., also called DHTbased systems, in which data and node ids are mappedto a number space and each node is responsible for apart of data space.

Since each node is mapped to a part of data space,data finding in these systems is easy. This deterministicnature and bounded number of messages in searchingfor data made these systems more popular.

DHT abstraction assumes the homogeneity of nodecapabilities which is not in practice because each nodepossesses different capabilities. DHT’s make use ofconsistent hashing in mapping data items to nodes. How-ever, consistent hashing produces a bound of O(logN)imbalance of keys among the nodes in the system, whereN is the number of nodes in the system [3].

Data that is shared to the network is not uniformly dis-tributed and the queries are never uniformly distributedin the network. Thus, assuming homogeneity of nodecapabilities, DHT abstraction, Data shared to the networkand Queries in the network lead to considerable amountof load imbalance in the system.

We present a 2-Tier Hierarchical architecture toachieve the proximity aware load balancing to improve

the response time, handle churn rate to reduce themessage traffic, deal with Free-riding to ensure that eachnode provides sufficient resources to deal with the greedynodes, which take the service from the network withoutsharing to the network.

Rest of this paper is organized as follows : Section 2presents related work, Section 3 describes the proposedoverlay system. Section 4 concludes the paper.

2 Related workDHT-based P2P systems [1][2][4][5], address the load

balancing issue in a simple way by using the uniformityof the hash function used to generate object ids. This ran-dom choice of object ids does not produce perfect loadbalance. These systems ignore the node heterogeneitywhich has been observed in [6].

Chord[1] was the first to propose the concept of virtualservers to address the load balancing issue by havingeach node simulate a logarithmic number of virtualservers. Virtual servers donot completely solve the loadbalancing issue.

Rao et al.[7] proposed three simple load balancingschemes for DHT- based P2P systems: One-to-One, One-to-Many and Many-to-Many. The basic idea behind theirschemes is that virtual servers are moved from heavynodes to light nodes for load balancing.

Byers et al.[8] addressed the load balancing from dif-ferent perspective. They proposed using the the “powerof two choices” paradigm to achieve load balance. Eachdata item is hashed to a small number of different idsand then is stored in the least loaded node among thenodes which are responsible for those ids.

Karger et al.[9] proposed two load balancing pro-tocols, like address-space balancing and item balanc-ing. Proximity information has been used in bothtopologically-aware DHT construction[10] and proxim-ity neighbor selection in P2P routing tables [11][12].The primary purpose of using the proximity informationin both cases is to improve the performance of DHToverlays. We use proximity for load balancing.

Haiying shen et al.[13] presented a new structurecalled Cycloid to deal with the load balancing by con-sidering the proximity.


Zhenyu et al.[14] proposed a distributed load bal-ancing algorithm based on virtual servers. In this eachnode aggregates the load information of the connectedneighbors and moves the virtual servers from heavilyloaded nodes to lightly loaded nodes.

Our architecture exploits the node capabilities andproximity of nodes in the network for load balancing.This architecture uses the Chord in the upper level of thehierarchy and deals with the Free-riding by constrainingthe minimum amount of sharing on the nodes in thelower level of hierarchy.

3 System ModelIn this model, we broadly categorize the nodes into

two types i.e. Super Node and Normal Node, basedon their capabilities like Bandwidth, Processing power,Storage, Number of nodes it can connect to.

i. Super node: A Super node is exposed to the upperlevel DHT based network (Chord). It manages the groupof Normal nodes connected to it. The data that is mappedon to a Super node is stored on the Normal nodespertaining to it’s group. It represents a group of Normalnodes in the DHT based network. It routes the queriesof the Normal nodes in it’s group.

ii. Normal node: A Normal node is a node that joinsthe network in the second level of hierarchy. It maintainsthe data mapped on to the Super node with which it isconnected. This is not exposed to the upper level DHTbased network.

These are in-turn classified into two types:i. Stable-Normal node.

ii. Unstable-Normal node.

TstabUn stable

Normal nodes

Stable

Normal nodes

Fig. 1. Categorization of Normal Nodes

i. Stable-Normal node: A Normal node that is inthe overlay network since Tstab or more time units.These are expected to be in the network with ac-ceptable probability. These are considered for loadbalancing by a Super node. In this paper, wheneverNormal node is referred it is called Stable-Normalnode unless and until explicitly specified.ii. Unstable-Normal node: These are recentlyjoined nodes and can leave the network at anytime.These are not considered by the Super node for loadbalancing.

3.1 BootstrappingWe assume that at the startup of the network, nodes

that join the network are Super nodes and directly added

: Super Node

: Normal Node

Fig. 2. 2-Tier Hierarchical Overlay Structure

in the upper level (chord). If N is the complete dataspace, then as soon as the number of nodes in thenetwork reach logN node addition in the upper level isstopped. Initially number of groups G = logN . There-after nodes that join in the second level of the network byconnecting to the nearest Super node. Each Super nodemanages a group of Normal nodes. Maximum numberof nodes in a group is Gmax. The moment a super nodedetects that number of nodes in it’s group go beyondGmax it splits the group into two groups. Splitting agroup invokes a node addition in the upper level ofhierarchy. At the time of division of a group, number ofnodes and data space are equally divided. The momenta Super node detects that number of nodes in it’s groupare less than Gmin, then nodes and data space of thegroup are equally divided to predecessor and successorgroups. This invokes node deletion in the upper level ofhierarchy.

3.2 Data Distribution:Every Super node divides it’s data space into K sets

and distributes each set to a group of Normal nodes.At the startup nodes are assigned to the data regions inround robin fashion. For instance, consider a Super nodeS1. Let the data space maintained by it be from X1 to


X2. The space is divided into K sets.Let there be 2K nodes in the group. Data is assigned

in the following fashion:

TABLE IData distribution in a group

Data region NodeidX1 − P1 N1, Nk+1

P1 − P2 N2, Nk+2

Pk−1 − X2 Nk , N2k

At the starting of the group formation, nodes areassigned to the data sub regions in the round robinfashion. As the nodes join the group, each sub range willget more number of nodes. Since the nodes in a groupare physically near, it facilitates load balancing withoutany additional cost. Dynamically as the load pertainingto a data subrange changes, more nodes are added to thecorresponding subrange.

3.3 Query ProcessingIn the proposed architecture only Super nodes are

exposed in the DHT based network. Normal nodes makeuse of the Super node to which they are connected insearching for data. For instance, consider two Supernodes S1 and S2. Let S11, S12 be the Normal nodesunder S1. Like-wise S21, S22 be the Normal nodes underS2. Let the data being searched by S11 belong to node ofS21. S11 sends the query to S1 and S1 runs regular datasearching algorithm in the upper level Chord network.S11 finds that data is in group S2. S2 gives the id ofS21 to the node S1 which in turn sends to S11. Then S11

directly downloads the data from S21. This way Supernodes mediate the data search for Normal nodes.

3.4 Grading of nodesEach Super node periodically collects the load of

Normal nodes in it’s group and grades them based onthe following definition of load.

Load of a node(LN ) is defined as the ratio of amountof data requests it receives to the amount of data it canserve.

Let Ar: Amount of data requests received.As: Amount of data it can serve.

LN = Ar

As

Main objective is to ensure that the load of each nodeis in between 0 and 1 i.e. 0 < LN < 1.

Different types of nodes in a group are:i. Lightly loaded nodes: Load of these nodes varies

from 0 to 0.5 i.e. 0 ≤ LN < 0.5.ii. Normal loaded nodes: Load of these nodes varies

from 0.5 to 1 i.e. 0.5 ≤ LN < 1

iii. Heavily loaded nodes: Load of these nodes isgreater than or equal to 1 i.e. LN ≥ 1.

3.5 Load BalancingIn this model load balancing is applied at two levels:i. Intra-group load balancing.

ii. Inter-group load balancing.3.5.1 Intra-group load balancing: This is applied

only when a Super node has enough number of lightlyloaded nodes. Data space of a Super node is dividedinto K intervals and uniformly distributed among thenodes in the group at the time of group creation. Asthe load on a particular interval increases, lightly loadednodes which are associated with other intervals are nowassigned to the loaded interval. For instance, consider aSuper node S that has 5 nodes in it’s group. Let the dataspace (X1 − X2) of it be divided at point a.

TABLE IIData distribution of Super node S

Data region NodeidX1 − a N1, N3, N5

a − X2 N2, N4

Now if the region (a−X2) is queried more, then theload on nodes N2 and N4 increases when compared tothat of the nodes in region (X1 − a). Hence nodes fromthat region are moved to region (a − X2).

Since the nodes in a group are physically near, thecost of moving the load from one node to other node isignorable.

3.5.2 Inter-group load balancing: When a Supernode does not have enough number of lightly loadednodes this is applied. There are two alternatives for thisprocedure:

i. When a Super node detects that there are notenough lightly loaded nodes in the group, it sendsa message “GET-NODE” message to all the nodesin the group. “GET-NODE” is a simple message toget more nodes to the group. Nodes in the groupsearch for lightly loaded nodes in the proximity asshown in fig.3 and get them to the group which isin need of lightly loaded nodes. This creates a flowof ping messages in the proximity of group to getmore nodes. Since the nodes are searched in thenearby proximity, cost of getting nodes from othergroups is also ignorable.

ii. a. If none of the nodes could find lightly loadednodes in the nearby proximity then this isapplied. Super node sends a multi-cast messageto the logG Super nodes whose information ismaintained in it’s finger table.

b. If none of the logG nodes respond to themessage then the logically connected neighborssend the message along the DHT ring to get


GET−NODE

GET−NODE

Fig. 3. Inter-group Load balancing

new lighlty loaded nodes. Always a Super nodechooses the node which is physically near inchoosing for load balancing.

3.6 Cost of Load balancingi. Cost of load balancing for Intra-group is ignorable

because lightly loaded nodes are available withinthe group and only cost is moving the load.

ii. Cost for Inter-group load balancing is:a. In the first alternative, cost of load balancing

is ignorable since the lightly loaded nodes aresearched in the nearby proximity.

b. In the second alternative, cost of load balancingis:Load of a node is dependent on the data itmaintains, the rate at which the data spaceis being queried and the availability of noderesources.Hence, when a Super node sends a message toget the lightly loaded nodes to the connectedSuper nodes, the nodes may or may not havelightly loaded nodes.Probability of a Super node providing thelightly loaded nodes is: 1

2 . When a Super nodesends a message to the connected Super nodes,the probability of getting at least one responseis: 2logG

−12logG , which is approximately 1. Hence,

the cost of load balancing is O(logG).In worst case, if none of the nodes respondto the message, then the message is passedalong the Chord ring in search of lightly loaded

nodes. Here, the cost is O(G). But as describedin the section 3.1 number of groups G = logN .Hence, the cost of load balancing is O(logN).

3.7 Load of a GroupEvery Super node ensures that there are atleast 10%

of lighly loaded nodes in the group by the mechanismsdescribed in the section 3.5 to handle the sudden changesin the load. Each lightly loaded node shares the load of aheavily loaded node and both the nodes become Normalloaded nodes. Hence the expected proportion of Normalloaded nodes is 50%. Heavily loaded nodes form the restof the group. Percentage of lightly loaded nodes is from10% to 50%. Similarly heavily loaded nodes form 0%to 40% of the group.

Load of a group is defined as the average load of allthe nodes in the group. Let the number of nodes in agroup be M .

3.7.1 Lower bound: Number of Lightly loadednodes: 0.5 × M .Number of Normal loaded nodes: 0.5 × M .Number of Heavily loaded nodes: 0 × M .

Load of Lightly loaded nodes: 0.1.Load of Normal loaded nodes: 0.5.Load of Heavily loaded nodes: 1.

Load of a group G:LG = 0.5M×0.1+0.5M×0.5+0×M×1

M

==> 0.05M+0.25MM

==> 0.3MM

LG = 0.33.7.2 Upper bound: Number of Lightly loaded

nodes: 0.1 × M .Number of Normal loaded nodes: 0.5 × M .Number of Heavily loaded nodes: 0.4 × M .

Load of Lightly loaded nodes: 0.4.Load of Normal loaded nodes: 0.9.Load of Heavily loaded nodes: 1.

LG = 0.1M×0.4+0.5M×0.9+0.4M×1M

==> 0.04M+0.45M+0.4MM

==> 0.86MM

LG = 0.86.The average load of a group varies from 0.5 to 0.8

making the group to be always normally loaded.0.3 ≤ LG ≤ 0.86

3.8 Node Joining/FailureNormal nodes join the network at the lower level of

hierarchy by connecting to the nearest Super node. Theyare not exposed to the upper level hierarchy. Henceaddition/failure of a Normal node does not effect thesystem.

Super nodes are exposed to the upper level hierarchy.When a Super node fails one of the Normal nodes withhigh probability will replace the Super node. When a


group joins/ fails finger tables of the other Super nodesneed to be updated.

Let Rj : Rate of joining of nodes in the network.Rl: Rate of leaving of nodes in the network.Gj : Rate of getting the nodes from nearby groups.Gl: Rate of giving the nodes to other groups.Gmax: Maximum group size.G: Number of groups.

Effective number of nodes joining the system per unittime: (Rj + Gj − Rl − Gl).

Effective number of nodes joining per group:(Rj+Gj−Rl−Gl)

G.

A group is created only if the number of nodes in agroup cross Gmax.

Number of groups created by a group is:(Rj+Gj−Rl−Gl)

Gmax∗G.

Addition of a group is equivalent to addition of anode in the chord. Chord[1] takes atmost O(log2N)2

messages, where N is the number of nodes in the system.Here, number of groups is G. Addition of a group

takes O(log2G)2.Number of messages routed because of new groups

created by a group is: (Rj+Gj−Rl−Gl)Gmax∗G

× O(log2G)2.Total number of messages routed in the upper level

network: (Rj+Gj−Rl−Gl)Gmax

× O(log2G)2.Since the Group joining and Group deletion are de-

pendent on the parameters given above. These eventsdo not occur very often. Hence the churn rate does notdegrade the system performance.

3.9 Data InsertionWhen a node shares the data to the network, the

Super node to which this data gets mapped needs to beidentified. This is similar to searching for a data in theChord. [1] proved that in chord network, number of stepsto find the data is O(logN). In the proposed network,there are G groups in the upper level hierarchy, hencethe cost of data insertion is O(logG).

TABLE IIIAnalysis of the system

Function MessagesData Insertion O(logG)

Query Processing O(logG)Node join/failure ConstantGroup insertion O(logG)2

Intra-group load balancing ConstantInter-group load balancing Avg:O(logG), Worst case:O(G).

3.10 Free-RidingSince the Normal nodes join the network by con-

necting to nearest Super nodes, at the time of joiningSuper node puts a minimum constraint on the amount

of sharing space it has to provide to the network. Thusnodes which do not give sufficient sharing space are notconnected to the network. Nodes which do not provideenough sharing cannot be put to use for handling the dataspace of the Super node. Thus Free-Riding is controlledat the lower level of the network.

4 ConclusionIn this paper we present efficient load balancing mech-

anisms by considering locality in a 2-tier hierarchicaloverlay network. In the proposed architecture, nodes areclassified as Super nodes and Normal nodes. Numberof normal nodes are limited initially to logN , where N

is the size of data space. Super nodes balance the loadby moving the load across the Normal nodes which arephysically near. Since the number of Super nodes arelimited initially and the rate at which the Super nodesjoin the network is very less, this architecture deals withthe churn rate. Constraining the minimum amount ofsharing by the Normal nodes regulates Free-Riding.

References[1] Ion Stoica, R. Morris, D. Karger, M. Kaashoek, and H. Balakrish-

nan, Chord: A scalable peer-to-peer lookup service for internetapplications, in Proceedings of SIGCOMM 2001.

[2] Sylvia Ratnasamy, P. Francis, M. Handley, R. Karp, and S.Shenker, A Scalable Content- Addressable Network, in Proceed-ings of ACM SIGCOMM 2001.

[3] David Karger, Eric Lehman, Tom Leighton, Matthew Levine,Daniel Lewin, Rina Panigrahy, Consistent Hashing and RandomTrees: Distributed Caching Protocols for Relieving Hot Spotson the World Wide Web, In proceedings of Symp. Theory ofcomputing(STOC’ 97)pp..654-663,1997.

[4] Antony Rowstron and P. Druschel, Pastry: Scalable, distributedobject location and routing for large-scale peer-to-peer systems,in IFIP/ACM International Conference on Distributed SystemsPlatforms (Middleware), Heidelberg, Germany, pp. 329 - 350,November 2001.

[5] Ben Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, andJ. D. Kubiatowicz, Tapestry: A Resilient Global-Scape Overlayfor Service Deployment, IEEE Journal on Selected Areas inCommunications, vol. 22, 2004.

[6] Stefan Saroiu, P. Krishna Gummadi, Steven D. Gribble, A Mea-surement Study of Peer-to-Peer File Sharing Systems, in Proceed-ings of Multimedia Computing and Networking (MMCN), 2002.

[7] A.Rao, K. Lakshminarayanan, S.Surana, R.Karp and I.Stoica,Load Balancing in Structured P2P Systems, in Proceedings ofSecond International workshop on P2P systems(IPTPS),pp.68-79,Feb 2003.

[8] J.W.Byers, J.Considine and M.Mitzenmacner, Simple Load Bal-ancing for Distributed Hash Tables, in Proceedings of Second In-ternational workshop on P2P systems(IPTPS),pp.80-87,Feb 2003.

[9] D.R.Karger and M.Ruhl, Simple Efficient Load Balancing Al-gorithms for Peer to Peer Systems, In proceedings of ThirdInternational workshop on P2P Systems(IPTPS),Feb.2004.

[10] S.Ratnasamy, M.Handley, R.M.Karp and S.ShenkerTopologically-Aware Overlay construction and Server Selection,In proceedings of IEEE INFOCOM, vol.3, pp.1190-1199, June2002.

[11] Z.Xu, C.Tang and Z.Zhang, Building Topology Aware OverlaysUsing Global Soft-State, In proceedings of 23rd Internationalconference on Distributed Computing Systems(ICDCS), pp.500-508, May 2003.


[12] Z.Xu, M.Mahalingam and M.Karlsson, Turning Heterogeneityinto an Advantage in Overlay Routing, In proceedings of IEEEINFOCOM, VOL. 2, PP.1499-1509, April 2003.

[13] Haiying Shen and Cheng-Zhong Xu, Locality Aware and ChurnResilient Load Balancing Algorithms in Strructured Peer to PeerNetworks, In proceedings of IEEE Transactions on Parallel andDistributed Systems, vol. 18, No.6, June 2007.

[14] Zhenyu Li and Gaogang Xie, A Distributed Load Balancing Al-gorithm for Structured P2P Systems, In proceedings of 11th IEEESymposium on Comnputers and Communuications(ISCC’06).


Grid Performance Analysis in a Parallel Environment

Y.S. Atmaca1, O. Kandara1, and A. Tan2

1Computer Science Department, Southern University and A&M College, Baton Rouge, Louisiana, USA 2 Center for Energy and Environmental Services, Southern University and A&M College, Baton Rouge,

Louisiana, USA

Abstract - High performance computing is a relatively new emerging technology gaining more and more importance everyday. Super computing, clustering, and the grid computing are the currently existing technologies used in high performance computing.

In this paper, we first studied the three types of emerging grid technologies, namely cluster grid, campus grid, and global grid. We use the cluster grid in our implementation. We used 11 nodes, where one node is dedicated to submit the jobs. The remaining 10 nodes are used as the computational units. Then we carried out a performance analysis based on the number of nodes in the grid. In the analysis, we observed that peak performance can be achieved with a certain number of nodes in the grid. That is to say, more nodes do not mean always more performance.

Our study proves that we can create a grid for high performance computing with the optimal number of nodes for jobs with the same or similar characteristics. This will lead us to implement parallel grids with the nodes that are not cost-efficient in performance to add to the existing grids.

Keywords: High performance computing, grid computing, grid performance analysis, cluster grid.

1 Introduction One way of categorizing a computational problem in computer science is by its degree of "parallelism". If a problem can be split into many smaller sub-problems that can be worked on by different processors in parallel, computation can be speed up a lot by using many computers. Fine-grained calculations are better suited to big, monolithic supercomputers, or at least very tightly coupled clusters of computers, which have lots of identical processors with an extremely fast, reliable network between the processors, to ensure there are no bottlenecks in communications. This type of computing is often referred to as high-performance computing (HPC). In its current stage of evolution,

most applications of the Grid fall into the HPC classification. This is due to the fact that Grid computing arose out of the need for more cost-effective HPC solutions to address critical problems in science and engineering. The initial adoption of the Grid by commercial enterprises has continued to focus on HPC because of the high return on investment and competitive advantage realized by solving compute intensive problems that were previously insolvable in a reasonable period of time or cost.

A simplified basic architecture of a Grid is shown in the Figure 1 with Grid middleware providing the location transparency that allows the applications to run over a virtualized layer of networked resources. The key aspect of middleware is that it gives the Grid the semblance of a single computer system, providing the coordination among all the computing resources that comprise the Grid. These functions usually include tools for handling resource discovery and monitoring, resource allocation and management, security, performance monitoring, and accounting.

Figure 1: Basic Architecture of a Grid

Key to success of Grid computing is the development of the 'middleware', the software that organizes and integrates the disparate computational facilities belonging to a Grid. Its main role is to automate all the "machine to machine" negotiations required to interlace the computing and storage resources and the network into a single, seamless computational "fabric".


There is a considerable amount of debate as to whether a local computational cluster of computers should be classified as a Grid. There is no doubt that clusters are conceptually quite similar to Grids. Most importantly, they are both dependent on middleware to provide the virtualization needed to make a multiplicity of networked computer systems appear to the user as a single system. Therefore, the middleware for clusters and Grids address the same basic issues, including message passing for parallel applications. As a result, the high level architecture of a cluster is essentially the same as that of the Grid.

1.1 Type of Grids Grid computing vendors have adopted various nomenclatures to explain and define the different types of grids. Some define grids based on the structure of the organization that is served by the grid, while others define it by the principle resources used in the grid. We will classify types of grid as three groups considering their regional service capability.

Cluster grids are the simplest. Cluster grids are made up of a set of computer hosts that work together. A cluster grid provides a single point of access to users in a single project or a single department.

Campus grids enable multiple projects or departments within an organization to share computing resources. Organizations can use campus grids to handle a variety of tasks, from cyclical business processes to rendering, data mining, and more.

Global grids are a collection of campus grids that cross organizational boundaries to create very large virtual systems. Users have access to compute power that far exceeds resources that are available within their own organization.

In the cluster grid, a user's job is handled by only one of the systems within the cluster. However, the user's cluster grid might be part of the more complex campus grid. And the campus grid might be part of the largest global grid. In such cases, the user's job can be handled by any member execution host that is located anywhere in the world[1].

2 Grid System Configuration Methodology To implement a Grid Computing System, we

followed three steps. 1) Questions to ask before implementing the Grid as an elicitation process, 2) Planning, 3) Verification. We asked those questions before implementing the grid as an elicitation process.

I) What kind of grid will be implemented, Cluster/ Departmental/Global? II) Do we have a room for the grid that meets our needs? III) Which vendor’s software will be chosen as the grid engine? IV) How can we obtain the required software, what’s the budget availability? V) What documentations should be used to implement the grid successfully? VI) Is the grid software proper choice for the grid that we are about to implement? VII) Will grid be composed of heterogeneous or homogenous systems? VIII) What are the requirements for the networking among the computers? IX) What kind of naming service do we need? X) Are we doing load balancing or parallel processing? XI) What kind of parallel environment is proper for the grid? XII) Will grid system be available from outside of the network?

After we found proper answers to the questions in step 1, we continued on the step 2. In planning, we passed over those steps: I) Deciding whether our Grid Engine environment will be a single cluster or a collection of sub-clusters called cells, II) Selecting the machines that will be Grid Engine hosts. Determine what kind(s) of host(s) each machine will be; master host, shadow master host, administration host, submit host, execution host, or a combination, III) Making sure that all Grid Engine users have the same user ids on all submit and execution hosts, IV) Deciding what the Grid Engine directory organization will be—for example, organizing directories as a complete tree on each workstation, directories cross mounted, or a partial directory tree on some workstations—and where each Grid Engine root directory will be located, V) Deciding on the site’s queue structure, VI) Deciding whether network services will be defined as an NIS file or local to each workstation in /etc/services, VII) Completing the installation plan.

Last step to implement a Grid Computing System is “Verification”. After we finish step 1 and 2, we review the question and the solution that we come with if they are really feasible and logical to implement. This step is to make sure everything is fine and we are ready to start. If we skip and don’t pay attention for the step 3, we might have set up all the system again and again. Grid system also needs several softwares to run and those softwares have to support each other.

3 Experimental Data of Grid Implementation

To perform the performance test, we implemented a cluster grid with 11 nodes. Grid system is composed of 1 submit host, 1 administration and master host and 10 execution hosts. Networking among the grid system


is made by 12 port AT&T hub with the connection of category 5 cables.

On each host the Sun Solaris operating system is installed. After that we decided how hosts will know each other and we set up a Domain name Server (DNS) on the master host. Every host has to be in the DNS list, otherwise installation of daemons will give errors.

Before installing the daemons, all systems should know each other, in order to make sure that, under /etc into the file called host is created and hosts are typed in here. If you skip this step, mount utility will not work and you will not be able to use Network File System (NFS).

In order to have a shared folder, NFS is set. The Network File system allows you to work locally with files that are stored on a remote computer disk but appear as if they are present on your own computer. The remote system act as a file server, and local system act as the client and queries the file server.

Another important step is defining the SGE_ROOT which is the root folder of the grid. On every host this SGE_ROOT has to be defined. If you try to install daemons before defining SGE_ROOT, installation will give error. In this implementation the SGE_ROOT is /opt/grid5.3. The services also have to be defined under /etc/services for the grid, before installation of the daemons.

Figure 2 shows our Grid Network Architecture.

3.1 Host Daemons Master Host Daemons: The daemon for the master host is sge_qmaster daemon which is responsible for center of the cluster’s management and scheduling activities, sge_qmaster maintains tables about hosts, queues, jobs, system load, and user permissions. It receives scheduling decisions from sge_schedd and requests actions from sge_execd on the appropriate execution hosts.

Master computer has also sge_schedd daemon, as we mentioned above it makes the decisions fro which jobs are dispatched to which queues and then forwards these decisions to sge_qmaster, which initiates the required actions.

Another daemon that a master runs is sge_commd, this daemon provides the communication between hosts, therefore it has to run on each hosts, not only on the master host.

Figure 2: Lab Grid Network

Submit Host Daemons: The only daemon that runs on submit hosts is sge_commd in order to provide the TCP communication. We need to add this service in service file over a well known TCP port on the submit host. In this implementation tcp port 536 is used.

Execution Host Daemons: Execution hosts use sge execd which is responsible for the queues on its host and for the execution of jobs in these queues. Periodically, it forwards information such as job status or load on its host to sge_qmaster.

We installed execution hosts after master and submit hosts are ready to run. Before sge_execd is installed, we need to specify which services will be used for the grid. In the service file we define sge_commd, sge_execd as services for the grid and we installed sge_execd daemon.

If you don’t set permissions as needed, the grid systems will not operate and can’t finish the sent jobs. The shared folders permissions should have write permission, too along with read permission.

Since, the test objective is to make a comparison between an individual computer and sun grid system performance, a parallel environment has to be set for the grid. The test will be carried out by sending a job to


the grid which processes this job in a parallel environment and an individual system that process this job by itself. As soon as job is sent to the grid, master computer checks its table for the hosts that show which computer load is low and which host is running which job and which hosts are more available to handle the job. According to this table information, execution hosts take over the job, they run and finish and create the output.

The first thing to do is choosing the environment that we work with. There are 2 choices PVM (Parallel Virtual Machine) and MPI (Message Passing Interface). PVM is a software package that permits a heterogeneous collection of UNIX and/or Windows computers hooked together by a network to be used as a single large parallel computer. Thus large computational problems can be solved more cost effectively by using the aggregate power and memory of many computers. The software is very portable. PVM enables users to exploit their existing computer hardware to solve much larger problems at minimal additional cost. Hundreds of sites around the world are using PVM to solve important scientific, industrial, and medical problems in addition to PVM's use as an educational tool to teach parallel programming. With tens of thousands of users, PVM has become the de facto standard for distributed computing world-wide [2].

On the other hand, MPI was designed for high performance on both massively parallel machines and on workstation clusters. MPI is widely available, with both free available and vendor-supplied implementations. MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms. There is no need to modify your source code when you port your application to a different platform that supports (and is compliant with) the MPI standard [3].

Because main logic of Grid computing is to use unused resources and having high performance computing by using those resources, it is more proper and logical to use PVM in our grid. PVM is better because applications will be run over heterogeneous networks. It has good interoperability between different hosts. PVM allows the development of fault tolerant applications that can survive host or task failures. Because the PVM model is built around the virtual machine concept (not present in the MPI model), it provides a powerful set of dynamic resource manager and process control function [4]

4 Discussion and Conclusion A grid configuration methodology is given with a step by step procedure. Before implementation of a grid, the system administrators should follow the given procedures and should understand the core requirements in order to be successful in implementation of the grid. Section 3 provides a detailed explanation which procedures the system administrators are supposed to go through.

If the grid can be set up in an appropriate way by understanding core requirements, there is no doubt about its robustness. As the goal of this paper, we are doing a performance analysis on the grid.

Figure 3: Grid Performance Table

Figure 3 presents the grid performance analysis that we have done by using ten nodes. These computers have the same hardware characteristics and the same operating system which is Sun Solaris 9. If we consider the performance results, we can easily notice that the more nodes, the more performance, and from first node to second node the performance doubles itself.

Figure 4: Grid Performance Analysis


http://www-unix.mcs.anl.gov/mpi/implementations.html

Figure 4 is the graphical presentation of the same analysis. Job duration is more than 12 minutes for a single node and when we run two nodes for the same job, the duration decreases to 6 minutes. This is a great performance increase, when we add the other nodes for the same job completion; we see that performance is still increasing noticeably. When we add the tenth node, the performance decreases according to result of ninth analysis. By using nine nodes, the job is finished in 2.10 minutes; however it completes the job in 2.20 minutes with ten nodes. This shows the performance depends on not only the number of nodes but also the job’s characteristics. We can state that the peak performance node requirement for that particular job can be gained by nine nodes. Therefore we can say that the more nodes do not mean always more performance. We can conclude that for any particular job, this peak performance node requirement can differ

5 References [1] http://docs.sun.com/app/docs/doc/817-6117 as of January 09, 2008

[2] http://www.csm.ornl.gov/pvm/ as of January 15, 2008

[3] http://www.llnl.gov/computing/tutorials/mpi/#What as of February 12, 2008

[4] http://www.rrsg.uct.ac.za/mti/pvmvsmpi.pdf as of January 15, 2005

[5] Performance analysis and Grid Computing, V. Getov et al, Kluwer Academic Publishers, 2004, ISBN-10: 1402076932

[6] The HPCVL Working Template: A Tool for High-Performance Programming, G. Liu et al., 19 International Symposium on High Performance Computing Systems and Applications, (HPCS`05) pp.110-116.

th

[7] Online Performance Monitoring and Analysis of Grid Scientific Workflows, H. Truong, T. Fahringer, Advances in Grid Computing – EGC 2005: European Grid Conference, Amsterdam, pp.1154-1164.


http://docs.sun.com/app/docs/doc/817-6117

http://www.csm.ornl.gov/pvm/

http://www.rrsg.uct.ac.za/mti/pvmvsmpi.pdf

�

��

��

��

�� !��"��

��

�

�

��

��

��

� � ��

��

��

�� !��

��

��" � ��

��

�� #� �� $� ��

��

�� %��

��

� ��

#�$��%� �� !"#�$� �""!%��

��&�� '��$"!('�

�

&� ��

�� $�)�� )�� *��

��&�� +,-� ��.��.� ��) � ��.� ��)��

��/��)� ��0)�� /�� .��.�

��*�� 1� ��&��0�)��&��

�))�/��.� �� 0��0�� .)��))�� ))�)�

��)�0�� /��*�� 0��)�2� )��.� 0��1� #��*�

�� *�� 00��)� ��.��

��)�0��3�0��1�"��*��*��*�� $�)��

�*��)�� ))�� )� � �� *��

��/�))�/��*��*�� )�� )��0*��

��&1��*�� *�� 0�� *��

��*��$�)��/��))��)�� *��

�� )�� *� �� /�� )�� *��

��)��/��*��))��)��.��*�� )��

��.� �� 00�� *�� )� ��)��1� ��

��)�� *�� /�� .�� *�� /*�)�� &��

�� 0�))�� /��*� �*�� $�)�� *��

�� %��2��))�1��*��.��*��

�2�� 0)��.� �*�� /�� 0*��0�� ))��

��)��*��

)�� )��)� �&�� 0��0�� *��

�*��)�0��.��)��0�))�0��2��)�2��)��

��)��)��.��. ��0*��.��*��0�)� �

�� 1��*�� .� ��0��))�� *��&��0��0��

*�� 0��))�� *�� 0*� .�� 0��

��)�� 0�� 0�� )��

��&�1�

&� � ��

� �*�� +4 5-� �� 0)��6��

.)��)� 0��.1� �� ))�/�� 00�� )��)��

0��)� �� 1� �� 0)�� 0��

��)�� 00�� 0��)� ��0��

�� )��.��.�� 0*� ��1� )��)� 0��.�

��)�0�� 0�� )�� )�� .� �*��

)�� *�� *�� 0��)�2�� *�� )��.�

��1�

'� ()*+��)# ��

� �*�� )�� )�� 0�))�0�� )�� *��

�� )�/��)��)� ��

��)��.� �� *�� 1� ��)�7�

0�� *�� 0�� )�� /*�0*� ��0*�

��0��0�� )��/*��1�8��.��

�� 0�� *�� )�� )��

��&�1�

$�� 9� �� $�0�� 0�� : $�;� �� *��

0��*��0�� 0�)��)��.��))� �*��

�� )��1� �� )�� .)�� .�� .� 0��0��

��)�.��&%�1�

��9� )��0��))�0��.��: ��;��

�� <��0�� =� �*��*��0��0)��.� $��

0��0�� *�� )�0�)� �� 00��

�2�0��)��)��1�

�$9� ��.� �� 0��.� $��0�� :�$;� ��

��0��0��*��0��/��*��

�*�� 1� �$� 0�� /�� )�� !�.*�/��.*��

��0�� 00�� &��0�)� :!�&;� �� 2�


�

�� $��0�� : ��$; � /*�0*� ��.�� 3�0��/��

�� 0�� $��0��

: ��$; �/*�0*��)��)�0�)��1�

)��6"9� )��6"� ��)�� 0�� 0��0��

��/�� .� $� � ��.� ��

��6/��&�� )��)�0��.��6"��*�� .��/��*�

�*�� )��*��)��1�

�$$� 9� )��)� �00�� $�0�� $��.�� : �$$;�

��)��0�)��0�� 1�

�$$� 0�� 0��3��0�� /��*� �� .��

0)��)��+>-�

��)��7;�� )��0��$��0�� )�� 0��$�0��

�

��

��0��

�

��0��.��

�

��.��

�

��0��

�

�

�

�

�

�

$��

�

�$6 ��$6 ��$�

�

��6��"��

�

�$$6 ��&�

�

��2��6 )��6"�

�

$��.)��.��

��*��0��

�

��0��

�

��.��0��

��.��

�

��00��

�

��0��

��0��

�

�

,��(��-��

� ��&�� .��.� ��)� �� 0)��

�� 0�� 0�))� :�&�;+7?- � /��*� ��

��))�/��.��.��0��0��

�� 0��0� ��)�0�� *�� 1� ��)�0��

��.�� /�� ))�)� 0)�� .�� .�

��)�� &��&�� *�� *�� *��

0��)�2�� )��.� �� .��.1� �� )� �

��.��)�0��.��2��0��))�)��.��. �

)�� )�� *�� 0��)�� ))� 0��0�� )�0��

��)��)�1� �� *�� 0��

��0��)�� 2�0��)�� 0�� 0�))�� *��

�� '2�0��)�1� �)�� .�� 0�))��

��'2�0��)��@��0��

�*�� A� ��0*�� ))�)� �2�0�� )��)��

�� '2�0��)�� )�� )��

��3�� ))�)� �2�0�� /*�)�� &��

��0�� /��*� 0��)�2� ��0*�� )��.� ��

.��)�� ))�)� 0�� )�� )��

�*�� .�� *�� 0��)� �� *��

'2�0��)��))��*��0��1�

��&�� .��))�� 0�� *�� ))�/��.� ��

0��9�

�)�� 0��9� �)�� 0�� .�� *��

��@�� &��0��

'�0*� 0�� 0�� *��=��.��

�*�� &��)��1��

$�� 0��9� $�� 0��

'2�0��)��0��)�/ �

�� 2�0��)�9� �� '2�0��)�� *��

�0��)�0��*��$��1�

'�0*� 0�� 0�� )��

0�� .��.��

��)��0��0��)��1�

�� 0�9� �� 0��

�� *�� 0)�� 0��

0��0��/��*��*��'2�0��)��0��1+> B-�

,� ��.(� �� (��

��"��

�� )�� 0�� 0��

�� /*�0*� �*�� )�� 0�� /��

��*��.*� �� .�� )��

0�� 0�)�� )�� *�� &�� 1�

�� 0��0�))� � �� )�� *�� ))�/��.�

0�� *�� )�� )�� *�/�� .��

71+>-�

��9� $�� *�� )�� *�� *�� )��

��1�

�$9� &��)��*�� 0�� *��

��&��0��1�

)��6"9� �)�� 2�0��)�� 0��0��

/��*��0*��*��.� )��6"1�

�$$� 9� ��0�� *�� &��

0��*��0)��0��)�1�


�

�

1�/��- ��.(�� *�� /�� 0��.�� &��1� "�� *��

0��)�� <)�.�0�=� �&� � �� *�� *�� *�� /� :)�/�

)��);� ��&�� &�� *�� *�� 3�0��

0�))�� 0�� *�� )��)� ��

��/��*� �*��$�)�� %1��

�� *�� 0�� )�� *�� &��

�&�1� &��0��)� ��0�� &�� &�� *�/��

��)�,1+>-1�

��)��,;� ��&��&��0��)��0��

��0�� 0��

��

.�0C��0��C*��)�C��

:�

.�0C��0��C*��)�C�� D�

*��)� �

0*��D�*��C��

��

0*��D��0C��;A�

��)�� 0�� *��)��

��.� �*��

��

��.�0C/��C��:�

��D��&�;A�

8�� 0��)��

�� *�� &��

.�0C0�))C��0��*��

��.�0C0�))�:�

.�0C��0��C*��)�C��D �

��;A�

&�� )�0��.� �&��

/��*��*��0��*��)��

�.��1��*��0�))��)�0��

��)��&��0��)��1�

��.�0C0�))C��0�:�

.�0C��0��C*��)�C��D �

��;A�

&��

��)�0��.:��0*��;�

�&��/��*��*��0��

*��)��.��1��

�� .�0C/��:��

��;A�

8�� 0��)�� *��

�� 0�� *��

��1�

�

0��-��

�*�� &��)� +E-� ��0�)�� *�� )��

��)��0�))��.��.��)� �� *��

0�� F$&�� F��$��)�� )�0��

��0��!��.��.��:�!;��G�!1��*��

��&��)��*��)�� &�� 0*��

��0��/��*��0�� 0��1�

��.�� 4� �*�/�� *�� 0*��0�� &��)1� �*��

&��)� �� F�� 00�� &�2�� *��

��.��0��)�1�

1��2�-�

�*��)��0� �/*�0*�0��F$&��F��

$��)�� 0��0�))��.��*��&��)1�

�*�� F$&� �� 0�� /��*� �� )��

��.�� *�� 0)��1� �*�� 0�� )��

�� *��)��.� $��)�� /*�0*� ��

�� )�� 2�0�� )�0��

.��)��.�1�

3��.(�

�� *�� *��*��

�� *�� , � ��.��/��0�)� )��

��0��)�� *�� &�� 0*��)�.�1� ��

�� 0*�� 0��0�� /��

0)�� 1�

�

�

�� '�4;��

4��(��

��$��+H- ��2��*��&��) ��)��

�� )��.� �� )�1� �� )�0��

��)�*��.� �� *�� 0�))�� .��

��)��*�� 0�� 8�� 0�� 00��.� �*��


�

��0�� 0��1� �*�� $��

*�� *�� 0��)�2�� *�� )��.� ��0��

�� 1� �� ))�/�� )�� )�� *��

�� )�0��)�� *�� )�1� ��$�� 0��

�*�� .�� )�� 0��0� ��)�0�� *��

�� 0�� )��.� 0��)�2� �� *��

��1� ��.�� 5� �*�/�� *�� 0*��0�� $��1� �*��

��0��)��0��)�/1�

"�� &��

#�� ,� $� ��*��00��))��

��*��0�� *��1� �� 0��)��

��.�� &�2�� 00�� *�� F��

�� 1�

'� ��

�*�� *��0��9�� )�0��

��0�1�� 0�� 0�� .��.�

��=��00�� )��.��)�0��

��)��*��)�0��*��&�2��*��

��*��=��0��)��

�

��.��5�;��*��0*��0�� $��

�

�� 0��0��0�� )�� *�/�

�� 00�� 0�1� �� )�0�� 0�� 0��

�� )�� )�0�� 0*� ��

�� )�� )�� 1� '�0*� ��0��

0�� G�!� ��0�� $��

G�!�$0*��1�

'� ��

�*�� 0�� 0*��. � ��.�

��.��))��.��0��1�

�

"��(��

�*��)�0��&��)� ��*��0��0��

�*�� $�� )��1� �� .�� )�0�� )�

��0�� @�� 0�� *��

��0�))�� )�� *�� 0�� 1� �*��

.�� *�� 0��G�!��0�� /*�0*�

�� *�� *�))�� F�� 3�0�� +B- � ��

��0�� .� ��/�� F�� *�� 0��

.�� F�� 3�0�� G�!� ��0��1� �*��

.��0��F$&��)��*��F��3�0�� /*�0*�

��)��*��0��)��)�0��)��.�1+I-�

&5� ��

� 8��0��*��0��0��*��)��

�� )�� &�� 1� �*�� *��

)�� )�� /� �� 0��

��)�� &�� /�� 0��

��0��1��)��.��*�� )��)��))�/��

��0��0��)��*��0��

��0�)��.� ��0�� )�� .� �*�� 0)��

�� *�� )�� )�� 1� &�)��

��)�� *�/�� *�� *�� )� ��*�� :��)��.�

��)�� $� )�� 0�� 0��;� ��

��)� �� 00��)� � ��0��))�� /*�� *��

)�.�1�8��*��)� �*�/�� /��0��*��0��

�� 2�� 0�0*��.� �� &�� *��

�� 1�

&&��

�

+7-�*�� .9��&�0��0�)� ��

��0*��)�.��)�0�� )�#:,??5;�

�

+,-��$ �*��9660��10��10*68�)0��1*��)6�

�

+4-$�� J � $��.�0*�� $ � �� $ �

��.��*�� .�� J1� ��9� �� /��

�� )�� .)��)� /�)��/�� 0��.�

��0��1� J�.*�&��0�� .� ��

��/��.� :!�0�� $0��0� � ��)1�

7,,>;1�$��.� �7IIE1�

+5-� J1� �� 1� $�� $1� $��.�0*� � ��.��

��)�� 9� ��/�� )��)� ��.�

��0�� .� $��

��0��.�$��0��)�� "0��7III1�

+>-� �� #�� *�� F1� 1J'K � �� 1��2� A�

L.��0��.��.��*��.)��)��0��)��M�

/�))��:,??4;�

+B-��*�� 0*�)��%�� 0��)��9�

L�� &�.��.�&��&�.��.�


�

��)�8��.� ��M � )��)� ��7 �� 0*�5�E �,??7�

�

+E-�$�� 1� �� )1��,??,1��*��&��)9�

�� 0� �� )� �� *�� &��)�1�

�0��.�� F�� ,??, � $��)� �

8��*��.�� $�1��1�

+H-��!�$ �*��966��)��2��1�.61�

+I-��*��)�� #�� *��.��0��0*��)�.��

8�!'K �,??>�

�

+7?-�J��*��)��/�� )�� 0��

��0*�0��))�)��0��. 8�!'K �

,??>�

�

�


SESSION

LATE PAPERS

Chair(s)

TBA


Implementation of Enhanced Reflective Meta Object to Address Issues Related to Contractual

Resource Sharing within Ad-hoc Grid Environments

Syed Alam, Norlaily Yaacob and Anthony Godwin Faculty of Engineering and Computing

Coventry University, Priory Street, Coventry CV1 5FB, UK {aa117, n.yaacob, a.n.godwin} @coventry.ac.uk

Abstract ⎯ Grid technologies allows for policy based resource sharing. The sharing of these resources can be subject to constraints such as committed consumption limitation or on timing basis. These ad hoc grid environments where non dedicated resources are shared, introduces several interesting challenges interims of scheduling, monitoring, auditing resource consumption and the issues of migration of tasks. This work attempts to propose a possible framework for non-dedicated ad-hoc grid environments based on Computational Reflection’s Meta object programming paradigm. The framework uses the Grid Meta Object with enhanced properties to aid in monitoring and migration related tasks.

Index Terms- Grid Computing, Middleware, Meta Object, Monitoring, Migration

1. INTRODUCTION Sharing of resources within grid environment[1,2] can be based on various computing factors such as CPU Cycles, Memory Consumption and Disk usage etc. With recent adoption of grid technology in various non-dedicated computing environments such as mobile computing devices, this resource sharing can also be based upon non-computing factor such as time. An example of this can be an enterprise using their Local Area Network computers within a grid environment during the non business or some other dedicated hours. Sharing of such nature introduces several interesting challenges in terms of controlling and managing resource allocation and consumption. If a grid job is requesting further resource consumption and the node has reached the committed grid resource level then a suitable strategy needs to be introduced.

In the above scenario a possible strategy may be to immediately abort the grid job since the committed resource level has been reached. However, an immediate abortion will result in an incomplete grid job. A fair policy in such scenario needs to look at the contract of resource sharing as being hard or soft. We refer to hard resource sharing as a fixed resource commitment and consumption model where no further resource access is possible beyond the committed level. On the other hand, soft resource sharing refers to a less restrictive resource sharing variant allowing consumption beyond the committed level. An employable grid computing model for both the hard and soft resource sharing requires monitoring and auditing of committed resources. Apart from monitoring, such models also need to provide support for job migration so that after consuming committed resources on the execution node the job may migrate to another node if possible. Migration of a job to another node will require identification of a suitable grid node and validation of its pre-requisites on the available node. These pre-requisites may be availability of specific runtime or a specific version of an operating system. The status of the job also needs to be archived and the job needs to be started from the last point of migration.

The rest of the paper discusses how computational reflection aids in addressing the contractual resource sharing issues within ad-hoc grid environments: Section 2 provides a discussion on ad-hoc resource sharing. Section 3 addresses the properties of Grid Meta object and how they can be used in a larger framework. Section 4 comments on possible costs and overheads of the proposed framework and finally the conclusions and future work is presented in section 5.


2. RESOURCE SHARING ISSUES IN AD-HOC GRID ENVIRONMENTS

A major issue within ad-hoc grid environment is that the resources are shared on a temporal basis. For example a non dedicated grid user may become the member of the grid community for a specific duration of time willing to commit a limited quantity of resources. This contributes towards issues relating to control and consumption of resource availability and monitoring during this ad-hoc grid membership. It also leads to overheads of monitoring and auditing of these temporal nodes for their availability and the status of any jobs being submitted to them. Due to the ad-hoc nature of the grid membership, a binding contract for resource sharing needs to be addressed in order to secure the availability of the temporal node for the duration on the contract. The grid infrastructure should cater for these contractual needs of the negotiating nodes. These contractual needs can be of diverse nature. For example, a node may wish to commit a specific amount of disk space or a specific amount of CPU cycles during specific hours. These resource sharing constraints need to be addressed by the grid framework.

3. GRID META OBJECT

Computational Reflection [3] and its implementation techniques [4] allow support for monitoring of objects by allowing a base object to be monitored by a connected Meta object. The Meta object maintains a casual connection with the base object and utilizes a data structure to store the monitoring information. Our implementation of the Grid Meta Object [13] supports the JAVA [6] platform with built-in monitoring support (See figure 1)

Fig 1. Base and Meta Object with the Implementation Framework The major components [7, 8 and 9] of the implementation framework are based on JAVA technology to support heterogeneous grid environments. Our work has proposed several enhancements to the traditional Meta object structure

to favor grid environments. We have proposed the following enhanced set of properties:

• Ability to maintain base object status information: This is same as the classic base and Meta object model where the ability to maintain base object status information is acquired through a suitable structure and a casual connection with the base object. •Ability to acquire node identification/environment information: An interesting property is base object's access to its execution environment. This is implicitly known to the base object and is not modeled explicitly. However, for grid Meta object the Meta object must have some kind of property to identify its execution node by a unique identifier such as node's IP. This will assist in migration of Meta object to identify the source and destination nodes of the Meta object. • Ability to send and receive through broadcast channel: The Meta object should be allowed to send and receive data through some kind of broadcast channel so that exact replica of the Meta object on other grid nodes can maintain replica Meta objects for specific purpose. • Ability to maintain migration, special action flag methods: A meta object should be allowed to store specific action flag methods and be able to store associated action, such as, if an exception occurs in the base object or a specific function is called on base object, the meta object can perform some special kind of action, such as broadcasting a message to associated remote meta object or node or start negotiating migration process to another node. • Ability to maintain specific call and action structure: This is the same case as above with the only difference that there may be a stack of action and an alternate action defined. • Ability to serialize: Meta object should have the ability to be serialized on streams so that they can have persistence and their life span is not only restricted to the in-memory execution life cycle. Persisting Meta object will help further analysis of Meta objects after execution life cycle is finished and broadcast of Meta object through streams such as disk or socket can become a possibility.

3.1 ADVANTAGES OF GRID META OBJECT The proposed Meta object properties offer the following advantages to the grid environment:


Fault Tolerance The base and Meta object reflective model allows full fault tolerance. A Meta object constantly maintains the base object state related information and can be used as a check point mechanism in the event of a base object crash. This model can be further extended beyond a physical machine if the Meta object is allowed to broadcast its status information to another machine. Dynamic Adaptability Reflective Systems are considered self aware and self healing. Systems based on computational reflection can adapt to changes as they happen within the execution boundaries. An example of self-aware and self healing systems can be reviewed in [5]. Persistence of State Specific Monitoring Data Meta object serialization can allow archiving of state information for various analytical purposes. The serialization on streams can be used for broadcast and remote monitoring of Meta objects within the grid environments. Backup Objects Utilizing computational reflection, backup Meta objects can be created allowing reliability and fault tolerance to classes of computer application where execution reliability is a must property.

3.2 USAGE OF GRID META OBJECT The implemented Meta object supports the Java platform and with the incorporated properties it can be used within a larger distributed computing framework. The proposed usage model allows the Meta object to be used by grid job management components such as scheduler and its sub-components such as monitor. (See figure 2)

Fig. 2. Grid Meta Object utilization within a larger framework

In this model a grid Meta object is allocated to a base object representing a grid job. The grid Meta object maintains the base job object’s status related information in such a manner so that it can be utilized later. The grid Meta object can be serialized to a suitable object store should the job be aborted due to a resource constraint. The broadcast feature of Meta object can assist in keeping a remote replica of the executing job’s Meta object. Both serialization and broadcast can utilize encryption to maintain security aspects. The serialized Meta object can be later used to re-create the base object and set its status to a specific point of interest, thus, allowing for the job to continue from the last check point at a different node (See figure 3) . Depending on the nature of submitted grid job and the capabilities of the Meta object bind to that job, there may exist several other possible usage of these Meta object as well.

Fig. 3. Serialized Grid Meta Object utilization within grid environment

4. COST AND OVERHEADS

The Gird Meta object adds some overheads towards excessive CPU and Memory usage. These overheads attribute to the monitoring of base object through intercepting the method calls to the base object and maintaining this information within a suitable data structure. For certain types of grid jobs the Meta object can be configured to maintain the last call rather then the complete call trace, this, results in smaller memory overhead but such usage is only suitable for applications which can continue from last method call. We are currently investing these overheads for different types of data and computation incentive grid applications.


5. CONCLUSION AND FUTURE WORK This paper has presented a proposed model for efficient Grid Meta object usage as part of a larger framework to address the issues related to contractual resource sharing with in ad-hoc Grid platform. The work is a continuation of our progress as reported in [10, 11, 12 and 13]. The work attempts to identify suitable Meta object properties to favor grid computing environments and argues that Meta object should be allowed their own specific properties other than their conventional usage limitations. We are currently investigating the overheads and possibilities of Grid Meta object usage towards various kinds of data and computational intensive grid applications.

REFERENCES

[1] I. Foster, C. Kesselman, and S. Tuecke. "The Anatomy of the Grid:

Enabling Scalable Virtual Organizations". In the International Journal of High Performance Computing Applications, vol. 15, pp. 200 - 222, 2001.

[2] I. Foster and C. Kesselman, “Globus: A Metacomputing Infrastructure Toolkit.” In the International Journal of Supercomputer Applications, vol.11, no. 2, pp. 115-128, 1997.

[3] P. Maes, “Concepts and Experiments in Computational Reflection”. In the OOPSLA Conference on Object-Oriented Programming: Systems and Applications, pp. 147 – 154, 1987.

[4] N. Yaacob. “Reflective Computation in Concurrent Object-based

Languages: A Transformational Approach”. PhD Thesis, University of Exeter, UK, 1999.

[5] G. S. Blair, G. Coulson, L. Blair, H. Duran-Limon, P. Grace, R. Moreira and N. Parlavantzas. “Reflection, Self-Awareness and Self-Healing”. In Proceedings of Workshop on Self-Healing Systems '02, Charleston, SC, 2002.

[6] James Gosling, Bill Joy, Guy Steele and Gilad Bracha. “The Java Language Specification, Third Edition”. Addison-Wesley, Reading, Massachusetts, 2004.

[7] S. Chiba “Load-time Structural Reflection in Java”. In the ECOOP 2000 – Object-Oriented Programming, LNCS 1850, Springer Verlag, pp. 313 –336, 2000.

[8] A. W. Keen, T. Ge, Justin T. Marin, R. Olsson. “JR: Flexible Distributed Programming in Extended Java”. In the ACM Transactions on Programming Languages and Systems, vol. 26, no. 3, pp. 578- 608, 2004.

[9] www.hyperic.com/products/sigar.html

[10] Yaacob N, Godwin A.N. and Alam S., “Reflective Approach to

Concurrent Resource Sharing for Grid Computing”, In Proceedings of the 2005 International Conference on Grid Computing and Applications, GCA '05, Las Vegas, USA, ISBN: 1-932415-57-2.

[11] Norlaily Yaacob, Anthony Godwin and Syed Alam, “Meta Level Architecture for Management of Resources in Grid Computing”, International Conference on Computational Science and Engineering (ICCSE 2005) June 2005, Istanbul, pp: 299-304, ISBN: 975-561-266-1.

[12] N. Yaacob, A. Godwin and S. Alam, “Resource Monitoring in Grid Computing Environment Using Computational Reflection and Extended JAVA – JR”, In 2nd International Computer Engineering Conference Engineering the Information Society, ICENCO'2006 , Faculty of Engineering, Cairo University, Cairo, EGYPT December 26-28,2006.

[13] Yaacob N, Godwin A.N. and Alam S., “Developing a Reflective Framework for Resource and Process Monitoring within Grid Environments”, In Proceedings of the 2007 International Conference on Grid Computing and Applications, GCA '07, Las Vegas, USA.


Enhanced File Sharing Service among Registered Nodes in Advanced Collaborating Environment

for Efficient User Collaboration

Mohammad Rezwanul Huq, Prof. Dr. M. A. Mottalib,

Md. Ali Ashik Khan, Md. Rezwanur Rahman, Tamkin Khan Avi

Department of Computer Science and Information Technology (CIT), Islamic University of Technology (IUT),

Gazipur, Dhaka, Bangladesh.

[email protected], [email protected], [email protected], [email protected], [email protected],

Abstract – Now-a-days, it is extremely necessary to provide file sharing and adaptation in a collaborative environment. Through file sharing and adaptation among users at different nodes in a collaborative environment, highest degree of collaboration can be achieved. Some authors have already proposed file sharing and adaptation framework. As per their proposed framework, users are allowed to share adapted files among them, invoking their file sharing and adaptation service built on the top of advanced collaborating environment. The goal of this adaptation approach is to provide the best possible adaptation scheme and convert the file according to this scheme keeping in mind the user’s preferences and device capabilities.

In this paper, we propose some new features for file sharing and adaptation framework to have faster and more efficient collaboration among users in advanced collaborating environment. We are proposing a mechanism that enables the other slave ACE nodes along with the requesting node to share the adapted files, where the nodes should have device capabilities similar to the requesting slave ACE node as well as they should be registered to the master ACE node. This approach will radically reduce not only the chance of redundant requests from different slave ACE nodes for sharing the same file in adapted form but also the adaptation overhead for adapting and sharing the same file for different slave ACE nodes. The registered nodes would also have the privilege to upload files to the server via master node. We distinguish each file according to their hit ratio that has been known from historical data so that only the frequently accessed files can be shared automatically among all other authenticated slave ACE nodes. This approach leads towards better collaboration among the users of ACE.

Keywords: File Sharing, Registered Node, Hit Ratio, User Collaboration

1 Introduction The concept of advanced collaborating environment is essential to provide interactive communication among a group of users. The traditional video conferencing becomes obsolete now-a-days due to the advancement in the field of networking and multimedia technology. The 3R factor that is- Right People, Right data and Right time, is the major concern of ACE, in order to perform a task, solve a problem, or simply discuss something of common interest [1]. Figure 1 depicts the concept of Advanced Collaborating Environment (ACE), where media, data, applications are shared among participants joining a collaboration session via multi-party networking [2]. Some early prototypes of ACE have been mainly applied to large-scale distributed meetings, seminar or lectures and collaborative work sessions, tutorials, training etc [3], [4]. Advanced Collaborating Environment (ACE) has been realized on the top of Access Grid which is a group to-group collaboration environment with an ensemble of resources including multimedia, large-format displays, and interactive conferencing tools. It has very effectively envisioned the implementation of ACE in real life scenario. The term venue server and venues come from the Access Grid multi party collaboration system [4], [5]. In [6], a file sharing and adaptation framework has been proposed for ACE. Through this framework, users at slave ACE nodes can share adapted files through master ACE node. Master ACE node has the capability to directly communicate to venue through venue client as well as venue server [6]. Slave ACE node has less capability compared to Master ACE Node in terms of device configuration as well as it cannot communicate to Venue and Venue Server directly [6]. Figure 2 shows the connectivity among master ACE


nodes, slave ACE nodes and venue server. For file adaptation, a hybrid approach has been mentioned for adapting files which considers user’s preferences as well as user’s device capabilities. In our work, we give emphasize to build some extended feature on the top of [6] so that the degree of user collaboration can be radically increased. Our extended feature will allow any slave ACE node to share adapted files depending on its own device capabilities where the request has been originated from another slave ACE node. Moreover, we provide some privilege to registered slave ACE nodes to upload file in venue server. For providing intelligent collaboration only the frequently accessed files will be automatically shared among slave ACE nodes. To serve this purpose, we use hit ratio of accessing files to distinguish the files in two categories: hot and cold. The rest of the paper is organized as follows. In Section 2, the Problem Statement is specified clearly. Section 3 begins the discussion of related work followed by the contribution of our work is evidently described in section 4. In section 5, we describe our proposed mechanism followed by the current implementation status as well as different implementation issues to be addressed in section 6. Finally, we draw conclusion and talk about some of our future works in section 7.

2 Problem Statement In this paper, we have tried to enhance the efficiency of file sharing and adaptation framework described in [6] by introducing slave node registration to facilitate file uploading to the venue server, reducing redundancy of file adaptation requests and applying hit ratio analysis for better collaboration. These new features allow users at Slave ACE nodes to share files in an efficient and faster way which will definitely increase user collaboration. Thus our problem statement may be summarized as follows: To provide file uploading privilege as well as efficient and faster file sharing capabilities to the users at slave ACE nodes by reducing not only the chance of redundant

requests from other slave ACE nodes for sharing the same file but also the adaptation overhead to adapt the same file from master node. Moreover, to provide better and intelligent collaboration among the users by advertising about frequently accessed files through multicast messages to registered slave ACE nodes. 3 Related Work Much research has been initiated in the area of context-aware computing in the past few years. Many projects have been initiated for developing interactive collaboration. These projects enable users to collaborate with each other for sharing files and other media types. The Gaia [7], [8], [9] is a distributed middleware infrastructure that manages seamless interaction and coordination among software entities and heterogeneous networked devices. A Gaia component is software module

that can be executed on any device within an Active Space. Gaia a number of services, including a context service, an event manager, a presence service, a repository and context file system. On top of these basic services, Gaia’s application framework provides mobility, adaptation and dynamic binding of components. Aura [10], [11] allows a user to migrate the application from one environment to another such that the execution of these tasks maximizes the use of available resources and minimizes user distraction. Two middleware building blocks of Aura are Coda and Odyssey. Coda is an experimental file system that offers seamless access to data [10] by relying heavily on caching. Odyssey includes application aware adaptations that offer energy-awareness and bandwidth-awareness to extend battery life and improve multimedia data access on mobile devices. The work on user-centric content adaptation [12] proposed a decision engine that is user-centric with QoS awareness, which can automatically negotiate for the appropriate adaptation decision to use in the synthesis of an optimal adapted version. The decision engine will look for the best

Figure 1 : Advanced Collaborating Environment (ACE)

Figure 2 : Master and Slave ACE nodes


trade off among various parameters in order to reduce the loss of quality in various domains. The decision has been designed for the content adaptation in mobile computing environment. The work described in [6] proposed a file sharing and adaptation framework. In [6], users are allowed to share adapted files among them, invoking their file sharing and adaptation service built on the top of advanced collaborating environment. The goal of this adaptation approach is to provide the best possible adaptation scheme and convert the file according to this scheme keeping in mind the user’s preferences and device capabilities. But in their proposed framework, they didn’t allow slave ACE nodes to upload file. Moreover, only the requesting slave ACE node receives the adapted files but other ACE nodes may initiate another request to share the same file which incurs a lot of processing overhead to the master ACE node. We have tried to extend the service provided by [6] in our work. In later sections, we discuss our proposed mechanism to enhance the file sharing service proposed in [6] in detail.

4 Our Contribution To the best of our knowledge there is not much work in the issues like Data Adaptation and File Sharing in Advanced Collaborating Environment, though other related fields had been explored as we described in the previous section. File Sharing and Adaptation Framework illustrated in [6] includes file sharing and data adaptation service. File sharing service is demonstrated by the realization of data adaptation service. These two services are very much necessary to provide effective collaboration among users in advanced collaborating environment. But there are some problems associated with this approach. As for example, in the existing framework, the slave ACE nodes cannot upload any data or file in the venue server which hinders to achieve maximum collaboration among ACE nodes. Again another drawback is, if there were multiple requests from different nodes of compatible device capabilities, the data adaptation technique must be repeated several times, which is an absolute overhead for the overall system. Furthermore, no automated features for enhancing user satisfaction have been found in the existing framework. Our target is to provide some extended features on the top of the existing framework so that the highest degree of collaboration among users can be realized. In this paper, we have tried to identify these problems and provide some effective solutions to address these issues.To enhance user collaboration, we propose node registration, which ensured a minimal privilege for the slave ACE nodes to upload files to the venue server via master node. We also propose for a data analysis approach based on user preference along with device capability records, which drastically reduces the chance of redundant file adaptation

requests, execution of adaptation process as well as communication overhead. And finally we devise a mechanism to facilitate the users with an intelligent and easy experience of collaboration based on hit ration of files. Thus, the comparison shows that our work will definitely encompass a meaningful advancement over the aforementioned work in the issues of increasing slave node activities to an acceptable extent, reducing the redundant complexity of data adaptation requests, decreasing the communication overhead at the master ACE node with both venue Server and Slave nodes and enhancement of overall collaboration by considering the hit ratio. We believe that our effort will certainly play a leading role for overcoming the deficiencies of this framework and also break new ground for more advancement in this field of research. 5 Proposed Mechanism In this section, we describe in detail the mechanism for our proposal. Before going into the detail, we give some definitions of related terminologies.

• User Registration: In ACE, the user registration is normally done by providing e-mail address along with other necessary information.

• Node registration: Node registration is a proposed feature of our paper, where the network admin of a venue server will authenticate some of the slave ACE nodes as registered node. Thus these nodes would have the privilege of uploading files to the venue server.

• Requesting node: The slave ACE node that requests for a file to master ACE node is termed as requesting node.

• Hit ratio: Hit ratio is the ratio between the number of times a file is requested and the total number of files requested. A mathematical representation of this term can be: Hit ratio, Hf1 = Number of times f1 is requested/ Number of total requests of files. where, f1 is any particular file.

• File counter: A process that counts the number of times a file is requested.

• Threshold Value: It is a dynamic value that is used in our framework to determine the ranking of files according to their hit ratio.

• Hot listed files: The list of files for which the number of requests exceeds the threshold value is termed as hot listed files.

• Cold listed files: The list of files for which the number of requests remains below the threshold value is termed as cold listed files.


Figure 3 : Overall Block Diagram of Proposed File Sharing Service

• File Cache: It is a temporary storage where the files are stored for a limited period of time after successful adaptation.

Figure 3 depicts the overall block diagram of the proposed file sharing service. We explain the overall mechanism by decomposing it into several parts. The next section describes the total mechanism of our proposed enhanced file sharing service. 5.1 File Uploading Mechanism

At first, user connects from slave ACE node to master ACE node. S/he needs to provide e-mail address and master ACE node address. Then, the system checks whether the user is already registered with the system or not. If user is not registered, the system checks whether the user fulfills a minimum node status for being registered. If it is eligible then the network admin registers the node for uploading files, otherwise the node is not permitted for uploading. If the node has already been registered then it may request master ACE node to upload one or more file/s to the venue server. Lastly, the master ACE node takes the file/s from the Slave node and uploads it to the server. Figure 4 depicts the File Upload mechanism of the system.

Figure 4 : Flow Chart of File Uploading Mechanism


5.2 File Sharing Mechanism:

Firstly, user connects from slave ACE node to master ACE node by providing e-mail address, user preference and master ACE node address. After being connected, the slave ACE node requests for a file. The master ACE node increases the hit counter for that particular file. After this, two types of checking are being executed. The system checks whether the hit counter exceeds the threshold value. If yes, an advertise for sharing this adapted file is multicasted to all the slave nodes of compatible device capabilities. The other checking is for whether any adapted version of the requested file of same preference already exists in the cache. If it is available then the file is simply sent by retrieving from the cache. If not then the normal file adaptation approach is followed. Figure 5 depicts the file sharing mechanism of the system.

Figure 5 Flow Chart of File Sharing Mechanism

5.3 Cache Updating Mechanism

At the beginning of this process the system checks whether the periodical time for updating cache has been elapsed or not. If yes then the master ACE node analyzes the hit list data by comparing the file counter with the threshold value. If the file counter of a file is more than the threshold value then that particular file would be treated as hot file and it remains in the cache. If the file counter does not exceed threshold then it is treated as cold item and is

deleted from the cache. Figure 6 depicts the File Sharing mechanism of the system.

Figure 6 Flow Chart of Cache Updating Mechanism 6 Current Implementation Status

and Future Issues Master file sharing service had been implemented as an AG shared application. That is why, it was implemented in python. This module retrieves files from data store at venue server and sends this file for appropriate adaptation. Figure 7 shows the user interface for connecting master ACE node and choosing specific file.

Figure 7: Connecting Master ACE Node and Choosing File


Slave file sharing service had been implemented as stand-alone application. It was implemented in python. It provides the user interface for entering venue URL and selecting the desired file. It also shows (see figure 8) the confirmation of successful reception of adapted file to the users at slave ACE nodes.

Figure 8: Confirmation Window of Successful Reception Data adaptation service is a stand-alone application.This module has a decision engine which provides the appropriate adaptation scheme for converting the original file. At the beginning, the user enters his/her e-mail address and preferred file type for sharing. Then, he/she presses the appropriate button for connecting Master ACE node (see figure 7). After pressing the connect button the master file sharing service will connect to the venue data store and retrieves file of user mentioned type. Then, the user will select one of the files. The specified file will be downloaded through the master file sharing service, the decision engine takes the decision of selecting appropriate adaptation method and then it will be passed to the data adaptation service. The function DataAdapter() takes the file and convert it based on the decision provided by the decision engine. Then adapted file will be sent to Slave ACE node by master file sharing serviceandtheadaptedfile will bestoredintothelocal storage of slave ACE node. Figure 8 shows the confirmation screen that adapted file has been successfully received by slave ACE node. There are a lot of future implementation issues which will be discussed in this portion of text. We try to implement each of these features very soon and incorporat eit with our current prototype. The basic requirement of the file upload feature is the node registration. The network admin would register an eligible slave ACE node to provide with the capabilities of uploading files to the venue server. As depicted in the

figure 9, the registered_node table would have node_id as Primary key and device_id as Foreign key referred from the device_profile table. In fact device_MAC field of this table is the main factor to distinguish each machine separately. For enhanced file sharing feature, there are two main concerns. Whether any adapted version of the file exists in the cache, and whether the device capability of the requesting node matches as one of the compatible nodes listed in the table. We explain each of these concerns one by one. The existence of any adapted version of the file in the cache mainly depends on the hit ratio based analysis. A dynamic threshold value is used for ranking the files according to the number of time the file has been requested. That is if the file counter is reaches the threshold value the file is known as Hot File and otherwise it is labeled as cold file. Now the formula for calculating hit ratio devised as follows:

• Hit Ratio, T = (File counter value / Total Number of requests arrived) × 100%

If T > 30% for any particular file, we termed that file as hot file. They are not deleted from the file cache. The files that have T value below the mentioned level are named as Cold files. To update the file cache as well as to obtain free space we need to drain out some Cold files. If T < 5% for any particular file, that cold file is deleted from the cache. One thing must be mentioned here, the deletion of files is done periodically. And this periodical time is a variable measure. For our system we may consider its value equal to 1 week. Now the later concern for file sharing feature is, in case of compatible device capability, we emphasize on lowest possible capability. Whenever we have any particular adapted file in the file cache, then the system searches for the slave nodes those are compatible with the device capability used for adapting that file. To find out a compatible node D (d1, d2……dn), D is acceptable, if and only if

di ≥ vi ( i = 1…n) where, di = device configuration in ith dimension

vi = Lowest possible device capability in ith dimension

We will try to implement the aforementioned feature as soon as possible considering all implementation alternatives. Our proposed enhanced feature can be useful for efficient file sharing. We will implement our module providing enhanced and extended feature in python so that they can be easily pluggable into ACE which has been built on the top of access grid.


Figure 9: Schema Design for Backend Database

7 Conclusion and Future Work In this paper, we have presented the enhanced features for file sharing and data adaptation framework in ACE. Our proposed features enable users at Slave ACE nodes to share adapted files faster and intelligently. Moreover, file uploading mechanism allows users at slave ACE nodes to upload files in the venue server as well via master ACE node. These features together realize the improved file sharing service. There is a lot of interesting work to be done in the near future for efficient file sharing and adaptation service. We plan to implement P2P file sharing among the slave ACE nodes. We may also explore the issues like automated device identification for the user and node registration purpose as well as user requirement prediction which will ensure higher degree of user collaboration. We believe our proposal for enhanced file sharing service and the prototype implementation to realize this service will be considered as a leading work in this domain in near future.

References [1] B. Corri, S. Marsh and S. Noel, “Towards Quality of Experience in Advanced Collaborative Environments”, In Proc. of the 3rd Annual Workshop on Advanced Collaborative Environments, 2003. [2] Sangwoo Han, Namgon Kim, and JongWon Kim, “Design of Smart Meeting Space based on AG Service

Composition”, AG Retreat 2007, Chicago, USA, May 2007. [3] R. Stevens, M. E. Papka and T. Disz, “Prototyping the Workspaces of the Future”, IEEE Internet Computing, pp. 51-58, 2003. [4] L. Childers, T. Disz, R. Olson, M. E. Papka, R. Stevens and T. Udeshi, “Access Grid: Immersive Group-to-Group Collaborative Visualization”, In Proc. of Immersive Projection Technology Workshop, 2000. [5] Access Grid, http://www.accessgrid.org/ [6] Mohammad Rezwanul Huq, Young-Koo Lee, Byeong-Soo Jeong, Sungyoung Lee: Towards Building File Sharing and Adaptation Service for Advanced Collaborating Environment. In the International Conference on Information Networking (ICOIN 2008), Busan, Korea, January 23-25, 2008. [7] Anand Ranganathan and Roy H. Campbell, “A Middleware for Context-Aware Agents in Ubiquitous Computing Environments”, In ACM/IFIP/USENIX International Middleware Conference, Brazil, June,2003. [8] The Gaia Project, University of Illinois at Urbana-Champaign, http://choices.cs.uiuc.edu/gaia/, 2003. [9] M. Roman, C. Hess, R. Cerqueira, A. Ranganathan, R. Campbell and KNahrstedt, “A Middleware Infrastructure for Active Spaces”, IEEE Pervasive Computing, vol. 1, no. 4, 2002. [10] M. Satyanarayanan, Project Aura, http://www-2.cs.cmu.edu/~aura/, 2000. [11] M. Satyanarayanan, “Mobile Information Access”, IEEE Personal Communications, http://www-2.cs.cmu.edu/~odyssey/docdir/ieeepcs95.pdf,Feb. 1996. [12] W.Y. Lum and F.C.M. Lau, “User-Centric Content Negotiation for Effective Adaptation Service in Mobile Computing”, IEEE Transactions on Software Engineering, Vol. 29, No. 12, Dec. 2003.


Overbooking in Planning Based Scheduling Systems ∗

Georg Birkenheuer #1, Matthias Hovestadt ∗2, Odej Kao ∗3, Kerstin Voss #4

#Paderborn Center for Parallel Computing,Universitat Paderborn

Furstenallee 11, 33102 Paderborn, [email protected] [email protected]

∗Technische Universitat BerlinEinsteinufer 17, 10587 Berlin, Germany

[email protected] [email protected]

Abstract

Nowadays cluster Grids encompass many cluster sys-tems with possible thousands of nodes and processors, of-fering compute power that was inconceivable only a fewyears ago. For attracting commercial users to use these en-vironments, the resource management systems (RMS) haveto be able to negotiate on Service Level Agreements (SLAs),which are defining all service quality requirements of sucha job, e.g. deadlines for job completion. Planning-basedscheduling seems to be well suited to guarantee the SLAadherence of these jobs, since it builds up a schedule for theentire future resource usage. However, it demands the userto give runtime estimates for his job. Since many users arenot able to give exact runtime estimates, it is common prac-tice to overestimate, thus reducing the number of jobs thatthe system is able to accept. In this paper we describe thepotential of overbooking mechanisms for coping with thiseffect.

Keywords: Grid-Scheduling, Overbooking, ResourceManagement, SLA

1 Introduction

Grid Computing is providing computing power for sci-entific and commercial users. Following the common evo-lution in computer technology, the system and network per-formance have constantly increased. The latest step in this

∗The authors would like to thank the EU for partially supporting thiswork within the 6th Framework Programme under contract IST-031772Advanced Risk Assessment and Management for Trustable Grids (As-sessGrid).

process was the introduction of multiple cores per proces-sor, making Grid nodes even more powerful. This evolu-tionary process particularly affects the scheduling compo-nents of resource management systems that are used formanaging cluster systems. On the one hand, the increas-ing number of nodes, processors, and cores results in anincreased degree of freedom for the scheduler, since thescheduler has more options of placing jobs on the nodes(and cores) of the system. On the other hand, also the re-quirements of users have changed. Commercial users askfor contractually fixed service quality levels, e.g. the ad-herence of deadlines. Hence, the scheduler has to respectadditional constraints at scheduling time.

Queuing is a technique used in many currently avail-able resource management systems, e.g. PBS [1], Load-Leveler [2], Grid Engine [3], LSF [4], or Condor [5]. Sincequeueing-based RMS only plan for the present, it is hard toprovide guarantees on future QoS aspects.

Planning-based RMS make functionalities like advancereservations trivial to implement. If a new job enters thesystem, the scheduling component of the RMS tries to placethe new job into the current system schedule, taking aspectslike project-specific resource usage limits, priorities, or ad-ministrative reservations into account. In planning-basedsystems it is mandatory for the user to specify the runtimeof his job. If thinking of the negotiation of Service LevelAgreements, this capability is essential for the provider’sdecision making process. As an example, using fixed reser-vations, specific resources can be reserved in a fixed timeinterval. In addition to plain queuing, the Maui [6] sched-uler also provides planning capabilities. Few other RMSlike OpenCCS [7] have been developed as planning-basedsystems from scratch.


However, (fixed) reservations in planning-based systemspotentially result in a high level of fragmentation of thesystem schedule, preventing the achievement of optimal re-source utilization and workload. Moreover, users tend tooverestimate the runtime of their jobs, since a planning-based RMS would terminate jobs once their user-specifiedruntime has expired. This termination is mandatory, if suc-ceeding reservations are scheduled for being executed onthese resources. Overestimation of job runtime inherits anearlier availability of assigned resources as expected by thescheduler, i.e. at time tr instead of tp. Currently, mecha-nisms like backfilling with new jobs or rescheduling (start,if possible, an already planned job earlier) are initiated tofill the gap between tr and the planned start time ts ≥ tp ofthe succeeding reservation. Due to conflicts with the earli-est possible execution time, moving of arbitrary succeedingjobs to an earlier starting time might not be possible. In par-ticular the probability to execute a job earlier might be lowif users have strict time intervals for the job execution sincea planning-based scheduler rejects job requests which can-not be planned according to time and resource constraintsin the schedule. An analysis of cluster logfiles revealedthat users overestimated the runtime of their jobs by a fac-tor of two to three [8]. For the provider this might resultin a bad workload and throughput, since computing poweris wasted if backfilling or rescheduling cannot start otherjobs earlier as initially planned. To prevent poor utilizationand throughput, overbooking has proven its potential in var-ious fields of application for increase the system utilizationand provider’s profit. As a matter of fact, overbooking re-sults in resource usage conflicts if the user-specified runtimeturns out to be realistic, or if the scheduler is working withan overestimation presumption that is too high. To com-pensate such situations, the suspension and later restart ofjobs are important instruments of the RMS. To suspend arunning job and without losing already performed compu-tation steps, the RMS makes a snapshot of the job, i.e. stor-ing the job process environment (including memory, mes-sages, registers, and program counter), to be able to mi-grate them to another machine or to restart them at a laterpoint of time. In the EC-funded project HPC4U [9] the nec-essary mechanisms have been integrated in the planning-based RMS OpenCCS to generate checkpoints and migratejobs. These fault-tolerance mechanisms are the basis fora profitable overbooking approach as presented in this pa-per since stopping a job does not imply losing computationsteps performed. Hence, gaps between advance reserva-tions can be used from jobs, which will be finished beforethe next reservation has to be started. In the next sectionwe discuss the related work followed by our ideas for over-booking, which are described in Section 3. In Section 4 weconclude the paper with a summary of our ideas and pre-senting plans for future work.

2 Related Work

The idea of overbooking resources is a standard ap-proach in many fields of application like in flight, hospi-tal, or hotel reservations. Overbooking beds, flights, etc.is a consequence of that a specific percentage of reserva-tions are not occupied, i.e. usually more people reserve ho-tel rooms [10] or buy flight tickets [11, 12] than actuallyappearing to use their reservation. The examples of hotelsand aeronautical companies show the idea we are followingalso in the provisioning of compute resources. Overbookingin the context of computing Grids slightly differs from thosefields of applications since therein the assumption is madethat less customers utilize their reservations than booked.In Grid computing all jobs that have been negotiated will beexecuted, however, in planning-based systems users signif-icantly overestimate the job duration. Comparing the usageof a compute resource and a seat in an aircraft is not mean-ingful since generally in computing Grids no fixed intervalsfor the resource utilization exist whereas a seat in an air-craft will not be occupied after the aircraft had taken off.As a consequence, results and observations from overbook-ing in the classical fields of applications cannot be reused inGrid scheduling. As a non-classical field of application [13]presents an overbooking approach for web platforms, how-ever, the challenges also differ from the Grid environment.

In the Grid context, consider overbooking approaches ismost sensible for planning based scheduling since in queu-ing based systems even the runtime have to be estimatedand thereby an additional uncertainty has to be taken intoaccount. Other work concerning overbooking in Grid orHPC environments is rare. In the context of Grid or HPCscheduling the benefits of using overbooking are pointedout, but no solutions are provided [14, 15]. Overbookingis also foreseen in a three layered protocol for negotiationin Grids [16]. Here, the restriction is made that overbookingis only used for multiple reservations for workflow sub-jobswhich were made by the negotiation protocol for optimalworkflow planning.

2.1 Planning Approaches

Some work had been done in the scope of planning al-gorithms. Before showing in Section 3 how overbookingcan be integrated, different approaches already developedare described in the following.

2.1.1 Geometry based approaches

Theoretical approaches for planning-based systems identifythat the scheduling is a special case of bin packing: thewidth of a bin is defined as the number of nodes generallyavailable and its height equals the time the resource can be


used. As the total usage time for an arbitrary number of jobsdoes not end, the height of the bin is infinite. Consequentlyit is not a bin, rather defined as a strip. Jobs are consideredas rectangles having a width equal to the number of requirednodes and a height equal to the execution time determinedby the user. The rectangles have to be positioned in thestrip in such a way that the distances between rectangles inthe strip are minimal and the jobs must not overlap them-selves. Since strip packing is an NP-hard problem, severalalgorithms have been developed which work with heuris-tics and are applicable in practice. Reference [17] gives agood overview of strip packing algorithms. Two kinds, on-line and offline, of strip packing algorithms are differed. Anoffline algorithm has information about all jobs to be sched-uled a priori whereas an online algorithms cannot estimatewhich jobs arrive in future. It is obvious that offline algo-rithms can achieve better utilization results since all jobsare known and can be scheduled by comparing among eachother. The approaches could be divided into several mainareas: Bottom-left algorithms which try to put a new job asfar to the bottom of the strip and as left as possible, level-oriented algorithms [18], split algorithms [18], shelf algo-rithms [19], and hybrid algorithms which are combinationsof the above mentioned ones.

2.1.2 Planning for Clusters

In practice, most planning based systems use first-comefirst-serve (FCFS) approaches. Grid scheduling has touse an online algorithm and consequently planning optimalschedules is not possible. The job is scheduled as soon aspossible according to the current schedule (containing alljobs previously scheduled) as well as its resource and timeconstraints. The FCFS approach might lead to gaps whichcould be prevented if jobs would have scheduled in a dif-ferent order. To increase the system utilization, backfilling[20] had been introduced to avoid such problems. Conser-vative Backfilling follows the objective to fill free gaps inthe scheduled produced by FCFS planning without delay-ing any previously planned job. Simulations show that theoverall utilization of systems is increased using backfillingstrategies. If effects/delays of not started jobs are accept-able, the more aggressive EASY backfilling [21] can furtherimprove the system utilization. However, [22] shows thatfor systems with high load the EASY approach is not bet-ter than the conservative backfilling. Furthermore, EASYbackfilling has to be used with caution in systems guaran-teeing QoS aspects since delays of SLA bound jobs mightlead to SLA violations implying penalties.

Concluding, there is much work done in the scope ofplanning based scheduling. Good resource utilization inGrid systems can be achieved by using backfilling. How-ever, applying conservative backfilling does not result in

a 100% workload since only gaps can be assigned to jobswhose duration is less than the gap length. The more ag-gressive EASY backfilling strategy does not necessarilyprovide a better utilization of the system and implies haz-ards for SLA provisioning. Combining conservative back-filling with overbooking should further increase the systemutilization and does not affect the planned schedule. Con-sequently, using these strategies combined has no disadvan-tages for not overbooked jobs and offers the possibility toschedule more jobs than with a simple FCFS approach.

3 Planning-Based Scheduling and Overbook-ing

This chapter explains the basic ideas to use overbook-ing in planning-based HPC systems. A user sends anSLA bound job request to the system. The SLA definesjob type (batch or interactive), number r of resources re-quired, estimated runtime d, as well as an execution win-dow [tstart, tend], i.e. earliest start-time and latest comple-tion time. The planning-based system can determine beforeagreeing the SLA whether it is possible to execute the jobaccording to time and resource constraints.

Fixed Reservations Planning-based scheduling is espe-cially beneficial if users are allowed to define time-slots(r, h) in the SLA for interactive sessions, i.e. reservingcompute nodes and manually start (multiple) jobs duringthe valid reservation time. The difference from an inter-active session and a usual advance reservation is that thereservation duration equals tend − tstart. For exampler = 32 nodes should be reserved from h = [11.12.08 :900, 11.12.08 : 1400] for a duration of h = 5 hours. Suchso called interactive or fixed reservations increase the dif-ficulty of the planning mechanism as these are fixed rect-angles in the plan and cannot be moved. This will haveworse effects on the system utilization than planning onlyadvance reservations less strict timed. Consequently, sup-porting fixed reservations step up the demand for additionalapproaches like overbooking to ensure a good system uti-lization. However, such fixed reservations appreciate thevalue of using Grid computing for end-users if these haveeither interactive applications or need to run simulations ex-actly on-time, like for example for presentations.

For example, a resource management system (RMS) op-erates a cluster with 32 nodes and the typical jobs scheduledneed 32 nodes and run 5 hours. During the day researchersmake two fixed reservations from 9am to 2pm and from2pm to 7pm. All other jobs are scheduled as batch jobs.In this scenario during the night, in the 14 hours betweenthe fixed reservations, only two 5 hours batch jobs could bescheduled since these could be totally completed. Conse-quently, the cluster would be idle for 4 hours. To achieve


a better system utilization, either the user has to shift thefixed reservations one hour every day. Since this not feasi-ble because of working-hours, assuming that the batch jobsfinishes after 4 hours and 30 minutes enables to overbookresources and execute the three batch jobs.

3.1 Overbooking

Overbooking benefits from the fact that users overesti-mate their jobs’ runtime. Consequently their jobs finish be-fore the jobs’ planned completion time. Taking advantageof this observation will increase the system utilization andthereby the provider’s profit.

This section shows the process of integrating overbook-ing in planning-based systems following conservative back-filling as basic scheduling strategy. At first, aspects arehighlighted which have to be considered when declaringjobs as usable for overbooking. Afterwards the overbookingalgorithm is described which is followed by remarks con-cerning fault-tolerance mechanisms, which should preventjob losses in case of wrong estimations of actual runtimemade. An example forms the end of this section.

3.1.1 Runtime-Estimations for Overbooking

The prevention of job losses caused by overbooking is oneimportant task of the scheduler. Further, good predictionsof the overestimated runtime forms the key factors for prof-itable overbooking.

On the first glance, users overestimate the job durationin average by two to three times of the actual runtime [8].Unfortunately, job traces show that the distribution of theoverestimation seems to be uniform [22] and depending onthe trace, 15% to nearly 30% of jobs are underestimatedand have to be killed in planning-based systems after theplanned completion time. Obviously, more not completedjobs could be killed when using overbooking. For instance,using the average value of overestimation from the statisti-cal measure (which is 150% up to 500%) in scope of over-booking would lead to conflicts since half of the jobs wouldbe killed. Instead of exhausting overestimation to their fullextend, it will be more profitable to balance between the riskof a too high overestimation and the opportunity to schedulean additional job. Hence, it might be often beneficial to notsubtract the average overestimated runtime from the esti-mated one in order to use this additional time for overbook-ing. In many cases only using 10% of the overestimationcan be sufficient. Given a uniform distribution, this wouldforce 10% of the overbooked jobs to be lost, but 90% wouldbe finished and increase the utilization. These in additionto the default strategy executed jobs increase the providersprofit. To use good predictions of the overestimated run-time, historical observations on the cluster and of the users

are necessary. A detailed analysis has to be performed aboutthe functional behavior of the runtime since an average ormedian value is not as meaningful as needed for reducingthe risk for the provider to cause job losses. If enough mon-itoring data is available, the following question arises: Howcan statistical information about actual job runtime be usedto effectively overbook machines? The answer is to ana-lyze several different aspects. First of all, a user-orientedanalysis has to be performed since users often utilize com-puting Grids for tasks in their main business/ working areawhich results in submitting same applications with similarinput again and again [23, 24]. Consequently, analyzing es-timated and actual runtime should whether and how muchoverestimations are made. If the results show that a userusually overestimates the runtime by factor x, the schedulercan use xo < x of the overestimated time as a time-framefor overbooking. If the statistical analysis shows, that theuser made accurate estimations, the scheduler should notuse her jobs for overbooking. If the user underestimatesthe runtime the scheduler might even plan more time toavoid job-kills at the end of the planned execution time. Anapplication-oriented statistical analysis of monitoring datashould be also performed in order to identify correlations ofoverestimations and a specific application. Performed stud-ies show, that automatically determined runtime estimationsbased on historical information (job traces) can be betterthan the user’s estimation [25, 26, 27]. The condition forits applicability is that enough data is available. In additionto these separated foci, a third analysis should combine theuser-oriented and application-oriented approach in order toidentify whether specific users over- or underestimate theruntime when using a specific application. This analysisshould result in valuable predictions.

3.1.2 Algorithmic Approach

This paragraph provides the algorithmic definition of thescheduling strategy for conservative backfilling with over-booking. When a new job j with estimated duration d, num-ber of nodes n, and execution window [tstart, tend] arrivesin the system, the following algorithm is used inserting therequest into the schedule which has anchor point ts whereresources become available and points where such slots endtendslot:

1. Select, if available, statistical information about theruntime of the application and the runtime estimationof the user. Compare them with the given runtime forthe new job j. If

• the estimated runtime d is significant longer thanthe standard runtime of the application

• or the user is tending to overestimate the runtimeof jobs


– then mark the application as promising foroverbooking. Assuming a uniform distribu-tion, the duration of the job d can be adjustedto d′ = d

1+maxPoF where maxPoF is themaximum acceptable probability of failure.The time interval oj = d − d′ can be usedfor overbooking.

• else the job should not be used for overbookingd′ = d, oj = 0.

2. Find starting point ts for job j, set ts as anchor point:

• Scan the current schedule and find the first pointts ≥ tstart where enough processors are avail-able to run this job.

• Starting from this point, check whether ts + d ≤tend and if this is valid, continue scanning theschedule to ascertain that these processors re-main available until the job’s expected termina-tion ts + d ≤ tendslot.

• If not,

– check validity of ts + d′ ≤ tend andwhether the processors remain available un-til the job’s expected termination reducedby the time usable for overbooking ts +d′ ≤ tendslot. If successful, mark the jobas overbooked and set the job duration d′ =tendslot − ts.

– It not, check, if there are direct predecessorsin the plan, which are ending at ts and areusable for overbooking. Then reduce ts bythe time a = mink{ok} of those jobs k andtry again. ts−a+d′ ≤ tendslot. (In this case,other jobs are also overbooked; neverthelesstheir runtime is not reduced. If they do notfinish earlier than excepted, they can still fin-ish and the overbooked job will be startedafter their initially planned completion.) Ifsuccessful, mark the job as overbooked andset the job duration d′ = tendslot− (ts− a).

• If overbooking was not possible, return and con-tinue the scan to find the next possible anchorpoint.

3. Update the schedule to reflect the allocation of r pro-cessors by this job j with the duration d′ for the reser-vation h = [ts, min(tendslot, ts +d)], starting from itsanchor point ts, or earlier ts − a.

4. If the job’s anchor is the current time, start it immedi-ately.

The algorithm defines that a job k which was overbookedby a job j should be resumed until its completion or its

planned completion time if it had not finished at time ts−a.Considering SLA bound jobs, this might be doubtful if ful-filling the SLA of job j would be more profitable than ofjob k. However, the reservation duration of job k is onlyreduced after curtaining the duration of job j. Hence, theprovider has no guarantee that the SLA of job j would benot violated if stopping the job execution of job k. Conse-quently, the scheduler should act conservatively and providefor job k the resources as required and prevent an SLA vio-lation of job k.

3.1.3 Checkpointing and Migration

By using overbooking the likelihood of conflicts increasesand consequently the need of preventing job losses becomesmore important. Which contractor (end-user or provider)has to pay the penalty in case of a not completed job de-pend on the responsibilities. In conservative planning-basedsystems, the user is responsible for an underestimated run-time of a job. Hence, if the job is killed after providing re-sources for the defined runtime, the provider does not careabout saving the results. The provider is responsible if itthe requested resources have not been available for the re-quested time. Hence, violating an SLA caused by overbook-ing results in that the provider has to pay the penalty fee.If the scheduler overbooked a schedule with a job which isplanned for less than the user’s estimated runtime and has tobe killed/ dispatched for another job, the provider is respon-sible since resources had not been available as agreed. TheRMS can prevent such conflicts by using the fault-tolerancemechanisms checkpointing and migration [9]. If the exe-cution time had been shortened by the RMS, at the end ofthe reservation a checkpoint can be generated of the job, i.e.making a snapshot of its memory, messages, the registers,and program counter. The checkpoint can be stored in a filesystem available in the network. This allows to restart thenot completed job in the next free gap before the job lat-est completion time. To be efficient, the free gap should beat least as long as the remaining estimated runtime. Notethat filling gaps by partly executing jobs should not be ageneral strategy since checkpointing and migration requiresresources and result in additional costs for the job execu-tion. As result, planning with checkpointing and migrationallow pre-emptive scheduling of HPC systems.

3.1.4 Example

To exemplify the approach, in the following a possible over-booking procedure is explained. Assume a Grid clusterwith X nodes, each job on the cluster will be assignedto the same fixed reservations with a duration of 5 hours[700 − 1200, 1200 − 1700]. Assume further that the fixedand some usual advance reservations are already scheduled,


directly beneath each other. Thus, the resources are oc-cupied for 20 hours of the schedule: [700 − 1200, 1200 −1700, 1700 − 2100, 2100 − 300]. This schedule is the sameevery day in the week considered. Then another job j withh = 5 hour reservation for X nodes should be inserted inthe schedule in the next two days. However, the resourcesare only free for 4 hours [300 − 700]! Consequently, thescheduler has to reject the job request, in case it cannotoverbook the schedule 300 + 5 hours = 800 ¬ ≤ 700 .We assume that the scheduler has statistics about the esti-mated and actual runtimes of applications. and users thatpropound an overestimation by 40%. Assume the sched-uler can takeo = d − d′ of the statistic over-estimated run-time of a job for overbooking, let the maximum PoF bemaxPoF = 13%. For a five-hour job these are 34 minutes.(As h = 5 hours = 300 minutes and maxPoF = 0, 13 →d′ = d

1+maxPoF = 3001,13 ≈ 265, 5minutes → o = d − d′ =

300 minutes − 266 minutes = 34 minutes .) If we over-book the advance reservation h for 34 minutes, the sched-ule is still not feasible (300 + 4:26 hours = 726 ¬ ≤ 700 )since the gap is only 4 hours and j would be given a run-time of 4 hours and 26 minutes. If the predecessor is alsooverbooked by 34 minutes, each job is reduced by half anhour and reservation h can be accepted (300−0:34 hours)+4:26 hours = 652 ≤ 700. Thus the job j with an user esti-mated runtime d = 5 hours has a duration d′ = 4:26 hoursand an estimated earliest start-time from ts = 226 in theoverbooked time-slot [300 − 700] The complete schedule is[700−1200, 1200−1700, 1700−2100, 2100−300, 300−700].

Note that, overbooking is possible only, if neither j itselfnor the predecessor (in case of its overbooking) is a fixedreservation. Hence, in our example avoiding an idle timeof 4 hours can be achieved by using overbooking only, ifthe job j and the reservations before [2100 − 300] are notfixed. For all reservations which are not fixed, a greaterflexibility exists if the start time ts of all those jobs couldbe moved forward to their earliest start time tstart, if thejobs before are ending earlier than planned. In this case, ifthe other execution times are dynamically shifted, any over-booked reservation from the schedule could be straightenout before execution. This approach has the big advantagethat an 1 hour overbooked reservation h could be finishedeven it would use the estimated time, if in total the prede-cessor reservations in the schedule require all in all 1 hourless time than estimated.

4 Conclusion and Future Work

The paper first motivated the need for Grid systems andin common with the management of supercomputers theadvantages of planning-based scheduling for SLA provi-sioning and fixed reservations. However, advance reserva-tions have the disadvantage to decrease the utilization of

the computer system. Using overbooking might be a pow-erful instrument to re-increase the system utilization and theprovider’s profit. The paper presented the concept and algo-rithm to use overbooking. Since overbooking might leadto conflicts because of providing resources for a shortertime than required, fault-tolerance mechanisms are crucial.Checkpointing and migration can be used for preventing joblosses and SLA violations and support the applicability ofoverbooking in Grid systems.

Future work focus on statistical analysis of runtime andimplementing the overbooking algorithm in the backfillingscheduling algorithms. At last we will develop a simulationprocess for testing and evaluating how much overbookingincreases the system utilization and provider’s profit.

References

[1] C. Ressources, “Torque resource man-ager,” 2008. [Online]. Available:http://www.clusterresources.com/pages/products/torque-resource-manager.php

[2] IBM, “Loadleveler,” 2008. [On-line]. Available: http://www-03.ibm.com/systems/clusters/software/loadleveler/index.html

[3] Gridengine, “Sun,” 2008. [Online]. Available:http://gridengine.sunsource.net/

[4] Platform, “Lfs load sharing facility,” 2008. [On-line]. Available: http://www.platform.com/Products/Platform.LSF.Family

[5] “Condor,” 2008. [Online]. Available:http://www.cs.wisc.edu/condor/

[6] D. Jackson, Q. Snell, and M. Clement, “Core Algo-rithms of the Maui Scheduler,” Job Scheduling Strate-gies for Parallel Processing: 7th International Work-shop, JSSPP 2001, Cambridge, MA, USA, June 16,2001: Revised Papers, 2001.

[7] “Openccs: Computing center software,” 2008.[Online]. Available: https://www.openccs.eu/core/

[8] A. Streit, “Self-tuning job scheduling strate-gies for the resource management of hpc sys-tems and computational grids,” Ph.D. disserta-tion, Paderborn Center for Parrallel Comput-ing, 2003. [Online]. Available: http://wwwcs.uni-paderborn.de/pc2/papers/files/422.pdf

[9] “Hpc4u: Introducing quality of service for grids,”http://www.hpc4u.org/, 2008.


[10] V. Liberman and U. Yechiali, “On the hotel overbook-ing problem-an inventory system with stochastic can-cellations,” Management Science, vol. 24, no. 11, pp.1117–1126, 1978.

[11] J. Subramanian, S. Stidham Jr, and C. Lautenbacher,“Airline yield management with overbooking, can-cellations, and no-shows,” Transportation Science,vol. 33, no. 2, pp. 147–167, 1999.

[12] M. Rothstein, “Or and the airline overbooking prob-lem,” Operations Research, vol. 33, no. 2, pp. 237–248, 1985.

[13] B. Urgaonkar, P. Shenoy, and T. Roscoe, “Resourceoverbooking and application profiling in shared host-ing platforms,” ACM SIGOPS Operating Systems Re-view, vol. 36, no. si, p. 239, 2002.

[14] M. Hovestadt, O. Kao, A. Keller, and A. Streit,“Scheduling in hpc resource management systems:Queuing vs. planning,” Job Scheduling Strategiesfor Parallel Processing: 9th International Workshop,Jsspp 2003, Seattle, Wa, Usa, June 24, 2003: RevisedPapers, 2003.

[15] A. Andrieux, D. Berry, J. Garibaldi, S. Jarvis, J. Ma-cLaren, D. Ouelhadj, and D. Snelling, “Open Issues inGrid Scheduling,” UK e-Science Report UKeS-2004-03, April 2004.

[16] M. Siddiqui, A. Villazon, and T. Fahringer, “Grid al-location and reservation—Grid capacity planning withnegotiation-based advance reservation for optimizedQoS,” Proceedings of the 2006 ACM/IEEE conferenceon Supercomputing, 2006.

[17] N. Ntene, “An algorithmic approach to the 2d orientedstrip packing problem,” Ph.D. dissertation.

[18] E. Coffman Jr, M. Garey, D. Johnson, and R. Tar-jan, “Performance bounds for level-oriented two-dimensional packing algorithms,” SIAM Journal onComputing, vol. 9, p. 808, 1980.

[19] B. Baker and J. Schwarz, “Shelf algorithms for two-dimensional packing problems,” SIAM Journal onComputing, vol. 12, p. 508, 1983.

[20] D. Feitelson and M. Jette, “Improved Utilization andResponsiveness with Gang Scheduling,” Job Schedul-ing Strategies for Parallel Processing: IPPS’97 Work-shop, Geneva, Switzerland, April 5, 1997: Proceed-ings, 1997.

[21] D. Lifka, “The ANL/IBM SP Scheduling System,”Job Scheduling Strategies for Parallel Processing:

IPPS’95 Workshop, Santa Barbara, CA, USA, April25, 1995: Proceedings, 1995.

[22] A. Mu’alem and D. Feitelson, “Utilization, pre-dictability, workloads, and user runtime estimates inscheduling the IBM SP 2 with backfilling,” IEEETransactions on Parallel and Distributed Systems,vol. 12, no. 6, pp. 529–543, 2001.

[23] A. Downey and D. Feitelson, “The elusive goal ofworkload characterization,” ACM SIGMETRICS Per-formance Evaluation Review, vol. 26, no. 4, pp. 14–29, 1999.

[24] D. Feitelson and B. Nitzberg, “Job Characteristics of aProduction Parallel Scientific Workload on the NASAAmes iPSC/860,” Job Scheduling Strategies for Par-allel Processing: IPPS’95 Workshop, Santa Barbara,CA, USA, April 25, 1995: Proceedings, 1995.

[25] R. Gibbons, “A Historical Application Profiler for Useby Parallel Schedulers,” Job Scheduling Strategiesfor Parallel Processing: IPPS’97 Workshop, Geneva,Switzerland, April 5, 1997: Proceedings, 1997.

[26] A. Downey, “Using Queue Time Predictions forProcessor Allocation,” Job Scheduling Strategies forParallel Processing: IPPS’97 Workshop, Geneva,Switzerland, April 5, 1997: Proceedings, 1997.

[27] W. Smith, I. Foster, and V. Taylor, “Predicting Ap-plication Run Times Using Historical Information,”Job Scheduling Strategies for Parallel Processing:IPPS/SPDP’98 Workshop, Orlando, Florida, USA,March 30, 1998: Proceedings, 1998.


Germany, Belgium, France, and Back Again: Job Migration using Globus

D. Battre1, M. Hovestadt1, O. Kao1, A. Keller2, K. Voss2

1Technische Universitat Berlin, Germany1Paderborn Center for Parallel Computing, University of Paderborn, Germany

Abstract

The EC-funded project HPC4U developed a Grid fab-ric that provides not only SLA-awareness but also asoftware-only based system for checkpointing sequentialand MPI parallel jobs. This allows job completion andSLA-compliance even in case of resource outages. Check-points are generated transparently to the user in thebackground. There is no need to modify the applications inany way or to execute it in a special manner. Checkpointdata sets can be migrated to other cluster systems toresume the job execution. This paper focuses on the jobmigration over the Grid by utilizing the WS-Agreementprotocol for SLA negotiation and mechanisms provided bythe Globus-Toolkit.

Keywords: Checkpointing, Globus, Job-Migration,Resource-Management, SLA-Negotiation, WS-Agreement

I. Introduction

The developments of the recent years in scope of Gridtechnologies have formed the basis for an invisible Gridwhose distributed services can be easily accessed andused. Using Grids to execute compute-intensive simula-tions and applications in the scientific community is verycommon nowadays. Founded by national and internationalorganizations, large grids like ChinaGrid, D-Grid, EGEE,NAREGI, TeraGrid, or the UK-eScience initiative havebeen established and are successfully used in the academicworld. Additionally, companies like IBM or Microsofthave identified the commercial potential of providing Gridservices as reflected in their participation and funding ofGrid initiatives.

However, while the technological basis for the Gridutilization is mostly available, some obstacles still have to

This work has been partially supported by the EC within the 6thFramework Programme under contract IST-031772 “Advanced Risk As-sessment and Management for Trustable Grids” (AssessGrid) and IST-511531 “Highly Predictable Cluster for Internet-Grids” (HPC4U).

be solved in order to gain a wide commercial uptake. Oneof the most challenging issues is to provide Grid services ina manner that fulfils the consumer’s demands of reliabilityand performance. This implies that providers have to agreeon Quality of Service (QoS) aspects which are important tothe end-user. QoS in the scope of Grid services addressesprimarily resource and time constraints, like number andspeed of compute nodes as well as earliest executiontime or latest completion time. To contractually determinesuch QoS requirements between service providers andconsumers, Service Level Agreements (SLAs) [1] are com-monly used. They allow the formulation of all expectationsand obligations in the business relationship between thecustomer and the provider.

The EC-funded project HPC4U (Highly PredictableCluster for Internet Grids) [2] developed a Grid fabric thatprovides not only SLA-awareness but also a software-onlybased system for checkpointing sequential and MPI paral-lel jobs. This allows job completion and SLA-complianceeven in case of resource outages. Checkpoints are gen-erated transparently to the user in the background (i.e.the application does not need to be modified in any way)and are used to resume the execution of an interruptedapplication on healthy resources. Normally, the job isresumed on the same system, since spare resources havebeen provided to enhance the reliability of the system.However, in case there are not enough spare resourcesavailable to resume the execution, the HPC4U systemis able to migrate the job to other clusters within thesame administrative domain (intra-domain migration) oreven over the Grid (inter-domain migration) by using WS-Agreement [3] and mechanisms provided by the GlobusToolkit [4]. This paper focuses on the inter domain migra-tion. In the next section we highlight the architecture of theHPC4U software stack. Section III discusses some issuesrelated to the negotiation protocol. Section IV describesthe phases of the job migration over the Grid in a moredetailed way. An overview about related work and a shortconclusion completes this paper.


Fig. 1. The HPC4U software stack

II. The Software Stack Architecture

The software stack depicted in Figure 1 consists of aplanning based [5] Resource Management System (RMS)called OpenCCS [6], a Negotiation Manager (NegMgr),and the Globus Toolkit (GT).

OpenCCS: The depicted modules (related to job mi-gration) are the Planning Manager (PM), the Sub-SystemController (SSC) and dedicated subsystems for check-pointing of processes (CP), network (NW), and storage(ST). The PM computes the schedule whereas the SSCorchestrates the mentioned subsystems for periodicallycreating checkpoints. OpenCCS communicates with theNegMgr through a so called ProtocolProxy (PP). It isresponsible for translating between the SOAP and Javabased NegMgr and the TCP-stream and C based OpenCCS.

The Negotiation Manager: The NegMgr facilitates thenegotiation of SLAs between an external entity with theRMS by implementing the WS-Agreement specificationversion 1.0. The NegMgr has been developed in a closecollaboration between the EC funded projects HPC4U andAssessGrid [7]. The Negotiation Manager is implementedas a Globus Toolkit service because this allows to use andsupport several mechanisms that are offered by the toolkit,such as authentication, authorization, GridFTP, monitoringservices, etc. Another major decision for implementing theNegotiation Manager as a Globus Toolkit service was thepotential impact by this move, as the GT is among thede-facto standard Grid middlewares used throughout theworld.

The consumer of the negotiation service can either be anend-user interface (e. g. a portal or command line tools) oreven a broker service. The consumer accesses the NegMgrfor creating or negotiating new SLAs and for checkingthe status of existing SLAs. WS-Notification is used tomonitor changes in the status of an agreement and theWS-ResourceFramework provides general read access tocreated agreements but also write access to the databaseof agreement templates.

III. The Negotitation Protocol

The WS-Agreement protocol is limited in terms ofnegotiation, basically resembling a one-phase commit pro-tocol. This is insufficient for scenarios with more thantwo parties, e.g. an end-user, who is interested in theconsumption of compute resources, a resource broker, anda resource provider. Three different usage scenarios are themain subjects.

• Direct SLA negotiation between an end-user and aprovider.

• The end-user submits an SLA request to the broker,which then looks for suitable resources and forwardsthe request to suitable providers. The broker returnsthe SLA offers to the end-user who is then free toselect and commit to an SLA offer by interactingdirectly with the corresponding provider.

• The broker acts as a virtual provider. The end-useragrees an SLA with the broker, which in turn agreesSLAs with all providers involved in executing theend-user’s application.

From these scenarios we can conclude that the current ne-gotiation interface of WS-Agreement does not satisfy ourneeds. According to WS-Agreement, by the act of issuingan SLA request to a resource provider, a resource consumeris already committed to that request. The provider can onlyaccept or reject the request. This has certain shortcomings.

It is common real-world practice that a customer spec-ifies a job and asks several companies for offers. Thecompanies state their price for the job and the customercan pick the cheapest offer. This is not possible in the cur-rent WS-Agreement specification. By submitting an SLArequest, the user is committed to that request. We neglectthat assumption. A user can submit a non-binding SLArequest and the provider is allowed to modify the requestby answering with an SLA offer that has a price tag. Theprovider is bound to this offer and the user can eithercommit to the offer or let it expire. Therefore, the interfaceand semantics of the WS-Agreement implementation wereslightly modified. A createAgreement call is not bindingfor the issuer and the WS-Agreement interface is extendedby a commit method. As the WS-GRAM implementation


of Globus is designed for queuing based schedulers, anew state machine was implemented that supports externalevents from the RMS (over the PP), GridFTP, and timeevents to trigger state transitions.

For standardizing the execution requirements of a com-putational job, the Job Submission Description Language(JSDL) [8] has been introduced by the JSDL-workinggroup of the Global Grid Forum. By means of JSDLall parameters for job submission can be specified, e.g.name of executable, required application parameters, orfile transfer filters for stage-in and stage-out. Our currentWS-Agreement implementation supports only a subset ofJSDL (namely POSIX jobs).

The Globus Resource Allocation Manager (GRAM)forms the glue between the RMS and the Globus Toolkit. Itis responsible for parsing requests coming from upper layerGrid levels and transferring it to uniform service requestsfor the underlying RMS. Unfortunately, the GRAM is notable to handle (advance) reservations. Therefore we bypassthe GRAM layer.

If an SLA request has been found to be templatecompliant, it is passed to the resource management systemwhich decides if it is able to provide the requested Service-Level. The job is integrated into a tentative schedulewhich forms the decision base. In case the job fits intothe schedule and no previously agreed SLAs need to beviolated, an offer is generated.

First, the SLA describes the time by which the userguarantees to provide files for stage-in. At this time,the negotiation manager’s state machine triggers the filestaging via GridFTP. Next, the SLA specifies the earliestjob start and finish time as well as the requested executiontime. The RMS scheduler is free to allocate the resourcesfor the requested time, anywhere between earliest start andlatest execution time. It is considered the user’s task toestimate the duration of the stage-in phase and set theearliest job start accordingly. In order to compensate smallunderestimations, a synchronization point is introduced. Incase the stage-in is not completed when the resources areallocated, the actual execution pauses until the stage-in iscompleted. Note that this idle-time is part of the guaranteedCPU time and may prevent the successful execution of theprogram.

IV. Grid Migration Phases

Initializing the Migration Progress: The first step of themigration process is triggered by the detection of a failure(e.g. a node crash) by the monitoring functionalities ofOpenCCS. The PM is signaled about this outage and triesto create a new schedule that pays attention to this changedsystem situation. If it is not possible to generate a schedulewhere the SLAs of all jobs are respected, the PM initiates

the job migration.Packing the Migration Data Set: Before any migration

process can be initiated, it has to be ensured that themigration data set is available. This is signaled to theSSC which is then generating such a data set. Beside thecheckpoint itself it includes information provided by theRMS like number of performed checkpoints or the layoutof MPI processes on the nodes. Once the migration dataset has been generated, the PP is finally transferring allnecessary data to the NegMgr for starting the query forpotential migration targets. This includes also the path toa transfer directory where the SSC has saved the migrationdata and where the results should be written to.

Finding the Migration Target: The NegMgr is now incharge of finding appropriate migration targets. For this,it uses for instance resource information catalogues inthe Grid. This query is the static part of the retrievalprocess. In a second step, the NegMgr contacts all otherNegMgr services of the returned Grid fabrics and initiatesthe negotiation process with them.

This SLA negotiation is started with a getQuote mes-sage to the remote NegMgr, asking for an SLA offer.The remote system can either reject this message, oranswer with OK. It has to be noted, that this OK is non-binding. The source NegMgr now has to choose among theofferings, where the remote Negotiation Managers repliedwith an OK message. An important aspect of this rankingcould be the price. At the end, the initiating NegMgr willsend one of these remote NegMgrs a createAgreementmessage. If the remote site still agrees, it replies with OK,which makes the SLA binding for both the local and theremote site. Otherwise the remote site will terminate theprocess (e.g. because free resources have been allocatedby other jobs meanwhile).

Transferring the Migration Data Set: The transfer pro-cess is driven by the remote site. This is possible, sincethe SSC has generated the migration data set in a transferdirectory accessible by the Globus Toolkit. Since the datatransfer is secured by proxy credentials, the source siteonly provides these rights to the respective remote site, sothat no other party may access the data. After transferringthe data set to the remote site both RMSs are notifiedby their associated NegMgr. This is triggered by a WS-Notification event. The source PM then changes the stateof the migrated job to remote execution. Similar to this, theremote NegMgr signals the remote PM that the migrationdata set is available, so that the job execution could beresumed.

Resuming the Job Execution: Once the migration datahas been transferred and the PM decides to restart the jobfrom the migrated checkpoint, the SSC is first signaledto re-establish the process environment by regarding thesituation of the storage working container. During the


runtime, the SSC creates checkpoints in regular intervals.Transferring the Result Data Set: If everything goes

right, the job finally finishes its computation and the jobresults can be transferred back to the source. The resultsare packed by the remote SSC component and saved to atransfer directory. The PP then signals its NegMgr, that theresult data set is available. The remote NegMgr initiatesthe data transfer to the location that has been originallyprovided by the source site at negotiation time. Oncethe data transfer has been completed successfully, bothNegMgrs inform their PM components. On remote site,all used resources are now released. On source site, thePM initiates the stage-out of the result data to the userspecified target. This way, the user retrieves the result data,not noticing any difference to a solely local computation.

Job Forwarding: In case the job also fails on the remotesite, it may be migrated again to another site to resumethe execution. The steps are the same as described above.However transferring the result data back to the originatingsite is done along the whole migration chain. All inter-mediate sites are then acting as a proxy, forwarding theincoming result data. This also simplifies the clean-up.

V. Related Work

The worldwide research in Grid computing resulted innumerous different Grid packages. Beside many commod-ity Grid systems, general purpose toolkits exist such asUnicore [9] or the Globus Toolkit [4]. Although the GlobusToolkit represents the de-facto standard for Grid toolkits,all these systems have proprietary designs and interfaces.To ensure future interoperability of Grid systems as wellas the opportunity to customize installations, the OGSA(Open Grid Services Architecture) working group withinthe Open Grid Forum (OGF) aims to develop the architec-ture for an open Grid infrastructure [10].

In [11], important requirements for the Next Gener-ation Grid (NGG) were described. An architecture thatsupports the co-allocation of multiple resource types,such as processors and network bandwidth, was presentedin [12]. The Grid community has identified the need fora standard for SLA description and negotiation. This ledto the development of WS-Agreement/-Negotiation [3].The Globus Architecture for Reservation and Allocation(GARA) provides “wrapper” functions to enhance a localRMS not capable of supporting advance reservations withthis functionality. This is an important step towards anintegrated QoS aware resource management. The GARAcomponent of Globus currently does not support the def-inition of SLAs nor malleable reservations, nor does itsupport resilience mechanisms to handle resource outagesor failures.

VI. Conclusion

In this paper we described the mechanisms providedby the HPC4U software stack for migrating checkpointdata sets of sequential or MPI parallel applications overthe Grid. We are using the Globus toolkit for findingappropriate resources and for migrating the checkpoint andresult data sets. We also developed an implementation ofthe WS-Agreement protocol to be able to negotiate withthe local RMSs.

Thanks to the transparent checkpointing capabilities,these mechanisms also apply for the execution of com-mercial applications, where no source code is availableand recompiling or relinking is not possible. The user evendoes not have to modify the way of executing the job.

We have shown the practicability of the presentedmechanisms by migrating jobs from a site in Germanyto Belgium and from there to a site in France. The resultswere automatically transferred back again to Germany. Thedeveloped software can be found on the project pages.

References

[1] A. Sahai, S. Graupner, V. Machiraju, and A. van Moorsel, “Spec-ifying and Monitoring Guarantees in Commercial Grids throughSLA,” Internet Systems and Storage Laboratory, HP LaboratoriesPalo Alto, Tech. Rep. HPL-2002-324, 2002.

[2] “Highly Predictable Cluster for Internet-Grids (HPC4U), EU-fundedproject IST-511531,” http://www.hpc4u.eu.

[3] A. Andrieux, K. Czajkowski, A. Dan, K. Keahey, H. Ludwig,T. Nakata, J. Pruyne, J. Rofrano, S. Tuecke, and M. Xu,“Web Services Agreement Specification (WS-Agreement),”www.gridforum.org/Public Comment Docs/Documents/Oct-2005/WS-AgreementSpecificationDraft050920.pdf, 2004.

[4] “Globus Alliance: Globus Toolkit,” http://www.globus.org.[5] M. Hovestadt, O. Kao, A. Keller, and A. Streit, “Scheduling in

HPC Resource Management Systems: Queuing vs. Planning,” inJob Scheduling Strategies for Parallel Processing: 9th InternationalWorkshop, JSSPP, Seattle, WA, USA, 2003.

[6] “OpenCCS,” http://www.openccs.eu.[7] “Advanced Risk Assessment and Management for Trustable

Grids (AssessGrid), EU-funded project IST-031772,” http://www.assessgrid.eu.

[8] “Job Submission Description Language (JSDL),” www.gridforum.org/documents/GFD.56.pdf.

[9] “UNICORE Forum e.V.” http://www.unicore.org.[10] GGF Open Grid Services Architecture Working Group (OGSA

WG), “Open Grid Services Architecture: A Roadmap,” http://www.ggf.org/ogsa-wg, 2003.

[11] K. Jeffery (edt.), “Next Generation Grids 2: Requirements andOptions for European Grids Research 2005-2010 and Beyond,”ftp://ftp.cordis.lu/pub/ist/docs/ngg2 eg final.pdf, 2004.

[12] I. Foster, C. Kesselman, C. Lee, B. Lindell, K. Nahrstedt, andA. Roy, “A Distributed Resource Management Architecture thatSupports Advance Reservations and Co-Allocation,” in 7th Inter-national Workshop on Quality of Service (IWQoS), London, UK,1999.


Development of Grid Service Based Molecular Docking Application

HwaMin Lee1, DooSoon Park1, and HeonChang Yu2

1Division of Computer Science & Engineering, Soonchunhyang University Asan, Korea 2Department of Computer Science Education, Korea University, Seoul, Korea

Abstract - A molecular docking is the process of reducing an unmanageable number of compounds to a limited number of compounds for the target of interest by means of computational simulation. And it is one of a large-scale scientific application that requires large computing power and data storage capability. Previous applications or software for molecular docking were developed to be run on a supercomputer, a workstation, or a cluster-computer. However the molecular docking using a supercomputer has a problem that a supercomputer is very expensive and the molecular docking using a workstation or a cluster-computer requires a long execution time. Thus we propose Grid service-based molecular docking application. We designed a resource broker and a data broker for supporting efficient molecular docking service and developed various services for molecular docking. Our application can reduce a timeline and cost of drug or new material design.

Keywords: Molecular docking, grid computing, grid services.

1 Introduction Drug discovery is an extended process that may take

heavy cost and long timeline from the first compound synthesis in the laboratory until the therapeutic agent or drug, is brought to market [1, 2]. The molecular docking as shown figure 1 is to search the feasible binding geometries of a putative ligand with a target receptor whose 3D structure is known. Molecular modeling methodology combines computational chemistry and computer graphics. And molecular docking has emerged as a popular methodology for drug discovery [3]. Docking methods in virtual screening can dock the large number of small molecules into the binding site of a receptor, allowing for a rank ordering in terms of strength of interaction with a particular receptor [4]. Docking each molecule in the target chemical database is both a compute and data intensive task [3]. Tremendous efforts are underway to improve programs aimed at the automated process of docking or positioning compounds in a binding site and scoring or rating the complementarity of small molecules. The challenge in the applications of molecular docking is that it is very compute intensive and requires a very fast computer to run.

Figure 1. Molecular docking

In the mid 1990s, Grid computing has emerged as an important new field, distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and high-performance orientation [5, 6]. Grid computing system [6] consists of large sets of diverse, geographically distributed resources that are grouped into virtual computers for executing specific applications. Today, Grid computing offers the strongest low cost and high throughput solutions [2] and is spotlighted as the key technology of the next generation internet. Grid computing is used in fields as diverse as astronomy, biology, drug discovery, engineering, weather forecasting, and high-energy physics.

A molecular docking is one of large-scale scientific applications that require large computing power and data storage capability. Thus we developed a Grid service-based molecular docking application using Grid computing technology which supports a large data intensive operation. In this paper, we constructed a 3-dimensional chemical molecular database for molecular docking. And we designed a resource broker and a data broker for supporting efficient molecular docking service and proposed various services for molecular docking. We implemented Grid service based molecular docking application with DOCK 5.0 and Globus toolkit 3.2. Our application can reduce a timeline and cost of drug discovery or new material design.

This paper is organized as follows: Section 2 presents related works and section 3 explains constructing chemical molecular


database. In section 4, we present architecture of Grid service based molecular docking system. Section 5 explains services for molecular docking. Section 6 describes the results of implementation. Finally, the paper concludes in Section 7.

2 Related Works Previous applications or software for molecular docking such as AutoDock, FlexX, DOCK, LigandFit, Hex were developed to be run on a supercomputer, a workstation, or a cluster-computer. However the molecular docking using a supercomputer has a problem that a supercomputer is very expensive and molecular docking using a workstation or a cluster-computer requires a long execution time. Recently several researches have been made on molecular modeling in Grid computing such as the virtual laboratory [3], BioSimGrid [7], protein structure comparison [8], and APST [9].

The virtual laboratory [3] provided an environment which runs grid-enable the molecular process by composing it as parameter sweep allocation using the Nimrod-G tools. The virtual laboratory proposed the Nimrod Parameter Modeling Tools for enabling DOCK as parameter sweep application, Nimrod-G Grid Resource Broker for scheduling DOCK jobs on the Grid, and Chemical Database (CDB) Management and Intelligent Access Tools. But the virtual laboratory did not support web service because it was implemented using Globus toolkit 2.2 and did not provide integration functionality of heterogeneous CDBs and retrieval functionality in distributed CDB.

BioSimGrid [7] project provides a generic database for comparative analysis of simulations of biomolecules of biological and pharmaceutical interest. The project is building an open software framework system based on OGSA and OGSA-DAI. The system has a service-oriented computing model and consists of the GUI, service, Grid middleware, and database/data.

A grid-aware approach to protein structure comparison [8] implemented a software tools for the comparison of protein structures. They proposed comparison algorithms based on indexing techniques that store transformation invariant properties of the 3D protein structures into tables. Because the method required large memory and intensive computing power, they used Grids framework.

Parameter sweeps on the Grid with APST [9] designed and implemented APST (AppLeS Parameter Sweep Template). APST project investigated adaptive scheduling of parameter sweep applications on the Grid, and evolved into a usable application execution environment. Parameter sweep applications are structured as sets of computational tasks that are mostly independent. APST software consists of a scheduler, a data manger, a compute manager, and a metadata manager.

However [3], [8], [9] did not support web service and [8], [9] did not provide integrated database service. Although [3] provided basic database service, it did not provide integrated database management of heterogeneous and distributed database/data in Grid environment. Therefore we construct integrated 3-dimensional chemical database and propose Grid service based molecular docking system.

3 Constructing 3D chemical database for molecular docking

In Grid computing, many applications use a large scale database or data. In existing chemical databases, order, kinds and degrees of fields are heterogeneous. Because kinds of chemical compounds become various and size of database increases, a data insertion, a data deletion, a data retrieval, and an integration of data in chemical databases become more difficult. Thus we construct a database that integrates the existing various chemical databases using MySQL. Table 1 shows Protomer table in our chemical database. Our chemical database contains 32,889 chemical molecules. Our Grid service based molecular docking system retrieves data fields for virtual screening and the retrieved data automatically composes a file mol2. And our database service provides a Query Evaluation Service. Query Evaluation Service inquires information of data nodes for selecting optimal data node.

Table 1. Protomer table in our chemical database

Attribute Description Prot_ID Molecular_ID Type Subset Type ex) Fragment-like, Drug-like Mole_name The name of molecular LogP Log of the octanol/water partition coefficient Apolar_desolvation Apolar desolvation Polar_desolvation Polar desolvation H_bond_donors The number of H bond donors H_bond_acceptors The number of H bond acceptors Charge Total charge of the molecule Molecular_weight Molecular weight with atomic weights taken fromRotable_bond The number of rotable bonds Content Whole contents of Mol2 file

4 Architecture of Grid service based

molecular docking system The current methodology in Grid computing is service

oriented architecture. In this section, we explain the components of Grid service based molecular docking system and the steps involved in molecular docking execution. Figure 2 shows an architecture of Grid service based molecular docking system. Grid service based molecular docking system consists of broker, computation resources, and data resources. And it is composed of multiple individual services located on different heterogeneous machines and administered by different organizations.


Figure 2. Architecture of Grid service based molecular docking system

Resource broker

The resource broker is responsible for scheduling docking jobs. For scheduling, it uses information about CPU utilization and available memory capacity and dispatches docking jobs to selected computation nodes. And it monitors the execution state of jobs and gathers computation results. Our resource broker uses MDS (Metacomputing Directory Service) for resource selection. It provides Dock Service Group Registry and Dock Service Factory.

Data broker

In Grid computing, data intensive applications produce data sets on terabytes or petabytes, and these data sets are replicated within Grid environment for reliability and performance. The data broker is responsible for selecting suitable CDB services and replica for efficient docking execution. It uses replica catalogue and database information service. And it uses information about network bandwidth to select optimal data resource which provides chemical data for molecular docking execution. Our data broker provides CDB Query Service Registry and CDB Query Service Factory.

Replica catalogue

Replica catalogue manages replicas information for CDB resource discovery. It maintains mapping between logical name and target address. Logical name is unique identifier for replicated data contents and target address is physical location of replica.

Database information service

Database information service integrates and manages information required for selecting and accessing data resources. When data broker sends a query about data resources, it replies the results of the query to data broker. Resource selection algorithms of data broker can select suitable data resources using replica catalogues and database information service.

Computation resources

Computation resource provides Dock Evaluation Service Factory that performs docking with receptor and ligand.

Data resources

Data resources consist of various file systems, databases, XML databases, and hierarchical repository system. Data resource provides Query Evaluation Service Factory. In our Grid service based molecular docking system, services that want to use data resources can access heterogeneous data resources by uniform method using data broker. And Dock Service can access data resources using OGSA-DAI and OGSA-DQP.

5 Grid service specification for molecular docking In this section, we explain service specification for

molecular docking. The service specification is orthogonal to the Grid authentication and authorization mechanisms.

Dock Service Group Registry

Dock Service Group Registry (DSGR) is persistent service, which registers Dock Evaluation Service Factory (DESF) to execute docking in computation resource. DSGR provides DockServiceGroupRegistry PortType derived from GridService PortType, NotificationSource PortType, ServiceGroup PortType, and Service Group Registration PortType in OGSI. DESF registers and deletes services in DSGR using Service Group Registration PortType. DSGR creates Service Group Entry service and manages duration time of registered services. Dock Service can inquiry information about computation resources for resource scheduling using DSGR.

Dock Service Factory

Dock Service Factory (DSF) is persistent service, which provides DockServiceFactory PortType derived from GridService PortType of OGSI. The primary function of DSF is to create Dock Service instance at the request of client. Any Client wishing to execute molecular docking first connects to DSF and creates an instance. Dock Service


instance discoveries suitable computation resources through DSGR and requests Dock Evaluation Service.

Dock Evaluation Service Factory

Dock Evaluation Service Factory (DESF) executes docking application with given receptor and ligand. DESF creates Dock Evaluation Service (DES) instance at the request of Dock Service instance. DES instance inquires ligands through CDB Query Service Factory (CDBQSF). DESF provides DockEvaluationServiceFactory PortType derived from GridService PortType of OGSI.

CDB Service Group Registry

CDB Service Group Registry (CDBSGR) is persistent service, which provides CDBServiceGroupRegistry PortType derived from GridService PortType, NotificationSource PortType, ServiceGroup PortType, and ServiceGroupRegistration PortType of OGSI. CDBSGR registers Query Evaluation Service Factory (QESF) that has inquiry functionality of molecules in CDB. CDBESF creates ServiceGroupEntry service and manages a lifespan using ServiceGroupEntry service. And it inquires information about registered services using GridService PortType and provides functionality that notifies updated services using NotificationSource PortType.

CDB Query Service Factory

CDB Query Service Factory (CDBQSF) is persistent service, which inquires information about ligand that DES instance requests. When CDB is available from more than one source, data broker selects one of them using CDBQSF. CDBQSF provides CDBQueryServiceFactory PortType derived from GridService PortType of OGSI. CDBQSF creates a CDB Query Service (CDBQS) instance at the request of DESI. CDBQS instance inquires information about ligand through QESF.

Query Evaluation Service Factory

Query Evaluation Service Factory (QESF) is persistent service, which inquires ligand that CDBQS instance requests in CDB. QESF provides QueryEvaluation ServiceFactory PortType derived from GridService PortType of OGSI and GDS PortType of OGSA-DAI. QESF creates Query Evaluation Service (QES) instance at the request of CDBQS instance. QES instance inquires database using OGSA-DAI.

6 Implementation We implemented Grid service based molecular docking

system using Globus Toolkit 3.2 and DOCK 5.0 developed by researchers at UCSF. And we constructed CDB using MySQL and defined services with XML and WSDL.

Figure 3 shows screenshot running the globus-start-container command. The list shown in figure 3 is the list of Grid services that are started along with the container. DockServiceFacoty, DockEvalServcieFactory, DockService GroupRegistry, CDBQueryServiceGroupRegisty, CBDQuery ServiceFactory, and QueryEvaluationServiceFactory that we defined and implemented are shown in figure 3 red box.

Figure 3. Screenshot running the globus-start-container command

We implemented Dock Client for user ease. Figure 4 shows interface of Dock Client and window searching ligands. The interface of Dock Client is divided into toolbar, process state display section, and screened molecules section. The toolbar consists of Login, Search Receptor, Search Ligand, and Docking. If more than two molecules are searched, client can select some of them in table.

Figure 4. Interface of Dock Client and window searching

ligands

Figure 5 shows screenshot that executes docking with selected receptor and ligand. The energy score as result of docking execution is shown in figure 5 red box. If client clicks column of energy score, energy score is sorted by ascending order.


Figure 5. Screenshot that executes docking with selected

receptor and ligand

7 Conclusion In development of drug discovery, molecular docking is an important technique because it reduces an unmanageable number of compounds to a limited number of compounds for the target of interest. Previous applications or software for molecular docking were developed to be run on a supercomputer, a workstation, or a cluster-computer. However the molecular docking using a supercomputer has a problem that a supercomputer is very expensive and the molecular docking using a workstation or a cluster-computer requires a long execution time. Thus we proposed a Grid service based molecular docking system using Grid computing technology. To develop a Grid service based molecular docking system, we constructed a 3-dimensional chemical molecular database and defined 6 services for molecular docking. We implemented Grid service based molecular docking system with DOCK 5.0 and Globus toolkit 3.2. Our system can reduce a timeline and cost of drug discovery or new material design and provide client the ease of use. In the future, we plan to make various experiments with data sets on terabytes or petabytes for measuring efficiency of our molecular docking system.

References

[1] Elizabeth Lunney. “Computing in Drug Discovery: The Design Phase”. IEEE Computing in Science and Engineering Magazine, 2001.

[2] Ahmar Abbas. “Grid Computing: A practical guide to technology and applications”. Charles River Media, 2003.

[3] Rajkumar Buyya, Kim Branson, Jon Giddy, David Abramson. “The Virtual Laboratory: a toolset to enable distributed molecular modeling for drug design on the World-

Wide Grid”. Concurrency and Computation: Practice and Experience. Vol. 15, 2003.

[4] Philip Bourne, Helge Weissig. Structural Bioinformatics. WILEY-LISS, 2003.

[5] I. Foster, C. Kesselman, S. Tuecke. “The Anatomy of the Grid : Enabling Scalable Virtual Organizations”. International J. Supercomputer Applications, 15(3), 2001.

[6] Ian Foster, Carl Kesselman. The Grid : Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, 1998.

[7] Bing Wu, Kaihsu Tai, Stuart Murdock, Muan Hong Ng, Steve Johnston, Hans Fangohr, Paul Jefferys, Simon Cox, Jonathan Essex, Mark Sansom. “BioSimGrid: towards a worldwide repository for biomolecular simulations”. Organic & Biomolecular Chemistry, Vol. 2, 2004.

[8] Carlo Ferrari, Concettina Guerra, G. Zanotti. “A Grid-aware approach to protein structure comparison”. Journal of Parallel and Distributed Computing, 2003.

[9] H. Casanova, F. Berman. “Parameter Sweeps on the Grid with APST”. Grid Computing: Making the Global Infrastructure a Reality, Wiley Publishers, 2003.


A Reliability Model in Wireless Sensor Powered Grid

Saeed Parsa, Azam Rahimi, Fereshteh-Azadi Parand Faculty of Computer Engineering, Iran University of Science and Technology, Narmak, Tehran, Iran

Abstract - The widespread use of sensor networks along with growing interest on grid computing and data grid infrastructures have led to integrate them as a unique system. To measure the reliability of such integrated system named Sensor Powered Grid (SPG), a reliability model has been proposed. Considering a semi-centralized architecture, comprising grid clients, resource manager, computing resources and sensor networks, a time distribution model has been developed as a reliability measure. Using such distribution model and the universal moment generating function (UMGF), the performance of each element like sensors, resources and their communication paths with resource management system (RMS) was formulated distinctly resulting to the system's total performance.

Keywords: Grid Computing, Sensor Networks, Fault Tolerance, Reliability, UMGF

1 Introduction With the advent of technology in micro-electromechanical systems (MEMS) and creation of small sensor nodes which integrate several kinds of sensors, sensor networks are used to observe natural phenomena such as seismological, weather conditions and traffic monitoring. These tiny sensor nodes can easily be deployed into a designated area to form a wireless network and perform a specific function. In the other hand, the availability of increasingly inexpensive different embedded computing and wireless communication technologies, such as IEEE 802.11 and Bluetooth is now making mobile computing more affordable. The increasing reliance on wireless networking for information exchange makes it critical to maintain reliable and fault-tolerant communication even in the instance of a component failure or security breach. In such wireless networks, mobile application systems continuously join and leave the network and change location with the resulting mobility impacting the degree of survivability, reliability and fault-tolerance of the communication. In the other hand Grid computing provides widespread dynamic, flexible and coordinated sharing of geographically distributed heterogeneous network resource among dynamic user groups. By spreading workload across a large number of computers, the grid computing user can take advantage of enormous computational, storages and bandwidth resource that would otherwise be prohibitively expensive to attain within traditional multiprocessor supercomputers.

The combination of sensor networks and grid computing enables the complementary strengths and characteristics of sensor networks and grid computing to be realized on a single integrated platform. This integrated platform provides real-time data about the environment for computational resources which enables the construction of real-time models and databases of the environment and physical processes as they unfold, from which high-value computation like decision making, analytics, data mining, optimization and prediction can be carried out. As the unreliable nature of the sensor network and grid environment, the investigation of reliability of sensor powered grid has great importance. Due to the low cost and the deployment of a large number of sensor nodes in an uncontrolled environment, it is not uncommon for the sensor nodes to become faulty and unreliable. On the other hand using sensor networks as eyes and ears of grid, necessitates the deployment of more reliable and fault tolerant resource management system, in other words sensor networks employment in grid means to insert more potentially faulty decisions to the grid environment, thus the fault-tolerance in sensor powered grid is one of the most important factors in resource management. Although some fault-tolerant models has been proposed for grid and sensor networks distinctly, but a comprehensive model for integrated grid and sensor network is not proposed yet. The important of a complementary fault tolerance model will be presented more, when the lack of middle layer between the fault tolerance models in grid and sensor network is observed.

2 Related Works In providing fault tolerance, four phases can be identified: error detection, damage confinement, error recovery, and fault treatment and continued system service [1]. some fault tolerance models are involved in error detection. Globus HBM [2] provides a generic failure detection service designed to be incorporated into distributed systems and grids. The applications notify the failure and takes appropriate recovery action. MDS-2 [3] in theory can support the task crash failure detection functionality via its GRRP notification protocol and GRIS/GIIS framework. Legion [4] uses pinging and time out mechanism since to detect task failures. Condor-G [5] adopts ad-hoc failure detection mechanisms since the underlying grid protocols ignore fault tolerance issues. In the case of error recovery the basis of most of researches are check pointing and replication. Job replication is a common method that aims to provide fault tolerance in distributed environments by scheduling redundant copies of


easirewfrfotaan[3chMmreG fawPrreMsoanesThaguofprdifanodasepr[1Manchsemtoneprin

3 innodrofsucotofugr

ach job, so asimple job exequirements fo

with a failure dramework. In ocus on the pargeting their n application 3,4] provides heck-pointing

Menlat [8] anmechanisms. Tecovery mech

G and replicatiIn the field

ault tolerance who proposed a

rasad et al. [1educing the o

Marzullo's appolution relaxend uses statisstimate of th

There are manave been suguarantees relif different typroposed in [1ifferent types ault tolerance ode placemenata disseminatensor networkresented a sch18] presents a

Modular Redunnd recovery teheck-pointingensor network

model for fauolerance modetworks distinroposed to ntegrated grid.

3 ArrangSome arran

ntegrating the ot comprehenrawback. Thaf centralized uch a way omputing, seno the grid and usion is execurid computing

s to increase thecuted. In [6or fault toleradetection serv

this case morovision of a system specifto take app

mechanism tg. Remaining nd Condor-G They provideanism (e.g. reon in Menlat)of sensor netmodels. It w

a model that t11] extended Moutput intervaproach is prs the assumpttics theory to

he parameter ny other FT sgested. In [13able and fairlpes of sensor14] where a s of resourcesin sensor net

nt. The Geogrtion uses dataks. Saleh etalhema for enea fault toleranndancy (TMRechnique to p

g technique. Ik has made itult detectiondels are connctly and no address faul. In this paper

gement Mngement mod

grid computinnsive and neeam and Buyya

and decentrthat in the

nsors and sensall computati

uted out of gg is executed

he probability6] a very intance in the grivice and a flexost of researcsingle fault t

fic domains. Inpropriate recoto support fau

grid systems[9] have th

e a single uetrying in Nets). tworks; there

was first studietolerates indivMarzullo's mo

al estimate. Aresented in tion on the nuo obtain a fau

being measuschemes for s3] an algorithly accurate ours. A fault tosingle type o. The work itworks from t

raphic Hash Ta replication fol [17] introduergy-aware faunce model baR) technique provide a wireIntegrating thet necessary to

n and recovencentrated on

comprehensilt tolerance r we address th

Model dels have bng and sensor

ed to be moda [19] have cralized sensor

former to sor networks cions take plac

grid. Howeveron a distribu

y of having atteresting analid is presentedxible failure hches have theitolerance mecn Globus [2] very action. ult tolerance ss like Netsol

heir failure reuser-transport solve and in C

are many pred by Marzul

vidual sensor fodel by consi

Another exten[12]; the prmber of faulty

ult tolerant intured by the ssensor networhm is developutput from a nolerance technf resource ban [15] considthe point of v

Table (GHT) [or tolerating fuce the conceult-tolerance. ased on (i) Aand (ii) Checeless TMR (We grid with wo propose a cery. Proposedn grids andive model hain wireless

his problem.

been proposr networks whdified to avoidconsidered twr-grid compuachieve sens

connect and ine in the grid,

r in the later uted architectu

t least a lysis of d, along andling ir main

chanism there is Legion such as lve [7], ecovery

failure Condor-

roposed llo [10] failures. derably

nsion of roposed y nodes tegrated sensors. rks that ped that number

nique is acks up ders the view of [16] for faults in ept and Neogy

A Triple ck-point WTMR) wireless capable d fault

sensor as been

sensor

ed for hich are d some o types

uting in sor-grid nterface no data sensor-

ure in a

mamasengribacoaremodesenforfeeaftpoanreqcencopofiltnegri

4 Tfaisysco

Sformiothdeprofaitherelneco

S

anner that thiaking within nsor-grid archids where dec

ase on the decnsidered the de some shortodel is morecision makingnsor network.recasting and ed for completer collecting

ossible that alnd planning aquired factorntralized modmputing reso

owered grid tering and deltwork and otid. Fig. 1 illus

F

Fault TTo explain theilure definitionstem. It is a nditions is sat1. The reso

due to a r2. The avai

the minimSensor powerr sensor dataight require her data can lay. In otherocess failuresilures. Fig. 2 ie SPG is equaliability of dattworks. The nnections betw

Sensor Network

is approach inthe sensor nehitecture. Thicision makingcisions actuatodecentralized tcomings in t suitable forg is simple an While in gridso on the co

ex computing ithe raw data

ll computationa comprehensr is so impdel comprisingources and s(SPG) wher

livering the rether complex strates the gen

Fig. 1. Utilized A

Tolerance e fault toleranns which will failure if an

tisfied: urce processinresource or seilability of a rmum levels ofred grid shall a delivery. Folow-latency, tolerate cert

r hand failurs, resource oillustrates genal to the reliabta resources in

complementaween grid and

k

nvolves proceetwork and atis method is

gs are took plaors will be acmethod as a b

their model. r sensor-actuand it is possibd computing sollected sensoin the grid wh

a by sensor nens take placesive architectuportant. We g grid clientssensor networe some leveeliable data is

computationsneral architectu

Architecture for

Model nce model we

be used in ound only if on

ng or sensor densor crash. resource or sef Quality of Se

support varyor example, highly reliab

tain degrees ores include thor sensor failneral fault typebility of compun which data rary FT mod

d sensor netwo

RMS

essing and det other levels designed foace by raw dactivated. Theybetter choice. Their decentr

ator systems ble to make thsystems like wor data is the hich will be inetwork. So it e in sensor neure considericonsider a

, resource maorks named els of informconducted in s are took plure of the SPG

r SPG

propose a seur fault managne of the foll

data collection

ensor does noervice (QoS).ying degree ocertain sensoble delivery, of network lhree general lures, and nees. The reliabiuting resourceresources are

del shall covork.

R

ecision of the

or data ata and y have There

ralized where

hem in weather

initial nitiated

is not etwork ing all

semi-anager, sensor mation sensor

lace in G.

eries of gement lowing

n stops

ot meet

of QoS or data

while oss or types:

etwork ility of es plus sensor er the

Resource


Fault type examples in complementary FT model:

1- Resource processor failure. 2- Lack of availability of required resource accuracy. 3- Sensor network proxy stops. 4- Unavailability of sensor network proxy. 5- Disconnect between resource manager and sensor network or

disconnect between resource manager and resource. 6- Slow communication between resource manager and sensor

network.

Fig. 2. Classification of fault types in SPG

5 The Model

Our proposed reliability model consists of two sectors: The data sector and the computational sector. Thus the SPG reliability consists of: SPG reliability = reliability of sensor network Λ reliability of computational grid According to our assumptions the entire task is divided into m subtasks which should be done in resources [20] and n data vector which should be provided by sensor network in a way that:

, (1)

Where C is the entire computational task complexity and cj

is the complexity of each subtask j and D is the entire sensor detection complexity and di is the detection rate for data vector i. The subtask processing time and data vector detection time are as follows:

(If the resource does not fail) And

(If the sensor does not fail)

And Tkj= ∞ or = ∞ if resource or sensor fails. In case of resources with constant failure rate the probability that resource k does not fail during processing of subtask j is:

The amount of data should be transmitted between RMS and the resource k that is processing subtask j (input data from the resource to RMS). Therefore the time of communication between RMS and resource k that process subtask j can take value:

(4)

For constant failure rate the probability that communication channels k does not fail during delivering the initial data and sending the results of subtask j can obtained as:

(5)

These give the distribution of the random subtask processing time in computational resources:

And ∞ 1

(6)

In case of sensor network with variable failure rate, the probability that sensor l does not fail during detection of vector data i is:

(7)

In which is the average number of failures. The amount of data should be transmitted between the sensor l that collect data vector i and RMS (input data from sensor network to

: Reliability functions of time t τ : Random time of data i detection by sensor l : Probability that subtask j is correctly completed by resource k : Number of detected data

: Probability that data i is correctly detected by sensor l : Data quantity should be transmitted between RMS and the resource k

: Task complexity a : Data quantity should be transmitted between the sensor l and RMS

: Data Vector Complexity : Communication channel speed for resource k : Computational complexity of subtask j : Communication channel speed for sensor : Detection rate of data vector i δ : Communication time of resource k for subtask j : Processing speed of resource k δ : Communication time of sensor l for data vector i

: Detection speed of sensor l : Probability that data quantity aj transported from resource k without failure

: Random time of subtask j processing by resource k : Probability that data quantity ai transported from sensor l

(3)

(2)Network Failure

Network QoS Failure 6

Sensor Powered Grid Failures

Resource or Sensor Failure Process Failures

Resource or Sensor Stop 3 Resource or Sensor QoS Failure 4

Network Disconnect Failure Process Stop Failure 1

Process QoS Failure 2


RMS). The time of communication between sensor network and RMS is:

(8)

For constant failure rate the probability that communication channels l does not fail during detect of data i can obtained as:

(9)

These give the distribution of the random data detection time:

And ∞ 1 (10)

Since the subtask is completed when its output reaches the RMS, the random completion time , for the detected data vector i subjecting to subtask j assigned to resource k is equal to . It can be easily seen that the distribution of this time is:

And ∞ 1

(11)

We assume that each sensor detect a vector data i and send it to RMS and then RMS divides it to m subtasks. Each subtask j is assigned to resources comprising set . In this case the random time of subtask j completion is:

, , (12)

The entire task is completed when all of the subtasks (including the slowest one) are completed. Therefore the random task execution time takes the form:

, (13)

5.1 Reliability and Performance In order to estimate both the service reliability and its performance, different measures can be used depending on the application. The system reliability is defined as a probability that the correct output is produced in time less than . This index can be obtained as:

. Θ θ

(14)

Where Θ Θ and is critical time. The service performance (the number of executed tasks over a fixed time) is another point of interest. The service performance is defined as the probability that it produces correct outputs without respect to the total task execution time. This index can be referred to as . The conditional expected task execution time W (given the system produces correct output) is considered to be a measure of its performance. This index

determines the expected service execution time given that the system does not fail. It can be obtained as:

∞∞

(15)

The procedure used in this paper for the evaluation of service time distribution is based on the universal moment generating function (UMGF) technique. In this paper the procedure used is based on the universal z-transform, which is a modern mathematical technique introduced in [21]. This method, convenient for numerical implementation, is proved to be very effective for high dimension combinatorial problems. In the literature, the universal z-transform is also called UMGF or simply u-transform. The UMGF extends the widely known ordinary moment generating function [22]. The UMGF of a discrete random variable y is defined as a polynomial:

(16)

Where the variable y has J possible values and Pj is the probability that y is equal to yj. To obtain the u-transform representing the performance of a function of two independent random variables φ y , y , composition operators are introduced. These operators determine the u-transform for φ y , y using simple algebraic operations on the individual u-transform of the variables. All of the composition operators take the form:

(17)

, , ,

, , ,

In the case of grid system, the u-transform , can define performance of total completion time , for data vector i resulted to subtask j assigned to resource k. This u-transform takes the form:

(18)

, 1 ∞ for resources And

, 1 ∞ for sensors

6 An Analytical Example Consider a sensor powered grid service that uses data vectors of two sensor and four computational resources. Assume that each data vector's computational process is divided into two subtasks by RMS. The first subtask of each data vector is assigned to resources 1 and 2; the second


subtask is assigned to resources 3 and 4. The reliability of the sensors and resources and assigned subtask completion times for available sensors and resources are presented in tables 1 and 2. About communication times in table 1 and 2, it is assumed that resources start the computation process as soon as the sensor networks start data detection. Also, it is assumed that the communication time of sensor networks is included the start up time before data transmittal. The estimated arrangement of the possible network is illustrated in Fig. 3.

Fig. 3. Schematic arrangement of possible network, S: Sensors, R: Resources & P: Paths

Table 1. Reliability , of SPG resources and resources’

communication paths and Subtask completion time kjkjT δ+ include

computation and communication times (s)

No of resource k

Reliability Completion Time No of subtask j No of subtask j

1 2 1 2 1 2 3 4

0.6, 0.7 0.5, 0.9

- -

- -

0.4, 0.7 0.7, 0.5

7+1 8+2

- -

- -

9+1 10+3

Table 2. Reliability , of SPG sensors and sensors’ communication paths and data detection completion time

, include detection and communication times (s)

No of sensor i

, ,

1 2

0.7, 0.3 0.5, 0.6

3+6 2+7

In order to determine the completion time distribution for both sensors and both subtasks, the u-transform , is defined:

{ }( ) ∞+= ZZZkju 3.037.0,1 , { }( ) ∞+=′ ZZZkju 7.063.0,1 S1 and P1

{ }( ) ∞+= ZZZu kj 5.05.0 2,2 , { }( ) ∞+=′ ZZZu kj 4.06.0 7

,2 S2 and P2

{ }( ) ∞+= ZZZui 4.06.0 71,1 , { }( ) ∞+=′ ZZZui 3.07.0 1

1,1 R1 and P3 in s1

{ }( ) ∞+= ZZZui 5.05.0 82,1 , { }( ) ∞+=′ ZZZui 1.09.0 2

2,1 R2andP4 in s1

{ }( ) ∞+= ZZZui 6.04.0 93,2 , { }( ) ∞+=′ ZZZui 3.07.0 1

3,2 R3andP5 in s2

{ }( ) ∞+= ZZZui 3.07.0 104,2 , { }( ) ∞+=′ ZZZui 5.05.0 3

4,2 R4andP6in s2

The u-transform representing the performance of completion times { }kij ,θ are obtained as follows:

{ } { }( ) ∞+=′Ω ZZuu kjkj 79.021.0, 6,1,11 (S1&P1),

{ } { }( ) ∞+=′Ω ZZuu kjkj 7.03.0, 7,2,22 (S2 & P2)

{ } { }( ) ∞+=′Ω ZZuu ii 58.042.0, 71,11,13 (R1&P3),

{ } { }( ) ∞+=′Ω ZZuu ii 55.045.0, 82,12,14 (R2 & P4)

{ } { }( ) ∞+=′Ω ZZuu ii 72.028.0, 93,23,25 (R3&P5), Then:

( ) ∞++=ΩΩΓ ZZZ 553.0237.021.0, 76211 For data vector

( ) ∞++=ΩΩΓ ZZZ 319.0261.042.0, 87

432 For subtask 1

( ) ∞++=ΩΩΓ ZZZ 468.0252.028.0, 109653 For subtask 2

And finally:

( ) ∞++=ΓΓΩ ZZZ 698.0115.0187.0, 87217

( )( ) ∞++=ΓΓΓΩΩ ZZZ 836.0075.0089.0,, 10932178

Now this u-transform represents the performance of Θ : ( ) 089.09Pr ==Θ , ( ) 075.010Pr ==Θ , ( ) 836.0Pr =∞=Θ

From the obtained performance we can calculate the service reliability as follows:

089.0)( * =θR for 109 * ≤<θ , ( ) 164.0=∞R ( ) 457.9164.0/10075.09089.0 =×+×=W

7 Conclusions Use of sensor networks is greatly spread in computational grids. The requirement of online data in computational grids has mandated the necessity of using the integrated sensor network powered computational grid. Although some fault tolerance models has been developed in grids, little investigations has been developed on FT models in sensor powered grids. In this paper using the UMGF technique, a reliability model for SPG has been developed. Other results are outlined hereunder:

- A diagram showing the classified SPG failures has been represented. In one hand all failures are classified into two groups of crash related and QoS level related failures, and on the other hand failures can be classified into three sectors of process, resource or sensor and network failures.

- Using the distributions of random processing, detection and communication times a performance measure for different SPG arrangements has been developed.

- Using the UMGF technique, a model for measuring the total time of task execution has been represented.

- In the represented model the total time has been divided into processing time in resources, detection time in sensors and communication times in sensor-RMS and resource-RMS paths.

S1

S2

R4

R3

R2

R1

P1

P2

P3 P4

P5

P6RMS


- The performance of sensors, resources and paths are measured distinctly.

8 References [1] Jalote, P. “Fault tolerance in Distributed Systems”;

Prentice-Hall, Inc, 1994. [2] Stelling, P., Foster, I., Kesselman, C., Lee, C.,

Laszewski, G. von. “A Fault Detection Service for Wide Area Distributed Computations”; In: Proceedings of the Seventh IEEE Symposium on High Performance Distributed Computing. 268—278, 1998.

[3] Czajkowski, K., Fitzgerald, S., Foster, I., Kesselman, C. “Grid Information Services for Distributed Resource Sharing”; In Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing, 2001.

[4] Grimshaw, A., Wulf, W., Team, T.L. “The Legion Vision of a Worldwide Virtual Computer”; Communications of the ACM, 1997.

[5] Frey, J., Tannenbaum, T., Foster, I., Livny, M., Tuecke, S. “Condor-G: A Computation Management Agent for Multi-Institutional Grids”; Cluster Computing, Vol. 5, No. 3, 2002.

[6] Hwang, S., Kesselman, C. “A flexible framework for fault tolerance in the grid”; J. Grid Comput. 1, 251—272, 2003.

[7] The Globus project, http://www-fp.globus.org/hbm [8] Grimshaw, A.S., Ferrari, A., West, E.A., Mentat. In.

Wilson, G.V., Lu (Eds.), P. “Parallel Programming Using C++”; 382—427 (Chapter10), 1996.

[9] Gartner, F.C. “Fundamentals of fault-tolerant distributed computing in asynchronous environments”; ACM Comput. Surv. 31 (1), 1999.

[10] Marzullo, K. “Tolerating Failures of Continuous Valued Sensors”; In ACM Transactions on Computer Systems. Vol. 8, 284—304, Nov 1990.

[11] Prasad, L. , Iyengar, S. S., Kashyap, R. L. , Madan, R. N. “Functional Characterization of Fault Tolerant Integration in Distributed Sensor Networks”; In IEEE Transactions on Systems. Man and Cybernetics, vol. 21, 1082—1087, Sep/Oct 1991.

[12] Liu, H.W, Mu, C.D. “Efficient Algorithm for Fault Tolerance in Multi-sensor Networks”; In International Conference on Machine Learning and Cybernetic, Vol. 2, 1258—1262, August 2004.

[13] Jayasimha, D. N. “Fault tolerance in multi-sensor networks”; In IEEE Transactions on Reliability, June 1996.

[14] Koushanfar, F., Otkonjak, M., Sangiovanni, A., Vincentelli, “Fault Tolerance in Wireless Ad-Hoc Sensor Networks”; In IEEE Sensors, 1491—1496, June 2002.

[15] Ishizuka, M., Aida, M. “Performance Study of Node Placement in Sensor Networks”; In Proceedings of the 24th International Conference on Distributed Computing Systems Workshops, 598—603, 2004.

[16] Ratnasamy, S., Karp, B., Shenker, S., Estrin, D., Govindan, R., Yin, L., Fang Yu. “Data-Centric Storage

in Sensor nets with GHT, a Geographic Hash Table”; Mob. Netw. Appl., 8(4): 427—442, August 2003.

[17] Saleh, I., Agbaria, A., Eltoweissg, M. “In-Network Fault Tolerance in Networked Sensor Systems”; DIWANS’06, Sept 2006.

[18] Neogy, S. “WTMR – A new Fault Tolerance Technique for Wireless and Mobile Computing Systems”; Department of Computer Science & Engineering, Jadavpur University, India, 2007.

[19] Tham, C. K., Buya, R. “SensorGrid: Integrating Sensor Networks and Grid Computing”; CSI Communications, 24, July 2005.

[20] Levitin, G., Dai, Y. S. “Service reliability and performance in grid system with star topology”; Reliabilty Engineering & System Safety, 2005.

[21] EL-Neweihi E., Proschan F., Degradable systems “A survey of multi states system theory”; Common. Statis. Theory Math., 13(4), 405—432, 1984.

[22] Ushakov I. A., “Universal generating function”; Sov. J. Computing System Science, 24(5), 118—129, 1986.


G-BLAST: A Grid Service for BLAST

Purushotham Bangalore, Enis Afgan

Department of Computer and Information Sciences

University of Alabama at Birmingham

1300 University Blvd., Birmingham AL 35294-1170

{puri, afgane}@cis.uab.edu

Abstract – This paper described the design and

implementation of G-BLAST – a Grid Service for one of the

most widely used bioinformatics application Basic Local

Alignment Search Tool (BLAST). G-BLAST uses the factory

design pattern to provide application developers a common

interface to incorporate multiple implementations of BLAST.

The process of application selection, resource selection,

scheduling, and monitoring is completely hidden from the

end-user through the web-based user interfaces and the

programmatic interfaces enable users to employ G-BLAST as

part of a bioinformatics pipeline. G-BLAST uses an adaptive

scheduler to select the best application and the best set of

resources available that will provide the shortest turnaround

time when executed in a grid environment. G-BLAST is

successfully deployed on a campus and regional grid and

several BLAST applications are tested for different

combinations of input parameters and computational

resources. Experimental results illustrate the overall

performance improvements obtained with G-BLAST.

Keywords: grid, BLAST, scheduling, usability

1 Introduction

Basic Local Alignment Search Tool (BLAST) is a

sequence analysis tool that performs similarity searches

between a short query sequence and a large database of

infrequently changing information such as DNA and amino

acid sequences [8, 9]. With the rapid development of

sequencing technology of large genomes for several species,

the sequence databases have been growing at exponential

rates [11]. Facing the rapid expanding target databases and

the length and number of queries used in the search, the

BLAST programs take significant time to find a match.

Parallel computing techniques have helped BLAST to gain

speedup on searches by distributing searching jobs over a

cluster of computers. Several parallel BLAST search tools

[13, 19] have been demonstrated to be effective on

improving BLAST’s performance. mpiBLAST [19] and

TurboBLAST [13] use database segmentation to distribute a

portion of the sequence database to each cluster node. Thus,

each cluster node only needs to search a query against its

portion of the sequence database. On the other side, some

researchers apply query segmentation to alleviate the burden

of searching jobs [16, 18]. In query segmentation, a subset of

queries, instead of the database, is distributed to each cluster

node, which has access to the whole database. As for as the

end-user of the BLAST application is concerned the final

outcome and turnaround time is of interest and typical users

do not really care which of the above techniques were used to

generate the final results.

The majority of parallel BLAST applications, however,

cannot cross the boundary of computer clusters, i.e., the

communication among parallel instances of BLAST

algorithms are limited among computing nodes with

homogeneous system architectures and operating systems.

This limitation heavily encumbers the development of

cooperative BLAST applications across heterogeneous

computing environments. Particularly when many

universities and research institutes have started to build grids

to take advantage of various computational resources

distributed across the organization.

The emerging Grid computing technology [12] based on

Service Oriented Architecture (SOA) [20] and Web Services

provides an ideal development platform to take advantage of

the distributed computational resources. Grid computing not

only presents maximum available data/computing resources

to BLAST search, but also shows its powerful ability on

some critical issues, such as security, load balancing, and

fault tolerance. Grid services [20] provide several unique

features such as statefull services, notification, and uniform

authentication and authorization across different

administrative domains. The focus of this paper is to develop

a grid service for BLAST by exploiting these unique features

of grid services and provide ubiquitous access to distributed

computational resources as well as hide the various details

about application and resource selection, job scheduling,

execution, and monitoring. One of the goals of this work is to

provide a web-based interface through which users can

submit queries, monitor their job status, and access results.

The portal could then dispatch the queries to all the available

computing resources according to a well-planed scheduling

scheme that takes into account the heterogeneity of the

resources and performance characteristics of BLAST on

these resources based on the query length, number of queries,

and the database used. An additional goal of this effort is to

provide applications developers a common interface so that

different implementations of BLAST could be easily


incorporated within the G-BLAST service. A scheduler can

then select the best application (multithreaded, query split, or

database split) for the available resources and dispatch the

job to the appropriate resource(s). There is no need for the

end-user to be concerned about which version of BLAST

application was used as well as the computational resource(s)

that where used to execute the application.

The rest of this paper is organized as follows. Section 2

presents an overall architecture of G-BLAST and described

the various components. The experimental setup and

deployment details used to test G-BLAST are provided in

section 3. Other related work is described in section 4 and

summary and conclusions are provided in section 5.

2 G-BLAST Architecture

The overall architecture of G-BLAST is illustrated in

Figure 1. G-BLAST has the following four key components:

(a) G-BLAST Core Service: Provides a uniform interface

through which a specific version of BLAST could be

instantiated. This enables application developers to

extend the core interface and incorporate newer

versions of BLAST applications.

(b) User Interfaces: Provides web and programmatic

interface for file transfer, job submission, job

monitoring and notification. These interfaces support

user interactions without exposing any of the details

about the grid environment and the application

selection process.

(c) Scheduler: Selects the best available resource and

application based on user request using a two-level

adaptive scheduling scheme.

(d) BLAST Grid Services: Individual grid services for

each of the BLAST variations that are deployed on

each of the computational resource.

Fig 1. Overall architecture of G-BLAST

The rest of this section describes each of the key

components of G-BLAST in detail.

2.1 G-BLAST Core Service

A BLAST Grid Service with a uniform Grid service

interface is deployed on each of the computing resources. It

is located between the Invoker and each implementation of

BLAST programs. No matter what kind of BLAST programs

are deployed on each resource, the BLAST Grid service

should cover the differences and provide fundamental

features. To facilitate developers to integrate individual

BLAST instances into the G-BLAST framework, the BLAST

Grid service defines the following methods for each instance:

1. UploadFile: Upload query sequences to a compute

node.

2. DownloadFile: Download query results from the

compute node.

3. RunBlast: Invoke corresponding BLAST programs on

the compute node(s).

4. GetStatus: Return current status of the job (i.e., pending,

running, done).

5. NotifyUser: Notify the user once the job is complete and

the results are available.

With G-BLAST developers can easily add new BLAST

services (corresponding to the BLAST programs and the

computing resources supporting it) without modifying any

G-BLAST core source code. In addition, developers can add

new BLAST services on the fly, without interrupting any of

the other G-BLAST services. To accommodate for such

functionality, G-BLAST employs the creational design

pattern “factory method” [22] to enable the invoker to call

newly-built BLAST services without changing its source

code. To integrate their corresponding BLAST programs into

this framework, developers should create and deploy Grid

services on each of the computing resources in the Grid.

As described in Figure 2, Invoker and BLASTService are

two abstract classes representing the invoker in the G-

BLAST service core and the BLAST services on computing

resources, correspondingly. When a new BLAST service

(e,g, mpiBLAST) is added into the system, the relevant

invoker (mpiInvoker) for that service must be integrated as a

subclass of the class Invoker. When the invoker wants to call

the new BLAST service, it can first create an instance of

mpiInvoker, then let the new invoker generate an instance of

mpiBLAST by calling the member function CreateService().

Thus, the invoker does not need to hard-code instantiation of

each type of BLAST services.

Fig 2. Factory method for BLAST service

BLASTService Invoker

UploadFile()

DownloadFile()

… …

mpiBLAST

CreateService()

SendQuery()

… …

mpiInvoker

CreateService()

return new mpiBLAST

Client Program

Web Interface

Users

… … BLAST BLAST BLAST

GIS Invoker

Grid Service Interface

Resource Information

Grid Service

Query

Response

Query

Dispatch Result

Notify

AIS

Scheduler

Application Information


This design pattern encapsulates the knowledge of which

BLAST services to create and delegate the responsibility of

choosing the appropriate BLAST service(s) to the scheduler

(described in Section 2.C). The Invoker could invoke more

than one BLAST service based on the availability of

resources to satisfy user requirements.

2.2 User Interfaces

G-BLAST framework provides unified, integrated

interfaces for users to invoke BLAST services over

heterogeneous and distributed Grid computing environment.

The interfaces summarize the general functionalities that are

provided by each individual BLAST service, as well as cover

the implementation details from the end users. Two user

interfaces are currently implemented to satisfy different

users’ requirements. For users who want to submit queries as

part of workflow, a programmable interface is furnished

through a Grid Service. Service data and notification

mechanism supported by Grid Services are integrated into

the BLAST Grid service to provide statefull services with

better job monitoring and notification. For users who want to

submit each query with individual parameter settings and

familiar with traditional BLAST interface, like NCBI

BLAST [26], a web interface is implemented for job

submission, monitoring, and file management.

G-BLAST exploits the notification mechanism [20]

provided by grid services in two aspects. One aspect is the

notification of changes by BLAST services to the scheduler.

The other aspect is the notification of job completion to the

end users. Both of these two instances strictly follow the

protocol of notification. In notification of service changes,

the BLAST services are the notification source, and the

scheduler is the notification subscriber. Whenever the

BLAST service on the computing node has any changes, the

service itself will automatically notify the scheduler with up-

to-date information. This mechanism keeps the scheduler

updated with the most recent status of the BLAST service,

therefore helps scheduler make informed decisions on the

selection of computing resources. Notification for job

completion has a similar implementation, except that the

notification sink is the registered client program.

To facilitate users’ using G-BLAST, a programming

template is also provided to guide users’ code their own

client program for G-BLAST service invocation. Figure 3

demonstrates the major part of a client program that invokes

G-BLAST service by creating Grid service handler,

uploading query sequence(s) to the back-end server,

submitting a query job, checking the job status, and finally

retrieving the query results.

In addition to providing a programmatic interface for the

end user, the framework also provides a web workspace that

supports the needs of a general, non-technical Grid user who

prefers graphical user interface to writing code. The most

common needs of a general user are file management, job

submission, and job monitoring. File management is

supported through a web file browser allowing users to

upload new query files or download search result files. It is a

simplified version of an FTP client that is developed in PHP.

Job submission module is made as simple as possible to use,

the user after naming the job for easy reference only provides

or selects a search query file and chooses the database to

search against. Application selection, resource selection, file

mapping, and data transfer are handled automatically by the

system. Finally, the job monitoring module presents the user

with the list of his or her jobs. It includes a date range

allowing the user to view not only the currently running jobs,

but the completed jobs as well. When viewing the jobs, the

user is given the name of the job, current status (running,

done, pending, or failed) and job execution start/end time.

Upon clicking on the job name, the user can view more

detailed information about the query file, database used, and

start and end time. The user is also given the option to re-

submit a job with the same set of parameters or after

changing one of the parameters.

// Get command-line argument as Grid Service Handler

URL GSH = new java.net.URL(args[0]);

// Get a reference to the Grid Service instance

GBLASTServiceGridLocator gblastServiceLocator = new

GBLASTServiceGridLocator();

GBLASTPortType gblast =

gblastServiceLocator.getGBLASTServicePort(GSH);

………..

//Query sequence uploading

gblast.FileTransfer(inputFile, src, remote);

………..

//Submit query as a job

gblast.BLASTRequest(blastRequest);

jobid=gblast.JobSubmit();

………..

//Check query (job) status

gblast.JobStatus(jobid);

………..

//Retrieve back the query result

gblast.ResultRetrive(jobid);

Fig 3. Client program to invoke G-BLAST service

2.3 Two-level Adaptive Scheduler

Due to the heterogeneity of available resources in the

Grid, system usability as well performance of the application

can be drastically affected without an efficient scheduler

service. Rather than developing a general purpose meta-

scheduler that tries to schedule any application equally by

using the same set of deciding factors, we have created a

two-level application-specific scheduler that uses application

and resource specific information to provide a high-level

service for the end user (either in turnaround time, better

resource utilization, or usability of the system).

The scheduler collects application specific information in

the Application Information Services (AIS) [3], initially from

the developer through Application Specification Language

(ASL) [3] and later from application runs. ASL assists


developers to describe the application requirements and the

AIS acts as a repository of application descriptors that can be

used be a resource broker to select the appropriate

application. ASL is much like RSL but from the application

point of view. It is a language that provides a way for the

application developer to specify the requirements imposed by

the application. It specifies deployment parameters that have

to be fulfilled during runtime such as required libraries,

specific operating system, minimum/maximum number of

processors required to run, specific input file(s) required,

specific input file format, minimum/maximum amount of

memory and disk space, type of interconnection network and

so on. Unlike RSL, where only the end user specifies the

requirements for their job, ASL allows the application

developer or owner to specify requirements for allowing the

application to be run (e.g., licensing, subscription fee) and

thus is creating a contract between the user and the

developer.

The scheduler uses this information for each of the

subsequent decision makings when selecting the best

available resource (say, resource resulting in shortest

turnaround time, or the cheapest resource, or the most

reliable resource). Once the user provides the necessary job

information (as described by the JDF), the scheduler obtains

a snapshot of the available resources in the Grid, and based

on the information obtained from AIS, it automatically

performs a matching between the JDF and ASL as to which

of the available algorithms and available resources will result

in desired performance. For the more details on the inner

workings of the scheduler please refer to [5, 6].

3 Deployment and Results

3.1 UABGrid

UABGrid is a campus grid that includes computational

resources belonging to several administrative and academic

units at the University of Alabama at Birmingham (UAB).

The campus-wide security infrastructure required to access

various resources on UABGrid are provided by Weblogin

[28] using the campus-wide Enterprise Identify Management

System [27]. There is diverse pool of available machines in

UABGrid, ranging from mini clusters based on Intel Pentium

IV processors and Intel Xeon based Condor pool to several

of clusters made up of 64-bit AMD Opteron and Intel Xeon

CPU’s. Each of the departments participating has complete

autonomy over local resource administration, resulting in a

true grid setup. Access to individual resources was

traditionally made through SSH or command line GRAM

tools, but more recently we have added a general purpose

resource broker [7, 23] and a portal that facilitates resource

selection based on user’s job requirements. Since we are

focusing our work to BLAST, a common feature of all the

resources in UABGrid is that they have BLAST and/or

mpiBLAST installed and available for use. The sequence

databases on local resources are updated daily by a cron job

and formatted appropriately to speedup user query searches.

3.2 Experimental Setup and Results

The scheduler has to select not only the best resource

among a set of available resources but also select the best

version of the BLAST application to deliver the shortest

turnaround time for any given user request. In order to

develop the knowledgebase required to test the capabilities of

the scheduler in delivering these goals we have executed

these applications on diverse computer architectures that are

representative of actual resources on UABGrid [4]. On each

of these computer architectures three different versions of

BLAST algorithm: multithreaded, query split, and database

split are executed. Each version of the BLAST application is

in-turn tested with three protein databases of varying sizes

and three different query file sizes (varying number of

queries and query lengths). In this section we provide some

of key performance results to illustrate the impact of these

various parameters on the overall performance of BLAST

queries and describe how the adaptive scheduler uses these

performance results to decide the appropriate BLAST

application and the computational resources.

Three protein databases available from NCBI’s website

(ftp://ftp.ncbi.nlm.nih.gov/blast/db/) are used as part of these

experiments. The smallest database selected was yeast.nt, a

13 MB representing protein translations from yeast genome.

As the medium size database, we selected the non-redundant

protein database (nr). It is an 821 MB database with entries

from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq [2]

and finally, the largest database selected was 5GB est

database, or Expressed Sequence Tags. This is a division of

GenBank, which contains sequence data and other

information on "single-pass" cDNA sequences from a

number of organisms [1]. These databases represent a wide

range of possible sizes and have been selected to reflect the

most commonly used databases by the scientists in order to

provide a solid base for the scheduling policies developed as

part of this application. 10,000 protein queries were run

against three NCBI's protein databases. Query input files are

grouped into three groups based on the number of queries;

small, medium and large. Small number of queries is

anything less than 100 queries, medium is between 100 and

1,000, while large is anything over 1,000 queries.

For any given computer architecture and BLAST version,

the performance depends on the following input parameters:

individual query lengths, total number of queries, and the

size of the database against which the search is performed.

Results indicate that the execution time increases linearly as

the length of individual queries increase. Figure 4 provides

BLAST execution time for queries of varying lengths using

nr database on three different architectures. Similar

experiments with other databases indicate that the execution

time increases correspondingly when the database size

increases. These experiments also highlight the importance

of CPU clock frequency on the overall execution time of a

BLAST query as the query length increases.

The performance of query splitting and database splitting

approaches were also compared on different architectures


with different query files and databases. Figure 5 provides

the comparison between query split and database split

approaches for the nr database using 10,000 query input file.

From the diagram, we can observe that BLAST keeps a

nearly linear speedup up-to 16 CPUs regardless of the

algorithm used. Overall performance results indicate that the

query-splitting algorithm outperforms database-splitting

algorithm by almost a factor of two. Use of different database

shows similar results as long as the size of the database is

less than the amount of main memory available. In the later

case, the database-splitting algorithm outperforms the query-

splitting algorithm due to the reduced I/O involved in trying

to keep only a portion of the original database in memory.

Execution time vs. Query length with nr database on 1000 queries

0

20

40

60

80

100

120

0 500 1000 1500 2000 2500

Query Length

Tim

e (

seco

nd

s)

Xeon EM64T 3.2 GHz Xeon 32-bit 2.66 GHz Opteron 1.6 GHz

Linear (Xeon EM64T 3.2 GHz) Linear (Xeon 32-bit 2.66 GHz) Linear (Opteron 1.6 GHz) Fig 4. BLAST application performance as a function of query

length.

Testing the validity of resource selection involved

submitting a number of equal jobs and varying resource

availability. We varied the number of available CPUs as well

as availability of resources of different architectures and

capabilities. Table 1 shows the performance of multithreaded

BLAST on different architectures for 10 queries of varying

lengths with the nr database. These results indicate that

CPUs with higher clock frequencies and larger caches, along

with more memory, outperform their slower counterparts. In

addition, hyper-threading seems to offer significant

performance improvements on the Intel Xeon EM64T

architecture.

0

100,000

200,000

300,000

400,000

500,000

600,000

2 4 8 10 16

Number of Processors

Ru

n t

ime (

in s

eco

nd

s)

10,000-db-split

10,000-query-split

Fig 5. Direct comparison of execution time of query splitting and

database splitting versions of the BLAST with varying number of

processors.

Table 1. Performance comparison on different processor types

against nr database (1.1 GB) and input file with 10 queries.

Processor Type Threads

1 2 3

Intel Xeon (2.66 GHz, 32 bit, 512 Kb L2, 2

GB RAM) – Dual processors 508 265 266

Intel Xeon (3.2 GHz, 64 bit, 2 MB L2, 4 GB

RAM) – Dual processors 426 231 180

AMD Opteron (1.6 GHz, 64 bit, 1 MB L2, 2

GB RAM) – Dual processors 471 243 242

Macintosh G5 (2.5 GHz, 64 bit, 512 KB L2, 2

GB RAM) – Dual processors 382 198 ---

Sun Sparc E450 (400 MHz, 64 bit, 4 MB L2, 4

GB RAM) – Quad Processors 2318 1183 590

Sun Sparc V880 (750 MHz, 64 bit, 8 MB L2,

8 GB RAM) – Quad Processors 1211 615 318

The use of grid technologies and the described scheduler

has enabled G-BLAST to move beyond executing on any

single resource and execute on users’ jobs on multiple

resources simultaneously, thus realizing a shorter job

turnaround time. The aim of the scheduler is to minimize

job’s overall turnaround time. This is achieved through

selection of resources to execute the job from the pool of

available resources and selection of which algorithm to

employ on each resource. These two main directives are

further complicated by the need to minimize load imbalance

across selected resources. Details on the scheduler

implementation can be found in [6], while the results in paper

focus on showing the ‘value added’ to a user employing G-

BLAST. G-BLAST enables execution of a job across

multiple resources simultaneously, and because this is

typically not available, it is hard to provide a direct

comparison of obtained results. As such, provided results

focus on internal functionality of the scheduler while overall

job runtimes of G-BLAST jobs and standard BLAST jobs

can be derived from Figure 4.

Figure 6 shows execution of a G-BLAST job across

multiple resources using 100 queries and nr database.

Following a job submission, G-BLAST selects resources for

execution and the input data is automatically distributed and

submitted to those resources. The figure shows execution of

the same job using 16, 8, 4, and 2 processors among selected

machines. As can be seen, the load imbalance across

resources is minimized, but not eliminated. This is generally

due to the inconsistencies of performance of individual

resource that was not predicted by the scheduler, as well as

the contention of any one fragment with other jobs currently

being submitted to the given resource and thus competing

with the one another.

All of the results presented above indicate the different

intricacies that a typical end-user has to handle while

executing BLAST in a grid environment. The scheduler

encapsulates all these details and makes it easier for the end-

user to take advantage of a grid environment. By analyzing

these experiments, we were able to confirm that the choice

the scheduler was making during algorithm selection was

indeed accurate. Under the constraints of resource

availability, it can be inferred from the above figures the

overall time saved by the user when performing searches


using G-BLAST. We observed that with average resource

availability of 8 CPUs on the UABGrid, the maximum time

saved by a user was around 75%, when compared to

executing the same job on a scientist’s local, single processor

workstation.

Fig. 6. Individual fragments across resources using 16, 8, 4, and 2

CPUs.

4 Related Work

4.1 BLAST on Grid

Several Grid-based BLAST systems have been proposed

to provide flexible BLAST systems that could harness

distributed computational resources. GridBLAST [24] is a set

of Perl scripts that distribute work over computing nodes on

a grid using a simple client/server model. Grid-BLAST [32]

employs a Grid Portal User Interface to collect query

requests and dispatch those requests to a set of NCSA

clusters. Each cluster in the system is added and tuned to

accept jobs in an ad hoc way. The major disadvantage of a

non-service based system is that the computing resources

cannot be integrated into the system automatically and

human intervention is required to adapt a new version of

BLAST and new computational resources. GT3 based

BLAST [10] system, however, is based on web services

programming model. A meta-scheduler is also used to farm

out query requests onto remote clusters. Nevertheless, the job

submission is still through traditional batch submission tools

and does not exploit the benefits of SOA and Grid Services.

4.2 Scheduling

Due to the heterogeneity of resources as well as different

application choices in the Grid, resource selection is a hard

task to perform correctly. Unlike the local schedulers [29, 33]

which have much of the necessary information readily

available to them, grid meta-schedulers are dependent on the

underlying infrastructure. The general meta-schedulers such

as Nimrod-G [15], AppLeS [21], the Resource Broker from

CrossGrid [17], Condor [25], and MARS [14] are helping

the general user in alleviating some of the intricacies of

resource selection by automating resource selection across

the Grid through application and resource parameter pooling.

Due to the mentioned heterogeneity of applications and

resources, a general meta-scheduler simply does not have

enough information and support from the middleware to

perform the optimal resource selection. In order to

accommodate this need, application-specific meta-schedulers

based on application runtime characteristics are used in G-

BLAST framework to adaptively schedule applications on

the grid. Runtime information is also used by other software

packages (ATLAS [31]and STAPL [30]) to determine the

best application specific parameters on a given architecture.

5 Summary and Conclusions

The overall architecture for G-BLAST – a Grid Service for

the Basic Local Alignment Search Tool (BLAST) was

presented in this paper. G-BLAST not only enabled the

execution of BLAST application in a grid environment but

also abstracted the various details about selecting a specific

application and computational resource and provided simple

interfaces to the end-user to use the service. Using the factory

design pattern multiple implementations of BLAST were

incorporated into G-BLAST without requiring any change to

the Core Interface. The two-level adaptive scheduler and the

user interfaces used by G-BLAST enabled the process of

application selection, resource selection, scheduling, and

monitoring without requiring extensive user interventions. G-

BLAST was successfully deployed on UABGrid and

different BLAST applications were tested for various

combinations of input parameters and computational

resources. The performance results obtained by executing

various BLAST applications (multithreaded, query split,

database split) on different architectures with different

databases and query lengths illustrated the role of the

adaptive scheduler in improving the overall performance of

BLAST applications in a Grid environment. In this paper, we

have used BLAST as an example for performing local

alignment search. We also plan to extend this architecture to

other bioinformatics applications.

REFERENCES

[1] (2000, July 11). "Expressed Sequence Tags database,"

Retrieved June 6, 2005, from

http://www.ncbi.nlm.nih.gov/dbEST/

[2] (2004, December 22). "GenBank Overview," Retrieved

4/21, 2005, from http://www.ncbi.nlm.nih.gov/Genbank/

[3] Afgan, E. and P. Bangalore, "Application Specification

Language (ASL) – A Language for Describing

Applications in Grid Computing," In the Proceedings of

The 4th International Conference on Grid Services

Engineering and Management - GSEM 2007, Leipzig,

Germany, 2007, pp. 24-38.

[4] Afgan, E. and P. Bangalore, "Performance

Characterization of BLAST for the Grid," In the

Proceedings of IEEE 7th International Symposium on

Bioinformatics & Bioengineering (IEEE BIBE 2007),

Boston, MA, 2007, pp. 1394-1398.


[5] Afgan, E., P. V. Bangalore, and S. V. Peechakara,

"Effective Utilization of the Grid with the Grid

Application Deployment Environment (GADE),"

Univeristy of Alabama at Birmingham, Birmingham, AL

UABCIS-TR-2005-0601-1, June 2005 2005.

[6] Afgan, E., V. Velusamy, and P. Bangalore, "Grid

Resource Broker with Application Profiling and

Benchmarking," In the Proceedings of European Grid

Conference 2005 (EGC '05), Amsterdam, The

Netherlands, 2005, pp. 691-701.

[7] Afgan, E., V. Velusamy, and P. V. Bangalore, "Grid

Resource Broker using Application Benchmarking," In

the Proceedings of European Grid Conference,

Amsterdam, Netherlands, 2005, pp. 10.

[8] Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D.

J. Lipman, "Basic local alignment search tool," J Mol

Biol, vol. 215, pp. 403-10, 1990.

[9] Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang,

Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST

and PSI-BLAST: a new generation of protein database

search programs," Nucleic Acids Res., vol. 25, pp. 3389-

3402, 1997.

[10] Bayer, M., A. Campbell, and D. Virdee, "A GT3 based

BLAST grid service for biomedical research," In the

Proceedings of UK e-Science All Hands Meeting,

Nottingham, UK, 2004

[11] Bergeron, B., "Bioinformatics Computing," 1st ed.

Upper Saddle River, New Jersey: Pearson Education,

2002.

[12] Berman, F., A. Hey, and G. Fox, "Grid Computing:

Making The Global Infrastructure a Reality." New York:

John Wiley & Sons, 2003, pp. 1080.

[13] Bjomson, R. D., A. H. Sherman, S. B. Weston, N.

Willard, and J. Wing, "TurboBLAST: A Parallel

Implementation of BLAST Built on the TurboHub," In the

Proceedings of Proceedings of International Parallel and

Distributed Processing Symposium: IPDPS 2002

Workshops, Ft. Lauderdale, FL, 2002

[14] Bose, A., B. Wickman, and C. Wood, "MARS: A

Metascheduler for Distributed Resources in Campus

Grids," In the Proceedings of Fifth IEEE/ACM

International Workshop on Grid Computing, Pittsburgh,

PA, 2004, pp. 10.

[15] Buyya, R., D. Abramson, and J. Giddy, "Nimrod-G: An

Architecture for a Resource Management and Scheduling

in a Global Computational Grid," In the Proceedings of

4th International Conference and Exhibition on High

Performance Computing in Asia-Pacific Region (HPC

ASIA 2000), Beijing, China, 2000

[16] Camp, N., H. Cofer, and R. Gomperts, "Hight-

Throughput BLAST," SGI September 1998.

[17] CrossGrid. (2004). "CrossGrid Production resource

broker," Retrieved 4/15, 2004, from

http://www.lip.pt/computing/projects/crossgrid/crossgrid-

services/resource-broker.htm

[18] Czajkowski, K., S. Fitzgerald, I. Foster, and C.

Kesselman, "Grid Information Services for Distributed

Resource Sharing," In the Proceedings of 10 th IEEE

Symp. On High Performance Distributed Computing,

2001

[19] Darling, A. E., L. Carey, and W.-c. Feng, "The Design,

Implementation, and Evaluation of mpiBLAST," In the

Proceedings of ClusterWorld Conference & Expo in

conjunction with the 4th International Conference on

Linux Clusters: The HPC Revolution 2003, San Jose, CA,

2003

[20] Foster, I., C. Kesselman, J. Nick, and S. Tuecke, "The

Physiology of the Grid: An Open Grid Services

Architecture for Distributed Systems Integration," Global

Grid Forum June 22 2002.

[21] Fran, B., W. Rich, F. Silvia, S. Jennifer, and S.Gary,

"Application-Level Scheduling on Distributed

Heterogeneous Networks," In the Proceedings of

Supercomputing '96, Pittsburgh, PA, 1996, pp. 28.

[22] Gamma, E., R. Helm, R. Johnson, and J. Vlissides,

"Design Patterns," 1st ed: Addison-Wesley Professional,

1995.

[23] Huedo, E., R. S. Montero, and I. M. Llorente, "A

Framework for Adaptive Execution on Grids," Journal of

Software - Practice and Experience, vol. 34, pp. 631-651,

2004.

[24] Krishnan, A., "GridBLAST: A High Throughput Task

Farming GRID Application for BLAST," In the

Proceedings of BII, Singapore, 2002

[25] Litzkow, M., M. Livny, and M. Mutka, "Condor - A

Hunter of Idle Workstations," In the Proceedings of 8th

International Conference of Distributed Computing

Systems, June 1988, pp. 104-111.

[26] NCBI. (2004, November 15). "NCBI BLAST,"

Retrieved 4/21, 2005, from

http://www.ncbi.nlm.nih.gov/BLAST/

[27] Puljala, R., R. Sadasivam, J.-P. Robinson, and J.

Gemmill, "Middleware: Single Sign On Authentication

and Authorization for Groups," In the Proceedings of the

ACM Southeastern Conference, Savannah, GA, 2003

[28] Robinson, J.-P., J. Gemmill, P. Joshi, P. Bangalore, Y.

Chen, S. Peechakara, S. Zhou, and P. Achutharao, "Web-

Enabled Grid Authentication in a Non-Kerberos

Environment," In the Proceedings of Grid 2005 - 6th

IEEE/ACM International Workshop on Grid Computing,

Seattle, WA, 2005

[29] Systems, V., "OpenPBS v2.3: The Portable Batch

System Software," 2004.

[30] Thomas, N., G. Tanase, O. Tkachyshyn, J. Perdue, N.

M. Amato, and L. Rauchwerger, "A Framework for

Adaptive Algorithm Selection in STAPL," In the

Proceedings of ACM SIGPLAN Symp. Prin. Prac. Par.

Prog. (PPOPP), Chicago, IL, 2005

[31] Whaley, R. C., A. Petitet, and J. Dongarra, "Automated

empirical optimizations of software and the ATLAS

project," Parallel Computing, vol. 27, pp. 3-35, 2001.

[32] Yong, L., "Grid-BLAST: Building A Cyberinfrastructure

for Large-scale Comparative Genomics Research," In the

Proceedings of 2003 Virtual Conference on Genomics

and Bioinformatics, 2003

[33] Zhou, S., "LSF: Load Sharing in Large-scale

Heterogeneous Distributed Systems," In the Proceedings

of Workshop on Cluster Computing, 1992