Towards Optimal Fault Tolerant Scheduling in Computational Grid

Towards Optimal Fault Tolerant Scheduling in

Computational Grid

Muhammad Imran 1, Iftikhar Azim Niaz ', Sajjad Haider 2, Naveed Hussain2, M. A. Ansari3'Faculty of Computing, Riphah International University, Islamabad.

2Information Technology Department, National University ofModem Languages, Islamabad3 Computer Science Department, Federal Urdu University of Arts, Science & Technology, Islamabad{mimran, ianiaz} griphah.edu.pk, { sajjad, naveedhussain} gnuml.edu.pk, drmaansarigfuuast.edu.pk

Abstract- Grid environment has significant challenges due todiverse failures encountered during job execution. Computationalgrids provide the main execution platform for long running jobs.Such jobs require long commitment of grid resources. Thereforefault tolerance in such an environment cannot be ignored. Most ofthe grid middleware have either ignored failure issues or havedeveloped adhoc solutions. Most of the existing fault tolerancetechniques are application dependant and causes cognitiveproblem.

This paper examines existing fault detection and tolerancetechniques in various middleware. We have proposed faulttolerant layered grid architecture with cross-layered design. Inour approach Hybrid Particle Swarm Optimization (HPSO)algorithm and Anycast technique are used in conjunction with theGlobus middleware. We have adopted a proactive and reactivefault management strategy for centralized and distributedenvironments.

The proposed strategy is helpful in identifying root cause offailures and resolving cognitive problem. Our strategy minimizescomputation and communication thus achieving higher reliability.Anycast limits the effect of Denial of Service/Distributed Denial ofService D(DoS) attacks nearest to the source of the attack thusachieving better security. Significant performance improvement isachieved through using Anycast before HPSO. The selection ofmore reliable nodes results in less overhead of checkpointing.

I. INTRODUCTION

Grid computing is a type of parallel and distributed systemthat consists of resources having heterogeneous architectures,being geographically distributed and interconnected viaunreliable network media. It enables the sharing ofgeographically distributed autonomous resources owned andmanaged dynamically by multiple organizations.Computational grid being popular for constructing large scalemeta-computing provides dependable, consistent, pervasiveand in-expensive access to high performance computationalcapabilities. These grids are well suitable for long runningapplications that require long commitment of grid resources.For example a job that takes weeks of execution on singlesystem can be executed on computational grid in minutesdepending upon how much computational resources areavailable on grid.

Grid middleware is a software suite that is deployed on eachparticipating machine to be able to participate in a gridenvironment. Each middleware provides some essentialfunctionality in order to execute user jobs successfully.

Due to dynamic, heterogeneous and geographicallydistributed nature of grid, user job is always prone to differentkinds of errors, failures and faults [1]. Grid environment hassignificant challenges due to the diverse failures encounteredduring job execution. A survey with grid users on faulttolerance has revealed that how difficult it is to run applicationson grid environments susceptible to a wide range of failures[4]. Each middleware should have fault tolerance mechanismin order to execute user jobs reliably.

Fault tolerance is to preserve the delivery of expectedservices despite the presence of fault-caused errors within thesystem itself. Errors are detected and corrected, and permanentfaults are located and removed while the system continues todeliver acceptable service [2]. As resource failure is the rulerather than exception in grid [3], so fault tolerance mechanismsboth proactive and reactive must be an essence of each gridmiddleware. Most of the fault tolerance techniques developedfor grid computing are reactive in nature. For examplereplication and checkpointing are used in grid but they are onlyable to deal with crash failures.

Grid middleware has either ignored failure issues or theyhave implemented on ad hoc basis. Most of the fault detectionand recovery techniques built so far are application dependant,resulting in problems like overwhelming single layer i.e.cognitive problem [4] and increased inter layer communication.In traditional layered grid architecture [8], cross layercommunication is not possible i.e. an intermediate layer cannotcommunicate with upper and lower layer. The conceptproposed in [1] multicasts address of execute machine in orderto select backup that generates huge amount of network traffic.This will increase network communication, which leads toincreased unreliability, safety and network delays.

Grid resource broker performs resource discovery,scheduling, and processing of application jobs on distributedgrid resources. It uses grid information service that maintainsstatus of grid resources. An artificial intelligence basedmetaheuristic algorithm Hybrid Particle Swarm Optimization(HPSO) for task allocation has already been proposed in [9].Evolutionary algorithms like HPSO performs two steps i.e.exploration and exploitation. It can take more time in large sizeproblems.

1-4244-1494-6/07/$25.00 C 2007 IEEE

We discuss the proposed scheme with respect to Globustookit [19]. In Globus a noticeable flaw is lack of support forfault tolerance. It does not provide checkpointing for savinglong running computation. Although it uses resource brokerlike Condor [20], but Condor's feature of checkpointing is notsupported in Globus [1].We have proposed that Anycast should be used instead of

multicast. As in Anycast, data is routed to the nearest or bestdestination as seen by the network topology or routing protocolmeasure of distance. By using HPSO after Anycast will notonly minimizes reliability calculation time but even in theworst case calculates the reliability within reasonable time. Theproposed strategy manages faults both in proactive and reactivemanner. It can work with both centralized and distributedenvironment. We believe that the proposed solution results inincreased reliability and safety.The rest of the paper is organized as follows. In section II,

existing fault detection and tolerance techniques have beenanalyzed. In section III, proposed system has been discussedthat consists of fault tolerant layered grid architecture, cross-layered grid design using Anycast and HPSO. In section IV,proposed architecture for fault tolerance with respect tocentralized and distributed environment have been discussed.Section V presents the evaluation of the proposed solution. InSection VI and VII we conclude and discuss the future work.

II. EXISTING FAULT TOLERANCE TECHNIQUES

Most of the existing fault management mechanisms arereactive incomplete and application dependent. For example ifa job execution machine fails during execution; jobs would besubmitted on another machine from start. This technique isknown as Retry. We cannot afford such kind of techniques forcompute intensive jobs that require huge computationalresources. By overburdening a single layer, communicationamong layers is increased. Another problem faced is thecognitive problem. It becomes very difficult to detect, identify,isolate and recover from failures. An extension to theclassification [5, 6, 7] of errors, failures and faults with theirexpected occurrence at appropriate grid layer has already beenpresented in [11].A proactive approach regarding job scheduling in

computational grid has been proposed in [10]. It schedule jobsbased on history maintained in grid information service aboutgrid resources.Another agent oriented proactive fault tolerant framework

has been proposed in [6] where agents deal with individualfaults proactively. Agents maintain information about hardwareconditions, executing process memory consumption, availableresources, network conditions and component mean time tofailure.According to our knowledge, none of the techniques or

frameworks deals with failures both in proactive and reactiveway. All of the available solutions do not follow fault toleranceat corresponding grid layer. They just follow traditional layered

approach in which a layer can only communicate with its upperor lower layer.Some of the existing fault detection and tolerance techniques

being used in various middleware are summarized in Table 1.It clearly shows that each middleware uses its own techniquefor fault detection and tolerance.

TABLE I FAULT DETECTION AND TOLERANCE TECHNIQUES USEDIN PARALLEL AND GRID COMPUTING

System Fault Detection Fault Tolerance

Globus Heart Beat Monitor Retry

LegionPinging and Checkpoint-recovery

Legion Timeout

Condor-G Polling machineCondor Checkpoint-recovery

NetSolve Generic heart beat Retry on anothermonitor available machine

Mentat Polling Replication

CoGKits N/A N/A

Software fault tolerance is provided by software diversity.Diversity can be introduced in software systems byconstructing diverse replicas that solve the same problem indifferent ways (different algorithms, different programminglanguages etc) [4].A traditional layered grid design is depicted in Fig. 1. It

shows that a layer cannot communicate with its upper andlower layer simultaneously.

Alpplications A EpplictionsIT

Fig. 1. Traditional layered grid design

It becomes difficult to implement fault tolerant layered griddesign that requires fault tolerance techniques at eachcorresponding layer.

III. PROPOSED SOLUTION

We believe that fault management should be deployed oneach participating machine in computational grid. In this waysingle point of failure can be avoided. Our proposed conceptconsists of following components.

A. Fault TolerantLayered GridArchitectureWe believe that all fault detection and tolerance mechanisms

should be implemented on each corresponding grid layer. Oneway is to implement through agents/components for each gridresource. Agents should keeps on monitoring and updatingreliability of that particular resource at corresponding gridlayer. A combined reliability factor for a particular machinecan be calculated based on reliability factors of layers.Furthermore overhead of checkpointing intensity can beminimized based on system reliability. This scheme can onlywork as proactive fault management.

B. Cross Layer DesignThe idea of cross layer design in grid with respect to Quality

of Service (QoS) has already been proposed in [12]. Fig. 2shows cross layer design in computational grid where eachlayer can communicate to the layer below and above it unliketraditional layered approach. Cross layer design in gridprovides flexibility in implementing different mechanisms ingrid.

Applica1tionsI~~~~~~~~~~~~Thlird Party User Level Middllewatre

CoreMiddleware

Girid Faibric

Fig. 2. Cross Layered Design in Computational Grid

C. AnycastAnycast is a network addressing and routing scheme

whereby data is routed to the "nearest" or "best" destination[13] based on the following:

> As viewed by the routing topology> According to the routing protocols measure of

distanceAn extension of Anycast has been proposed in [18], which

delivers message to k-nodes from a group. Applications ofAnycast were largely been unexplored in the past; but recentwork [14] shows it is emerging as a new communicationparadigm. Anycast have two important characteristics fromfault tolerance and safety point of view. These include:

1. ReliabilityAnycast applications typically feature external "heartbeat"monitoring that helps to avoid black holes. It providesautomatic failover.

2. Property ofsinking D(DoS) attacksIt can limit scope of Denial of Service/Distributed Denialof Service attacks nearest to the point of origination.

Deploying Anycast does not require any specialsoftware/firmware, it just leverage existing infrastructure.

D. HybridParticle Swarm Optimization (HPSO)A metaheuristic evolutionary algorithm for maximizing

reliability of distributed systems has been proposed [9]. It findsnear-optimal resources for task execution based on theirreliability. Reliability of resources is calculated based oncertain factors. Alike all modem metaheuristic algorithms, itperforms two steps i.e. exploration and exploitation. Lookingfor new candidate solutions in solution space is calledexploration while finding local optima from good specificsolutions is called exploitation. It continuously exploressolution space and in worst cast might go for long time that isnot affordable.

IV. ARCHITECTURE OVERVIEW

This section presents the architectural view of proposedsystem with respect to cross-layered grid design. Fig. 3 showsour proposed architecture with respect to Globus.

ApplicattionsI

Thirld Part User Level Middleware

Globus

Other Grid lIfrm_ation HPSOservices SROiWc (MDS)I 111 1111ME

Grid Security Layer

Grid Fabric Anycastt

Fig. 3. Anycast & HPSO in Cross Layered Grid design

We propose that HPSO can be incorporated at any of thehigher three layers i.e. grid core middleware, user levelmiddleware or application. It would be more useful to haveHPSO at core grid middleware in order to minimize cross layercommunication. As grid information service i.e. Monitoringand Discovery Service (MDS) have all the information aboutthe resources in grid, HPSO can use that information at samelayer to find better machines in terms of reliability. HPSO canbe used with resource brokers like condor at user levelmiddleware but that requires cross layer communication withMDS. Similarly, it can be implemented at application layer thatwould also require cross layer communication with MDS anduser level middleware.We also propose that Anycast can be incorporated at grid

fabric layer in cross-layered grid design. By using Anycastbefore HPSO will not only reduce network communication butalso it would reduce reliability calculation time of HPSO.Anycast would actually limits the solution space for HPSO,which in worst case take longer time. Although it would resultin local optima but the proposed solution is for huge problemsize. Fig. 4 shows the effect of Anycast before running HPSO.

10.0.0.5 10.0.0.6

10.0.0.1

Partial task 1

Fig. 4. Effect of Anycast before running HPSO

The proposed concept can work both in centralized anddistributed environment. Both scenarios have been discussed innext sections.

A. Centralized environmentCentralized environment have been discussed in detail [15]

with its proposed architecture. It does not describe how toselect a reliable primary and backup server for sendingcheckpoints. Fig. 5 depicts steps involved in selecting nearoptimal server for sending checkpoints. The 'execute machine'would Anycast for selecting primary server as backup. Thecandidate servers with their machine information reply. HPSOwould find the optimal server for sending checkpoints. Primaryserver would perform similar steps in order to select secondarybackup server.

10.0.0.5 10.0.0.6

Fig. 5. Selecting optimal primary server for sending checkpoints

Fig. 6 depicts 'execute machine' sending checkpoints toprimary server and it further sending to secondary backupserver. In case, a primary server fails, backup would becomeprimary and it will select its backup following similar steps.

B. Distributed environmentDistributed fault management for computational grid has

been proposed in [1]. The question of how to select an optimalor near optimal node as execute or backup machine remainsunanswered. The proposed concept was that submit machinewould multicast address of execute machine in order to requestfor backup. All candidate backups reply to execute machine asconsent to serve as backup. Two problems that are identifiedare: First, increased communication Second, a D(DoS) attackcan effect network performance severely.

Fig. 6. Selecting optimal primary server for sending checkpoints

Our proposed approach has been depicted in Fig. 7. In orderto select machines for job execution, submit machine wouldAnycast. The nearest candidate backup machines would replywith their 'machine information'. The submit machine wouldrun HPSO based on information received. In this way, nearoptimal machines in terms of reliability would be selected fortask execution. Another benefit would be that constraints withlimited resources would also be satisfied. The scenario isdepicted in Fig. 7.

10.0.0.1

iputationally/wulsive task <

10.0.0.1,0 Anycast

0Machine Info

10.0.0.1

10.0.0.2

10.0.0.2

100 l

10.0.0.3

Fig. 7. Selecting execute machine for task execution

The 'execute machine' would perform similar steps in orderto select backup machine. Fig. 8 presents this scenario

10.0.0.1

Partial task I1

10.0.0.1

// yast 10.0.0.2

s 10Machine Info

Anycast 10.0.0.3

Machirlne Itlfo

Fig. 8. Selecting backup machine for sending checkpoints

In case of 'execute machine' failure, backup machine wouldtake over execution from last recent checkpoint and it wouldselect its backup following similar steps.

V. EVALUATION

A. Reliability Calculation TimeBy using Stirling formula for interpolation [21], we can

estimate the time required to calculate system reliabilitythrough HPSO from [9].Population size of entire solution space = NPopulation size of localized solution space = kReliability calculation time required for 1 node f(k) wherek=IReliability calculation time required for k nodes f(k)Reliability calculation time required for N nodesf(k) where k=NSo,

The more the number of nodes are; higher the reliabilitycalculation time of HPSO is. The results from mathematicalmodel presented in section-A and [9] show the relationship aspresented in Fig. 9.

No ofNodes vs Reliability calculation time (sec)

10987

a; 54-

2

00 1 2 3 4

Reliability Cal time (sec)

--No ofnodes

6

Fig. 9. No. ofNodes vs Reliability calculation timeVn e N,P(n)

3g g c N

f(k) =Yk =yO +o+ )+U Syl+-u D(AY1+Y)2 2 3!

x - xoWhere u =

hWhere x is the period of interpolation of any node and h isnode difference.

B. Performance Comparison ofMulticast andAnycastTotal No of nodesGroup of nodes in N for multicastGroup of nodes in g that are nearest (Anycast)Comparative analysis based on distanceComparative analysis based on delayN is total number ofnodesVn e N,P(n)3g: g c N, where g is a group in N i.e. giwhere i= 1.....n.

Avg (g)

Where g =9

ng>O

-g<O3kcgcNDelay factor |k.< |g.< |N|For k,

k1 = dik2 =d2

=N=g=k

As process of checkpointing itself is an overhead, so inorder to make system efficient we need to minimize overheadof checkpointing. The more the system is reliable; less the

.... overhead of checkpointing is. Table II show that a systemhaving 95%0 reliability requires checkpointing after 50%0 of taskexecution. Whereas a system having 55°0 reliability requirecheckpoint after every 10% of task execution.

TABLE II RELIABILITY VS CHECKPOINTING INTENSITY REQUIREDReliability Checkpointing Intensity Req55%0 10%65%0 20%75%0 30%85% 40o%950o 50o%

The less the checkpointing intensity is; more time requiredfor checkpointing. Table III shows the relationship betweencheckpoint intensity and time required to save checkpoints. Itshows that a system with 10% of checkpoint intensity takes 50seconds for checkpointing. Whereas, a system with 50% ofcheckpoint intensity takes only 10 seconds for checkpointing.TABLE III CHECKPOINTING INTENSITY VS TIME REQUIRED FOR

CHECKPOINTINGCheckpointing Time required for checkpointingIntensity Req (seconds)10% 5020% 2530% 16.6540% 12.550% 10

k1= d.min{dl,d2, d13= kAvg (g) > k

As k < Avg ( g) < NSo, Delay factor is less.

VI. FUTURE WORK

We have proposed fault tolerant architecture with respect toCondor and Globus only. A detail empirical evaluation is

required for other grid middleware although theoretically itlooks feasible.We also have a plan to implement proposed concept.

Implementation of proposed system consists of three parts:Communication, HPSO and checkpointing. Communicationuses Anycast; details of deploying IP Anycast have beenpresented in [16, 17]. We prefer to use kernel level checkpointing so that taking process snap shot could be applicationindependent. An algorithm for implementing HPSO have beendescribed in detail in [9], it can be implemented at core gridmiddleware in case of open source, at user level middleware orembedding in grid application.

VII. CONCLUSION

In this paper we have proposed that fault managementmechanisms should be implemented at appropriate grid layer.We have proposed cross-layered grid design for faultmanagement. We also propose that Anycast should be usedinstead of multicast.

The proposed mechanism has ability to handle diversefailures both proactively and reactively. It is proactive in asense that it finds near optimal reliable machines for jobexecution in terms of reliability. It behaves in reactively, if amachine, server or link fails, it continues job execution. Theproposed mechanism can work for both centralized anddistributed environment.

The proposed approach provides increased reliability byreducing communication and computation. Increased safety isachieved through Anycast, that limits the effect of D(DoS)attacks nearest to the source of the attack. Significantperformance improvement is achieved through using Anycastbefore HPSO. The proposed architecture is more helpful inidentifying root cause of problems.

ACKNOWLEDGEMENTS

We are thankful to all of our friends and colleaguesincluding Mr. Imran Baig, Mr. Saeed Ullah Mr. MuhammadAffaan and Mr. Muhammad Aqeel who have been very helpfulin giving their comments and valuable suggestions for theimprovement of proposed architecture. We are especiallythankful to Mr. Nadeem Talib for helping in developingmathematical models.

[5]. Y. Derbal, "A New Fault-Tolerance Framework for Grid Computing",Journal ofMultiagent and Grid Systems, vol.2, no. 2, 2006, pp. 115-133.

[6]. M. T. Huda, H W. Schmidt, I D. Peake, "An Agent Oriented Fault-tolerant Framework for Grid Computing", Proc. of the js InternationalConference on e-Science and Grid Computing (e-Science'05),Melbourne, Australia, 2005, pp. 304-311.

[7]. K. Vaidyanathan and K. S. Trivedi, "Extended Classification of SoftwareFaults based on Aging", 12th International Symposium on SoftwareReliability Engineering, Hong Kong, 2001. pp. 99.

[8] P. Asadzadeh, R. Buyya, C L. Kei, D. Nayyar and S. Venugopal, "GlobalGrids and Software Toolkits, A study of Four Grid MiddlewareTechnologies", High Performance Computing: Paradigm andInfrastructure, Wiley Press, USA, June 2005.

[9]. P. Y. Yin, S. S. Yu, P. P. Wang and Y. T. Wang, "Task allocation formaximizing reliability of a distributed system using hybrid particleswarm optimization", Journal of Systems and Software, vol. 80, issue 5,2007, pp. 724-735.

[10]. B. Nazir and T. Khan "Fault Tolerant Job Scheduling in ComputationalGrid", Proc. of the IEEE 2nd International Conference on EmergingTechnologies (ICET2006), Peshawar, Pakistan, 2006, pp. 708-713.

[11]. S Haider, M. Imran, I. A. Niaz, S. Ullah and M.A. Ansari, "Componentbased Proactive Fault Tolerant Scheduling in Computational Grid", Proc.of the IEEE 3rd International Conference on Emerging Technologies(ICET 2007), Rawalpindi, Pakistan, 2007.

[12]. L. Chunlin and L. Layuan, "Joint QoS optimization for layeredcomputational grid", Information Sciences: an International Journal, vol.177, issue 15 (August 2007) Pages: 3038-3059.

[13]. Wikipedia, http:Hen.wikipedia.org/wiki/Anycast [15 June 2007].[14]. M. Szymaniak, G. Pierre, M. S. Nikolova and M. van Steen, "Enabling

Service Adaptability with Versatile Anycast", Concurrency andComputation: Practice and Experience, vol 19, issue 13, 2007), pp.1837-1863.

[15]. N. Hussain, M. A. Ansari, M. M. Yasin, A. Rauf and S. Haider, "FaultTolerance using "Parallel Shadow Image Servers (PSIS)" in Grid BasedComputing Environment", Proc. of the IEEE 2nd InternationalConference on Emerging Technologies (ICET 2006), Peshawar, Pakistan,2006, pp. 703-707.

[16]. 0. E. Masafumi and S. Yamaguchi, "Implementation and Evaluation ofIPv6 Anycast", Proc. of the INET 2000, Yokohama, Japan, 2000. pp 323,http://www.isoc.org/inet2000/cdproceedings/

[17]. K. Miller "Deploying IP Anycast", http://www.net.cmu.edu/pres/anycast[18]. B. Wu and J. Wu, "K-Anycast Routing Schemes for Mobile Ad Hoc

Networks", Proc. of the IEEE 20th International Parallel and DistributedProcessing Symposium (IPDPS 2006), Rhodes Island, Greece, 2006.

[19]. Globus Allaiance, "Globus Toolkit", http://www.globus.org[20]. M. Litzkow, M. Livny and m.W. Mutka, "Condor-A Hunter of Idle

Workstations", Proc. of the IEEE 8th International Conference ofDistributed Computing Systems, San Jose, CA, USA, 1988, pp. 104-111.

[21]. Springer Link, Stirling Interpolation Formula, Encyclopedia ofMathematics, http:Heom.springer.de/s/sO87840.htm.

REFERENCES

[1] M. Affan and M. A. Ansari "Distributed Fault Management forComputational Grids", Proc. of Fifth International Conference on Gridand Cooperative Computing (GCC 2006), Changsha, Hunan, China,2006, pp. 363-368.

[2]. A. Avizienis, "The N-version Approach to Fault-Tolerant Software",IEEE Transactions on Software Engineering, vol. 11, no. 12, 1985, pp.1491-1501.

[3] M. Baker, R. Buyya and D. Laforenza, "Grids and Grid Technologies forWide-Area Distributed Computing", Software - Practice and Experience(SPE), vol. 32, issue 15, 2002, pp. 1437-1466.

[4] R. Medeiros, W. Cirne, F. Brasileiro and J. Sauve, "Faults in Grids: Whyare They so Bad and What Can Be Done about It?", Proc. of the 4thInternational Workshop on Grid Computing, Phoenix, Arizona, USA,2003, pp. 18-24.

Towards Optimal Fault Tolerant Scheduling in Computational Grid

Documents

Transcript of Towards Optimal Fault Tolerant Scheduling in Computational Grid