Using group replication for resilience on exascale systems

44
ISSN 0249-6399 ISRN INRIA/RR--7876--FR+ENG RESEARCH REPORT N° 7876 February 2012 Project-Team ROMA Using group replication for resilience on exascale systems Marin Bougeret, Henri Casanova, Yves Robert, Frédéric Vivien, Dounia Zaidouni

Transcript of Using group replication for resilience on exascale systems

IS

SN

02

49

-6

39

9IS

RN

IN

RIA

/R

R--7

87

6--F

R+

EN

G

RESEARCHREPORTN° 7876February 2012

Project-Team ROMA

Using group replicationfor resilienceon exascale systemsMarin Bougeret, Henri Casanova, Yves Robert, Frédéric Vivien,Dounia Zaidouni

RESEARCH CENTREGRENOBLE – RHÔNE-ALPES

Inovallée

655 avenue de l’Europe Montbonnot

38334 Saint Ismier Cedex

Using group replication for resilienceon exascale systems

Marin Bougeretú, Henri Casanova†, Yves Robert‡§¶, FrédéricVivien·, Dounia Zaidouni·

Project-Team ROMA

Research Report n° 7876 — February 2012 — 41 pages

ú LIRMM Montpellier, France† University of Hawai‘i at Manoa, Honolulu, USA‡ LIP, Ecole Normale Supérieure de Lyon, France§ University of Tennessee Knoxville, USA¶ Institut Universitaire de France.Î INRIA, Lyon, France

Abstract: High performance computing applications must be resilient to faults, which are com-mon occurrences especially in post-petascale settings. The traditional fault-tolerance solution ischeckpoint-recovery, by which the application saves its state to secondary storage throughout exe-cution and recovers from the latest saved state in case of a failure. An oft studied research questionis that of the optimal checkpointing strategy: when should state be saved? Unfortunately, evenusing an optimal checkpointing strategy, the checkpointing frequency must increase as platformscale increases, leading to higher checkpointing overhead. This overhead precludes high parallel ef-ficiency for large-scale platforms, thus mandating other more scalable fault-tolerance mechanisms.One such mechanism is replication, which can be used in addition to checkpoint-recovery. Usingreplication, multiple processors perform the same computation so that a processor failure does notnecessarily imply application failure. While at first glance replication may seem wasteful, it maybe significantly more e�cient than using solely checkpoint-recovery at large scale. In this workwe investigate a simple approach where entire application instances are replicated. We provide atheoretical study of checkpoint-recovery with replication in terms of expected application execu-tion time, under an exponential distribution of failures. We design dynamic-programming basedalgorithms to define checkpointing dates that work under any failure distribution. We also conductsimulation experiments assuming that failures follow Exponential or Weibull distributions, the lat-ter being more representative of real-world systems, and using failure logs from production clusters.Our results show that replication is useful in a variety of realistic application and checkpointingcost scenarios for future exascale platforms.

Key-words: Fault-tolerance, replication, checkpointing, parallel job, Weibull, exascale

La réplication pour l’amélioration de larésilience des applications sur systèmes exascalesRésumé : Les applications de calcul à haute-performance doivent être ré-siliantes aux pannes, car les pannes ne seront pas des évènements rares sur lesplates-formes post-petascales. La tolérance aux pannes est traditionnellementréalisée par un mécanisme d’enregistrement et redémarrage, au moyen duquell’application sauve son état sur un système de stockage secondaire et, en casde panne, redémarre à partir du dernier état sauvegardé. Une question souventétudiée est celle de la stratégie de sauvegarde optimale: quand l’état doit-il êtresauvé ? Malheureusement, même quand on utilise une stratégie de sauvegardeoptimale, la fréquence de sauvegarde doit augmenter avec la taille de la plate-forme, augmentant mécaniquement le coût des sauvegardes. Ce coût interditd’obtenir une très bonne e�cacité sur des plates-formes à très large échelle,et requiert d’utiliser d’autres mécanismes de tolérance aux pannes, qui passentmieux à l’échelle. Un mécanisme potentiel est la réplication, qui peut être utiliséeconjointement avec une solution de sauvegarde et redémarrage. Avec la réplica-tion, plusieurs processeurs exécutent le même calcul de sorte que la panne del’un d’entre eux n’implique pas nécessairement une panne pour l’application.Alors qu’à première vue une telle approche gaspille des ressources, la répli-cation peut être significativement plus e�cace que la seule mise en œuvre detechniques de sauvegarde et redémarrage sur des plates-formes à très grandeéchelle. Dans la présente étude nous considérons une approche simple où uneapplication toute entière est répliquée. Nous fournissons une étude théoriqued’un schéma d’exécution avec réplication lorsque la distribution des pannes suitune loi exponentielle. Nous proposons des algorithmes de détermination desdates de sauvegarde quand la distribution des pannes suit une loi quelconque.Nous menons aussi une étude expérimentale, au moyen de simulations, baséesur une distribution de pannes suivant une loi exponentielle, de Weibull (ce quiest plus représentatif des systèmes réels), ou tirée de logs de clusters utilisésen production. Nos résultats montrent que la réplication est bénéfique pour unensemble de modèles d’applications et de coût de sauvegardes réalistes, dans lecadre des futures plates-formes exascales.

Mots-clés : Tolérance aux pannes, réplication, checkpoint, tâche parallèle,Weibull, exascale

Using group replication for resilience on exascale systems 4

1 IntroductionAs plans are made for deploying post-petascale high performance computing(HPC) systems [10, 22], solutions need to be developed to ensure resilience tofailures that occur because not all faults can be automatically detected and cor-rected in hardware. For instance, the 224,162-core Jaguar platform is reportedto experience on the order of 1 failure per day [20, 2], and its scale is modest com-pared to platforms in the plans for the next decade. For applications that enrolllarge numbers of, or perhaps all, processors a failure is the common case ratherthan the exception. One can recover from a failure by resuming execution froma previously saved fault-free execution state, or checkpoint. Checkpoints aresaved to resilient storage throughout execution (usually periodically). More fre-quent checkpoints lead to less loss when a failure occurs but to higher overheadduring fault-free execution. A checkpointing strategy specifies when checkpointsshould be taken.

A large literature is devoted to developing e�cient checkpointing strategies,i.e., ones that minimize expected job execution time, including both theoreticaland practical e�orts. The former typically rely on assumptions regarding theprobability distributions of times to failure of the processors (e.g., Exponential,Weibull), while the latter rely on simulations driven by failure datasets obtainedon real-world platforms. In a previous paper [4], we have made several contri-butions in this context, including optimal solutions for Exponential failures anddynamic programming solutions in the general case.

A major issue with checkpoint-recovery is scalability: the necessary check-point frequency for tolerating failures in large-scale platforms is so large thatprocessors spend more time saving state than computing. It is thus expectedthat future platforms will lead to unacceptably low parallel e�ciency if onlycheckpoint-recovery is used, no matter how good the checkpointing strategy.Consequently, additional mechanisms must be used. In this work we focus onreplication: several processors perform the same computation synchronously, sothat a fault on one of these processors does not lead to an application failure.Replication is an age-old fault-tolerant technique, but it has gained tractionin the HPC context only relatively recently. While replication wastes com-pute resources in fault-free executions, it can alleviate the poor scalability ofcheckpoint-recovery.

Consider a parallel application that is moldable, meaning that it can beexecuted on an arbitrary number of processors, which each processor runningone application process. In our group replication approach, multiple applicationinstances are executed. One could, for instance, execute 2 distinct n-process ap-plication instances on a 2n-processor platform. Each instance runs at a smallerscale, meaning that it has better parallel e�ciency than a single 2n-process in-stance due to a lower checkpointing frequency. Furthermore, once an instancesaves a checkpoint, the other instance can use this checkpoint immediately.Given the above, our contributions in this work are:

• A theoretical analysis of the optimal number of processors to use for acheckpoint-recovery execution of a parallel application, for various parallelworkload models for Exponential failure distributions;

• An e�ective approach for group replication, with a theoretical analysisbounding expected execution time for Exponential failure distribution,and several dynamic programming solutions working for general failure

RR n° 7876

Using group replication for resilience on exascale systems 5

distributions;• Extensive simulations showing that group replication can indeed lower

application running times, and that some of our proposed strategies delivergood performance.

The rest of this paper is organized as follows. Section 2 discusses related work.Section 3 defines the theoretical framework and states key assumptions. Sec-tion 4 discusses the optimal number of processors for a checkpoint-recoveryexecution of a parallel application under an Exponential distribution of failures.Section 5 provides several approaches for group replication, and the theoreti-cal analysis of one of them. Section 6 describes the experimental methodologyand Section 7 presents simulation results. Finally, Section 8 concludes with asummary of our findings and with future perspectives.

2 Related workCheckpointing policies have been widely studied in the literature. In [9], Dalystudies periodic checkpointing policies for Exponentially distributed failures,generalizing the well-known bound obtained by Young [28]. Daly extended hiswork in [15] to study the impact of sub-optimal checkpointing periods. In [25],the authors develop an “optimal” checkpointing policy, based on the popularassumption that optimal checkpointing must be periodic. In [6], Bouguerra etal. prove that the optimal checkpointing policy is periodic when checkpointingand recovery overheads are constant, for either Exponential or Weibull failures.But their results rely on the unstated assumption that all processors are reju-venated after each failure and after each checkpoint. In our recent work [4], wehave shown that this assumption is unreasonable for Weibull failures. We havedeveloped optimal solutions for Exponential failures and dynamic programmingsolutions for any failure distribution, demonstrating performance improvementsover checkpointing approaches proposed in the literature in the case of Weibulland log-based failures. The Weibull distribution is recognized as a reasonableapproximation of failures in real-world systems [14, 24]. The work in this paperrelates to checkpointing policies in the sense that we study a replication mech-anism that is used as an addition to checkpointing. Part of our results build onthe algorithms and results in [4].

In spite of all the above advances, several studies have questioned the fea-sibility of pure checkpoint-recovery for large-scale systems (see [12] for a dis-cussion of this issue and for references to such studies). In this work, we studythe use of replication as a mechanism complementary to checkpoint-recovery.Replication has long been used as a fault-tolerance mechanism in distributedsystems [13] and more recently in the context of volunteer computing [17]. Theidea to use replication together with checkpoint-recovery has been studied inthe context of grid computing [27].One concern about replication in HPC isthe induced resource waste. However, given the scalability limitations of purecheckpoint-recovery, replication has recently received more attention in the HPCliterature [23, 29, 11].

In this work we study “group replication,” by which multiple application in-stances are executed on di�erent groups of processors.An orthogonal approach,“process replication,” was recently studied by Ferreira et al. [12] in the contextof MPI applications. To achieve fault-tolerance, each MPI process is replicated

RR n° 7876

Using group replication for resilience on exascale systems 6

in a way that is transparent to the application developer. While this approachcan lead to good fault-tolerance, one of its drawbacks is that it increases thenumber and the volume of communications. Let V

tot

be the total volume ofinter-processor communications for a traditional execution. With process repli-cation using g replicas per replica-groups, each original communication nowinvolves g sources and g destinations, hence the total communication volumebecomes V

tot

◊ g2. Instead, with group replication using g groups, each origi-nal communication takes place g times, hence the total communication volumeincreases only to V

tot

◊ g. Another drawback of process replication is that itrequires the use of a customized MPI library (such as the prototype developedby the authors in [12]). By contrast, group replication is completely agnostic tothe parallel runtime system and thus does not even require MPI. Nevertheless,even for MPI applications, group replication provides an out of the box fault-tolerance solution that can be used until process replication possibly becomes amainstream feature in MPI implementations.

3 FrameworkWe consider the execution of a tightly-coupled parallel application, or job, ona platform composed of p processors. We use the term processor to indicateany individually scheduled compute resource (a core, a multi-core processor,a cluster node) so that our work applies regardless of the granularity of theplatform. We assume that system-level checkpoint-recovery is enabled.

The job must complete W units of (divisible) work, which can be split ar-bitrarily into separate chunks. The job can execute on any number q Æ pprocessors. Letting W(q) be the time required for a failure-free execution on qprocessors, we use three models:

• Perfectly parallel jobs: W(q) = W/q.• Generic parallel jobs: W(q) = (1 ≠ “)W/q + “W. As in Amdahl’s law [1],

“ < 1 is the fraction of the work that is inherently sequential.• Numerical kernels: W(q) = W/q + “W2/3/

Ôq. This is representative of

a matrix product or a LU/QR factorization of size N on a 2D-processorgrid, where W = O(N3). In the algorithm in [3], q = r2 and each proces-sor receives 2r blocks of size N2/r2 during the execution. Here “ is thecommunication-to-computation ratio of the platform.

Each participating processor is subject to failures. A failure causes a down-time period of the failing processor, of duration D. When a processor fails, thewhole execution is stopped, and all processors must recover from the previouscheckpoint. We let C(q) denote the time needed to perform a checkpoint, andR(q) the time to perform a recovery. The downtime accounts for software reju-venation (i.e., rebooting [16, 8]) or for the replacement of the failed processor bya spare. Regardless, we assume that after a downtime the processor is fault-freeand begins a new lifetime at the beginning of the recovery period. This recoveryperiod corresponds to the time needed to restore the last checkpoint. Assumingthat the application’s memory footprint is V bytes, with each processor holdingV/q bytes, we consider two scenarios:

• Proportional overhead: C(q) = R(q) = –V/q = C/q with – some con-stant, for cases where the bandwidth of the network card/link at eachprocessor is the I/O bottleneck.

RR n° 7876

Using group replication for resilience on exascale systems 7

• Constant overhead: C(q) = R(q) = –V = C with – some constant, forcases where the bandwidth to/from the resilient storage system is the I/Obottleneck.

We assume coordinated checkpointing [26], meaning that no message logging/re-play is needed when recovering from failures. We assume that failures canhappen during recovery or checkpointing, but not during a downtime (otherwise,the downtime period could be considered part of the recovery period). Weassume that the parallel job is tightly coupled, meaning that all q processorsoperate synchronously throughout the job execution. These processors executethe same amount of work W(q) in parallel, chunk by chunk. The total time (onone processor) to execute a chunk of size Ê, and then checkpointing it, is Ê +C(q). Finally, we assume that failure arrivals at all processors are independentand identically distributed (iid).

4 Optimal number of processors for executionLet E(q) be the expectation of the execution time, or makespan, when usingq processors, and q

opt

the value of q that minimizes E(q). Is it true that theoptimal solution is to use all processors, i.e., q

opt

= p? If not, what can we sayabout the value of q

opt

? This question was partially and empirically addressedin [25], via experiments for 4 MPI applications for up to 35 processors. Ourapproach here is radically di�erent since we target large-scale platforms andseek theoretical results in the form of optimal solutions. The main objectiveof this section is to show analytically that, for Exponential failures, E(q) mayreach its minimum for some finite value of q (implying that q

opt

is not necessarilyequal to p).

Assume that failure inter-arrival times follow an Exponential distributionwith parameter ⁄. In our recent work [4], we have shown that the optimalstrategy to minimize the expected makespan E(q) is to split W into Kú =max(1, ÂK0(q)Ê) or Kú = ÁK0(q)Ë same-size chunks, whichever leads to thesmaller value, where K0(q) = q⁄W(q)

1+L(≠e

≠q⁄C(q)≠1) is the optimal (non integer) num-ber of chunks. L denotes the Lambert function, defined as L(z)eL(z) = z. Thisresult shows that the optimal strategy is periodic and that the optimal expec-tation of the makespan is:

Eú(q) = Kú(q)3

1q⁄

+ E(XD

(q))4

eq⁄R(q)1

eq⁄W(q)K

ú(q) +q⁄C(q) ≠ 12

(1)

where E(XD

(q)) denotes the expectation of the downtime. It turns out that, al-though we can compute the optimal number of chunks (and thus the chunk size),we cannot compute Eú(q) analytically because E(X

D

(q)) is di�cult to compute.This is because a processor can fail while another one is down, thus prolongingthe downtime. With a single processor (q = 1), X

D

(q) has constant value D,but with several processors there could be cascading downtimes. It turns outthat we can compute the following lower and upper bounds for E(X

D

(q)):

Proposition 1. Let XD

(q) denote the downtime of a group of q processors.Then

D Æ E(XD

(q)) Æ e(q≠1)⁄D ≠ 1(q ≠ 1)⁄ (2)

RR n° 7876

Using group replication for resilience on exascale systems 8

Proof. In [4], we have shown that th optimal expectation of the makespan iscomputed as:

Eú(q) = Kú(q)3

1q⁄

+ E(Trec

(q))4 1

eq⁄W(q)K

ú(q) +q⁄C(q) ≠ 12

(3)

where E(Trec

(q)) denotes the expectation of the recovery time, i.e., the timespent recovering from failure during the computation of a chunk. All chunkshave the same recovery time because they all have the same size and becauseof the memoryless property of the Exponential distribution. It turns out thatalthough we can compute the optimal number of chunks (and thus the chunksize), we cannot compute Eú(q) analytically because E(T

rec

(q)) is di�cult tocompute. We write the following recursion:

Trec

(q) =

Y_]

_[

XD

(q) + R(q) if no processor failsduring R(q) units of time,

XD

(q) + Tlost

(R(q)) + Trec

(q) otherwise.(4)

XD

(q) is the downtime of a group of q processors, that is the time betweenthe first failure of one of the processors and the first time at which all of themare available (accounting for the fact a processor can fail while another oneis down, thus prolonging the downtime). T

lost

(R(q)) is the amount of timespent computing by these processors before a first failure, knowing that thenext failure occurs within the next R(q) units of time. In other terms, it is thecompute time that is wasted because checkpoint recovery was not completed.The time until the next failure of a group of q processors is the minimum of q iidExponentially distributed variables, and is thus Exponential with parameter q⁄.We can compute E(T

lost

(R(q))) = 1q⁄

≠ R(q)e

q⁄R(q)≠1 (see [4] for details). Pluggingthis value into Equation 4 leads to:

E(Trec

(q)) = e≠q⁄R(q)(E(XD

(q)) + R(q))

+ (1 ≠ e≠q⁄R(q))3E(X

D

(q)) + 1q⁄

≠ R(q)eq⁄R(q) ≠ 1 + E(T

rec

(q))4

(5)

Equation 5 reads as follows: after the downtime XD

(q), either the recoverysucceeds for everybody, or there is a failure during the recovery and another at-tempt must be made. Both events are weighted by their respective probabilities.Simplifying the above expression we get:

E(Trec

(q)) = E(XD

(q))eq⁄R(q) + 1q⁄

(eq⁄R(q) ≠ 1) (6)

Plugging back this expression in Equation 3, we obtain the value given in Equa-tion 1.

Now we establish the desired bounds on E(XD

(q)) We always have XD

(q) ØX

D

(1) Ø D, hence the lower bound. For the upper bound, consider a date atwhich one of the q processors, say processor i0, just had a failure and initiatesits downtime period for D time units. Some other processors might be in themiddle of their downtime period: for each processor i, 1 Æ i Æ q, let t

i

denotethe remaining duration of the downtime of processor i. We have 0 Æ t

i

Æ D for

RR n° 7876

Using group replication for resilience on exascale systems 9

1 Æ i Æ q, ti0 = D, and t

i

= 0 means that processor i is up and running. LetX

t1,..,t

q

D

(q) be the remaining downtime of a group of q processors, knowing thatprocessor i, 1 Æ i Æ q, will still be down for a duration of t

i

, and that a failurejust happened (i.e., there exists i0 such that t

i0 = D). Given the values of theti

’s, we have the following equation for the random variable Xt1,..,t

q

D

(q):

Xt1,..,t

q

D

(q) =

Y______]

______[

D

if none of the processors of the groupfails during the next D units of time

Tt1,..,t

q

lost

(D) + Xt

Õ1,..,t

Õq

D

(q)otherwise.

In the second case of the equation, consider the next D time-units. Processor ican only fail in the last D ≠ t

i

of these time-units. Here the values of the tÕi

’s de-pend on the t

i

’s and on Tt1,..,t

q

lost

(D). Indeed, except for the last processor to fail,say i1, for which tÕ

i1 = D, we have tÕi

= max{tÕi

≠ Tt1,..,t

q

lost

(D), 0}. More impor-tantly we always have T

t1,..,t

q

lost

(D) Æ T D,0,...,0lost

(D) and Xt1,..,t

q

D

(q) Æ XD,0,..,0D

(q)because the probability for a processor to fail during D time units is always largerthan that to fail during D≠t

i

time-units. Thus, E(Xt1,..,t

q

D

(q)) Æ E(XD,0,..,0D

(q)).Following the same line of reasoning, we derive an upper-bound for XD,0,..,0

D

(q):

XD,0,..,0D

(q) Æ

Y______]

______[

D

if none of the q ≠ 1 running processors of the groupfails during the downtime D

T D,0,..,0lost

(D) + XD,0,..,0D

(q)otherwise.

Weighting both cases by their probability and taking expectations, we obtain

E1

XD,0,..,0D

(q)2

Æ e≠(q≠1)⁄DD + (1 ≠ e≠(q≠1)⁄D)1

E1

T D,0,..,0lost

(D)2

+ E1

XD,0,..,0D

(q)22

hence E1

XD,0,..,0D

(q)2

Æ D + (e(q≠1)⁄D ≠ 1)E1

T D,0,..,0lost

(D)2

, with

E1

T D,0,..,0lost

(D)2

= 1(q ≠ 1)⁄ ≠ D

e(q≠1)⁄D ≠ 1 .

We deriveE

1X

t1,..,t

q

D

(q)2

Æ E1

XD,..,0D

(q)2

Æ e(q≠1)⁄D ≠ 1(q ≠ 1)⁄ .

which concludes the proof. As a sanity check, we observe that the upper boundis at least D, using the identity ex Ø 1 + x for x Ø 0.

While in a failure-free environment Eú(q) would always decrease as q in-creases, using the above lower bound on E(X

D

(q)) we obtain the followingresults:

RR n° 7876

Using group replication for resilience on exascale systems 10

Theorem 1. When the failure distribution follows an Exponential law, Eú(q)reaches its minimum for some finite value of q in the following scenarios: alljob types (perfectly parallel, generic and numerical) with constant overhead, andgeneric or numerical jobs with proportional overhead.

Note that the only open scenario is with perfectly parallel jobs and pro-portional overhead. In this case the lower bound on Eú(q) decreases to someconstant value while the upper bound goes to +Œ as q increases.

Proof. We show that limqæ+Œ Eú(q) = +Œ for the relevant scenarios. We first

plug the lower-bound of Equation 2 into Equation 6 and obtain:

E(Trec

(q)) Ø Deq⁄R(q) + 1q⁄

1eq⁄R(q) ≠ 1

2.

From Equation 1 we then derive the lower-bound:

Eú(q) Ø K0(q)3

1q⁄

+ D

4eq⁄R(q)

1e

q⁄W(q)K0(q) +q⁄C(q) ≠ 1

2

using the fact that, by definition, the expression in the right hand-side of Equa-tion 1 is minimized by K0, where K0(q) = q⁄W(q)

1+L(≠e

≠q⁄C(q)≠1) .

With constant overhead. Let us consider the case of perfectly parallel jobs(W(q) = W/q) with constant checkpointing overhead (C(q) = R(q) = C). Weget the lower bound:

Eú(q) Ø K0(q)3

1q⁄

+ D

4eq⁄C

1e

⁄WK0(q) +q⁄C ≠ 1

2

where K0(q) = ⁄W1+L(≠e

≠q⁄C≠1) . When q tends to +Œ, K0(q) goes to ⁄W, while

( 1q⁄

+ D)eq⁄C

1e

⁄WK0(q) +q⁄C ≠ 1

2goes to +Œ. Consequently, Eú(q) is bounded

below by a quantity that goes to +Œ, which concludes the proof. This resultalso implies that Eú(q) reaches a minimum for a finite q value for other job types(generic, numerical) with constant overhead, just because the execution time islarger in those cases than with perfectly parallel jobs.

Generic parallel job with proportional overhead. Here we assume thatW(q) = W/q + “W, and use proportional overhead: C(q) = R(q) = C

q

. We getthe lower bound:

Eú(q) Ø K0(q)3

1q⁄

+ D

4e⁄C

1e

⁄W+q⁄“WK0(q) +⁄C ≠ 1

2

where K0(q) = ⁄W+q⁄“W1+L(≠e

≠⁄C≠1) . As before, we show that limqæ+Œ Eú

min

(q) = +Œto get the result. When q tends to +Œ, K0(q) tends to +Œ, while ( 1

q⁄

+

D)e⁄C

1e

⁄W+q⁄“WK0(q) +⁄C ≠ 1

2tends to some positive constant. This concludes the

proof. Note that this proof also serves for generic parallel jobs with constantoverhead, simply because the execution time is larger in that case than withproportional overhead.

RR n° 7876

Using group replication for resilience on exascale systems 11

Numerical kernels with proportional overhead. Here we assume thatW(q) = W/q + “W2/3/

Ôq, and use proportional overhead: C(q) = R(q) = C

q

.We get the lower bound:

Eú(q) Ø K0(q)3

1q⁄

+ D

4e⁄C

3e

⁄W+⁄“W2/3Ôq

K0(q) +⁄C ≠ 14

where K0(q) = ⁄W+⁄“W2/3Ôq

1+L(≠e

≠⁄C≠1) . As before, we show that limqæ+Œ Eú

min

(q) =+Œ to get the result. When q tends to +Œ, K0(q) tends to +Œ, while

( 1q⁄

+ D)e⁄C

3e

⁄W+⁄“W2/3Ôq

K0(q) +⁄C ≠ 14

tends to some positive constant. Thisconcludes the proof.

5 Group replicationSince using all processors to run a single application instance may not makesense in light of Theorem 1, the group replication approach consists in executingmultiple application instances on di�erent processor groups, where the numberof processors in a group is closer to q

opt

. All groups compute the same chunksimultaneously, and do so until one of them succeeds, potentially after severalfailed trials. Then all other groups stop executing that chunk and recover fromthe checkpoint stored by the successful group. All groups then attempt tocompute the next chunk. Group replication can be implemented easily withno modification to the application, provided that the recovery implementationallows a group to recover immediately from a checkpoint produced by anothergroup. In this section we formalize group replication as an execution protocolcalled ASAP (As Soon As Possible), and analyze its performance for Exponentialfailures. We then introduce dynamic programming solutions that work withgeneral failure distributions.

5.1 The ASAP execution protocolWe consider g groups, where each group has q processors, with g ◊ q Æ p. Agroup is available for execution if and only if all its q processors are available.In case of a failure, the downtime of a group is a random variable X

D

(q) Ø D,whose expectation is bounded in Proposition 1. If a group encounters a firstprocessor failure at time t, the group is down between t and t + X

D

(q).The ASAP algorithm proceeds in k macro-steps. During macro-step j, 1 Æ

j Æ k, each group independently attempts to execute the j-th chunk of size Êj

and to checkpoint, restarting as soon as possible in case of a failure. As soon asone of the groups succeeds, say at time tend

j

, all the other groups are immediatelystopped, macro-step j is over, and macro-step (j+1) starts (if j < k). Note thatthe value of k, the total number of chunks, as well as the chunk sizes, the Ê

j

’s,are inputs to the algorithm (we always have

qk

j=1 Êj

= W(q)). We provide ananalytical evaluation of ASAP for Exponential failure laws, and discuss how tochoose these values, in Section 5.2.

Two important aspects must be mentioned. First, before being able to startmacro-step (j + 1), a group that has been stopped must execute a recovery, inorder to restart from the checkpoint of a successful group. Second, this recovery

RR n° 7876

Using group replication for resilience on exascale systems 12

Downtine (of a group)

Recover

Downtine (of a processor)

Failure

Ê1

Ê2

Group 1

Group 2

Group 3 R(q)

C(q)R(q)

C(q)

tend1 tend

2

Figure 1: Execution of chunks Ê1 and Ê2 (macro-steps 1 and 2) using theASAP protocol. At time tend

1 , Group 1 is not ready, and Group 2 is the onlyone that does not need to recover.

may start later than time tend

j

, in the case where the group is down at time tend

j

.An example is shown in Figure 1, in which group 1 cannot start the recoveryat time tend

j

. The only groups that do not need to recover at the beginning ofthe next step are the groups that were successful for the previous step, exceptduring the first step at which all groups can start computing right away.

We now provide an analytical evaluation of ASAP for Exponential failurelaws, and show how to compute the number of macro-steps k and the values ofthe chunk sizes Ê

j

.

5.2 Exponential failuresLet use assume that the failure rate of each processor obeys an Exponentiallaw of parameter ⁄. For the sake of the theoretical analysis, we introduce aslightly modified version of the ASAP protocol in which all groups, includingthe successful ones, execute a recovery at the beginning of all macro-steps,including the first one. This new version of ASAP is described in Algorithm 1.It is completely symmetric, which renders its analysis easier: for macro-step jto be successful, one of the groups must be up and running for a duration ofR(q) + Ê

j

+ C(q).

Algorithm 1: ASAP (Ê1, . . . , Êk

)1 for j = 1 to k do

2 for each group do in parallel

3 repeat

4 Finish current downtime (if any)

5 Try to perform a recovery, then a chunk of size Ê

j

, and finally to

checkpoint

6 if execution successful then

7 Signal other groups to immediately stop their attempts

8 until one of the groups has a successful attempt

Consider the j-th macro step, number the attempts of all groups by theirstart time, and let N

j

be the index of the earliest started attempt that success-fully computes chunk Ê

j

. In Figure 2, we have j = 2, the successful chunk ofsize R+Ê2 +C is the fourth attempt, so N2 = 4. To represent each attempt, wesample random variables Xj

i

and Y j

i

, 1 Æ i Æ Nj

, that correspond respectively

RR n° 7876

Using group replication for resilience on exascale systems 13

X21 X2

3L

Group 1

Group 2

Group 3

and is followed by a downtime of size Y 2i

Attempt i (of step 2) has size X2i

R(q) + Ê2 + C(q)

tend1 tend

2

Job1

Job2

Job3

Job4

Y 22X2

2Y 21 Y 2

3

Figure 2: Zoom on macro-step 2 of the execution depicted in Figure 1, using the(X, Y ) notation of Algorithm 2. Recall that Job

i

has size X2i

+Y 2i

for 1 Æ i Æ 3,and Job4 has size R(q) + Ê2 + C(q).

to the ith tentative execution of the chunk and to the ith downtime that followsit (if i ”= N

j

). Note that Xj

i

< R + Êj

+ C for i < Nj

, and Xj

N

j

Ø R + Êj

+ C.All the Xj

i

’s follow the same distribution DX

, namely an Exponential law ofparameter q⁄. And all the Y j

i

’s follow the same distribution DX

D

(q), that ofthe the random variable X

D

(q) corresponding to the downtime of a group of qprocessors.

The main idea here is to view the Nj

execution attempts as jobs, where thesize of job i is Xj

i

+ Y j

i

, and to distribute them across the g groups using theclassical online list scheduling algorithm for independent jobs [21, section 5.6].This formulation (see Proposition 2) allows us to provide an upper bound forthe starting time of job N

j

, and hence for the length of macro-step j, using awell-known scheduling argument (see Proposition 3). We then derive an upperbound for the expected execution time of ASAP (see Theorem 2).

Algorithm 2: Step j of ASAP (Ê1, . . . , Êk

)1 i Ω 1 /* i represents the number of attempts for the job */

2 L Ω ÿ /* L represents the list of attempts for the job */

3 Sample X

j

i

and Y

j

i

using D

X

and D

X

D

(q) respectively

4 while X

j

i

< R(q) + Ê

j

+ C(q) do

5 Add Jobi

, with processing time X

j

i

+ Y

j

i

, to L6 i Ω i + 1

7 Sample X

j

i

and Y

j

i

using D

X

and D

X

D

(q) respectively

8 N

j

Ω i

9 Add JobN

j

, with processing time R(q) + Ê

j

+ C(q), to L/* the first successful job has size R(q) + Ê

j

+ C(q), not X

j

N

j

+ Y

j

N

j

*/

10 From time t

end

j≠1 on, execute a List Scheduling algorithm to distribute jobs of Lto the di�erent groups (recall that some groups may not be ready at time t

end

j≠1)

Proposition 2. The j-th macro-step of the ASAP protocol can be simulatedusing Algorithm 2: the last job scheduled by Algorithm 2 ends exactly at timetend

j

.

Proof. The List Scheduling algorithm distributes the next job to the first avail-able group. Because of the memoryless property of Exponential laws, it is

RR n° 7876

Using group replication for resilience on exascale systems 14

tendj≠1

T(R(q)+Ê

j

+C(q))truestart R(q) + Êj + C(q)

XjN

j

tendj

tendj ≠ tend

j≠1

Figure 3: Notations used in Proposition 3.

equivalent (i) to generate the attempts a priori and greedily schedule them, or(ii) to generate them independently within each group.

Proposition 3. Let T(R(q)+Ê

j

+C(q))truestart

be the time elapsed between tend

j≠1 and thebeginning of Job

N

j

(see Figure 3). We have E1

T(R(q)+Ê

j

+C(q))truestart

2Æ E(Y ) +

E(N

j

)E(X)≠E(X

N

j

j

)+(E(N

j

)≠1)E(Y )g

where X and Y are random variables corre-sponding to an attempt (sampled using D

X

and DX

D

(q) respectively). Moreover,we have E(N

j

) = e⁄q(R(q)+Ê

j

+C(q)) and E(XN

j

j

) = 1q⁄

+ R(q) + Êj

+ C(q).

Proof. For group x, 1 Æ x Æ g, let Yx

denote the time elapsed before it is readyfor macro-step j. For example in Figure 2, we have Y1 > 0 (group 1 is down attime tend

j≠1), while Y2 = Y3 = 0 (groups 2 and 3 are ready to compute at timetend

j≠1). Proposition 2 has shown that executing macro-step j can be simulatedby executing a List Schedule on a job list L (see Algorithm 2). We now considerg “jobs” ˜Job

x

, x = 1, . . . , g, so that ˜Jobx

has duration Yx

. We now considerthe augmented job list LÕ = L fi

tg

x=1˜Job

x

. Note that LÕ may contain morejobs than macro-step j: the jobs that start after the successful job Job

N

j

arediscarded from the list LÕ. However, both schedules have the same makespan,and jobs common to both systems have the same start and completion dates.Thus, we have T

(R(q)+Ê

j

+C(q))truestart Æ

qg

x=1(Y

x

)+q

N

j

≠1i=1

(X

j

i

+Y

j

i

)g

: this key inequalityis due to the property of list scheduling: the group which is assigned the lastjob is the least loaded when this assignment is decided, hence its load doesnot exceed the average load (which is the total load divided by the number ofgroups). Given that E(Y

x

) Æ E(Y ), we derive

E1

T(R(q)+Ê

j

+C(q))truestart

2Æ E(Y ) +

E1q

N

j

≠1i=1 Xj

i

2+ E

1qN

j

≠1i=1 (Y j

i

)2

g

But Nj

is the stopping criterion of the (Xj

i

) sequence, hence using Wald’stheorem we have E(

qN

j

i=1 Xj

i

) = E(Nj

)E(X) which leads to E(q

N

j

≠1i=1 Xj

i

) =E(N

j

)E(X) ≠ E(XN

j

j

). Moreover, as Nj

and Y j

i

are independent variables,

RR n° 7876

Using group replication for resilience on exascale systems 15

we have E(q

N

j

≠1i=1 Y j

i

) = (E(Nj

) ≠ 1)E(Y ), and we get the desired bound forE(T (R(q)+Ê

j

+C(q))truestart ).Finally, as the expected number of attempts when repeating independently

until success an event of probability – is 1–

(geometric law), we get E(Nj

) =e⁄q(R(q)+Ê

j

+C(q)). The value of E(XN

j

j

) can be directly computed from thedefinition, recalling that X

N

j

j

Ø R(q) + Êj

+ C(q) and each Xi

j

follows anExponential distribution of parameter q⁄.

Theorem 2. The expected execution time of ASAP has the following upperbound:

g ≠ 1g

W(q) + 1g

31

q⁄+ E(Y )

4e⁄q(R(q)+C(q))kúe⁄q

W(q)k

ú

+ kú3

g ≠ 1g

(E(Y ) + R(q) + C(q)) ≠ 1g

1q⁄

4

which is obtained when using kú = max(1, Âk0Ê) or kú = Ák0Ë same-size chunks,whichever leads to the smaller value, where:

k0 = ⁄qW(q)1 + L

11g ≠ 1 + (g≠1)q⁄(R(q)+C(q))≠g

1+q⁄E(Y )

2e≠(1+⁄q(R(q)+C(q)))

2 ·

Proof. From Proposition 3, the expected execution time of ASAP has upperbound T

ASAP

=q

k

j=1 –j

, where

–j

= E(Y ) +E(N

j

)E(X) ≠ E(XN

j

j

) + (E(Nj

) ≠ 1)E(Y )g

+ (R(q) + Êj

+ C(q)).

Our objective now is to find the inputs to the ASAP algorithm, namely the num-ber k of macro-steps together with the chunk sizes (Ê1, . . . , Ê

k

), that minimizethis T

ASAP

bound.We first have to prove that any optimal (in expectation) policy uses only a

finite number of chunks. Let – be the expectation of the ASAP makespan usinga unique chunk of size W(q). According to Proposition 3,

– = E(T (R(q)+W(q)+C(q))truestart ) + C(q) + W(q) + R(q),

and is finite. Thus, if an optimal policy uses kú chunks, we must have kúC(q) Æ–, and thus kú is bounded.

In the proof of Theorem 1 in [4], we have shown that any deterministicstrategy uses the same sequence of chunk sizes, whatever the failure scenario,thanks to the memoryless property of the Exponential distribution. We cannotprove such a result in the current context. For instance, the number of groupsperforming a downtime at time tend

1 depends on the scenario. There is thusno reason a priori for the size of the second chunk to be independent of thescenario. To overcome this di�culty, we restrict our analysis to strategies thatuse the same sequence of chunk sizes whatever the failure scenario. We optimizeT

ASAP

in that context, at the possible cost of finding a larger upper bound.We thus suppose that we have a fixed number of chunks, k, and a sequence of

chunk sizes (Ê1, . . . , Êk

), and we look for the values of (Ê1, . . . , Êk

) that minimize

RR n° 7876

Using group replication for resilience on exascale systems 16

TASAP

=q

k

j=1 –j

. Let us first compute one of the –j

term. Replacing E(Nj

)and E(XN

j

j

) by the values given in Proposition 3, and E(X) by 1q⁄

, we get

–j

= g ≠ 1g

Êj

+ 1g

e⁄q(R(q)+Ê

j

+C(q))3

1q⁄

+ E(Y )4

+ g ≠ 1g

(E(Y ) + R(q) + C(q)) ≠ 1g

1q⁄

TASAP

= g ≠ 1g

W + 1g

31

q⁄+ E(Y )

4e⁄q(R(q)+C(q))

kÿ

j=1e⁄qÊ

j

+ k

3g ≠ 1

g(E(Y ) + R(q) + C(q)) ≠ 1

g

1q⁄

4

By convexity, the expressionq

k

j=1 e⁄qÊ

j is minimal when all Êj

’s are equal (toW(q)/k). Hence all the chunks should be equal for T

ASAP

to be minimal. Weobtain:

TASAP

= g ≠ 1g

W + 1g

31

q⁄+ E(Y )

4e⁄q(R(q)+C(q))ke⁄q

W(q)k

+ k

3g ≠ 1

g(E(Y ) + R(q) + C(q)) ≠ 1

g

1q⁄

4.

Let f(x) = ·1xe⁄q

W(q)x + ·2x, where

·1 = 1g

31

q⁄+ E(Y )

4e⁄q(R(q)+C(q)) and

·2 =3

g ≠ 1g

(E(Y ) + R(q) + C(q)) ≠ 1g

1q⁄

4.

A simple analysis using di�erentiation shows that f has a unique minimum,and solving f Õ(x) = 0 leads to ·1e⁄q

W(q)k

11 ≠ ⁄qW(q)

k

2+ ·2 = 0, and thus to

k = ⁄qW(q)1+L

!·2

·1·e

" = kú, which concludes the proof.

Using the upper bound of E(Y ) = E(XD

(q)) in Proposition 1, we can com-pute numerically the number of chunks and the expectation of the upper boundgiven by Theorem 2.

5.3 Group replication heuristics for general failure distri-butions

The results of the previous section are limited to Exponential failures. We nowaddress turn to the general case. In Section 5.3.1 we recall the dynamic programdesigned in [4] to define checkpointing dates in a context without replication.We then discuss how to use this dynamic program in the context of ASAP(Section 5.3.2). Finally, in Section 5.3.3 we propose heuristics that correspondto more general execution protocols than ASAP.

RR n° 7876

Using group replication for resilience on exascale systems 17

Algorithm 3: DPNextFailure (W ,·1, ..., ·p

)1 if W = 0 then return 0

2 best Ω 0; chunksize Ω 0

3 for Ê = quantum to W step quantum do

4 (expected_work, 1st_chunk) Ω DP-

NextFailure(W ≠ Ê, ·1 + Ê + C(p), ..., ·

p

+ Ê + C(p))

5 cur_exp_work ΩP

suc

(·1 + Ê + C(p), ..., ·

p

+ Ê + C(p) | ·1, ..., ·

p

) ◊ (Ê + expected_work)

6 if cur_exp_work > best then best Ω cur_exp_work; chunksize Ω Ê

7 return (best, chunksize)

5.3.1 Solution without replication

According to [4], the most e�cient algorithm to define checkpointing dates, forgeneral failure distributions and when no replication is used, is the dynamicprogramming approach called DPNextFailure, shown in Algorithm 3. Thisalgorithm works on a failure by failure basis, maximizing the expectation of theamount of work completed (and checkpointed) before the next failure occurs.This algorithm only provides an approximation of the optimal solution as itrelies on a time discretization. (For the sake of simplicity, we present all dynamicprograms as recursive algorithms.)

5.3.2 Implementing the ASAP execution protocol

A key question when implementing ASAP is that of the chunk size. A naiveapproach would be to use DPNextFailure. Each group would call DPNext-Failure to compute what would be the optimal chunk size for itself, as if therewere no other groups. Then we have to merge these individual chunk sizes toobtain a common chunk size. This heuristic, DPNextFailureAsap, is shownin Algorithm 4 (the Alive function returns, for a list of q processors, the amountof time each has been up and running since its last downtime). For the Mergeoperator, one could be pessimistic and take the minimum of the chunk sizes,or be optimistic and take the maximum, or attempt a trade-o� by taking theaverage. Two important limitations of this heuristic are: 1) the Merge operatorthat has no theoretical justification; and 2) the use of chunk sizes defined usingan obsolete failure history. The latter limitation shows after a group is victimof a failure: the failure history has changed significantly but the chunk size isnot recomputed (this has no consequences with an Exponential distribution asthe Exponential is memoryless).

5.3.3 Other execution protocols

In order to circumvent both limitations of DPNextFailureAsap, we relaxthe constraint that all groups work with the same chunk size. Each group nowworks with its own chunk size that is recomputed each time the group fails, andeach time one of the groups successfully complete its own chunk. This leads tothe heuristic DPNextFailureSynchro in Algorithm 5. Each time a groupsuccessfully works for the duration of its chunk size and checkpoints its work,it signals its success to all groups which then interrupt their own work. This

RR n° 7876

Using group replication for resilience on exascale systems 18

Algorithm 4: DPNextFailureAsap(W ).1 while W ”= 0 do

2 for each group x = 1..g do in parallel

3 (·(x≠1)q+1, . . . , ·(x≠1)q+q

) Ω Alive((x ≠ 1)q + 1, ..., (x ≠ 1)q + g)

4 (exp_work

x

, Ê

x

) Ω DPNextFailure(W, ·(x≠1)q+1, . . . , ·(x≠1)q+q

)

5 Ê Ω Merge(Ê1, . . . , Ê

g

)

6 for each group do in parallel

7 repeat

8 Try to execute a chunk of size Ê and then checkpoint

9 if successful then

10 Signal other groups to immediately stop their attempts

11 else if failure then Complete downtime and perform recovery

12 until One of the groups signals its success13 W Ω W ≠ Ê

14 for each group do in parallel

15 if not successful on last chunk then Perform recovery from last

successfully completed checkpoint

behavior is clearly sub-optimal if the successful completion is for a very smallchunk and an interrupted group that was computing a large chunk was closeto completion. The problems are that: 1) chunk sizes are defined as if eachgroup was alone; and 2) the definition of the chunk sizes does not account forthe fact that the first group to complete its checkpoint defines a mandatorycheckpointing date for all groups.

Algorithm 5: DPNextFailureSynchro(W ).1 for each group x = 1..g do in parallel

2 while W ”= 0 do

3 (·(x≠1)q+1, . . . , ·(x≠1)q+q

) Ω Alive((x ≠ 1)q + 1, ..., (x ≠ 1)q + g)

4 (exp_work

x

, Ê

x

) Ω DPNextFailure(W, ·(x≠1)q+1, . . . , ·(x≠1)q+q

)

5 Try to execute a chunk of size Ê

x

and then checkpoint

6 if successful then

7 Signal other groups to immediately stop their attempts

8 W Ω W ≠ Ê

x

9 if failure then Complete downtime

10 if failure or signal then

11 Perform recovery from last successfully completed checkpoint

To address these problems, rather than doing another attempt at reusingDPNextFailure, we design a brand new dynamic program. In [4], withoutreplication, we found that a dynamic program that minimizes the expectationof the makespan would require an intractable exponential number of states torecord which processors fail and when. Hence we aimed instead at maximizingthe expectation of the work completed before the next failure. We now extendthis approach to the context of replication. The first failure will only interrupta single group. Therefore, the objective should be to maximize the expectationof the work completed before all groups have failed. This approach ignoresthat once a group has failed, it will eventually restart and resume computing.However, keeping track of such restarts would require recording which processors

RR n° 7876

Using group replication for resilience on exascale systems 19

have failed and when, thus once again leading to an exponential number ofstates.

To avoid having the first completed checkpoint force a checkpoint for allother groups we design a new dynamic program, DPNextCheckpoint (Algo-rithm 6). DPNextCheckpoint does not define chunk sizes, i.e., amount ofwork to be processed before a checkpoint is taken, but instead it defines check-point dates. The rationale is that one checkpointing date can correspond todi�erent amounts of work for each group, depending on when the group hasstarted to process its chunk, after either its last failure and recovery, or its lastcheckpoint, or its last recovery based on another group’s checkpoint. The func-tion WorkAlreadyDone (Line 3) returns, for each group, the time since itstarted processing its current chunk.

DPNextCheckpoint proceeds as follows. At the checkpointing date, theamount of work completed is the maximum of the amount of work done bythe di�erent groups that successfully complete the checkpoint. Therefore, weconsider all the di�erent cases (Line 8), that is, which group x, among thesuccessful groups, has done the most work. We compute the probability of eachcase (Line 11). All groups that started to work earlier than group x have failed(i.e., at least one processor in each of them has failed) but not group x (i.e.,none of its processors have failed). We compute the expectation of the amount ofwork completed in each case (Lines 12 and 13). We then sum the contributionsof all the cases (Line 14) and record the checkpointing date leading to thelargest expectation (Line 15). Note that the probability computed at Line 11explicitly states which groups have successfully completed the checkpoint andwhich groups have not. We choose not to take this information into accountwhen computing the expectation (recursive call at Line 13). This is to avoidkeeping track of which group had failed, thereby lowering the complexity ofthe dynamic program. This explains why the conditions do not evolve in theconditional probability at Line 11.

Finally, Algorithm 7 shows the algorithm, called DPNextFailureGlobal,that uses DPNextCheckpoint. Each time a group is a�ected by an event (afailure, a successful checkpoint by itself or by another group), it computes thenext checkpoint date and signals the result of its computation to the othergroups (e.g., by broadcasting it to the g group leaders). Hence, a group mayhave computed the next checkpoint date to be t, and that date can be eitherun-modified, or postponed, or advanced by events occurring on other groupsand by their re-computation of the best next checkpoint date. In practice, asthese algorithms rely on a time discretization, at each time quantum a groupcan check whether the current time is a checkpoint date or not.

6 Simulation frameworkIn this section we detail our simulation methodology. We use both synthetic andreal-world failure distributions. The source code and all simulation results arepublicly available at: http://perso.ens-lyon.fr/frederic.vivien/Data/

Resilience/SPAA2012/.

RR n° 7876

Using group replication for resilience on exascale systems 20

Algorithm 6: DPNextCheckpoint(W , T , T0, ·1, ..., ·gq

)1 if W = 0 then return 0

2 best_work Ω 0; next_chkpt Ω date

3 (W1, ..., W

g

) Ω WorkAlreadyDone(T ) /* Time since last recovery or

checkpoint */

4 Reorder groups in non-increasing availabilities (W1 is maximum)

5 for t = T to T + W ≠ W

g

step quantum /* Loop on checkpointing date */

6 do

7 cur_work Ω 0

8 for x = 1 to g /* Loop on the first group to successfully work

until t + C(q) */

9 do

10 ” Ω (t + C(q)) ≠ T0 /* Total time elapsed until the checkpoint

completion */

11proba Ω

rx≠1y=1 P

fail

(·(y≠1)q+1 + ”, ..., ·(y≠1)q+p

+ ” | ·(y≠1)q+1, ..., ·(y≠1)q+p

)

◊P

suc

(·(x≠1)q+1 + ”, ..., ·(x≠1)q+p

+ ” | ·(x≠1)q+1, ..., ·(x≠1)q+p

)

12 Ê Ω min{W ≠ W

x

, t ≠ T } /* Work done between T and t by group

x */

13 (rec_Ê, rec_t) Ω DP-

NextCheckpoint(W ≠ W

x

≠ Ê, T + Ê + C(q) + R(q), T0, ·1, ..., ·

p

)

14 cur_work Ω cur_work + proba ◊ (W

x

+ Ê + rec_Ê)

15 if cur_work > best_work then

best_work Ω cur_work; next_chkpt Ω t

16 return (best_work, next_chkpt)

Algorithm 7: DPNextFailureGlobal(W ).1 for each group x = 1..g do in parallel

2 while W ”= 0 do

3 (·1, ..., ·

gq

) Ω Alive(1, ..., gq)

4 T0 Ω Time() /* Current time */

5 date Ω DPNextCheckpoint(W, T0, T0, ·1, . . . , ·

gq

)

6 Signal all processors that the next checkpoint date is now date7 Try to work until date and then checkpoint

8 if successful work until date and checkpoint then

9 Let y be the longest running group without failure among the

successful groups

10 Let Ê be the work performed by y since its last recovery or

checkpoint

11 W Ω W ≠ Ê

12 if group x last recovery or checkpoint was strictly later than that ofy then

13 Perform a recovery

14 if failure then Complete downtime

15 if failure or signal then Perform recovery from last successfully

completed checkpoint

RR n° 7876

Using group replication for resilience on exascale systems 21

6.1 HeuristicsOur simulator implements the following eight checkpointing policies:

• Two versions of the ASAP protocol: OptExp, that uses the optimal andperiodic policy established in [4] for Exponential failure distributions andno replication; OptExpGroup, that uses the periodic policy defined byTheorem 2 for Exponential distributions.

• The six dynamic programming approaches in Section 5.3: DPNext-Failure, the 3 variants of DPNextFailureAsap, DPNextFailureSyn-chro, and DPNextFailureGlobal.

Our simulator also implements PeriodLB, which is a numerical search for theoptimal period for ASAP. We evaluate each candidate period on 50 randomlygenerated scenarios. To build the candidate periods, the period computed forOptExp is multiplied and divided by 1 + 0.05 ◊ i with i œ {1, ..., 180}, and by1.1j with j œ {1, ..., 60}. PeriodLB corresponds to the periodic policy thatuses the best period found by the search. The evaluation of PeriodLB on aconfiguration requires running 24,000 simulations (which would be prohibitivein practice), but we include it for reference. Based on the results in [4], we do notconsider any additional checkpointing policy, such as those defined by Young [28]or Daly [9] for instance. We point out that OptExp and OptExpGroup com-pute the checkpointing period based solely on the MTBF, implicitly assumingthat failures are exponentially distributed. For the sake of completeness wenevertheless include them in all our simulations, simply using the MTBF valueeven when failures are not exponentially distributed. To use OptExp with ggroups we use the period from [4] computed with Âp/gÊ processors.

6.2 PlatformsWe target two types of platforms, depending on the type of the failure distribu-tion. For synthetic distributions, we consider platforms containing from 32,768to 1,048,576 processors. For platforms with failures based on failure logs fromproduction clusters, because of the limited scale of those clusters, we restrictthe size of the platforms to a maximum of 131,072 processors, starting with4,096 processors. For both platform types, we determine the job size W sothat a job using the whole platform would use it for a significant amount oftime in the absence of failures, namely ¥ 3.5 days on the largest platforms forsynthetic failures (W = 10, 000 years), and ¥ 2.8 days on those for log-basedfailures (W = 1, 000 years). Otherwise, we use the same parameters as in [4]:C = R = 600 s, D = 60 s, “ = 10≠6 for generic parallel jobs, and “ = 0.1for numerical kernels. Note that the checkpointing overheads come from thescenarios in [7] and are smaller than those used in [12].

6.3 Failure scenarios

Synthetic failure distributions – To choose failure distribution parametersthat are representative of realistic systems, we use failure statistics from theJaguar platform. Jaguar contains 45, 208 processors and is said to experienceon the order of 1 failure per day [20, 2]. Assuming a 1-day platform MTBFgives us a processor MTBF equal to 45,208

365 ¥ 125 years. For the Exponentialdistribution, we then have ⁄ as ⁄ = 1

MT BF

and for Weibull, which requires two

RR n° 7876

Using group replication for resilience on exascale systems 22

parameters k and ⁄, we have ⁄ = MTBF/�(1 + 1/k) and we fix k to 0.7 basedon the results of [24]. (We have shown in [4] that the general trends were notinfluenced by the exact values used for k nor for the MTBF.)Log-based failure distributions – We also consider failure distributionsbased on failure logs from production clusters. We used logs from the largestclusters among the preprocessed logs in the Failure trace archive [18], i.e., fromclusters at the Los Alamos National Laboratory [24]. In these logs, each fail-ure is tagged by the node —and not just the processor— on which the failureoccurred. Among the 26 possible clusters, we opted for the only two clusterswith more than 1,000 nodes, as we needed a sample history su�ciently largeto simulate platforms with more than 10,000 nodes. The two chosen logs arefor clusters 18 and 19 in the archive (referred to as 7 and 8 in [24]). For eachlog, we record the set S of availability intervals. A discrete failure distribu-tionfor the simulation is then generated as follows: the conditional probabilityP(X Ø t | X Ø ·) that a node stays up for a duration t, knowing that it has beenup for a duration · , is set to the ratio of the number of availability durationsin S greater than or equal to t, over the number of availability durations in Sgreater than or equal to · .Scenario generation – Given a p-processor job, a failure trace is a set of failuredates for each processor over a fixed time horizon h (set to 2 years). The jobstart time is assumed to be 1 year for synthetic distribution platforms, and 0.25year for log-based distribution platforms. We use a non-null start time to avoidside-e�ects related to the synchronous initialization of all processors. Given thedistribution of inter-arrival times at a processor, for each processor we generatea trace via independent sampling until the target time horizon is reached.Forsimulations where the only varying parameter is the number of processors a Æp Æ b, we first generate traces for b processors. For experiments with p processorswe then simply select the first p traces. This ensures that simulation results arecoherent when varying p. Finally, the two clusters used for computing our log-based failure distributions consist of 4-processor nodes. Hence, to simulate a131,072-processor platform we generate 32,768 failure traces, one for each four-processor node.

7 Simulation resultsIn this section we discuss simulation results, but only show graphs for perfectlyparallel applications under the constant overhead scenario. All results can befound in the appendix. All simulations with replication are for two groups (usingthree groups never leads to improvements in our experiments).

Optimal number of processors for execution – For Weibull and log-basedfailures with the constant overhead scenario, using all the available processors torun a single application instance leads to significantly larger makespans. This isseen in Figures 5-7 where the curves for OptExp, PeriodLB, and DPNext-Failure, the no-replication strategies, shoot upward when p reaches a largeenough value. For instance, with traces based on the logs of LANL cluster 18, theincrease in makespan is more than 37% when going from p = 216 to p = 217. Thisbehavior is in line with the result in Theorem 1. However, in our experiments

RR n° 7876

Using group replication for resilience on exascale systems 23

50

60

aver

age

mak

espa

n(in

days

)

40

218 219216 217

30

20

10

0

number of processors220215214

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

Figure 4: Exponential failures withMTBF=125 years.

60

aver

age

mak

espa

n(in

days

) 50

40

214 215 220217216 219218

30

20

10

0

number of processors

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

Figure 5: Weibull failures withMTBF=125 years and k = 0.70.

10

0

number of processors214 215212 217213 216

70

aver

age

mak

espa

n(in

days

)

80

60

50

40

30

20

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

Figure 6: Failures based on the failurelog of LANL cluster 18.

10

0

number of processors214 215212 217213 216

70

aver

age

mak

espa

n(in

days

)

80

60

50

40

30

20

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

Figure 7: Failures based on the failurelog of LANL cluster 19.

with Exponential failures the lowest makespan when running a single applicationinstance is obtained when using all the processors, regardless of the applicationmodel. This is seen plainly in Figure 4 at least up to 220 processors. This likelymeans that our experimental scenarios have not reached the scale at which themakespan would increase. Furthermore, running a single application instanceis not always best in our results even for Exponential failures. For instance,for generic parallel jobs OptExp (with g = 1) begins being outperformed bystrategies that use replication at 220 processors (see the full results in [5]). Weconclude that on exascale platforms the optimal makespan may not be obtainedby running a single application instance. This is similar to the result obtainedin [12] in the context of process replication.

The ASAP protocol with Exponential – Considering only those resultsobtained for Exponential failures and using group replication, we find that Opt-ExpGroup with g = 2 delivers performance very close to that of the numericallower bound on any periodic policy under ASAP (PeriodLB with g = 2) andslightly better than the performance of OptExp with g = 2.These results cor-roborates our theoretical study. Therefore, determining chunk sizes accordingto Theorem 2 is more e�cient than using two instances that each use the chunksize determined by OptExp.

RR n° 7876

Using group replication for resilience on exascale systems 24

The dynamic programming approaches - The performance of the threevariants of DPNextFailureAsap (with Merge being the minimum, the aver-age, or the maximum) are indistinguishable on Exponential and Weibull failuredistributions. For log-based failure distributions, their performance are stillvery close, with the average and the maximum seemingly achieving better per-formance (the performance di�erences, however, may not be significant in lightof the standard deviations).

DPNextFailureSynchro achieves worse performance than DPNextFail-ureAsap for Exponential and Weibull failure distributions. For log-based fail-ure distributions, in some rare settings, the opposite is true. Overall, thisheuristic provides disappointing results. By contrast, depending on the scenario,DPNextFailureGlobal is equivalent to the other dynamic programming ap-proaches or outperforms them significantly. Furthermore, its performance isrelatively close to that of PeriodLB with g = 2.

The very perplexing result here is that, regardless of the underlying failuredistribution, DPNextFailureGlobal is outperformed by the periodic policyOptExp with g = 2. Recall that this policy erroneously assumes Exponen-tial failures and computes the checkpointing period ignoring that replicationis used! By contrast, DPNextFailureGlobal is based on strong theoreticalfoundations and was designed specifically to handle multiple groups e�ciently.And yet, in non-Exponential failure cases, OptExp with g = 2 outperformsDPNextFailureGlobal (see Figures 5-7). In the case of log-based experi-ments (Figures 6 and 7), it may be that our datasets are not su�ciently large.We use failure data from the largest production clusters in the Failure tracearchive [18], but these clusters are orders of magnitude smaller than our simu-lated platforms. Consequently, we are forced to “replay” the same failure datamany times for many of our simulated nodes, which biases experiments. In thecase of Weibull failures with k = 0.7 (Figure 5), it turns out that the failuredistribution is su�ciently close to an Exponential, i.e., that k is su�cientlyclose to 1, so that OptExp does well. However, smaller values of k are rea-sonable as seen for instance in [19] (k ¥ 0.5) and in [24] (0.33 Æ k Æ 0.49).In Table 1 we show results for smaller values of k for a platform with 220 pro-cessors. We see that, as k decreases, the performance of OptExp with g = 2degrades when compared to PeriodLB, reaching more than 100% degradationfor k = 0.5. However, DPNextFailureGlobal still achieves close-to-optimalmakespans in such cases, as can be seen on Figure 8 (for k = 0.5). Overall,DPNextFailureGlobal achieves good to very good performance all acrossthe board. Finally, remark on Figure 8 that PeriodLB achieves better per-formance using a replication factor of 3 (g = 3) than with a replication factorof 2.

8 ConclusionIn this paper we have presented a study of replication techniques for large-scaleplatforms. These platforms are subject to failures, the frequency of which in-crease dramatically with platform scale. For a variety of job types (perfectlyparallel, generic, or numerical) and checkpoint cost models (constant or pro-portional overhead), we have shown that using the largest possible number ofprocessors does not always lead to the smallest execution time. This is because

RR n° 7876

Using group replication for resilience on exascale systems 25

aver

age

mak

espa

n(in

days

)

140

120

100

number of processors214 215 220217216 219218

80

60

40

20

0DPNextFailureGlobal 2 groupsOptExpGroup 2 groupsPeriodLB 3 groupsPeriodLB 2 groupsPeriodLBOptExp 3 groupsOptExp 2 groupsOptExp

Figure 8: Weibull failures with MTBF=125 years and k = 0.50.

using more resources implies facing more failures during execution, hence wast-ing more time tolerating them (via an increase in checkpointing frequency) andrecovering from them. This waste results in a slow-down despite the additionalhardware resources.

This observation led us to investigate replication as a technique to better useall the resources provided by the platform. While others have studied processreplication [12], in this work we have focused on group replication, a more gen-eral and less intrusive approach. Group replication consists in partitioning theplatform into several groups, which each executes an instance of the applicationconcurrently. All groups coordinate to take advantage of the most advancedapplication state available. We have conducted an analytical study of groupreplication for Exponential failures when using the ASAP execution protocol,using an analogy with a list schedule to bound the number of attempts and theexecution time of each group. This analogy makes it possible to compute an up-per bound on the expected execution time, which can be computed numerically.We have also proposed several dynamic programming algorithms to minimizeapplication makespan, whatever the failure distribution. We have comparedall these approaches, along with no-replication approaches proposed in previ-ous work, in simulation for both synthetic failure distributions and failure datafrom production clusters. Our main findings are 1) that replication can signif-icantly lower the execution time of applications on very large scale platforms,for failure and checkpointing characteristics corresponding to today’s platforms;2) that our DPNextFailureGlobal dynamic programming approach is closeto the optimal periodic solution (in fact close to PeriodLB, which uses a pro-hibitively expensive numerical search for the best period).

A clear direction for future work, which we are currently exploring and thatcould lead to results in the final version of the article, is the reasons for theunexpected good results for OptExp with g = 2. In terms of longer-termobjectives, it would be interesting to generalize this work beyond the case ofcoordinated checkpointing. For instance, we plan to study hierarchical check-pointing schemes with message logging. Another interesting direction would beto study process replication and compare it to group replication, both theoret-ically and through simulations.

RR n° 7876

Using group replication for resilience on exascale systems 26

AcknowledgementsThis work was supported in part by the ANR RESCUE project, and by theINRIA-Illinois Joint Laboratory for Petascale Computing.

References[1] G. Amdahl. The validity of the single processor approach to achieving large

scale computing capabilities. In AFIPS Conference Proceedings, volume 30,pages 483–485. AFIPS Press, 1967.

[2] L. Bautista Gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Mat-suoka. Transparent low-overhead checkpoint for GPU-accelerated clusters.https://wiki.ncsa.illinois.edu/download/attachments/17630761/

INRIA-UIUC-WS4-lbautista.pdf?version=1&modificationDate=

1290470402000.

[3] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon,J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker,and R. C. Whaley. ScaLAPACK Users’ Guide. SIAM, 1997.

[4] Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, and FrédéricVivien. Checkpointing strategies for parallel jobs. In Proceedings of 2011International Conference for High Performance Computing, Networking,Storage and Analysis, SC ’11, pages 33:1–33:11, New York, NY, USA, 2011.ACM.

[5] Marin Bougeret, Henri Casanova, Yves Robert, Frédéric Vivien, and Dou-nia Zaidouni. Using group replication for resilience on exascale systems.Research Report RR-xxxx, INRIA, ENS Lyon, France, 2011. Available atgraal.ens-lyon.fr/~yrobert/.

[6] Mohamed-Slim Bouguerra, Thierry Gautier, Denis Trystram, and Jean-Marc Vincent. A flexible checkpoint/restart model in distributed systems.In PPAM, volume 6067 of LNCS, pages 206–215, 2010.

[7] Franck Cappello, Henri Casanova, and Yves Robert. Checkpointing vs. mi-gration for post-petascale supercomputers. In ICPP’2010. IEEE ComputerSociety Press, 2010.

[8] V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi,K. Vaidyanathan, and W. P. Zeggert. Proactive management of softwareaging. IBM J. Res. Dev., 45(2):311–332, 2001.

[9] J. T. Daly. A higher order estimate of the optimum checkpoint intervalfor restart dumps. Future Generation Computer Systems, 22(3):303–312,2004.

[10] Jack Dongarra, Pete Beckman, Patrick Aerts, Frank Cappello, ThomasLippert, Satoshi Matsuoka, Paul Messina, Terry Moore, Rick Stevens, AnneTrefethen, and Mateo Valero. The international exascale software project:a call to cooperative action by the global high-performance community. Int.J. High Perform. Comput. Appl., 23(4):309–322, 2009.

RR n° 7876

Using group replication for resilience on exascale systems 27

[11] C. Engelmann, H. H. Ong, and S. L. Scorr. The case for modular re-dundancy in large-scale highh performance computing systems. In Proc.of the 8th IASTED Infernational Conference on Parallel and DistributedComputing and Networks (PDCN), pages 189–194, 2009.

[12] K. Ferreira, J. Stearley, J. H. III Laros, R. Oldfield, K. Pedretti,R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating theViability of Process Replication Reliability for Exascale Systems. In Pro-ceedings of the 2011 ACM/IEEE Conference on Supercomputing, 2011.

[13] F. Gärtner. Fundamentals of fault-tolerant distributed computing in asyn-chronous environments. ACM Computing Surveys, 31(1), 1999.

[14] T. Heath, R. P. Martin, and T. D. Nguyen. Improving cluster availabilityusing workstation validation. SIGMETRICS Perf. Eval. Rev., 30(1):217–227, 2002.

[15] W.M. Jones, J.T. Daly, and N. DeBardeleben. Impact of sub-optimalcheckpoint intervals on application e�ciency in computational clusters. InHPDC’10, pages 276–279. ACM, 2010.

[16] Nick Kolettis and N. Dudley Fulton. Software rejuvenation: Analysis, mod-ule and applications. In FTCS ’95, page 381, Washington, DC, USA, 1995.IEEE CS.

[17] D. Kondo, A. Chien, and H. Casanova. Scheduling Task Parallel Appli-cations for Rapid Application Turnaround on Enterprise Desktop Grids.Journal of Grid Computing, 5(4):379–405, 2007.

[18] Derrick Kondo, Bahman Javadi, Alexandru Iosup, and Dick Epema. Thefailure trace archive: Enabling comparative analysis of failures in diversedistributed systems. Cluster Computing and the Grid, IEEE InternationalSymposium on, 0:398–407, 2010.

[19] Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, andSL Scott. An optimal checkpoint/restart model for a large scale high per-formance computing system. In IPDPS 2008, pages 1–9. IEEE, 2008.

[20] E. Meneses. Clustering Parallel Applications to Enhance MessageLogging Protocols. https://wiki.ncsa.illinois.edu/download/

attachments/17630761/INRIA-UIUC-WS4-emenese.pdf?version=

1&modificationDate=1290466786000.

[21] Michael Pinedo. Scheduling: theory, algorithms, and systems (3rd edition).Springer, 2008.

[22] Vivek Sarkar and others. Exascale software study: Software challenges inextreme scale systems, 2009. White paper available at: http://users.

ece.gatech.edu/mrichard/ExascaleComputingStudyReports/ECSS%

20report%20101909.pdf.

[23] B. Schroeder and G. Gibson. Understanding failures in petascale comput-ers. Journal of Physics: Conference Series, 78(1), 2007.

RR n° 7876

Using group replication for resilience on exascale systems 28

k PeriodLB OptExp Relative(g = 2) (g = 2) degradation

0.5 35.57 81.20 128.30%0.6 21.50 30.10 39.99%0.7 15.38 17.14 11.44%0.8 12.27 12.45 1.46%0.9 10.45 10.47 0.21%

Table 1: Makespans (in days) obtained by PeriodLB with g = 2 and OptExpwith g = 2, and relative degradation of OptExp, as a function of k, the shapeparameter of the Weibull distribution of failures, on platforms with 1, 048, 576processors.

[24] B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proc. of DSN, pages 249–258, 2006.

[25] K. Venkatesh. Analysis of Dependencies of Checkpoint Cost and Check-point Interval of Fault Tolerant MPI Applications. Analysis, 2(08):2690–2697, 2010.

[26] L. Wang, P. Karthik, Z. Kalbarczyk, R.K. Iyer, L. Votta, C. Vick, andA. Wood. Modeling Coordinated Checkpointing for Large-Scale Supercom-puters. In Proc. of the International Conference on Dependable Systemsand Networks, pages 812–821, June 2005.

[27] S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho. Using Replication andCheckpointing for Reliable Task Management in Computational Grids. InProc. of the International Conference on High Performance Computing &Simulation, 2010.

[28] John W. Young. A first order approximation to the optimum checkpointinterval. Communications of the ACM, 17(9):530–531, 1974.

[29] Z. Zheng and Z. Lan. Reliability-aware scalability models for high perfor-mance computing. In Proc. of the IEEE Conference on Cluster Computing,2009.

RR n° 7876

Using group replication for resilience on exascale systems 29

A Detailed simulation results

RR n° 7876

Using group replication for resilience on exascale systems 30g

=1

g=

2p

Opt

Exp

Per

iodL

BD

PN

extF

ailu

reO

ptE

xpO

ptE

xpG

roup

Per

iodL

BD

PN

extF

ailu

reA

sap

DP

Nex

tFai

lure

Sync

hro

DP

Nex

tFai

lure

Glo

bal

min

avg

max

3276

812

4.14

±0.

8612

4.14

±0.

8612

4.12

±0.

7723

1.72

±0.

3322

8.74

±0.

5622

8.53

±0.

5823

8.74

±0.

4923

8.75

±0.

5623

8.75

±0.

2925

4.27

±1.

3222

6.89

±0.

5965

536

65.2

0.60

65.2

0.60

65.3

0.57

117.

96±

0.18

116.

11±

0.51

116.

07±

0.54

122.

99±

0.28

122.

99±

0.31

122.

93±

0.26

133.

61±

0.88

114.

94±

0.34

1310

7235

.16

±0.

5335

.12

±0.

5635

.27

±0.

4760

.61

±0.

1559

.53

±0.

4459

.49

±0.

3465

.14

±0.

1765

.14

±0.

2665

.20

±0.

3971

.65

±0.

5058

.94

±0.

3026

2144

19.7

0.37

19.6

0.35

19.7

0.32

31.6

0.16

31.0

0.36

30.9

0.29

34.7

0.15

34.8

0.24

34.8

0.27

39.0

0.35

30.8

0.24

5242

8811

.74

±0.

3311

.72

±0.

3211

.79

±0.

3516

.96

±0.

1816

.68

±0.

3316

.62

±0.

2019

.14

±0.

1719

.10

±0.

1619

.12

±0.

2721

.79

±0.

3116

.92

±0.

2010

4857

67.

82±

0.31

7.82

±0.

317.

87±

0.31

9.55

±0.

239.

45±

0.39

9.42

±0.

1711

.05

±0.

1311

.07

±0.

1911

.07

±0.

1712

.42

±0.

229.

93±

0.22

(a)

Perfe

ctly

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax32

768

128.

22±

0.86

128.

22±

0.86

128.

22±

0.76

235.

53±

0.35

232.

45±

0.56

232.

29±

0.62

242.

71±

0.25

242.

59±

0.46

242.

70±

0.34

258.

59±

1.30

230.

65±

0.60

6553

669

.48

±0.

6069

.48

±0.

6069

.64

±0.

6112

1.81

±0.

1811

9.94

±0.

5211

9.87

±0.

5312

7.01

±0.

3212

6.98

±0.

2812

6.96

±0.

2213

7.95

±0.

8611

8.68

±0.

3413

1072

39.7

0.49

39.7

0.60

39.9

0.46

64.5

0.16

63.4

0.49

63.4

0.34

69.4

0.27

69.4

0.23

69.4

0.24

76.3

0.53

62.8

0.33

2621

4424

.81

±0.

4124

.75

±0.

4124

.90

±0.

3935

.76

±0.

1935

.07

±0.

3835

.05

±0.

2639

.43

±0.

3139

.42

±0.

3939

.35

±0.

1644

.23

±0.

4034

.97

±0.

2852

4288

17.8

0.37

17.8

0.35

17.9

0.39

21.4

0.21

21.0

0.42

20.9

0.26

24.1

0.19

24.1

0.29

24.1

0.25

27.4

0.38

21.3

0.28

1048

576

15.9

0.48

15.9

0.48

16.0

0.46

14.5

0.32

14.3

0.55

14.3

0.20

16.8

0.25

16.7

0.15

16.8

0.30

18.9

0.28

15.0

0.27

(b)

Gen

eric

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax32

768

124.

44±

0.83

124.

44±

0.83

124.

49±

0.87

232.

14±

0.34

229.

18±

0.57

228.

95±

0.60

239.

16±

0.47

239.

27±

0.37

239.

20±

0.56

254.

78±

1.28

227.

36±

0.59

6553

665

.46

±0.

6065

.46

±0.

6065

.64

±0.

5811

8.26

±0.

1711

6.43

±0.

4911

6.39

±0.

5112

3.34

±0.

4212

3.25

±0.

2312

3.29

±0.

2913

4.02

±0.

8411

5.22

±0.

3513

1072

35.3

0.51

35.3

0.56

35.4

0.47

60.8

0.17

59.8

0.49

59.7

0.45

65.4

0.25

65.3

0.31

65.3

0.17

71.9

0.52

59.1

0.34

2621

4419

.86

±0.

3719

.77

±0.

3519

.87

±0.

3431

.78

±0.

1631

.15

±0.

3631

.15

±0.

3135

.00

±0.

2534

.97

±0.

1635

.02

±0.

2739

.25

±0.

3631

.06

±0.

2552

4288

11.8

0.34

11.8

0.33

11.9

0.35

17.0

0.18

16.8

0.32

16.7

0.19

19.2

0.17

19.3

0.32

19.3

0.25

21.9

0.32

17.0

0.21

1048

576

7.95

±0.

327.

95±

0.32

7.99

±0.

319.

65±

0.23

9.56

±0.

409.

52±

0.17

11.1

0.13

11.1

0.21

11.1

0.20

12.5

0.21

10.0

0.22

(c)

Num

eric

alke

rnel

s.

Tabl

e2:

Eval

uatio

nof

the

di�e

rent

heur

istic

son

apl

atfo

rmw

ithE

xpon

enti

alfa

ilure

s(M

TB

F=

125

year

s)un

der

the

cons

tant

over

head

mod

el.

RR n° 7876

Using group replication for resilience on exascale systems 31g

=1

g=

2p

Opt

Exp

Per

iodL

BD

PN

extF

ailu

reO

ptE

xpO

ptE

xpG

roup

Per

iodL

BD

PN

extF

ailu

reA

sap

DP

Nex

tFai

lure

Sync

hro

DP

Nex

tFai

lure

Glo

bal

min

avg

max

6553

612

3.19

±5.

8712

3.08

±6.

0412

3.62

±5.

8718

5.64

±4.

8018

5.88

±7.

6818

2.76

±5.

7321

8.73

±4.

5121

8.27

±4.

6721

8.31

±4.

6923

6.11

±6.

6820

2.78

±7.

1313

1072

61.8

2.91

61.7

2.49

61.9

2.89

92.7

2.20

92.5

3.60

91.2

3.45

109.

71±

2.14

109.

45±

2.11

109.

48±

2.14

118.

43±

2.55

101.

39±

3.43

2621

4430

.55

±1.

2130

.55

±1.

2130

.66

±1.

1446

.65

±1.

0146

.69

±1.

6845

.70

±1.

5054

.74

±1.

1454

.58

±1.

1554

.57

±1.

1558

.85

±1.

4450

.90

±1.

6952

4288

15.4

0.58

15.4

0.47

15.5

0.58

23.1

0.60

23.2

0.89

22.8

0.71

27.3

0.60

27.2

0.60

27.2

0.60

29.4

0.55

25.2

0.64

1048

576

7.82

±0.

317.

82±

0.31

7.87

±0.

319.

55±

0.23

9.45

±0.

399.

42±

0.17

11.0

0.13

11.0

0.19

11.0

0.17

12.4

0.22

9.93

±0.

22

(a)

Perfe

ctly

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax65

536

131.

05±

6.31

130.

99±

5.79

131.

29±

6.10

192.

02±

4.74

191.

80±

7.74

188.

77±

5.90

225.

85±

4.83

225.

77±

4.89

225.

78±

4.89

243.

36±

6.37

209.

88±

6.78

1310

7269

.65

±2.

8369

.62

±2.

3669

.82

±2.

8999

.09

±2.

1898

.82

±3.

9097

.18

±3.

2511

6.71

±2.

2611

6.68

±2.

2111

6.68

±2.

2012

6.29

±2.

7610

7.85

±4.

1126

2144

38.6

1.40

38.5

1.29

38.6

1.32

52.7

1.16

52.6

1.82

51.7

1.53

61.8

1.29

61.7

1.35

61.7

1.35

66.4

1.43

57.5

2.03

5242

8823

.59

±0.

8123

.55

±0.

7823

.62

±0.

7829

.27

±0.

6029

.27

±0.

8728

.85

±0.

7434

.52

±0.

6934

.47

±0.

6534

.47

±0.

6537

.20

±0.

7031

.90

±0.

8310

4857

615

.95

±0.

4815

.92

±0.

4816

.02

±0.

4614

.54

±0.

3214

.38

±0.

5514

.33

±0.

2016

.83

±0.

2516

.77

±0.

1516

.86

±0.

3018

.95

±0.

2815

.07

±0.

27

(b)

Gen

eric

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax65

536

123.

76±

5.94

123.

60±

6.08

124.

05±

5.98

186.

39±

4.73

186.

73±

7.69

183.

65±

5.74

219.

24±

4.56

219.

10±

4.63

219.

10±

4.63

236.

52±

6.59

203.

63±

7.03

1310

7262

.33

±2.

8362

.26

±2.

7662

.37

±2.

8993

.32

±2.

1793

.03

±3.

7291

.67

±3.

3211

0.01

±2.

2210

9.97

±2.

2010

9.95

±2.

1811

8.77

±2.

6010

2.19

±3.

6326

2144

30.8

1.21

30.8

1.21

30.8

1.14

46.9

1.04

46.8

1.71

45.9

1.51

55.0

1.20

54.8

1.20

54.9

1.20

59.2

1.41

51.2

1.82

5242

8815

.65

±0.

5815

.65

±0.

5215

.69

±0.

5923

.37

±0.

5823

.44

±0.

9223

.08

±0.

7027

.51

±0.

6027

.48

±0.

6127

.48

±0.

6129

.64

±0.

5925

.45

±0.

6810

4857

67.

95±

0.32

7.95

±0.

327.

99±

0.31

9.65

±0.

239.

56±

0.40

9.52

±0.

1711

.16

±0.

1311

.19

±0.

2111

.19

±0.

2012

.55

±0.

2110

.03

±0.

22

(c)

Num

eric

alke

rnel

s.

Tabl

e3:

Eval

uatio

nof

the

di�e

rent

heur

istic

son

apl

atfo

rmw

ithE

xpon

enti

alfa

ilure

s(M

TB

F=

125

year

s)un

der

the

prop

ortio

nal

over

head

mod

el.

RR n° 7876

Using group replication for resilience on exascale systems 32g

=1

g=

2p

Opt

Exp

Per

iodL

BD

PN

extF

ailu

reO

ptE

xpO

ptE

xpG

roup

Per

iodL

BD

PN

extF

ailu

reA

sap

DP

Nex

tFai

lure

Sync

hro

DP

Nex

tFai

lure

Glo

bal

min

avg

max

3276

814

2.66

±1.

9113

7.19

±1.

1413

7.51

±1.

2223

6.16

±0.

8724

1.59

±2.

6123

6.06

±0.

8225

1.98

±1.

4625

2.17

±1.

0625

2.38

±1.

5227

7.78

±1.

2723

2.66

±0.

7565

536

80.4

1.45

76.1

0.79

76.3

0.81

122.

54±

0.85

128.

82±

2.50

122.

49±

0.49

133.

43±

0.49

133.

52±

0.52

133.

53±

0.55

150.

89±

0.91

121.

11±

0.63

1310

7248

.93

±1.

2544

.92

±0.

6445

.21

±0.

6665

.51

±0.

9572

.56

±2.

6065

.29

±0.

4574

.34

±0.

8474

.41

±0.

9474

.22

±0.

5784

.39

±0.

7265

.51

±0.

5626

2144

33.1

1.25

29.1

0.56

29.5

0.59

37.0

0.53

44.1

1.63

36.4

0.33

42.8

0.50

42.7

0.34

42.8

0.88

48.5

0.56

37.8

0.52

5242

8827

.43

±1.

4522

.49

±0.

6222

.72

±0.

7023

.00

±0.

5830

.85

±1.

2321

.98

±0.

3426

.60

±0.

8926

.43

±0.

5826

.43

±0.

5628

.73

±0.

3824

.37

±0.

5010

4857

631

.83

±1.

9323

.67

±1.

0123

.96

±1.

0417

.16

±0.

7726

.73

±1.

9215

.38

±0.

4318

.53

±0.

4318

.52

±0.

4318

.71

±0.

9818

.84

±0.

4417

.98

±0.

76

(a)

Perfe

ctly

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax32

768

147.

35±

1.87

141.

72±

1.18

142.

06±

1.22

240.

03±

0.87

245.

59±

2.66

239.

98±

0.84

255.

91±

0.80

256.

44±

0.96

256.

59±

0.76

282.

32±

1.29

236.

52±

0.69

6553

685

.76

±1.

4381

.10

±0.

8181

.30

±0.

7812

6.54

±0.

8913

2.99

±2.

5012

6.52

±0.

5213

7.70

±0.

3713

7.74

±0.

3613

7.79

±0.

3515

5.87

±1.

0212

5.06

±0.

6213

1072

55.1

1.29

50.7

0.66

51.1

0.74

69.7

1.00

77.4

2.75

69.5

0.49

79.2

1.06

79.0

0.40

79.0

0.29

89.9

0.77

69.7

0.59

2621

4441

.62

±1.

4736

.69

±0.

6837

.16

±0.

7741

.84

±0.

6150

.12

±1.

7141

.11

±0.

4048

.43

±0.

5748

.38

±0.

4048

.54

±0.

8254

.77

±0.

5942

.77

±0.

5352

4288

41.2

1.93

34.0

0.78

34.3

0.86

29.0

0.50

38.4

1.37

27.7

0.34

33.1

0.32

33.2

0.31

33.2

0.33

36.1

0.45

30.6

0.49

1048

576

63.1

2.71

47.5

1.38

48.1

1.48

25.9

0.98

39.9

2.13

23.4

0.52

28.4

1.46

28.2

1.02

28.1

0.46

28.7

0.51

27.2

0.73

(b)

Gen

eric

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax32

768

143.

09±

1.86

137.

57±

1.14

137.

93±

1.27

236.

61±

0.88

242.

03±

2.62

236.

51±

0.85

252.

38±

0.71

252.

74±

0.93

252.

92±

1.29

278.

28±

1.32

233.

11±

0.74

6553

680

.83

±1.

4576

.51

±0.

7976

.67

±0.

7812

2.85

±0.

8412

9.19

±2.

4712

2.83

±0.

5013

3.61

±0.

3413

3.80

±0.

4713

3.75

±0.

4615

1.34

±0.

9012

1.45

±0.

6313

1072

49.1

1.29

45.1

0.62

45.3

0.64

65.7

0.94

72.9

2.63

65.5

0.45

74.5

0.68

74.5

0.62

74.7

1.28

84.7

0.71

65.7

0.54

2621

4433

.37

±1.

2729

.36

±0.

6029

.70

±0.

6337

.27

±0.

5444

.42

±1.

6136

.59

±0.

3643

.14

±0.

7643

.04

±0.

5143

.07

±0.

8248

.75

±0.

5138

.10

±0.

4852

4288

27.7

1.45

22.7

0.64

22.9

0.75

23.1

0.58

31.0

1.22

22.1

0.32

26.4

0.29

26.5

0.35

26.6

0.48

28.9

0.38

24.5

0.53

1048

576

32.3

1.96

24.0

1.06

24.3

1.09

17.3

0.78

27.0

2.02

15.5

0.41

18.7

0.44

18.6

0.41

18.7

0.46

19.0

0.42

18.2

0.68

(c)

Num

eric

alke

rnel

s.

Tabl

e4:

Eval

uatio

nof

the

di�e

rent

heur

istic

son

apl

atfo

rmw

ithW

eibu

llfa

ilure

s(M

TB

F=

125

year

san

dk

=0.

70)

unde

rth

eco

nsta

ntov

erhe

adm

odel

.

RR n° 7876

Using group replication for resilience on exascale systems 33

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax13

1072

217.

06±

12.6

816

7.58

±7.

6016

7.82

±7.

8223

7.38

±8.

8635

0.20

±9.

6620

3.08

±8.

2423

8.97

±5.

0924

2.39

±5.

4624

2.39

±5.

4622

7.19

±6.

1122

2.85

±6.

3326

2144

115.

85±

7.53

87.2

3.61

87.4

3.84

129.

28±

6.33

205.

92±

13.4

310

6.70

±3.

7912

5.47

±3.

5212

6.01

±3.

8512

6.06

±3.

8311

8.59

±4.

0711

8.00

±4.

8952

4288

59.8

4.36

45.1

2.15

45.3

2.03

66.9

3.79

113.

18±

9.20

54.6

1.82

64.4

1.44

64.4

1.44

64.4

1.44

60.6

1.93

60.1

2.20

1048

576

31.8

1.93

23.6

1.01

23.9

1.04

17.1

0.77

26.7

1.92

15.3

0.43

18.5

0.43

18.5

0.43

18.7

0.98

18.8

0.44

17.9

0.76

(a)

Perfe

ctly

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax13

1072

241.

53±

13.8

718

8.05

±7.

7818

8.28

±8.

0824

7.34

±5.

9335

1.98

±10

.85

211.

96±

5.08

252.

06±

3.99

255.

12±

3.11

255.

12±

3.11

239.

30±

4.57

232.

91±

5.22

2621

4414

3.31

±8.

1110

8.86

±3.

9810

9.18

±4.

2514

4.93

±6.

6422

8.57

±12

.62

120.

47±

4.12

141.

29±

3.54

142.

12±

3.66

142.

10±

3.67

133.

53±

3.84

132.

56±

4.34

5242

8890

.43

±5.

7467

.80

±2.

3868

.04

±2.

5883

.63

±4.

1514

0.11

±10

.47

68.5

2.13

80.4

1.79

80.8

1.71

80.7

1.74

76.2

2.09

75.5

2.54

1048

576

63.1

2.71

47.5

1.38

48.1

1.48

25.9

0.98

39.9

2.13

23.4

0.52

28.4

1.46

28.2

1.02

28.1

0.46

28.7

0.51

27.2

0.73

(b)

Gen

eric

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax13

1072

217.

99±

12.8

916

8.14

±7.

4916

8.34

±7.

7823

7.60

±9.

3234

9.39

±10

.05

204.

15±

7.91

239.

94±

4.67

243.

39±

5.39

243.

41±

5.41

228.

14±

5.83

223.

56±

6.34

2621

4411

6.18

±7.

1687

.85

±3.

5288

.00

±3.

7812

9.85

±6.

3220

6.03

±13

.37

107.

25±

3.84

125.

80±

3.67

126.

77±

3.78

126.

80±

3.77

119.

17±

4.04

118.

53±

4.84

5242

8860

.63

±4.

3745

.61

±2.

0645

.85

±2.

0267

.27

±3.

7611

3.66

±9.

1854

.99

±1.

8864

.67

±1.

5364

.85

±1.

5164

.86

±1.

4961

.13

±1.

8860

.70

±2.

2210

4857

632

.34

±1.

9624

.00

±1.

0624

.37

±1.

0917

.34

±0.

7827

.07

±2.

0215

.56

±0.

4118

.74

±0.

4418

.68

±0.

4118

.75

±0.

4619

.02

±0.

4218

.23

±0.

68

(c)

Num

eric

alke

rnel

s.

Tabl

e5:

Eval

uatio

nof

the

di�e

rent

heur

istic

son

apl

atfo

rmw

ithW

eibu

llfa

ilure

s(M

TB

F=

125

year

san

dk

=0.

70)

unde

rth

epr

opor

tiona

love

rhea

dm

odel

.

RR n° 7876

Using group replication for resilience on exascale systems 34g

=1

g=

2p

Opt

Exp

Per

iodL

BD

PN

extF

ailu

reO

ptE

xpO

ptE

xpG

roup

Per

iodL

BD

PN

extF

ailu

reA

sap

DP

Nex

tFai

lure

Sync

hro

DP

Nex

tFai

lure

Glo

bal

min

avg

max

4096

124.

15±

1.33

124.

07±

1.42

120.

94±

1.45

203.

13±

0.58

198.

94±

1.41

198.

95±

0.90

221.

99±

3.39

218.

18±

4.22

216.

48±

4.02

234.

28±

1.75

197.

38±

0.87

8192

72.7

0.94

72.7

0.94

69.8

0.92

109.

17±

0.74

107.

04±

1.49

106.

88±

0.69

122.

11±

2.80

119.

21±

3.65

117.

79±

2.63

131.

25±

1.58

106.

44±

0.79

1638

446

.38

±0.

8245

.97

±0.

9444

.12

±0.

8061

.41

±0.

8760

.27

±1.

4360

.23

±0.

5971

.23

±2.

5268

.32

±1.

7567

.34

±1.

6772

.44

±0.

6760

.08

±0.

6532

768

32.8

0.78

32.5

0.84

31.7

0.88

36.9

0.39

35.9

0.59

35.7

0.50

44.0

1.63

42.2

1.54

41.5

1.32

43.7

0.56

36.7

0.51

6553

629

.12

±0.

9628

.12

±1.

0028

.22

±1.

1524

.53

±0.

3623

.82

±0.

4523

.64

±0.

4530

.02

±1.

4928

.43

±1.

2328

.06

±1.

2428

.57

±0.

5525

.62

±0.

5413

1072

40.3

1.70

37.4

1.59

38.7

1.70

19.9

0.45

19.0

0.53

18.9

0.44

23.6

0.91

22.9

1.08

22.7

1.18

22.2

0.58

22.6

0.72

(a)

Perfe

ctly

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax40

9612

4.27

±1.

3212

4.22

±1.

4112

1.15

±1.

6320

3.23

±0.

5719

9.05

±1.

4019

9.02

±0.

8922

2.15

±4.

3721

8.33

±3.

8921

6.09

±3.

6423

4.33

±1.

7819

7.50

±0.

8781

9272

.88

±0.

9372

.88

±0.

9369

.95

±0.

9210

9.27

±0.

7410

7.14

±1.

4910

7.00

±0.

6712

2.45

±3.

5011

9.00

±3.

0911

7.90

±3.

0113

1.26

±1.

4410

6.57

±0.

7216

384

46.5

0.80

46.1

0.93

44.3

0.82

61.5

0.87

60.3

1.44

60.3

0.47

70.8

1.97

68.7

1.77

67.9

1.70

72.6

0.77

60.2

0.70

3276

833

.13

±0.

7932

.76

±0.

8531

.97

±0.

8837

.07

±0.

4036

.08

±0.

5835

.91

±0.

5144

.31

±1.

7442

.74

±1.

6341

.87

±1.

3743

.91

±0.

5836

.88

±0.

5265

536

29.9

1.01

28.5

1.02

28.6

1.18

24.7

0.36

23.9

0.44

23.7

0.45

30.2

1.74

28.7

1.37

28.4

1.55

28.7

0.58

25.7

0.54

1310

7241

.80

±1.

7138

.82

±1.

7840

.27

±1.

8520

.40

±0.

4819

.39

±0.

5419

.15

±0.

6024

.05

±1.

0723

.29

±1.

1122

.95

±1.

1822

.59

±0.

5922

.97

±0.

74

(b)

Gen

eric

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax40

9612

4.28

±1.

3212

4.23

±1.

4012

1.01

±1.

4820

3.37

±0.

5719

9.19

±1.

4019

9.15

±0.

8922

1.89

±3.

0121

8.15

±3.

9721

6.89

±3.

7123

4.46

±1.

6119

7.66

±0.

8281

9272

.90

±0.

9272

.90

±0.

9270

.03

±0.

9410

9.33

±0.

7310

7.21

±1.

4910

7.01

±0.

6812

2.93

±2.

6111

9.05

±3.

1211

8.16

±3.

0613

1.47

±1.

5810

6.73

±0.

8916

384

46.5

0.80

46.1

0.94

44.2

0.83

61.5

0.85

60.4

1.42

60.3

0.59

71.3

1.90

68.6

1.92

67.8

1.58

72.6

0.68

60.2

0.66

3276

832

.99

±0.

8032

.64

±0.

8431

.86

±0.

8837

.05

±0.

3936

.00

±0.

6035

.85

±0.

5144

.58

±1.

3742

.34

±1.

5641

.88

±1.

4943

.90

±0.

5436

.84

±0.

5065

536

29.6

0.98

28.2

0.99

28.4

1.17

24.6

0.36

23.9

0.45

23.7

0.45

29.8

1.39

28.5

1.27

28.0

1.26

28.6

0.57

25.7

0.51

1310

7240

.73

±1.

6937

.86

±1.

6639

.14

±1.

7320

.22

±0.

4619

.23

±0.

5318

.99

±0.

6023

.89

±0.

9823

.18

±1.

1722

.83

±1.

0222

.41

±0.

5822

.85

±0.

77

(c)

Num

eric

alke

rnel

s.

Tabl

e6:

Eval

uatio

nof

the

di�e

rent

heur

istic

son

apl

atfo

rmw

ithfa

ilure

sba

sed

onth

efa

ilure

log

ofLA

NL

clus

ter

18an

dun

der

unde

rth

eco

nsta

ntov

erhe

adm

odel

.

RR n° 7876

Using group replication for resilience on exascale systems 35

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax32

768

226.

77±

12.3

921

8.36

±13

.79

211.

07±

11.9

729

2.56

±11

.30

286.

19±

13.4

228

5.64

±13

.78

337.

50±

20.7

733

9.30

±24

.99

352.

10±

25.2

031

0.68

±15

.33

331.

38±

17.6

765

536

89.4

4.12

84.1

4.39

84.0

4.40

119.

10±

4.48

114.

67±

4.69

112.

43±

4.79

134.

09±

5.64

128.

67±

5.64

127.

85±

5.90

123.

99±

5.89

131.

63±

7.76

1310

7240

.30

±1.

7037

.42

±1.

5938

.75

±1.

7019

.92

±0.

4519

.09

±0.

5318

.94

±0.

4423

.67

±0.

9122

.99

±1.

0822

.74

±1.

1822

.28

±0.

5822

.68

±0.

72

(a)

Perfe

ctly

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax32

768

228.

66±

12.2

521

9.93

±13

.89

212.

77±

12.4

229

3.69

±11

.52

287.

96±

13.3

528

7.54

±13

.83

340.

51±

20.1

634

2.01

±25

.19

353.

95±

25.4

431

2.25

±15

.23

332.

71±

18.3

465

536

91.0

4.29

86.1

4.88

85.6

4.55

120.

10±

4.45

115.

60±

4.65

113.

28±

4.74

135.

11±

5.63

129.

59±

5.42

129.

12±

5.76

124.

70±

5.72

132.

50±

7.13

1310

7241

.80

±1.

7138

.82

±1.

7840

.27

±1.

8520

.40

±0.

4819

.39

±0.

5419

.15

±0.

6024

.05

±1.

0723

.29

±1.

1122

.95

±1.

1822

.59

±0.

5922

.97

±0.

74

(b)

Gen

eric

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax32

768

226.

94±

12.4

821

9.04

±13

.36

211.

32±

12.8

629

3.58

±11

.58

287.

83±

13.4

528

7.23

±13

.58

337.

58±

19.1

234

0.34

±24

.77

352.

72±

25.3

931

1.75

±15

.37

332.

25±

18.5

965

536

89.8

4.15

85.1

4.69

84.7

4.42

119.

55±

4.50

115.

18±

4.72

112.

79±

4.85

134.

37±

5.35

129.

21±

5.85

128.

63±

5.90

124.

56±

5.92

132.

28±

8.05

1310

7240

.73

±1.

6937

.86

±1.

6639

.14

±1.

7320

.22

±0.

4619

.23

±0.

5318

.99

±0.

6023

.89

±0.

9823

.18

±1.

1722

.83

±1.

0222

.41

±0.

5822

.85

±0.

77

(c)

Num

eric

alke

rnel

s.

Tabl

e7:

Eval

uatio

nof

the

di�e

rent

heur

istic

son

apl

atfo

rmw

ithfa

ilure

sba

sed

onth

efa

ilure

log

ofLA

NL

clus

ter

18an

dun

der

the

prop

ortio

nalo

verh

ead

mod

el.

RR n° 7876

Using group replication for resilience on exascale systems 36g

=1

g=

2p

Opt

Exp

Per

iodL

BD

PN

extF

ailu

reO

ptE

xpO

ptE

xpG

roup

Per

iodL

BD

PN

extF

ailu

reA

sap

DP

Nex

tFai

lure

Sync

hro

DP

Nex

tFai

lure

Glo

bal

min

avg

max

4096

123.

81±

1.28

123.

67±

1.41

120.

13±

1.46

203.

31±

0.55

198.

87±

1.39

198.

97±

0.92

224.

81±

5.58

219.

66±

3.81

216.

53±

4.32

233.

54±

1.80

196.

71±

0.83

8192

72.1

0.82

71.8

0.96

69.2

0.95

109.

42±

0.77

107.

05±

1.36

106.

98±

0.57

122.

92±

2.72

119.

69±

1.92

118.

55±

2.32

130.

38±

1.57

106.

39±

0.82

1638

445

.50

±0.

7445

.27

±0.

7943

.57

±0.

7861

.36

±0.

8760

.08

±1.

3759

.88

±0.

4671

.57

±2.

5668

.21

±1.

9867

.59

±2.

0072

.15

±0.

6359

.81

±0.

5932

768

32.5

0.76

31.6

0.84

31.2

0.96

36.9

0.29

36.0

0.42

35.9

0.42

44.2

2.14

41.8

1.33

41.3

1.21

43.5

0.54

36.8

0.53

6553

628

.74

±1.

1027

.57

±1.

1527

.54

±1.

2524

.18

±0.

4023

.06

±0.

5023

.01

±0.

4829

.78

±1.

0828

.40

±1.

3927

.99

±1.

4028

.22

±0.

5625

.21

±0.

5713

1072

39.6

1.47

36.9

1.85

38.0

2.12

19.2

0.49

18.2

0.52

17.9

0.57

22.9

1.24

21.6

0.94

21.3

0.85

21.4

0.72

21.4

0.75

(a)

Perfe

ctly

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax40

9612

3.93

±1.

3012

3.80

±1.

4412

0.24

±1.

5120

3.42

±0.

5519

9.00

±1.

4019

9.07

±0.

9122

4.16

±4.

9821

9.79

±3.

7421

6.99

±5.

1923

3.77

±1.

8619

6.76

±0.

8281

9272

.33

±0.

8172

.01

±0.

9569

.43

±0.

9110

9.52

±0.

7610

7.20

±1.

3810

7.09

±0.

5812

3.09

±2.

9312

0.09

±1.

9211

8.02

±2.

0413

0.28

±1.

5210

6.34

±0.

8216

384

45.7

0.68

45.4

0.78

43.7

0.78

61.4

0.88

60.2

1.39

60.0

0.47

71.6

2.12

68.4

2.00

67.4

1.91

72.2

0.63

59.8

0.55

3276

832

.82

±0.

7731

.90

±0.

8531

.49

±0.

9437

.11

±0.

2936

.07

±0.

4136

.07

±0.

4144

.16

±2.

0142

.09

±1.

6541

.51

±1.

3443

.70

±0.

5936

.98

±0.

5365

536

29.2

1.11

28.0

1.20

27.9

1.26

24.3

0.40

23.2

0.51

23.1

0.50

30.0

1.07

28.5

1.43

28.1

1.40

28.4

0.57

25.4

0.59

1310

7240

.93

±1.

5738

.12

±1.

7939

.26

±2.

2019

.59

±0.

4818

.54

±0.

5218

.23

±0.

5623

.35

±1.

1422

.09

±0.

9621

.56

±0.

8921

.82

±0.

7321

.90

±0.

84

(b)

Gen

eric

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax40

9612

4.01

±1.

3112

3.90

±1.

4412

0.39

±1.

5120

3.53

±0.

5619

9.13

±1.

3919

9.20

±0.

9022

4.18

±4.

7021

9.70

±3.

9821

7.95

±4.

5323

4.13

±1.

9219

6.94

±0.

7481

9272

.31

±0.

8272

.00

±0.

9569

.42

±0.

9610

9.59

±0.

7610

7.24

±1.

3310

7.14

±0.

5612

3.37

±2.

1212

0.10

±2.

0811

8.64

±2.

1913

0.39

±1.

5310

6.51

±0.

7416

384

45.6

0.70

45.4

0.79

43.7

0.78

61.4

0.86

60.2

1.36

60.0

0.46

71.1

2.71

68.9

1.83

67.7

2.00

72.2

0.65

59.8

0.55

3276

832

.69

±0.

7731

.78

±0.

8531

.38

±0.

9637

.08

±0.

2936

.07

±0.

4136

.05

±0.

3644

.01

±1.

5941

.93

±1.

2841

.54

±1.

3143

.63

±0.

5436

.92

±0.

5065

536

28.9

1.09

27.7

1.18

27.7

1.23

24.3

0.40

23.1

0.50

23.1

0.49

29.9

0.99

28.4

1.33

28.1

1.41

28.3

0.57

25.3

0.59

1310

7240

.00

±1.

5037

.25

±1.

8238

.43

±2.

2219

.40

±0.

4818

.34

±0.

5118

.03

±0.

5723

.21

±1.

1121

.80

±0.

8721

.37

±0.

8421

.60

±0.

7321

.60

±0.

84

(c)

Num

eric

alke

rnel

s.

Tabl

e8:

Eval

uatio

nof

the

di�e

rent

heur

istic

son

apl

atfo

rmw

ithfa

ilure

sba

sed

onth

efa

ilure

log

ofLA

NL

clus

ter

19an

dun

der

the

cons

tant

over

head

mod

el.

RR n° 7876

Using group replication for resilience on exascale systems 37

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax32

768

202.

75±

10.3

119

4.84

±9.

9819

1.61

±12

.91

282.

16±

15.7

127

0.90

±14

.84

270.

90±

14.8

431

9.55

±15

.68

311.

87±

16.0

931

9.79

±21

.61

299.

02±

16.2

132

9.11

±20

.91

6553

683

.80

±4.

1377

.91

±4.

0477

.98

±4.

3810

9.64

±4.

6510

5.29

±4.

5910

3.92

±5.

4412

5.88

±10

.22

119.

30±

5.83

117.

21±

5.33

114.

52±

5.61

121.

70±

6.52

1310

7239

.69

±1.

4736

.93

±1.

8538

.08

±2.

1219

.26

±0.

4918

.24

±0.

5217

.92

±0.

5722

.95

±1.

2421

.68

±0.

9421

.32

±0.

8521

.48

±0.

7221

.45

±0.

75

(a)

Perfe

ctly

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax32

768

205.

09±

10.6

219

6.63

±10

.59

192.

73±

13.4

528

3.03

±15

.83

272.

17±

14.9

827

2.17

±14

.98

320.

80±

16.9

831

4.54

±16

.59

321.

32±

21.7

830

0.54

±16

.25

329.

95±

20.7

665

536

85.3

4.23

79.3

4.30

79.5

4.66

110.

56±

4.61

108.

24±

4.90

104.

74±

5.04

125.

85±

9.64

120.

23±

7.00

118.

34±

5.34

115.

56±

5.61

123.

05±

6.62

1310

7240

.93

±1.

5738

.12

±1.

7939

.26

±2.

2019

.59

±0.

4818

.54

±0.

5218

.23

±0.

5623

.35

±1.

1422

.09

±0.

9621

.56

±0.

8921

.82

±0.

7321

.90

±0.

84

(b)

Gen

eric

para

llelj

obs.

g=

1g

=2

pO

ptE

xpP

erio

dLB

DP

Nex

tFai

lure

Opt

Exp

Opt

Exp

Gro

upP

erio

dLB

DP

Nex

tFai

lure

Asa

pD

PN

extF

ailu

reSy

nchr

oD

PN

extF

ailu

reG

loba

lm

inav

gm

ax32

768

204.

05±

10.3

319

6.02

±10

.15

192.

61±

12.8

328

3.17

±15

.92

272.

25±

15.1

327

2.25

±15

.13

322.

35±

16.8

731

3.46

±16

.92

322.

73±

23.5

230

0.28

±16

.44

329.

87±

20.8

365

536

84.3

4.19

78.4

4.07

78.5

4.51

110.

15±

4.60

107.

79±

4.85

104.

24±

5.41

125.

56±

9.92

120.

07±

6.93

117.

84±

5.47

115.

05±

5.59

122.

37±

6.45

1310

7240

.00

±1.

5037

.25

±1.

8238

.43

±2.

2219

.40

±0.

4818

.34

±0.

5118

.03

±0.

5723

.21

±1.

1121

.80

±0.

8721

.37

±0.

8421

.60

±0.

7321

.60

±0.

84

(c)

Num

eric

alke

rnel

s.

Tabl

e9:

Eval

uatio

nof

the

di�e

rent

heur

istic

son

apl

atfo

rmw

ithfa

ilure

sba

sed

onth

efa

ilure

log

ofLA

NL

clus

ter

19an

dun

der

the

prop

ortio

nalo

verh

ead

mod

el.

RR n° 7876

Using group replication for resilience on exascale systems 38

50

60

aver

age

mak

espa

n(in

days

)

40

218 219216 217

30

20

10

0

number of processors220215214

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

a) constant overhead model

60

aver

age

mak

espa

n(in

days

) 50

40

214 215 220217216 219218

30

20

10

0

number of processors

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

b) proportional overhead model(1) Perfectly parallel jobs.

60

aver

age

mak

espa

n(in

days

) 50

40

214 215 220217216 219218

30

20

10

0

number of processors

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

a) constant overhead model

219214 215 220217216 218

aver

age

mak

espa

n(in

days

)

60

50

40

30

20

10

0

number of processors

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

b) proportional overhead model(2) Generic parallel jobs.

50

60

aver

age

mak

espa

n(in

days

)

40

218 219216 217

30

20

10

0

number of processors220215214

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

a) constant overhead model

60

aver

age

mak

espa

n(in

days

) 50

40

214 215 220217216 219218

30

20

10

0

number of processors

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

b) proportional overhead model(3) Numerical kernels.

Figure 9: Evaluation of the di�erent heuristics on a platform with Exponentialfailures (MTBF = 125 years).

RR n° 7876

Using group replication for resilience on exascale systems 39

60

aver

age

mak

espa

n(in

days

) 50

40

214 215 220217216 219218

30

20

10

0

number of processors

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

a) constant overhead model

214

0

number of processors215

60

50

aver

age

mak

espa

n(in

days

)218

40

30

20

10

219216 217 220

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

b) proportional overhead model(1) Perfectly parallel jobs.

60

aver

age

mak

espa

n(in

days

) 50

40

214 215 220217216 219218

30

20

10

0

number of processors

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

a) constant overhead model

aver

age

mak

espa

n(in

days

)

218 219216 217 220215214

number of processors

0

10

20

30

40

50

60

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

b) proportional overhead model(2) Generic parallel jobs.

60

aver

age

mak

espa

n(in

days

) 50

40

214 215 220217216 219218

30

20

10

0

number of processors

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

a) constant overhead model

214

0

number of processors215

60

50

aver

age

mak

espa

n(in

days

)

218

40

30

20

10

219216 217 220

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

b) proportional overhead model(3) Numerical kernels.

Figure 10: Evaluation of the di�erent heuristics on a platform with Weibullfailures (MTBF = 125 years, and k = 0.70).

RR n° 7876

Using group replication for resilience on exascale systems 40

10

0

number of processors214 215212 217213 216

70

aver

age

mak

espa

n(in

days

)

80

60

50

40

30

20

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

a) constant overhead model

aver

age

mak

espa

n(in

days

)216213 217212 215214

number of processors

0

10

20

30

40

50

60

70

80

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

b) proportional overhead model(1) Perfectly parallel jobs.

10

0

number of processors214 215212 217213 216

70

aver

age

mak

espa

n(in

days

)

80

60

50

40

30

20

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

a) constant overhead model

aver

age

mak

espa

n(in

days

)

216213 217212 215214

number of processors

0

10

20

30

40

50

60

70

80

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

b) proportional overhead model(2) Generic parallel jobs.

10

0

number of processors214 215212 217213 216

70

aver

age

mak

espa

n(in

days

)

80

60

50

40

30

20

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

a) constant overhead model

aver

age

mak

espa

n(in

days

)

216213 217212 215214

number of processors

0

10

20

30

40

50

60

70

80

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

b) proportional overhead model(3) Numerical kernels.

Figure 11: Evaluation of the di�erent heuristics on a platform with failuresbased on the failure log of LANL cluster 18.

RR n° 7876

Using group replication for resilience on exascale systems 41

10

0

number of processors214 215212 217213 216

70

aver

age

mak

espa

n(in

days

)

80

60

50

40

30

20

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

a) constant overhead model

aver

age

mak

espa

n(in

days

)216213 217212 215214

number of processors

0

10

20

30

40

50

60

70

80

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

b) proportional overhead model(1) Perfectly parallel jobs.

10

0

number of processors214 215212 217213 216

70

aver

age

mak

espa

n(in

days

)

80

60

50

40

30

20

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

a) constant overhead model

aver

age

mak

espa

n(in

days

)

216213 217212 215214

number of processors

0

10

20

30

40

50

60

70

80

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

b) proportional overhead model(2) Generic parallel jobs.

10

0

number of processors214 215212 217213 216

70

aver

age

mak

espa

n(in

days

)

80

60

50

40

30

20

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

a) constant overhead model

aver

age

mak

espa

n(in

days

)

216213 217212 215214

number of processors

0

10

20

30

40

50

60

70

80

DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp

b) proportional overhead model(3) Numerical kernels.

Figure 12: Evaluation of the di�erent heuristics on a platform with failuresbased on the failure log of LANL cluster 19.

RR n° 7876

RESEARCH CENTREGRENOBLE – RHÔNE-ALPES

Inovallée

655 avenue de l’Europe Montbonnot

38334 Saint Ismier Cedex

Publisher

Inria

Domaine de Voluceau - Rocquencourt

BP 105 - 78153 Le Chesnay Cedex

inria.fr

ISSN 0249-6399