Using group replication for resilience on exascale systems
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of Using group replication for resilience on exascale systems
IS
SN
02
49
-6
39
9IS
RN
IN
RIA
/R
R--7
87
6--F
R+
EN
G
RESEARCHREPORTN° 7876February 2012
Project-Team ROMA
Using group replicationfor resilienceon exascale systemsMarin Bougeret, Henri Casanova, Yves Robert, Frédéric Vivien,Dounia Zaidouni
RESEARCH CENTREGRENOBLE – RHÔNE-ALPES
Inovallée
655 avenue de l’Europe Montbonnot
38334 Saint Ismier Cedex
Using group replication for resilienceon exascale systems
Marin Bougeretú, Henri Casanova†, Yves Robert‡§¶, FrédéricVivien·, Dounia Zaidouni·
Project-Team ROMA
Research Report n° 7876 — February 2012 — 41 pages
ú LIRMM Montpellier, France† University of Hawai‘i at Manoa, Honolulu, USA‡ LIP, Ecole Normale Supérieure de Lyon, France§ University of Tennessee Knoxville, USA¶ Institut Universitaire de France.Î INRIA, Lyon, France
Abstract: High performance computing applications must be resilient to faults, which are com-mon occurrences especially in post-petascale settings. The traditional fault-tolerance solution ischeckpoint-recovery, by which the application saves its state to secondary storage throughout exe-cution and recovers from the latest saved state in case of a failure. An oft studied research questionis that of the optimal checkpointing strategy: when should state be saved? Unfortunately, evenusing an optimal checkpointing strategy, the checkpointing frequency must increase as platformscale increases, leading to higher checkpointing overhead. This overhead precludes high parallel ef-ficiency for large-scale platforms, thus mandating other more scalable fault-tolerance mechanisms.One such mechanism is replication, which can be used in addition to checkpoint-recovery. Usingreplication, multiple processors perform the same computation so that a processor failure does notnecessarily imply application failure. While at first glance replication may seem wasteful, it maybe significantly more e�cient than using solely checkpoint-recovery at large scale. In this workwe investigate a simple approach where entire application instances are replicated. We provide atheoretical study of checkpoint-recovery with replication in terms of expected application execu-tion time, under an exponential distribution of failures. We design dynamic-programming basedalgorithms to define checkpointing dates that work under any failure distribution. We also conductsimulation experiments assuming that failures follow Exponential or Weibull distributions, the lat-ter being more representative of real-world systems, and using failure logs from production clusters.Our results show that replication is useful in a variety of realistic application and checkpointingcost scenarios for future exascale platforms.
Key-words: Fault-tolerance, replication, checkpointing, parallel job, Weibull, exascale
La réplication pour l’amélioration de larésilience des applications sur systèmes exascalesRésumé : Les applications de calcul à haute-performance doivent être ré-siliantes aux pannes, car les pannes ne seront pas des évènements rares sur lesplates-formes post-petascales. La tolérance aux pannes est traditionnellementréalisée par un mécanisme d’enregistrement et redémarrage, au moyen duquell’application sauve son état sur un système de stockage secondaire et, en casde panne, redémarre à partir du dernier état sauvegardé. Une question souventétudiée est celle de la stratégie de sauvegarde optimale: quand l’état doit-il êtresauvé ? Malheureusement, même quand on utilise une stratégie de sauvegardeoptimale, la fréquence de sauvegarde doit augmenter avec la taille de la plate-forme, augmentant mécaniquement le coût des sauvegardes. Ce coût interditd’obtenir une très bonne e�cacité sur des plates-formes à très large échelle,et requiert d’utiliser d’autres mécanismes de tolérance aux pannes, qui passentmieux à l’échelle. Un mécanisme potentiel est la réplication, qui peut être utiliséeconjointement avec une solution de sauvegarde et redémarrage. Avec la réplica-tion, plusieurs processeurs exécutent le même calcul de sorte que la panne del’un d’entre eux n’implique pas nécessairement une panne pour l’application.Alors qu’à première vue une telle approche gaspille des ressources, la répli-cation peut être significativement plus e�cace que la seule mise en œuvre detechniques de sauvegarde et redémarrage sur des plates-formes à très grandeéchelle. Dans la présente étude nous considérons une approche simple où uneapplication toute entière est répliquée. Nous fournissons une étude théoriqued’un schéma d’exécution avec réplication lorsque la distribution des pannes suitune loi exponentielle. Nous proposons des algorithmes de détermination desdates de sauvegarde quand la distribution des pannes suit une loi quelconque.Nous menons aussi une étude expérimentale, au moyen de simulations, baséesur une distribution de pannes suivant une loi exponentielle, de Weibull (ce quiest plus représentatif des systèmes réels), ou tirée de logs de clusters utilisésen production. Nos résultats montrent que la réplication est bénéfique pour unensemble de modèles d’applications et de coût de sauvegardes réalistes, dans lecadre des futures plates-formes exascales.
Mots-clés : Tolérance aux pannes, réplication, checkpoint, tâche parallèle,Weibull, exascale
Using group replication for resilience on exascale systems 4
1 IntroductionAs plans are made for deploying post-petascale high performance computing(HPC) systems [10, 22], solutions need to be developed to ensure resilience tofailures that occur because not all faults can be automatically detected and cor-rected in hardware. For instance, the 224,162-core Jaguar platform is reportedto experience on the order of 1 failure per day [20, 2], and its scale is modest com-pared to platforms in the plans for the next decade. For applications that enrolllarge numbers of, or perhaps all, processors a failure is the common case ratherthan the exception. One can recover from a failure by resuming execution froma previously saved fault-free execution state, or checkpoint. Checkpoints aresaved to resilient storage throughout execution (usually periodically). More fre-quent checkpoints lead to less loss when a failure occurs but to higher overheadduring fault-free execution. A checkpointing strategy specifies when checkpointsshould be taken.
A large literature is devoted to developing e�cient checkpointing strategies,i.e., ones that minimize expected job execution time, including both theoreticaland practical e�orts. The former typically rely on assumptions regarding theprobability distributions of times to failure of the processors (e.g., Exponential,Weibull), while the latter rely on simulations driven by failure datasets obtainedon real-world platforms. In a previous paper [4], we have made several contri-butions in this context, including optimal solutions for Exponential failures anddynamic programming solutions in the general case.
A major issue with checkpoint-recovery is scalability: the necessary check-point frequency for tolerating failures in large-scale platforms is so large thatprocessors spend more time saving state than computing. It is thus expectedthat future platforms will lead to unacceptably low parallel e�ciency if onlycheckpoint-recovery is used, no matter how good the checkpointing strategy.Consequently, additional mechanisms must be used. In this work we focus onreplication: several processors perform the same computation synchronously, sothat a fault on one of these processors does not lead to an application failure.Replication is an age-old fault-tolerant technique, but it has gained tractionin the HPC context only relatively recently. While replication wastes com-pute resources in fault-free executions, it can alleviate the poor scalability ofcheckpoint-recovery.
Consider a parallel application that is moldable, meaning that it can beexecuted on an arbitrary number of processors, which each processor runningone application process. In our group replication approach, multiple applicationinstances are executed. One could, for instance, execute 2 distinct n-process ap-plication instances on a 2n-processor platform. Each instance runs at a smallerscale, meaning that it has better parallel e�ciency than a single 2n-process in-stance due to a lower checkpointing frequency. Furthermore, once an instancesaves a checkpoint, the other instance can use this checkpoint immediately.Given the above, our contributions in this work are:
• A theoretical analysis of the optimal number of processors to use for acheckpoint-recovery execution of a parallel application, for various parallelworkload models for Exponential failure distributions;
• An e�ective approach for group replication, with a theoretical analysisbounding expected execution time for Exponential failure distribution,and several dynamic programming solutions working for general failure
RR n° 7876
Using group replication for resilience on exascale systems 5
distributions;• Extensive simulations showing that group replication can indeed lower
application running times, and that some of our proposed strategies delivergood performance.
The rest of this paper is organized as follows. Section 2 discusses related work.Section 3 defines the theoretical framework and states key assumptions. Sec-tion 4 discusses the optimal number of processors for a checkpoint-recoveryexecution of a parallel application under an Exponential distribution of failures.Section 5 provides several approaches for group replication, and the theoreti-cal analysis of one of them. Section 6 describes the experimental methodologyand Section 7 presents simulation results. Finally, Section 8 concludes with asummary of our findings and with future perspectives.
2 Related workCheckpointing policies have been widely studied in the literature. In [9], Dalystudies periodic checkpointing policies for Exponentially distributed failures,generalizing the well-known bound obtained by Young [28]. Daly extended hiswork in [15] to study the impact of sub-optimal checkpointing periods. In [25],the authors develop an “optimal” checkpointing policy, based on the popularassumption that optimal checkpointing must be periodic. In [6], Bouguerra etal. prove that the optimal checkpointing policy is periodic when checkpointingand recovery overheads are constant, for either Exponential or Weibull failures.But their results rely on the unstated assumption that all processors are reju-venated after each failure and after each checkpoint. In our recent work [4], wehave shown that this assumption is unreasonable for Weibull failures. We havedeveloped optimal solutions for Exponential failures and dynamic programmingsolutions for any failure distribution, demonstrating performance improvementsover checkpointing approaches proposed in the literature in the case of Weibulland log-based failures. The Weibull distribution is recognized as a reasonableapproximation of failures in real-world systems [14, 24]. The work in this paperrelates to checkpointing policies in the sense that we study a replication mech-anism that is used as an addition to checkpointing. Part of our results build onthe algorithms and results in [4].
In spite of all the above advances, several studies have questioned the fea-sibility of pure checkpoint-recovery for large-scale systems (see [12] for a dis-cussion of this issue and for references to such studies). In this work, we studythe use of replication as a mechanism complementary to checkpoint-recovery.Replication has long been used as a fault-tolerance mechanism in distributedsystems [13] and more recently in the context of volunteer computing [17]. Theidea to use replication together with checkpoint-recovery has been studied inthe context of grid computing [27].One concern about replication in HPC isthe induced resource waste. However, given the scalability limitations of purecheckpoint-recovery, replication has recently received more attention in the HPCliterature [23, 29, 11].
In this work we study “group replication,” by which multiple application in-stances are executed on di�erent groups of processors.An orthogonal approach,“process replication,” was recently studied by Ferreira et al. [12] in the contextof MPI applications. To achieve fault-tolerance, each MPI process is replicated
RR n° 7876
Using group replication for resilience on exascale systems 6
in a way that is transparent to the application developer. While this approachcan lead to good fault-tolerance, one of its drawbacks is that it increases thenumber and the volume of communications. Let V
tot
be the total volume ofinter-processor communications for a traditional execution. With process repli-cation using g replicas per replica-groups, each original communication nowinvolves g sources and g destinations, hence the total communication volumebecomes V
tot
◊ g2. Instead, with group replication using g groups, each origi-nal communication takes place g times, hence the total communication volumeincreases only to V
tot
◊ g. Another drawback of process replication is that itrequires the use of a customized MPI library (such as the prototype developedby the authors in [12]). By contrast, group replication is completely agnostic tothe parallel runtime system and thus does not even require MPI. Nevertheless,even for MPI applications, group replication provides an out of the box fault-tolerance solution that can be used until process replication possibly becomes amainstream feature in MPI implementations.
3 FrameworkWe consider the execution of a tightly-coupled parallel application, or job, ona platform composed of p processors. We use the term processor to indicateany individually scheduled compute resource (a core, a multi-core processor,a cluster node) so that our work applies regardless of the granularity of theplatform. We assume that system-level checkpoint-recovery is enabled.
The job must complete W units of (divisible) work, which can be split ar-bitrarily into separate chunks. The job can execute on any number q Æ pprocessors. Letting W(q) be the time required for a failure-free execution on qprocessors, we use three models:
• Perfectly parallel jobs: W(q) = W/q.• Generic parallel jobs: W(q) = (1 ≠ “)W/q + “W. As in Amdahl’s law [1],
“ < 1 is the fraction of the work that is inherently sequential.• Numerical kernels: W(q) = W/q + “W2/3/
Ôq. This is representative of
a matrix product or a LU/QR factorization of size N on a 2D-processorgrid, where W = O(N3). In the algorithm in [3], q = r2 and each proces-sor receives 2r blocks of size N2/r2 during the execution. Here “ is thecommunication-to-computation ratio of the platform.
Each participating processor is subject to failures. A failure causes a down-time period of the failing processor, of duration D. When a processor fails, thewhole execution is stopped, and all processors must recover from the previouscheckpoint. We let C(q) denote the time needed to perform a checkpoint, andR(q) the time to perform a recovery. The downtime accounts for software reju-venation (i.e., rebooting [16, 8]) or for the replacement of the failed processor bya spare. Regardless, we assume that after a downtime the processor is fault-freeand begins a new lifetime at the beginning of the recovery period. This recoveryperiod corresponds to the time needed to restore the last checkpoint. Assumingthat the application’s memory footprint is V bytes, with each processor holdingV/q bytes, we consider two scenarios:
• Proportional overhead: C(q) = R(q) = –V/q = C/q with – some con-stant, for cases where the bandwidth of the network card/link at eachprocessor is the I/O bottleneck.
RR n° 7876
Using group replication for resilience on exascale systems 7
• Constant overhead: C(q) = R(q) = –V = C with – some constant, forcases where the bandwidth to/from the resilient storage system is the I/Obottleneck.
We assume coordinated checkpointing [26], meaning that no message logging/re-play is needed when recovering from failures. We assume that failures canhappen during recovery or checkpointing, but not during a downtime (otherwise,the downtime period could be considered part of the recovery period). Weassume that the parallel job is tightly coupled, meaning that all q processorsoperate synchronously throughout the job execution. These processors executethe same amount of work W(q) in parallel, chunk by chunk. The total time (onone processor) to execute a chunk of size Ê, and then checkpointing it, is Ê +C(q). Finally, we assume that failure arrivals at all processors are independentand identically distributed (iid).
4 Optimal number of processors for executionLet E(q) be the expectation of the execution time, or makespan, when usingq processors, and q
opt
the value of q that minimizes E(q). Is it true that theoptimal solution is to use all processors, i.e., q
opt
= p? If not, what can we sayabout the value of q
opt
? This question was partially and empirically addressedin [25], via experiments for 4 MPI applications for up to 35 processors. Ourapproach here is radically di�erent since we target large-scale platforms andseek theoretical results in the form of optimal solutions. The main objectiveof this section is to show analytically that, for Exponential failures, E(q) mayreach its minimum for some finite value of q (implying that q
opt
is not necessarilyequal to p).
Assume that failure inter-arrival times follow an Exponential distributionwith parameter ⁄. In our recent work [4], we have shown that the optimalstrategy to minimize the expected makespan E(q) is to split W into Kú =max(1, ÂK0(q)Ê) or Kú = ÁK0(q)Ë same-size chunks, whichever leads to thesmaller value, where K0(q) = q⁄W(q)
1+L(≠e
≠q⁄C(q)≠1) is the optimal (non integer) num-ber of chunks. L denotes the Lambert function, defined as L(z)eL(z) = z. Thisresult shows that the optimal strategy is periodic and that the optimal expec-tation of the makespan is:
Eú(q) = Kú(q)3
1q⁄
+ E(XD
(q))4
eq⁄R(q)1
eq⁄W(q)K
ú(q) +q⁄C(q) ≠ 12
(1)
where E(XD
(q)) denotes the expectation of the downtime. It turns out that, al-though we can compute the optimal number of chunks (and thus the chunk size),we cannot compute Eú(q) analytically because E(X
D
(q)) is di�cult to compute.This is because a processor can fail while another one is down, thus prolongingthe downtime. With a single processor (q = 1), X
D
(q) has constant value D,but with several processors there could be cascading downtimes. It turns outthat we can compute the following lower and upper bounds for E(X
D
(q)):
Proposition 1. Let XD
(q) denote the downtime of a group of q processors.Then
D Æ E(XD
(q)) Æ e(q≠1)⁄D ≠ 1(q ≠ 1)⁄ (2)
RR n° 7876
Using group replication for resilience on exascale systems 8
Proof. In [4], we have shown that th optimal expectation of the makespan iscomputed as:
Eú(q) = Kú(q)3
1q⁄
+ E(Trec
(q))4 1
eq⁄W(q)K
ú(q) +q⁄C(q) ≠ 12
(3)
where E(Trec
(q)) denotes the expectation of the recovery time, i.e., the timespent recovering from failure during the computation of a chunk. All chunkshave the same recovery time because they all have the same size and becauseof the memoryless property of the Exponential distribution. It turns out thatalthough we can compute the optimal number of chunks (and thus the chunksize), we cannot compute Eú(q) analytically because E(T
rec
(q)) is di�cult tocompute. We write the following recursion:
Trec
(q) =
Y_]
_[
XD
(q) + R(q) if no processor failsduring R(q) units of time,
XD
(q) + Tlost
(R(q)) + Trec
(q) otherwise.(4)
XD
(q) is the downtime of a group of q processors, that is the time betweenthe first failure of one of the processors and the first time at which all of themare available (accounting for the fact a processor can fail while another oneis down, thus prolonging the downtime). T
lost
(R(q)) is the amount of timespent computing by these processors before a first failure, knowing that thenext failure occurs within the next R(q) units of time. In other terms, it is thecompute time that is wasted because checkpoint recovery was not completed.The time until the next failure of a group of q processors is the minimum of q iidExponentially distributed variables, and is thus Exponential with parameter q⁄.We can compute E(T
lost
(R(q))) = 1q⁄
≠ R(q)e
q⁄R(q)≠1 (see [4] for details). Pluggingthis value into Equation 4 leads to:
E(Trec
(q)) = e≠q⁄R(q)(E(XD
(q)) + R(q))
+ (1 ≠ e≠q⁄R(q))3E(X
D
(q)) + 1q⁄
≠ R(q)eq⁄R(q) ≠ 1 + E(T
rec
(q))4
(5)
Equation 5 reads as follows: after the downtime XD
(q), either the recoverysucceeds for everybody, or there is a failure during the recovery and another at-tempt must be made. Both events are weighted by their respective probabilities.Simplifying the above expression we get:
E(Trec
(q)) = E(XD
(q))eq⁄R(q) + 1q⁄
(eq⁄R(q) ≠ 1) (6)
Plugging back this expression in Equation 3, we obtain the value given in Equa-tion 1.
Now we establish the desired bounds on E(XD
(q)) We always have XD
(q) ØX
D
(1) Ø D, hence the lower bound. For the upper bound, consider a date atwhich one of the q processors, say processor i0, just had a failure and initiatesits downtime period for D time units. Some other processors might be in themiddle of their downtime period: for each processor i, 1 Æ i Æ q, let t
i
denotethe remaining duration of the downtime of processor i. We have 0 Æ t
i
Æ D for
RR n° 7876
Using group replication for resilience on exascale systems 9
1 Æ i Æ q, ti0 = D, and t
i
= 0 means that processor i is up and running. LetX
t1,..,t
q
D
(q) be the remaining downtime of a group of q processors, knowing thatprocessor i, 1 Æ i Æ q, will still be down for a duration of t
i
, and that a failurejust happened (i.e., there exists i0 such that t
i0 = D). Given the values of theti
’s, we have the following equation for the random variable Xt1,..,t
q
D
(q):
Xt1,..,t
q
D
(q) =
Y______]
______[
D
if none of the processors of the groupfails during the next D units of time
Tt1,..,t
q
lost
(D) + Xt
Õ1,..,t
Õq
D
(q)otherwise.
In the second case of the equation, consider the next D time-units. Processor ican only fail in the last D ≠ t
i
of these time-units. Here the values of the tÕi
’s de-pend on the t
i
’s and on Tt1,..,t
q
lost
(D). Indeed, except for the last processor to fail,say i1, for which tÕ
i1 = D, we have tÕi
= max{tÕi
≠ Tt1,..,t
q
lost
(D), 0}. More impor-tantly we always have T
t1,..,t
q
lost
(D) Æ T D,0,...,0lost
(D) and Xt1,..,t
q
D
(q) Æ XD,0,..,0D
(q)because the probability for a processor to fail during D time units is always largerthan that to fail during D≠t
i
time-units. Thus, E(Xt1,..,t
q
D
(q)) Æ E(XD,0,..,0D
(q)).Following the same line of reasoning, we derive an upper-bound for XD,0,..,0
D
(q):
XD,0,..,0D
(q) Æ
Y______]
______[
D
if none of the q ≠ 1 running processors of the groupfails during the downtime D
T D,0,..,0lost
(D) + XD,0,..,0D
(q)otherwise.
Weighting both cases by their probability and taking expectations, we obtain
E1
XD,0,..,0D
(q)2
Æ e≠(q≠1)⁄DD + (1 ≠ e≠(q≠1)⁄D)1
E1
T D,0,..,0lost
(D)2
+ E1
XD,0,..,0D
(q)22
hence E1
XD,0,..,0D
(q)2
Æ D + (e(q≠1)⁄D ≠ 1)E1
T D,0,..,0lost
(D)2
, with
E1
T D,0,..,0lost
(D)2
= 1(q ≠ 1)⁄ ≠ D
e(q≠1)⁄D ≠ 1 .
We deriveE
1X
t1,..,t
q
D
(q)2
Æ E1
XD,..,0D
(q)2
Æ e(q≠1)⁄D ≠ 1(q ≠ 1)⁄ .
which concludes the proof. As a sanity check, we observe that the upper boundis at least D, using the identity ex Ø 1 + x for x Ø 0.
While in a failure-free environment Eú(q) would always decrease as q in-creases, using the above lower bound on E(X
D
(q)) we obtain the followingresults:
RR n° 7876
Using group replication for resilience on exascale systems 10
Theorem 1. When the failure distribution follows an Exponential law, Eú(q)reaches its minimum for some finite value of q in the following scenarios: alljob types (perfectly parallel, generic and numerical) with constant overhead, andgeneric or numerical jobs with proportional overhead.
Note that the only open scenario is with perfectly parallel jobs and pro-portional overhead. In this case the lower bound on Eú(q) decreases to someconstant value while the upper bound goes to +Œ as q increases.
Proof. We show that limqæ+Œ Eú(q) = +Œ for the relevant scenarios. We first
plug the lower-bound of Equation 2 into Equation 6 and obtain:
E(Trec
(q)) Ø Deq⁄R(q) + 1q⁄
1eq⁄R(q) ≠ 1
2.
From Equation 1 we then derive the lower-bound:
Eú(q) Ø K0(q)3
1q⁄
+ D
4eq⁄R(q)
1e
q⁄W(q)K0(q) +q⁄C(q) ≠ 1
2
using the fact that, by definition, the expression in the right hand-side of Equa-tion 1 is minimized by K0, where K0(q) = q⁄W(q)
1+L(≠e
≠q⁄C(q)≠1) .
With constant overhead. Let us consider the case of perfectly parallel jobs(W(q) = W/q) with constant checkpointing overhead (C(q) = R(q) = C). Weget the lower bound:
Eú(q) Ø K0(q)3
1q⁄
+ D
4eq⁄C
1e
⁄WK0(q) +q⁄C ≠ 1
2
where K0(q) = ⁄W1+L(≠e
≠q⁄C≠1) . When q tends to +Œ, K0(q) goes to ⁄W, while
( 1q⁄
+ D)eq⁄C
1e
⁄WK0(q) +q⁄C ≠ 1
2goes to +Œ. Consequently, Eú(q) is bounded
below by a quantity that goes to +Œ, which concludes the proof. This resultalso implies that Eú(q) reaches a minimum for a finite q value for other job types(generic, numerical) with constant overhead, just because the execution time islarger in those cases than with perfectly parallel jobs.
Generic parallel job with proportional overhead. Here we assume thatW(q) = W/q + “W, and use proportional overhead: C(q) = R(q) = C
q
. We getthe lower bound:
Eú(q) Ø K0(q)3
1q⁄
+ D
4e⁄C
1e
⁄W+q⁄“WK0(q) +⁄C ≠ 1
2
where K0(q) = ⁄W+q⁄“W1+L(≠e
≠⁄C≠1) . As before, we show that limqæ+Œ Eú
min
(q) = +Œto get the result. When q tends to +Œ, K0(q) tends to +Œ, while ( 1
q⁄
+
D)e⁄C
1e
⁄W+q⁄“WK0(q) +⁄C ≠ 1
2tends to some positive constant. This concludes the
proof. Note that this proof also serves for generic parallel jobs with constantoverhead, simply because the execution time is larger in that case than withproportional overhead.
RR n° 7876
Using group replication for resilience on exascale systems 11
Numerical kernels with proportional overhead. Here we assume thatW(q) = W/q + “W2/3/
Ôq, and use proportional overhead: C(q) = R(q) = C
q
.We get the lower bound:
Eú(q) Ø K0(q)3
1q⁄
+ D
4e⁄C
3e
⁄W+⁄“W2/3Ôq
K0(q) +⁄C ≠ 14
where K0(q) = ⁄W+⁄“W2/3Ôq
1+L(≠e
≠⁄C≠1) . As before, we show that limqæ+Œ Eú
min
(q) =+Œ to get the result. When q tends to +Œ, K0(q) tends to +Œ, while
( 1q⁄
+ D)e⁄C
3e
⁄W+⁄“W2/3Ôq
K0(q) +⁄C ≠ 14
tends to some positive constant. Thisconcludes the proof.
5 Group replicationSince using all processors to run a single application instance may not makesense in light of Theorem 1, the group replication approach consists in executingmultiple application instances on di�erent processor groups, where the numberof processors in a group is closer to q
opt
. All groups compute the same chunksimultaneously, and do so until one of them succeeds, potentially after severalfailed trials. Then all other groups stop executing that chunk and recover fromthe checkpoint stored by the successful group. All groups then attempt tocompute the next chunk. Group replication can be implemented easily withno modification to the application, provided that the recovery implementationallows a group to recover immediately from a checkpoint produced by anothergroup. In this section we formalize group replication as an execution protocolcalled ASAP (As Soon As Possible), and analyze its performance for Exponentialfailures. We then introduce dynamic programming solutions that work withgeneral failure distributions.
5.1 The ASAP execution protocolWe consider g groups, where each group has q processors, with g ◊ q Æ p. Agroup is available for execution if and only if all its q processors are available.In case of a failure, the downtime of a group is a random variable X
D
(q) Ø D,whose expectation is bounded in Proposition 1. If a group encounters a firstprocessor failure at time t, the group is down between t and t + X
D
(q).The ASAP algorithm proceeds in k macro-steps. During macro-step j, 1 Æ
j Æ k, each group independently attempts to execute the j-th chunk of size Êj
and to checkpoint, restarting as soon as possible in case of a failure. As soon asone of the groups succeeds, say at time tend
j
, all the other groups are immediatelystopped, macro-step j is over, and macro-step (j+1) starts (if j < k). Note thatthe value of k, the total number of chunks, as well as the chunk sizes, the Ê
j
’s,are inputs to the algorithm (we always have
qk
j=1 Êj
= W(q)). We provide ananalytical evaluation of ASAP for Exponential failure laws, and discuss how tochoose these values, in Section 5.2.
Two important aspects must be mentioned. First, before being able to startmacro-step (j + 1), a group that has been stopped must execute a recovery, inorder to restart from the checkpoint of a successful group. Second, this recovery
RR n° 7876
Using group replication for resilience on exascale systems 12
Downtine (of a group)
Recover
Downtine (of a processor)
Failure
Ê1
Ê2
Group 1
Group 2
Group 3 R(q)
C(q)R(q)
C(q)
tend1 tend
2
Figure 1: Execution of chunks Ê1 and Ê2 (macro-steps 1 and 2) using theASAP protocol. At time tend
1 , Group 1 is not ready, and Group 2 is the onlyone that does not need to recover.
may start later than time tend
j
, in the case where the group is down at time tend
j
.An example is shown in Figure 1, in which group 1 cannot start the recoveryat time tend
j
. The only groups that do not need to recover at the beginning ofthe next step are the groups that were successful for the previous step, exceptduring the first step at which all groups can start computing right away.
We now provide an analytical evaluation of ASAP for Exponential failurelaws, and show how to compute the number of macro-steps k and the values ofthe chunk sizes Ê
j
.
5.2 Exponential failuresLet use assume that the failure rate of each processor obeys an Exponentiallaw of parameter ⁄. For the sake of the theoretical analysis, we introduce aslightly modified version of the ASAP protocol in which all groups, includingthe successful ones, execute a recovery at the beginning of all macro-steps,including the first one. This new version of ASAP is described in Algorithm 1.It is completely symmetric, which renders its analysis easier: for macro-step jto be successful, one of the groups must be up and running for a duration ofR(q) + Ê
j
+ C(q).
Algorithm 1: ASAP (Ê1, . . . , Êk
)1 for j = 1 to k do
2 for each group do in parallel
3 repeat
4 Finish current downtime (if any)
5 Try to perform a recovery, then a chunk of size Ê
j
, and finally to
checkpoint
6 if execution successful then
7 Signal other groups to immediately stop their attempts
8 until one of the groups has a successful attempt
Consider the j-th macro step, number the attempts of all groups by theirstart time, and let N
j
be the index of the earliest started attempt that success-fully computes chunk Ê
j
. In Figure 2, we have j = 2, the successful chunk ofsize R+Ê2 +C is the fourth attempt, so N2 = 4. To represent each attempt, wesample random variables Xj
i
and Y j
i
, 1 Æ i Æ Nj
, that correspond respectively
RR n° 7876
Using group replication for resilience on exascale systems 13
X21 X2
3L
Group 1
Group 2
Group 3
and is followed by a downtime of size Y 2i
Attempt i (of step 2) has size X2i
R(q) + Ê2 + C(q)
tend1 tend
2
Job1
Job2
Job3
Job4
Y 22X2
2Y 21 Y 2
3
Figure 2: Zoom on macro-step 2 of the execution depicted in Figure 1, using the(X, Y ) notation of Algorithm 2. Recall that Job
i
has size X2i
+Y 2i
for 1 Æ i Æ 3,and Job4 has size R(q) + Ê2 + C(q).
to the ith tentative execution of the chunk and to the ith downtime that followsit (if i ”= N
j
). Note that Xj
i
< R + Êj
+ C for i < Nj
, and Xj
N
j
Ø R + Êj
+ C.All the Xj
i
’s follow the same distribution DX
, namely an Exponential law ofparameter q⁄. And all the Y j
i
’s follow the same distribution DX
D
(q), that ofthe the random variable X
D
(q) corresponding to the downtime of a group of qprocessors.
The main idea here is to view the Nj
execution attempts as jobs, where thesize of job i is Xj
i
+ Y j
i
, and to distribute them across the g groups using theclassical online list scheduling algorithm for independent jobs [21, section 5.6].This formulation (see Proposition 2) allows us to provide an upper bound forthe starting time of job N
j
, and hence for the length of macro-step j, using awell-known scheduling argument (see Proposition 3). We then derive an upperbound for the expected execution time of ASAP (see Theorem 2).
Algorithm 2: Step j of ASAP (Ê1, . . . , Êk
)1 i Ω 1 /* i represents the number of attempts for the job */
2 L Ω ÿ /* L represents the list of attempts for the job */
3 Sample X
j
i
and Y
j
i
using D
X
and D
X
D
(q) respectively
4 while X
j
i
< R(q) + Ê
j
+ C(q) do
5 Add Jobi
, with processing time X
j
i
+ Y
j
i
, to L6 i Ω i + 1
7 Sample X
j
i
and Y
j
i
using D
X
and D
X
D
(q) respectively
8 N
j
Ω i
9 Add JobN
j
, with processing time R(q) + Ê
j
+ C(q), to L/* the first successful job has size R(q) + Ê
j
+ C(q), not X
j
N
j
+ Y
j
N
j
*/
10 From time t
end
j≠1 on, execute a List Scheduling algorithm to distribute jobs of Lto the di�erent groups (recall that some groups may not be ready at time t
end
j≠1)
Proposition 2. The j-th macro-step of the ASAP protocol can be simulatedusing Algorithm 2: the last job scheduled by Algorithm 2 ends exactly at timetend
j
.
Proof. The List Scheduling algorithm distributes the next job to the first avail-able group. Because of the memoryless property of Exponential laws, it is
RR n° 7876
Using group replication for resilience on exascale systems 14
tendj≠1
T(R(q)+Ê
j
+C(q))truestart R(q) + Êj + C(q)
XjN
j
tendj
tendj ≠ tend
j≠1
Figure 3: Notations used in Proposition 3.
equivalent (i) to generate the attempts a priori and greedily schedule them, or(ii) to generate them independently within each group.
Proposition 3. Let T(R(q)+Ê
j
+C(q))truestart
be the time elapsed between tend
j≠1 and thebeginning of Job
N
j
(see Figure 3). We have E1
T(R(q)+Ê
j
+C(q))truestart
2Æ E(Y ) +
E(N
j
)E(X)≠E(X
N
j
j
)+(E(N
j
)≠1)E(Y )g
where X and Y are random variables corre-sponding to an attempt (sampled using D
X
and DX
D
(q) respectively). Moreover,we have E(N
j
) = e⁄q(R(q)+Ê
j
+C(q)) and E(XN
j
j
) = 1q⁄
+ R(q) + Êj
+ C(q).
Proof. For group x, 1 Æ x Æ g, let Yx
denote the time elapsed before it is readyfor macro-step j. For example in Figure 2, we have Y1 > 0 (group 1 is down attime tend
j≠1), while Y2 = Y3 = 0 (groups 2 and 3 are ready to compute at timetend
j≠1). Proposition 2 has shown that executing macro-step j can be simulatedby executing a List Schedule on a job list L (see Algorithm 2). We now considerg “jobs” ˜Job
x
, x = 1, . . . , g, so that ˜Jobx
has duration Yx
. We now considerthe augmented job list LÕ = L fi
tg
x=1˜Job
x
. Note that LÕ may contain morejobs than macro-step j: the jobs that start after the successful job Job
N
j
arediscarded from the list LÕ. However, both schedules have the same makespan,and jobs common to both systems have the same start and completion dates.Thus, we have T
(R(q)+Ê
j
+C(q))truestart Æ
qg
x=1(Y
x
)+q
N
j
≠1i=1
(X
j
i
+Y
j
i
)g
: this key inequalityis due to the property of list scheduling: the group which is assigned the lastjob is the least loaded when this assignment is decided, hence its load doesnot exceed the average load (which is the total load divided by the number ofgroups). Given that E(Y
x
) Æ E(Y ), we derive
E1
T(R(q)+Ê
j
+C(q))truestart
2Æ E(Y ) +
E1q
N
j
≠1i=1 Xj
i
2+ E
1qN
j
≠1i=1 (Y j
i
)2
g
But Nj
is the stopping criterion of the (Xj
i
) sequence, hence using Wald’stheorem we have E(
qN
j
i=1 Xj
i
) = E(Nj
)E(X) which leads to E(q
N
j
≠1i=1 Xj
i
) =E(N
j
)E(X) ≠ E(XN
j
j
). Moreover, as Nj
and Y j
i
are independent variables,
RR n° 7876
Using group replication for resilience on exascale systems 15
we have E(q
N
j
≠1i=1 Y j
i
) = (E(Nj
) ≠ 1)E(Y ), and we get the desired bound forE(T (R(q)+Ê
j
+C(q))truestart ).Finally, as the expected number of attempts when repeating independently
until success an event of probability – is 1–
(geometric law), we get E(Nj
) =e⁄q(R(q)+Ê
j
+C(q)). The value of E(XN
j
j
) can be directly computed from thedefinition, recalling that X
N
j
j
Ø R(q) + Êj
+ C(q) and each Xi
j
follows anExponential distribution of parameter q⁄.
Theorem 2. The expected execution time of ASAP has the following upperbound:
g ≠ 1g
W(q) + 1g
31
q⁄+ E(Y )
4e⁄q(R(q)+C(q))kúe⁄q
W(q)k
ú
+ kú3
g ≠ 1g
(E(Y ) + R(q) + C(q)) ≠ 1g
1q⁄
4
which is obtained when using kú = max(1, Âk0Ê) or kú = Ák0Ë same-size chunks,whichever leads to the smaller value, where:
k0 = ⁄qW(q)1 + L
11g ≠ 1 + (g≠1)q⁄(R(q)+C(q))≠g
1+q⁄E(Y )
2e≠(1+⁄q(R(q)+C(q)))
2 ·
Proof. From Proposition 3, the expected execution time of ASAP has upperbound T
ASAP
=q
k
j=1 –j
, where
–j
= E(Y ) +E(N
j
)E(X) ≠ E(XN
j
j
) + (E(Nj
) ≠ 1)E(Y )g
+ (R(q) + Êj
+ C(q)).
Our objective now is to find the inputs to the ASAP algorithm, namely the num-ber k of macro-steps together with the chunk sizes (Ê1, . . . , Ê
k
), that minimizethis T
ASAP
bound.We first have to prove that any optimal (in expectation) policy uses only a
finite number of chunks. Let – be the expectation of the ASAP makespan usinga unique chunk of size W(q). According to Proposition 3,
– = E(T (R(q)+W(q)+C(q))truestart ) + C(q) + W(q) + R(q),
and is finite. Thus, if an optimal policy uses kú chunks, we must have kúC(q) Æ–, and thus kú is bounded.
In the proof of Theorem 1 in [4], we have shown that any deterministicstrategy uses the same sequence of chunk sizes, whatever the failure scenario,thanks to the memoryless property of the Exponential distribution. We cannotprove such a result in the current context. For instance, the number of groupsperforming a downtime at time tend
1 depends on the scenario. There is thusno reason a priori for the size of the second chunk to be independent of thescenario. To overcome this di�culty, we restrict our analysis to strategies thatuse the same sequence of chunk sizes whatever the failure scenario. We optimizeT
ASAP
in that context, at the possible cost of finding a larger upper bound.We thus suppose that we have a fixed number of chunks, k, and a sequence of
chunk sizes (Ê1, . . . , Êk
), and we look for the values of (Ê1, . . . , Êk
) that minimize
RR n° 7876
Using group replication for resilience on exascale systems 16
TASAP
=q
k
j=1 –j
. Let us first compute one of the –j
term. Replacing E(Nj
)and E(XN
j
j
) by the values given in Proposition 3, and E(X) by 1q⁄
, we get
–j
= g ≠ 1g
Êj
+ 1g
e⁄q(R(q)+Ê
j
+C(q))3
1q⁄
+ E(Y )4
+ g ≠ 1g
(E(Y ) + R(q) + C(q)) ≠ 1g
1q⁄
TASAP
= g ≠ 1g
W + 1g
31
q⁄+ E(Y )
4e⁄q(R(q)+C(q))
kÿ
j=1e⁄qÊ
j
+ k
3g ≠ 1
g(E(Y ) + R(q) + C(q)) ≠ 1
g
1q⁄
4
By convexity, the expressionq
k
j=1 e⁄qÊ
j is minimal when all Êj
’s are equal (toW(q)/k). Hence all the chunks should be equal for T
ASAP
to be minimal. Weobtain:
TASAP
= g ≠ 1g
W + 1g
31
q⁄+ E(Y )
4e⁄q(R(q)+C(q))ke⁄q
W(q)k
+ k
3g ≠ 1
g(E(Y ) + R(q) + C(q)) ≠ 1
g
1q⁄
4.
Let f(x) = ·1xe⁄q
W(q)x + ·2x, where
·1 = 1g
31
q⁄+ E(Y )
4e⁄q(R(q)+C(q)) and
·2 =3
g ≠ 1g
(E(Y ) + R(q) + C(q)) ≠ 1g
1q⁄
4.
A simple analysis using di�erentiation shows that f has a unique minimum,and solving f Õ(x) = 0 leads to ·1e⁄q
W(q)k
11 ≠ ⁄qW(q)
k
2+ ·2 = 0, and thus to
k = ⁄qW(q)1+L
!·2
·1·e
" = kú, which concludes the proof.
Using the upper bound of E(Y ) = E(XD
(q)) in Proposition 1, we can com-pute numerically the number of chunks and the expectation of the upper boundgiven by Theorem 2.
5.3 Group replication heuristics for general failure distri-butions
The results of the previous section are limited to Exponential failures. We nowaddress turn to the general case. In Section 5.3.1 we recall the dynamic programdesigned in [4] to define checkpointing dates in a context without replication.We then discuss how to use this dynamic program in the context of ASAP(Section 5.3.2). Finally, in Section 5.3.3 we propose heuristics that correspondto more general execution protocols than ASAP.
RR n° 7876
Using group replication for resilience on exascale systems 17
Algorithm 3: DPNextFailure (W ,·1, ..., ·p
)1 if W = 0 then return 0
2 best Ω 0; chunksize Ω 0
3 for Ê = quantum to W step quantum do
4 (expected_work, 1st_chunk) Ω DP-
NextFailure(W ≠ Ê, ·1 + Ê + C(p), ..., ·
p
+ Ê + C(p))
5 cur_exp_work ΩP
suc
(·1 + Ê + C(p), ..., ·
p
+ Ê + C(p) | ·1, ..., ·
p
) ◊ (Ê + expected_work)
6 if cur_exp_work > best then best Ω cur_exp_work; chunksize Ω Ê
7 return (best, chunksize)
5.3.1 Solution without replication
According to [4], the most e�cient algorithm to define checkpointing dates, forgeneral failure distributions and when no replication is used, is the dynamicprogramming approach called DPNextFailure, shown in Algorithm 3. Thisalgorithm works on a failure by failure basis, maximizing the expectation of theamount of work completed (and checkpointed) before the next failure occurs.This algorithm only provides an approximation of the optimal solution as itrelies on a time discretization. (For the sake of simplicity, we present all dynamicprograms as recursive algorithms.)
5.3.2 Implementing the ASAP execution protocol
A key question when implementing ASAP is that of the chunk size. A naiveapproach would be to use DPNextFailure. Each group would call DPNext-Failure to compute what would be the optimal chunk size for itself, as if therewere no other groups. Then we have to merge these individual chunk sizes toobtain a common chunk size. This heuristic, DPNextFailureAsap, is shownin Algorithm 4 (the Alive function returns, for a list of q processors, the amountof time each has been up and running since its last downtime). For the Mergeoperator, one could be pessimistic and take the minimum of the chunk sizes,or be optimistic and take the maximum, or attempt a trade-o� by taking theaverage. Two important limitations of this heuristic are: 1) the Merge operatorthat has no theoretical justification; and 2) the use of chunk sizes defined usingan obsolete failure history. The latter limitation shows after a group is victimof a failure: the failure history has changed significantly but the chunk size isnot recomputed (this has no consequences with an Exponential distribution asthe Exponential is memoryless).
5.3.3 Other execution protocols
In order to circumvent both limitations of DPNextFailureAsap, we relaxthe constraint that all groups work with the same chunk size. Each group nowworks with its own chunk size that is recomputed each time the group fails, andeach time one of the groups successfully complete its own chunk. This leads tothe heuristic DPNextFailureSynchro in Algorithm 5. Each time a groupsuccessfully works for the duration of its chunk size and checkpoints its work,it signals its success to all groups which then interrupt their own work. This
RR n° 7876
Using group replication for resilience on exascale systems 18
Algorithm 4: DPNextFailureAsap(W ).1 while W ”= 0 do
2 for each group x = 1..g do in parallel
3 (·(x≠1)q+1, . . . , ·(x≠1)q+q
) Ω Alive((x ≠ 1)q + 1, ..., (x ≠ 1)q + g)
4 (exp_work
x
, Ê
x
) Ω DPNextFailure(W, ·(x≠1)q+1, . . . , ·(x≠1)q+q
)
5 Ê Ω Merge(Ê1, . . . , Ê
g
)
6 for each group do in parallel
7 repeat
8 Try to execute a chunk of size Ê and then checkpoint
9 if successful then
10 Signal other groups to immediately stop their attempts
11 else if failure then Complete downtime and perform recovery
12 until One of the groups signals its success13 W Ω W ≠ Ê
14 for each group do in parallel
15 if not successful on last chunk then Perform recovery from last
successfully completed checkpoint
behavior is clearly sub-optimal if the successful completion is for a very smallchunk and an interrupted group that was computing a large chunk was closeto completion. The problems are that: 1) chunk sizes are defined as if eachgroup was alone; and 2) the definition of the chunk sizes does not account forthe fact that the first group to complete its checkpoint defines a mandatorycheckpointing date for all groups.
Algorithm 5: DPNextFailureSynchro(W ).1 for each group x = 1..g do in parallel
2 while W ”= 0 do
3 (·(x≠1)q+1, . . . , ·(x≠1)q+q
) Ω Alive((x ≠ 1)q + 1, ..., (x ≠ 1)q + g)
4 (exp_work
x
, Ê
x
) Ω DPNextFailure(W, ·(x≠1)q+1, . . . , ·(x≠1)q+q
)
5 Try to execute a chunk of size Ê
x
and then checkpoint
6 if successful then
7 Signal other groups to immediately stop their attempts
8 W Ω W ≠ Ê
x
9 if failure then Complete downtime
10 if failure or signal then
11 Perform recovery from last successfully completed checkpoint
To address these problems, rather than doing another attempt at reusingDPNextFailure, we design a brand new dynamic program. In [4], withoutreplication, we found that a dynamic program that minimizes the expectationof the makespan would require an intractable exponential number of states torecord which processors fail and when. Hence we aimed instead at maximizingthe expectation of the work completed before the next failure. We now extendthis approach to the context of replication. The first failure will only interrupta single group. Therefore, the objective should be to maximize the expectationof the work completed before all groups have failed. This approach ignoresthat once a group has failed, it will eventually restart and resume computing.However, keeping track of such restarts would require recording which processors
RR n° 7876
Using group replication for resilience on exascale systems 19
have failed and when, thus once again leading to an exponential number ofstates.
To avoid having the first completed checkpoint force a checkpoint for allother groups we design a new dynamic program, DPNextCheckpoint (Algo-rithm 6). DPNextCheckpoint does not define chunk sizes, i.e., amount ofwork to be processed before a checkpoint is taken, but instead it defines check-point dates. The rationale is that one checkpointing date can correspond todi�erent amounts of work for each group, depending on when the group hasstarted to process its chunk, after either its last failure and recovery, or its lastcheckpoint, or its last recovery based on another group’s checkpoint. The func-tion WorkAlreadyDone (Line 3) returns, for each group, the time since itstarted processing its current chunk.
DPNextCheckpoint proceeds as follows. At the checkpointing date, theamount of work completed is the maximum of the amount of work done bythe di�erent groups that successfully complete the checkpoint. Therefore, weconsider all the di�erent cases (Line 8), that is, which group x, among thesuccessful groups, has done the most work. We compute the probability of eachcase (Line 11). All groups that started to work earlier than group x have failed(i.e., at least one processor in each of them has failed) but not group x (i.e.,none of its processors have failed). We compute the expectation of the amount ofwork completed in each case (Lines 12 and 13). We then sum the contributionsof all the cases (Line 14) and record the checkpointing date leading to thelargest expectation (Line 15). Note that the probability computed at Line 11explicitly states which groups have successfully completed the checkpoint andwhich groups have not. We choose not to take this information into accountwhen computing the expectation (recursive call at Line 13). This is to avoidkeeping track of which group had failed, thereby lowering the complexity ofthe dynamic program. This explains why the conditions do not evolve in theconditional probability at Line 11.
Finally, Algorithm 7 shows the algorithm, called DPNextFailureGlobal,that uses DPNextCheckpoint. Each time a group is a�ected by an event (afailure, a successful checkpoint by itself or by another group), it computes thenext checkpoint date and signals the result of its computation to the othergroups (e.g., by broadcasting it to the g group leaders). Hence, a group mayhave computed the next checkpoint date to be t, and that date can be eitherun-modified, or postponed, or advanced by events occurring on other groupsand by their re-computation of the best next checkpoint date. In practice, asthese algorithms rely on a time discretization, at each time quantum a groupcan check whether the current time is a checkpoint date or not.
6 Simulation frameworkIn this section we detail our simulation methodology. We use both synthetic andreal-world failure distributions. The source code and all simulation results arepublicly available at: http://perso.ens-lyon.fr/frederic.vivien/Data/
Resilience/SPAA2012/.
RR n° 7876
Using group replication for resilience on exascale systems 20
Algorithm 6: DPNextCheckpoint(W , T , T0, ·1, ..., ·gq
)1 if W = 0 then return 0
2 best_work Ω 0; next_chkpt Ω date
3 (W1, ..., W
g
) Ω WorkAlreadyDone(T ) /* Time since last recovery or
checkpoint */
4 Reorder groups in non-increasing availabilities (W1 is maximum)
5 for t = T to T + W ≠ W
g
step quantum /* Loop on checkpointing date */
6 do
7 cur_work Ω 0
8 for x = 1 to g /* Loop on the first group to successfully work
until t + C(q) */
9 do
10 ” Ω (t + C(q)) ≠ T0 /* Total time elapsed until the checkpoint
completion */
11proba Ω
rx≠1y=1 P
fail
(·(y≠1)q+1 + ”, ..., ·(y≠1)q+p
+ ” | ·(y≠1)q+1, ..., ·(y≠1)q+p
)
◊P
suc
(·(x≠1)q+1 + ”, ..., ·(x≠1)q+p
+ ” | ·(x≠1)q+1, ..., ·(x≠1)q+p
)
12 Ê Ω min{W ≠ W
x
, t ≠ T } /* Work done between T and t by group
x */
13 (rec_Ê, rec_t) Ω DP-
NextCheckpoint(W ≠ W
x
≠ Ê, T + Ê + C(q) + R(q), T0, ·1, ..., ·
p
)
14 cur_work Ω cur_work + proba ◊ (W
x
+ Ê + rec_Ê)
15 if cur_work > best_work then
best_work Ω cur_work; next_chkpt Ω t
16 return (best_work, next_chkpt)
Algorithm 7: DPNextFailureGlobal(W ).1 for each group x = 1..g do in parallel
2 while W ”= 0 do
3 (·1, ..., ·
gq
) Ω Alive(1, ..., gq)
4 T0 Ω Time() /* Current time */
5 date Ω DPNextCheckpoint(W, T0, T0, ·1, . . . , ·
gq
)
6 Signal all processors that the next checkpoint date is now date7 Try to work until date and then checkpoint
8 if successful work until date and checkpoint then
9 Let y be the longest running group without failure among the
successful groups
10 Let Ê be the work performed by y since its last recovery or
checkpoint
11 W Ω W ≠ Ê
12 if group x last recovery or checkpoint was strictly later than that ofy then
13 Perform a recovery
14 if failure then Complete downtime
15 if failure or signal then Perform recovery from last successfully
completed checkpoint
RR n° 7876
Using group replication for resilience on exascale systems 21
6.1 HeuristicsOur simulator implements the following eight checkpointing policies:
• Two versions of the ASAP protocol: OptExp, that uses the optimal andperiodic policy established in [4] for Exponential failure distributions andno replication; OptExpGroup, that uses the periodic policy defined byTheorem 2 for Exponential distributions.
• The six dynamic programming approaches in Section 5.3: DPNext-Failure, the 3 variants of DPNextFailureAsap, DPNextFailureSyn-chro, and DPNextFailureGlobal.
Our simulator also implements PeriodLB, which is a numerical search for theoptimal period for ASAP. We evaluate each candidate period on 50 randomlygenerated scenarios. To build the candidate periods, the period computed forOptExp is multiplied and divided by 1 + 0.05 ◊ i with i œ {1, ..., 180}, and by1.1j with j œ {1, ..., 60}. PeriodLB corresponds to the periodic policy thatuses the best period found by the search. The evaluation of PeriodLB on aconfiguration requires running 24,000 simulations (which would be prohibitivein practice), but we include it for reference. Based on the results in [4], we do notconsider any additional checkpointing policy, such as those defined by Young [28]or Daly [9] for instance. We point out that OptExp and OptExpGroup com-pute the checkpointing period based solely on the MTBF, implicitly assumingthat failures are exponentially distributed. For the sake of completeness wenevertheless include them in all our simulations, simply using the MTBF valueeven when failures are not exponentially distributed. To use OptExp with ggroups we use the period from [4] computed with Âp/gÊ processors.
6.2 PlatformsWe target two types of platforms, depending on the type of the failure distribu-tion. For synthetic distributions, we consider platforms containing from 32,768to 1,048,576 processors. For platforms with failures based on failure logs fromproduction clusters, because of the limited scale of those clusters, we restrictthe size of the platforms to a maximum of 131,072 processors, starting with4,096 processors. For both platform types, we determine the job size W sothat a job using the whole platform would use it for a significant amount oftime in the absence of failures, namely ¥ 3.5 days on the largest platforms forsynthetic failures (W = 10, 000 years), and ¥ 2.8 days on those for log-basedfailures (W = 1, 000 years). Otherwise, we use the same parameters as in [4]:C = R = 600 s, D = 60 s, “ = 10≠6 for generic parallel jobs, and “ = 0.1for numerical kernels. Note that the checkpointing overheads come from thescenarios in [7] and are smaller than those used in [12].
6.3 Failure scenarios
Synthetic failure distributions – To choose failure distribution parametersthat are representative of realistic systems, we use failure statistics from theJaguar platform. Jaguar contains 45, 208 processors and is said to experienceon the order of 1 failure per day [20, 2]. Assuming a 1-day platform MTBFgives us a processor MTBF equal to 45,208
365 ¥ 125 years. For the Exponentialdistribution, we then have ⁄ as ⁄ = 1
MT BF
and for Weibull, which requires two
RR n° 7876
Using group replication for resilience on exascale systems 22
parameters k and ⁄, we have ⁄ = MTBF/�(1 + 1/k) and we fix k to 0.7 basedon the results of [24]. (We have shown in [4] that the general trends were notinfluenced by the exact values used for k nor for the MTBF.)Log-based failure distributions – We also consider failure distributionsbased on failure logs from production clusters. We used logs from the largestclusters among the preprocessed logs in the Failure trace archive [18], i.e., fromclusters at the Los Alamos National Laboratory [24]. In these logs, each fail-ure is tagged by the node —and not just the processor— on which the failureoccurred. Among the 26 possible clusters, we opted for the only two clusterswith more than 1,000 nodes, as we needed a sample history su�ciently largeto simulate platforms with more than 10,000 nodes. The two chosen logs arefor clusters 18 and 19 in the archive (referred to as 7 and 8 in [24]). For eachlog, we record the set S of availability intervals. A discrete failure distribu-tionfor the simulation is then generated as follows: the conditional probabilityP(X Ø t | X Ø ·) that a node stays up for a duration t, knowing that it has beenup for a duration · , is set to the ratio of the number of availability durationsin S greater than or equal to t, over the number of availability durations in Sgreater than or equal to · .Scenario generation – Given a p-processor job, a failure trace is a set of failuredates for each processor over a fixed time horizon h (set to 2 years). The jobstart time is assumed to be 1 year for synthetic distribution platforms, and 0.25year for log-based distribution platforms. We use a non-null start time to avoidside-e�ects related to the synchronous initialization of all processors. Given thedistribution of inter-arrival times at a processor, for each processor we generatea trace via independent sampling until the target time horizon is reached.Forsimulations where the only varying parameter is the number of processors a Æp Æ b, we first generate traces for b processors. For experiments with p processorswe then simply select the first p traces. This ensures that simulation results arecoherent when varying p. Finally, the two clusters used for computing our log-based failure distributions consist of 4-processor nodes. Hence, to simulate a131,072-processor platform we generate 32,768 failure traces, one for each four-processor node.
7 Simulation resultsIn this section we discuss simulation results, but only show graphs for perfectlyparallel applications under the constant overhead scenario. All results can befound in the appendix. All simulations with replication are for two groups (usingthree groups never leads to improvements in our experiments).
Optimal number of processors for execution – For Weibull and log-basedfailures with the constant overhead scenario, using all the available processors torun a single application instance leads to significantly larger makespans. This isseen in Figures 5-7 where the curves for OptExp, PeriodLB, and DPNext-Failure, the no-replication strategies, shoot upward when p reaches a largeenough value. For instance, with traces based on the logs of LANL cluster 18, theincrease in makespan is more than 37% when going from p = 216 to p = 217. Thisbehavior is in line with the result in Theorem 1. However, in our experiments
RR n° 7876
Using group replication for resilience on exascale systems 23
50
60
aver
age
mak
espa
n(in
days
)
40
218 219216 217
30
20
10
0
number of processors220215214
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
Figure 4: Exponential failures withMTBF=125 years.
60
aver
age
mak
espa
n(in
days
) 50
40
214 215 220217216 219218
30
20
10
0
number of processors
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
Figure 5: Weibull failures withMTBF=125 years and k = 0.70.
10
0
number of processors214 215212 217213 216
70
aver
age
mak
espa
n(in
days
)
80
60
50
40
30
20
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
Figure 6: Failures based on the failurelog of LANL cluster 18.
10
0
number of processors214 215212 217213 216
70
aver
age
mak
espa
n(in
days
)
80
60
50
40
30
20
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
Figure 7: Failures based on the failurelog of LANL cluster 19.
with Exponential failures the lowest makespan when running a single applicationinstance is obtained when using all the processors, regardless of the applicationmodel. This is seen plainly in Figure 4 at least up to 220 processors. This likelymeans that our experimental scenarios have not reached the scale at which themakespan would increase. Furthermore, running a single application instanceis not always best in our results even for Exponential failures. For instance,for generic parallel jobs OptExp (with g = 1) begins being outperformed bystrategies that use replication at 220 processors (see the full results in [5]). Weconclude that on exascale platforms the optimal makespan may not be obtainedby running a single application instance. This is similar to the result obtainedin [12] in the context of process replication.
The ASAP protocol with Exponential – Considering only those resultsobtained for Exponential failures and using group replication, we find that Opt-ExpGroup with g = 2 delivers performance very close to that of the numericallower bound on any periodic policy under ASAP (PeriodLB with g = 2) andslightly better than the performance of OptExp with g = 2.These results cor-roborates our theoretical study. Therefore, determining chunk sizes accordingto Theorem 2 is more e�cient than using two instances that each use the chunksize determined by OptExp.
RR n° 7876
Using group replication for resilience on exascale systems 24
The dynamic programming approaches - The performance of the threevariants of DPNextFailureAsap (with Merge being the minimum, the aver-age, or the maximum) are indistinguishable on Exponential and Weibull failuredistributions. For log-based failure distributions, their performance are stillvery close, with the average and the maximum seemingly achieving better per-formance (the performance di�erences, however, may not be significant in lightof the standard deviations).
DPNextFailureSynchro achieves worse performance than DPNextFail-ureAsap for Exponential and Weibull failure distributions. For log-based fail-ure distributions, in some rare settings, the opposite is true. Overall, thisheuristic provides disappointing results. By contrast, depending on the scenario,DPNextFailureGlobal is equivalent to the other dynamic programming ap-proaches or outperforms them significantly. Furthermore, its performance isrelatively close to that of PeriodLB with g = 2.
The very perplexing result here is that, regardless of the underlying failuredistribution, DPNextFailureGlobal is outperformed by the periodic policyOptExp with g = 2. Recall that this policy erroneously assumes Exponen-tial failures and computes the checkpointing period ignoring that replicationis used! By contrast, DPNextFailureGlobal is based on strong theoreticalfoundations and was designed specifically to handle multiple groups e�ciently.And yet, in non-Exponential failure cases, OptExp with g = 2 outperformsDPNextFailureGlobal (see Figures 5-7). In the case of log-based experi-ments (Figures 6 and 7), it may be that our datasets are not su�ciently large.We use failure data from the largest production clusters in the Failure tracearchive [18], but these clusters are orders of magnitude smaller than our simu-lated platforms. Consequently, we are forced to “replay” the same failure datamany times for many of our simulated nodes, which biases experiments. In thecase of Weibull failures with k = 0.7 (Figure 5), it turns out that the failuredistribution is su�ciently close to an Exponential, i.e., that k is su�cientlyclose to 1, so that OptExp does well. However, smaller values of k are rea-sonable as seen for instance in [19] (k ¥ 0.5) and in [24] (0.33 Æ k Æ 0.49).In Table 1 we show results for smaller values of k for a platform with 220 pro-cessors. We see that, as k decreases, the performance of OptExp with g = 2degrades when compared to PeriodLB, reaching more than 100% degradationfor k = 0.5. However, DPNextFailureGlobal still achieves close-to-optimalmakespans in such cases, as can be seen on Figure 8 (for k = 0.5). Overall,DPNextFailureGlobal achieves good to very good performance all acrossthe board. Finally, remark on Figure 8 that PeriodLB achieves better per-formance using a replication factor of 3 (g = 3) than with a replication factorof 2.
8 ConclusionIn this paper we have presented a study of replication techniques for large-scaleplatforms. These platforms are subject to failures, the frequency of which in-crease dramatically with platform scale. For a variety of job types (perfectlyparallel, generic, or numerical) and checkpoint cost models (constant or pro-portional overhead), we have shown that using the largest possible number ofprocessors does not always lead to the smallest execution time. This is because
RR n° 7876
Using group replication for resilience on exascale systems 25
aver
age
mak
espa
n(in
days
)
140
120
100
number of processors214 215 220217216 219218
80
60
40
20
0DPNextFailureGlobal 2 groupsOptExpGroup 2 groupsPeriodLB 3 groupsPeriodLB 2 groupsPeriodLBOptExp 3 groupsOptExp 2 groupsOptExp
Figure 8: Weibull failures with MTBF=125 years and k = 0.50.
using more resources implies facing more failures during execution, hence wast-ing more time tolerating them (via an increase in checkpointing frequency) andrecovering from them. This waste results in a slow-down despite the additionalhardware resources.
This observation led us to investigate replication as a technique to better useall the resources provided by the platform. While others have studied processreplication [12], in this work we have focused on group replication, a more gen-eral and less intrusive approach. Group replication consists in partitioning theplatform into several groups, which each executes an instance of the applicationconcurrently. All groups coordinate to take advantage of the most advancedapplication state available. We have conducted an analytical study of groupreplication for Exponential failures when using the ASAP execution protocol,using an analogy with a list schedule to bound the number of attempts and theexecution time of each group. This analogy makes it possible to compute an up-per bound on the expected execution time, which can be computed numerically.We have also proposed several dynamic programming algorithms to minimizeapplication makespan, whatever the failure distribution. We have comparedall these approaches, along with no-replication approaches proposed in previ-ous work, in simulation for both synthetic failure distributions and failure datafrom production clusters. Our main findings are 1) that replication can signif-icantly lower the execution time of applications on very large scale platforms,for failure and checkpointing characteristics corresponding to today’s platforms;2) that our DPNextFailureGlobal dynamic programming approach is closeto the optimal periodic solution (in fact close to PeriodLB, which uses a pro-hibitively expensive numerical search for the best period).
A clear direction for future work, which we are currently exploring and thatcould lead to results in the final version of the article, is the reasons for theunexpected good results for OptExp with g = 2. In terms of longer-termobjectives, it would be interesting to generalize this work beyond the case ofcoordinated checkpointing. For instance, we plan to study hierarchical check-pointing schemes with message logging. Another interesting direction would beto study process replication and compare it to group replication, both theoret-ically and through simulations.
RR n° 7876
Using group replication for resilience on exascale systems 26
AcknowledgementsThis work was supported in part by the ANR RESCUE project, and by theINRIA-Illinois Joint Laboratory for Petascale Computing.
References[1] G. Amdahl. The validity of the single processor approach to achieving large
scale computing capabilities. In AFIPS Conference Proceedings, volume 30,pages 483–485. AFIPS Press, 1967.
[2] L. Bautista Gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Mat-suoka. Transparent low-overhead checkpoint for GPU-accelerated clusters.https://wiki.ncsa.illinois.edu/download/attachments/17630761/
INRIA-UIUC-WS4-lbautista.pdf?version=1&modificationDate=
1290470402000.
[3] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon,J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker,and R. C. Whaley. ScaLAPACK Users’ Guide. SIAM, 1997.
[4] Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, and FrédéricVivien. Checkpointing strategies for parallel jobs. In Proceedings of 2011International Conference for High Performance Computing, Networking,Storage and Analysis, SC ’11, pages 33:1–33:11, New York, NY, USA, 2011.ACM.
[5] Marin Bougeret, Henri Casanova, Yves Robert, Frédéric Vivien, and Dou-nia Zaidouni. Using group replication for resilience on exascale systems.Research Report RR-xxxx, INRIA, ENS Lyon, France, 2011. Available atgraal.ens-lyon.fr/~yrobert/.
[6] Mohamed-Slim Bouguerra, Thierry Gautier, Denis Trystram, and Jean-Marc Vincent. A flexible checkpoint/restart model in distributed systems.In PPAM, volume 6067 of LNCS, pages 206–215, 2010.
[7] Franck Cappello, Henri Casanova, and Yves Robert. Checkpointing vs. mi-gration for post-petascale supercomputers. In ICPP’2010. IEEE ComputerSociety Press, 2010.
[8] V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi,K. Vaidyanathan, and W. P. Zeggert. Proactive management of softwareaging. IBM J. Res. Dev., 45(2):311–332, 2001.
[9] J. T. Daly. A higher order estimate of the optimum checkpoint intervalfor restart dumps. Future Generation Computer Systems, 22(3):303–312,2004.
[10] Jack Dongarra, Pete Beckman, Patrick Aerts, Frank Cappello, ThomasLippert, Satoshi Matsuoka, Paul Messina, Terry Moore, Rick Stevens, AnneTrefethen, and Mateo Valero. The international exascale software project:a call to cooperative action by the global high-performance community. Int.J. High Perform. Comput. Appl., 23(4):309–322, 2009.
RR n° 7876
Using group replication for resilience on exascale systems 27
[11] C. Engelmann, H. H. Ong, and S. L. Scorr. The case for modular re-dundancy in large-scale highh performance computing systems. In Proc.of the 8th IASTED Infernational Conference on Parallel and DistributedComputing and Networks (PDCN), pages 189–194, 2009.
[12] K. Ferreira, J. Stearley, J. H. III Laros, R. Oldfield, K. Pedretti,R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating theViability of Process Replication Reliability for Exascale Systems. In Pro-ceedings of the 2011 ACM/IEEE Conference on Supercomputing, 2011.
[13] F. Gärtner. Fundamentals of fault-tolerant distributed computing in asyn-chronous environments. ACM Computing Surveys, 31(1), 1999.
[14] T. Heath, R. P. Martin, and T. D. Nguyen. Improving cluster availabilityusing workstation validation. SIGMETRICS Perf. Eval. Rev., 30(1):217–227, 2002.
[15] W.M. Jones, J.T. Daly, and N. DeBardeleben. Impact of sub-optimalcheckpoint intervals on application e�ciency in computational clusters. InHPDC’10, pages 276–279. ACM, 2010.
[16] Nick Kolettis and N. Dudley Fulton. Software rejuvenation: Analysis, mod-ule and applications. In FTCS ’95, page 381, Washington, DC, USA, 1995.IEEE CS.
[17] D. Kondo, A. Chien, and H. Casanova. Scheduling Task Parallel Appli-cations for Rapid Application Turnaround on Enterprise Desktop Grids.Journal of Grid Computing, 5(4):379–405, 2007.
[18] Derrick Kondo, Bahman Javadi, Alexandru Iosup, and Dick Epema. Thefailure trace archive: Enabling comparative analysis of failures in diversedistributed systems. Cluster Computing and the Grid, IEEE InternationalSymposium on, 0:398–407, 2010.
[19] Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, andSL Scott. An optimal checkpoint/restart model for a large scale high per-formance computing system. In IPDPS 2008, pages 1–9. IEEE, 2008.
[20] E. Meneses. Clustering Parallel Applications to Enhance MessageLogging Protocols. https://wiki.ncsa.illinois.edu/download/
attachments/17630761/INRIA-UIUC-WS4-emenese.pdf?version=
1&modificationDate=1290466786000.
[21] Michael Pinedo. Scheduling: theory, algorithms, and systems (3rd edition).Springer, 2008.
[22] Vivek Sarkar and others. Exascale software study: Software challenges inextreme scale systems, 2009. White paper available at: http://users.
ece.gatech.edu/mrichard/ExascaleComputingStudyReports/ECSS%
20report%20101909.pdf.
[23] B. Schroeder and G. Gibson. Understanding failures in petascale comput-ers. Journal of Physics: Conference Series, 78(1), 2007.
RR n° 7876
Using group replication for resilience on exascale systems 28
k PeriodLB OptExp Relative(g = 2) (g = 2) degradation
0.5 35.57 81.20 128.30%0.6 21.50 30.10 39.99%0.7 15.38 17.14 11.44%0.8 12.27 12.45 1.46%0.9 10.45 10.47 0.21%
Table 1: Makespans (in days) obtained by PeriodLB with g = 2 and OptExpwith g = 2, and relative degradation of OptExp, as a function of k, the shapeparameter of the Weibull distribution of failures, on platforms with 1, 048, 576processors.
[24] B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proc. of DSN, pages 249–258, 2006.
[25] K. Venkatesh. Analysis of Dependencies of Checkpoint Cost and Check-point Interval of Fault Tolerant MPI Applications. Analysis, 2(08):2690–2697, 2010.
[26] L. Wang, P. Karthik, Z. Kalbarczyk, R.K. Iyer, L. Votta, C. Vick, andA. Wood. Modeling Coordinated Checkpointing for Large-Scale Supercom-puters. In Proc. of the International Conference on Dependable Systemsand Networks, pages 812–821, June 2005.
[27] S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho. Using Replication andCheckpointing for Reliable Task Management in Computational Grids. InProc. of the International Conference on High Performance Computing &Simulation, 2010.
[28] John W. Young. A first order approximation to the optimum checkpointinterval. Communications of the ACM, 17(9):530–531, 1974.
[29] Z. Zheng and Z. Lan. Reliability-aware scalability models for high perfor-mance computing. In Proc. of the IEEE Conference on Cluster Computing,2009.
RR n° 7876
Using group replication for resilience on exascale systems 29
A Detailed simulation results
RR n° 7876
Using group replication for resilience on exascale systems 30g
=1
g=
2p
Opt
Exp
Per
iodL
BD
PN
extF
ailu
reO
ptE
xpO
ptE
xpG
roup
Per
iodL
BD
PN
extF
ailu
reA
sap
DP
Nex
tFai
lure
Sync
hro
DP
Nex
tFai
lure
Glo
bal
min
avg
max
3276
812
4.14
±0.
8612
4.14
±0.
8612
4.12
±0.
7723
1.72
±0.
3322
8.74
±0.
5622
8.53
±0.
5823
8.74
±0.
4923
8.75
±0.
5623
8.75
±0.
2925
4.27
±1.
3222
6.89
±0.
5965
536
65.2
1±
0.60
65.2
1±
0.60
65.3
8±
0.57
117.
96±
0.18
116.
11±
0.51
116.
07±
0.54
122.
99±
0.28
122.
99±
0.31
122.
93±
0.26
133.
61±
0.88
114.
94±
0.34
1310
7235
.16
±0.
5335
.12
±0.
5635
.27
±0.
4760
.61
±0.
1559
.53
±0.
4459
.49
±0.
3465
.14
±0.
1765
.14
±0.
2665
.20
±0.
3971
.65
±0.
5058
.94
±0.
3026
2144
19.7
1±
0.37
19.6
3±
0.35
19.7
3±
0.32
31.6
0±
0.16
31.0
1±
0.36
30.9
8±
0.29
34.7
7±
0.15
34.8
0±
0.24
34.8
0±
0.27
39.0
5±
0.35
30.8
9±
0.24
5242
8811
.74
±0.
3311
.72
±0.
3211
.79
±0.
3516
.96
±0.
1816
.68
±0.
3316
.62
±0.
2019
.14
±0.
1719
.10
±0.
1619
.12
±0.
2721
.79
±0.
3116
.92
±0.
2010
4857
67.
82±
0.31
7.82
±0.
317.
87±
0.31
9.55
±0.
239.
45±
0.39
9.42
±0.
1711
.05
±0.
1311
.07
±0.
1911
.07
±0.
1712
.42
±0.
229.
93±
0.22
(a)
Perfe
ctly
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax32
768
128.
22±
0.86
128.
22±
0.86
128.
22±
0.76
235.
53±
0.35
232.
45±
0.56
232.
29±
0.62
242.
71±
0.25
242.
59±
0.46
242.
70±
0.34
258.
59±
1.30
230.
65±
0.60
6553
669
.48
±0.
6069
.48
±0.
6069
.64
±0.
6112
1.81
±0.
1811
9.94
±0.
5211
9.87
±0.
5312
7.01
±0.
3212
6.98
±0.
2812
6.96
±0.
2213
7.95
±0.
8611
8.68
±0.
3413
1072
39.7
8±
0.49
39.7
6±
0.60
39.9
0±
0.46
64.5
8±
0.16
63.4
6±
0.49
63.4
0±
0.34
69.4
5±
0.27
69.4
2±
0.23
69.4
0±
0.24
76.3
7±
0.53
62.8
2±
0.33
2621
4424
.81
±0.
4124
.75
±0.
4124
.90
±0.
3935
.76
±0.
1935
.07
±0.
3835
.05
±0.
2639
.43
±0.
3139
.42
±0.
3939
.35
±0.
1644
.23
±0.
4034
.97
±0.
2852
4288
17.8
3±
0.37
17.8
3±
0.35
17.9
0±
0.39
21.4
0±
0.21
21.0
5±
0.42
20.9
7±
0.26
24.1
2±
0.19
24.1
3±
0.29
24.1
3±
0.25
27.4
7±
0.38
21.3
3±
0.28
1048
576
15.9
5±
0.48
15.9
2±
0.48
16.0
2±
0.46
14.5
4±
0.32
14.3
8±
0.55
14.3
3±
0.20
16.8
3±
0.25
16.7
7±
0.15
16.8
6±
0.30
18.9
5±
0.28
15.0
7±
0.27
(b)
Gen
eric
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax32
768
124.
44±
0.83
124.
44±
0.83
124.
49±
0.87
232.
14±
0.34
229.
18±
0.57
228.
95±
0.60
239.
16±
0.47
239.
27±
0.37
239.
20±
0.56
254.
78±
1.28
227.
36±
0.59
6553
665
.46
±0.
6065
.46
±0.
6065
.64
±0.
5811
8.26
±0.
1711
6.43
±0.
4911
6.39
±0.
5112
3.34
±0.
4212
3.25
±0.
2312
3.29
±0.
2913
4.02
±0.
8411
5.22
±0.
3513
1072
35.3
8±
0.51
35.3
6±
0.56
35.4
8±
0.47
60.8
6±
0.17
59.8
0±
0.49
59.7
5±
0.45
65.4
2±
0.25
65.3
9±
0.31
65.3
5±
0.17
71.9
8±
0.52
59.1
9±
0.34
2621
4419
.86
±0.
3719
.77
±0.
3519
.87
±0.
3431
.78
±0.
1631
.15
±0.
3631
.15
±0.
3135
.00
±0.
2534
.97
±0.
1635
.02
±0.
2739
.25
±0.
3631
.06
±0.
2552
4288
11.8
8±
0.34
11.8
8±
0.33
11.9
2±
0.35
17.0
9±
0.18
16.8
1±
0.32
16.7
3±
0.19
19.2
8±
0.17
19.3
2±
0.32
19.3
1±
0.25
21.9
8±
0.32
17.0
5±
0.21
1048
576
7.95
±0.
327.
95±
0.32
7.99
±0.
319.
65±
0.23
9.56
±0.
409.
52±
0.17
11.1
6±
0.13
11.1
9±
0.21
11.1
9±
0.20
12.5
5±
0.21
10.0
3±
0.22
(c)
Num
eric
alke
rnel
s.
Tabl
e2:
Eval
uatio
nof
the
di�e
rent
heur
istic
son
apl
atfo
rmw
ithE
xpon
enti
alfa
ilure
s(M
TB
F=
125
year
s)un
der
the
cons
tant
over
head
mod
el.
RR n° 7876
Using group replication for resilience on exascale systems 31g
=1
g=
2p
Opt
Exp
Per
iodL
BD
PN
extF
ailu
reO
ptE
xpO
ptE
xpG
roup
Per
iodL
BD
PN
extF
ailu
reA
sap
DP
Nex
tFai
lure
Sync
hro
DP
Nex
tFai
lure
Glo
bal
min
avg
max
6553
612
3.19
±5.
8712
3.08
±6.
0412
3.62
±5.
8718
5.64
±4.
8018
5.88
±7.
6818
2.76
±5.
7321
8.73
±4.
5121
8.27
±4.
6721
8.31
±4.
6923
6.11
±6.
6820
2.78
±7.
1313
1072
61.8
9±
2.91
61.7
9±
2.49
61.9
6±
2.89
92.7
9±
2.20
92.5
5±
3.60
91.2
2±
3.45
109.
71±
2.14
109.
45±
2.11
109.
48±
2.14
118.
43±
2.55
101.
39±
3.43
2621
4430
.55
±1.
2130
.55
±1.
2130
.66
±1.
1446
.65
±1.
0146
.69
±1.
6845
.70
±1.
5054
.74
±1.
1454
.58
±1.
1554
.57
±1.
1558
.85
±1.
4450
.90
±1.
6952
4288
15.4
6±
0.58
15.4
5±
0.47
15.5
1±
0.58
23.1
4±
0.60
23.2
2±
0.89
22.8
9±
0.71
27.3
3±
0.60
27.2
6±
0.60
27.2
6±
0.60
29.4
3±
0.55
25.2
7±
0.64
1048
576
7.82
±0.
317.
82±
0.31
7.87
±0.
319.
55±
0.23
9.45
±0.
399.
42±
0.17
11.0
5±
0.13
11.0
7±
0.19
11.0
7±
0.17
12.4
2±
0.22
9.93
±0.
22
(a)
Perfe
ctly
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax65
536
131.
05±
6.31
130.
99±
5.79
131.
29±
6.10
192.
02±
4.74
191.
80±
7.74
188.
77±
5.90
225.
85±
4.83
225.
77±
4.89
225.
78±
4.89
243.
36±
6.37
209.
88±
6.78
1310
7269
.65
±2.
8369
.62
±2.
3669
.82
±2.
8999
.09
±2.
1898
.82
±3.
9097
.18
±3.
2511
6.71
±2.
2611
6.68
±2.
2111
6.68
±2.
2012
6.29
±2.
7610
7.85
±4.
1126
2144
38.6
1±
1.40
38.5
5±
1.29
38.6
3±
1.32
52.7
9±
1.16
52.6
4±
1.82
51.7
1±
1.53
61.8
3±
1.29
61.7
6±
1.35
61.7
6±
1.35
66.4
1±
1.43
57.5
1±
2.03
5242
8823
.59
±0.
8123
.55
±0.
7823
.62
±0.
7829
.27
±0.
6029
.27
±0.
8728
.85
±0.
7434
.52
±0.
6934
.47
±0.
6534
.47
±0.
6537
.20
±0.
7031
.90
±0.
8310
4857
615
.95
±0.
4815
.92
±0.
4816
.02
±0.
4614
.54
±0.
3214
.38
±0.
5514
.33
±0.
2016
.83
±0.
2516
.77
±0.
1516
.86
±0.
3018
.95
±0.
2815
.07
±0.
27
(b)
Gen
eric
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax65
536
123.
76±
5.94
123.
60±
6.08
124.
05±
5.98
186.
39±
4.73
186.
73±
7.69
183.
65±
5.74
219.
24±
4.56
219.
10±
4.63
219.
10±
4.63
236.
52±
6.59
203.
63±
7.03
1310
7262
.33
±2.
8362
.26
±2.
7662
.37
±2.
8993
.32
±2.
1793
.03
±3.
7291
.67
±3.
3211
0.01
±2.
2210
9.97
±2.
2010
9.95
±2.
1811
8.77
±2.
6010
2.19
±3.
6326
2144
30.8
3±
1.21
30.8
3±
1.21
30.8
8±
1.14
46.9
1±
1.04
46.8
9±
1.71
45.9
1±
1.51
55.0
5±
1.20
54.8
9±
1.20
54.9
0±
1.20
59.2
1±
1.41
51.2
4±
1.82
5242
8815
.65
±0.
5815
.65
±0.
5215
.69
±0.
5923
.37
±0.
5823
.44
±0.
9223
.08
±0.
7027
.51
±0.
6027
.48
±0.
6127
.48
±0.
6129
.64
±0.
5925
.45
±0.
6810
4857
67.
95±
0.32
7.95
±0.
327.
99±
0.31
9.65
±0.
239.
56±
0.40
9.52
±0.
1711
.16
±0.
1311
.19
±0.
2111
.19
±0.
2012
.55
±0.
2110
.03
±0.
22
(c)
Num
eric
alke
rnel
s.
Tabl
e3:
Eval
uatio
nof
the
di�e
rent
heur
istic
son
apl
atfo
rmw
ithE
xpon
enti
alfa
ilure
s(M
TB
F=
125
year
s)un
der
the
prop
ortio
nal
over
head
mod
el.
RR n° 7876
Using group replication for resilience on exascale systems 32g
=1
g=
2p
Opt
Exp
Per
iodL
BD
PN
extF
ailu
reO
ptE
xpO
ptE
xpG
roup
Per
iodL
BD
PN
extF
ailu
reA
sap
DP
Nex
tFai
lure
Sync
hro
DP
Nex
tFai
lure
Glo
bal
min
avg
max
3276
814
2.66
±1.
9113
7.19
±1.
1413
7.51
±1.
2223
6.16
±0.
8724
1.59
±2.
6123
6.06
±0.
8225
1.98
±1.
4625
2.17
±1.
0625
2.38
±1.
5227
7.78
±1.
2723
2.66
±0.
7565
536
80.4
4±
1.45
76.1
7±
0.79
76.3
7±
0.81
122.
54±
0.85
128.
82±
2.50
122.
49±
0.49
133.
43±
0.49
133.
52±
0.52
133.
53±
0.55
150.
89±
0.91
121.
11±
0.63
1310
7248
.93
±1.
2544
.92
±0.
6445
.21
±0.
6665
.51
±0.
9572
.56
±2.
6065
.29
±0.
4574
.34
±0.
8474
.41
±0.
9474
.22
±0.
5784
.39
±0.
7265
.51
±0.
5626
2144
33.1
5±
1.25
29.1
6±
0.56
29.5
2±
0.59
37.0
7±
0.53
44.1
3±
1.63
36.4
1±
0.33
42.8
3±
0.50
42.7
6±
0.34
42.8
7±
0.88
48.5
2±
0.56
37.8
5±
0.52
5242
8827
.43
±1.
4522
.49
±0.
6222
.72
±0.
7023
.00
±0.
5830
.85
±1.
2321
.98
±0.
3426
.60
±0.
8926
.43
±0.
5826
.43
±0.
5628
.73
±0.
3824
.37
±0.
5010
4857
631
.83
±1.
9323
.67
±1.
0123
.96
±1.
0417
.16
±0.
7726
.73
±1.
9215
.38
±0.
4318
.53
±0.
4318
.52
±0.
4318
.71
±0.
9818
.84
±0.
4417
.98
±0.
76
(a)
Perfe
ctly
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax32
768
147.
35±
1.87
141.
72±
1.18
142.
06±
1.22
240.
03±
0.87
245.
59±
2.66
239.
98±
0.84
255.
91±
0.80
256.
44±
0.96
256.
59±
0.76
282.
32±
1.29
236.
52±
0.69
6553
685
.76
±1.
4381
.10
±0.
8181
.30
±0.
7812
6.54
±0.
8913
2.99
±2.
5012
6.52
±0.
5213
7.70
±0.
3713
7.74
±0.
3613
7.79
±0.
3515
5.87
±1.
0212
5.06
±0.
6213
1072
55.1
6±
1.29
50.7
6±
0.66
51.1
3±
0.74
69.7
8±
1.00
77.4
0±
2.75
69.5
4±
0.49
79.2
5±
1.06
79.0
6±
0.40
79.0
1±
0.29
89.9
2±
0.77
69.7
7±
0.59
2621
4441
.62
±1.
4736
.69
±0.
6837
.16
±0.
7741
.84
±0.
6150
.12
±1.
7141
.11
±0.
4048
.43
±0.
5748
.38
±0.
4048
.54
±0.
8254
.77
±0.
5942
.77
±0.
5352
4288
41.2
5±
1.93
34.0
7±
0.78
34.3
7±
0.86
29.0
2±
0.50
38.4
9±
1.37
27.7
0±
0.34
33.1
6±
0.32
33.2
2±
0.31
33.2
2±
0.33
36.1
7±
0.45
30.6
6±
0.49
1048
576
63.1
4±
2.71
47.5
7±
1.38
48.1
2±
1.48
25.9
9±
0.98
39.9
4±
2.13
23.4
1±
0.52
28.4
7±
1.46
28.2
6±
1.02
28.1
8±
0.46
28.7
0±
0.51
27.2
8±
0.73
(b)
Gen
eric
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax32
768
143.
09±
1.86
137.
57±
1.14
137.
93±
1.27
236.
61±
0.88
242.
03±
2.62
236.
51±
0.85
252.
38±
0.71
252.
74±
0.93
252.
92±
1.29
278.
28±
1.32
233.
11±
0.74
6553
680
.83
±1.
4576
.51
±0.
7976
.67
±0.
7812
2.85
±0.
8412
9.19
±2.
4712
2.83
±0.
5013
3.61
±0.
3413
3.80
±0.
4713
3.75
±0.
4615
1.34
±0.
9012
1.45
±0.
6313
1072
49.1
5±
1.29
45.1
4±
0.62
45.3
9±
0.64
65.7
7±
0.94
72.9
6±
2.63
65.5
3±
0.45
74.5
2±
0.68
74.5
7±
0.62
74.7
1±
1.28
84.7
3±
0.71
65.7
4±
0.54
2621
4433
.37
±1.
2729
.36
±0.
6029
.70
±0.
6337
.27
±0.
5444
.42
±1.
6136
.59
±0.
3643
.14
±0.
7643
.04
±0.
5143
.07
±0.
8248
.75
±0.
5138
.10
±0.
4852
4288
27.7
3±
1.45
22.7
2±
0.64
22.9
9±
0.75
23.1
8±
0.58
31.0
9±
1.22
22.1
3±
0.32
26.4
7±
0.29
26.5
9±
0.35
26.6
4±
0.48
28.9
5±
0.38
24.5
3±
0.53
1048
576
32.3
4±
1.96
24.0
0±
1.06
24.3
7±
1.09
17.3
4±
0.78
27.0
7±
2.02
15.5
6±
0.41
18.7
4±
0.44
18.6
8±
0.41
18.7
5±
0.46
19.0
2±
0.42
18.2
3±
0.68
(c)
Num
eric
alke
rnel
s.
Tabl
e4:
Eval
uatio
nof
the
di�e
rent
heur
istic
son
apl
atfo
rmw
ithW
eibu
llfa
ilure
s(M
TB
F=
125
year
san
dk
=0.
70)
unde
rth
eco
nsta
ntov
erhe
adm
odel
.
RR n° 7876
Using group replication for resilience on exascale systems 33
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax13
1072
217.
06±
12.6
816
7.58
±7.
6016
7.82
±7.
8223
7.38
±8.
8635
0.20
±9.
6620
3.08
±8.
2423
8.97
±5.
0924
2.39
±5.
4624
2.39
±5.
4622
7.19
±6.
1122
2.85
±6.
3326
2144
115.
85±
7.53
87.2
0±
3.61
87.4
1±
3.84
129.
28±
6.33
205.
92±
13.4
310
6.70
±3.
7912
5.47
±3.
5212
6.01
±3.
8512
6.06
±3.
8311
8.59
±4.
0711
8.00
±4.
8952
4288
59.8
6±
4.36
45.1
4±
2.15
45.3
6±
2.03
66.9
7±
3.79
113.
18±
9.20
54.6
1±
1.82
64.4
2±
1.44
64.4
1±
1.44
64.4
0±
1.44
60.6
6±
1.93
60.1
7±
2.20
1048
576
31.8
3±
1.93
23.6
7±
1.01
23.9
6±
1.04
17.1
6±
0.77
26.7
3±
1.92
15.3
8±
0.43
18.5
3±
0.43
18.5
2±
0.43
18.7
1±
0.98
18.8
4±
0.44
17.9
8±
0.76
(a)
Perfe
ctly
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax13
1072
241.
53±
13.8
718
8.05
±7.
7818
8.28
±8.
0824
7.34
±5.
9335
1.98
±10
.85
211.
96±
5.08
252.
06±
3.99
255.
12±
3.11
255.
12±
3.11
239.
30±
4.57
232.
91±
5.22
2621
4414
3.31
±8.
1110
8.86
±3.
9810
9.18
±4.
2514
4.93
±6.
6422
8.57
±12
.62
120.
47±
4.12
141.
29±
3.54
142.
12±
3.66
142.
10±
3.67
133.
53±
3.84
132.
56±
4.34
5242
8890
.43
±5.
7467
.80
±2.
3868
.04
±2.
5883
.63
±4.
1514
0.11
±10
.47
68.5
8±
2.13
80.4
1±
1.79
80.8
0±
1.71
80.7
7±
1.74
76.2
1±
2.09
75.5
9±
2.54
1048
576
63.1
4±
2.71
47.5
7±
1.38
48.1
2±
1.48
25.9
9±
0.98
39.9
4±
2.13
23.4
1±
0.52
28.4
7±
1.46
28.2
6±
1.02
28.1
8±
0.46
28.7
0±
0.51
27.2
8±
0.73
(b)
Gen
eric
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax13
1072
217.
99±
12.8
916
8.14
±7.
4916
8.34
±7.
7823
7.60
±9.
3234
9.39
±10
.05
204.
15±
7.91
239.
94±
4.67
243.
39±
5.39
243.
41±
5.41
228.
14±
5.83
223.
56±
6.34
2621
4411
6.18
±7.
1687
.85
±3.
5288
.00
±3.
7812
9.85
±6.
3220
6.03
±13
.37
107.
25±
3.84
125.
80±
3.67
126.
77±
3.78
126.
80±
3.77
119.
17±
4.04
118.
53±
4.84
5242
8860
.63
±4.
3745
.61
±2.
0645
.85
±2.
0267
.27
±3.
7611
3.66
±9.
1854
.99
±1.
8864
.67
±1.
5364
.85
±1.
5164
.86
±1.
4961
.13
±1.
8860
.70
±2.
2210
4857
632
.34
±1.
9624
.00
±1.
0624
.37
±1.
0917
.34
±0.
7827
.07
±2.
0215
.56
±0.
4118
.74
±0.
4418
.68
±0.
4118
.75
±0.
4619
.02
±0.
4218
.23
±0.
68
(c)
Num
eric
alke
rnel
s.
Tabl
e5:
Eval
uatio
nof
the
di�e
rent
heur
istic
son
apl
atfo
rmw
ithW
eibu
llfa
ilure
s(M
TB
F=
125
year
san
dk
=0.
70)
unde
rth
epr
opor
tiona
love
rhea
dm
odel
.
RR n° 7876
Using group replication for resilience on exascale systems 34g
=1
g=
2p
Opt
Exp
Per
iodL
BD
PN
extF
ailu
reO
ptE
xpO
ptE
xpG
roup
Per
iodL
BD
PN
extF
ailu
reA
sap
DP
Nex
tFai
lure
Sync
hro
DP
Nex
tFai
lure
Glo
bal
min
avg
max
4096
124.
15±
1.33
124.
07±
1.42
120.
94±
1.45
203.
13±
0.58
198.
94±
1.41
198.
95±
0.90
221.
99±
3.39
218.
18±
4.22
216.
48±
4.02
234.
28±
1.75
197.
38±
0.87
8192
72.7
1±
0.94
72.7
1±
0.94
69.8
0±
0.92
109.
17±
0.74
107.
04±
1.49
106.
88±
0.69
122.
11±
2.80
119.
21±
3.65
117.
79±
2.63
131.
25±
1.58
106.
44±
0.79
1638
446
.38
±0.
8245
.97
±0.
9444
.12
±0.
8061
.41
±0.
8760
.27
±1.
4360
.23
±0.
5971
.23
±2.
5268
.32
±1.
7567
.34
±1.
6772
.44
±0.
6760
.08
±0.
6532
768
32.8
8±
0.78
32.5
3±
0.84
31.7
6±
0.88
36.9
2±
0.39
35.9
1±
0.59
35.7
4±
0.50
44.0
6±
1.63
42.2
6±
1.54
41.5
5±
1.32
43.7
6±
0.56
36.7
1±
0.51
6553
629
.12
±0.
9628
.12
±1.
0028
.22
±1.
1524
.53
±0.
3623
.82
±0.
4523
.64
±0.
4530
.02
±1.
4928
.43
±1.
2328
.06
±1.
2428
.57
±0.
5525
.62
±0.
5413
1072
40.3
0±
1.70
37.4
2±
1.59
38.7
5±
1.70
19.9
2±
0.45
19.0
9±
0.53
18.9
4±
0.44
23.6
7±
0.91
22.9
9±
1.08
22.7
4±
1.18
22.2
8±
0.58
22.6
8±
0.72
(a)
Perfe
ctly
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax40
9612
4.27
±1.
3212
4.22
±1.
4112
1.15
±1.
6320
3.23
±0.
5719
9.05
±1.
4019
9.02
±0.
8922
2.15
±4.
3721
8.33
±3.
8921
6.09
±3.
6423
4.33
±1.
7819
7.50
±0.
8781
9272
.88
±0.
9372
.88
±0.
9369
.95
±0.
9210
9.27
±0.
7410
7.14
±1.
4910
7.00
±0.
6712
2.45
±3.
5011
9.00
±3.
0911
7.90
±3.
0113
1.26
±1.
4410
6.57
±0.
7216
384
46.5
7±
0.80
46.1
7±
0.93
44.3
3±
0.82
61.5
4±
0.87
60.3
7±
1.44
60.3
5±
0.47
70.8
5±
1.97
68.7
1±
1.77
67.9
9±
1.70
72.6
9±
0.77
60.2
1±
0.70
3276
833
.13
±0.
7932
.76
±0.
8531
.97
±0.
8837
.07
±0.
4036
.08
±0.
5835
.91
±0.
5144
.31
±1.
7442
.74
±1.
6341
.87
±1.
3743
.91
±0.
5836
.88
±0.
5265
536
29.9
4±
1.01
28.5
6±
1.02
28.6
7±
1.18
24.7
1±
0.36
23.9
8±
0.44
23.7
7±
0.45
30.2
1±
1.74
28.7
2±
1.37
28.4
7±
1.55
28.7
9±
0.58
25.7
8±
0.54
1310
7241
.80
±1.
7138
.82
±1.
7840
.27
±1.
8520
.40
±0.
4819
.39
±0.
5419
.15
±0.
6024
.05
±1.
0723
.29
±1.
1122
.95
±1.
1822
.59
±0.
5922
.97
±0.
74
(b)
Gen
eric
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax40
9612
4.28
±1.
3212
4.23
±1.
4012
1.01
±1.
4820
3.37
±0.
5719
9.19
±1.
4019
9.15
±0.
8922
1.89
±3.
0121
8.15
±3.
9721
6.89
±3.
7123
4.46
±1.
6119
7.66
±0.
8281
9272
.90
±0.
9272
.90
±0.
9270
.03
±0.
9410
9.33
±0.
7310
7.21
±1.
4910
7.01
±0.
6812
2.93
±2.
6111
9.05
±3.
1211
8.16
±3.
0613
1.47
±1.
5810
6.73
±0.
8916
384
46.5
8±
0.80
46.1
9±
0.94
44.2
8±
0.83
61.5
6±
0.85
60.4
0±
1.42
60.3
6±
0.59
71.3
6±
1.90
68.6
5±
1.92
67.8
3±
1.58
72.6
0±
0.68
60.2
2±
0.66
3276
832
.99
±0.
8032
.64
±0.
8431
.86
±0.
8837
.05
±0.
3936
.00
±0.
6035
.85
±0.
5144
.58
±1.
3742
.34
±1.
5641
.88
±1.
4943
.90
±0.
5436
.84
±0.
5065
536
29.6
3±
0.98
28.2
4±
0.99
28.4
0±
1.17
24.6
3±
0.36
23.9
0±
0.45
23.7
2±
0.45
29.8
7±
1.39
28.5
6±
1.27
28.0
7±
1.26
28.6
8±
0.57
25.7
3±
0.51
1310
7240
.73
±1.
6937
.86
±1.
6639
.14
±1.
7320
.22
±0.
4619
.23
±0.
5318
.99
±0.
6023
.89
±0.
9823
.18
±1.
1722
.83
±1.
0222
.41
±0.
5822
.85
±0.
77
(c)
Num
eric
alke
rnel
s.
Tabl
e6:
Eval
uatio
nof
the
di�e
rent
heur
istic
son
apl
atfo
rmw
ithfa
ilure
sba
sed
onth
efa
ilure
log
ofLA
NL
clus
ter
18an
dun
der
unde
rth
eco
nsta
ntov
erhe
adm
odel
.
RR n° 7876
Using group replication for resilience on exascale systems 35
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax32
768
226.
77±
12.3
921
8.36
±13
.79
211.
07±
11.9
729
2.56
±11
.30
286.
19±
13.4
228
5.64
±13
.78
337.
50±
20.7
733
9.30
±24
.99
352.
10±
25.2
031
0.68
±15
.33
331.
38±
17.6
765
536
89.4
8±
4.12
84.1
5±
4.39
84.0
9±
4.40
119.
10±
4.48
114.
67±
4.69
112.
43±
4.79
134.
09±
5.64
128.
67±
5.64
127.
85±
5.90
123.
99±
5.89
131.
63±
7.76
1310
7240
.30
±1.
7037
.42
±1.
5938
.75
±1.
7019
.92
±0.
4519
.09
±0.
5318
.94
±0.
4423
.67
±0.
9122
.99
±1.
0822
.74
±1.
1822
.28
±0.
5822
.68
±0.
72
(a)
Perfe
ctly
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax32
768
228.
66±
12.2
521
9.93
±13
.89
212.
77±
12.4
229
3.69
±11
.52
287.
96±
13.3
528
7.54
±13
.83
340.
51±
20.1
634
2.01
±25
.19
353.
95±
25.4
431
2.25
±15
.23
332.
71±
18.3
465
536
91.0
6±
4.29
86.1
6±
4.88
85.6
4±
4.55
120.
10±
4.45
115.
60±
4.65
113.
28±
4.74
135.
11±
5.63
129.
59±
5.42
129.
12±
5.76
124.
70±
5.72
132.
50±
7.13
1310
7241
.80
±1.
7138
.82
±1.
7840
.27
±1.
8520
.40
±0.
4819
.39
±0.
5419
.15
±0.
6024
.05
±1.
0723
.29
±1.
1122
.95
±1.
1822
.59
±0.
5922
.97
±0.
74
(b)
Gen
eric
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax32
768
226.
94±
12.4
821
9.04
±13
.36
211.
32±
12.8
629
3.58
±11
.58
287.
83±
13.4
528
7.23
±13
.58
337.
58±
19.1
234
0.34
±24
.77
352.
72±
25.3
931
1.75
±15
.37
332.
25±
18.5
965
536
89.8
8±
4.15
85.1
2±
4.69
84.7
0±
4.42
119.
55±
4.50
115.
18±
4.72
112.
79±
4.85
134.
37±
5.35
129.
21±
5.85
128.
63±
5.90
124.
56±
5.92
132.
28±
8.05
1310
7240
.73
±1.
6937
.86
±1.
6639
.14
±1.
7320
.22
±0.
4619
.23
±0.
5318
.99
±0.
6023
.89
±0.
9823
.18
±1.
1722
.83
±1.
0222
.41
±0.
5822
.85
±0.
77
(c)
Num
eric
alke
rnel
s.
Tabl
e7:
Eval
uatio
nof
the
di�e
rent
heur
istic
son
apl
atfo
rmw
ithfa
ilure
sba
sed
onth
efa
ilure
log
ofLA
NL
clus
ter
18an
dun
der
the
prop
ortio
nalo
verh
ead
mod
el.
RR n° 7876
Using group replication for resilience on exascale systems 36g
=1
g=
2p
Opt
Exp
Per
iodL
BD
PN
extF
ailu
reO
ptE
xpO
ptE
xpG
roup
Per
iodL
BD
PN
extF
ailu
reA
sap
DP
Nex
tFai
lure
Sync
hro
DP
Nex
tFai
lure
Glo
bal
min
avg
max
4096
123.
81±
1.28
123.
67±
1.41
120.
13±
1.46
203.
31±
0.55
198.
87±
1.39
198.
97±
0.92
224.
81±
5.58
219.
66±
3.81
216.
53±
4.32
233.
54±
1.80
196.
71±
0.83
8192
72.1
8±
0.82
71.8
4±
0.96
69.2
8±
0.95
109.
42±
0.77
107.
05±
1.36
106.
98±
0.57
122.
92±
2.72
119.
69±
1.92
118.
55±
2.32
130.
38±
1.57
106.
39±
0.82
1638
445
.50
±0.
7445
.27
±0.
7943
.57
±0.
7861
.36
±0.
8760
.08
±1.
3759
.88
±0.
4671
.57
±2.
5668
.21
±1.
9867
.59
±2.
0072
.15
±0.
6359
.81
±0.
5932
768
32.5
4±
0.76
31.6
3±
0.84
31.2
0±
0.96
36.9
7±
0.29
36.0
0±
0.42
35.9
6±
0.42
44.2
6±
2.14
41.8
6±
1.33
41.3
2±
1.21
43.5
4±
0.54
36.8
4±
0.53
6553
628
.74
±1.
1027
.57
±1.
1527
.54
±1.
2524
.18
±0.
4023
.06
±0.
5023
.01
±0.
4829
.78
±1.
0828
.40
±1.
3927
.99
±1.
4028
.22
±0.
5625
.21
±0.
5713
1072
39.6
9±
1.47
36.9
3±
1.85
38.0
8±
2.12
19.2
6±
0.49
18.2
4±
0.52
17.9
2±
0.57
22.9
5±
1.24
21.6
8±
0.94
21.3
2±
0.85
21.4
8±
0.72
21.4
5±
0.75
(a)
Perfe
ctly
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax40
9612
3.93
±1.
3012
3.80
±1.
4412
0.24
±1.
5120
3.42
±0.
5519
9.00
±1.
4019
9.07
±0.
9122
4.16
±4.
9821
9.79
±3.
7421
6.99
±5.
1923
3.77
±1.
8619
6.76
±0.
8281
9272
.33
±0.
8172
.01
±0.
9569
.43
±0.
9110
9.52
±0.
7610
7.20
±1.
3810
7.09
±0.
5812
3.09
±2.
9312
0.09
±1.
9211
8.02
±2.
0413
0.28
±1.
5210
6.34
±0.
8216
384
45.7
3±
0.68
45.4
8±
0.78
43.7
9±
0.78
61.4
7±
0.88
60.2
1±
1.39
60.0
2±
0.47
71.6
0±
2.12
68.4
1±
2.00
67.4
9±
1.91
72.2
7±
0.63
59.8
9±
0.55
3276
832
.82
±0.
7731
.90
±0.
8531
.49
±0.
9437
.11
±0.
2936
.07
±0.
4136
.07
±0.
4144
.16
±2.
0142
.09
±1.
6541
.51
±1.
3443
.70
±0.
5936
.98
±0.
5365
536
29.2
7±
1.11
28.0
4±
1.20
27.9
8±
1.26
24.3
9±
0.40
23.2
6±
0.51
23.1
9±
0.50
30.0
6±
1.07
28.5
6±
1.43
28.1
6±
1.40
28.4
4±
0.57
25.4
0±
0.59
1310
7240
.93
±1.
5738
.12
±1.
7939
.26
±2.
2019
.59
±0.
4818
.54
±0.
5218
.23
±0.
5623
.35
±1.
1422
.09
±0.
9621
.56
±0.
8921
.82
±0.
7321
.90
±0.
84
(b)
Gen
eric
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax40
9612
4.01
±1.
3112
3.90
±1.
4412
0.39
±1.
5120
3.53
±0.
5619
9.13
±1.
3919
9.20
±0.
9022
4.18
±4.
7021
9.70
±3.
9821
7.95
±4.
5323
4.13
±1.
9219
6.94
±0.
7481
9272
.31
±0.
8272
.00
±0.
9569
.42
±0.
9610
9.59
±0.
7610
7.24
±1.
3310
7.14
±0.
5612
3.37
±2.
1212
0.10
±2.
0811
8.64
±2.
1913
0.39
±1.
5310
6.51
±0.
7416
384
45.6
8±
0.70
45.4
2±
0.79
43.7
3±
0.78
61.4
8±
0.86
60.2
1±
1.36
60.0
2±
0.46
71.1
7±
2.71
68.9
5±
1.83
67.7
3±
2.00
72.2
8±
0.65
59.8
7±
0.55
3276
832
.69
±0.
7731
.78
±0.
8531
.38
±0.
9637
.08
±0.
2936
.07
±0.
4136
.05
±0.
3644
.01
±1.
5941
.93
±1.
2841
.54
±1.
3143
.63
±0.
5436
.92
±0.
5065
536
28.9
8±
1.09
27.7
7±
1.18
27.7
0±
1.23
24.3
0±
0.40
23.1
7±
0.50
23.1
0±
0.49
29.9
4±
0.99
28.4
6±
1.33
28.1
4±
1.41
28.3
6±
0.57
25.3
3±
0.59
1310
7240
.00
±1.
5037
.25
±1.
8238
.43
±2.
2219
.40
±0.
4818
.34
±0.
5118
.03
±0.
5723
.21
±1.
1121
.80
±0.
8721
.37
±0.
8421
.60
±0.
7321
.60
±0.
84
(c)
Num
eric
alke
rnel
s.
Tabl
e8:
Eval
uatio
nof
the
di�e
rent
heur
istic
son
apl
atfo
rmw
ithfa
ilure
sba
sed
onth
efa
ilure
log
ofLA
NL
clus
ter
19an
dun
der
the
cons
tant
over
head
mod
el.
RR n° 7876
Using group replication for resilience on exascale systems 37
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax32
768
202.
75±
10.3
119
4.84
±9.
9819
1.61
±12
.91
282.
16±
15.7
127
0.90
±14
.84
270.
90±
14.8
431
9.55
±15
.68
311.
87±
16.0
931
9.79
±21
.61
299.
02±
16.2
132
9.11
±20
.91
6553
683
.80
±4.
1377
.91
±4.
0477
.98
±4.
3810
9.64
±4.
6510
5.29
±4.
5910
3.92
±5.
4412
5.88
±10
.22
119.
30±
5.83
117.
21±
5.33
114.
52±
5.61
121.
70±
6.52
1310
7239
.69
±1.
4736
.93
±1.
8538
.08
±2.
1219
.26
±0.
4918
.24
±0.
5217
.92
±0.
5722
.95
±1.
2421
.68
±0.
9421
.32
±0.
8521
.48
±0.
7221
.45
±0.
75
(a)
Perfe
ctly
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax32
768
205.
09±
10.6
219
6.63
±10
.59
192.
73±
13.4
528
3.03
±15
.83
272.
17±
14.9
827
2.17
±14
.98
320.
80±
16.9
831
4.54
±16
.59
321.
32±
21.7
830
0.54
±16
.25
329.
95±
20.7
665
536
85.3
6±
4.23
79.3
6±
4.30
79.5
0±
4.66
110.
56±
4.61
108.
24±
4.90
104.
74±
5.04
125.
85±
9.64
120.
23±
7.00
118.
34±
5.34
115.
56±
5.61
123.
05±
6.62
1310
7240
.93
±1.
5738
.12
±1.
7939
.26
±2.
2019
.59
±0.
4818
.54
±0.
5218
.23
±0.
5623
.35
±1.
1422
.09
±0.
9621
.56
±0.
8921
.82
±0.
7321
.90
±0.
84
(b)
Gen
eric
para
llelj
obs.
g=
1g
=2
pO
ptE
xpP
erio
dLB
DP
Nex
tFai
lure
Opt
Exp
Opt
Exp
Gro
upP
erio
dLB
DP
Nex
tFai
lure
Asa
pD
PN
extF
ailu
reSy
nchr
oD
PN
extF
ailu
reG
loba
lm
inav
gm
ax32
768
204.
05±
10.3
319
6.02
±10
.15
192.
61±
12.8
328
3.17
±15
.92
272.
25±
15.1
327
2.25
±15
.13
322.
35±
16.8
731
3.46
±16
.92
322.
73±
23.5
230
0.28
±16
.44
329.
87±
20.8
365
536
84.3
9±
4.19
78.4
2±
4.07
78.5
7±
4.51
110.
15±
4.60
107.
79±
4.85
104.
24±
5.41
125.
56±
9.92
120.
07±
6.93
117.
84±
5.47
115.
05±
5.59
122.
37±
6.45
1310
7240
.00
±1.
5037
.25
±1.
8238
.43
±2.
2219
.40
±0.
4818
.34
±0.
5118
.03
±0.
5723
.21
±1.
1121
.80
±0.
8721
.37
±0.
8421
.60
±0.
7321
.60
±0.
84
(c)
Num
eric
alke
rnel
s.
Tabl
e9:
Eval
uatio
nof
the
di�e
rent
heur
istic
son
apl
atfo
rmw
ithfa
ilure
sba
sed
onth
efa
ilure
log
ofLA
NL
clus
ter
19an
dun
der
the
prop
ortio
nalo
verh
ead
mod
el.
RR n° 7876
Using group replication for resilience on exascale systems 38
50
60
aver
age
mak
espa
n(in
days
)
40
218 219216 217
30
20
10
0
number of processors220215214
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
a) constant overhead model
60
aver
age
mak
espa
n(in
days
) 50
40
214 215 220217216 219218
30
20
10
0
number of processors
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
b) proportional overhead model(1) Perfectly parallel jobs.
60
aver
age
mak
espa
n(in
days
) 50
40
214 215 220217216 219218
30
20
10
0
number of processors
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
a) constant overhead model
219214 215 220217216 218
aver
age
mak
espa
n(in
days
)
60
50
40
30
20
10
0
number of processors
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
b) proportional overhead model(2) Generic parallel jobs.
50
60
aver
age
mak
espa
n(in
days
)
40
218 219216 217
30
20
10
0
number of processors220215214
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
a) constant overhead model
60
aver
age
mak
espa
n(in
days
) 50
40
214 215 220217216 219218
30
20
10
0
number of processors
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
b) proportional overhead model(3) Numerical kernels.
Figure 9: Evaluation of the di�erent heuristics on a platform with Exponentialfailures (MTBF = 125 years).
RR n° 7876
Using group replication for resilience on exascale systems 39
60
aver
age
mak
espa
n(in
days
) 50
40
214 215 220217216 219218
30
20
10
0
number of processors
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
a) constant overhead model
214
0
number of processors215
60
50
aver
age
mak
espa
n(in
days
)218
40
30
20
10
219216 217 220
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
b) proportional overhead model(1) Perfectly parallel jobs.
60
aver
age
mak
espa
n(in
days
) 50
40
214 215 220217216 219218
30
20
10
0
number of processors
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
a) constant overhead model
aver
age
mak
espa
n(in
days
)
218 219216 217 220215214
number of processors
0
10
20
30
40
50
60
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
b) proportional overhead model(2) Generic parallel jobs.
60
aver
age
mak
espa
n(in
days
) 50
40
214 215 220217216 219218
30
20
10
0
number of processors
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
a) constant overhead model
214
0
number of processors215
60
50
aver
age
mak
espa
n(in
days
)
218
40
30
20
10
219216 217 220
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
b) proportional overhead model(3) Numerical kernels.
Figure 10: Evaluation of the di�erent heuristics on a platform with Weibullfailures (MTBF = 125 years, and k = 0.70).
RR n° 7876
Using group replication for resilience on exascale systems 40
10
0
number of processors214 215212 217213 216
70
aver
age
mak
espa
n(in
days
)
80
60
50
40
30
20
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
a) constant overhead model
aver
age
mak
espa
n(in
days
)216213 217212 215214
number of processors
0
10
20
30
40
50
60
70
80
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
b) proportional overhead model(1) Perfectly parallel jobs.
10
0
number of processors214 215212 217213 216
70
aver
age
mak
espa
n(in
days
)
80
60
50
40
30
20
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
a) constant overhead model
aver
age
mak
espa
n(in
days
)
216213 217212 215214
number of processors
0
10
20
30
40
50
60
70
80
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
b) proportional overhead model(2) Generic parallel jobs.
10
0
number of processors214 215212 217213 216
70
aver
age
mak
espa
n(in
days
)
80
60
50
40
30
20
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
a) constant overhead model
aver
age
mak
espa
n(in
days
)
216213 217212 215214
number of processors
0
10
20
30
40
50
60
70
80
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
b) proportional overhead model(3) Numerical kernels.
Figure 11: Evaluation of the di�erent heuristics on a platform with failuresbased on the failure log of LANL cluster 18.
RR n° 7876
Using group replication for resilience on exascale systems 41
10
0
number of processors214 215212 217213 216
70
aver
age
mak
espa
n(in
days
)
80
60
50
40
30
20
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
a) constant overhead model
aver
age
mak
espa
n(in
days
)216213 217212 215214
number of processors
0
10
20
30
40
50
60
70
80
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
b) proportional overhead model(1) Perfectly parallel jobs.
10
0
number of processors214 215212 217213 216
70
aver
age
mak
espa
n(in
days
)
80
60
50
40
30
20
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
a) constant overhead model
aver
age
mak
espa
n(in
days
)
216213 217212 215214
number of processors
0
10
20
30
40
50
60
70
80
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
b) proportional overhead model(2) Generic parallel jobs.
10
0
number of processors214 215212 217213 216
70
aver
age
mak
espa
n(in
days
)
80
60
50
40
30
20
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
a) constant overhead model
aver
age
mak
espa
n(in
days
)
216213 217212 215214
number of processors
0
10
20
30
40
50
60
70
80
DPNextFailureGlobal 2 groupsDPNextFailureSynchro 2 groupsDPNextFailureAsap 2 groupsDPNextFailureOptExpGroup 2 groupsPeriodLB 2 groupsPeriodLBOptExp 2 groupsOptExp
b) proportional overhead model(3) Numerical kernels.
Figure 12: Evaluation of the di�erent heuristics on a platform with failuresbased on the failure log of LANL cluster 19.
RR n° 7876