Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Article Dans Une Revue Parallel Computing Année : 2019

Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms

Résumé

This paper compares the performance of different approaches to tolerate failures for applications executing on large-scale failure-prone platforms. We study (i) Rigid applications, which use a constant number of processors throughout execution; (ii) Moldable applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) GridShaped applications, which are moldable applications restricted to use rectangular processo grids (such as many dense linear algebra kernels). We start with checkpoint/ restart, the de-facto standard approach. For each application type, we compute the optimal number of failures (i.e. that maximizes the yield of the application) to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. For GridShaped applications, we also investigate Application Based Fault Tolerance (ABFT) techniques and perform the same analysis, computing the optimal number of failures to tolerate and the associated yield. We instantiate our performance model with realistic applicative scenarios and make it publicly available for further usage. We show that using spare nodes grants a much better yield than currently used strategies that restart after each failure. Moreover, the yield is similar for Rigid, Moldable and GridShaped applications, while the optimal number of failures to tolerate is very high, even for a short wait time in between allocations. Finally, Moldable applications have the advantage to restart less frequently than Rigid applications.
Fichier principal
Vignette du fichier
S0167819118302230.pdf (816.89 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03360189 , version 1 (22-10-2021)

Licence

Paternité - Pas d'utilisation commerciale

Identifiants

Citer

Valentin Le Fèvre, Thomas Herault, Yves Robert, Aurelien Bouteiller, Atsushi Hori, et al.. Comparing the performance of rigid, moldable and grid-shaped applications on failure-prone HPC platforms. Parallel Computing, 2019, 85, pp.1-12. ⟨10.1016/j.parco.2019.02.002⟩. ⟨hal-03360189⟩
25 Consultations
53 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More