Skip to Main content Skip to Navigation
Reports

Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs

Abstract : This paper focuses on the resilient scheduling of parallel jobs on highperformance computing (HPC) platforms to minimize the overall completion time, or makespan. We revisit the problem by assuming that jobs are subject to transient or silent errors, and hence may need to be re-executed each time they fail to complete successfully. This work generalizes the classical framework where jobs are known offline and do not fail: in this classical framework, list scheduling that gives priority to longest jobs is known to be a 3-approximation when imposing to use shelves, and a 2-approximation without this restriction. We show that when jobs can fail, using shelves can be arbitrarily bad, but unrestricted list scheduling remains a 2-approximation. The paper focuses on the design of several heuristics, some list-based and some shelf-based, along with different priority rules and backfflling options. We assess and compare their performance through an extensive set of simulations, using both synthetic jobs and log traces from the Mira supercomputer.
Complete list of metadatas

Cited literature [49 references]  Display  Hide  Download

https://hal.inria.fr/hal-02317464
Contributor : Equipe Roma <>
Submitted on : Friday, February 21, 2020 - 10:55:06 PM
Last modification on : Wednesday, February 26, 2020 - 11:14:02 AM

File

rr9296.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02317464, version 2

Collections

Citation

Anne Benoit, Valentin Le Fèvre, Padma Raghavan, Yves Robert, Hongyang Sun. Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs. [Research Report] RR-9296, Inria - Research Centre Grenoble – Rhône-Alpes. 2019, pp.1-29. ⟨hal-02317464v2⟩

Share

Metrics

Record views

38

Files downloads

81