AllScale Resilience Manager
In the AllScale project, we design and implement a Resilience Manager from scratch. This is a good opportunity to learn from best practices of state-of-the-art resilience techniques in various runtimes, including:
- User-Level Failure Mitigation (ULFM) in MPI
- Resilience strategies implemented in other task-based runtimes, such as X10, Chapel, Charm++, etc.
We have prototyped our resilience strategy in a simulator, which is open-sourced under: https://hpdc-gitlab.eeecs.qub.ac.uk/kdichev/resilience-simulator.
We have completed the design of our resilience strategy. Following discussions with pilot applications, we have established that addressing node failures is the most important point for our application developers. This is particularly important for long-running applications.
Our node failure recovery is based on classic checkpoint/restart (C/R) mechanisms. However, our recovery follows different principles from MPI-based recovery – it is educated by the task dependencies of an application kernel.
The most important aspects of our C/R strategy are:
- A guard-protectee scheme for checkpointing and restoring checkpoints
- Application-specific checkpoints (see figure for 1D stencil-specific checkpoints)
- Recovery educated by task dependencies
A recusrively decomposed 1D stencil, with dimensions T(<time>,<space>), with checkpoints outlined in red per stencil task of a specific granularity