Resilience

Efficient HPC Resilience through Rollback Avoidance

Contact: Scott Levy (slevy@sandia.gov)

Rollback avoidance techniques provide an important means for reducing the need for costly checkpoint/restart phases. Our research has demonstrated multiple approaches to successfully minimizing the need to rely on rollback recovery. In our most recent work, to be presented this year at Supercomputing, we demonstrate that lightweight memory compression can be used to effectively protect against uncorrectable memory errors, improving the performance of HPC applications.

SMURFS: Simulation and Modeling for understanding Resilience and Faults at Scale

Contact: Dorian Arnold (darnold@cs.unm.edu); Kurt Ferreira (kurt@cs.unm.edu)

Exaflop computational power will enable new, important discoveries across all basic science domains. Application resilience is a major challenge to the realization of extreme scale computing systems. SMURFS addresses this challenge with new simulation and modeling capabilities that improve our predictive understanding of the complex interactions amongst a given application, a given real or hypothetical hardware and software system environment and a given fault-tolerance strategy at extreme scale.