Simulating Resilience at Scale

Effective evaluation of fault tolerance mechanisms for exascale systems has been challenging; current systems are significantly smaller and typically have different architectural features (e.g., interconnect, persistent storage) than next-generation systems. Additionally, accurate analytical models do not yet exist for many emerging resilience techniques. We have designed and built a simulation framework for efficiently evaluating the performance of resilience techniques on future systems. Our framework exploits three key observations (1) faults and fault-tolerance events be modeled as CPU detours; (2) only coarse-grained application events and system featueres appear relevant to fault-tolerance performance; and (3) studying fault-tolerance does not require cycle-accurate simulation. Using these observations, we convert faults and fault-tolerance activity into CPU detours and build upon LogGOPSIM, an existing coarse-grain simulation framework that supports CPU detours. As an example, we can simulate a 128K node, 10 hour, production run of LAMMPS with a speed-up of over 10,000x compared to the real execution.

This project is part of a UNM/SNL collaboration.


Loading publications...