Rollback Recovery Avoidance

Because of the expense of rollback-based recovery, we have been actively researching alternatives that attempt to avoid rollback. Most recently, we have examined the use of memory similarity to avoid crashes caused by uncorrectable DRAM errors. Our studies show that HPC applications have large amounts of duplicate or similar memory pages. In addition, information on similar pages in HPC applications can be detected and maintained with very low overhead. This allows applications to use similar pages to recover from uncorrectable ECC errors without costly rollbacks. We have also examined rollback avoidance techniques more generally, using modeling to explore the scenarios in which scenarios rollback avoidance techniques can be used to augment or replace checkpoint/restart mechanisms. Specifically, we developed a mathematical model which can be used to analyze the performance of a number of recent HPC resilience techniques, including replication, failure prediction and proactive migration, and software ECC systems.

This project is part of a UNM/SNL collaboration.

Publications

Loading publications...