Enhancing Checkpoint/Restart

In an attempt to keep checkpoint/restart (CR) viable for future extreme-scale systems, we study CR protocol performance and optimizations including:
  • Compression: an application independent way to decrease the sizes of checkpoints and message logs;
  • Incremental Checkpointing: a low overhead, hash-based approach that only saves changes since last checkpoint;
  • Uncoordinated Checkpointing: understanding how collective communication patterns impact protocol performance;
  • Task replication: studying how replication can help to lower overheads on future sytems.

This project is part of a UNM/SNL collaboration.

Publications

Loading publications...