Enhancing Checkpoint/Restart

In an attempt to keep checkpoint/restart (CR) viable for future extreme-scale systems, we study CR protocol performance and optimizations including:

Compression: an application independent way to decrease the sizes of checkpoints and message logs;
Incremental Checkpointing: a low overhead, hash-based approach that only saves changes since last checkpoint;
Uncoordinated Checkpointing: understanding how collective communication patterns impact protocol performance;
Task replication: studying how replication can help to lower overheads on future sytems.

This project is part of a UNM/SNL collaboration.

Publications

Loading publications...