Resilience Benchmarking

Fault-tolerance is rapidly becoming a first-class design objective for sustainable high-performance computing. Realistic evaluation of systems and applications at extreme scale requires consideration of their resilience characteristics. Accurate, flexible, and sophisticated fault-aware benchmarking will only become more important as energy budgets for future machines shrink; as data-centric computing must account for deep and complex memory hierarchies; and as coupled analytics and workflows integrate traditional HPC resources into solutions for new problem domains. Classical HPC benchmarks, conceived in an era with infrequent failures and emphasizing floating-point performance, will not provide the right type of information needed to evaluate resilience strategies for future exascale applications and systems. Our research reconsiders benchmarking of large-scale computations for the coming era in which the impact of fault-tolerance strategies will be critical.

We pursue two interrelated research vectors:

  1. Evaluation of platform, application, resilience strategy, and failure characteristics to determine the insight they can provide into the performance of fault-tolerant applications at extreme scale. This activity will help to inform the evolution of resilience-aware benchmarks. It will also improve our analytical understanding of the tradeoffs involved in emphasizing particular characteristics or in combining them for benchmarking purposes.
  2. Simulation-based evaluation in realistic resilience scenarios of 1) evolving underlying software/hardware layers and 2) existing benchmarking approaches. Here we develop tools and approaches to measure resilience impacts on applications and assist in the integration of new system and runtime capabilities, providing a basis for future fault-aware benchmarking suites.

This project is part of a UNM/SNL collaboration.

Publications

Loading publications...