HPC System Services

LIBI: The lightweight infrastructure-bootstrapping infrastructure (LIBI)

The lightweight infrastructure-bootstrapping infrastructure (LIBI) project targets a uniform and scalable bootstrapping process for extreme-scale-software systems. This involves launching processes on a requested set of nodes and propagating relevant initialization information to the launched processes. The LIBI API presents a consistent interface to the programmer while leveraging the native HPC services (like SLURM, ALPS or OpenRTE) when available. This enables application portability while maintaining the speed of the native services. We also developed a novel algorithm (based on our performance model) that determines an optimal bootstrapping strategy. Our algorithm can decrease bootstrap time by up to 50%

This project is part of a UNM/LLNL collaboration.

LIBI website: LIBI

Extreme Scale Services

Owing to the significant high rate of component failures at extreme scales, system services will need to be failure-resistant, adaptive and self-healing. A majority of HPC services are still designed around a centralized paradigm and hence are susceptible to scaling issues. Peerto-peer services have proved themselves at scale for wide-area internet workloads. Distributed key-value stores (KVS) are widely used as a building block for these services, but are not prevalent in HPC services. In this paper, we simulate KVS for various service architectures and examine the design trade-offs as applied to HPC service workloads to support extreme-scale systems. The simulator is validated against existing distributed KVS-based services. Via simulation, we demonstrate how failure, replication, and consistency models affect performance at scale. Finally, we emphasize the general use of KVS to HPC services by feeding real HPC service workloads into the simulator and presenting a KVS-based distributed job launch prototype.

This project is part of a LANL/IIT/UNM collaboration.

Publications

Loading publications...