System Software and Services

HPC System Performance Modeling

Contact: Patrick G. Bridges (patrickb@unm.edu); Oscar Mondragon (oscar.mondragon@gmail.com)

System software for next-generation HPC systems must handle complex resource allocation decisions, but the performance impact of these decisions are difficult to predict because of complex interactions between the system software and the distributed application. We are researching new techniques to characterize and optimize system software interactions with applications for forthcoming HPC systems. These techniques leverage novel extreme value models to predict the impact of resurce allocation decisions on HPC application performance. Our initial results, presented at Supercomputing 2016, demonstrate that this approach can accurately predict the impact of a wide range system software actions on the performance of modern HPC applications.

Multi-core HPC Communication Systems

Contact: Matthew Dosanjh (mdosanjh@cs.unm.edu); Ryan Grant (regrant@sandia.gov); Nathan Hjelm (hjelmn@lanl.gov)

High-performance communication is essential to efficient parallel computation, but is difficult to do effectively in modern many-core systems. We are researching the performance of and optimizations for different multi-core HPC communication systems, including one-sided and two-sided operations in Open MPI and the performance of thread extensions for OpenShmem.

OS Support for Application Composition

Contact: Noah Evans (nevans@sandia.gov); Patrick G. Bridges (patrickb@unm.edu)

Emerging applications increasingly rely on multiple cooperating components to model and analyze complex phenomena. We are researching novel OS mechanisms for supporting such composed applications, for example for efficiently handling data movement and control between co-located application components. Our research leverages features of the Hobbes Exascale Operating System to effectively support emerging composed applications.

Supporting Thread-Level Heterogeneity in Coupled Applications

Contact: Sam Gutierrez (samuel@cs.unm.edu); Dorian C. Arnold (darnold@cs.unm.edu)

Hybrid parallel program models that combine message passing and multithreading (MP+MT) are becoming more popular, extending the basic message passing (MP) model that uses single-threaded processes for parallelism. A consequence is that coupled parallel applications increasingly comprise MP libraries together with MP+MT libraries with differing preferred degrees of threading, resulting in thread-level heterogeneity. Our approach enables full utilization of all available compute resources throughout an application's execution by providing programmable facilities to dynamically reconfigure runtime environments for compute phases with differing threading factors and affinities.

Virtualization of HPC Storage Systems

Contact: Hussein Al-Azzawi (azzawi@carc.unm.edu); Damion Terrell (evil42@unm.edu); Shuang Yang (yangs@cs.unm.edu); Patrick G. Bridges (patrickb@unm.edu)

Both hardware and software for HPC storage systems are complex, difficult to administer, unreliable, and performance sensitive. Virtualizing HPC storage systems would increase the reliability, manageability, and flexibility of these systems, allowing them to be converged with more general cloud, big data, and high-end computing systems. This project is examining a wide range of HPC storage virtualization architectures, focusing on the performance costs of different approaches to running the Lustre Parallel File System inside VMware virtual machines on commodity Dell hardware. As part of this work, we are examining the costs and benefits of pass-through and full virtualization of the Infiniband fabric in the context of storage system workloads. We are also researching the potential reliability gains of VM-based replication of key system software components such as the Lustre metadata server.