Large Scale Debugging

The Stack Trace Analysis Tool (STAT), a 2011 R&D 100 Award winner, is a highly scalable, lightweight debugging tool that identifies groups of processes in a parallel application that exhibit similar behavior. STAT gathers and merges stack traces from a parallel application’s processes. The tool produces 3D spatial-temporal callgraph prefix tree profiles based on series of snapshots from the application taken over time.

In 2012, STAT demonstrated successful debugging of a program running over one million MPI processes on the IBM Blue Gene/Q (BGQ)-based Sequoia supercomputer. In this significant accomplishment, STAT has helped both early access users and system integrators quickly isolate a wide range of errors, including particularly perplexing issues that only manifested at extremely large scales up to 1,179,648 compute cores. The STAT team continues to investigate new research as well as convenience features that promise to make the tool even more useful and impactful.

This project is part of a UNM/LLNL/Wisconsin collaboration.

STAT website: STAT

Publications

Loading publications...