Keywords: High performance computing, Large scale distributed systems, Autonomous systems, Fault-tolerance, HPC Tools.

My research interests fall under the broad areas of high performance computing and large scale distributed systems. In particular, I am interested in abstractions, mechanisms and tools that allow system non-experts to harness the power of high-performance systems in scalable, efficient, reliable ways.

I collaborate with researchers at the The University of Wisconsin and the Lawrence Livermore National Laboratory, and collaborations with the Sandia National Laboratory and Los Alamos National Laboratory and are in the formative stages. These collaborations have placed us in an exclusive and privileged position to work with world class scientists on the largest systems in the world.

Autonomous Systems

Currently, we are studying autonomous (aka self-adaptive, aka self-managing) overlay networks that support in-network data analyses and aggregation. Such networks use non-intrusive mechanisms for monitoring health, performance and offered loads, and use online performance modeling to reconfigure and optimize overlay topologies dynamically. We are also studying the broader applicability of the tree-based computational model (described below) for scientific applications, information analytics, data mining and enterprise computing.

Our other interests include alternatives to contemporary fault-tolerance mechanisms and new programming models and paradigms for future large scale systems .

Tree-based Overlay Networks

At the University of Wisconsin with Bart Miller, I studied the use of hierarchical or tree-based overlay networks (TBONs) for efficient, reliable data communication and analyses for scalable tools and applications. A major outcome of this work is a scalable method for using the inherent data redundancies of certain (broad) classes of data aggregation computations to make them robust to node and process failures while avoiding the non-scalable overhead of explicit state replication (e.g. checkpoints). MRNet, the multicast/reduction network, is the TBON prototype we developed and continue to use to evaluate most of our TBON-related research.

Scalable Application Debugging

STAT, the stack trace analysis tool is being developed as a collaboration of researchers from the Lawrence Livermore National Laboratory, The University of Wisconsin, and The University of New Mexico. STAT was developed to explore lightweight debugging techniques for extremely large (thousands and millions of processes) applications. STAT identifies processes equivalence classes, groups of processes exhibiting similar behavior, so that single class representatives can then be examined in depth with full-featured (less scalable) tools like TotalView or DDT. The initial version of STAT used only stack traces to determine process equivalence classes; our current work explores lightweight program analyses to classify application processes based on various notions of progress.

Prior Research

At the University of Tennessee as a post-masters research associate w/ Jack Dongarra, I worked on NetSolve - a framework embodying set of mechanisms that allowed application scientists to use simple programming environments and interfaces to access parallel solvers and high-performance hardware focusing on fault-tolerance, task scheduling and data logistics.

Also at the University of Tennessee with Jim Plank, I developed CLUBS (Checkpoint Library for Unix Based Systems), a user-level, transparent checkpointing library that succeeded the earlier libckpt. I studied checkpointing optimizations including file formats for low-overhead roll-backs and copy-on-write process forking to decrease checkpointing overhead. The psncLibCkpt library from the Pozna´n Supercomputing and Networking Center in Poland is based on CLUBS.