News Archives

[Colloquium] An Overview of HPC Resilience and an Approach to Soft Error Fault Injection

September 23, 2011

Watch Colloquium: 

M4V file (682 MB)

  • Date: Friday, September 23, 2011 
  • Time: 12:00 pm — 12:50 pm 
  • Place: Centennial Engineering Center 1041

Nathan DeBardeleben
Los Alamos National Laboratory

Over the next decade the field of high performance computing (supercomputing) will undoubtedly see major changes in the ways leadership class machines are built, used, and maintained. There are any number of challenges including operating systems, programming models and languages, power, and file systems to name but a few. This talk will focus on one of those challenges, the cross-cutting goal of providing reliable computation on fundamentally unreliable components. Nathan will provide an overview of the field of resilience and point to decadal obstacles, look at potential solutions that appear promising, and discuss areas that appear to need more emphasis. Nathan’s own new research on a soft error fault injection (SEFI) framework will be presented as will some early results. SEFI is intended as a framework for determining the resilience of a target application to soft errors. The initial implementation using a processor emulator virtual machine will be discussed as will reasons SEFI might be moving away to a dynamic instrumentation approach.

 

Bio: Nathan DeBardeleben is a research scientist at Los Alamos National Laboratory leading the HPC Resilience effort in the Ultrascale Systems Research Center (USRC). He joined LANL in 2004 after receiving his PhD, Master’s, and Bachelor’s in computer engineering from Clemson University. At LANL, Nathan was an early developer and designer of the Eclipse Parallel Tools Platform (PTP) project, spent several years optimizing application codes, and has since turned to focus on resilient computation. Nathan is active in the resilience community and spent 2010 on an IPA assignment at the U.S. Department of Defense where he lead the Resilience Thrust of the Advanced Computing Systems Research Program. Active on several program committees, Nathan leads the Fault-Tolerance at Extreme Scale Workshop. His own research interests are in the field of reliable computation, particularly the area of HPC resilience. This includes, but is not limited to, fault-tolerance, resilient programming models, resilient application design, and soft errors (particularly those transient in nature).

(Students with interests in Dr. DeBardeleben’s research wanting to meet with him over lunch should contact Dorian Arnold (darnold@cs.unm.edu) )