Skip to main content Skip to navigation

EECS Colloquium: Compiler directed resilience techniques for HPC applications — Dr. Chao Chen, Staff Researcher, CoCoPie


About the event

Abstract: Transient faults are becoming a significant concern for emerging extreme-scale high performance computing (HPC) systems. This nascent problem is exacerbated by technology trends toward smaller transistor size, higher circuit density and the use of near-threshold voltage techniques to save power. They could corrupt the execution of long-running scientific applications by leading to either SDCs (incorrect values in outputs) or soft failures (abnormal termination, e.g., process crashes). While SDCs harm the confidence in computations and could lead to inaccurate and untrustworthy scientific insights, soft failures degrade system efficiency and performance since they require the impacted jobs to be restarted from their checkpoints and re-executing the lost computations before continuing the normal operation. As a consequence, transient faults detection as well as recovery must be dealt with in the HPC system design for its usability (trust in the output results) and efficiency (speedup and energy efficiency). In particular, solutions must be designed that have very low regular execution overheads, as well as an ability to detect (and potentially recover from) a large set of faults with negligible downtime.

Bio: Chao Chen is now a Staff Researcher of AI Compilers at CoCoPie. Before joining the  CoCoPie, He was a software engineer at Amazon Science.  He got his Ph.D. from the School of Computer Science at Georgia Tech, advised by Santosh Pande and Greg Eisenhauer. His research interests are broadly in the areas of compilers and systems, with a thesis research on lightweight resilience techniques for HPC applications by exploring applications’ properties. His work appears in top-tier HPC venues, and was nominated for Best Student Paper at SC ’19.