Skip to main content Skip to navigation

EECS Colloquium: AI in HPC Systems Stack: Storage Systems and Deep Learning Systems by Bing Xie, Oak Ridge National Laboratory Univ.

Online
Zoom link

About the event

Abstract
In high-performance computing (HPC), scientific codes have been evolving continuously. Moving from numerical simulations and analyses to AI/ML-based applications, scientific codes execute on larger computational scales and issue massive data movements periodically for network communication and I/O at application runtime. In this talk, I will mainly discuss two of my recent works on understanding/improving the performance of HPC I/O subsystems by leveraging AI/ML algorithms and optimizing network communication in large-scale deep learning systems. In particular, for the work in HPC I/O, I will talk about the challenges and our ML-based solutions of benchmarking, modeling, and tuning the performance of supercomputer I/O systems based on the system design, deployment and configuration. For the work in AI systems, I will discuss our proposals in a popular collective communication library for deep learning frameworks, Horovod, which introduces a decentralized coordination scheme and a grouping mechanism in the Horovod’s control plane and data plane, separately.

Bio
Dr. Bing Xie is an HPC research scientist at the Oak Ridge Leadership Computing Facility (OLCF) of Oak Ridge National Laboratory (ORNL). Bing received a Ph.D. in Computer Science from Duke University in 2017 and joined ORNL in the same year. She conducts computer systems research with a strong publication record spanning multiple research areas, including large-scale parallel file systems, deep learning systems, and resource management. Her works are presented at major conferences and journals, such as SC, ACM TOS, NSDI, HPDC, IPDPS. Bing is a winner of IEEE-CS TCHPC early career researchers award in 2021. Her work on parallel file system performance study  is nominated as a best paper and a best student paper at SC in 2012. Her improvements on HFD5, a widely used HPC I/O library, were adopted by OLCF. Her work on Horovod is incorporated in Horovod v0.20.1.

Contact