Deep Learning Training on GPU Clusters
In this talk, I will introduce both completed work and work in progress in the intersection of systems and machine learning. After (very) briefly discussing projects on reducing tail latency in interactive services and intelligent real-time data analytics, I will dive into the main body of the talk, which is toward building GPU clusters for deep learning training workloads.
I will describe Project Philly, a service for training machine learning models that performs GPU cluster management for deep learning training jobs. Philly factors in key workload characteristics that are important to consider toward building more efficient GPU clusters. For example, to improve training performance, Philly has to take into consideration intra-server communication locality, which otherwise slows down training performance due to interferences arising in various communication channels. Philly has been in use by Microsoft to serve a number of production groups for years. Based on experience running a large-scale operation, we observe issues that affect cluster utilization for DNN training workloads. I will provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.
I will also (very) briefly talk about Tiresias, a GPU cluster resource manager tailored for distributed deep learning training jobs, which schedules deep learning jobs to reduce their job completion times. Execution time of deep learning job is often unpredictable. Tiresias incorporates two complementary scheduling algorithms assuming no execution time information or partial information.
Myeongjae Jeon is an assistant professor in the Department of Computer Science and Engineering at Ulsan National Institute of Science and Technology (UNIST) since 2018 fall. Prior to joining UNIST, he was with Systems Research Group in Microsoft Research Redmond (from 2015-2018), and before that I worked at ARM R&D (from 2014-2015). He received the Ph.D. in computer science at Rice University in 2014, the M.S. degree in computer science from KAIST in 2009, and the B.E. degree in computer engineering from Kwangwoon University in 2005. His recent research interests span parallel/distributed processing of deep learning workloads, real-time stream data analytics at cloud/IoT scale, public/private blockchain.