OMNIA’s paper accepted to NSDI 2019
Professor Myeongjae Jeon’s paper titled “Tiresias: A GPU Cluster Manager for Distributed Deep Learning” has been accepted to NSDI 2019. This work was done in collaboration with researchers at University of Michigan, Microsoft Research, Alibaba, and Bytedance.
This paper presents Tiresias, a GPU cluster resource manager tailored for distributed deep learning training, which efficiently schedules and places deep learning jobs to reduce their job completion times. It proposes (1) a scheduling algorithm called 2DAS that generalizes LAS and Gittins Index policy by incorporating spatial and temporal aspects of deep learning jobs, and (2) a job placement policy based on profiling of skewness in tensor distributions.
NSDI is the top conference in networked and distributed systems.