Optimus 论文

2018引用 397

Cloud Computing and Resource ManagementIoT and Edge/Fog ComputingDistributed and Parallel Computing Systems

云计算 Distributed and Parallel Computing Systems Cloud Computing and Resource Management IoT and Edge/Fog Computing

作者

摘要

Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.

作者查看全部 (5)

Chuanxiong Guo

Chuan Wu

Yangrui Chen

Yixin Bao

Optimus 论文

摘要

作者查看全部 (5)

相关技术

相关事件

相关文章