Issue |
ITM Web Conf.
Volume 12, 2017
The 4th Annual International Conference on Information Technology and Applications (ITA 2017)
|
|
---|---|---|
Article Number | 03030 | |
Number of page(s) | 5 | |
Section | Session 3: Computer | |
DOI | https://doi.org/10.1051/itmconf/20171203030 | |
Published online | 05 September 2017 |
DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment
1 School of Software and Microelectronics, Peking University, Beijing, China
2 School of Software and Microelectronics, Peking University, Beijing, China
3 School of Software and Microelectronics, Peking University, Beijing, China
a qiaowei@pku.edu.cn
b li.ying@pku.edu.cn
* Corresponding author: wuzh@ss.pku.edu.cn
Deep neural networks (DNNs) have recently yielded strong results on a range of applications. Training these DNNs using a cluster of commodity machines is a promising approach since training is time consuming and compute-intensive. Furthermore, putting DNN tasks into containers of clusters would enable broader and easier deployment of DNN-based algorithms. Toward this end, this paper addresses the problem of scheduling DNN tasks in the containerized cluster environment. Efficiently scheduling data-parallel computation jobs like DNN over containerized clusters is critical for job performance, system throughput, and resource utilization. It becomes even more challenging with the complex workloads. We propose a scheduling method called Deep Learning Task Allocation Priority (DLTAP) which performs scheduling decisions in a distributed manner, and each of scheduling decisions takes aggregation degree of parameter sever task and worker task into account, in particularly, to reduce cross-node network transmission traffic and, correspondingly, decrease the DNN training time. We evaluate the DLTAP scheduling method using a state-of-the-art distributed DNN training framework on 3 benchmarks. The results show that the proposed method can averagely reduce 12% cross-node network traffic, and decrease the DNN training time even with the cluster of low-end servers.
© The Authors, published by EDP Sciences, 2017
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.