DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment

Wei Qiao; Ying Li; Zhong-Hai Wu

doi:10.1051/itmconf/20171203030

All issues

Volume 12 (2017)

ITM Web Conf., 12 (2017) 03030

Abstract

Open Access

Issue		ITM Web Conf. Volume 12, 2017 The 4^th Annual International Conference on Information Technology and Applications (ITA 2017)


Article Number		03030
Number of page(s)		5
Section		Session 3: Computer
DOI		https://doi.org/10.1051/itmconf/20171203030
Published online		05 September 2017

ITM Web of Conferences 12, 03030 (2017)

DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment

Wei Qiao¹^a, Ying Li²^b and Zhong-Hai Wu³^*

¹ School of Software and Microelectronics, Peking University, Beijing, China
² School of Software and Microelectronics, Peking University, Beijing, China
³ School of Software and Microelectronics, Peking University, Beijing, China

^a qiaowei@pku.edu.cn
^b li.ying@pku.edu.cn
^* Corresponding author: wuzh@ss.pku.edu.cn

Abstract

Deep neural networks (DNNs) have recently yielded strong results on a range of applications. Training these DNNs using a cluster of commodity machines is a promising approach since training is time consuming and compute-intensive. Furthermore, putting DNN tasks into containers of clusters would enable broader and easier deployment of DNN-based algorithms. Toward this end, this paper addresses the problem of scheduling DNN tasks in the containerized cluster environment. Efficiently scheduling data-parallel computation jobs like DNN over containerized clusters is critical for job performance, system throughput, and resource utilization. It becomes even more challenging with the complex workloads. We propose a scheduling method called Deep Learning Task Allocation Priority (DLTAP) which performs scheduling decisions in a distributed manner, and each of scheduling decisions takes aggregation degree of parameter sever task and worker task into account, in particularly, to reduce cross-node network transmission traffic and, correspondingly, decrease the DNN training time. We evaluate the DLTAP scheduling method using a state-of-the-art distributed DNN training framework on 3 benchmarks. The results show that the proposed method can averagely reduce 12% cross-node network traffic, and decrease the DNN training time even with the cluster of low-end servers.

© The Authors, published by EDP Sciences, 2017

This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

Homepage

Table of Contents

Previous article Next article

Article contents

Metrics

Show article metrics

Services

Articles citing this article
CrossRef (1)
Same authors
- Google Scholar
- EDP Sciences database

Recommend this article
Download citation