Deep learning and AI have grabbed tremendous attention in the last decade. The substantial accuracy improvement by neural networks in common tasks such as image classiﬁcation and speech recognition has made deep learning as a replacement for many conventional machine learning techniques. Training Deep Neural networks require a lot of data, and therefore vast of amounts of computing resources to process the data and train the model for the neural network. The most obvious solution to solving this problem is to speed up the time it takes to train Deep Neural networks.
AI and deep learning workloads are diﬀerent from the conventional cloud and mobile workloads, with respect to: (1) Computational Intensity, (2) I/O characteristics, and (3) communication pattern. While there is a considerable amount of research activity on the theoretical aspects of AI and Deep Learning algorithms that run with greater eﬃciency, there are only a few studies on the infrastructural impact of Deep Learning workloads on computing and storage resources in distributed systems.
It is typical to utilize a heterogeneous mixture of CPU and GPU devices to perform training on a neural network. Google Brain has a developed a reinforcement model that can place training operations across a heterogeneous cluster. Though it has only been tested with local devices in a single cluster. This study will explore the method’s capabilities and attempt to apply this method on a cluster with nodes across a network.