Distributed Deep Learning

Kailash Thiyagarajan

Published Nov 30, 2021

In the world with abundance of data, processing them at speed with reduced storage cost becomes quite challenging. There are lot of distributed Frameworks that comes in handy starting with MapReduce, Spark, Dask and so on. Here I would like to focus more on the distributed ML training on GPU and more specifically with neural networks

Before diving into the distributed deep learning, a quick recap of how a simple neural network training works. Let's imagine that we have a simple neural network with two layers and three nodes in each layer. Each node has its own weights and biases, our trainable parameters. A training step begins with preprocessing our data. We then feed them into our network and it predicts the output (forward pass). We then compare the prediction with the desired label by computing the loss, and in the backward pass we will compute the gradients and update the weights based on the gradients. And repeat.

In a significant number of use cases, deep learning training can be performed in a single machine on a single GPU with relatively high performance and speed. However, there are times that we need even more speed. Examples include when our data are too large to fit on a machine, or simply our hardware is not capable enough to handle the training. As a result, we may need to scale out.

Scaling out means adding more GPUs to our system or perhaps using multiple machines inside a cluster. Therefore we need some way to distribute our training efficiently. But it is not that easy in real life. The two main distributed training strategies are data parallelism and model parallelism, the choice depends on your network architecture but the most commonly used one is data parallelism

Data Parallelism:

One way of intuitively understanding how the data-parallel works is that the gradient is calculated for a small batch of data (say 30 images at once) in each GPU node and at the end of one round of forward-backward passes by the network, the updated weights are sent back to the initiating node. The weighted average of the weights from each node is applied to the model parameters. The updated model parameters are sent back to the nodes for the next round of iteration.

There are two main ways of updating the weight parameters

Synchronous: Let’s say we are dealing with 10k images and 10 nodes, each node is given a subset of 1k images, once they finish the first iterations the updated model parameters are sent to the parameter server( a node or a group of nodes responsible for synchronizing the model parameters) . This approach greatly enhances the accuracy of the model, but the downside of this approach of course is that the server must wait for all the nodes to complete the iteration, if there is a dead slow server it may bring down the speed of the whole training. The bottleneck with the parameter server becoming a single point of failure can be minimized to some extent by introducing multiple parallel servers and ensuring proper storage redundancy is applied.

Asynchronous: In this case instead of waiting for all the nodes to send the weight updates, they are sent as soon as they are available, this increases the cluster utilization and improves the training speed but of course leads to stale gradients problem. Most of the frameworks which implement asynchronous updates apply some strategies to reduce the impact in favor of higher cluster utilization. (Ring all Reduce method) http://research.baidu.com/bringing-hpc-techniques-deep-learning/ .

Recommended by LinkedIn

Don’t wait for hours (when Deep Learning training)

Christian Haller, Ph.D. 6 years ago

ARTICLE ABOUT MACHINE LEARNING:

Adhithya S 1 year ago

Basics of Building Deep Learning Models with PyTorch

Himanshu Singh 1 year ago

In the ring-allreduce algorithm, each of N nodes communicates with two of its peers 2*(N-1) times. During this communication, a node sends and receives chunks of the data buffer. In the first N-1 iterations, received values are added to the values in the node’s buffer. In the second N-1 iterations, received values replace the values held in the node’s buffer. Baidu’s paper suggests that this algorithm is bandwidth-optimal, meaning that if the buffer is large enough, it will optimally utilize the available network.

Uber's Horovod https://github.com/horovod/ package comes as a handy to distribute the training process across a cluster of machines. Its available with Tensorflow, Pytorch and Spark.

In addition, you can use

https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Model Parallelism:

In Model Parallelism, one model is divided into N parts (where N is equal to the number of GPUs, in the figure above N = 4), each part of the model is placed on a separate GPU, then the batch is sequentially calculated on GPU#0, GPU#1, …, GPU#N … This is where forward propagation ends. Backward propagation is done in reverse order, starting with GPU#N and ending with GPU#0.

The obvious advantage of this approach is that we can train a model in this way that does not fit into one GPU. However, there is also a drawback. In Model Parallelism, while computing on GPU#i is in progress, all the rest are idle. This problem is solved by switching to asynchronous GPU operation and for that the split_size parameter is required, which splits the batch into several parts.

The split_size parameter must be iterated over to find the one at which the model’s performance will be the fastest. Given that, as a rule, there are several options for splitting a model, you will have to select split_size for each option separately. This makes using Model Parallelism more complex than using Data Parallelism.

References and few interesting research in this space:

https://eng.uber.com/horovod/

https://www.marktechpost.com/2021/10/31/researchers-introduce-colossal-ai-a-pytorch-based-deep-learning-system-for-large-scale-parallel-training/

Hope you enjoyed the article, please feel free to share your comments.

To view or add a comment, sign in

Distributed Deep Learning

Kailash Thiyagarajan

Recommended by LinkedIn

More articles by Kailash Thiyagarajan

Others also viewed

How do data science projects work?

Applying Deep Learning in the domain of Signal Processing

Intro to Machine Learning Introduction to Machine Learning: A Historical Perspective

Deep Learning or classical Machine Learning — which one to use for your project?

The Art and Science of Feature Engineering: Going Beyond the Basics

Need to apply Deep Learning but don't have enough data, what to do next ?

Transfer Learning: Experimental project

Understanding Tensors in Machine Learning

Udacity Deep Learning Nanodegree - Review

Suddenly I see or Visualize your deep learning model trained with Custom Vision API

Explore content categories

Recommended by LinkedIn

More articles by Kailash Thiyagarajan

Recommendation Systems - Complete Overview - Part 2 (Intermediate

Recommendation Systems - Complete Overview - Part 1 (Basics)

Search, Ranking & Evaluation Metrics

Others also viewed

How do data science projects work?

Applying Deep Learning in the domain of Signal Processing

Intro to Machine Learning Introduction to Machine Learning: A Historical Perspective

Deep Learning or classical Machine Learning — which one to use for your project?

The Art and Science of Feature Engineering: Going Beyond the Basics

Need to apply Deep Learning but don't have enough data, what to do next ?

Transfer Learning: Experimental project

Understanding Tensors in Machine Learning

Udacity Deep Learning Nanodegree - Review

Suddenly I see or Visualize your deep learning model trained with Custom Vision API

Explore content categories