Distributed Deep Learning
In the world with abundance of data, processing them at speed with reduced storage cost becomes quite challenging. There are lot of distributed Frameworks that comes in handy starting with MapReduce, Spark, Dask and so on. Here I would like to focus more on the distributed ML training on GPU and more specifically with neural networks
Before diving into the distributed deep learning, a quick recap of how a simple neural network training works. Let's imagine that we have a simple neural network with two layers and three nodes in each layer. Each node has its own weights and biases, our trainable parameters. A training step begins with preprocessing our data. We then feed them into our network and it predicts the output (forward pass). We then compare the prediction with the desired label by computing the loss, and in the backward pass we will compute the gradients and update the weights based on the gradients. And repeat.
In a significant number of use cases, deep learning training can be performed in a single machine on a single GPU with relatively high performance and speed. However, there are times that we need even more speed. Examples include when our data are too large to fit on a machine, or simply our hardware is not capable enough to handle the training. As a result, we may need to scale out.
Scaling out means adding more GPUs to our system or perhaps using multiple machines inside a cluster. Therefore we need some way to distribute our training efficiently. But it is not that easy in real life. The two main distributed training strategies are data parallelism and model parallelism, the choice depends on your network architecture but the most commonly used one is data parallelism
Data Parallelism:
One way of intuitively understanding how the data-parallel works is that the gradient is calculated for a small batch of data (say 30 images at once) in each GPU node and at the end of one round of forward-backward passes by the network, the updated weights are sent back to the initiating node. The weighted average of the weights from each node is applied to the model parameters. The updated model parameters are sent back to the nodes for the next round of iteration.
There are two main ways of updating the weight parameters
Synchronous: Let’s say we are dealing with 10k images and 10 nodes, each node is given a subset of 1k images, once they finish the first iterations the updated model parameters are sent to the parameter server( a node or a group of nodes responsible for synchronizing the model parameters) . This approach greatly enhances the accuracy of the model, but the downside of this approach of course is that the server must wait for all the nodes to complete the iteration, if there is a dead slow server it may bring down the speed of the whole training. The bottleneck with the parameter server becoming a single point of failure can be minimized to some extent by introducing multiple parallel servers and ensuring proper storage redundancy is applied.
Asynchronous: In this case instead of waiting for all the nodes to send the weight updates, they are sent as soon as they are available, this increases the cluster utilization and improves the training speed but of course leads to stale gradients problem. Most of the frameworks which implement asynchronous updates apply some strategies to reduce the impact in favor of higher cluster utilization. (Ring all Reduce method) http://research.baidu.com/bringing-hpc-techniques-deep-learning/ .
Recommended by LinkedIn
In the ring-allreduce algorithm, each of N nodes communicates with two of its peers 2*(N-1) times. During this communication, a node sends and receives chunks of the data buffer. In the first N-1 iterations, received values are added to the values in the node’s buffer. In the second N-1 iterations, received values replace the values held in the node’s buffer. Baidu’s paper suggests that this algorithm is bandwidth-optimal, meaning that if the buffer is large enough, it will optimally utilize the available network.
Uber's Horovod https://github.com/horovod/ package comes as a handy to distribute the training process across a cluster of machines. Its available with Tensorflow, Pytorch and Spark.
In addition, you can use
Model Parallelism:
In Model Parallelism, one model is divided into N parts (where N is equal to the number of GPUs, in the figure above N = 4), each part of the model is placed on a separate GPU, then the batch is sequentially calculated on GPU#0, GPU#1, …, GPU#N … This is where forward propagation ends. Backward propagation is done in reverse order, starting with GPU#N and ending with GPU#0.
The obvious advantage of this approach is that we can train a model in this way that does not fit into one GPU. However, there is also a drawback. In Model Parallelism, while computing on GPU#i is in progress, all the rest are idle. This problem is solved by switching to asynchronous GPU operation and for that the split_size parameter is required, which splits the batch into several parts.
The split_size parameter must be iterated over to find the one at which the model’s performance will be the fastest. Given that, as a rule, there are several options for splitting a model, you will have to select split_size for each option separately. This makes using Model Parallelism more complex than using Data Parallelism.
References and few interesting research in this space:
Hope you enjoyed the article, please feel free to share your comments.