Effect of Batch Size on Training Process and Results by Gradient Accumulation
This is a blog post to share our experiment on MNIST dataset to determine the optimal batch size and use of Gradient Accumulation strategy.
In this experiment, we investigate the effect of batch size and gradient accumulation on training and test accuracy.
We investigate the batch size in the context of image classification, taking MNIST dataset to experiment. It is well known in the Machine Learning community the difficulty of making general statements about the effects of hyperparameters as behaviour often varies from dataset to dataset and model to model. Therefore, the conclusions we make can only serve as a signposts rather than general statements about the batch size.
Batch size is one of the important hyperparameters to tune in modern deep learning systems. Practitioners often want to use a larger batch size to train their model as it allows computational speedups from the parallelism of GPU’s. However, it is well known that too large of a batch size will lead to poor generalization. On the other hand, using smaller batch sizes have been empirically shown to have faster convergence to good solutions as it allows the model to start learning before having seen all the data. But the downside could be that the model is not guaranteed to converge to the global optima.
It is generally accepted that there is a sweet spot for batch size between one and entire training dataset that will provide the best generalization and hence the higher accuracies are achieved, and it usually depends on the dataset and the model at question.
Experiment
Our experiment uses Convolutional Neural Networks (CNN) to classify the images in MNIST dataset (containing images of handwritten digits 0 to 9) to corresponding digit labels “0” to “9". Fig. below showing the images of handwritten digits.
The best known MNIST classifier found on the internet achieves 99.8% accuracy!! That's amazing. The public best Kaggle kernel for MNIST achieves 99.75% accuracy using an emsemble of 15 models. This Experiment demonstrates the study of determining the optimal batch size for any classifier used in training the model.
Architectural highlights:
We trained 6 different models, each with a different batch size of 32, 64, 128, 256, 512, and 1024 samples, keeping all other hyperparameters same in all the models. Then, we analysed the validation accuracy of each model architecture.
The notebook is available on Github.
Result:
Results explain the curves for different batch size shown in different colours as per the plot legend. On the x- axis, are the no. of epochs, which in this experiment are taken as “20”, and y-axis shows the training accuracy plot.
It is evident from the experiment that the accuracies improve with the increase in the batch size, however, the batch size of 512 which is the 2nd largest batch size in experiment differs in accuracy with largest batch size by just a small margin. Hence it can be taken as best sample size so as to account for optimum utilization of computing resources too and lesser complexity.
GRADIENT ACCUMULATION:
A strategy that can overcome the low memory GPU constraint of using smaller batch size for training the model is Accumulation of Gradients.
Gradient accumulation is a mechanism to split the batch of samples, used for training a neural network, into several mini-batches of samples that will be run sequentially.
Accumulation of Gradients works such that, during the back propagation of network the parameters are not updated in each step of mini batch and gradients results are accumulated. On completion of all the steps of mini batch, the accumulated gradients of all the previous steps of mini batch updates the model parameters. This process is as good as using higher batch size for training the network as gradients are updated the same number of times.
In the given code, optimizer is stepped after accumulating gradients from 8 batches of batch-size 128, which gives the same net effect of using a batch-size of 128*8 = 1024.
One thing to keep in mind is the nature of BatchNorm layers which will still function per batch. You need to replace them with GroupNorm layers to be effective while performing gradient accumulation.
Experiment Code:
Result:
On submitting to Kaggle's DigitRecognizer competition, we got a score of 99.257% with a batch size of 128. Upon employing GradientAccumulation with a step size of 8, the accuracy improved to 99.442%. You can click on the links to view the respective notebooks.
Cheers!