Gaussian distribution and Batch Normalization
Ever wondered why we always push our data towards a Gaussian distribution before training a model on it? You will be surprised to know how simple batch normalization greatly affects your model’s performance.
The famed bell-shaped curve, which depicts the normal distribution, shows that the distribution of variable is more concentrated around its mean than away from it, giving rise to its bell shape. Normal distribution is important as many psychological and educational variables are distributed approximately normally. Measures of reading ability, introversion, job satisfaction, and memory are among the many psychological variables approximately normally distributed. Since the end goal of all ML models is to work on these real-life data, we make our models compatible to these data conditions by exposing it to normally distributed data during training using techniques like batch normalization.
Batch normalization solves a problem called internal co-variate shift. It refers to change in the distribution of inputs to layers in the network. This problem occurs frequently in medical data (where you have training samples from one age group, but you want to classify something coming from another age group), or finance (due to changing market conditions). This causes the learning algorithm to chase a forever moving target. Batch normalization standardizes the data distribution for each mini-batch by making it normally distributed.
Apart from making the data normally distributed, batch normalization plays other vital roles too towards improving our model’s performance:
1. Networks train faster as normalization convert the long-elongated contours to round shaped contours making gradient descent move faster.
2. It improves gradient flow through the network by preventing vanishing gradients and thus allows higher learning rates.
3. It reduces the dependence of our models on weight initialization.
4. It adds a form of regularization to our model.