Normalization or Not
During any kind of model training the weights are updated to reach the most optimum, as in the values at which the matrix multiplication of the input data through each function would produce an output that is at the lowest difference of their loss values. This is in essence what can be defined as model training.
Plotting the values of the loss function with respect to each epoch would get us a graph that increases and decreases over time but stabilizes at the lowest point, or the global minima of the functions together. Consider it partial differentiation of each function related to the parameters and the corresponding input features.
Following experiment is on the computer vision principles of model training which also more or less stands for other fields of deep learning.
Intuition
We define a model and all the weights are initialized with some random values. A pretrained model will have a set of values aligned to give the lowest possible loss with the kind of data it has been trained for.
During training the model is basically doing pattern recognition, taking a convolutional neural network for example it takes in the input images and tries to extract features from it. While the upper layers are focused on general level features, as the model deepens the layers focus on intrinsic features and edges, contours etc. All of this together gives us an array which can be said to be the best representation of the image for that class.
Now images for the computer are represented numerically in the form of arrays. Depending on the channels it can be a 2D or 3D array. Lets say we have an image of size 100x100 that is 10000 numerical values its trying to find the best representation from. If the image is of a standard color it consists of values ranging from 0-255. Now multiply that with 100x100x3 and you see the vast amount of numbers the model is working on. And add that with the number of images and training classes and you see where the thinking comes. How is it possible that the neural network can see this exponential values of digits and finds the best combination of weights even for images which might have very similar combinations but belong to different classes.
Thankfully Neural Networks along with Backpropagation is a very powerful combination that helps the layers approximate to the best possible parameter values for our images. But that doesn't mean we cannot help our network learn better.
Normalization helps the network approximate the values and reduce the loss function by standardizing the input features and decreasing the range of information they have to work on. It improves accuracy and training stability. This along with preprocessing techniques like converting images to a standard size and basic augmentation help the network learn better rather than just fitting in raw images.
If the loss landscape during training can be represented as a series of distribution points spread over maximum and minimum values similarly can the 'weights' be represented over their shift during the model training process ?
Weight Distribution - Chaos or Pattern ?
There are two different thoughts I had and wanted to prove through this experiment.
The different kinds of training would be
train_transforms = transforms.Compose([
transforms.ToPILImage(),
#transforms.CenterCrop()
transforms.ToTensor()
])
def manual_transform(image):
image = image.astype(np.float32) / 255.0
image = np.transpose(image, (2,0,1))
return torch.from_numpy(image)
average_scaling_transform = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize((224,224)),
#transforms.CenterCrop()
transforms.ToTensor(),
transforms.Normalize( #ImageNEt mean and std dev values
mean = [0.485, 0.456, 0.406],
std = [0.229, 0.224, 0.225]
)
])
So weights either would be randomly initialized (new model) or have some inherent distribution already (pre-trained model) and during training the values would be updated for this new dataset and I would be able to check and infer on my hypothesis.
The Landscape
Without normalization the distribution should have high variance and look jagged or a bit irregular. The input features would vary heavily and the network's optimizers would be working overtime to compensate for this to bring it at the global minima.
Min/Max scaling might make it very sensitive to outliers and again show a very gradual difference in the distribution.
Mean/Standard normalization should be the sweet spot. The weight distribution should be narrower and smoother. The centered inputs should ensure that the weights do not need to make any massive updates to trigger the activations.
I took two different models, VGG16 and mobilenetv2 from PyTorch's torchvision module. For one set of experiments it was using the first conv2d layer and its weight values. The other experiment was using a deeper conv2d layer weights. Kept the criterion and optimizer the same, basic dynamic scheduler and optimizer. Ran the experiment for 35 epochs. The initial results were not that good honestly.
The KDE graph for the probability distribution of an epoch. Good indicator of the values across all the points for the different training types in the layer.
Recommended by LinkedIn
Percentile ribbon to try showing the ranges of distributions on both center and spread across all epochs.
Violin Plot - Good side by side comparison of the distribution across all epochs.
So the dataset I was using was very different from Imagenet so I did expect to see a lot of variation in shape and high std deviation values on the graphs but as you can see its more or less similar for all the different training types. Both VGG16 and Mobilenetv2 looked too similar so have put only one for show. Even my loss values were pretty high and changing drastically so quite unexpected.
I can assume that since the model was already pretrained and on a very large set of data and epochs, finetuning only nudges the weights in one particular direction. Maybe it would have been different if I were training all layers but as we can see, the different normalization techniques didn't make any difference since the gradient updates were very minimal.
What if I finetune but do not download any pretrained weights and essentially train from scratch.
Now here we can see that the distributions are affected, not significantly but yes you can see the differences compared to the previous graph.
A non pretrained model has completely randomly initialized weights and the gradient updates must be significant to steer the network's optimizers in direction of the training data. So the crucial hypothesis being proven here is -- It does matter how you normalize the data.
Looking at the KDE distribution we can see that curves are separated and have different shapes. The slight bumps show how the different formats for input features pushes the same model's same function to approximate its optimal value. I can say that for no normalization the weight changes start bit more on the higher end than the other types for one.
The standard deviation of the weights for each epoch shows a difference in values but then they are fixed for all epochs. Why ? The shape barely changes. Its constant. So I am gonna say that while each normalization type defines how the weights are updated but once they lock into their initial characteristics, they do not deviate overall. So the updates are more or less the same. Is it because of finetuning or also maybe this is an indicator of good training methodology. I don't think I would be seeing std deviation this way if the training was irregular. Avg Norm has the highest and Min/Max the lowest. Meaning one produced the most exploratory/diverse filters and conservative respectively.
Similar intuition for violin plots, although I do not see much difference here.
Inference
Very interesting observations. In the case of the pretrained models it looks the updates were very insignificant and completely opposite for the ones without any inherent training knowledge.
It is interesting when I think of this comparing to the loss function values. While the loss was decreasing and in significant values even during the pre-trained model, same cannot be said for the distribution of weights no matter what kind of normalization was applied to the input features.
Also if you look at the KDE for the VGG16 vs MobilenetV2, the former one was more wider in the center comparatively, is it that the larger model has more values centered around a mean than the smaller ones ? I am not sure but something interesting to ponder about.
I think the most significant thing I have seen with this experiment is the idea of every distribution resembling a Normal/Gaussian one no matter which model, training paradigm or normalization applied to it.
Another example of the Central Limit Theorem being proven. A large number of independent random variables, taking their sum or average the result tends towards a normal distribution regardless of what the individual distribution looks like. (Here I might have flattened the weights of the layer and not taken the sum or average, but that layer weights are the result of many sequential gradient updates with different layers and sum of the batch activations. So in a way its a resultant of summation values)
We have BatchNorm layers to standardize the activations to zero mean and unit variances at the layers and this also is another example of why we have a gaussian distribution for each model conv2d layer block.
Very important for neural networks since its stable, maximum entropy distribution and if the weights are gaussian there's a high possibility of the outputs being the same. This makes training inherently easier since the signal is preserved along with low variance. If the distribution was very skewed there would be no way any network could learn any feature since it would have been biased or output particular features only.
So while the updates to the weights might not be that significant as compared to the loss or even the learning rate, this was a fun weekend thought experiment and also see how ultimately a normal distribution is the key to solving pattern recognition.