AUTOENCODERS

AUTOENCODERS

No alt text provided for this image

Let’s say that we want to extract the feeling or emotion of a person in a photograph. The photograph that we see here is a small image of 256 by 256 pixels. But, this means that there are over 65000 pixels in action which defines the dimension of the input. As we increase the dimensionality, the time to deal with data increases exponentially, in order to train and fit the raw data into a neural network that can detect the emotion. So, we require a way to extract the most important features of a face and represent each image with those features which are of lower dimensions. An autoencoder works adequately for this type of problem.

An autoencoder is a type of unsupervised learning algorithm that will find patterns in a dataset by detecting key features.

It is a type of neural net that analyzes all of the images in our dataset and extracts some useful features automatically in such a way that it can recognize images using those features. Autoencoders excel in tasks that require

  1. Feature learning or extraction
  2. Data compression
  3. Learning generative models of data
  4. Dimensionality reduction.

High-dimensional data is a significant problem for machine learning tasks. In fact, data scientists commonly refer to it as the “CURSE OF DIMENSIONALITY”.

CURSE OF DIMENSIONALITY

A high-resolution image from a smartphone would be even larger. The function mentioned above is a function of the number of points, dimensionality, and parameters. So, as we increase the dimensionality, our time to fit our model will increase exponentially.

So, what happens when we increase or reduce the dimension of data?

No alt text provided for this image

If we have a huge number of dimensions, our data will start to get sparse, which results in an over-allocation of memory and slow training time.

If we have a small number of dimensions, our data could overlap, resulting in a loss of data features. You can see how that looks in one dimension and three dimensions. Overlap and sparsity make it difficult to determine the underlying patterns. However, with the proper number of dimensions, the patterns become much clearer.

No alt text provided for this image

It’s important to know that an Autoencoder is not the only dimension reduction method in Machine Learning. Principal Component Analysis (or PCA) has been around for a long time and is a classic algorithm for dimensionality reduction. The left output is from PCA. The output on the right is from an autoencoder. As we can see, it’s easier to identify the data with the autoencoder’s output. The separability of the autoencoder is far better than the PCA, in this case. Since division is important for applying clustering algorithms, this difference in quality is vital.

An autoencoder can extract key image features, improve training times of other networks, and improve the parting of reduced datasets when compared to other methods. For these reasons, the autoencoder was a breakthrough in the unsupervised learning research field.

Apply recurrent neural networks to language modeling.

Language modeling is a gateway into many exciting deep learning applications like speech recognition, machine translation, and image captioning. At its simplest, language modeling is the process of assigning probabilities to sequences of words.

No alt text provided for this image

For example, a language model could analyze a sequence of words and predict which word is most likely to follow. So with the sequence, “This is an”, which we see here, a language model might predict what the next word might be. Clearly, there are many options for what word could be used as the next one in the string. But a trained model might predict, with an 80 percent probability of being correct, that the word “example” is most likely to result. This reduces down to a sequential data analysis problem. The sequence of words forms the context, and the most recent word is the input data.

No alt text provided for this image

Utilizing two pieces of information, we need to output both a predicted word and a new context that contains the input word. Recurrent neural networks are a great fit for this type of problem. At the first time step, a recurrent net can receive a word as input along with the initial context. It generates an output. The output word with the current sequence of words as the context will then be re-fed into the network in the second time step. A new word would be predicted. And, these steps are repeated until the sentence is complete.

A closer look at an LSTM (Long Short Term Memory) network for modeling the language.

No alt text provided for this image

In this network, we will use an RNN network with two built LSTM units. LSTMs are a special kind of Recurrent Neural Network — capable of learning long-term dependencies by remembering information for long periods is the default behavior. For training such a network, we have to pass each word of the sentence to the network and let the network generate an output. For example, after passing the words “this” and “is”, if we pass the word “an” in the third time step, we expect the network to generate the word, “example,” as output. But we cannot easily pass a word to the network. We have to transform it into a vector of numbers somehow. We can use “word embedding” for this purpose.

What happens in a word embedding?

No alt text provided for this image

An interesting way to process words is into a structure known as a Word Embedding. A word embedding is an n-dimensional vector of real numbers for each word. The vector is typically large, for example, 200 lengths. We can see what that might look like with the word “example” here.

So, how do we find the proper values for these vectors?

No alt text provided for this image

In our RNN model, the vectors (also known as the matrix for the vocabulary) are initialized randomly for all the words that we are going to use in training. Then, during the recurrent network’s training, the vector values are updated based on the context in which the word is being inserted. So, words that are used in similar contexts end up with similar positions in the vector space. This can be visualized by utilizing a dimensionality-reduction algorithm. After training the RNN, if we imagine the words based on their embedding vectors, the words are grouped together either because they’re synonyms, or they’re used in similar places within a sentence. For example, the words “zero” and “none” are close semantically, so it’s natural for them to be placed close together. And while “Italy” and “Germany” aren’t synonyms, they can be interchanged in several sentences without distorting the grammar.

No alt text provided for this image
No alt text provided for this image

Imagine that the input data is a group with only one sequence of words. This is like a group that includes one sentence only - one that includes 20 words. Assume that the vocabulary size of the words is 10,000 words, and the length of each embedding vector is 200. We have to look up those 20 words in the randomly initialized embedding matrix, and then feed them into the first LSTM unit. Please mark that only one word in each time step is fed into the network, and one word would be the output. But, during 20-time steps, the output would be 20 words. In our network, we have 2 LSTM units, with arbitrary hidden sizes of 256 and 128. So, the output of the second LSTM unit would be a matrix of size 20-by-128. Now, we need a softmax layer to calculate the probability of the output words. It "squashes" the 128-dimensional vector of real values to a 10,000-dimensional vector, which is our vocabulary size. This means that the output of the network at each time step is a probability vector of length 10,000. So, the output word is the one with the maximum probability value in the vector. Now, we can compare the sequence of 20 output words with the ground truth words. And finally, calculate the error as a quantitative value, so-called loss value, and back-propagate the errors into the network.

We will not train the model using only one sequence. We will use a group of sequences to train it and calculate the error. So, instead of feeding one series, we can supply the network in many iterations - perhaps even a batch of 60 sentences.

What does the network learn when the error is propagated back, in each iteration?

As previously noted, the weights keep updating based on the error of the network-in-training. First, the embedding matrix will be updated in each iteration. Second, there are a bunch of weight matrices related to the gates in the LSTM units which will be changed.

And finally, the weights related to the Softmax layer, which plays the decoding role for encoded-words in the embedding layer. This is how we use LSTM for language modeling.

To view or add a comment, sign in

More articles by Akanksha Akanksha

  • Restricted Boltzmann Machine

    This is a method that can automatically discover patterns in our data by reconstructing input. History: Geoff Hinton of…

Others also viewed

Explore content categories