Word2Vec in a Nutshell: Teaching a computer how to read
In today’s article of our 'in a nutshell series' we are going to cover the process of transforming text into a computer readable format and how to extract meaningful information out if it.
Word embeddings
We as humans can easily take in, process and understand sentences in any language we are capable of. A computer on the other hand does not know the concept of languages, sentences or words. Even though we can tell a program that certain input is text, we cannot gain any information from it that goes beyond statistical measures, e.g. word frequencies. For the longest time natural language processing (NLP), like machine translation, was based on statistics and if-then logic, that required a huge corpus of words of each language that is going to be processed.
Example transformation from word into vector.
To make words interpretable by a computer we need to transform it into a format it can understand. For this purpose, we use so called word embeddings. In fact, we assign each word a numerical vector that represents its meaning. This creates a high dimensional space of word vectors through which we can measure similarity or sentiment of words and sentences.
Neuronal Networks
But how do we achieve such a numerical representation? For this purpose, we need to use machine learning models that take in an arbitrary word vector and aim to shift this vector to an accurate representation of the underlying word. The most promising model for this type of task is the neuronal network.
Sketch of a neuronal network with 3 layers. One input row corresponds to one output row, marked by colour.
Neuronal networks in general are a generic sequence of nonlinear transformations that in theory can be used to approximate any data-target relationship. They are built out of many layers consisting of nodes, where each node takes as input the output of all nodes in the previous layer. The transformation inside a node consists of a large weight vector that is multiplied with the input matrix and then put into a nonlinear activation function. The different nodes can also be understood as one large matrix that is multiplied with the input. Each node then forms a column in this layer matrix. This creates a complex network of connections between all nodes that can be used to reproduce the desired relationship between in- and output.
Before constructing a network, we have to bring the words into a form that the layer matrices can operate on. For this purpose, we define our input for the network as an excerpt of size C from a word vocabulary of size V. Those words are transformed into vectors through one-hot encoding. In this process we give each word an index number to identify it inside the vocabulary. The encoded vector for each word then consists of all zeros but one single 1 at the index the word was given. The vector now uniquely represents its respective word in a numerical way we can work with. The encoded vectors are combined into a C x V input matrix that will be later fed into the network.
Example vocabulary of size V=6. Coloured entries form a context set.
With our input ready we have to define the task the network should solve. There are two main approaches for the purpose of achieving word vectors: The continuous bag of words (CBOW) and the skip-gram method. In the CBOW approach we feed a set of related words, so called context words, into the network from which we remove exactly one word. The network then should predict the word that is missing from the set. The skip-gram method can be seen as the exact reversal of the CBOW approach. Here we input one word and try to predict contextual related words. For now, we will focus on the former method.
CBOW network example to predict the word 'pet'.
A network for the CBOW approach takes in our previously defined word matrix of size C x V and multiplies it with a weight matrix W of size V x N in the input layer, i.e. the input is fed into a layer consisting of N nodes. N is the size we want the word embeddings to have at the end. The result is then put into a layer consisting of V nodes to regain a column entry for each word of the vocabulary. Averaging this matrix by rows gives a 1 x V matrix that we can then feed into a Softmax function that will produce values between one and zero in each entry. These values can be interpreted as probabilities for each word to be the word missing in the set. We would want the network to give the entry at the index corresponding to our missing word the highest probability.
As we know which word is missing from the set, we can use a loss function to compute an error value. Examples for such functions are the cross-entropy loss. We can now compute the gradient of the loss function which can then be used to apply optimization algorithms to train the weight matrices. The process of presenting correct output to the network to train the weight matrices is called supervised learning. We repeat this training process multiple times over all possible context sets of words out of our vocabulary until the resulting loss is sufficiently low.
When we have trained our network, we can then extract the initially wanted word vectors through the weight matrix W. This matrix now contains a N dimensional representation of all V words of the vocabulary. The predicting of contextual related words was just a constructed task to train the entries inside the matrix W.
In the example depicted in the picture we are selecting 4 words out of our vocabulary of size 6. The network we are applying to predict the word “pet” will create 2-dimensional word vectors inside the matrix W. If we wanted to depict these vectors the result could look like the following.
Visualization of word vectors gained from a CBOW approach. Similarity between words is given by the cosine value of the angle between them.
Application
With the word embeddings created by the neuronal network we can now compute the similarity between words by calculating the cosine similarity of their corresponding vectors. This measure is based on the angle between to vectors, with an angle of 0° resulting in perfect similarity of 1. If we construct word vectors in a 2-dimensional space, we can directly read off the similarity of two words by looking at the distance between them. Similar words will group together while very different words will be far apart.
We can also use the word embeddings to further extract information from the text. One example is the sentiment analysis. Here we again use neuronal networks to assign sentences embeddings a sentiment class which indicates if the sentences are negative (class 0) or positive (class 1). For this process we first exclude all stop words from our sentence, i.e. words like “the”, “is”, “a”, “and”. These are words that carry no information as they appear in nearly all sentences. Then we can compute a sentence embedding as average of all word embeddings. This is then fed into a neuronal network to classify the sentence. We can also use this to determine the sentiment of documents or short text, e.g. tweets, by averaging the sentiment of all sentences inside.
Sketch of a sentiment analysis process.
The transformation of words into vectors allows a multitude of possible applications and tremendously increased the accuracy of existing concepts like machine translation. Platforms like DeepL, Apples Siri or Amazons Alexa are great achievements in the field of natural language processing and the research is only progressing further. Through machine learning we are able to provide Q&A Bots and nearly perfect machine translations. In combination with other techniques like speech recognition we can establish broad service lines that can communicate with people without the interference of another person.