Natural Language Processing with TensorFlow - Part II (How to visualize word embeddings generated by the word2vec model in TensorBoard)
Neural networks require inputs only in numbers. So when we have textual data, we convert them into numeric or vector representation and feed it to the network. There are various methods for converting the input text to numeric form. Some of the popular methods include term frequency-inverse document frequency (tf-idf), bag of words (BOW), and so on. However, these methods do not capture the semantics of the word. This means that these methods will not understand the meaning of the words.
What is Word2Vec?
Word2Vec learns the meaning of a given word by looking at its context and representing it numerically. By context, we refer to a fixed number of words in front of or behind the word of interest. Word2Vec techniques use the context of a given word to learn its semantics.
Let's learn two different types of Word2Vec model called continuous bag-of-words (CBOW) and skip-gram model.
Understanding the CBOW model
Let's say we have a neural network with an input layer, a hidden layer, and an output layer. The goal of the network is to predict a word given its surrounding words. The word that we are trying to predict is called the target word and the words surrounding the target word are called the context words.
How many context words do we use to predict the target word? We use a window of size to choose the context word. If the window size is 2, then we use two words before and two words after the target word as the context words.
Let's consider the sentence, The dog barks at strangers with the word barks as the target word. If we set the window size as 2, then we take the words the and dog, which are the two words before, and at and strangers which are the two words after the target word barks as context words, as shown in the following figure:
So the input to the network is context words and the output is a target word. How do we feed these inputs to the network? The neural network accepts only numeric input so we cannot feed the raw context words directly as an input to the network. Hence, we convert all the words in the given sentence into a numeric form using the one-hot encoding technique, as shown in the following figure:
The = [1 0 0 0 0]
dog = [0 1 0 0 0]
barks = [0 0 1 0 0]
at = [0 0 0 1 0]
strangers = [0 0 0 0 1]
The architecture of the CBOW model is shown in the following figure. As you can see, we feed the context words, the, dog, at, and strangers, as inputs to the network and it predicts the target word barks as an output:
In the initial iteration, the network cannot predict the target word correctly. But over a series of iterations, it learns to predict the correct target word using gradient descent. With gradient descent, we update the weights of the network and find the optimal weights with which we can predict the correct target word.
As we have one input, one hidden, and one output layer, as shown in the preceding figure, we will have two weights:
- Input layer to hidden layer weight - W
- Hidden layer to output layer weight - W'
After training, if we look at the matrix, it represents the embeddings for each of the words.
Understanding skip-gram model
Now, let's look at another interesting type of the word2vec model, called skip-gram. Skip-gram is just the reverse of the CBOW model. That is in a skip-gram model, we try to predict the context words given the target word as an input. As shown in the following figure, we can notice that we have the target word as barks and we need to predict the context words the, dog, at, and strangers:
Similar to the CBOW model, we use the window size to determine how many context words we need to predict. The architecture of the skip-gram model is shown in the following figure.
As we can see that it takes the single target word as input and tries to predict the multiple context words:
In the skip-gram model, we try to predict the context words based on the target word. So, it takes one target word as an input and returns context words as output, as shown in the above figure. So, after training the skip-gram model to predict the context words, the weights between our input to hidden layer becomes the vector representation of the words, just like we saw in the CBOW model. Now we have understanding of the skip-gram model.
word2vec model using gensim :
(textAnalysis) Somanaths-MacBook-Pro:~ conda install nltk -y
Downloading and Extracting Packages
nltk-3.4.5 | 2.1 MB | ######################### | 100%
openssl-1.0.2u | 3.0 MB | ######################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(textAnalysis) Somanaths-MacBook-Pro:~ somanathnanda$ python
Python 3.6.2 |Anaconda, Inc.| (default, Oct 5 2017, 03:00:07)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/somanathnanda/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
True
Read data :
Data can be downloaded here.
"""
Created on Thu Jan 9 12:42:47 2020
@author: somanathnanda
"""
import pandas as pd
from nltk.corpus import stopwords
stopWords = stopwords.words('english')
#modelling
from gensim.models import Word2Vec
from gensim.models import Phrases
from gensim.models.phrases import Phraser
import tensorflow as tf
#Load the data
data = pd.read_csv('text_data.txt',header=None,sep='\n')
print(data)
######## output: #######
runfile('/Users/somanathnanda/baba/DL-NN/projects/textAnalysis/word2vec_using_gensim.py', wdir='/Users/somanathnanda/baba/DL-NN/projects/textAnalysis')
0
0 The quick brown fox jumps over the lazy dog.
1 As the use of typewriters grew in the late 19t...
2 In the age of computers, this pangram is commo...
3 Elizabeth Angela Marguerite Bowes-Lyon (4 Augu...
4 After a successful visit to Northern Ireland i...
5 On 20 January 1936, King George V died and Alb...
6 During the Second World War, the King and Quee...
7 Napoleon Bonaparte (15 August 1769 – 5 May 182...
Let's preprocess the data and see how it looks:
#Load the data
data = pd.read_csv('text_data.txt',header=None,sep='\n')
print('----------------------')
print('Before preprocessing:')
print('----------------------')
print(data[0][2])
# preprocess data
def pre_process(text):
#convert to lowercase
text = str(text).lower()
#remove all special characters and keep only alpha numeric characters and spaces
text = re.sub(r'[^A-Za-z0-9\s.]',r'',text)
#remove new lines
text = re.sub(r'\n',r' ',text)
# remove stop words
text = " ".join([word for word in text.split() if word not in stopWords])
return text
data[0] = data[0].map(lambda x: pre_process(x))
print('---------------------')
print('After Preprocessing:')
print('---------------------')
print(data[0][2])
######### OUTPUT #########
----------------------
Before preprocessing:
----------------------
In the age of computers, this pangram is commonly used to display font samples and for testing computer keyboards. In cryptography, it is commonly used as a test vector for hash and encryption algorithms to verify their implementation, as well as to ensure alphabetic character set compatibility.
---------------------
After Preprocessing:
---------------------
age computers pangram commonly used display font samples testing computer keyboards. cryptography commonly used test vector hash encryption algorithms verify implementation well ensure alphabetic character set compatibility.
Genism library requires input in the from of list of lists. i.e,
text = [ [word1, word2, word3], [word1, word2, word3] ]
We know that each row in our data contains a set of sentences. So we split them by '.' and convert them into list. Now, We have the data in a list. But we need to convert them into a list of lists. So, now again we split them by space ' '. i.e, First we split the data by '.' and then we split them by ' ' so that we can get our data in a list of lists:
Now the problem we have is our corpus contains only unigrams and it will not give us results when we give bigram as an input, for an example say 'san francisco'.
So we use gensim's Phrases functions which collect all the words which occur together and add an underscore between them. So now 'san francisco' becomes 'san_francisco'. We set the min_count parameter to 25 which implies we ignore all the words and bigrams which appears lesser than this.
Build the Model¶
Now let us build the model. Let us define some of the important hyperparameters that the model needs.
- Size represents the size of the vector i.e dimensions of the vector to represent a word. The size can be chosen according to our data size. If our data is very small then we can set our size to a small value, but if we have significantly large dataset then we can set our vector size to 300. In our case, we set our size to 100
- Window size represents the distance that should be considered between the target word and its neighboring word. Words exceeding the window size from the target word will not be considered for learning. Typically, a small window size is preferred.
- Min count represents the minimum frequency of words. i.e if the particular word's occurrence is less than a min_count then we can simply ignore that word.
- workers specify the number of worker threads we need to train the model
- sg=1 implies we use skip-gram method for training if sg=0 then it implies we use CBOW for training
#Load the data
data = pd.read_csv('text_data.txt',header=None,sep='\n')
print('----------------------')
print('Before preprocessing:')
print('----------------------')
print(data[0][2])
# preprocess data
def pre_process(text):
#convert to lowercase
text = str(text).lower()
#remove all special characters and keep only alpha numeric characters and spaces
text = re.sub(r'[^A-Za-z0-9\s.]',r'',text)
#remove new lines
text = re.sub(r'\n',r' ',text)
# remove stop words
text = " ".join([word for word in text.split() if word not in stopWords])
return text
data[0] = data[0].map(lambda x: pre_process(x))
print('---------------------')
print('After Preprocessing:')
print('---------------------')
print(data[0][2])
data = data[0].map(lambda x: x.split('.'))
print('===============')
print(data)
corpus = []
for i in (range(len(data))):
for line in data[i]:
words = [x for x in line.split()]
corpus.append(words)
print('---------------------')
print(corpus[2])
phrases = Phrases(sentences=corpus,min_count=25,threshold=50)
bigram = Phraser(phrases)
for index,sentence in enumerate(corpus):
corpus[index] = bigram[sentence]
size = 100
window_size = 2
epochs = 100
min_count = 2
workers = 4
sg = 1
model = Word2Vec(corpus,sg=1,window=window_size,size=size, min_count=min_count,workers=workers,iter=epochs)
model.save('word2vec.model')
Evaluate the Embeddings
After training the model, we evaluate them. Let us see what the model has been learned and how well it has understood the semantics of words. Genism provides a most_similar function which gives us top similar words related to the given word.
print(model.most_similar('king'))
###################
[('became', 0.9993757009506226), ('elizabeth', 0.999358057975769), ('tour', 0.9992202520370483), ('area', 0.9992145299911499), ('ii', 0.9992124438285828), ('spanish', 0.9992071390151978), ('second', 0.999194860458374), ('military', 0.9991846680641174), ('wars', 0.9991650581359863), ('queen', 0.999160885810852)]
###################
print(model.most_similar('war'))
[('became', 0.9993777871131897), ('service', 0.9993032813072205), ('city', 0.9992703199386597), ('elizabeth', 0.9992671012878418), ('military', 0.9992603063583374), ('french', 0.9992287158966064), ('august', 0.9992210865020752), ('northern', 0.9992187023162842), ('end', 0.9992092847824097), ('5', 0.9992049336433411)]
#############
print(model.most_similar('spain'))
#############
File "/Users/somanathnanda/baba/DL-NN/envs/textAnalysis/lib/python3.6/site-packages/gensim/models/keyedvectors.py", line 468, in word_vec
raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word 'spain' not in vocabulary"
############
to be continued.....
Hi. I am planning to integrate the nlp module with a live stream vedio feed for a real time conversation and interaction. Ping me back if you would like to collaborate.