Natural Language Processing with TensorFlow - Part I

Somanath Nanda (Baba)

Published Dec 26, 2019

+ Follow

I am learning and I thought of sharing the same knowledge.

Various NLP tasks can be categorised into several different types as shown below.

text_to_word_sequence :

It converts a text to a sequence of words(or tokens).

import tensorflow as tf
from tensorflow.keras.preprocessing.text import text_to_word_sequence

sentence='Would you like to go for a coffee?'
tokens=text_to_word_sequence(sentence)
print(tokens)
print('Vocab Size : '+str(len(set(tokens))))

           # =======   OUTPUT   ========


['would', 'you', 'like', 'to', 'go', 'for', 'a', 'coffee']
Vocab Size : 8

By default this function automatically does 3 things: (You can pass custom values to these arguments for your use case)

Splits words by space (split=" ")
Filters out punctuation (filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
Converts text to lower case (default lower=True)

hashing_trick :

Converts a text to a sequence of indexes in a fixed-size hashing space. Two or more words may be assigned to the same index, due to possible collisions by the hashing function.
The probability of a collision is in relation to the dimension of the hashing space and the number of distinct objects i.e. vocabulary size. It's better to increase the vocabulary size to minimize collisions when hashing words.
It internally calls text_to_word_sequence() to generate the sequence and apply hashing technique to return the list of values.

import tensorflow as tf
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import hashing_trick


sentence='Because many writers abuse long sentences, cramming too many
thoughts into each sentence, muddling up their message and leaving readers
confused.'


tokens=text_to_word_sequence(sentence)
vocab_size=len(set(tokens))
print(tokens)
print('vocab size: '+str(vocab_size))
one_hot_result =one_hot(sentence,round(vocab_size*1.1))
print('---- one hot encoding result ----')
print(one_hot_result)
hashing_trick_result=hashing_trick(sentence,round(vocab_size*1.1),hash_function='md5')
print('---- hashing trick result md5 : 1.1 times large vocab size ----')
print(hashing_trick_result)
hashing_trick_result=hashing_trick(sentence,round(vocab_size*1.1),hash_function=hash)
print('---- hashing trick result hash ----')
print(hashing_trick_result)
hashing_trick_result=hashing_trick(sentence,round(vocab_size*1.5),hash_function='md5')
print('---- hashing trick result md5 : 1.5 times large vocab size ----')
print(hashing_trick_result)

OUTPUT :

# =======   OUTPUT   ========


['because', 'many', 'writers', 'abuse', 'long', 'sentences', 'cramming', 
 'too', 'many', 'thoughts', 'into', 'each', 'sentence', 'muddling', 'up', 
 'their', 'message', 'and', 'leaving', 'readers', 'confused']
vocab size: 20
---- one hot encoding result ----
[9, 6, 5, 13, 20, 16, 18, 2, 6, 4, 20, 1, 20, 10, 12, 13, 13, 14, 14, 17, 1]
---- hashing trick result md5 : 1.1 times large vocab size ----
[8, 19, 16, 3, 7, 1, 7, 15, 19, 3, 3, 17, 18, 2, 18, 4, 15, 21, 15, 17, 19]
---- hashing trick result hash ----
[9, 6, 5, 13, 20, 16, 18, 2, 6, 4, 20, 1, 20, 10, 12, 13, 13, 14, 14, 17, 1]
---- hashing trick result md5 : 1.5 times large vocab size ----

[28, 27, 20, 2, 1, 29, 1, 26, 27, 21, 4, 20, 11, 25, 2, 16, 5, 13, 22, 28, 19]

Observation:

One hot encoding results the same output as hashing trick 'hash'. one_hot is a wrapper to the `hashing_trick` function using `hash` as the hashing function; unicity of word to index mapping non-guaranteed.
With increased vocab_size the collision is minimized.

Tokenization :

Source Code

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences=['I love swimming','I love watching series!',
           'I love my cat',
           'Bob is working from home & he will sleep later.',
           '@Lee: I want to ask you a question',
           'John loves pizza, burger and olives.',
           'I am working on editing a new series']

tokenizer=Tokenizer(num_words=200)
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index

print(word_index)

The output looks like:

runfile('/Users/somanathnanda/.spyder-py3/temp.py', wdir='/Users/somanathnanda/.spyder-py3')

{'i': 1, 'love': 2, 'series': 3, 'working': 4, 'a': 5, 'swimming': 6, 'watching': 7, 'my': 8, 'cat': 9, 'bob': 10, 'is': 11, 'from': 12, 'home': 13, 'he': 14, 'will': 15, 'sleep': 16, 'later': 17, 'lee': 18, 'want': 19, 'to': 20, 'ask': 21, 'you': 22, 'question': 23, 'john': 24, 'loves': 25, 'pizza': 26, 'burger': 27, 'and': 28, 'olives': 29, 'am': 30, 'on': 31, 'editing': 32, 'new': 33}

word_index returns the above dictionary. Observe the followings:

series! and series are not treated as separate words.
Punctuations are stripped.
Block letters are converted to small letters.
num_words is a hyper parameter which essentially takes most common 200 words and tokenize them.
Why don't combine fit_on_texts() and texts_to_sequences() ? Because you almost always fit once and convert to sequences many times. You will fit on your training corpus once and use that exact same word_index dictionary at train / eval / testing / prediction time to convert actual text into sequences to feed them to the network. So it makes sense to keep those methods separate. Example below:

sentences=['I love swimming','I love watching series!',
           'I love my cat',
           'Bob is working from home & he will sleep later.',
           '@Lee: I want to ask you a question',
           'John loves pizza, burger and olives.',
           'I am working on editing a new series']


tokenizer=Tokenizer()
tokenizer.fit_on_texts(sentences)
test_seq = tokenizer.texts_to_sequences(sentences)

print(test_seq)

output:

[
[1, 2, 6], [1, 2, 7, 3], 
[1, 2, 8, 9], 
[10, 11, 4, 12, 13, 14, 15, 16, 17], 
[18, 1, 19, 20, 21, 22, 5, 23], 
[24, 25, 26, 27, 28, 29], 
[1, 30, 4, 31, 32, 5, 33, 3]

]

Boston, flying, tonight -- these words are missing in the corpus. So the sentence is encoded to : "my is to"

sentences=['I love swimming','I love watching series!',
           'I love my cat',
           'Bob is working from home & he will sleep later.',
           '@Lee: I want to ask you a question',
           'John loves pizza, burger and olives.',
           'I am working on editing a new series']


tokenizer=Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)


word_index = tokenizer.word_index
print(word_index)


test_seq = tokenizer.texts_to_sequences(sentences)
print(test_seq)

#=========== passing test_data to check ==========
print('How does this new sentence look like in terms of sequences? ? =>')
print(tokenizer.texts_to_sequences(['My friend is flying to Boston tonight']))
#=================================================================

OUTPUT :
------------

{'i': 1, 'love': 2, 'series': 3, 'working': 4, 'a': 5, 'swimming': 6, 'watching': 7, 'my': 8, 'cat': 9, 'bob': 10, 'is': 11, 'from': 12, 'home': 13, 'he': 14, 'will': 15, 'sleep': 16, 'later': 17, 'lee': 18, 'want': 19, 'to': 20, 'ask': 21, 'you': 22, 'question': 23, 'john': 24, 'loves': 25, 'pizza': 26, 'burger': 27, 'and': 28, 'olives': 29, 'am': 30, 'on': 31, 'editing': 32, 'new': 33}
[[1, 2, 6], [1, 2, 7, 3], [1, 2, 8, 9], [10, 11, 4, 12, 13, 14, 15, 16, 17], [18, 1, 19, 20, 21, 22, 5, 23], [24, 25, 26, 27, 28, 29], [1, 30, 4, 31, 32, 5, 33, 3]]

How does this new sentence look like in terms of sequences? ? =>
[[8, 11, 20]]

To avoid such problem you can use oov_token hyperparameter. Example is shown below:

sentences=['I love swimming','I love watching series!',
           'I love my cat',
           'Bob is working from home & he will sleep later.',
           '@Lee: I want to ask you a question',
           'John loves pizza, burger and olives.',
           'I am working on editing a new series']


#oov_token: if given, it will be added to word_index and used to
#            replace out-of-vocabulary words during text_to_sequence calls
tokenizer=Tokenizer(num_words=100,oov_token='boston')
tokenizer.fit_on_texts(sentences)


# summarize what was learned
print(tokenizer.word_counts)
print('---------------------')
print(tokenizer.document_count)
print('---------------------')
print(tokenizer.word_index)
print('---------------------')
print(tokenizer.word_docs)
print('---------------------')
# integer encode documents
encoded_docs = tokenizer.texts_to_matrix(sentences, mode='count')
print(encoded_docs)
print('---------------------')
word_index = tokenizer.word_index
print(word_index)
print('---------------------')
test_seq = tokenizer.texts_to_sequences(sentences)
print(test_seq)
print('---------------------')
print('How does this new sentence look like in terms of sequences? ? =>')
print(tokenizer.texts_to_sequences(['My friend is flying to Boston tonight']))

============================   OUTPUT   ========================

OrderedDict([('i', 5), ('love', 3), ('swimming', 1), ('watching', 1), ('series', 2), ('my', 1), ('cat', 1), ('bob', 1), ('is', 1), ('working', 2), ('from', 1), ('home', 1), ('he', 1), ('will', 1), ('sleep', 1), ('later', 1), ('lee', 1), ('want', 1), ('to', 1), ('ask', 1), ('you', 1), ('a', 2), ('question', 1), ('john', 1), ('loves', 1), ('pizza', 1), ('burger', 1), ('and', 1), ('olives', 1), ('am', 1), ('on', 1), ('editing', 1), ('new', 1)])
---------------------
7
---------------------
{'boston': 1, 'i': 2, 'love': 3, 'series': 4, 'working': 5, 'a': 6, 'swimming': 7, 'watching': 8, 'my': 9, 'cat': 10, 'bob': 11, 'is': 12, 'from': 13, 'home': 14, 'he': 15, 'will': 16, 'sleep': 17, 'later': 18, 'lee': 19, 'want': 20, 'to': 21, 'ask': 22, 'you': 23, 'question': 24, 'john': 25, 'loves': 26, 'pizza': 27, 'burger': 28, 'and': 29, 'olives': 30, 'am': 31, 'on': 32, 'editing': 33, 'new': 34}
---------------------
defaultdict(<class 'int'>, {'love': 3, 'i': 5, 'swimming': 1, 'series': 2, 'watching': 1, 'my': 1, 'cat': 1, 'he': 1, 'from': 1, 'bob': 1, 'later': 1, 'home': 1, 'is': 1, 'working': 2, 'will': 1, 'sleep': 1, 'question': 1, 'to': 1, 'ask': 1, 'a': 2, 'you': 1, 'lee': 1, 'want': 1, 'pizza': 1, 'loves': 1, 'and': 1, 'burger': 1, 'john': 1, 'olives': 1, 'editing': 1, 'am': 1, 'new': 1, 'on': 1})
---------------------
[[0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1.
  1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]
 [0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0.]]
---------------------
{'boston': 1, 'i': 2, 'love': 3, 'series': 4, 'working': 5, 'a': 6, 'swimming': 7, 'watching': 8, 'my': 9, 'cat': 10, 'bob': 11, 'is': 12, 'from': 13, 'home': 14, 'he': 15, 'will': 16, 'sleep': 17, 'later': 18, 'lee': 19, 'want': 20, 'to': 21, 'ask': 22, 'you': 23, 'question': 24, 'john': 25, 'loves': 26, 'pizza': 27, 'burger': 28, 'and': 29, 'olives': 30, 'am': 31, 'on': 32, 'editing': 33, 'new': 34}
---------------------
[[2, 3, 7], [2, 3, 8, 4], [2, 3, 9, 10], [11, 12, 5, 13, 14, 15, 16, 17, 18], [19, 2, 20, 21, 22, 23, 6, 24], [25, 26, 27, 28, 29, 30], [2, 31, 5, 32, 33, 6, 34, 4]]
---------------------
How does this new sentence look like in terms of sequences? ? =>
[[9, 1, 12, 1, 21, 1, 1]]

Padding:

Deep Learning libraries take vectorized representation of the data as input for modelling. When feeding training data to the neural network, a uniformity of the data must be maintained. While feeding training data in the form of sentences, padding is used to provide uniformity in the sentences. Now the solution can be achieved in the following ways:

1- Padding variable length sequences with dummy values.

2- Padding variable length sequences to a new longer desired length.

3- Padding variable length sequences to a new shorter desired length.

Following hyperparameters can be used to achieve padding.

maxlen: Int, maximum length of all sequences.
padding: String, 'pre' or 'post': pad either before or after each sequence.
truncating: String, 'pre' or 'post’: remove values from sequences larger than `maxlen`, either at the beginning or at the end of the sequences.

Please refer the documentation and the source code to explore more.

Example - 1:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Dec 26 15:42:18 2019


@author: somanathnanda
"""
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


docs=['Bob and Alice are playing football','Will you go for the game tonight?',
      'Johnny expects a good weather for the game tomorrow','Hello World!',
      'Do you fancy a glass of wine?','How are you doing?',
      'Would you like to go for a coffee?'
      ]


tokenizer=Tokenizer(num_words=200,oov_token='<oov>')
tokenizer.fit_on_texts(docs)


word_index=tokenizer.word_index


print(word_index)
print('--------------------------')
sequences=tokenizer.texts_to_sequences(docs)
print(sequences)
print('--------------------------')
padded=pad_sequences(sequences)
print(padded)
print('--------------------------')


=====    OUTPUT    =====

{'<oov>': 1, 'you': 2, 'for': 3, 'a': 4, 'are': 5, 'go': 6, 'the': 7, 'game': 8, 'bob': 9, 'and': 10, 'alice': 11, 'playing': 12, 'football': 13, 'will': 14, 'tonight': 15, 'johnny': 16, 'expects': 17, 'good': 18, 'weather': 19, 'tomorrow': 20, 'hello': 21, 'world': 22, 'do': 23, 'fancy': 24, 'glass': 25, 'of': 26, 'wine': 27, 'how': 28, 'doing': 29, 'would': 30, 'like': 31, 'to': 32, 'coffee': 33}
--------------------------
[[9, 10, 11, 5, 12, 13], 
 [14, 2, 6, 3, 7, 8, 15], 
 [16, 17, 4, 18, 19, 3, 7, 8, 20], 
 [21, 22], 
 [23, 2, 24, 4, 25, 26, 27], 
 [28, 5, 2, 29], 
 [30, 2, 31, 32, 6, 3, 4, 33]]
--------------------------
[[ 0  0  0  9 10 11  5 12 13]
 [ 0  0 14  2  6  3  7  8 15]
 [16 17  4 18 19  3  7  8 20]
 [ 0  0  0  0  0  0  0 21 22]
 [ 0  0 23  2 24  4 25 26 27]
 [ 0  0  0  0  0 28  5  2 29]
 [ 0 30  2 31 32  6  3  4 33]]
--------------------------

Example - 2: (Default padding is 'pre', explicit 'post' padding example)

print('-----------DEFAULT PRE PADDING---------------')
padded=pad_sequences(sequences)
print(padded)
print('------------POST PADDING--------------')
padded=pad_sequences(sequences,padding='post')
print(padded)
print('--------------------------')

Output:
========
-----------DEFAULT PRE PADDING---------------
[[ 0  0  0  9 10 11  5 12 13]
 [ 0  0 14  2  6  3  7  8 15]
 [16 17  4 18 19  3  7  8 20]
 [ 0  0  0  0  0  0  0 21 22]
 [ 0  0 23  2 24  4 25 26 27]
 [ 0  0  0  0  0 28  5  2 29]
 [ 0 30  2 31 32  6  3  4 33]]
------------POST PADDING--------------
[[ 9 10 11  5 12 13  0  0  0]
 [14  2  6  3  7  8 15  0  0]
 [16 17  4 18 19  3  7  8 20]
 [21 22  0  0  0  0  0  0  0]
 [23  2 24  4 25 26 27  0  0]
 [28  5  2 29  0  0  0  0  0]
 [30  2 31 32  6  3  4 33  0]]
--------------------------

Example - 3: (The length of sequence can be truncated too. )

The pad_sequences() function can also be used to pad sequences to a preferred length that may be longer than any observed sequences. The desired length for sequences can be specified as a number of timesteps with the “maxlen” argument. There are two ways that sequences can be truncated, either by removing timesteps from the beginning (which is by default 'pre' truncating) or the end of sequences (where truncating='post').

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Dec 26 15:42:18 2019


@author: somanathnanda
"""
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


docs=['Bob and Alice are playing football','Will you go for the game tonight?',
      'Johnny expects a good weather for the game tomorrow','Hello World!',
      'Do you fancy a glass of wine?','How are you doing?',
      'Would you like to go for a coffee?'
      ]


tokenizer=Tokenizer(num_words=200,oov_token='<oov>')
tokenizer.fit_on_texts(docs)


word_index=tokenizer.word_index


print(word_index)
print('--------------------------')
sequences=tokenizer.texts_to_sequences(docs)
print(sequences)
print('-----------DEFAULT PRE PADDING---------------')
padded=pad_sequences(sequences)
print(padded)
print('-----------Pad Sequences to a desired length---------------')
padded=pad_sequences(sequences,maxlen=3)
print(padded)
print('------------POST PADDING--------------')
padded=pad_sequences(sequences,padding='post')
print(padded)
print('--- Padding Sequence to a desired length; though padding is post, it will take the last 4 as default pre padding-----')
padded=pad_sequences(sequences,padding='post',maxlen=4)
print(padded)
print('------------POST PADDING, POST TRUNCATING--------------')
padded=pad_sequences(sequences,padding='post',truncating='post',maxlen=7)
print(padded)

OUTPUT:

{'<oov>': 1, 'you': 2, 'for': 3, 'a': 4, 'are': 5, 'go': 6, 'the': 7, 'game': 8, 'bob': 9, 'and': 10, 'alice': 11, 'playing': 12, 'football': 13, 'will': 14, 'tonight': 15, 'johnny': 16, 'expects': 17, 'good': 18, 'weather': 19, 'tomorrow': 20, 'hello': 21, 'world': 22, 'do': 23, 'fancy': 24, 'glass': 25, 'of': 26, 'wine': 27, 'how': 28, 'doing': 29, 'would': 30, 'like': 31, 'to': 32, 'coffee': 33}

--------------------------

[[9, 10, 11, 5, 12, 13], 

 [14, 2, 6, 3, 7, 8, 15], 

 [16, 17, 4, 18, 19, 3, 7, 8, 20], 

 [21, 22], 

 [23, 2, 24, 4, 25, 26, 27], 

 [28, 5, 2, 29], 

 [30, 2, 31, 32, 6, 3, 4, 33]]

-----------DEFAULT PRE PADDING---------------

[[ 0  0  0  9 10 11  5 12 13]

 [ 0  0 14  2  6  3  7  8 15]

 [16 17  4 18 19  3  7  8 20]

 [ 0  0  0  0  0  0  0 21 22]

 [ 0  0 23  2 24  4 25 26 27]

 [ 0  0  0  0  0 28  5  2 29]

 [ 0 30  2 31 32  6  3  4 33]]

-----------Pad Sequences to a desired length---------------

[[ 5 12 13]

 [ 7  8 15]

 [ 7  8 20]

 [ 0 21 22]

 [25 26 27]

 [ 5  2 29]

 [ 3  4 33]]

------------POST PADDING--------------

[[ 9 10 11  5 12 13  0  0  0]

 [14  2  6  3  7  8 15  0  0]

 [16 17  4 18 19  3  7  8 20]

 [21 22  0  0  0  0  0  0  0]

 [23  2 24  4 25 26 27  0  0]

 [28  5  2 29  0  0  0  0  0]

 [30  2 31 32  6  3  4 33  0]]

---POST PADDING, PRE TRUNCATING, Padding Sequence to a desired length-----

[[11  5 12 13]

 [ 3  7  8 15]

 [ 3  7  8 20]

 [21 22  0  0]

 [ 4 25 26 27]

 [28  5  2 29]

 [ 6  3  4 33]]

------------POST PADDING, POST TRUNCATING--------------

[[ 9 10 11  5 12 13  0]

 [14  2  6  3  7  8 15]

 [16 17  4 18 19  3  7]

 [21 22  0  0  0  0  0]

 [23  2 24  4 25 26 27]

 [28  5  2 29  0  0  0]

 [30  2 31 32  6  3  4]]

Hope, you like the part I. I will come with part II next week.

To view or add a comment, sign in

Natural Language Processing with TensorFlow - Part I

Somanath Nanda (Baba)

text_to_word_sequence :

hashing_trick :

Tokenization :

Padding:

More articles by Somanath Nanda (Baba)

Others also viewed

Automation: Generate Questions & Answers from any text

Machine Learning for Metaphors

Mastering Retrieval-Augmented Generation (RAG) with LangChain: A Comprehensive Learning Path

Natural Language Processing: Linear Text Classification

Support Vector Machines in NLP

Kickstart Your AI and ML Journey: A Comprehensive Roadmap to Mastery

A Practical NLP Project: Spam Detection with TF-IDF + Naive Bayes vs a Dual-Input Neural Network

Natural Language Processing with TextBlob

R.U.Stoked as a Machine Learning (NLP) Project

The Hottest Tools in Machine Learning and Data Science in 2024 (Part 1)

Explore content categories

text_to_word_sequence :

hashing_trick :

Tokenization :

Padding:

More articles by Somanath Nanda (Baba)

COVID-19 LSTM Model to predict case counts