Tokenization
One of the common tasks in NLP is tokenization, where it builds various tokens which are broken down from a set of text into individual words. These words are popularly called as tokens. Objective of tokenization is to segregate each word to help us in identifying the word itself. It plays a major role in process of Lexical analysis.
Approach of Tokenization is quite easy. Depending on the language, you have to go and you have to break it up into words or into sentences. In other words, you have two different type of tokenizations that you can do. One is breaking down into sentences, and for that, NLTK provides you a method which is present in tokenize module.
Name of that method is sent_tokenize. When it comes to tokenizing it in form of words, NLTK again, provides you a method called word_tokenize, which is part of tokenize package.
Now, there are different type of tokenizations which can be done by utilizing NLTK. Some of this specific use cases are PunktWordTokenizer, RegexpTokenizer and TreebankWordTokenizer. All these are derived from base that is Tokenizer. PunktWordTokenizer is all about dividing the given string into a list of substrings.
But when we talk about regular expression tokenizer, we have to go and we have to consider regular expression that need to be applied in order to create the tokens. It can be of two types, WordPunktTokenizer or WhiteSpaceTokenizer. WordPunktTokenizer is all about tokenizing a text into a sequence of alphabet or non-alphabet characters.
But when it comes to WhiteSpaceTokenizer, it is all about getting the white space in order to build the tokens. Finally, we have TreebankWordTokenizer, it treats most of the punctuation characters as a separate token itself. It will also split off commas and single quotes. When followed by wide space, it will separate periods that appears at the end of the line.
Now, let's talk about NLTK Tokenizer module. NLTK tokenizer module provides us capability of tokenizing various different components. It can be word tokenization by utilizing a method called as word tokenize. Or it can be sentence tokenization which can be, again, done by making call to a function or a method called as sent tokenize.
Tokenize module in order to help us in tokenizing word or sentence and provide different methods provides two different submodules. First sub module is word tokenize. Second sub module is sentence tokenize. Now, let's go and take a simple example of word tokenization. We are importing word_tokenize library from nltk.tokenize.
from nltk.tokenize import word_tokenize
Then we are declaring a text, which is Good Morning! It's a beautiful day!
text = "Good Morning! It's a beautiful day!"
Recommended by LinkedIn
Finally, we are making call to word_tokenize to tokenize the text, which gives us output that contains tokens in form of an array.
print (word_tokenize (text))
That is Good, Morning, !, It's, a, beautiful, day, !, all of them are independent tokens.
Output: ['Good', 'Morning', '!', 'It's', 'a', 'beautiful', 'day', '!']
Let's take another example of sentence tokenization. Again, we are importing sent_tokenize, we are having the same text which is Good Morning! It's a beautiful day!
from nltk.tokenize import sent_tokenize
text = "Good Morning! It's a beautiful day!"
But now when you are printing text by making call to sent_tokenize, you are getting the word broken down into sentences where you have now two sentence, one is Good morning!, and second is It's a beautiful day!
Output: ['Good Morning!', 'It's a beautiful day!']
In other words, exclamatory mark is playing essential role in order to tokenize your text in form of sentences.