Tokenization

Sateesh Singh

Published Oct 19, 2022

One of the common tasks in NLP is tokenization, where it builds various tokens which are broken down from a set of text into individual words. These words are popularly called as tokens. Objective of tokenization is to segregate each word to help us in identifying the word itself. It plays a major role in process of Lexical analysis.

Approach of Tokenization is quite easy. Depending on the language, you have to go and you have to break it up into words or into sentences. In other words, you have two different type of tokenizations that you can do. One is breaking down into sentences, and for that, NLTK provides you a method which is present in tokenize module.

Name of that method is sent_tokenize. When it comes to tokenizing it in form of words, NLTK again, provides you a method called word_tokenize, which is part of tokenize package.

Now, there are different type of tokenizations which can be done by utilizing NLTK. Some of this specific use cases are PunktWordTokenizer, RegexpTokenizer and TreebankWordTokenizer. All these are derived from base that is Tokenizer. PunktWordTokenizer is all about dividing the given string into a list of substrings.

But when we talk about regular expression tokenizer, we have to go and we have to consider regular expression that need to be applied in order to create the tokens. It can be of two types, WordPunktTokenizer or WhiteSpaceTokenizer. WordPunktTokenizer is all about tokenizing a text into a sequence of alphabet or non-alphabet characters.

But when it comes to WhiteSpaceTokenizer, it is all about getting the white space in order to build the tokens. Finally, we have TreebankWordTokenizer, it treats most of the punctuation characters as a separate token itself. It will also split off commas and single quotes. When followed by wide space, it will separate periods that appears at the end of the line.

Now, let's talk about NLTK Tokenizer module. NLTK tokenizer module provides us capability of tokenizing various different components. It can be word tokenization by utilizing a method called as word tokenize. Or it can be sentence tokenization which can be, again, done by making call to a function or a method called as sent tokenize.

Tokenize module in order to help us in tokenizing word or sentence and provide different methods provides two different submodules. First sub module is word tokenize. Second sub module is sentence tokenize. Now, let's go and take a simple example of word tokenization. We are importing word_tokenize library from nltk.tokenize.

from nltk.tokenize import word_tokenize

Then we are declaring a text, which is Good Morning! It's a beautiful day!

text = "Good Morning! It's a beautiful day!"

Recommended by LinkedIn

Revolutionizing News Consumption: How NLP-Powered…

Farhan A. 1 year ago

Fundamentals -4: Lemmatization

Dheeraj RP 3 years ago

Decoding Word2Vec Magic

Srinivasan Karthikeyan Pitchai(K P S) 4 months ago

Finally, we are making call to word_tokenize to tokenize the text, which gives us output that contains tokens in form of an array.

print (word_tokenize (text))

That is Good, Morning, !, It's, a, beautiful, day, !, all of them are independent tokens.

Output: ['Good', 'Morning', '!', 'It's', 'a', 'beautiful', 'day', '!']

Let's take another example of sentence tokenization. Again, we are importing sent_tokenize, we are having the same text which is Good Morning! It's a beautiful day!

from nltk.tokenize import sent_tokenize

text = "Good Morning! It's a beautiful day!"

But now when you are printing text by making call to sent_tokenize, you are getting the word broken down into sentences where you have now two sentence, one is Good morning!, and second is It's a beautiful day!

Output: ['Good Morning!', 'It's a beautiful day!']

In other words, exclamatory mark is playing essential role in order to tokenize your text in form of sentences.

To view or add a comment, sign in

Tokenization

Sateesh Singh

Recommended by LinkedIn

More articles by Sateesh Singh

Others also viewed

NLP 101: Topic Modelling for Humans - Part #1

From Words to Vectors: My Journey Understanding AI Encoding and Embeddings

Engine Behind Contextual Intelligence In Modern AI

Intuition Behind Self-Attention in Transformers.

Summarization impact on sentiment detection? Hypothesis test with Power Platform

From TF-IDF to Transformers: What Classifying Disaster Tweets Taught Me About How We Got to LLMs

Bahdanau Attention Mechanism

Automation: Generate Questions & Answers from any text

NLP Transformers brings new life to the knowledge base

Behind the Scenes of NLP: Understanding the Basics of Semantic Search!!!

Explore content categories

Recommended by LinkedIn

More articles by Sateesh Singh

Strategic Management - Shareholder Approach

Eviden (AtoS) tools on generative AI

Atos Quantum Learning Machine (QLM)

Building a Data-Driven Culture

AI Workforce Structures

AI Lung Cancer Diagnosis

Predicting Deforestation in Amazon Rainforests

Computer Vision and GANs

CNN Basics and Evolution

The Impact of Computer Vision in Aerospace, Automobiles, and Robotics

Others also viewed

NLP 101: Topic Modelling for Humans - Part #1

From Words to Vectors: My Journey Understanding AI Encoding and Embeddings

Engine Behind Contextual Intelligence In Modern AI

Intuition Behind Self-Attention in Transformers.

Summarization impact on sentiment detection? Hypothesis test with Power Platform

From TF-IDF to Transformers: What Classifying Disaster Tweets Taught Me About How We Got to LLMs

Bahdanau Attention Mechanism

Automation: Generate Questions & Answers from any text

NLP Transformers brings new life to the knowledge base

Behind the Scenes of NLP: Understanding the Basics of Semantic Search!!!

Explore content categories