Tokenization

Tokenization

One of the common tasks in NLP is tokenization, where it builds various tokens which are broken down from a set of text into individual words. These words are popularly called as tokens. Objective of tokenization is to segregate each word to help us in identifying the word itself. It plays a major role in process of Lexical analysis.

Approach of Tokenization is quite easy. Depending on the language, you have to go and you have to break it up into words or into sentences. In other words, you have two different type of tokenizations that you can do. One is breaking down into sentences, and for that, NLTK provides you a method which is present in tokenize module.

Name of that method is sent_tokenize. When it comes to tokenizing it in form of words, NLTK again, provides you a method called word_tokenize, which is part of tokenize package.

Now, there are different type of tokenizations which can be done by utilizing NLTK. Some of this specific use cases are PunktWordTokenizer, RegexpTokenizer and TreebankWordTokenizer. All these are derived from base that is Tokenizer. PunktWordTokenizer is all about dividing the given string into a list of substrings.

But when we talk about regular expression tokenizer, we have to go and we have to consider regular expression that need to be applied in order to create the tokens. It can be of two types, WordPunktTokenizer or WhiteSpaceTokenizer. WordPunktTokenizer is all about tokenizing a text into a sequence of alphabet or non-alphabet characters.

But when it comes to WhiteSpaceTokenizer, it is all about getting the white space in order to build the tokens. Finally, we have TreebankWordTokenizer, it treats most of the punctuation characters as a separate token itself. It will also split off commas and single quotes. When followed by wide space, it will separate periods that appears at the end of the line.

Now, let's talk about NLTK Tokenizer module. NLTK tokenizer module provides us capability of tokenizing various different components. It can be word tokenization by utilizing a method called as word tokenize. Or it can be sentence tokenization which can be, again, done by making call to a function or a method called as sent tokenize.

Tokenize module in order to help us in tokenizing word or sentence and provide different methods provides two different submodules. First sub module is word tokenize. Second sub module is sentence tokenize. Now, let's go and take a simple example of word tokenization. We are importing word_tokenize library from nltk.tokenize.

from nltk.tokenize import word_tokenize

Then we are declaring a text, which is Good Morning! It's a beautiful day!

text = "Good Morning! It's a beautiful day!"

Finally, we are making call to word_tokenize to tokenize the text, which gives us output that contains tokens in form of an array.

print (word_tokenize (text))

That is Good, Morning, !, It's, a, beautiful, day, !, all of them are independent tokens.

Output: ['Good', 'Morning', '!', 'It's', 'a', 'beautiful', 'day', '!']

Let's take another example of sentence tokenization. Again, we are importing sent_tokenize, we are having the same text which is Good Morning! It's a beautiful day!

from nltk.tokenize import sent_tokenize

text = "Good Morning! It's a beautiful day!"

But now when you are printing text by making call to sent_tokenize, you are getting the word broken down into sentences where you have now two sentence, one is Good morning!, and second is It's a beautiful day!

Output: ['Good Morning!', 'It's a beautiful day!']

In other words, exclamatory mark is playing essential role in order to tokenize your text in form of sentences.

To view or add a comment, sign in

More articles by Sateesh Singh

  • Strategic Management - Shareholder Approach

    This approach primarily emphasizes maximizing shareholder value and returns. Shareholders, who have invested in the…

  • Eviden (AtoS) tools on generative AI

    Generative AI, also known as generative artificial intelligence, refers to a subset of artificial intelligence…

  • Atos Quantum Learning Machine (QLM)

    The Atos Quantum Learning Machine (QLM) is a quantum computing simulator developed by Atos, a global information…

  • Building a Data-Driven Culture

    In the dynamic landscape of data-driven organizations, a well-structured data team is essential, comprising various…

  • AI Workforce Structures

    The landscape of the workforce is undergoing rapid transformations, particularly accelerated by the onset of COVID-19…

  • AI Lung Cancer Diagnosis

    In the initial step of identifying cancerous tumors and cells, we must first identify the lungs in each X-ray image…

  • Predicting Deforestation in Amazon Rainforests

    We begin by importing the necessary libraries, including scikit-learn, Pandas, NumPy, and various functions from Keras…

  • Computer Vision and GANs

    Object detection is a significant application of computer vision, with various uses such as image annotation, face…

  • CNN Basics and Evolution

    Now that we have witnessed numerous remarkable applications of Computer Vision in the real world, let's delve into…

  • The Impact of Computer Vision in Aerospace, Automobiles, and Robotics

    Computer vision has emerged as a crucial technology in various industries, revolutionizing efficiency and safety. In…

Others also viewed

Explore content categories