What is Tokenization in Natural Language Processing (NLP)?

Rakesh Bhol

Published May 2, 2021

Tokenization is the process of breaking down a piece of text into small units called tokens. A token may be a word, part of a word or just characters like punctuation.

It is one of the most foundational NLP task and a difficult one, because every language has its own grammatical constructs, which are often difficult to write down as rules.

Tokenization can be used to secure sensitive data by replacing the original data with an unrelated value of the same length and format. The tokens are then sent to an organization’s internal systems for use, and the original data is stored in a secure token vault.

Why Is It Called Tokenization?

Let’s look at the history of tokenization before we dive deep into its everyday use. The first thing we want to know is why it’s called tokenization anyway.

Natural language processing goes hand in hand with “formal languages” a field between linguistics and computer science that essentially studies programming languages’ language aspects.

Just like in natural language, formal languages have distinct strings that have meaning; we often call them words, but to avoid confusion, the formal languages people called them tokens.

In other words, a token is a string with a known meaning.

The first place to see this is in your code editor.

When you write def in python, it will get colored because the code editor recognized def as a token with special meaning. On the other hand, if you wrote “def,” it would be colored differently because the code editor would recognize it as a token whose meaning is “Arbitrary string”.

Which Tokenization Should you use?

tokenization can be performed on word, character, or subword level. It’s a common question – which Tokenization should we use while solving an NLP task? Let’s address this question here.

Word Tokenization

Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text into individual words based on a certain delimiter. Depending upon delimiters, different word-level tokens are formed.

Drawbacks of Word Tokenization

One of the major issues with word tokens is dealing with Out Of Vocabulary (OOV) words. OOV words refer to the new words which are encountered at testing. These new words do not exist in the vocabulary. Hence, these methods fail in handling OOV words.

Another issue with word tokens is connected to the size of the vocabulary. Generally, pre-trained models are trained on a large volume of the text corpus. So, just imagine building the vocabulary with all the unique words in such a large corpus. This explodes the vocabulary!

Character Tokenization

Character Tokenization splits a piece of text into a set of characters. It overcomes the drawbacks we saw above about Word Tokenization.

Character Tokenizers handles OOV words coherently by preserving the information of the word. It breaks down the OOV word into characters and represents the word in terms of these characters. It also limits the size of the vocabulary. since the 26 vocabulary contains a unique set of characters.

Drawbacks of Character Tokenization

Character tokens solve the OOV problem but the length of the input and output sentences increases rapidly as we are representing a sentence as a sequence of characters. As a result, it becomes challenging to learn the relationship between the characters to form meaningful words.

This brings us to another tokenization known as Subword Tokenization which is in between a Word and Character tokenization.

Subword Tokenization

Subword Tokenization splits the piece of text into subwords (or n-gram characters). For example, words like lower can be segmented as low-er, smartest as smart-est, and so on.

Why Do We Need Tokenize?

Suppose, We have a bunch of text, and we want to computer to work on all the text, so why do we need to break the text into small tokens?

Programming languages work by breaking up raw code into tokens and then combining them by some logic (the program’s grammar) in natural language processing.

By breaking up the text into small, known fragments, we can apply a small set of rules to combine them into some larger meaning.

To view or add a comment, sign in

What is Tokenization in Natural Language Processing (NLP)?

Rakesh Bhol

Which Tokenization Should you use?

More articles by Rakesh Bhol

Others also viewed

Natural Language Processing (NLP) Landscape from the 1960s to the 2000s

The Illusion of Intelligence: Do Large Language Models (LLMs) Really Think?

Demystifying Word Vectors: How Large Language Models Understand Meaning

Navigating the LLM Project Landscape

Creating a Large Language Model (LLM) Model: A Step-by-Step Guide

Natural Language Processing for Beginners: Concepts, Tools, and a Real Example

Demystifying Tokenization: Preparing Data for Large Language Models (LLMs)

Decoding GenAI Leaderboards and LLM Standouts

LangChain: Unleashing the Power of Language Models

Natural Language Processing: Resources

Explore content categories

Which Tokenization Should you use?

More articles by Rakesh Bhol

How to install and configure Elasticsearch in Linux

Docker Installation and a Simple tutorial in Windows

Others also viewed

Natural Language Processing (NLP) Landscape from the 1960s to the 2000s

The Illusion of Intelligence: Do Large Language Models (LLMs) Really Think?

Demystifying Word Vectors: How Large Language Models Understand Meaning

Navigating the LLM Project Landscape

Creating a Large Language Model (LLM) Model: A Step-by-Step Guide

Natural Language Processing for Beginners: Concepts, Tools, and a Real Example

Demystifying Tokenization: Preparing Data for Large Language Models (LLMs)

Decoding GenAI Leaderboards and LLM Standouts

LangChain: Unleashing the Power of Language Models

Natural Language Processing: Resources

Similar topics

Natural Language Processing Algorithms

Explore content categories