What is Tokenization in Natural Language Processing (NLP)?
Tokenization is the process of breaking down a piece of text into small units called tokens. A token may be a word, part of a word or just characters like punctuation.
It is one of the most foundational NLP task and a difficult one, because every language has its own grammatical constructs, which are often difficult to write down as rules.
Tokenization can be used to secure sensitive data by replacing the original data with an unrelated value of the same length and format. The tokens are then sent to an organization’s internal systems for use, and the original data is stored in a secure token vault.
Why Is It Called Tokenization?
Let’s look at the history of tokenization before we dive deep into its everyday use. The first thing we want to know is why it’s called tokenization anyway.
Natural language processing goes hand in hand with “formal languages” a field between linguistics and computer science that essentially studies programming languages’ language aspects.
Just like in natural language, formal languages have distinct strings that have meaning; we often call them words, but to avoid confusion, the formal languages people called them tokens.
In other words, a token is a string with a known meaning.
The first place to see this is in your code editor.
When you write def in python, it will get colored because the code editor recognized def as a token with special meaning. On the other hand, if you wrote “def,” it would be colored differently because the code editor would recognize it as a token whose meaning is “Arbitrary string”.
Which Tokenization Should you use?
tokenization can be performed on word, character, or subword level. It’s a common question – which Tokenization should we use while solving an NLP task? Let’s address this question here.
- Word Tokenization
Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text into individual words based on a certain delimiter. Depending upon delimiters, different word-level tokens are formed.
Drawbacks of Word Tokenization
One of the major issues with word tokens is dealing with Out Of Vocabulary (OOV) words. OOV words refer to the new words which are encountered at testing. These new words do not exist in the vocabulary. Hence, these methods fail in handling OOV words.
Another issue with word tokens is connected to the size of the vocabulary. Generally, pre-trained models are trained on a large volume of the text corpus. So, just imagine building the vocabulary with all the unique words in such a large corpus. This explodes the vocabulary!
- Character Tokenization
Character Tokenization splits a piece of text into a set of characters. It overcomes the drawbacks we saw above about Word Tokenization.
Character Tokenizers handles OOV words coherently by preserving the information of the word. It breaks down the OOV word into characters and represents the word in terms of these characters. It also limits the size of the vocabulary. since the 26 vocabulary contains a unique set of characters.
Drawbacks of Character Tokenization
Character tokens solve the OOV problem but the length of the input and output sentences increases rapidly as we are representing a sentence as a sequence of characters. As a result, it becomes challenging to learn the relationship between the characters to form meaningful words.
This brings us to another tokenization known as Subword Tokenization which is in between a Word and Character tokenization.
- Subword Tokenization
Subword Tokenization splits the piece of text into subwords (or n-gram characters). For example, words like lower can be segmented as low-er, smartest as smart-est, and so on.
Why Do We Need Tokenize?
Suppose, We have a bunch of text, and we want to computer to work on all the text, so why do we need to break the text into small tokens?
Programming languages work by breaking up raw code into tokens and then combining them by some logic (the program’s grammar) in natural language processing.
By breaking up the text into small, known fragments, we can apply a small set of rules to combine them into some larger meaning.