Tokenization and Embeddings
What is Tokenization?
In AI/ML, especially in Natural Language Processing (NLP), tokenization is the process of splitting text into smaller units (tokens) that a model can process.
Why?
Computers don’t understand text directly, they work with numbers. Tokenization is the first step to convert raw text into something structured
Types of tokens:
Word-level: "Learn ML Basics" -> ["Learn", "ML", "Basics"]
Sub word-level (common in modern LLMs): "unhappiness" -> ["un", "happiness"] (prevents rare words from being completely unknown)
Character-level: "AI" -> ["A", "I"]
Note: Most modern models (like GPT, BERT, etc.) use sub word tokenization (e.g., Byte-Pair Encoding, WordPiece, SentencePiece).
What are Embeddings?
After tokenization, we still have symbols (tokens) and we need to turn them into numbers. That’s where embeddings come in.
Putting it Together
Now the model can process numbers instead of raw text.
Beyond Words: Embeddings Everywhere
Embeddings aren’t just for words:
This allows AI models to compare, cluster, and reason across different data types.
How Tokenization and Embeddings are used in ML?
Both Tokenization and Embeddings are used to train the model and to retrieve the information from the trained models.
Recommended by LinkedIn
Training workflow
Inference and Information Retrieval Flow from a trained Model
Inference with a trained model
1) Input text is tokenized.
2) Tokens are mapped to their pre-trained embeddings.
3)The model uses these embeddings to make predictions (e.g. next token, class label).
Information Retrieval (IR) / Search
In retrieval tasks, embeddings are used differently:
Example:
This is the basis for semantic search, vector databases, and RAG (retrieval-augmented generation).
An example Vector space (Embeddings Vector)
A three-dimensional vector space where points A and B are close to each other indicates that their corresponding embeddings represent strong semantic similarity (e.g., car and automobile). In contrast, point C lies farther away in the vector space, suggesting it is less semantically related to points A and B.
How does semantic search using embeddings improve over traditional keyword-based search methods like TF-IDF or BM25, and in what scenarios we can still prefer traditional methods over semantic search?