Tokenization and Embeddings

Satish R.

Published Sep 12, 2025

What is Tokenization?

In AI/ML, especially in Natural Language Processing (NLP), tokenization is the process of splitting text into smaller units (tokens) that a model can process.

Why?

Computers don’t understand text directly, they work with numbers. Tokenization is the first step to convert raw text into something structured

Types of tokens:

Word-level: "Learn ML Basics" -> ["Learn", "ML", "Basics"]

Sub word-level (common in modern LLMs): "unhappiness" -> ["un", "happiness"] (prevents rare words from being completely unknown)

Character-level: "AI" -> ["A", "I"]

Note: Most modern models (like GPT, BERT, etc.) use sub word tokenization (e.g., Byte-Pair Encoding, WordPiece, SentencePiece).

What are Embeddings?

After tokenization, we still have symbols (tokens) and we need to turn them into numbers. That’s where embeddings come in.

Definition: An embedding is a vector (list of numbers) that represents a token in a high-dimensional space.
Why embeddings? They capture semantic meaning of words: Similar words have closer vectors e.g., king ≈ queen in meaning and vectors are close in vector space. e.g., car and automobile are close in embedding space. Relations can be captured as well: king - man + woman ≈ queen
Dimensionality: Typical embedding vectors have hundreds of dimensions (e.g., 300, 768, 1536). Example: "cat" -> [0.21, -0.35, 0.87, ...]

Putting it Together

Raw text: "Learn ML Basics"
Tokenization: ["Learn", "ML", "Basics"]
Convert tokens to IDs: [101, 2023, 30522] (model’s vocabulary IDs)
Embeddings lookup: "Learn" -> [0.2, -0.1, 0.7, ...] "ML" -> [0.5, 0.8, -0.3, ...] "Basics" ->[0.9, -0.4, 0.2, ...]

Now the model can process numbers instead of raw text.

Beyond Words: Embeddings Everywhere

Embeddings aren’t just for words:

Images: Represented as vectors (via CNNs, Vision Transformers).
Audio: Converted into spectrogram embeddings.
Graph data: Nodes/edges can be embedded.

This allows AI models to compare, cluster, and reason across different data types.

How Tokenization and Embeddings are used in ML?

Both Tokenization and Embeddings are used to train the model and to retrieve the information from the trained models.

Recommended by LinkedIn

#LINKED0021 📌 What Is Tokenization?

Ashish Sonawane 10 months ago

From Input Text to Input Embeddings: Understanding the…

Jerry Kuru 8 months ago

What is Retrieval Augmented Generation (RAG)?

Julian Kaljuvee 1 year ago

Training workflow

Input text to Tokens Example: "ML is cool" -> [ "ML", "is", "cool" ]
Tokens to Embeddings Vectors Each token is mapped to a vector (e.g., [0.12, -0.43, 0.77, …]), initially the Embedding vectors can be initialized with the random n-dimensional values
Embeddings Vectors to Model layers These vectors are fed into the model (e.g., a Transformer). The model processes them through layers of attention and feed-forward networks.
Prediction + Loss computation
Backpropagation updates embeddings Gradients flow back not only through model weights but also through the embedding matrix, so token representations improve over time.

Inference and Information Retrieval Flow from a trained Model

Inference with a trained model

At inference (generation, classification, QA, etc.), the process looks almost the same as training but without weight updates:

1) Input text is tokenized.

2) Tokens are mapped to their pre-trained embeddings.

3)The model uses these embeddings to make predictions (e.g. next token, class label).

Key point: embeddings here are fixed, learned during training, and act as the "input language" of the model.

Information Retrieval (IR) / Search

In retrieval tasks, embeddings are used differently:

Each document or query is encoded into an embedding vector
Similarity between embeddings (using cosine similarity or dot product) is computed to find the most relevant documents

Example:

Query: "Best Indian restaurants in Redmond" -> embedding vector q
Candidate docs (restaurant reviews, menus, etc.) -> embeddings [d1, d2, d3, …]
Compute similarity: similarity(q, di) for each doc using cosine or dot products of embedding vectors
Rank documents by similarity

This is the basis for semantic search, vector databases, and RAG (retrieval-augmented generation).

An example Vector space (Embeddings Vector)

A three-dimensional vector space where points A and B are close to each other indicates that their corresponding embeddings represent strong semantic similarity (e.g., car and automobile). In contrast, point C lies farther away in the vector space, suggesting it is less semantically related to points A and B.

Sonal Sharma 7mo

How does semantic search using embeddings improve over traditional keyword-based search methods like TF-IDF or BM25, and in what scenarios we can still prefer traditional methods over semantic search?

See more comments

To view or add a comment, sign in

Tokenization and Embeddings

Satish R.

What is Tokenization?

Why?

Types of tokens:

What are Embeddings?

How Tokenization and Embeddings are used in ML?

Recommended by LinkedIn

Training workflow

Inference and Information Retrieval Flow from a trained Model

Information Retrieval (IR) / Search

An example Vector space (Embeddings Vector)

More articles by Satish R.

Others also viewed

Learning Transformers in Simple Terms: From NLP to Computer Vision

BERT MODEL- UNDERSTANDING

Understanding Next Sentence Prediction (NSP)

What Can Transformers Do?

Natural Language Processing in ITSM: Turning User Queries into Actionable Tasks

Late Night with OpenAI: What is Promptcrafting?

An overview of RAG (Retrieval-Augmented Generation) and how it compares with CAG (Contextual-Augmented Generation) architecture

Jarvis Series | Part 2: Evolving from Context to Cognition

Why Decoder-only Transformers?

Explore content categories

What is Tokenization?

Why?

Types of tokens:

What are Embeddings?

How Tokenization and Embeddings are used in ML?

Recommended by LinkedIn

Training workflow

Inference and Information Retrieval Flow from a trained Model

Information Retrieval (IR) / Search

An example Vector space (Embeddings Vector)

More articles by Satish R.

Composition vs Inheritance explained with simple examples

Cached data management through API

Fostering an Org culture

Designing software to support experimentation and change from ground up

Pipeline and Filter pattern in C#

Essentials considerations of a software program

Why my project is not on track?

Others also viewed

Learning Transformers in Simple Terms: From NLP to Computer Vision

BERT MODEL- UNDERSTANDING

Understanding Next Sentence Prediction (NSP)

What Can Transformers Do?

Natural Language Processing in ITSM: Turning User Queries into Actionable Tasks

Late Night with OpenAI: What is Promptcrafting?

An overview of RAG (Retrieval-Augmented Generation) and how it compares with CAG (Contextual-Augmented Generation) architecture

Jarvis Series | Part 2: Evolving from Context to Cognition

Why Decoder-only Transformers?

Similar topics

Importance of Embeddings in AI

How Llms Process Language

Using Multi-Dimensional Context in Large Language Models

Explore content categories