Tokenization and Embeddings

What is Tokenization?

In AI/ML, especially in Natural Language Processing (NLP), tokenization is the process of splitting text into smaller units (tokens) that a model can process.

Why?

Computers don’t understand text directly, they work with numbers. Tokenization is the first step to convert raw text into something structured

Types of tokens:

Word-level: "Learn ML Basics" -> ["Learn", "ML", "Basics"]

Sub word-level (common in modern LLMs): "unhappiness" -> ["un", "happiness"] (prevents rare words from being completely unknown)

Character-level: "AI" -> ["A", "I"]

Note: Most modern models (like GPT, BERT, etc.) use sub word tokenization (e.g., Byte-Pair Encoding, WordPiece, SentencePiece).

What are Embeddings?

After tokenization, we still have symbols (tokens) and we need to turn them into numbers. That’s where embeddings come in.

  • Definition: An embedding is a vector (list of numbers) that represents a token in a high-dimensional space.
  • Why embeddings? They capture semantic meaning of words: Similar words have closer vectors e.g., king ≈ queen in meaning and vectors are close in vector space. e.g., car and automobile are close in embedding space. Relations can be captured as well: king - man + woman ≈ queen
  • Dimensionality: Typical embedding vectors have hundreds of dimensions (e.g., 300, 768, 1536). Example: "cat" -> [0.21, -0.35, 0.87, ...]

 Putting it Together

  1. Raw text: "Learn ML Basics"
  2. Tokenization: ["Learn", "ML", "Basics"]
  3. Convert tokens to IDs: [101, 2023, 30522] (model’s vocabulary IDs)
  4. Embeddings lookup: "Learn" -> [0.2, -0.1, 0.7, ...] "ML" -> [0.5, 0.8, -0.3, ...] "Basics" ->[0.9, -0.4, 0.2, ...]

Now the model can process numbers instead of raw text.

Beyond Words: Embeddings Everywhere

Embeddings aren’t just for words:

  • Images: Represented as vectors (via CNNs, Vision Transformers).
  • Audio: Converted into spectrogram embeddings.
  • Graph data: Nodes/edges can be embedded.

This allows AI models to compare, cluster, and reason across different data types.

How Tokenization and Embeddings are used in ML?

Both Tokenization and Embeddings are used to train the model and to retrieve the information from the trained models.

Training workflow

  1. Input text to Tokens Example: "ML is cool" -> [ "ML", "is", "cool" ]
  2. Tokens to Embeddings Vectors Each token is mapped to a vector (e.g., [0.12, -0.43, 0.77, …]), initially the Embedding vectors can be initialized with the random n-dimensional values
  3. Embeddings Vectors to Model layers These vectors are fed into the model (e.g., a Transformer). The model processes them through layers of attention and feed-forward networks.
  4. Prediction + Loss computation
  5. Backpropagation updates embeddings Gradients flow back not only through model weights but also through the embedding matrix, so token representations improve over time.

Inference and Information Retrieval Flow from a trained Model

Inference with a trained model

  • At inference (generation, classification, QA, etc.), the process looks almost the same as training but without weight updates:

1) Input text is tokenized.

2) Tokens are mapped to their pre-trained embeddings.

3)The model uses these embeddings to make predictions (e.g. next token, class label).

  • Key point: embeddings here are fixed, learned during training, and act as the "input language" of the model.

Information Retrieval (IR) / Search

In retrieval tasks, embeddings are used differently:

  • Each document or query is encoded into an embedding vector
  • Similarity between embeddings (using cosine similarity or dot product) is computed to find the most relevant documents

Example:

  • Query: "Best Indian restaurants in Redmond" -> embedding vector q
  • Candidate docs (restaurant reviews, menus, etc.) -> embeddings [d1, d2, d3, …]
  • Compute similarity: similarity(q, di) for each doc using cosine or dot products of embedding vectors
  • Rank documents by similarity

This is the basis for semantic search, vector databases, and RAG (retrieval-augmented generation).

An example Vector space (Embeddings Vector)

A three-dimensional vector space where points A and B are close to each other indicates that their corresponding embeddings represent strong semantic similarity (e.g., car and automobile). In contrast, point C lies farther away in the vector space, suggesting it is less semantically related to points A and B.


Article content
3-D Vector space


How does semantic search using embeddings improve over traditional keyword-based search methods like TF-IDF or BM25, and in what scenarios we can still prefer traditional methods over semantic search?

Like
Reply

To view or add a comment, sign in

More articles by Satish R.

  • Composition vs Inheritance explained with simple examples

    Traditionally the primary goal of composition and inheritance design principles in the object oriented programming…

    4 Comments
  • Cached data management through API

    Have you ever wondered if the cached data available in the memory of the service is getting refreshed correctly from…

  • Fostering an Org culture

    Every member of a team has immense responsibility of fostering a culture within the team, and that eventually becomes…

    2 Comments
  • Designing software to support experimentation and change from ground up

    "The change is only constant in the modern software development life cycle", I'm sure many of us might have already…

  • Pipeline and Filter pattern in C#

    Yesterday, I published an article with a neat and simple C# code implementing pipeline and filter message processing…

  • Essentials considerations of a software program

    When we write computer software program, we have to always keep in mind that the single purpose of my program is to…

  • Why my project is not on track?

    Aggressive deliverable deadline, too many feature expectations to achieve in a short period of time? These 2 are the…

    3 Comments

Others also viewed

Explore content categories