Understanding Embeddings with Visualization + Code (Beginner Friendly Guide)

Rajveer Gangwar

Published Apr 17, 2026

What Are Embeddings?

In the world of Artificial Intelligence and Natural Language Processing (NLP), embeddings are one of the most powerful concepts.

Simply put, an embedding is a dense numerical representation of text, words, sentences, or even entire documents in a high-dimensional vector space. Instead of treating words as isolated symbols, embeddings convert them into vectors (lists of numbers) such that similar words or sentences have vectors that are close to each other.

For example:

The embedding of "lion" will be much closer to "tiger" than to "eagle" or "ant".
The embedding of "king" is often close to "queen", "man" is close to "woman", and "Paris" is close to "France".

Visualization of embeddings

I took a list of 50 animal/insect/fish names (mammals 🦁, birds 🦅, reptiles 🐍, fish 🐟, insects 🐜) and turned them into dense vector representations using the powerful all-mpnet-base-v2 model from Sentence Transformers.

Then I reduced the high-dimensional embeddings to 2D using PCA, applied K-Means clustering (k=5), and visualized everything with convex hulls to clearly show the natural semantic groups.

Why this matters:

Embeddings don't just turn text into numbers — they preserve semantic relationships.
Even with very different-looking words, similar concepts cluster together automatically.
This technique is foundational for semantic search, topic modeling, RAG systems, recommendation engines, and more.

Here's the core code walkthrough:

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from scipy.spatial import ConvexHull

# 1. Load the embedding model (all-mpnet-base-v2 is excellent for semantic similarity)
model = SentenceTransformer('all-mpnet-base-v2')

# Sample texts (50 animal-related words across 5 natural categories)
texts = [
    "lion", "tiger", "elephant", "giraffe", "zebra", "kangaroo", "panda", "wolf", "dolphin", "bat",
    "eagle", "sparrow", "parrot", "penguin", "owl", "peacock", "flamingo", "pigeon", "hawk", "crow",
    "snake", "lizard", "crocodile", "alligator", "turtle", "chameleon", "gecko", "iguana", "komodo_dragon", "cobra",
    "salmon", "tuna", "shark", "goldfish", "catfish", "trout", "sardine", "eel", "stingray", "clownfish",
    "ant", "bee", "butterfly", "beetle", "mosquito", "fly", "grasshopper", "dragonfly", "termite", "ladybug"
]

# 2. Generate embeddings (768-dimensional vectors)
embeddings = model.encode(texts)

# 3. Reduce dimensionality to 2D for visualization using PCA
pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)

# 4. Apply K-Means clustering (we expect ~5 semantic groups)
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
clusters = kmeans.fit_predict(reduced)

# 5. Plot with annotations + convex hulls to show cluster boundaries
plt.figure(figsize=(30, 10))
plt.scatter(reduced[:, 0], reduced[:, 1], s=50)

for i, label in enumerate(texts):
    plt.annotate(label, (reduced[i, 0], reduced[i, 1]), 
                 textcoords="offset points", xytext=(5,5), ha='center')

# Draw convex hulls around each cluster
for i in range(5):
    pts = reduced[clusters == i]
    if len(pts) > 2:
        hull = ConvexHull(pts)
        plt.plot(pts[hull.vertices, 0], pts[hull.vertices, 1], 'k-', alpha=0.5)
        plt.fill(pts[hull.vertices, 0], pts[hull.vertices, 1], alpha=0.1)

plt.title("2D PCA Visualization of Animal Embeddings with K-Means Clusters")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.grid(True)
plt.legend()
plt.show()

The clusters that emerge are surprisingly clean — mammals, birds, reptiles, fish, and insects naturally group together even though the model never saw explicit category labels.

This is a great starter project if you're getting into vector embeddings, semantic similarity, or building retrieval systems.

Would love to hear your thoughts:

What other datasets would you like to see visualized this way?

Drop your experiences or suggestions below 👇

#AI #MachineLearning #NLP #Embeddings #SentenceTransformers #DataVisualization #Python #Clustering #PCA #KMeans

Ishwar Chandra Tiwari 2w

This is exactly why embeddings power things like semantic search and RAG. Once you see this visually, it’s easier to trust similarity scores in real systems.

1 Reaction

See more comments

To view or add a comment, sign in

Understanding Embeddings with Visualization + Code (Beginner Friendly Guide)

Rajveer Gangwar

What Are Embeddings?

Visualization of embeddings

More articles by Rajveer Gangwar

Explore content categories

What Are Embeddings?

Visualization of embeddings

More articles by Rajveer Gangwar

Why Chunking is Important in a RAG System

Building a Local RAG Pipeline: From Raw Text to Grounded AI Responses

Why Does AI "Lie"? Understanding the Phenomenon of AI Hallucinations

Text and Document Splitting in RAG: Why It Matters, When You Need It, and the Best Ways to Do It

Understanding Vector Stores and Vector Databases

Semantic Search Explained: How AI Understands Meaning (Not Just Keywords)

Built an Email Spam Classifier using Machine Learning (with Code)

Understanding Overfitting and Underfitting in Machine Learning (In Simple Words)

Types of Supervised Learning: Classification vs Regression

Introduction to Machine Learning

Explore content categories