Understanding Embeddings with Visualization + Code (Beginner Friendly Guide)
What Are Embeddings?
In the world of Artificial Intelligence and Natural Language Processing (NLP), embeddings are one of the most powerful concepts.
Simply put, an embedding is a dense numerical representation of text, words, sentences, or even entire documents in a high-dimensional vector space. Instead of treating words as isolated symbols, embeddings convert them into vectors (lists of numbers) such that similar words or sentences have vectors that are close to each other.
For example:
Visualization of embeddings
I took a list of 50 animal/insect/fish names (mammals 🦁, birds 🦅, reptiles 🐍, fish 🐟, insects 🐜) and turned them into dense vector representations using the powerful all-mpnet-base-v2 model from Sentence Transformers.
Then I reduced the high-dimensional embeddings to 2D using PCA, applied K-Means clustering (k=5), and visualized everything with convex hulls to clearly show the natural semantic groups.
Why this matters:
Here's the core code walkthrough:
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from scipy.spatial import ConvexHull
# 1. Load the embedding model (all-mpnet-base-v2 is excellent for semantic similarity)
model = SentenceTransformer('all-mpnet-base-v2')
# Sample texts (50 animal-related words across 5 natural categories)
texts = [
"lion", "tiger", "elephant", "giraffe", "zebra", "kangaroo", "panda", "wolf", "dolphin", "bat",
"eagle", "sparrow", "parrot", "penguin", "owl", "peacock", "flamingo", "pigeon", "hawk", "crow",
"snake", "lizard", "crocodile", "alligator", "turtle", "chameleon", "gecko", "iguana", "komodo_dragon", "cobra",
"salmon", "tuna", "shark", "goldfish", "catfish", "trout", "sardine", "eel", "stingray", "clownfish",
"ant", "bee", "butterfly", "beetle", "mosquito", "fly", "grasshopper", "dragonfly", "termite", "ladybug"
]
# 2. Generate embeddings (768-dimensional vectors)
embeddings = model.encode(texts)
# 3. Reduce dimensionality to 2D for visualization using PCA
pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)
# 4. Apply K-Means clustering (we expect ~5 semantic groups)
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
clusters = kmeans.fit_predict(reduced)
# 5. Plot with annotations + convex hulls to show cluster boundaries
plt.figure(figsize=(30, 10))
plt.scatter(reduced[:, 0], reduced[:, 1], s=50)
for i, label in enumerate(texts):
plt.annotate(label, (reduced[i, 0], reduced[i, 1]),
textcoords="offset points", xytext=(5,5), ha='center')
# Draw convex hulls around each cluster
for i in range(5):
pts = reduced[clusters == i]
if len(pts) > 2:
hull = ConvexHull(pts)
plt.plot(pts[hull.vertices, 0], pts[hull.vertices, 1], 'k-', alpha=0.5)
plt.fill(pts[hull.vertices, 0], pts[hull.vertices, 1], alpha=0.1)
plt.title("2D PCA Visualization of Animal Embeddings with K-Means Clusters")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.grid(True)
plt.legend()
plt.show()
The clusters that emerge are surprisingly clean — mammals, birds, reptiles, fish, and insects naturally group together even though the model never saw explicit category labels.
This is a great starter project if you're getting into vector embeddings, semantic similarity, or building retrieval systems.
Would love to hear your thoughts:
Drop your experiences or suggestions below 👇
#AI #MachineLearning #NLP #Embeddings #SentenceTransformers #DataVisualization #Python #Clustering #PCA #KMeans
This is exactly why embeddings power things like semantic search and RAG. Once you see this visually, it’s easier to trust similarity scores in real systems.