Vector embeddings as I understand them
The team at PlanetScale (where I work) recently announced that vector search is coming to our databases.
As a technical writer and researcher on the Developer Education team, it became my responsibility to understand what vector embeddings are and why this announcement is important. So I set out to do just that for future content and announcements. I’m no AI expert, but I think I’ve built a pretty good mental model of what they are.
Thus the headline, because this is how (I think) I understand them
What are vector embeddings?
To start, a vector embedding in AI workloads is a representation of some "thing" using an array of numbers.
On the surface, it’s kinda weird to think that anything can be boiled down to such a simple concept, but let's reframe it just a bit into something a little easier to comprehend for developers. When I was originally learning OOP in C#, much of the educational material would use the concept of cars and motorcycles to teach inheritance. They are different objects, but share attributes like having wheels, passengers, a color, etc.
So a Car class and a Motorcycle class can inherit from a Vehicle class.
Recommended by LinkedIn
Now while a car and motorcycle are different considering the number of wheels they have, let’s throw a semi truck into the mix. A semi often has 10 wheels (assuming two dual-wheel rear axles on the cab), and when you consider all three vehicles based solely on wheel count alone, the motorcycle (2 wheels) and car (4 wheels) would technically be more similar to each other than they would the semi (10 wheels). “Wheel count” is one attribute that is used to gauge the similarity between these three objects.
In reality, these objects have THOUSANDS of attributes that can be used to determine how similar they are to each other.
Measuring similarity
The numbers in the vector embeddings (or vectors for short) array represent coordinates in a highly complex, multidimensional graph that tracks a single attribute.
To determine how similar or not two objects are, a mathematical formula that I won't even pretend to understand is used called Cosine Similarity. When two vectors are run through this formula, a decimal number between 0 and 1 is returned that states how similar the two things are, with 1 being a perfect match. Now do this over an entire dataset in a SQL table.
That’s what PlanetScale Engineering has pulled off.