LLMs, RAG, Vector Databases. What are they?
I use AI every day. It has genuinely changed how I work: code, emails, reports, mockups, data analysis, and documentation. And yet, for a long time, I didn't really know what actually happened when I typed something and got an answer back. I didn't know what a vector database was. I'd heard "RAG" and I knew the meaning, but I didn't know what it actually was. I knew LLM stood for Large Language Model the same way I know JPEG stands for Joint Photographic Experts Group. It's completely useless.
The gap caught up with me eventually. I built an integration that kept returning stale data and couldn't figure out why. I watched a chatbot give a customer wrong instructions because it didn't know about a policy change from three months earlier. I started asking the questions I should have asked sooner: where is this thing actually getting its answers from?
That question is the one that unlocks most of the confusing behavior. This is my attempt to answer it.
Who this is for: developers getting into AI who want to understand the infrastructure, and anyone who already uses AI tools daily and wants to understand why they sometimes work brilliantly and sometimes make things up with total confidence.
The Terms That Keep Coming Up
Four terms come up constantly in AI conversations. I'll define them here because the rest of this article assumes you know them. If you're already familiar with LLMs, feel free to skip ahead. If you've been quietly fuzzy on any of these, same here.
LLM: Large Language Model
An LLM is the model doing the reading and writing. It's trained on a massive amount of text: books, websites, code, Wikipedia, forum posts, documentation. It learned patterns not by memorizing text, but by repeatedly predicting what comes next, until those patterns stuck. The result is a system that can write coherently, reason through problems, and generate code.
Think of it this way: Someone who spent years reading every book in a library. They didn't memorize them; they absorbed enough to reason, write, and explain. The catch: the day they walked out, they stopped learning. Anything published after that is news to them.
That cutoff matters. An LLM's knowledge is frozen at training time. Ask it about something that happened after its cutoff and it either admits ignorance or, worse, generates something that sounds plausible but is wrong. This is called hallucination. It's not really a bug: the model is doing exactly what it was trained to do. It just doesn't know what it doesn't know.
GPT-4, Claude, Gemini, Llama: all LLMs. Different training data, different architectures, different strengths. Same fundamental limitation.
Vector Database
A vector database stores information based on meaning rather than exact words. Here's what that looks like in practice: you take a piece of content, run it through a model that converts it into a list of numbers, and store those numbers. That list is called a vector, and it represents what the content means, not what it literally says.
Here's why it matters. "The return window is 30 days" and "Customers have a month to send items back" mean the same thing, and a vector search will find both. A keyword search would miss one of them. The system isn't matching words; it's matching meaning.
A helpful way to picture it: Instead of organizing a library by title or author, imagine assigning every book GPS coordinates based purely on what it's about. Related books cluster together on the map. You search by dropping a pin where your question lands, and the closest books surface, regardless of what they're called.
At scale, scanning every stored entry to find the closest match would be too slow, so vector databases use smart indexing to navigate toward the best results quickly. It's not perfectly accurate, but in practice the results are more than good enough.
RAG: Retrieval-Augmented Generation
RAG is a technique, not a product. It's a pattern for solving the training cutoff problem without retraining the model. The name breaks down simply: retrieval (find relevant documents), augmented (add them to the prompt), generation (let the model write an answer based on what it found).
Here's the flow: a question comes in, you search a knowledge base for the most relevant content, drop that content into the model's prompt alongside the original question, and the model writes an answer based on what it just read. The model doesn't need to have seen your documentation during training. It reads it in the moment, the same way you'd hand someone a document before asking them to summarize it.
Another way to think about it: Instead of quizzing someone from memory, you hand them the relevant pages first. Their answer is based on what's in front of them, not on what they might or might not remember.
The retrieval step almost always uses a vector database, because you want documents that are meaningfully related to the question, not just ones that share a few words. Someone asking "how long do I have to return something?" should surface the return policy even if the policy never uses the word "long."
RAG is the most widely used pattern in production AI today. It's why vector databases went from obscure to mainstream in about two years.
Tool Calls
On its own, an LLM can only do one thing: generate text. Tool calls change that. You give the model a defined list of actions it's allowed to take: search the web, look up a record, send an email, check a calendar, call an API. When it determines one of those would help answer the question, it requests it. The application actually makes the call, gets the result, and feeds it back. The model reads the result and continues.
Think of it like this: A dispatcher who can't leave the building but has a team in the field. They describe what they need, someone goes and gets it, reports back. The dispatcher incorporates the information and moves forward. They never leave the building, but they're not limited to what they already know.
This is what separates an AI that answers questions from one that actually gets things done. Booking meetings, creating tickets, updating records, pulling live data: all tool calls. Combine that with RAG for grounded knowledge and you have most of what people mean when they say "AI agent."
Worth noting: the database type still matters. Looking up a customer's exact order count hits a traditional SQL database. Finding support tickets similar to a new complaint hits a vector database. Both can happen in the same conversation.
Seeing the Full Picture
Architecture diagrams are either immediately helpful or immediately confusing, with not much in between. This one I've tried to keep honest. It shows what actually happens between a user typing a question and getting a response back. Read it top to bottom, left to right.
Read top to bottom: user sends a question -> the question is converted to a searchable format and run against the knowledge base -> relevant docs are retrieved (RAG) -> external tools are called if needed -> everything is assembled into a grounded answer.
Recommended by LinkedIn
Embeddings and Semantic Search
The part of the system people find hardest to picture.
I get why this one is hard to picture. There's no familiar mental model to map it onto. But once it clicks, it makes a lot of sense.
A vector database doesn't store text as text. It stores everything as a list of numbers that represents what the content means. You run your content through a conversion model first, which outputs that list of numbers. You store it alongside a reference to the original content and any metadata you care about: document ID, date, category, whatever you'll want to filter on later.
When a query comes in, it goes through the same conversion and becomes a numerical representation. The database finds stored entries that are mathematically closest to it. What comes back is a ranked list of content that means something similar to the query, even if the words are completely different.
The search uses smart indexing rather than scanning everything, so at a million entries it still returns results in milliseconds. It's approximate, not perfect, but accurate enough for production use.
Where it fits in AI systems
Vector databases are the retrieval engine in RAG pipelines. You split your documents into smaller chunks, convert every chunk into its numerical representation, and store those. At query time, you search for the most relevant chunks and drop them into the model's prompt. That's what lets an AI assistant answer from your actual documentation instead of making something up. The same pattern also applies to recommendation systems, duplicate detection, and image similarity search.
What it does well
Where it struggles
What You’re Actually Working With
A quick reference for the key technical properties, useful when evaluating tools or explaining the system to someone else.
A vector database doesn't store what your data says. It stores what your data means. That distinction is what makes RAG possible.
Why RAG Changed How We Think About AI in Production
Before RAG, if you wanted a model to know your company's specific information, you had two options. You could retrain it on your data, which is expensive, slow, and requires a dedicated ML team. Or you could accept that it would sometimes make things up when asked about anything outside its training, and try to paper over it with careful prompting. Neither felt like a real solution.
RAG is the third option. Keep the model as-is. Maintain a knowledge base of your content. When a question comes in, search the knowledge base, pull the relevant pieces, drop them into the prompt with the question, and let the model reason over what it just read. The model doesn't change. The knowledge base is just text you keep updated. When something changes, you update the entry and re-index it: no retraining, no deployment, done.
A concrete example
Your team builds an AI assistant for software support. Someone types: "How do I configure SSO with Okta?" The system searches your documentation and pulls back the three most relevant pieces: maybe the SSO setup guide, a troubleshooting FAQ, and a recent changelog entry. All three go into the prompt alongside the original question. The model reads them and writes an answer grounded in your actual docs, not in something it half-remembered from training data that might be two years old.
That's the whole pattern. The knowledge base does the retrieval. The model does the reasoning and writing. Neither can do the other's job, and neither is trying to.
Where tool calls fit into this
RAG gives the model knowledge. Tool calls give it reach. In production they often run together: the model retrieves documentation via RAG to answer a question, then uses a tool call to create a ticket, check an account, or send a follow-up. The model coordinates both, deciding at each step what it needs. That combination of grounded knowledge plus live actions is what separates genuinely useful AI systems from demos that only work when the question is exactly right.
On the blurring of lines
Worth mentioning: the line between "vector database" and "regular database" is blurring. PostgreSQL has pgvector, which adds vector storage and search directly to Postgres. Elasticsearch has supported it for a while. If you're starting out and don't want to add a dedicated vector database to your stack, pgvector on an existing Postgres instance is a reasonable starting point: solid performance, familiar tooling, and you can always migrate later if scale demands it.
Where to go from here
If you're a developer: build a small RAG pipeline end-to-end. Grab a handful of documents, split them into chunks, convert them to their numerical representations using an API or a local model, store them in Chroma or pgvector, and write a retrieval function. Run a few queries. See what comes back and why. One hour of doing that will teach you more than another article about it, including this one.
If you're not a developer but you evaluate or buy AI tools: the most useful question you can ask is "where does this get its information from, and how often is that updated?" If the answer is "from the model's training data," anything that changed in the past year or two might be wrong or missing. If the answer is "from a knowledge base over our documentation," ask how often that knowledge base is updated when content changes. The gap between what's stored and what's actually current is where quiet failures live: no errors, just wrong answers.
I'm still learning this too. It moves fast. But the core mechanics have been stable long enough to be worth understanding properly. Everything built on top of this eventually traces back to it.
Great article Henrique, I thoroughly enjoyed it👏!