CODE CONTEXT GRAPH
You’ve been there: a big repo, a vague question (“where do we handle auth?”), and a search that either drowns you in results or misses the one function that actually matters.
Grep finds strings. Semantic search finds similar text. But code has structure—calls, types, files, classes—and that structure is exactly what you want when you’re answering “what calls this?” or “what does this depend on?”
We wanted something that could do both: understand meaning (semantic search) and follow structure (calls, types, siblings). So we built a Code Context Graph: an index that represents your code as a graph and retrieves context by combining vector search, reranking, and graph traversal.
1. The problem with “just embed the repo”
If you embed every file (or every chunk) and search by cosine similarity, you get semantically close text. You don’t get “this function calls that one.”
We needed retrieval that could say: “Here are the most relevant pieces and here’s what they call, what types they use, and what sits next to them.”
🖼️ The Limitations of Traditional Code Search
Before jumping into the architecture, it helps to visualize the problem. Standard search tools treat a codebase like a flat stack of documents, missing the vital architectural connections that define logic.
Here is what that looks like conceptually: a developer isolated by a wall of files, unable to see the connections they need.
Figure 1: The Status Quo. A developer overwhelmed by a disconnected codebase. Simple grep and semantic search only see surface-level text (the abstract vector cloud), missing the deep structural connections (the code structure diagram on the glass) necessary for understanding.
2. What we actually built
We didn't just build a better search; we built a structured map of the codebase. Our Code Context Graph (CCG) treats code as a deeply interconnected web of logic, not a flat directory of text files.
This section breaks down how we turn raw files into a graph-aware index, and how that index powers a robust three-stage retrieval pipeline.
I. Scope-aware indexing (not just “files”)
We don’t index raw files. We parse them using Tree-sitter into a rigid scope tree: file → class → function. Each node is a real unit of code, not just a chunk of bytes.
Instead of embedding the whole function body, we embed a short "ghost text" containing the file path, the scope, and a slice of the code. Vectors are stored (e.g., in Qdrant) next to a shadow index (SQLite) holding the full content and the graph edges.
II. A graph over the same nodes
While we build the scope tree, we also extract the connections: who calls whom and who uses which types. This becomes our graph:
This gives us one index with two views: vectors for “what’s similar?” and a graph for “what’s connected?”
🖼️ Visualizing the Indexing and Graph Construction
How does this graph look in reality? This diagram shows the relationship between the physical storage (vectors and relational data) and the conceptual model (the logical graph). You can see the distinct nodes (File, Class, Function) and the critical edges (Calls, Uses Type) connecting them.
Recommended by LinkedIn
Figure 2: Turning Code into a Graph. Raw code is parsed (Tree-sitter) into a structured hierarchy of nodes. Instead of one big vector for a whole file, we generate embeddings for specific scopes (like class UserAuth). Relationships become explicit 'CALLS' or 'USES_TYPE' edges, merging semantic similarity with structural context.
III. Three-stage retrieval + graph expansion
This structured graph completely changes how we retrieve context. We don’t just run a single cosine similarity search and dump the results. Instead, we use a refined, three-stage pipeline to ensure we find the right code, and the necessary code.
The result is one context string, containing the most relevant code plus the structural dependencies needed to understand it.
3. Maintaining the Graph (Incremental Ingest)
A perfect index is useless if it’s obsolete. Re-parsing the whole repo on every change doesn’t scale. We needed a system that updates gracefully as the code evolves.
We use a manifest system: for each file, we store a content hash (Merkle-style). On the next ingest, we only re-parse new or changed files, merging updates into the graph and shadow index. We re-embed only those specific nodes.
We also use a background watcher that triggers this incremental ingest after a short debounce when files change. The index stays up to date without full rebuilds.
🖼️ The Graph in Action (Retrieval and Ingest)
The following diagram is a conceptual overview of the system architecture. On the left, it shows how an agent queries the index using the three-stage retrieval pipeline. On the right, it shows how the background "watcher" handles incremental ingests, keeping the graph fresh as files change.
Figure 3: Operating the Graph. Left: The Retrieval Pipeline. A user query is vector searched, reranked, and then structurally expanded (traversing edges defined in Figure 2) to build rich context. Right: Incremental Ingest. A background watcher detects file changes and uses a Merkle-style manifest to update only the modified nodes and edges in the indexes, keeping the system efficient.
4. Why it’s “context” and not just “search”
The goal isn’t only to return a list of matches. It’s to return context: the right code and the code it calls, the types it uses, and the siblings in the same class.
The graph exists to expand the semantic result set into a coherent context. We trade a bit more context for a bit more noise, so we cap the number of nodes added (e.g., max_hops=3).
The takeaway
A Code Context Graph is an index that treats your code as both meaning (vectors) and structure (a graph of calls and types).
If you’re building RAG over code, agents that need to “look it up in the repo,” or internal tools that need “relevant code + its call graph,” this pattern—semantic retrieval plus a small, scope-level graph—is a solid base. We’ve open-sourced our implementation so you can try it on your own codebase.
If you’ve built something similar, I’d love to hear how you’re doing it—drop a comment or DM.