CODE CONTEXT GRAPH

You’ve been there: a big repo, a vague question (“where do we handle auth?”), and a search that either drowns you in results or misses the one function that actually matters.

Grep finds strings. Semantic search finds similar text. But code has structure—calls, types, files, classes—and that structure is exactly what you want when you’re answering “what calls this?” or “what does this depend on?”

We wanted something that could do both: understand meaning (semantic search) and follow structure (calls, types, siblings). So we built a Code Context Graph: an index that represents your code as a graph and retrieves context by combining vector search, reranking, and graph traversal.

1. The problem with “just embed the repo”

If you embed every file (or every chunk) and search by cosine similarity, you get semantically close text. You don’t get “this function calls that one.”

We needed retrieval that could say: “Here are the most relevant pieces and here’s what they call, what types they use, and what sits next to them.”

🖼️ The Limitations of Traditional Code Search

Before jumping into the architecture, it helps to visualize the problem. Standard search tools treat a codebase like a flat stack of documents, missing the vital architectural connections that define logic.

Here is what that looks like conceptually: a developer isolated by a wall of files, unable to see the connections they need.

Figure 1: The Status Quo. A developer overwhelmed by a disconnected codebase. Simple grep and semantic search only see surface-level text (the abstract vector cloud), missing the deep structural connections (the code structure diagram on the glass) necessary for understanding.

2. What we actually built

We didn't just build a better search; we built a structured map of the codebase. Our Code Context Graph (CCG) treats code as a deeply interconnected web of logic, not a flat directory of text files.

This section breaks down how we turn raw files into a graph-aware index, and how that index powers a robust three-stage retrieval pipeline.

I. Scope-aware indexing (not just “files”)

We don’t index raw files. We parse them using Tree-sitter into a rigid scope tree: file → class → function. Each node is a real unit of code, not just a chunk of bytes.

Instead of embedding the whole function body, we embed a short "ghost text" containing the file path, the scope, and a slice of the code. Vectors are stored (e.g., in Qdrant) next to a shadow index (SQLite) holding the full content and the graph edges.

II. A graph over the same nodes

While we build the scope tree, we also extract the connections: who calls whom and who uses which types. This becomes our graph:

Nodes: The same scope nodes (functions, classes).
Edges: Relationships like CALLS, USES_TYPE, and SIBLING (same file/class).

This gives us one index with two views: vectors for “what’s similar?” and a graph for “what’s connected?”

🖼️ Visualizing the Indexing and Graph Construction

How does this graph look in reality? This diagram shows the relationship between the physical storage (vectors and relational data) and the conceptual model (the logical graph). You can see the distinct nodes (File, Class, Function) and the critical edges (Calls, Uses Type) connecting them.

Figure 2: Turning Code into a Graph. Raw code is parsed (Tree-sitter) into a structured hierarchy of nodes. Instead of one big vector for a whole file, we generate embeddings for specific scopes (like class UserAuth). Relationships become explicit 'CALLS' or 'USES_TYPE' edges, merging semantic similarity with structural context.

III. Three-stage retrieval + graph expansion

This structured graph completely changes how we retrieve context. We don’t just run a single cosine similarity search and dump the results. Instead, we use a refined, three-stage pipeline to ensure we find the right code, and the necessary code.

Vector search: We embed the query and find the initial top candidates by cosine similarity (searching over the ghost-text embeddings).
Rerank: We don’t trust vector order alone. We run a cross-encoder (e.g., BGE-Reranker) on those candidates to refine the list, keeping only the best.
Graph expansion: This is the key step. Starting from those top reranked nodes, we traverse the graph—following calls, uses_type, and siblings edges—for one or more hops.

The result is one context string, containing the most relevant code plus the structural dependencies needed to understand it.

3. Maintaining the Graph (Incremental Ingest)

A perfect index is useless if it’s obsolete. Re-parsing the whole repo on every change doesn’t scale. We needed a system that updates gracefully as the code evolves.

We use a manifest system: for each file, we store a content hash (Merkle-style). On the next ingest, we only re-parse new or changed files, merging updates into the graph and shadow index. We re-embed only those specific nodes.

We also use a background watcher that triggers this incremental ingest after a short debounce when files change. The index stays up to date without full rebuilds.

🖼️ The Graph in Action (Retrieval and Ingest)

The following diagram is a conceptual overview of the system architecture. On the left, it shows how an agent queries the index using the three-stage retrieval pipeline. On the right, it shows how the background "watcher" handles incremental ingests, keeping the graph fresh as files change.

Figure 3: Operating the Graph. Left: The Retrieval Pipeline. A user query is vector searched, reranked, and then structurally expanded (traversing edges defined in Figure 2) to build rich context. Right: Incremental Ingest. A background watcher detects file changes and uses a Merkle-style manifest to update only the modified nodes and edges in the indexes, keeping the system efficient.

4. Why it’s “context” and not just “search”

The goal isn’t only to return a list of matches. It’s to return context: the right code and the code it calls, the types it uses, and the siblings in the same class.

The graph exists to expand the semantic result set into a coherent context. We trade a bit more context for a bit more noise, so we cap the number of nodes added (e.g., max_hops=3).

The takeaway

A Code Context Graph is an index that treats your code as both meaning (vectors) and structure (a graph of calls and types).

If you’re building RAG over code, agents that need to “look it up in the repo,” or internal tools that need “relevant code + its call graph,” this pattern—semantic retrieval plus a small, scope-level graph—is a solid base. We’ve open-sourced our implementation so you can try it on your own codebase.

https://github.com/datandleorg/code-context-graph

If you’ve built something similar, I’d love to hear how you’re doing it—drop a comment or DM.

CODE CONTEXT GRAPH

Saravanan ayyappan

1. The problem with “just embed the repo”

🖼️ The Limitations of Traditional Code Search

2. What we actually built

I. Scope-aware indexing (not just “files”)

II. A graph over the same nodes

🖼️ Visualizing the Indexing and Graph Construction

Recommended by LinkedIn

III. Three-stage retrieval + graph expansion

3. Maintaining the Graph (Incremental Ingest)

🖼️ The Graph in Action (Retrieval and Ingest)

4. Why it’s “context” and not just “search”

The takeaway

More articles by Saravanan ayyappan

Others also viewed

The Return to the Command Line - TUIs and DSLs

Protobuf vs JSON: Which One Should You Use?

The Rise of Darkstar: How We Rebuilt the New OHDSI R Runtime for Production

Writing a JSON Pretty Printer

The 15-Minute Demo That Became a 3-Day Nightmare

From Messy to Clean – Part 2

What is TOON? How Token-Oriented Object Notation Could Change How AI Sees Data

What Rails Actually Wants: Tidying Controllers and Views Without Service Object Explosion

Design the Duck

Explore content categories

1. The problem with “just embed the repo”

🖼️ The Limitations of Traditional Code Search

2. What we actually built

I. Scope-aware indexing (not just “files”)

II. A graph over the same nodes

🖼️ Visualizing the Indexing and Graph Construction

Recommended by LinkedIn

III. Three-stage retrieval + graph expansion

3. Maintaining the Graph (Incremental Ingest)

🖼️ The Graph in Action (Retrieval and Ingest)

4. Why it’s “context” and not just “search”

The takeaway

More articles by Saravanan ayyappan

The Era of Traditional Media is Over. We Are the Journalists Now.

Delimitation in India: Fair Representation or Unfair Penalty?

How AI is Going to Make Schools and Colleges Moot and Optional

The Virtual Software Engineer

The Era of Human Emulators: How VLA Architecture Replaces "Soft Work"

Others also viewed

The Return to the Command Line - TUIs and DSLs

Protobuf vs JSON: Which One Should You Use?

The Rise of Darkstar: How We Rebuilt the New OHDSI R Runtime for Production

Writing a JSON Pretty Printer

The 15-Minute Demo That Became a 3-Day Nightmare

From Messy to Clean – Part 2

What is TOON? How Token-Oriented Object Notation Could Change How AI Sees Data

What Rails Actually Wants: Tidying Controllers and Views Without Service Object Explosion

Design the Duck

Explore content categories