Code Intelligence Through Structure: A Vectorless RAG-Inspired Approach for Python, R, and SQL Codebases
Using a codebase's own structure — files, functions, dependencies, SQL etc.— as a precise retrieval layer for AI-assisted code understanding.
Codebases already contain a remarkable amount of structure. Files import each other. Functions call functions. Scripts embed SQL that touches specific tables. Configurations flow into business logic. These relationships are precise, deterministic, and rich with meaning.
Inspired by the idea of Vectorless RAG, I have been working on a project with the goal of making codebases easy to understand, review, document, and validate.
This article walks through what the system does, the kinds of questions it answers well, and where it's been most useful in practice.
Core Idea: Code has an inherent deterministic structure. The system treats it that way.
Functionality Overview:
At a high level, the system parses code written in Python, R, & SQL to extract and organize details such as:
The system follows a bottom-up approach to build summaries of functions, classes, files, projects, etc. These details are embedded in a structured layer built in JSON format. The structured layer can be reviewed deterministically or act as a grounded context for LLM evaluation.
This structured code intelligence layer can then help answer practical questions such as:
These questions come up constantly during onboarding, code review, model validation, and impact analysis. A structural map makes each one a traversal rather than a search.
The system has a flexible AI layer. It can work with foundational LLMs like OpenAI and Anthropic models through APIs as well as locally hosted models through Ollama, depending on privacy requirements, deployment constraints, and organizational policies.
The approach is inspired by Vectorless RAG, where retrieval does not have to rely entirely on embeddings or vector databases. Instead, the system can use structured metadata, relationships, lineage, and deterministic traversal to locate relevant information. This is especially useful in cases where access to vector databases is limited, or where teams have access to foundational models but not the full RAG infrastructure stack.
Recommended by LinkedIn
Some of the key advantages of this approach include:
Limitations
A few areas where the approach is still maturing:
The broader goal is to help teams reduce the time required for codebase understanding, onboarding, documentation, review, impact analysis, and validation.
Is this project a silver bullet? No 😊, but it does offer clear advantages in helping users become familiar with a new codebase and providing a practical guide for onboarding.
If you've worked on something similar, or if your team is exploring how to bring AI-assisted understanding to your codebase, I'd genuinely like to hear how you've approached it. And if you're curious about the tool itself, feel free to reach out.
Tested on Python and SQL codebases using Ollama with Qwen2.5-Coder.
Coding Assistance Used: Codex & Claude