Building Text-to-SQL RAG System with GPT-4o and ChromaDB

I built a Text-to-SQL RAG system from scratch and it genuinely surprised me how much the retrieval step matters. The idea: type a plain English question, get back the right SQL query and the actual results. No schema memorisation, no manual query writing. Here's how it works under the hood: → Schema indexing (offline) I extract every table, column, data type, foreign key, and sample row from MySQL's INFORMATION_SCHEMA. Each table becomes a rich text document that gets embedded and stored in ChromaDB. → Query time (online) When you ask a question, it gets embedded with the same model, and cosine similarity retrieves the most relevant tables. Those schema docs go into a structured prompt alongside the question, and GPT-4o generates the SQL at temperature=0 (deterministic — crucial for SQL). → Two safety layers A keyword blocklist catches dangerous operations (DROP, DELETE, etc.) before execution. A read-only MySQL user enforces it at the database level — so even a prompt injection can't cause damage. Stack: Python · OpenAI GPT-4o · ChromaDB · MySQL · text-embedding-3-small Key insight I didn't expect: the quality of your schema document matters more than the LLM. A table description with column types + foreign keys + 3 sample rows retrieves dramatically better than just a list of column names. Full code on GitHub (link in comments). Happy to answer questions about the design. #MachineLearning #Python #SQL #RAG #LLM #DataEngineering #OpenAI #PortfolioProject

  • graphical user interface

GitHub repo: https://github.com/AyushSingh-916/RAG-SQL-project Built as part of my portfolio targeting Data/Quant Analyst roles. Stack: Python · GPT-4o · ChromaDB · MySQL. Open to feedback!

Like
Reply

Hey brother, how do you implement all this? Do you use Vibe Coding tools like Google AIStudio or Antigravity? Or like you have to write the whole code line by line yourself?

Like
Reply

Nice one! We have been working on a similar application as well, introducing the concept of "golden queries" (i.e. a query/sql pair) seems to have helped a lot. It allows you to force a ground truth of sorts!

See more comments

To view or add a comment, sign in

Explore content categories