I built a Text-to-SQL RAG system from scratch and it genuinely surprised me how much the retrieval step matters. The idea: type a plain English question, get back the right SQL query and the actual results. No schema memorisation, no manual query writing. Here's how it works under the hood: → Schema indexing (offline) I extract every table, column, data type, foreign key, and sample row from MySQL's INFORMATION_SCHEMA. Each table becomes a rich text document that gets embedded and stored in ChromaDB. → Query time (online) When you ask a question, it gets embedded with the same model, and cosine similarity retrieves the most relevant tables. Those schema docs go into a structured prompt alongside the question, and GPT-4o generates the SQL at temperature=0 (deterministic — crucial for SQL). → Two safety layers A keyword blocklist catches dangerous operations (DROP, DELETE, etc.) before execution. A read-only MySQL user enforces it at the database level — so even a prompt injection can't cause damage. Stack: Python · OpenAI GPT-4o · ChromaDB · MySQL · text-embedding-3-small Key insight I didn't expect: the quality of your schema document matters more than the LLM. A table description with column types + foreign keys + 3 sample rows retrieves dramatically better than just a list of column names. Full code on GitHub (link in comments). Happy to answer questions about the design. #MachineLearning #Python #SQL #RAG #LLM #DataEngineering #OpenAI #PortfolioProject
Hey brother, how do you implement all this? Do you use Vibe Coding tools like Google AIStudio or Antigravity? Or like you have to write the whole code line by line yourself?
Nice one! We have been working on a similar application as well, introducing the concept of "golden queries" (i.e. a query/sql pair) seems to have helped a lot. It allows you to force a ground truth of sorts!
That's awesome
GitHub repo: https://github.com/AyushSingh-916/RAG-SQL-project Built as part of my portfolio targeting Data/Quant Analyst roles. Stack: Python · GPT-4o · ChromaDB · MySQL. Open to feedback!