Building Text-to-SQL RAG System with GPT-4o and ChromaDB

I built a Text-to-SQL RAG system from scratch and it genuinely surprised me how much the retrieval step matters. The idea: type a plain English question, get back the right SQL query and the actual results. No schema memorisation, no manual query writing. Here's how it works under the hood: → Schema indexing (offline) I extract every table, column, data type, foreign key, and sample row from MySQL's INFORMATION_SCHEMA. Each table becomes a rich text document that gets embedded and stored in ChromaDB. → Query time (online) When you ask a question, it gets embedded with the same model, and cosine similarity retrieves the most relevant tables. Those schema docs go into a structured prompt alongside the question, and GPT-4o generates the SQL at temperature=0 (deterministic — crucial for SQL). → Two safety layers A keyword blocklist catches dangerous operations (DROP, DELETE, etc.) before execution. A read-only MySQL user enforces it at the database level — so even a prompt injection can't cause damage. Stack: Python · OpenAI GPT-4o · ChromaDB · MySQL · text-embedding-3-small Key insight I didn't expect: the quality of your schema document matters more than the LLM. A table description with column types + foreign keys + 3 sample rows retrieves dramatically better than just a list of column names. Full code on GitHub (link in comments). Happy to answer questions about the design. #MachineLearning #Python #SQL #RAG #LLM #DataEngineering #OpenAI #PortfolioProject

6 Comments

Ayush Singh 3w

GitHub repo: https://github.com/AyushSingh-916/RAG-SQL-project Built as part of my portfolio targeting Data/Quant Analyst roles. Stack: Python · GPT-4o · ChromaDB · MySQL. Open to feedback!

Shishir Mohan Nigam 2w

Hey brother, how do you implement all this? Do you use Vibe Coding tools like Google AIStudio or Antigravity? Or like you have to write the whole code line by line yourself?

Swetank Poddar 3w

Nice one! We have been working on a similar application as well, introducing the concept of "golden queries" (i.e. a query/sql pair) seems to have helped a lot. It allows you to force a ground truth of sorts!

1 Reaction

Mohit Sen 2w

That's awesome

See more comments

To view or add a comment, sign in

More Relevant Posts

Sayan Dutta
3w Edited
Report this post
TerSQL v0.0.2 (beta) is live — and this is where things start getting serious. 🚀 What began as a better MySQL terminal is now evolving into something bigger: 👉 A SQL interface built for *humans*, not just developers. 🧠 The problem hasn’t changed: Databases are powerful — but interacting with them is still painful. • Beginners struggle with syntax • Developers waste time debugging queries • One mistake can still break things ⚡ What’s new in v0.0.2 (beta) This update focuses on **making databases more intuitive, not just more powerful**: ✨ Natural-language style queries → Type: *“show top 5 users”* → TerSQL auto-corrects to real SQL 🧩 Modular architecture → Clean pipeline: NLP → Core → Plugin Router → DB → Designed for extensibility across multiple databases 🌐 Multi-database support → MySQL · PostgreSQL · MongoDB 🛡️ Improved safety layer → Query validation + guardrails before execution 🎯 Interactive demo + full landing page → Visualise how queries transform and execute 🧠 What makes TerSQL different? This is NOT: ❌ Another database ❌ Another GUI client It’s an **interaction layer** on top of your existing database. No migration. No complexity. Just a better way to work with data. 🔮 Where this is going TerSQL is moving toward: → AI-assisted query generation → Query explanation (human-readable) → Smarter error correction → Developer + beginner unified experience 💡 Why I’m building this I don’t think databases should feel intimidating. If you can *think it*, you should be able to *query it*. 🌐 Try it out Live: https://lnkd.in/gxbpNz5j GitHub: https://lnkd.in/g2x5sSTp If you find it interesting, a ⭐ would mean a lot. 💬 I’d love your thoughts: Would you actually use natural language for querying databases? Or do you still prefer raw SQL? #opensource #ai #sql #python #developerexperience #devtools #databases #buildinpublic #systemdesign #machinelearning #backend #programming #techinnovation
Like Comment
To view or add a comment, sign in
Manikandan Palanisamy
1mo
Report this post
🚀 Built a MySQL MCP Server with Natural Language Querying I recently built a MySQL MCP (Model Context Protocol) server using Python that allows AI to interact with databases using plain English. 💡 What this means: You can ask questions like: 👉 "Show last 10 orders" 👉 "Get top customers by revenue" …and the system automatically converts it into SQL and fetches results. 🔧 Key Features: • Natural Language → SQL • Secure Read-only Query Mode • Schema Exploration (tables, columns) • Plug & Play with Claude Desktop • Configurable via .env • Fully tested end-to-end 🧠 Architecture: AI Client → MCP Server → MySQL Database ⚙️ Tech Stack: Python | MCP | MySQL | mysql-connector | dotenv | Claude Desktop 🔥 Why this matters: This bridges the gap between AI + Databases, making data accessible even for non-technical users. 📌 Next steps: Planning to extend this with: • Query optimization • Role-based access • Multi-database support 🔗 GitHub: https://lnkd.in/g7RgQrdd #MCP #Python #MySQL #AI #LLM #OpenSource #BuildInPublic #DatabaseAI
2 Comments
Like Comment
To view or add a comment, sign in
SHREYASHI SHARMA
5d Edited
Report this post
9 ways you can read in Pandas (and instantly level up your data workflow): Most people focus on models and algorithms—but the real edge often comes from how efficiently you can bring data in. Here are 9 essential formats you should be comfortable with: 🔹 CSV (.csv) The most common format—simple, fast, and everywhere. Use: pd.read_csv() 🔹 Excel (.xlsx, .xls) Widely used in business for reports and multi-sheet data. Use: pd.read_excel() 🔹 JSON (.json) Perfect for API responses and semi-structured data. Use: pd.read_json() 🔹 SQL Databases Pull data directly from databases like MySQL or PostgreSQL. Use: pd.read_sql() 🔹 Parquet (.parquet) Efficient, compressed, and built for big data workflows. Use: pd.read_parquet() 🔹 Feather (.feather) Optimized for fast read/write between Python environments. Use: pd.read_feather() 🔹 HTML Tables Extract tables directly from websites. Use: pd.read_html() 🔹 Pickle (.pkl) Quickly store and load Python objects. Use: pd.read_pickle() 🔹 Text Files (.txt) Flexible format with custom delimiters (tabs, pipes, etc.). Use: pd.read_csv(sep='\\t') Why this matters: The faster you can load data, the faster you can analyze, model, and deliver impact. Strong data professionals don’t just analyze data— they know exactly how to access it. #DataScience #Pandas #Python #DataAnalytics #MachineLearning #DataEngineering #IT #MachineLearning #Growth #SQLDATABASE #HTML #TABLE #DataPreprocessing
Like Comment
To view or add a comment, sign in
Saurabh Gupta
2w
Report this post
If you’re still building SQL queries using string concatenation… you’re making your life harder than it needs to be. Not because SQL is bad - but because treating queries like strings is an engineering liability. It works in dev. It breaks in production. Developers are still duct-taping raw queries together like this: "𝗦𝗘𝗟𝗘𝗖𝗧 * 𝗙𝗥𝗢𝗠 𝘂𝘀𝗲𝗿𝘀 𝗪𝗛𝗘𝗥𝗘 𝗮𝗴𝗲 > " + 𝘀𝘁𝗿(𝘂𝘀𝗲𝗿_𝗶𝗻𝗽𝘂𝘁) If your queries depend on + 𝘀𝘁𝗿(𝘂𝘀𝗲𝗿_𝗶𝗻𝗽𝘂𝘁): you’re not just writing brittle code - you’re opening the door to bugs and injection risks. On the flip side, bringing in a massive ORM just to handle a few complex joins is severe overkill. I’ve been there: • Debugging messy query strings • Chasing silent bugs • Rewriting the same logic again and again You need a middle ground. That’s where 𝗣𝘆𝗣𝗶𝗸𝗮 comes in. 𝗣𝘆𝗣𝗶𝗸𝗮: It’s a pure SQL query builder that sits in the perfect sweet spot and gives you structure without losing control: ✅ Writes in pure Python ✅ Natively parameterizes inputs (safer, avoids injection issues) ✅ Makes queries highly composable (letting FastAPI and Pydantic handle the rest) I broke down exactly why this tool is a massive upgrade over raw strings and when you should (and shouldn't) use it. Breakdown in the carousel 👇 Curious - how are you handling dynamic SQL today? #Python #SQL #DataScience #DataEngineering #BackendEngineering #SoftwareArchitecture #TechTips

1 Comment
Like Comment
To view or add a comment, sign in
Maxwell Hiamatsu
3w
Report this post
A few months ago, I barely knew what an ORM was. Today I'm designing relational databases from scratch and querying them with raw SQL like it's second nature. Here's what I've been building 👇 🛠️ The Project A full data modelling and SQL project built in Python — designing schemas, seeding realistic test data, and running analytical queries against a live PostgreSQL database. 📐 The Stack → SQLAlchemy ORM to define clean, Pythonic relational models → PostgreSQL as the database engine (running locally via Docker) → pgcli for a smoother terminal querying experience with syntax highlighting and autocomplete → Claude Code inside VS Code as my AI pair programmer 🗂️ The Schema I modelled four core entities and their relationships: • Users → with emails, names, and timestamps • Addresses → linked to users via foreign key, with a default flag • Products → with categories, pricing, stock, and unique SKUs • Orders → tying it all together The thing nobody tells you about data engineering: the modelling decisions you make early ripple through everything downstream. Get the foreign keys wrong and your joins become a nightmare. I learned that the hard way — which is honestly the best way. 💡 Key Takeaways → SQLAlchemy keeps your schema readable and maintainable without writing raw DDL → pgcli makes working in the terminal genuinely enjoyable → Thinking carefully about entity relationships before writing a single line of code saves you hours of refactoring later → Seeding realistic synthetic data early forces you to stress-test your schema assumptions 📍 What's Next Layering in complex analytical queries, exploring how this data model feeds into a broader pipeline, and eventually connecting it to a transformation layer with dbt. Always building. The fundamentals matter more than the frameworks. 🚀 #DataEngineering #SQL #Python #SQLAlchemy #PostgreSQL #LearningInPublic #BuildInPublic #MachineLearning #DataScience #CareerJourney
Like Comment
To view or add a comment, sign in
Odos Matthews
4w
Report this post
In my last post, I introduced PydanTable—Pydantic-native tables, lazy transforms, and Rust-backed execution. Now, let's explore the next layer: SQL. Many data tools follow a familiar pattern for SQL sources: pulling rows into Python, transforming them, and then writing them somewhere else. While this approach works, it becomes cumbersome when dealing with large datasets or when your write target is the same database from which you read. The process of “extracting everything locally” can feel more like a burden than a benefit. PydanTable now offers an optional SQL execution path, allowing you to keep transformations within the database as long as the engine supports them. You only materialize data when you actually need it on the Python side. This shifts the paradigm from classic ETL—Extract, Transform locally, Load—to a more efficient TEL: Transform in SQL, extract locally when needed, then load. The primary advantage is operational efficiency. When your load target is on the same SQL server, you can often bypass the costly step of transferring the entire result set through the application, enabling a direct transition from transformation to loading, with the server handling the heavy lifting. This approach also indicates our future direction: a more intelligent execution strategy for PydanTable. The planner will optimize work on the read side when it is safe and efficient, selecting the best compute resources rather than defaulting to local resources or a single engine that may not be ideal for the task. On the roadmap, we have plans for a MongoDB engine to allow aggregation to remain on the server before extraction or writing back, as well as a PySpark-engine that introduces strong typing to traditional Spark-style operations. I am excited to continue advancing PydanTable beyond merely “strongly typed dataframes” toward strong typing where the data already resides. #DataEngineering #Python #OpenSource #SQL #ETL

1 Comment
Like Comment
To view or add a comment, sign in
Chandrika Tadikonda
1mo
Report this post
BLOG 11 — SQL JOINS Explained with Examples In this blog, I explained how to combine data from multiple tables using SQL JOINS. Topics covered: ✔ INNER JOIN ✔ LEFT JOIN ✔ RIGHT JOIN ✔ FULL OUTER JOIN ✔ Difference between JOIN types ✔ Practical examples using tables SQL JOINS are one of the most important concepts in databases and are widely used in real-world data analysis. Read here: https://lnkd.in/gYbAvrT9 Grateful to Innomatics Research Labs for providing practical exposure and structured learning. Excited to continue building strong foundations in SQL, Data Analytics, and Data Science. Special thanks to the team for their guidance and support: Co-Founder & CEO – Kalpana Katiki Reddy Regional Head – VAMSI KRISHNA KANAGALA Trainer – Swathi Reddy Thatikonda Abhilash Manikanta Mentors: Gogula Vinay Koduri Srihari Dinesh Bodigadla Rahul Janjirala Program Manager – Raghu Ram Aduri Placement Team: Sigilipelli Yeshwanth Sravani Burma Rishita Bhargavi K Eswarkarthic M SQL | Python | Pandas | Data Analytics | Statistics #SQL #DataAnalytics #Database #LearningJourney #InnomaticsResearchLabs #CareerGrowth #Beginner #Portfolio #100DaysOfLearning

SQL JOINS Explained with Examples (INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN) medium.com
Like Comment
To view or add a comment, sign in
Jayaraman R
5d
Report this post
Every data beginner hits this wall: “Should I learn SQL or Pandas?” I wasted a week thinking it was a choice. Until one conversation changed everything. Here’s the mental model that made it click Think of it like a kitchen: SQL = Storage room → Everything lives here → Structured, organized, built for scale Pandas = Prep table → Bring what you need → Slice, transform, experiment freely A chef doesn’t choose between them. They use both — at the right moment. Reach for SQL when: ✔ Data lives in a database ✔ You’re joining multiple tables ✔ Working with millions of rows ✔ Need automated, repeatable queries Reach for Pandas when: ✔ Data is CSV / Excel ✔ You’re exploring & experimenting ✔ Quick transformations / EDA ✔ Building logic on top of Python My workflow now: → SQL to extract & prepare → Pandas to analyze & explore Same problems. Different strengths. Zero conflict. The real skill nobody teaches: Not perfect SQL syntax. Not memorizing Pandas functions. Knowing which tool to use — and why That’s what separates beginners from analysts. Share this with someone stuck in the “SQL vs Python” debate #SQL #Python #Pandas #DataAnalytics #SqlVsPython #LearningInPublic #AspiringDataAnalyst #TechCareer
Like Comment
To view or add a comment, sign in
Harshit Raj
2w
Report this post
I recently completed a project focused on Data Cleaning and Validation using a hybrid SQL and Python stack. Validating critical identifiers like the Indian PAN (Permanent Account Number) is essential for maintaining data integrity in financial and identity management systems. 🛠️ The Architecture The project follows a structured pipeline: Extraction: Loading raw datasets from Excel using Pandas. Integration: Moving data into a MySQL environment via mysql-connector-python. Cleaning: Using SQL to handle NULL values, remove duplicates, and standardize formatting (TRIM/UPPER). Validation: Implementing complex logic through Regular Expressions (Regex) and Custom SQL Functions. 💡 Key Technical Highlights Pattern Matching: Used REGEXP '^[A-Z]{5}[0-9]{4}[A-Z]$' to enforce the 10-character legal format. Custom Logic: Developed SQL functions to detect adjacent repeating characters and sequential patterns (like 'ABCDE' or '1234') that often indicate placeholder or fraudulent data. Efficient Querying: Leveraged Common Table Expressions (CTEs) to create a clean, distinct dataset before running final validation checks. 📈 Final Output The pipeline categorizes each entry as 'valid' or 'invalid' based on the rigorous rule set, providing a ready-to-use dataset for real-world analytics workflows. This project was a fantastic way to bridge the gap between Python's automation capabilities and SQL's powerful data manipulation strengths. #SQL #Python #DataAnalytics #DataCleaning #MySQL #DataValidation #DataEngineering
Like Comment
To view or add a comment, sign in
Ibrahim Fadhili
1w
Report this post
Most data analysts on my team spent more time writing SQL than actually analysing data. So I built a fix — without touching our existing Superset setup. It's called a Text-to-SQL Sidecar: a standalone FastAPI microservice that sits alongside Apache Superset and turns plain English into validated, safe SQL. You ask: "which products had the highest return rate last quarter?" It generates, validates, and executes the SQL — then hands the results back. A few things I was deliberate about: → AST-level SQL validation (not string matching — trivially bypassable) → Per-database table allowlists so the LLM can only touch what it's supposed to → Schema caching so we're not hammering the DB on every request → LLM-agnostic design — swap the endpoint URL, change the model → Reasoning traces returned alongside SQL so analysts can actually trust the output Superset never needs to know it exists. It just receives SQL. I wrote up the full implementation — architecture, code walkthrough, and the design decisions that make it production-ready. Link in the comments 👇 #DataEngineering #AI #SQL #FastAPI #ApacheSuperset #LLM #Python

Building a scalable Text-to-SQL Sidecar for Apache Superset medium.com
Like Comment
To view or add a comment, sign in

74 followers

2 Posts

View Profile Follow

Building Text-to-SQL RAG System with GPT-4o and ChromaDB

More Relevant Posts

Explore related topics

Explore content categories