A few months ago, I barely knew what an ORM was. Today I'm designing relational databases from scratch and querying them with raw SQL like it's second nature. Here's what I've been building 👇 🛠️ The Project A full data modelling and SQL project built in Python — designing schemas, seeding realistic test data, and running analytical queries against a live PostgreSQL database. 📐 The Stack → SQLAlchemy ORM to define clean, Pythonic relational models → PostgreSQL as the database engine (running locally via Docker) → pgcli for a smoother terminal querying experience with syntax highlighting and autocomplete → Claude Code inside VS Code as my AI pair programmer 🗂️ The Schema I modelled four core entities and their relationships: • Users → with emails, names, and timestamps • Addresses → linked to users via foreign key, with a default flag • Products → with categories, pricing, stock, and unique SKUs • Orders → tying it all together The thing nobody tells you about data engineering: the modelling decisions you make early ripple through everything downstream. Get the foreign keys wrong and your joins become a nightmare. I learned that the hard way — which is honestly the best way. 💡 Key Takeaways → SQLAlchemy keeps your schema readable and maintainable without writing raw DDL → pgcli makes working in the terminal genuinely enjoyable → Thinking carefully about entity relationships before writing a single line of code saves you hours of refactoring later → Seeding realistic synthetic data early forces you to stress-test your schema assumptions 📍 What's Next Layering in complex analytical queries, exploring how this data model feeds into a broader pipeline, and eventually connecting it to a transformation layer with dbt. Always building. The fundamentals matter more than the frameworks. 🚀 #DataEngineering #SQL #Python #SQLAlchemy #PostgreSQL #LearningInPublic #BuildInPublic #MachineLearning #DataScience #CareerJourney
Designing Relational Databases with SQLAlchemy and PostgreSQL
More Relevant Posts
-
Team, have a look #DBT #Snowflake -> Built something new last weekend: dbt-vectorize It turns dbt models into semantic search indexes in Postgres. The idea is simple: instead of running separate pipelines for embeddings, you define everything in dbt and run: dbt-vectorize build --select my_model Then query it: dbt-vectorize search --select my_model --query "oauth callback issues" Example: dbt-vectorize build --select vector_knowledge_base --profile default --target dev #DataScientist #Boston
Built something new last weekend: dbt-vectorize It turns dbt models into semantic search indexes in Postgres. The idea is simple: instead of running separate pipelines for embeddings, you define everything in dbt and run: dbt-vectorize build --select my_model Then query it: dbt-vectorize search --select my_model --query "oauth callback issues" Example: dbt-vectorize build --select vector_knowledge_base --profile default --target dev … embedded 30 rows into public.knowledge_base Search: dbt-vectorize search --select vector_knowledge_base --query "oauth callback issues" --top-k 3 --include-distance Example search output below ↓ Under the hood: - dbt builds the dataset - embeddings are generated (Rust, no Python required) - vectors are stored in Postgres via pgvector - search runs nearest-neighbor lookup directly in SQL Why I built this: embedding workflows often live outside the data stack (Python scripts, jobs, etc.). This keeps it closer to dbt: define once → build → search Still early, but already useful for: - semantic search over internal data - lightweight RAG pipelines - similarity search Curious if others are trying to bring embeddings into dbt workflows or keeping them separate. And same energy for next week: continue volleyball practices, continue builds. 🏐 #Rust #DataEngineering #dbt #PostgreSQL #BuildInPublic
To view or add a comment, sign in
-
-
An API endpoint was making 401 database queries. Every. Single. Request. Day 15 of 30 -- SQLAlchemy ORM Internals Phase 3 -- Backend and APIs Nobody noticed for weeks. The dashboard just felt slow. The fix: 4 lines of code. One selectinload. One joinedload. Down to 3 queries. 3.2 seconds to 31 milliseconds. The culprit was the N+1 problem. When you fetch 50 orders and access order.customer in a loop, SQLAlchemy fires a separate SELECT for each customer. 50 orders = 51 queries. Add items and products and you get 401 queries per page load. Today's Topic covers: Identity Map and Unit of Work -- the two patterns that explain 90% of SQLAlchemy behavior 4 session states -- Transient, Pending, Persistent, Detached and what each means The N+1 problem with a visual query trace -- 401 queries vs 3 6 loading strategies -- lazy, joinedload, selectinload, subqueryload, raiseload, noload When to use joinedload vs selectinload -- the to-one vs to-many rule Full annotated syntax -- model definition, eager loading chains, unit-of-work commit Real order dashboard -- 401 queries reduced to 3 with 4 lines of options() 5 mistakes including DetachedInstanceError and joinedload on to-many 5 best practices including lazy=raise as the production default Key insight: Every ORM hides SQL from you. The good engineers know exactly what SQL their ORM is generating -- and why. #Python #SQLAlchemy #ORM #BackendDevelopment #Database #100DaysOfCode #FastAPI #TechContent #BuildInPublic #TechIndia #SoftwareEngineering #LinkedInCreator #LearnPython #OpenToWork #DataEngineering #PythonTutorial
To view or add a comment, sign in
-
I built a Text-to-SQL RAG system from scratch and it genuinely surprised me how much the retrieval step matters. The idea: type a plain English question, get back the right SQL query and the actual results. No schema memorisation, no manual query writing. Here's how it works under the hood: → Schema indexing (offline) I extract every table, column, data type, foreign key, and sample row from MySQL's INFORMATION_SCHEMA. Each table becomes a rich text document that gets embedded and stored in ChromaDB. → Query time (online) When you ask a question, it gets embedded with the same model, and cosine similarity retrieves the most relevant tables. Those schema docs go into a structured prompt alongside the question, and GPT-4o generates the SQL at temperature=0 (deterministic — crucial for SQL). → Two safety layers A keyword blocklist catches dangerous operations (DROP, DELETE, etc.) before execution. A read-only MySQL user enforces it at the database level — so even a prompt injection can't cause damage. Stack: Python · OpenAI GPT-4o · ChromaDB · MySQL · text-embedding-3-small Key insight I didn't expect: the quality of your schema document matters more than the LLM. A table description with column types + foreign keys + 3 sample rows retrieves dramatically better than just a list of column names. Full code on GitHub (link in comments). Happy to answer questions about the design. #MachineLearning #Python #SQL #RAG #LLM #DataEngineering #OpenAI #PortfolioProject
To view or add a comment, sign in
-
-
🚀 Day 8 of My PySpark Learning Journey DataFrames in PySpark: 📌 What is a DataFrame? A DataFrame is a distributed table with rows, columns and a schema. If you've used Pandas before, the concept is identical. Named columns, data types, familiar operations like filter, group, sort. The difference is where it runs. → Pandas DataFrame - runs on your laptop, limited by your RAM → PySpark DataFrame - runs across a cluster, handles terabytes Same idea. Completely different scale. 📌 How Spark reads your data You can create a DataFrame from almost anything - a CSV file, a JSON file, a Parquet file, a database table, even a plain Python list. Spark reads the data, figures out the column names and data types, and distributes it across partitions automatically. You just point it at the data source and it handles the rest. 📌 What you can do with a DataFrame Once you have a DataFrame, the operations feel very natural: → Select specific columns you care about → Filter rows based on conditions → Add new calculated columns → Group rows and aggregate - count, sum, average → Join two DataFrames together like a SQL join → Sort by any column The syntax is close enough to Pandas that if you already know Pandas, PySpark DataFrames will feel familiar within an hour. 📌 One key difference from Pandas In Pandas, when you modify a DataFrame it changes in place. In PySpark, every operation returns a NEW DataFrame. The original is never touched. This is because of lazy evaluation - Spark is building a plan, not executing immediately. Nothing actually runs until you ask for results. This lets Spark optimize the full pipeline before doing any real work. 📌 Why DataFrames over RDDs? RDDs give you low-level control but you have to handle everything manually. DataFrames sit on top of RDDs and add a massive advantage — the Catalyst optimizer. Spark automatically figures out the most efficient way to execute your query. You write what you want, Spark figures out how to do it fast. For 90% of real-world tasks, DataFrames are what you'll use. #PySpark #DataFrames #DataEngineering #Python #ApacheSpark #LearningInPublic
To view or add a comment, sign in
-
-
🐛 #PythonJourney | Day 148 — Debugging SQLAlchemy Models & Type Compatibility Today was about learning through debugging. I encountered multiple SQLAlchemy type compatibility issues and learned valuable lessons about database design. Key accomplishments: ✅ Fixed critical SQLAlchemy issues: • JSONB and INET are PostgreSQL-specific types • Must import from sqlalchemy.dialects.postgresql • Resolved naming conflicts (metadata is reserved) ✅ Solved type mismatches: • User.id must be UUID(as_uuid=True) • URL.user_id must match User.id type exactly • Foreign key constraints require compatible types • PostgreSQL is strict about type casting ✅ Debugged relationship definitions: • back_populates must reference correct class names • Cascade deletes prevent orphaned data • Bidirectional relationships need proper naming ✅ Created test user script: • Generates database tables automatically • Creates sample user with API key • Tests database connectivity ✅ All 5 SQLAlchemy models are now production-ready: • User (authentication) • URL (shortened URLs) • Click (event tracking) • ClickAggregate (analytics summaries) • AuditLog (compliance) What I learned today: → Database type safety is critical → PostgreSQL has its own type system (JSONB, UUID, INET) → SQLAlchemy type imports matter - core vs dialect-specific → Debugging error messages contain the actual problem - read them carefully → Foreign key constraints are strict about type compatibility The lesson: Sometimes the best learning comes from fixing errors. Each error message was an opportunity to understand the framework better. #Python #SQLAlchemy #PostgreSQL #DatabaseDesign #Backend #Debugging #SoftwareDevelopment #TechLearning
To view or add a comment, sign in
-
-
When I joined my current team, we ran ETL. Extract from source. Transform in Python. Load clean data to BigQuery. Six months later, we switched to ELT. Load raw data to BigQuery first. Transform Inside BigQuery using dbt. Here's exactly why - and what we got wrong the first time. ───────────────── The ETL problems we kept hitting: Python transform scripts were getting complex fast. Business logic kept changing. Every new metric required updating Python, code review, redeploy, rerun. Worse: no way to replay history with new logic. Raw data was already transformed and gone. Business rule changes meant we couldn't reprocess old data. We painted ourselves into corners every sprint. ───────────────── What switching to ELT changed: → Analysts now change transformation logic themselves - in SQL, not Python → Business rule changes? Rerun dbt on historical raw data. Done in minutes. → Python pipeline went from 800 lines to ~100. The rest is dbt models. → dbt gave us automatic documentation and lineage for free ───────────────── But - ELT is Not always right. If you handle sensitive personal data (healthcare, financial), you may Not be allowed to land raw PII in your warehouse. ETL is correct here - mask or encrypt before data touches storage. ───────────────── The honest decision rule: Can your warehouse handle transformation compute? → ELT Can you store raw data affordably? → ELT Does your team prefer SQL over Python for transforms? → ELT Is data sensitivity a hard constraint? → ETL Which does your team use - and what drove that decision? 👇 #DataEngineering #ETL #ELT #dbt #BigQuery #LearningInPublic
To view or add a comment, sign in
-
🚀 Day 56 – Data Collection in Data Science Today I focused on one of the most important steps in any Data Science project — Data Collection 📊 💡 Without quality data, even the best model won’t perform well. --- 🔹 Ways to Collect Data: ✔️ APIs Fetch structured and real-time data from servers using tools like Python requests ✔️ Web Scraping Extract data from websites using BeautifulSoup and Selenium ✔️ Databases Access stored data from SQL (MySQL, PostgreSQL) or NoSQL (MongoDB) ✔️ Open Datasets Use platforms like Kaggle for ready-made datasets ✔️ Surveys & Forms Collect custom data using tools like Google Forms ✔️ Logs & Tracking Analyze user behavior from website/app logs --- ⚖️ Key Insight: API → Clean & reliable data Scraping → Useful when API is not available --- 🔥 What I Realized: Data Collection is not just gathering data — it’s about collecting the right data for your problem --- 📈 Next Step: Moving towards Data Cleaning & Preprocessing --- #Day56 #DataScience #DataCollection #Python #MachineLearning #WebScraping #API #LearningJourney #100DaysOfCode
To view or add a comment, sign in
-
-
𝟮𝟬 𝗚𝗶𝘁 𝗰𝗼𝗺𝗺𝗮𝗻𝗱𝘀 𝗲𝘃𝗲𝗿𝘆 𝗱𝗮𝘁𝗮 𝘀𝗰𝗶𝗲𝗻𝘁𝗶𝘀𝘁 𝘀𝗵𝗼𝘂𝗹𝗱 𝗸𝗻𝗼𝘄. A lot of people in data science know Python. Many know SQL. Some know machine learning. But one skill quietly saves more work than all of them combined: 𝗚𝗶𝘁. Because the truth is… Your model can be great. Your analysis can be solid. Your dashboard can look perfect. But if you overwrite your own work, lose an experiment, or mess up a shared notebook without version control, none of that helps. I’ve seen people spend months learning tools… and still struggle with the basics of managing code properly. So here’s a simple breakdown of 𝟮𝟬 𝗚𝗶𝘁 𝗰𝗼𝗺𝗺𝗮𝗻𝗱𝘀 𝘁𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗺𝗮𝘁𝘁𝗲𝗿 for data scientists: 𝗨𝘀𝗲 𝘁𝗵𝗲𝘀𝗲 𝗱𝗮𝗶𝗹𝘆 → git status → git add → git commit -m "message" → git push → git pull 𝗨𝘀𝗲 𝘁𝗵𝗲𝘀𝗲 𝗮𝗹𝗹 𝘁𝗵𝗲 𝘁𝗶𝗺𝗲 𝗶𝗻 𝗽𝗿𝗼𝗷𝗲𝗰𝘁𝘀 → git clone → git init → git branch → git checkout → git checkout -b → git merge → git log → git log --oneline → git diff → git fetch 𝗨𝘀𝗲 𝘁𝗵𝗲𝘀𝗲 𝘄𝗵𝗲𝗻 𝘀𝗼𝗺𝗲𝘁𝗵𝗶𝗻𝗴 𝗴𝗼𝗲𝘀 𝘄𝗿𝗼𝗻𝗴 → git stash → git stash pop → git reset → git revert → git restore My honest advice? Do not try to memorize everything at once. Just master these 5 first: → git status → git add → git commit -m → git push → git pull That alone covers most of your real day-to-day work. The rest you’ll learn naturally once you start building projects, collaborating with teams, and fixing mistakes. And trust me… the day Git saves you from losing hours of work, you’ll wish you learned it earlier. Save this post for later. ♻️ Repost if you know a data scientist who needs this.
To view or add a comment, sign in
-
-
NOBOT - Status Update: Docker, ORM, and DB Design 🚀 Building in public means showing the process, even when it’s not "pixel-perfect". Here’s what’s happening with NOBOT: 1. Why PostgreSQL? 🐘 After a long brainstorming session (SQLite vs. one central hub), I decided to bet on PostgreSQL. With all the relations I have planned, Postgres just felt like the right, solid choice for the backbone of the system. 2. The Docker Dopamine Hit 🐳 There’s nothing like the satisfaction of seeing docker-compose up working for the first time. I just pushed the Dockerfile and docker-compose.yml to GitHub. Seeing the DB and data services running in containers gave me a huge smile – a small win for the infrastructure! 3. From Pydantic Contracts to ORM 🧹 I’m using Contract-Driven Development, so I started with Pydantic models to define how data should flow. Now I’m bridging that with an ORM. It lets me map my contracts directly to the database, keeps the code clean, and saves me from writing manual SQL modules. 4. Thinking Out Loud 🧠 The attached diagram is my current "brain dump." It’s changing every hour as I review my early contracts and refine the logic. Visualizing the schema helps me find edge cases before they turn into annoying bugs. I’m curious — what are your favorite tools for database design? I’m using dbdiagram.io right now, but sometimes I still think about the chaotic energy of Microsoft Paint! 😂 #BuildInPublic #Python #PostgreSQL #Docker #AI #RPA #SoftwareEngineering #NOBOT #IDP
To view or add a comment, sign in
-
-
In my last post, I introduced PydanTable—Pydantic-native tables, lazy transforms, and Rust-backed execution. Now, let's explore the next layer: SQL. Many data tools follow a familiar pattern for SQL sources: pulling rows into Python, transforming them, and then writing them somewhere else. While this approach works, it becomes cumbersome when dealing with large datasets or when your write target is the same database from which you read. The process of “extracting everything locally” can feel more like a burden than a benefit. PydanTable now offers an optional SQL execution path, allowing you to keep transformations within the database as long as the engine supports them. You only materialize data when you actually need it on the Python side. This shifts the paradigm from classic ETL—Extract, Transform locally, Load—to a more efficient TEL: Transform in SQL, extract locally when needed, then load. The primary advantage is operational efficiency. When your load target is on the same SQL server, you can often bypass the costly step of transferring the entire result set through the application, enabling a direct transition from transformation to loading, with the server handling the heavy lifting. This approach also indicates our future direction: a more intelligent execution strategy for PydanTable. The planner will optimize work on the read side when it is safe and efficient, selecting the best compute resources rather than defaulting to local resources or a single engine that may not be ideal for the task. On the roadmap, we have plans for a MongoDB engine to allow aggregation to remain on the server before extraction or writing back, as well as a PySpark-engine that introduces strong typing to traditional Spark-style operations. I am excited to continue advancing PydanTable beyond merely “strongly typed dataframes” toward strong typing where the data already resides. #DataEngineering #Python #OpenSource #SQL #ETL
To view or add a comment, sign in
Explore related topics
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development