I spent the last few weeks building something I'm genuinely proud of. It started with a simple question: what does a production-style data pipeline actually look like when you build it from scratch? So I built one. 𝐎𝐩𝐬𝐏𝐮𝐥𝐬𝐞-𝐍𝐘𝐂-𝐓𝐚𝐱𝐢-𝐏𝐢𝐩𝐞𝐥𝐢𝐧𝐞 — a modular ETL pipeline that pulls NYC Yellow Taxi trip data, cleans it, transforms it, and loads it into a SQL Server database for analysis. Here's what I learned along the way: → Clean architecture isn't optional. When your pipeline breaks at 2am, you'll thank yourself for writing modular code. → The pipeline fails loudly, not silently. HTTP errors, missing values, duplicates — nothing slips through quietly. Because bad data that goes unnoticed is worse than a pipeline that stops. → Logging is your best friend. If you can't observe it, you can't debug it. → A fail-fast strategy saves hours. If extract fails, nothing else runs. Simple. Brutal. Effective. Tech I used: Python · Pandas · Parquet · MSSQL Server · Requests · Custom logging The pipeline has 3 stages: Extract → you enter a month and year, the pipeline fetches the exact Parquet file for that period — no hardcoding, no manual downloads Transform → deduplicates, cleans nulls, engineers features, aggregates revenue per day Load → writes structured, clean data directly into MSSQL Server — query-ready from day one GitHub link in the comments 👇 #DataEngineering #ETL #Datapipeline #Python #MSSQL #DataWarehouse #LearningInPublic
Building a Modular NYC Taxi Data Pipeline with Python
More Relevant Posts
-
🚨 We had 12 database tables… for the SAME data. And it kept growing every month. I was debugging a backend system where something felt off. Everything worked… but nothing felt right. Then I saw it. ❌ JanuaryData table ❌ FebruaryData table ❌ MarchData table … and this pattern continued Every. Single. Month. At first glance, it looked organized. In reality? 💥 It was a scaling disaster. As the system grew: – Queries became messy and repetitive – Business logic had to be duplicated – Adding a feature meant touching multiple tables – Maintenance became painful 💡 The real issue wasn’t performance. It was bad data modeling. 🛠️ What I changed Instead of creating a new table every month: ✅ Built a single main model ✅ Created a monthly records model ✅ Connected them using a Foreign Key (One-to-Many relationship) Now: 👉 One entity → multiple monthly entries 👉 Clean, scalable, maintainable structure 🎯 The impact ✔️ Removed duplication ✔️ Simplified queries ✔️ Faster development ✔️ Future-proof design 💡 Lesson: If your database is growing with more tables for the same logic… You don’t have a scaling system. You have a design problem. Good backend engineers don’t just fix bugs. They fix foundations. Have you ever seen something like this in production? 😅 What was the worst DB design you’ve fixed? 👇 Drop it below #Python #Django #DatabaseDesign #BackendDevelopment #SoftwareEngineering #PostgreSQL #WebDevelopment #FullStackDeveloper #DataModeling #CleanCode #TechTips #Programming #OpenToWork #RemoteWork #AWS
To view or add a comment, sign in
-
-
Earlier this week, I was debugging a legacy SQL script. It was a mess of subqueries wrapped inside subqueries—like trying to unwrap an endless stack of boxes just to find one small item. It was hard to read, impossible to debug, and slowed the whole team down. The fix? I refactored the entire thing into clean CTEs (Common Table Expressions). Here is why I’ve made the switch: Readability: CTEs let you name your data blocks. You read the code from top to bottom, like a story, not from the inside out. Easy Debugging: You can test each "block" individually. No more untangling a web of parentheses. Team Speed: If a teammate can understand your query in 30 seconds instead of 30 minutes, you’ve just saved the company money. In Data Science Engineering, "clean" is often better than "clever." #SQL #DataEngineering #DataScience #CleanCode #TechTips
To view or add a comment, sign in
-
-
Your SQL should read like a story, not a puzzle. In modern Data Engineering, "it works" isn't the only requirement. Readability is a production feature. Here is why the choice between Subqueries and CTEs defines your code quality: 🔹 Subqueries (The Quick Fix) A query nested inside another query. Pros: • Great for simple, one-off filtering or scalar checks. • Quick to write for a "throwaway" query. Cons: • Creates "spaghetti code" that grows inward. • Hard to debug when a join fails 4 levels deep. • Makes it nearly impossible for a teammate to follow your logic. 🚀 CTEs — Common Table Expressions (The Systematic Way) Defined using the WITH clause at the top of your script. Pros: • Modular: You "name" your steps (e.g., cleaned_sales). • Top-to-Bottom: Anyone can read your logic like a book. • Reusable: Reference the same logic multiple times. Performance Note: In modern engines like Spark or SQL Server, the optimizer is smart enough to handle both efficiently. But your coworkers aren't compilers—they need to be able to read your work. 💡 Simple way to think about it: Subqueries = A long paragraph with 12 commas. CTEs = A clean, organized list of bullet points. The Verdict: If you’re building a Medallion architecture, especially in the Silver layer, default to CTEs. Your future self (and your teammates) will thank you. #DataEngineering #SQL #Databricks #Azure #CleanCode #MedallionArchitecture #Subquery #CTE
To view or add a comment, sign in
-
-
Hello everyone, I’m writing a multi-part series on fine-tuning Text-to-SQL models using the JAX ecosystem. Part 1 focuses on the initial setup and groundwork. In this part, I use Grain for data loading and prepare the foundation for fine-tuning. I first wrote about JAX two years ago in a JAX vs NumPy performance comparison, and since then the ecosystem has grown a lot. This series is also inspired by Chris Achard’s course “Build and Train an LLM with JAX”, and I wanted to adapt the idea into a practical Text-to-SQL fine-tuning example. Part 1 is here: https://lnkd.in/eKjbJNyY More parts coming soon. I’ll add my older JAX vs NumPy article in the comments.
To view or add a comment, sign in
-
In general, people with a programming background often find the SQL CASE expression easy to pick up, but the real advantage is knowing how to look at a problem in SQL terms rather than trying to apply procedural logic. CASE is not a control flow statement. It is an expression. It evaluates conditions and returns a value for each row in the result set. To understand how it behaves, it helps to be aware of the logical order of execution in SQL: FROM WHERE GROUP BY HAVING SELECT ORDER BY This matters because CASE is evaluated wherever it is used as part of a clause. In most cases, this is in the SELECT phase, or in ORDER BY, but it can also appear in WHERE, HAVING, or even UPDATE statements. It does not affect how rows are retrieved, only how values are computed. Example: SELECT username, CASE WHEN score >= 90 THEN 'A' ELSE 'B' END AS grade FROM users WHERE active = 1; What is happening here: FROM selects the table WHERE filters rows first SELECT applies the CASE expression to each remaining row A common mistake is trying to treat CASE as something that influences earlier steps in the query. It doesn’t. It only transforms data at the point where it is evaluated. A CASE expression in SQL always ends with END, which is mandatory because it marks the completion of the expression itself. After END, it is optional to assign an alias using AS, like END AS grade, which simply names the result column in the output. The key distinction is that END is part of the CASE syntax and must always exist, while AS is not part of CASE at all—it is only for readability and reference. Below is an example of using CASE in an ORDER BY statement. Notice that we did not need to give the expression a column alias, and the result does not appear in the final output— the result of the case expression only existed during the query’s execution. #SQL #Database #DataEngineering #DataAnalysis #DataAnalytics #Analytics #BusinessIntelligence #BackendDevelopment #Programming #SoftwareDevelopment #LearnSQL #QueryOptimization #DataVisualization #ComputerScience #DevLife
To view or add a comment, sign in
-
-
𝗣𝘆𝗦𝗽𝗮𝗿𝗸 𝘃𝘀 𝗦𝗽𝗮𝗿𝗸 𝗦𝗤𝗟: 𝗜 𝗱𝗼𝗻’𝘁 𝗰𝗵𝗼𝗼𝘀𝗲 𝗯𝗮𝘀𝗲𝗱 𝗼𝗻 𝗽𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲, 𝗜 𝗰𝗵𝗼𝗼𝘀𝗲 𝗯𝗮𝘀𝗲𝗱 𝗼𝗻 𝘁𝗵𝗲 𝗷𝗼𝗯. It’s easy to turn this into a “which is better” debate. In practice, both are useful just for different reasons. And one thing is often misunderstood: Spark doesn’t execute “Python” or “SQL” the way people think. It executes a 𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻 -> 𝗼𝗽𝘁𝗶𝗺𝗶𝘀𝗲𝗱 𝗽𝗹𝗮𝗻 -> 𝗽𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗽𝗹𝗮𝗻. So a lot of the time, the real difference isn’t performance, it’s 𝗵𝗼𝘄 𝗰𝗹𝗲𝗮𝗿𝗹𝘆 𝘆𝗼𝘂 𝗲𝘅𝗽𝗿𝗲𝘀𝘀 𝗶𝗻𝘁𝗲𝗻𝘁 𝗮𝗻𝗱 𝗵𝗼𝘄 𝗺𝗮𝗶𝗻𝘁𝗮𝗶𝗻𝗮𝗯𝗹𝗲 𝘁𝗵𝗲 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 𝗶𝘀. 𝗪𝗵𝗲𝗻 𝗦𝗽𝗮𝗿𝗸 𝗦𝗤𝗟 𝘄𝗶𝗻𝘀 • The work is mostly select, join, filter, aggregate • Logic needs to be readable by more people (analysts + engineers) • I want quick iteration and clear intent • Performance tuning is easier because the query shape is obvious 𝗪𝗵𝗲𝗻 𝗣𝘆𝗦𝗽𝗮𝗿𝗸 𝘄𝗶𝗻𝘀 • I need custom logic that’s awkward in SQL • Complex parsing, nested structures, arrays/maps, JSON heavy work • Reusable functions and cleaner code structure (modules, unit tests) • Integration steps around the transformation (validation, file handling, etc.) 𝗧𝗵𝗲 𝗿𝗲𝗮𝗹 𝘁𝗿𝗮𝗱𝗲 𝗼𝗳𝗳 • SQL usually optimizes for clarity. • PySpark usually optimizes for flexibility. 𝗧𝗵𝗲 𝗯𝗲𝘀𝘁 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 𝗜’𝘃𝗲 𝘀𝗲𝗲𝗻 • Use SQL for the core transformations (joins/aggregations) • Use PySpark for the edges (validation, enrichment, complex business rules) • Keep one “source of truth” so business logic doesn’t get duplicated Takeaway: Choosing PySpark vs Spark SQL isn’t a style choice. It’s a maintainability and delivery choice. Drop your go-to rule for choosing between them in the comments. #PySpark #SparkSQL #DataEngineering #Databricks #BigData #SQL #AnalyticsEngineering #DataPipelines
To view or add a comment, sign in
-
-
I recently worked on an ETL pipeline built around UK regional carbon intensity data. The pipeline extracts 24-hour regional data from the Carbon Intensity API, transforms the nested JSON response into a structured tabular format, aggregates the 30-minute interval readings into daily regional summaries, and loads the output into PostgreSQL for analysis. On the transformation side, the workflow flattens both the carbon intensity values and the generation mix data across fuel sources, then uses Pandas to produce daily region-level metrics. On the database side, the final output is stored in PostgreSQL tables designed for reporting, with date-based partitioning applied to the fact tables to support cleaner storage management and better scalability as the data grows. The result is a query-ready pipeline that turns raw API data into structured daily carbon intensity and generation mix data that can be used for downstream analysis and reporting. Tech used: Python, Pandas, PostgreSQL, SQLAlchemy, YAML #DataEngineering #ETL #Python #PostgreSQL #SQL #DataPipeline #DatabaseDesign #AnalyticsEngineering
To view or add a comment, sign in
-
-
🚀 Built an End-to-End Data Pipeline using API & SQL Server! Excited to share my recent hands-on project where I built a complete data pipeline from scratch 👇 🔹 What I did: 1. Source Database (SQL Server) ↓ 2. Create API using FastAPI ↓ 3. Expose endpoint (/data) ↓ 4. Call API using Python (requests) ↓ 5. Get data in JSON format ↓ 6. Connect to Target SQL Server ↓ 7. Auto-create table (if not exists) ↓ 8. Insert data into target table ↓ 9. Verify data in SSMS 🔹 Tech Stack: Python | FastAPI | SQL Server | pyodbc | requests 🔹 Key Learnings: 💡 How APIs act as a bridge between systems 💡 Converting JSON data into structured format 💡 Building real-world ETL pipelines 💡 Automating data movement without manual intervention This project helped me understand how real-world data engineering pipelines work — from data extraction to loading 🚀 Looking forward to building more such projects and improving my skills! #DataEngineering #Python #FastAPI #SQLServer #ETL #DataPipeline #LearningInPublic #100DaysOfData #BuildingInPublic
To view or add a comment, sign in
-
-
**🗄️ SQL Subqueries Explained — Scalar, Correlated & Nested** Most SQL developers know how to write a basic query. But mastering subqueries? That's where you level up. 🚀 Here's a quick breakdown of the 3 types you need to know: **1️⃣ Scalar Subquery** Returns exactly ONE value (one row, one column). Use it in SELECT, WHERE, or HAVING. ⚠️ Returns more than one value? It'll throw an error. **2️⃣ Correlated Subquery** Runs once per row of the outer query — it depends on outer query values. Great for row-level comparisons, but can be slow on large datasets. 💡 Consider JOINs or WINDOW FUNCTIONS for better performance. **3️⃣ Nested Queries** A query inside a query — and yes, they can be multi-level. Perfect for filtering with IN or matching aggregated values across departments. ⚠️ Avoid over-nesting — it hurts readability and makes debugging painful Which type of subquery do you use the most in your day-to-day work? Drop it in the comments! 👇 \#SQL #DataEngineering #DataAnalytics #DatabaseDesign #LearnSQL #TechTips #DataScience #Programming #SoftwareDevelopment
To view or add a comment, sign in
-
-
An Implementation Guide to Building a DuckDB-Python Analytics Pipeline with SQL, DataFrames, Parquet, UDFs, and Performance Profiling - MarkTechPost https://lnkd.in/eJuvwc_a
To view or add a comment, sign in
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development
https://github.com/ZohaibJavaid32/OpsPulse-NYC-Taxi-Pipeline