Building a Modular NYC Taxi Data Pipeline with Python

I spent the last few weeks building something I'm genuinely proud of. It started with a simple question: what does a production-style data pipeline actually look like when you build it from scratch? So I built one. 𝐎𝐩𝐬𝐏𝐮𝐥𝐬𝐞-𝐍𝐘𝐂-𝐓𝐚𝐱𝐢-𝐏𝐢𝐩𝐞𝐥𝐢𝐧𝐞 — a modular ETL pipeline that pulls NYC Yellow Taxi trip data, cleans it, transforms it, and loads it into a SQL Server database for analysis. Here's what I learned along the way: → Clean architecture isn't optional. When your pipeline breaks at 2am, you'll thank yourself for writing modular code. → The pipeline fails loudly, not silently. HTTP errors, missing values, duplicates — nothing slips through quietly. Because bad data that goes unnoticed is worse than a pipeline that stops. → Logging is your best friend. If you can't observe it, you can't debug it. → A fail-fast strategy saves hours. If extract fails, nothing else runs. Simple. Brutal. Effective. Tech I used: Python · Pandas · Parquet · MSSQL Server · Requests · Custom logging The pipeline has 3 stages: Extract → you enter a month and year, the pipeline fetches the exact Parquet file for that period — no hardcoding, no manual downloads Transform → deduplicates, cleans nulls, engineers features, aggregates revenue per day Load → writes structured, clean data directly into MSSQL Server — query-ready from day one GitHub link in the comments 👇 #DataEngineering #ETL #Datapipeline #Python #MSSQL #DataWarehouse #LearningInPublic

See more comments

To view or add a comment, sign in

Explore content categories