I spent the last few weeks building Pipekit — a distributed data pipeline orchestrator from scratch in Python. Here's what I built and what I learned WHAT IS PIPEKIT? A simplified Apache Airflow — built from first principles to deeply understand how pipeline orchestrators actually work under the hood. You define tasks with a decorator: @task(retries=3) def fetch_data():   return "raw_data" Express dependencies in one line: merge.depends_on(fetch_users, fetch_orders, fetch_products) And Pipekit handles the rest. WHAT IT CAN DO - DAG execution — Kahn's algorithm resolves dependencies automatically - True parallelism — tasks in the same wave run across multiple Celery workers simultaneously - State machine — every task tracked: pending → running → success / failed - Persistent state — full audit trail in PostgreSQL - Exponential backoff retry — 2s, 4s, 8s between attempts - Artifact passing — task outputs flow automatically to downstream tasks - REST API — trigger pipelines over HTTP (FastAPI) - CLI tool — pipekit run, pipekit status - Cron scheduler — pipelines run automatically on a schedule - Cycle detection — raises an error immediately for circular dependencies PROOF OF PARALLELISM 3 tasks × 2 seconds each: Sequential → 6.2 seconds Pipekit → 2.04 seconds ForkPoolWorker-7, ForkPoolWorker-8, and ForkPoolWorker-1 all picked up tasks at the same timestamp. That's real parallelism — not threads, actual separate processes. TECH STACK - Python (core) - FastAPI (REST API) - PostgreSQL (persistent state) - Redis (message broker) - Celery (distributed workers) - APScheduler (cron scheduling) - Click (CLI) THE BIGGEST THING I LEARNED Reliability in distributed systems comes from state. Without persistent state — a crash means you lose everything. You don't know what ran, what failed, where to restart from. With state — you can observe, recover, and retry. That's the insight behind every production orchestrator. Every database. Every message queue. It's all about managing state reliably. Building this from scratch taught me more about distributed systems than any course ever did. If you're learning backend or distributed systems — I highly recommend building something like this from scratch. You understand it on a completely different level when you've written every line yourself. Project is on GitHub 👇 https://lnkd.in/gHx4Rtjp website link - https://lnkd.in/g4xC5RsN #Python #DistributedSystems #BackendEngineering #BuildInPublic #SoftwareEngineering #DataEngineering #OpenSource

To view or add a comment, sign in

Explore content categories