Databricks Modular Framework
The Problem
Every new data pipeline meant writing boilerplate — imports, config parsing, error handling, logging. We needed a framework where pipelines are configuration, not code — but without sacrificing the flexibility that custom Python gives you.
The Architecture
It's a dependency injection-based orchestration framework running on Databricks. A notebook passes globals() — including spark, dbutils, and display — into the orchestrator. From there, the framework takes over entirely.
Two execution paths, same engine:
Python modules are pre-installed in a shared location and runtime only needs to add paths for modules to be found.
🔵 Development — The orchestrator fetches all code and config from any GitHub branch at runtime. Developers test full end-to-end pipelines on their feature branch without deploying anything. No shared environment collisions.
🟢 Production — CI/CD packages the repo into an immutable zip file, uploads it to DBFS, and the orchestrator reads from that artifact. No GitHub access at runtime. If the zip is there, the pipeline runs exactly as packaged.
Both paths converge on the same pipeline engine, which iterates through an ordered set of function calls defined in YAML.
The exec() and eval() Secret Sauce
Functions are standalone .py files — one file, one function, no class hierarchies. They're registered into globals() at runtime by the dependency framework. A function written once is reused across pipelines without copying or importing.
At runtime, the framework fetches each function file and does this:
This is what makes functions hot-loadable from any branch. There's no static import tree. A developer on a feature branch gets their version of every function without touching production. The same mechanism loads the pipeline execution engine itself — ProcessPipeline.py is fetched and executed via exec(), then the callable is extracted with eval().
eval() also handles parameter deserialization — after variable substitution converts dicts and lists to strings, eval() parses them back to their original Python types.
Pipelines Are Pure YAML
Each pipeline is a folder of numbered YAML files. The ordinal in the filename determines execution order:
Recommended by LinkedIn
Outputs are stored in pipeline_results[ordinal]. Downstream steps reference upstream results by number — creating a lightweight DAG inside every pipeline. The enabled: false flag lets you skip steps without deleting config.
Five Layers of Variable Resolution
The variable substitution engine runs a multi-pass resolution across all parameters:
Same YAML works in dev and prod. Only the environment config changes.
Extending the Framework Without Changing It
Adding a new capability to the framework is three steps:
That's it. The framework itself doesn't change. The dependency declaration model (pipeline_dependencies.yaml) maps each pipeline to its required function groups. A pipeline that needs [dts, dataframe] won't load email, reporting, or database functions. Namespace stays lean, startup stays fast.
Built-In Observability
Every pipeline run gets a UUID correlation ID that flows through five specialized log channels:
Full observability: the orchestrator calls write_logs() at the end of execution — or on failure — persisting everything automatically.
Batch Mode and Restartability
For large-volume pipelines, batch orchestration splits work into configurable batch sizes, loops over pipeline steps per batch, and checkpoints state between iterations. On failure, rerun=True resumes from the last checkpoint — no reprocessing of completed batches.
The Result
A framework where exec() and eval() aren't code smells — they're the architecture. Dynamic code loading from any branch, YAML-driven orchestration, immutable production deploys, and a function library shared across 80+ pipelines.
Sometimes the "dangerous" tools are exactly the right ones — when the design is intentional and the boundaries are clear.
Thanks for sharing, will give this a read through!