Databricks Modular Framework

Databricks Modular Framework

The Problem

Every new data pipeline meant writing boilerplate — imports, config parsing, error handling, logging. We needed a framework where pipelines are configuration, not code — but without sacrificing the flexibility that custom Python gives you.

The Architecture

It's a dependency injection-based orchestration framework running on Databricks. A notebook passes globals() — including spark, dbutils, and display — into the orchestrator. From there, the framework takes over entirely.

Two execution paths, same engine:

Python modules are pre-installed in a shared location and runtime only needs to add paths for modules to be found.

🔵 Development — The orchestrator fetches all code and config from any GitHub branch at runtime. Developers test full end-to-end pipelines on their feature branch without deploying anything. No shared environment collisions.

🟢 Production — CI/CD packages the repo into an immutable zip file, uploads it to DBFS, and the orchestrator reads from that artifact. No GitHub access at runtime. If the zip is there, the pipeline runs exactly as packaged.

Both paths converge on the same pipeline engine, which iterates through an ordered set of function calls defined in YAML.

The exec() and eval() Secret Sauce

Functions are standalone .py files — one file, one function, no class hierarchies. They're registered into globals() at runtime by the dependency framework. A function written once is reused across pipelines without copying or importing.

At runtime, the framework fetches each function file and does this:

Article content

This is what makes functions hot-loadable from any branch. There's no static import tree. A developer on a feature branch gets their version of every function without touching production. The same mechanism loads the pipeline execution engine itself — ProcessPipeline.py is fetched and executed via exec(), then the callable is extracted with eval().

eval() also handles parameter deserialization — after variable substitution converts dicts and lists to strings, eval() parses them back to their original Python types.

Article content


Pipelines Are Pure YAML

Each pipeline is a folder of numbered YAML files. The ordinal in the filename determines execution order:

Article content

Outputs are stored in pipeline_results[ordinal]. Downstream steps reference upstream results by number — creating a lightweight DAG inside every pipeline. The enabled: false flag lets you skip steps without deleting config.

Five Layers of Variable Resolution

The variable substitution engine runs a multi-pass resolution across all parameters:

  1. Function expressions — $(date_tz_format:America/Toronto;%Y%m%d) calls a registered Python function with arguments. Supports date formatting, Spark SQL execution, and vault secrets.
  2. Pipeline variables — $source_path, $date — defined in per-pipeline, per-environment YAML.
  3. Global environment variables — $env.storage_account — shared across all pipelines for a given environment.
  4. Vault secrets — $(vault_secret:scope;secret_name) — resolved at runtime from a secure backend.
  5. Runtime overrides — JSON passed at job trigger time overwrites any pipeline variable on-the-fly.

Same YAML works in dev and prod. Only the environment config changes.

Extending the Framework Without Changing It

Adding a new capability to the framework is three steps:

  1. Drop a .py file in Functions/
  2. Register it in repo.yaml
  3. Reference it from pipeline YAML

That's it. The framework itself doesn't change. The dependency declaration model (pipeline_dependencies.yaml) maps each pipeline to its required function groups. A pipeline that needs [dts, dataframe] won't load email, reporting, or database functions. Namespace stays lean, startup stays fast.

Article content

Built-In Observability

Every pipeline run gets a UUID correlation ID that flows through five specialized log channels:

  • Stdout — real-time notebook output
  • Pipeline logs — all entries persisted to a Delta table
  • Business stats — pipeline-level metrics
  • Function timing — per-function duration and error tracking
  • Table operations — Delta table write/merge/optimize metrics

Full observability: the orchestrator calls write_logs() at the end of execution — or on failure — persisting everything automatically.

Batch Mode and Restartability

For large-volume pipelines, batch orchestration splits work into configurable batch sizes, loops over pipeline steps per batch, and checkpoints state between iterations. On failure, rerun=True resumes from the last checkpoint — no reprocessing of completed batches.


The Result

A framework where exec() and eval() aren't code smells — they're the architecture. Dynamic code loading from any branch, YAML-driven orchestration, immutable production deploys, and a function library shared across 80+ pipelines.

Sometimes the "dangerous" tools are exactly the right ones — when the design is intentional and the boundaries are clear.

Thanks for sharing, will give this a read through!

Like
Reply

To view or add a comment, sign in

More articles by Wayne T.

Others also viewed

Explore content categories