Databricks Modular Framework

Wayne T.

Published Apr 22, 2026

The Problem

Every new data pipeline meant writing boilerplate — imports, config parsing, error handling, logging. We needed a framework where pipelines are configuration, not code — but without sacrificing the flexibility that custom Python gives you.

The Architecture

It's a dependency injection-based orchestration framework running on Databricks. A notebook passes globals() — including spark, dbutils, and display — into the orchestrator. From there, the framework takes over entirely.

Two execution paths, same engine:

Python modules are pre-installed in a shared location and runtime only needs to add paths for modules to be found.

🔵 Development — The orchestrator fetches all code and config from any GitHub branch at runtime. Developers test full end-to-end pipelines on their feature branch without deploying anything. No shared environment collisions.

🟢 Production — CI/CD packages the repo into an immutable zip file, uploads it to DBFS, and the orchestrator reads from that artifact. No GitHub access at runtime. If the zip is there, the pipeline runs exactly as packaged.

Both paths converge on the same pipeline engine, which iterates through an ordered set of function calls defined in YAML.

The exec() and eval() Secret Sauce

Functions are standalone .py files — one file, one function, no class hierarchies. They're registered into globals() at runtime by the dependency framework. A function written once is reused across pipelines without copying or importing.

At runtime, the framework fetches each function file and does this:

This is what makes functions hot-loadable from any branch. There's no static import tree. A developer on a feature branch gets their version of every function without touching production. The same mechanism loads the pipeline execution engine itself — ProcessPipeline.py is fetched and executed via exec(), then the callable is extracted with eval().

eval() also handles parameter deserialization — after variable substitution converts dicts and lists to strings, eval() parses them back to their original Python types.

Pipelines Are Pure YAML

Each pipeline is a folder of numbered YAML files. The ordinal in the filename determines execution order:

Recommended by LinkedIn

From "Why Won't This Pod Start?!" to Production in 3…

Yaniv V. 4 months ago

Comparing the MLOps libraries .

Sejal Patil 2 years ago

Containerized COVID19 Dashboard

Syed Rehman 5 years ago

Outputs are stored in pipeline_results[ordinal]. Downstream steps reference upstream results by number — creating a lightweight DAG inside every pipeline. The enabled: false flag lets you skip steps without deleting config.

Five Layers of Variable Resolution

The variable substitution engine runs a multi-pass resolution across all parameters:

Function expressions — $(date_tz_format:America/Toronto;%Y%m%d) calls a registered Python function with arguments. Supports date formatting, Spark SQL execution, and vault secrets.
Pipeline variables — $source_path, $date — defined in per-pipeline, per-environment YAML.
Global environment variables — $env.storage_account — shared across all pipelines for a given environment.
Vault secrets — $(vault_secret:scope;secret_name) — resolved at runtime from a secure backend.
Runtime overrides — JSON passed at job trigger time overwrites any pipeline variable on-the-fly.

Same YAML works in dev and prod. Only the environment config changes.

Extending the Framework Without Changing It

Adding a new capability to the framework is three steps:

Drop a .py file in Functions/
Register it in repo.yaml
Reference it from pipeline YAML

That's it. The framework itself doesn't change. The dependency declaration model (pipeline_dependencies.yaml) maps each pipeline to its required function groups. A pipeline that needs [dts, dataframe] won't load email, reporting, or database functions. Namespace stays lean, startup stays fast.

Built-In Observability

Every pipeline run gets a UUID correlation ID that flows through five specialized log channels:

Stdout — real-time notebook output
Pipeline logs — all entries persisted to a Delta table
Business stats — pipeline-level metrics
Function timing — per-function duration and error tracking
Table operations — Delta table write/merge/optimize metrics

Full observability: the orchestrator calls write_logs() at the end of execution — or on failure — persisting everything automatically.

Batch Mode and Restartability

For large-volume pipelines, batch orchestration splits work into configurable batch sizes, loops over pipeline steps per batch, and checkpoints state between iterations. On failure, rerun=True resumes from the last checkpoint — no reprocessing of completed batches.

The Result

A framework where exec() and eval() aren't code smells — they're the architecture. Dynamic code loading from any branch, YAML-driven orchestration, immutable production deploys, and a function library shared across 80+ pipelines.

Sometimes the "dangerous" tools are exactly the right ones — when the design is intentional and the boundaries are clear.

Nicholas Lombardi, FRM 1w

Thanks for sharing, will give this a read through!

To view or add a comment, sign in

Databricks Modular Framework

Wayne T.

The Problem

The Architecture

The exec() and eval() Secret Sauce

Pipelines Are Pure YAML

Recommended by LinkedIn

Five Layers of Variable Resolution

Extending the Framework Without Changing It

Built-In Observability

Batch Mode and Restartability

The Result

More articles by Wayne T.

Others also viewed

🧠🤖 Building an Automated Alerting Agent on Top of Google Sheets: Python, Telegram Bot, and GitHub Actions

dbt fusion : Under the Hood—The Technical Architecture

The Contract of Nouns: Defining API Components with OpenAPI [Part 6]

Python Template: Incrementally Read S3 Objects from SQS Queue as Spark DataFrame | Hands on Labs

The dbt Fusion Engine: Why a Complete Rewrite in Rust Changes Everything: A Technical Deep Dive, Licensing Model and Migration Guide

Building an End-to-End MLOps Pipeline for Free: A Practical Guide

Leveraging Python’s Built-in collections Module for Efficient Data Structures

Exploring uv vs Anaconda: A Developer’s Perspective

GraphQL vs REST API: Building Data-Driven Applications with GraphQL, Python, & Streamlit

Vibe Coding Part 1 - Data quality framework

Explore content categories

The Problem

The Architecture

The exec() and eval() Secret Sauce

Pipelines Are Pure YAML

Recommended by LinkedIn

Five Layers of Variable Resolution

Extending the Framework Without Changing It

Built-In Observability

Batch Mode and Restartability

The Result

More articles by Wayne T.

The One Billion Row Challenge

Parsing Billions of FIX Messages with Rust-Powered Python on Databricks

Others also viewed

🧠🤖 Building an Automated Alerting Agent on Top of Google Sheets: Python, Telegram Bot, and GitHub Actions

dbt fusion : Under the Hood—The Technical Architecture

The Contract of Nouns: Defining API Components with OpenAPI [Part 6]

Python Template: Incrementally Read S3 Objects from SQS Queue as Spark DataFrame | Hands on Labs

The dbt Fusion Engine: Why a Complete Rewrite in Rust Changes Everything: A Technical Deep Dive, Licensing Model and Migration Guide

Building an End-to-End MLOps Pipeline for Free: A Practical Guide

Leveraging Python’s Built-in collections Module for Efficient Data Structures

Exploring uv vs Anaconda: A Developer’s Perspective

GraphQL vs REST API: Building Data-Driven Applications with GraphQL, Python, & Streamlit

Vibe Coding Part 1 - Data quality framework

Similar topics

Documentation for Data Pipelines

Explore content categories