Simplifying Config Management in Databricks : A JSON + Git + CICD Approach

Prabhakaran Kanniappan

Published May 18, 2025

Managing configuration in data engineering workflows is often overlooked — yet it’s critical for reliability, scalability, and ease of maintenance. Many teams store configuration data in Delta Lake tables or shared notebooks, which may seem convenient at first but often leads to long-term challenges.

Now, we explore a more streamlined, maintainable approach: using version-controlled JSON configuration files deployed through CICD pipelines.

Common Practices and Their Limitations

1. Using Shared Configuration Notebooks

Some teams centralize configs in shared notebooks, which are %run from multiple jobs. For example:

# config_notebook.py
bronze_path = "/mnt/data/bronze"
batch_size = 1000

While easy to implement, this approach is prone to:

Untracked Changes: No version control unless tracked externally.
Tight Coupling: Changes affect all jobs unless carefully managed.
Poor Separation of Environments: Developers may accidentally run dev configurations in production.

2. Storing Configurations in Delta Tables

A popular approach involves storing configurations such as paths, flags, or parameters in Delta tables:

config_df = spark.read.format("delta").load("/mnt/config_table")
config_dict = {row['key']: row['value'] for row in config_df.collect()}
bronze_path = config_dict["bronze_path"]

While functional, this method introduces several issues:

Overhead: Reading a Delta table for basic values like strings or integers is overkill.
Complexity: Requires schema enforcement, casting, and error handling.
Environment Confusion: Managing environment-specific configs demands additional logic or separate tables.
Lack of Git Tracking: You can’t easily see what changed or roll back.

A Better Alternative: JSON Configuration Files with Git and CICD

A simpler, robust method is to manage configuration in structured JSON files stored in Git, with automated deployment via CICD pipelines (e.g., Azure DevOps).

Recommended by LinkedIn

Feature Engineering with dbt + Airflow on Kubernetes…

Young Gyu Kim 1 month ago

🚀 Part II: From Prototyping to Production: Turning…

Alexis B. 5 months ago

YAML Engineers

Deepak Rout 1 year ago

Example JSON Configuration (config.json)

{
  "bronze_path": "/mnt/data/bronze",
  "silver_path": "/mnt/data/silver",
  "gold_path": "/mnt/data/gold",
  "batch_size": 500
}

How It Works

Store JSON files per environment in your Git repository.
Manage changes via pull requests — track who changed what and why.
Deploy the correct config to the appropriate Databricks environment (dev, QA, prod) using your CICD pipelines.
Read config in notebooks using standard Python/Databricks utilities.

Reading JSON Config in PySpark Notebooks

Here is a sample code snippet to read and use configuration from a JSON file in a Databricks notebook:

import json
# Path to config file (mounted in workspace via repo sync)
config_path = "/Workspace/Repos/team/project/config/config.json"
with open(config_path, 'r') as f:
    config = json.load(f)
bronze_path = config.get("bronze_path")
batch_size = config.get("batch_size", 1000)
df = spark.read.format("parquet").load(bronze_path)
df = df.limit(batch_size)
df.show()

This approach allows the logic in your notebooks to remain clean, readable, and decoupled from environment-specific values.

Benefits to Development Teams

Improved Maintainability: Config values are cleanly separated and easily updated.
Reduced Runtime Overhead: No need to query Delta for simple values.
Traceability: Every change is tracked via Git history.
Environment Clarity: Each environment has its own JSON config.
CICD Compatibility: Seamlessly integrates with DevOps pipelines for automated deployment.

✅ Adopt configuration best practices

By adopting JSON configuration files that are tracked in Git and deployed through CICD pipelines, teams can:

Eliminate unnecessary dependencies on Delta tables for non-data configurations.
Improve the reliability of their environment-specific parameters.
Enhance collaboration and reduce deployment risks.

This method is simple, scalable, and adheres to modern DevOps best practices. For most cases, configuration does not belong in your data lake — it belongs in your version-controlled repository, managed like any other source asset.

Let Delta Lake do what it does best — managing data. Let Git, JSON, and CICD handle your configurations with confidence.

Simplifying Config Management in Databricks : A JSON + Git + CICD Approach

Prabhakaran Kanniappan

Common Practices and Their Limitations

1. Using Shared Configuration Notebooks

2. Storing Configurations in Delta Tables

A Better Alternative: JSON Configuration Files with Git and CICD

Recommended by LinkedIn

Example JSON Configuration (config.json)

How It Works

Reading JSON Config in PySpark Notebooks

Benefits to Development Teams

✅ Adopt configuration best practices

More articles by Prabhakaran Kanniappan

Others also viewed

Issue #7: Marvelous MLOps

BigQuery Transformations pipeline automated with dbt, Airflow, Kubernetes, and GitHub Actions

Comparing Agent Orchestration Architectures: Approaches & Popular Tools

The Largest Benefit Of Adapting Grafana In Your Company

Serverless Data Product POC Part-I

Databricks CICD Setup - SPETLR Lakehouse Template

How to Build a Near Zero-Cost Local MLOps Environment with Kubernetes, JupyterHub & MLflow

Using Json input data in Spinnaker

Databricks Unity Catalog - Practical guide and steps on how to Implement Secure Catalogs providing least-privilege access model

Explore content categories

Common Practices and Their Limitations

1. Using Shared Configuration Notebooks

2. Storing Configurations in Delta Tables

A Better Alternative: JSON Configuration Files with Git and CICD

Recommended by LinkedIn

Example JSON Configuration (config.json)

How It Works

Reading JSON Config in PySpark Notebooks

Benefits to Development Teams

✅ Adopt configuration best practices

More articles by Prabhakaran Kanniappan

It’s Time to Move On: From Traditional Notebook Deployments to Databricks Asset Bundles

Others also viewed

Issue #7: Marvelous MLOps

BigQuery Transformations pipeline automated with dbt, Airflow, Kubernetes, and GitHub Actions

Comparing Agent Orchestration Architectures: Approaches & Popular Tools

The Largest Benefit Of Adapting Grafana In Your Company

Serverless Data Product POC Part-I

Databricks CICD Setup - SPETLR Lakehouse Template

How to Build a Near Zero-Cost Local MLOps Environment with Kubernetes, JupyterHub & MLflow

Using Json input data in Spinnaker

Databricks Unity Catalog - Practical guide and steps on how to Implement Secure Catalogs providing least-privilege access model

Explore content categories