Simplifying Config Management in Databricks : A JSON + Git + CICD Approach

Simplifying Config Management in Databricks : A JSON + Git + CICD Approach

Managing configuration in data engineering workflows is often overlooked — yet it’s critical for reliability, scalability, and ease of maintenance. Many teams store configuration data in Delta Lake tables or shared notebooks, which may seem convenient at first but often leads to long-term challenges.

Now, we explore a more streamlined, maintainable approach: using version-controlled JSON configuration files deployed through CICD pipelines.

Common Practices and Their Limitations

1. Using Shared Configuration Notebooks

Some teams centralize configs in shared notebooks, which are %run from multiple jobs. For example:

# config_notebook.py
bronze_path = "/mnt/data/bronze"
batch_size = 1000        

While easy to implement, this approach is prone to:

  • Untracked Changes: No version control unless tracked externally.
  • Tight Coupling: Changes affect all jobs unless carefully managed.
  • Poor Separation of Environments: Developers may accidentally run dev configurations in production.

2. Storing Configurations in Delta Tables

A popular approach involves storing configurations such as paths, flags, or parameters in Delta tables:

config_df = spark.read.format("delta").load("/mnt/config_table")
config_dict = {row['key']: row['value'] for row in config_df.collect()}
bronze_path = config_dict["bronze_path"]        

While functional, this method introduces several issues:

  • Overhead: Reading a Delta table for basic values like strings or integers is overkill.
  • Complexity: Requires schema enforcement, casting, and error handling.
  • Environment Confusion: Managing environment-specific configs demands additional logic or separate tables.
  • Lack of Git Tracking: You can’t easily see what changed or roll back.

A Better Alternative: JSON Configuration Files with Git and CICD

A simpler, robust method is to manage configuration in structured JSON files stored in Git, with automated deployment via CICD pipelines (e.g., Azure DevOps).

Example JSON Configuration (config.json)

{
  "bronze_path": "/mnt/data/bronze",
  "silver_path": "/mnt/data/silver",
  "gold_path": "/mnt/data/gold",
  "batch_size": 500
}        

How It Works

  1. Store JSON files per environment in your Git repository.
  2. Manage changes via pull requests — track who changed what and why.
  3. Deploy the correct config to the appropriate Databricks environment (dev, QA, prod) using your CICD pipelines.
  4. Read config in notebooks using standard Python/Databricks utilities.

Reading JSON Config in PySpark Notebooks

Here is a sample code snippet to read and use configuration from a JSON file in a Databricks notebook:

import json
# Path to config file (mounted in workspace via repo sync)
config_path = "/Workspace/Repos/team/project/config/config.json"
with open(config_path, 'r') as f:
    config = json.load(f)
bronze_path = config.get("bronze_path")
batch_size = config.get("batch_size", 1000)
df = spark.read.format("parquet").load(bronze_path)
df = df.limit(batch_size)
df.show()        

This approach allows the logic in your notebooks to remain clean, readable, and decoupled from environment-specific values.

Benefits to Development Teams

  • Improved Maintainability: Config values are cleanly separated and easily updated.
  • Reduced Runtime Overhead: No need to query Delta for simple values.
  • Traceability: Every change is tracked via Git history.
  • Environment Clarity: Each environment has its own JSON config.
  • CICD Compatibility: Seamlessly integrates with DevOps pipelines for automated deployment.

✅ Adopt configuration best practices

By adopting JSON configuration files that are tracked in Git and deployed through CICD pipelines, teams can:

  • Eliminate unnecessary dependencies on Delta tables for non-data configurations.
  • Improve the reliability of their environment-specific parameters.
  • Enhance collaboration and reduce deployment risks.

This method is simple, scalable, and adheres to modern DevOps best practices. For most cases, configuration does not belong in your data lake — it belongs in your version-controlled repository, managed like any other source asset.

Let Delta Lake do what it does best — managing data. Let Git, JSON, and CICD handle your configurations with confidence.


To view or add a comment, sign in

More articles by Prabhakaran Kanniappan

Others also viewed

Explore content categories