Simplifying Config Management in Databricks : A JSON + Git + CICD Approach
Managing configuration in data engineering workflows is often overlooked — yet it’s critical for reliability, scalability, and ease of maintenance. Many teams store configuration data in Delta Lake tables or shared notebooks, which may seem convenient at first but often leads to long-term challenges.
Now, we explore a more streamlined, maintainable approach: using version-controlled JSON configuration files deployed through CICD pipelines.
Common Practices and Their Limitations
1. Using Shared Configuration Notebooks
Some teams centralize configs in shared notebooks, which are %run from multiple jobs. For example:
# config_notebook.py
bronze_path = "/mnt/data/bronze"
batch_size = 1000
While easy to implement, this approach is prone to:
2. Storing Configurations in Delta Tables
A popular approach involves storing configurations such as paths, flags, or parameters in Delta tables:
config_df = spark.read.format("delta").load("/mnt/config_table")
config_dict = {row['key']: row['value'] for row in config_df.collect()}
bronze_path = config_dict["bronze_path"]
While functional, this method introduces several issues:
A Better Alternative: JSON Configuration Files with Git and CICD
A simpler, robust method is to manage configuration in structured JSON files stored in Git, with automated deployment via CICD pipelines (e.g., Azure DevOps).
Recommended by LinkedIn
Example JSON Configuration (config.json)
{
"bronze_path": "/mnt/data/bronze",
"silver_path": "/mnt/data/silver",
"gold_path": "/mnt/data/gold",
"batch_size": 500
}
How It Works
Reading JSON Config in PySpark Notebooks
Here is a sample code snippet to read and use configuration from a JSON file in a Databricks notebook:
import json
# Path to config file (mounted in workspace via repo sync)
config_path = "/Workspace/Repos/team/project/config/config.json"
with open(config_path, 'r') as f:
config = json.load(f)
bronze_path = config.get("bronze_path")
batch_size = config.get("batch_size", 1000)
df = spark.read.format("parquet").load(bronze_path)
df = df.limit(batch_size)
df.show()
This approach allows the logic in your notebooks to remain clean, readable, and decoupled from environment-specific values.
Benefits to Development Teams
✅ Adopt configuration best practices
By adopting JSON configuration files that are tracked in Git and deployed through CICD pipelines, teams can:
This method is simple, scalable, and adheres to modern DevOps best practices. For most cases, configuration does not belong in your data lake — it belongs in your version-controlled repository, managed like any other source asset.
Let Delta Lake do what it does best — managing data. Let Git, JSON, and CICD handle your configurations with confidence.