Mesh with Databricks

Pankaj kumar

Published Aug 31, 2023

The Need: Let’s understand the need for Data Mesh paradigm first. One of the fortune 125 insurance domain client has several well-established data mart catering the consumption of data to their different line of business. Different data marts have been created on an on-prem Netezza platform and the pattern of data ownership is based upon data domain created by different analytics team for their purpose. They have the requisite object access governed by DBA in an on-prem environment.

However, as the data grows exponentially with social profile capturing, for the ease of doing business, Netezza started giving issues. Users complain increases on day-to-day basis for unavailability of streamline data. The TCO is increasing as the server need to be upgraded regularly and at times, they are not being fully utilized post increment.

To get rid of scalability issue and reduce the TCO, Data and Analytics department decided to go with Databricks & AWS cloud services, post a successful POC. As the Data Mart functionality is working perfectly fine for consumers, team decided to go with Hub & Spoke Data Mesh architecture on Databricks Lakehouse.

Data Mesh Paradigm: Data Mesh is an architectural paradigm coined by Zhamak Dehghani in 2019-20, which basically encompasses below four principles:

Data Domain Ownership – Data producer retains full ownership of the data domain.
Data as a product – Ensure quality data is provided to consumers.
Self-Service – Provide common tools/methods for domain-agnostic data.
Federated governance – An ecosystem that adheres to organizational rules.

The Data Mesh concept treats data like a product: It need to be discoverable, trustworthy, self-describing, addressable and interoperable. Besides data and metadata, it can contain code, dashboards, features, models, and other resources needed to create and maintain the data product. Two popular examples often seen in enterprises are the Harmonized Data Mesh and the Hub & Spoke Data Mesh.

Recommended by LinkedIn

Databricks vs Snowflake: Which Platform Excels in Data…

Mahaboob Basha Shaik 1 year ago

Snowflake Vs. Databricks: Which One Is Right For You…

IDERA ER/Studio 1 year ago

Architecting Data Pipelines with Azure Data Lake and…

Rangaraj Balakrishnan 1 year ago

Databricks Lakehouse: Databricks Lakehouse is a cloud-native data & analytics platform that combines the performance and features of a data warehouse with the low-cost, flexibility and scalability of a modern data lake. To implement a Data Mesh effectively, we need a flexible platform that ensures collaboration between data personas, delivers data quality, and facilitates interoperability and productivity across all data. The basic building block of a data mesh is the data domain, which usually comprised of the four main components as listed above.

Hub and Spoke Data Mesh Architecture using Databricks Lakehouse: Infosys team implemented a Hub & Spoke Data Mesh using Databricks for this client to incorporates a centralized location for managing shareable data assets and data that does not sit logically within any domain as hub and created separate data domains as spokes. E.g – Risk Domain, Wealth domain etc. Using Databricks Unity catalog, data products are published to the hub. Databricks Delta Sharing was used for enterprise-grade interoperable data sharing. The data hub provided generic services platform operations for data domains such as: Self-service data publishing to managed locations.- Data cataloging, lineage, audit, and access control via Unity Catalog. The data hub also acted as a data domain. For example, pipelines or tools for generic datasets such as Field Representatives info, Client Info, market research etc.

Unity Catalog provides federated governance, discovery and lineage as a centralized service at the organization's account level running Databricks. Databricks Lakehouse offers flexibility in how data is organized and structured, whilst providing a unified management infrastructure across all workloads. Databricks ‘workspace’ helped in creating primary data domains enabling data ownership and access control. Using a common self-serviced infrastructure helped in automating the provisioning of the environment and orchestration of data pipelines using built in services such as Databricks Workflows, and deployment automation using the Databricks Terraform provider.

Concluding Remarks: Data Mesh and Lakehouse both arose due to common pain points and shortcomings of enterprise data warehouses and traditional data lakes. Data Mesh comprehensively articulates the business vision and needs for improving productivity and value from data, whereas the Databricks Lakehouse provided an open and scalable foundation to meet those needs with maximum interoperability, cost-effectiveness, and simplicity.

Amit Kaulgud 2y

Nice one Pankaj

AVINASH KUMAR 2y

Thank you for being so helpful! We appreciate the time you spent finding that information for us.

Mesh with Databricks

Pankaj kumar

Recommended by LinkedIn

Others also viewed

Data Lake : What, why and why not

Enterprise Data Model in Redshift, actionable lineage information

2024 - 5 Key Considerations for expanding a DW and scaling for Future

Revolutionizing Data Management: Azure Synapse Analytics as a Cloud Lakehouse

The Evolution of Data Storage: From Centralization to Decentralization

Data Mesh on Google Cloud

Azure Data Factory vs Databricks: Key Differences, Use Cases, and Benefits

Optimizing Data Pipelines with Azure Data Factory

Data warehouse vs Data Lake vs Data Lakehouse

Microsoft Fabric Data Pipelines vs. Azure Data Factory Pipelines – My Perspective

Explore content categories

Recommended by LinkedIn

Others also viewed

Data Lake : What, why and why not

Enterprise Data Model in Redshift, actionable lineage information

2024 - 5 Key Considerations for expanding a DW and scaling for Future

Revolutionizing Data Management: Azure Synapse Analytics as a Cloud Lakehouse

The Evolution of Data Storage: From Centralization to Decentralization

Data Mesh on Google Cloud

Azure Data Factory vs Databricks: Key Differences, Use Cases, and Benefits

Optimizing Data Pipelines with Azure Data Factory

Data warehouse vs Data Lake vs Data Lakehouse

Microsoft Fabric Data Pipelines vs. Azure Data Factory Pipelines – My Perspective

Similar topics

How to Build Data Product Ecosystems

Ensuring Data Quality For Scalable AI

Explore content categories