Mesh with Databricks

The Need: Let’s understand the need for Data Mesh paradigm first. One of the fortune 125 insurance domain client has several well-established data mart catering the consumption of data to their different line of business. Different data marts have been created on an on-prem Netezza platform and the pattern of data ownership is based upon data domain created by different analytics team for their purpose. They have the requisite object access governed by DBA in an on-prem environment.

However, as the data grows exponentially with social profile capturing, for the ease of doing business, Netezza started giving issues. Users complain increases on day-to-day basis for unavailability of streamline data. The TCO is increasing as the server need to be upgraded regularly and at times, they are not being fully utilized post increment.

To get rid of scalability issue and reduce the TCO, Data and Analytics department decided to go with Databricks & AWS cloud services, post a successful POC. As the Data Mart functionality is working perfectly fine for consumers, team decided to go with Hub & Spoke Data Mesh architecture on Databricks Lakehouse. 


Data Mesh Paradigm: Data Mesh is an architectural paradigm coined by Zhamak Dehghani in 2019-20, which basically encompasses below four principles:

  1. Data Domain Ownership – Data producer retains full ownership of the data domain.
  2. Data as a product – Ensure quality data is provided to consumers.
  3. Self-Service – Provide common tools/methods for domain-agnostic data.
  4. Federated governance – An ecosystem that adheres to organizational rules.

The Data Mesh concept treats data like a product: It need to be discoverable, trustworthy, self-describing, addressable and interoperable. Besides data and metadata, it can contain code, dashboards, features, models, and other resources needed to create and maintain the data product. Two popular examples often seen in enterprises are the Harmonized Data Mesh and the Hub & Spoke Data Mesh.


Article content
P.C - Dremio

Databricks Lakehouse: Databricks Lakehouse is a cloud-native data & analytics platform that combines the performance and features of a data warehouse with the low-cost, flexibility and scalability of a modern data lake. To implement a Data Mesh effectively, we need a flexible platform that ensures collaboration between data personas, delivers data quality, and facilitates interoperability and productivity across all data. The basic building block of a data mesh is the data domain, which usually comprised of the four main components as listed above.

Hub and Spoke Data Mesh Architecture using Databricks Lakehouse: Infosys team implemented a Hub & Spoke Data Mesh using Databricks for this client to incorporates a centralized location for managing shareable data assets and data that does not sit logically within any domain as hub and created separate data domains as spokes. E.g – Risk Domain, Wealth domain etc. Using Databricks Unity catalog, data products are published to the hub. Databricks Delta Sharing was used for enterprise-grade interoperable data sharing. The data hub provided generic services platform operations for data domains such as: Self-service data publishing to managed locations.- Data cataloging, lineage, audit, and access control via Unity Catalog. The data hub also acted as a data domain. For example, pipelines or tools for generic datasets such as Field Representatives info, Client Info, market research etc.

  • Unity Catalog provides federated governance, discovery and lineage as a centralized service at the organization's account level running Databricks. Databricks Lakehouse offers flexibility in how data is organized and structured, whilst providing a unified management infrastructure across all workloads. Databricks ‘workspace’ helped in creating primary data domains enabling data ownership and access control. Using a common self-serviced infrastructure helped in automating the provisioning of the environment and orchestration of data pipelines using built in services such as Databricks Workflows, and deployment automation using the Databricks Terraform provider

Article content
P.C - Databricks



Concluding Remarks: Data Mesh and Lakehouse both arose due to common pain points and shortcomings of enterprise data warehouses and traditional data lakes. Data Mesh comprehensively articulates the business vision and needs for improving productivity and value from data, whereas the Databricks Lakehouse provided an open and scalable foundation to meet those needs with maximum interoperability, cost-effectiveness, and simplicity.

 

Thank you for being so helpful! We appreciate the time you spent finding that information for us.

To view or add a comment, sign in

Others also viewed

Explore content categories