Mesh with Databricks
The Need: Let’s understand the need for Data Mesh paradigm first. One of the fortune 125 insurance domain client has several well-established data mart catering the consumption of data to their different line of business. Different data marts have been created on an on-prem Netezza platform and the pattern of data ownership is based upon data domain created by different analytics team for their purpose. They have the requisite object access governed by DBA in an on-prem environment.
However, as the data grows exponentially with social profile capturing, for the ease of doing business, Netezza started giving issues. Users complain increases on day-to-day basis for unavailability of streamline data. The TCO is increasing as the server need to be upgraded regularly and at times, they are not being fully utilized post increment.
To get rid of scalability issue and reduce the TCO, Data and Analytics department decided to go with Databricks & AWS cloud services, post a successful POC. As the Data Mart functionality is working perfectly fine for consumers, team decided to go with Hub & Spoke Data Mesh architecture on Databricks Lakehouse.
Data Mesh Paradigm: Data Mesh is an architectural paradigm coined by Zhamak Dehghani in 2019-20, which basically encompasses below four principles:
The Data Mesh concept treats data like a product: It need to be discoverable, trustworthy, self-describing, addressable and interoperable. Besides data and metadata, it can contain code, dashboards, features, models, and other resources needed to create and maintain the data product. Two popular examples often seen in enterprises are the Harmonized Data Mesh and the Hub & Spoke Data Mesh.
Recommended by LinkedIn
Databricks Lakehouse: Databricks Lakehouse is a cloud-native data & analytics platform that combines the performance and features of a data warehouse with the low-cost, flexibility and scalability of a modern data lake. To implement a Data Mesh effectively, we need a flexible platform that ensures collaboration between data personas, delivers data quality, and facilitates interoperability and productivity across all data. The basic building block of a data mesh is the data domain, which usually comprised of the four main components as listed above.
Hub and Spoke Data Mesh Architecture using Databricks Lakehouse: Infosys team implemented a Hub & Spoke Data Mesh using Databricks for this client to incorporates a centralized location for managing shareable data assets and data that does not sit logically within any domain as hub and created separate data domains as spokes. E.g – Risk Domain, Wealth domain etc. Using Databricks Unity catalog, data products are published to the hub. Databricks Delta Sharing was used for enterprise-grade interoperable data sharing. The data hub provided generic services platform operations for data domains such as: Self-service data publishing to managed locations.- Data cataloging, lineage, audit, and access control via Unity Catalog. The data hub also acted as a data domain. For example, pipelines or tools for generic datasets such as Field Representatives info, Client Info, market research etc.
Concluding Remarks: Data Mesh and Lakehouse both arose due to common pain points and shortcomings of enterprise data warehouses and traditional data lakes. Data Mesh comprehensively articulates the business vision and needs for improving productivity and value from data, whereas the Databricks Lakehouse provided an open and scalable foundation to meet those needs with maximum interoperability, cost-effectiveness, and simplicity.
Nice one Pankaj
Thank you for being so helpful! We appreciate the time you spent finding that information for us.