DATA LAKE

Ragini Trivedi

Published Jan 3, 2022

What is a data lake?

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

Why do you need a data lake?

Organizations that successfully generate business value from their data, will outperform their peers. An Aberdeen survey saw organizations who implemented a Data Lake outperforming similar companies by 9% in organic revenue growth. These leaders were able to do new types of analytics like machine learning over new sources like log files, data from click-streams, social media, and internet connected devices stored in the data lake. This helped them to identify, and act upon opportunities for business growth faster by attracting and retaining customers, boosting productivity, proactively maintaining devices, and making informed decisions.

Data Lake Participants – Roles and Responsibilities

The Supplier has no direct knowledge of the Consumer’s needs or how they want the items presented – that is the role of the Aggregator
The Consumer is unaware of the Supplier, only knows what is available by interacting with the Aggregator
The Aggregator is driven by an understanding of the Consumer, both in knowing what they need (or may need in the future), as well as how they need to see or access it, therefore, it is the Aggregator that decides how to present items to the Consumer

Keeping these underlying principles in mind, the following set of responsibilities can be defined for each role (note that the embedded examples are for a Healthcare Insurance Provider):

Supplier

Provides a full description of what is being delivered to the Data Lake
A Conceptual and Logical Model of the information in the “language” of the standard catalog that has been adopted by the Data Lake as representative of the enterprise’s business information – independent of any physical implementation
A set of any rules that have been placed upon the information (e.g. this source system only allows one Address per Person)
The set of “calculations” being provided, along with a formula of how that calculation is made – using the concepts as defined in the enterprise catalog (e.g. a count of Group Members is the sum of all Plan Members, both the Group Member, i.e. the Subscriber, as well as all the Plan Members identified on each Contract held by the Subscriber)
The set of “views” that are represented in the supplied information and the criteria used to generate the content of the view (e.g. all contracts of subscribers that are age 65 or over and are male)
Provides a full description of how the information is being delivered to the Data Lake
The form (extract file, acquisition service, direct connection “pipe”, etc.)
The detailed format within the form that maps back to the “what” documentation presented above

Recommended by LinkedIn

Monetizing "Dead Data" is the 2019 Business…

Lawrence Cowan 7 years ago

Bridging the Data Divide: Uniting Data Governance…

Kamal Singh 1 year ago

Big Data on a Smaller Scale

Anton Kobelev 7 years ago

Note that no transformation requirements are provided because, as a supplier, it is not its responsibility

Consumer

Provides a full description of what is being requested of the Data Lake
A Conceptual and Logical Model of the information in the “language” of the standard catalog that has been adopted by the Data Lake as representative of the enterprise’s business information – independent of any physical implementation
A set of any rules that are followed by the target, so the delivered information needs to abide accordingly (e.g. this target system only allows one Benefit Package per Division)
The set of “calculations” needed by the target, along with a formula of how that calculation is made – using the concepts as defined in the enterprise catalog (e.g. a count of Group Members is the sum of all Plan Members, both the Group Member, i.e. the Subscriber, as well as all the Plan Members identified on each Contract held by the Subscriber)
The set of “views” that are needed to be provided in the supplied information and the criteria that defines the view content (e.g. all contracts for an HMO product where the subscriber is female and resident in the state of Arkansas)
Provides a full description of how the information is desired from the Data Lake (this is highly negotiable as the Data Lake may offer alternative delivery mechanisms or may reject the Consumer’s request)
The form (extract file, acquisition service, direct connection “pipe”, etc.)
The detailed format within the form that maps back to the “what” documentation presented above
If transformations needed from what the Data Lake has agreed to make available, a description of the transformation desired

Note that in this model, even other “consolidators” (such as a Data Warehouse or Operational Data Store) are also Consumers, therefore have the same responsibility

Aggregator

Ensures there are suppliers with the items the consumers’ need
Taking delivery from a supplier, in whatever format that takes, and presenting these items to the consumer
Provide the common vocabulary (catalog) of the information currently or “aspirationally” resident in the Data Lake (this may expand as Suppliers come on board with new concepts or Consumers make requests for new concepts)
A Conceptual and Logical Model
A set of any rules that have been placed upon the information
The set of “calculations” available
The set of “views” available
Provides a full description of how the information can be accessed by a Consumer and the physical mapping for where the information may be found
Determines the best approach for moving Supplier information to Consumer accessible information (by using its knowledge of the needs of the consumer and how it wishes to serve the consumer)
Provides assistance for both Suppliers and Consumers in representing their information utilizing the common vocabulary
Provides guidance and assistance to Consumers in actually obtaining the information from the Data Lake

Governs all the information resident in the Data Lake

This last statement is key to the connection to Information Governance. As a matter of fact, all these responsibility descriptions are an aspect of the “decision rights” defined and controlled by a Governance Body.

The implication being that the “keepers” of the Data Lake must establish the Governance of the information housed in the lake – although it is recommended that the IG Program be created organizationally as a separate and distinct entity from the Data Lake solution owner.

You will also notice that a lynchpin between all these roles is a Catalog that is utilized by all parties in their communications with the other roles. The creation and maintenance of this catalog is the responsibility of the IG Program – and I will talk more about this artifact, and its importance, in my next post.

To view or add a comment, sign in

DATA LAKE

Ragini Trivedi

What is a data lake?

Why do you need a data lake?

Data Lake Participants – Roles and Responsibilities

Supplier

Recommended by LinkedIn

Note that no transformation requirements are provided because, as a supplier, it is not its responsibility

Consumer

Note that in this model, even other “consolidators” (such as a Data Warehouse or Operational Data Store) are also Consumers, therefore have the same responsibility

Aggregator

Governs all the information resident in the Data Lake

More articles by Ragini Trivedi

Others also viewed

BI Strategy, bridging the BI gap between IT and Business Users, avoiding dashboard mistakes and much more...

P&C Insurance Data Science Maturity: No shortcuts please!

The Power of Data Analytics: One Click Away

Key ingredients of a Product-driven policy solution - Flexible persistency layer

A Word to Executives about Data Management: How to Put the Lion Back Into the Cage

Why I Built My Own Data Analytics Project (And Why You Should Too)

The Data Product Conundrum

Technical metadata and business metadata: what is the difference

The Strategic Value of Data Lineage in Identifying Data Risks

My data management journey for Zurich insurance

Explore content categories

What is a data lake?

Why do you need a data lake?

Data Lake Participants – Roles and Responsibilities

Supplier

Recommended by LinkedIn

Note that no transformation requirements are provided because, as a supplier, it is not its responsibility

Consumer

Note that in this model, even other “consolidators” (such as a Data Warehouse or Operational Data Store) are also Consumers, therefore have the same responsibility

Aggregator

Governs all the information resident in the Data Lake

More articles by Ragini Trivedi

GIT

APACHE SPARK

DEVOPS

AZURE DATA ENGINEER

GCP

ACTURIAL

CLOUD OPERATIONS

SALESFORCE

REDSHIFT

UIPATH

Others also viewed

BI Strategy, bridging the BI gap between IT and Business Users, avoiding dashboard mistakes and much more...

P&C Insurance Data Science Maturity: No shortcuts please!

The Power of Data Analytics: One Click Away

Key ingredients of a Product-driven policy solution - Flexible persistency layer

A Word to Executives about Data Management: How to Put the Lion Back Into the Cage

Why I Built My Own Data Analytics Project (And Why You Should Too)

The Data Product Conundrum

Technical metadata and business metadata: what is the difference

The Strategic Value of Data Lineage in Identifying Data Risks

My data management journey for Zurich insurance

Explore content categories