DATA LAKE

DATA LAKE

What is a data lake?

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

Why do you need a data lake?

Organizations that successfully generate business value from their data, will outperform their peers. An Aberdeen survey saw organizations who implemented a Data Lake outperforming similar companies by 9% in organic revenue growth. These leaders were able to do new types of analytics like machine learning over new sources like log files, data from click-streams, social media, and internet connected devices stored in the data lake. This helped them to identify, and act upon opportunities for business growth faster by attracting and retaining customers, boosting productivity, proactively maintaining devices, and making informed decisions.

Data Lake Participants – Roles and Responsibilities

  • The Supplier has no direct knowledge of the Consumer’s needs or how they want the items presented – that is the role of the Aggregator
  • The Consumer is unaware of the Supplier, only knows what is available by interacting with the Aggregator
  • The Aggregator is driven by an understanding of the Consumer, both in knowing what they need (or may need in the future), as well as how they need to see or access it, therefore, it is the Aggregator that decides how to present items to the Consumer

Keeping these underlying principles in mind, the following set of responsibilities can be defined for each role (note that the embedded examples are for a Healthcare Insurance Provider):

Supplier

  • Provides a full description of what is being delivered to the Data Lake
  • A Conceptual and Logical Model of the information in the “language” of the standard catalog that has been adopted by the Data Lake as representative of the enterprise’s business information – independent of any physical implementation
  • A set of any rules that have been placed upon the information (e.g. this source system only allows one Address per Person)
  • The set of “calculations” being provided, along with a formula of how that calculation is made – using the concepts as defined in the enterprise catalog (e.g. a count of Group Members is the sum of all Plan Members, both the Group Member, i.e. the Subscriber, as well as all the Plan Members identified on each Contract held by the Subscriber)
  • The set of “views” that are represented in the supplied information and the criteria used to generate the content of the view (e.g. all contracts of subscribers that are age 65 or over and are male)
  • Provides a full description of how the information is being delivered to the Data Lake
  • The form (extract file, acquisition service, direct connection “pipe”, etc.)
  • The detailed format within the form that maps back to the “what” documentation presented above

Note that no transformation requirements are provided because, as a supplier, it is not its responsibility

Consumer

  • Provides a full description of what is being requested of the Data Lake
  • A Conceptual and Logical Model of the information in the “language” of the standard catalog that has been adopted by the Data Lake as representative of the enterprise’s business information – independent of any physical implementation
  • A set of any rules that are followed by the target, so the delivered information needs to abide accordingly (e.g. this target system only allows one Benefit Package per Division)
  • The set of “calculations” needed by the target, along with a formula of how that calculation is made – using the concepts as defined in the enterprise catalog (e.g. a count of Group Members is the sum of all Plan Members, both the Group Member, i.e. the Subscriber, as well as all the Plan Members identified on each Contract held by the Subscriber)
  • The set of “views” that are needed to be provided in the supplied information and the criteria that defines the view content (e.g. all contracts for an HMO product where the subscriber is female and resident in the state of Arkansas)
  • Provides a full description of how the information is desired from the Data Lake (this is highly negotiable as the Data Lake may offer alternative delivery mechanisms or may reject the Consumer’s request)
  • The form (extract file, acquisition service, direct connection “pipe”, etc.)
  • The detailed format within the form that maps back to the “what” documentation presented above
  • If transformations needed from what the Data Lake has agreed to make available, a description of the transformation desired

Note that in this model, even other “consolidators” (such as a Data Warehouse or Operational Data Store) are also Consumers, therefore have the same responsibility

Aggregator

  • Ensures there are suppliers with the items the consumers’ need
  • Taking delivery from a supplier, in whatever format that takes, and presenting these items to the consumer
  • Provide the common vocabulary (catalog) of the information currently or “aspirationally” resident in the Data Lake (this may expand as Suppliers come on board with new concepts or Consumers make requests for new concepts)
  • A Conceptual and Logical Model
  • A set of any rules that have been placed upon the information
  • The set of “calculations” available
  • The set of “views” available
  • Provides a full description of how the information can be accessed by a Consumer and the physical mapping for where the information may be found
  • Determines the best approach for moving Supplier information to Consumer accessible information (by using its knowledge of the needs of the consumer and how it wishes to serve the consumer)
  • Provides assistance for both Suppliers and Consumers in representing their information utilizing the common vocabulary
  • Provides guidance and assistance to Consumers in actually obtaining the information from the Data Lake

Governs all the information resident in the Data Lake

This last statement is key to the connection to Information Governance. As a matter of fact, all these responsibility descriptions are an aspect of the “decision rights” defined and controlled by a Governance Body.

The implication being that the “keepers” of the Data Lake must establish the Governance of the information housed in the lake – although it is recommended that the IG Program be created organizationally as a separate and distinct entity from the Data Lake solution owner.

You will also notice that a lynchpin between all these roles is a Catalog that is utilized by all parties in their communications with the other roles. The creation and maintenance of this catalog is the responsibility of the IG Program – and I will talk more about this artifact, and its importance, in my next post.

To view or add a comment, sign in

More articles by Ragini Trivedi

  • GIT

    Git is a mature, actively maintained open source project originally developed in 2005 by Linus Torvalds. Git is an…

  • APACHE SPARK

    What is Apache Spark? Apache Spark is an open-source, distributed processing system used for big data workloads. It…

  • DEVOPS

    What is DevOps DevOps is a collection of flexible practices and processes organizations use to create and deliver…

  • AZURE DATA ENGINEER

    What is Azure Data Factory? Azure Data Factory is a cloud-based data integration service that allows you to create…

  • GCP

    Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that runs on the same…

  • ACTURIAL

    What Is Actuarial Science? Actuarial science is a discipline that assesses financial risks in the insurance and finance…

  • CLOUD OPERATIONS

    Cloud operations (CloudOps) is the management, delivery and consumption of software in a computing environment where…

  • SALESFORCE

    Salesforce, Inc. is an American cloud-based software company headquartered in San Francisco, California.

    1 Comment
  • REDSHIFT

    A Redshift Database is a cloud-based, big data warehouse solution offered by Amazon. The platform provides a storage…

  • UIPATH

    UiPath is a robotic process automation tool for large-scale end-to-end automation. For an accelerated business change…

    3 Comments

Others also viewed

Explore content categories