Architecting a Datalake

Kathryn Redman MS-MIS

Published Sep 1, 2017

Hadoop vendors are talking about using a Hadoop cluster as the datalake; the place where all types of data are brought together to provide broad access to data for discovery of facts, patterns in data, and ad hoc reporting. In response, traditional database vendors are coupling their database technology with Hadoop to create a datalake that combines the performance of the database with the flexibility of Hadoop. However, these solutions forget many of the lessons learn from building data warehouses.

To be useful, the data in the datalake will include a large amount of data from many systems. That data is typically inconsistent and difficult to correlate. Yet, the data will include some of the most valuable and sensitive information that the organization has; including personal information about customers, employees and suppliers, and financial information. Most likely it will also include externally provided information. This raises a number of questions:

How is the right information located that covers the facts needed at the right levels of confidence for the task in hand?
How is this information protected while still being open for sharing?
How does the organization encourage the sharing of insight derived from the data?

What is a datalake?

The datalake is a key enabler for an organization wishing to put data and analytics at the heart of their operation.

A data lake provides data to an organization for a variety of analytics processing including:

Discovery and exploration of data
Simple ad hoc analytics
Complex analysis for business decisions
Reporting
Real-time analytics

In summary, it supports deploying analytics into managed environment to generate additional insight for the organization.

A datalake manages shared repositories of information for analytical purposes. Each DataLake Repository is optimized for a particular type of processing, such as real-time analytics, deep analytics (such as data mining), exploratory analytics, OLAP, reporting, etc.

Data values may be replicated in multiple repositories in the data lake. However the data lake ensures the copying and updating of this data is managed and governed using well-defined information supply chains.

Information in the data lake can be accessed through different types of interfaces and provisioning mechanisms provided the DataLake Services. The datalake services are supported by an information management and governance fabric such as the IBM InfoSphere Information and Integration portfolio.

How is the data lake used by the business?

The business sees the data lake as a single organized collection of information described in a catalog. A person/team who owns some information sources they wish to share can advertise that these information sources are available in the catalog (1). Someone interested in data browses the catalog to discover the sources they need (2). The catalog has a list of all of the data inside of the data lake along with potential information sources that are outside of the data lake (3). If an information source from outside of the lake is requested, the data lake may access it directly, if the information owner allows, or data is copied into the data lake.(4) Once data has been provisioned, a person can explore the data values in a sandbox (5) and build insight. If the analytical model they build has a general use, it may be deployed into the datalake, or another system, to regularly deliver insight to the organization (6).

The datalake architecture

The datalake must be architected for big data and structured data analytics. For example, where their use cases includes some processing that need the speed and accuracy of a structured data warehouse or data mart running on a Netezza box to produce the required results in a timely and efficient manner, whereas other types of exploratory analytics needs the flexibility for Hadoop. The result is that data needs to be stored on the platform most suited for it, and some of the data is required on both platforms.

The datalake must provide the mechanisms to provision data into the repositories, manage any copies, and maintain sufficient metadata to enable people to locate, query, and extract the data they need. A properly architected datalake addresses the following challenges:

It creates synergies in the maintenance of the information in the data repository by using consistent mechanism and standards for synchronizing data between the repositories and other systems.
It provides convenient mechanisms for the analytics teams, and line-of-business teams, to locate and acquire the information they need in a convenient and cost effective manner.
It ensures valuable or sensitive information, residing in the datalake, is properly protected.
It ensures all information in the data lake is governed; meaning is it is traceable, has appropriate quality, is managed effectively, and removed when no longer needed.

A datalake reference architecture is a set of assets that guide an organization in the development of their big data service. Many have been developed. Good ones are deliberately provocative, to force conversation on key architectural principles and choices that an organization has to make. It is not a prepackaged solution. It documents the majority of the functions to be supported by the datalake, however, there is still considerable implementation of integration logic to move data around and support the specific governance requirements of the organization.

Conclusion

The datalake is the next logical expansion of the information needed by organizations to include Structured, Unstructured, Internally sourced, and Externally sourced data. As such, the governance of the data and lifecycle management of the data is critical to avoid the algae clogging up the lake and making it a swamp.

The lessons learned from Reporting Systems, 4GL languages & Data Warehousing still apply.

To view or add a comment, sign in

Architecting a Datalake

Kathryn Redman MS-MIS

What is a datalake?

How is the data lake used by the business?

The datalake architecture

Further reading ...

Conclusion

More articles by Kathryn Redman MS-MIS

Others also viewed

Scaling up your existing data warehouse using Hadoop – a cost effective solution.

How to expand the life of your Data Warehouse using Hadoop

Elephants are not Greyhounds

Hadoop is really a Pluggable Architecture

Management of Data by Tech Giants...

HDFS

Big Data, Small Steps, Long Journey

Building a comprehensive Hadoop Governance model

Big data vs Hadoop

OLAP-on-Hadoop on the Rise

Explore content categories

What is a datalake?

How is the data lake used by the business?

The datalake architecture

Further reading ...

Conclusion

More articles by Kathryn Redman MS-MIS

How do I Sell my Data

Making Business Intelligence & Cognitive Simpler

GDPR, It's not just a European Union Issue

Disrupting Information Technology

People Get Hurt by Bad Analytics

What does Brexit & the Presidential Election tell us about People and Social Analytics?

The one thing that terrifies me about Self Driving Cars - Bugs

IBM and The Weather Company takes the mystery out of the weather. See how it's done, and learn how you can become part of the community.

Others also viewed

Scaling up your existing data warehouse using Hadoop – a cost effective solution.

How to expand the life of your Data Warehouse using Hadoop

Elephants are not Greyhounds

Hadoop is really a Pluggable Architecture

Management of Data by Tech Giants...

HDFS

Big Data, Small Steps, Long Journey

Building a comprehensive Hadoop Governance model

Big data vs Hadoop

OLAP-on-Hadoop on the Rise

Similar topics

Data Lakes and Warehousing

Real-Time Data Analytics in ERP

Explore content categories