Architecting a Datalake
Grand Lake o' the Cherokees, Oklahoma USA

Architecting a Datalake

Hadoop vendors are talking about using a Hadoop cluster as the datalake; the place where all types of data are brought together to provide broad access to data for discovery of facts, patterns in data, and ad hoc reporting. In response, traditional database vendors are coupling their database technology with Hadoop to create a datalake that combines the performance of the database with the flexibility of Hadoop.  However, these solutions forget many of the lessons learn from building data warehouses.

To be useful, the data in the datalake will include a large amount of data from many systems. That data is typically inconsistent and difficult to correlate. Yet, the data will include some of the most valuable and sensitive information that the organization has; including personal information about customers, employees and suppliers, and financial information.  Most likely it will also include externally provided information. This raises a number of questions:

  • How is the right information located that covers the facts needed at the right levels of confidence for the task in hand?
  • How is this information protected while still being open for sharing?
  • How does the organization encourage the sharing of insight derived from the data?

What is a datalake?

The datalake is a key enabler for an organization wishing to put data and analytics at the heart of their operation.

A data lake provides data to an organization for a variety of analytics processing including:

  • Discovery and exploration of data
  • Simple ad hoc analytics
  • Complex analysis for business decisions
  • Reporting
  • Real-time analytics

In summary, it supports deploying analytics into managed environment to generate additional insight for the organization.

A datalake manages shared repositories of information for analytical purposes. Each DataLake Repository is optimized for a particular type of processing, such as real-time analytics, deep analytics (such as data mining), exploratory analytics, OLAP, reporting, etc.

Data values may be replicated in multiple repositories in the data lake. However the data lake ensures the copying and updating of this data is managed and governed using well-defined information supply chains. 

Information in the data lake can be accessed through different types of interfaces and provisioning mechanisms provided the DataLake Services. The datalake services are supported by an information management and governance fabric such as the IBM InfoSphere Information and Integration portfolio.

 

How is the data lake used by the business?


The business sees the data lake as a single organized collection of information described in a catalog. A person/team who owns some information sources they wish to share can advertise that these information sources are available in the catalog (1). Someone interested in data browses the catalog to discover the sources they need (2). The catalog has a list of all of the data inside of the data lake along with potential information sources that are outside of the data lake (3). If an information source from outside of the lake is requested, the data lake may access it directly, if the information owner allows, or data is copied into the data lake.(4) Once data has been provisioned, a person can explore the data values in a sandbox (5) and build insight. If the analytical model they build has a general use, it may be deployed into the datalake, or another system, to regularly deliver insight to the organization (6).

The datalake architecture

The datalake must be architected for big data and structured data analytics. For example, where their use cases includes some processing that need the speed and accuracy of a structured data warehouse or data mart running on a Netezza box to produce the required results in a timely and efficient manner, whereas other types of exploratory analytics needs the flexibility for Hadoop. The result is that data needs to be stored on the platform most suited for it, and some of the data is required on both platforms.

The datalake must provide the mechanisms to provision data into the repositories, manage any copies, and maintain sufficient metadata to enable people to locate, query, and extract the data they need. A properly architected datalake addresses the following challenges:

  • It creates synergies in the maintenance of the information in the data repository by using consistent mechanism and standards for synchronizing data between the repositories and other systems.
  • It provides convenient mechanisms for the analytics teams, and line-of-business teams, to locate and acquire the information they need in a convenient and cost effective manner.
  • It ensures valuable or sensitive information, residing in the datalake, is properly protected.
  • It ensures all information in the data lake is governed; meaning is it is traceable, has appropriate quality, is managed effectively, and removed when no longer needed. 

A datalake reference architecture is a set of assets that guide an organization in the development of their big data service. Many have been developed. Good ones are deliberately provocative, to force conversation on key architectural principles and choices that an organization has to make. It is not a prepackaged solution. It documents the majority of the functions to be supported by the datalake, however, there is still considerable implementation of integration logic to move data around and support the specific governance requirements of the organization.

Further reading ...

Ten minutes with a web search engine will show you that many vendors are talking about a datalake. Some take a very restricted view, that is it is a way of organizing a Hadoop based repository. For a high level architectural view, check out the following.

Gartner published an article called Gartner says Beware of the Data Lake Fallacy which is a very good datalake approach. 

Forrester published an article in 2011 that introduced the concept of the data lake: Big Data Requires a Big New Architecture.

Booz Allen Hamilton have also published: The Data Lake: Taking Big Data Beyond the Cloud

Conclusion

The datalake is the next logical expansion of the information needed by organizations to include Structured, Unstructured, Internally sourced, and Externally sourced data. As such, the governance of the data and lifecycle management of the data is critical to avoid the algae clogging up the lake and making it a swamp.

The lessons learned from Reporting Systems, 4GL languages & Data Warehousing still apply.

To view or add a comment, sign in

More articles by Kathryn Redman MS-MIS

Others also viewed

Explore content categories