How to Avoid the Data Lake Disaster

A Guide to Keep Agile BI, Operational Analytics, Streaming Analytics and Big Data in Context

I’ve been looking into analytics and data lakes a little more, and thought I’d share my findings.

Data is like water in California. You go through a drought for years, forcing you to operate without it and work around your central source, whether that’s the water company or corporate IT. But when you get a sudden flood of it, in the form of atmospheric rivers or Big Data, you’re not prepared. Unlike floods in California, a data deluge isn’t something that happens every 200 years. Your data is growing faster than you can handle it with your current approach. So you need to plan ahead and think about your whole data architecture, and what you’re missing.

The problem is there are now many companies starting to believe that the answer to all their problems, their silver bullet, is … the data lake. I know it’s easy to be tempted by a Lady of the Data Lake rising up and handing you Excalibur, that shiny new tool that will unite all your data. But the reality is that most data lakes today are only about Big Data and answering questions you didn’t know you needed to answer. While the technologies are converging, you need to first understand your needs, and then know how to map them to the technologies to avoid where many companies are already heading today, towards their data lake disaster.

If someone asks me to explain data and analytics to the business, I explain it this way. Behind every process, every decision, every event, is data. When different groups in the business, and in suppliers and customers, decided to use their own software, that shattered the single, real-time view of the business. That’s not a bad decision. It’s just reality.

When the business started asking for reports, IT decided to meet reporting needs with a data warehouse and reporting and analytics tools on top. They spent the time retrofitting the data back together, which usually meant in batch, with data quality and other tools to clean up any mistakes in data entry, or outdated data, and merge the data to get to a single view of the business. In order to get adequate performance when running many different reports against the same data, they structured the data, and often did a lot of calculations by attributes that analysts needed to drill down on, like region, product, customer or revenues and profits, to make sure they could slice and dice the data quickly. Adding new data typically takes 1-3 weeks to gather requirements, and 1-3 months to change the data warehouse to add the right data behind a report. And since companies typically weren’t capturing the data until they needed it, when you did add new data there usually wasn’t a historical view right away.

Over time, this led to four major challenges that were brought up roughly in this order.

  • Reports take too long: As companies tried to improve their businesses they realized they couldn’t create get new data fast enough. The reports were pretty easy to create … if you had the right data. But it takes it takes months to get the data into the data warehouse.
  • Lack of real-time data: As companies try to improve the customer experience or operational efficiency, or meet an SLA, employees in the field need up-to-date information, or real-time visibility into the business. But the data warehouse is fundamentally batch. Some accelerated ETL and warehousing with specialized hardware to approach real-time, but doing so ended up costing $50,000-$100,000 a terabyte of data; in short, millions a year.
  • Lack of automation: some decisions need to happen immediately, and should be automated because if a person gets involved, it will take too long. So how do you automate decisions against live data?
  • Lack of data: storing data was expensive. Application data has been growing at 50-70% a year in many cases, and the applications often don’t store a complete history. They just store the current snapshot in time. Data warehouses by design only stored what they needed or might need. What about all the raw data you might need to use in the future? How do you build a historic view or answer questions against data you weren’t keeping? And how should you store and consume all the new types of unstructured or streaming “Big Data”?

Companies often solve these four problems differently in different groups using different technologies:

  • Agile Business Intelligence (BI): the ability to create reports in hours, not weeks or months.
  • Operational analytics (or Operational Intelligence): analytics to help improve operational decisions. Real-time visibility makes a big difference here because the sooner you see an issue, the faster you can fix it.
  • Streaming analytics: the ability to integrate and process data streams, identify events and patterns, and act in real-time either through alerting or automation.
  • Big Data: the ability to store and process massive amounts of unstructured data cost effectively.

If you want to go back in history, James Dixon, a seasoned BI and OLAP technologist by training when I worked with him and who is now the CTO of Pentaho, is often credited with using the term data lake. A data lake in its pure form is a (massive) reservoir of any raw data. It helps you answer questions that you didn’t yet know you needed to ask. You can go back to all your and generate new reports or analytics to get new insights that help improve your decisions, and these insights can drive automation using streaming data analytics. That’s really important, but that’s it.

Most data lakes you see today on top of Hadoop are not well suited for Agile BI, and especially poor for operational analytics. They solve the fourth challenge by storing massive amounts of data and allowing data scientists to look for new insights. Without structure of some sort, your average analyst can’t quickly find the data they need or get the performance to do any kind of real-time interactive analytics. Without real-time streaming made simple they won’t be able to perform operational analytics.

And that’s OK. Don’t expect your architecture to look like all data flowing into a massive lake and then coming out. It’s an interconnected network of different lakes, data warehouses, data marts, and whatever else you need, on top of a real-time network of data streams. So don’t force fit a data lake approach to solve all your data problems. You’ll fail.


A Guide to Keep Agile BI, Operational Analytics, Streaming Analytics and Big Data in Context

So with all that context in mind, here’s my advice on the four areas.

Agile BI

Any time you hear the need for reports faster, remember this. First, you need to replace the 1-3 weeks of requirements gathering with direct prototyping by an analyst in a tool. Second, you need to get your data directly from sources into a structure that has enough structure to give you performance, not from a warehouse that takes 1-3 months to change. You can roll in the changes into the warehouse later if needed. And third, Agile BI should not drive your technology decision because there is no one tool for all “Agile BI”. It’s more of a methodology. Figure out WHY they’re asking for the report, and WHAT data they need. It’s either for longer term analytics and planning, or it’s for operations, which requires real-time data. You will need to choose your technology based on those needs.

Big Data (Lakes)

If you’re going to start implementing Hadoop as a data lake, first make sure you have a business reason and a sponsor to justify the project. As I’ve mentioned earlier, picking a use case to improve the customer experience (from service delivery to product satisfaction or results) will usually lead to a big win. Second, don’t be afraid to have more than one logical data lake. You might have some operational data stores that are more application-specific and structured, and they may feed into a larger (Hadoop-based) data lake over time. Third, think about how you’re going to feed the lake. There are several great new integration tools for integrating streaming and batch data from many of the newer sources into Hadoop, including Paxata, Streamsets, Talend and Trifacta (in alphabetical order.) Fourth, you do still need to think about data governance, to help people organize and find data. That’s early in its infancy but outside of the tools from the Hadoop vendors there are tools like Waterline Data that make it much easier to catalog and search. Fifth, make sure you at HOW you’re going to act on the findings. That means pairing it with operational analytics or streaming analytics technologies to automate decisions.

Operational Analytics

Operational analytics does not work well on a traditional data warehouse, because operational analytics needs “fresh” up-to-date data. For some applications like procurement and inventory replenishment, it might mean visibility up to the hour. For customer service issues or industrial equipment failures it might require visibility within seconds.

Some have used specialized data warehouse appliances to “accelerate” the data warehouse architecture. As I mentioned before, that’s really expensive. It’s also not agile by default. But there are new approaches being done that combine the scale-out approach that came with Hadoop with columnar data storage that adds enough structure and is well suited for analytics.

Incorta is one great example general operational analytics. There are others who have focused specifically on vertical solutions who are also great if you're interested. Incorta's technology helps companies build analytic data lakes in a box for operational analytics that support Agile BI. Their offering combines Spark for scale out stream processing, Parquet for columnar data lake storage and Presto for standard sql querying. They have connectors for various on premise and SaaS offerings, and as they load data they auto-detect relationships and add the indexing to provide the performance they need. If you look at some of their customer implementations, they’ve been able to implement Incorta-based analytic data lakes in days, and add new data in hours, stream updates in real-time and deliver second-level performance against massive data sets.

This is exactly what’s needed for operational analytics and Agile BI, and if you believe the biggest business improvements will come from improving the customer experience and operations, then you will be implementing this type of architecture. Here’s why:

  • First, companies increasingly no longer own the data format. That’s being defined by SaaS vendors, and by APIs whose requirements come from customers. These vendors and ecosystems have been making the data fit better together. That means its becoming easier to fit this data together without ETL.
  • Second, data is becoming more real-time. Not only are SaaS vendors and APIs providing real-time interfaces. So is your IT organization. So batch is no longer the only way to get at data.
  • Third, modern data technology has shown that it can scale and deliver real-time performance without star or snowflake schemas. Parquet, for example, combines years of experience with columnar stores and analytics with the Spark model of scale out. You can now put your data in a massive public or private shared cloud and burst to get the real-time responsiveness you need. Incorta and others have proven it works. 

It’s OK if you have an analytic lake and a larger Hadoop-based lake. If you’re going to onboard cloud apps, these types of analytic data lakes may be the best approach for feeding the data into your analytics lakes, much like how an operational data store (ODS) fed into a warehouse or supported data synchronization. 

Streaming Analytics

Eventually, you will want to automate people out of the decision making, because that will be the best approach. While there have been a number of streaming analytics and complex event processing vendors, Spark and a few other technologies have completely changed the market. But it’s still changing. It’s OK to invest in some of these technologies when the return on investment is clear just as a single application, and it will be. But you will need to future-proof your architecture in two ways.

  • Invest in a real-time integration architecture for big data that supports a host of sources and targets including your SaaS vendors and Spark.
  • Evaluate data lakes and streaming analytics technologies, not just one or the other, to support your business needs that are driving your project. Make sure the machine learning and statistics you use in your analytics on your data lake can be leveraged in your streaming analytics. The patterns you detect, like leading indicators for customer attrition or equipment failure, are exactly the same algorithms you’ll want to implement against data/event streams to detect attrition or failure as it’s happening.

To view or add a comment, sign in

More articles by Rob Meyer

Others also viewed

Explore content categories