Data Warehouse Triage

Data Warehouse Triage

Too many data warehouses fail because they collapse under their own weight - as you put the warehouse under load from daily refresh processes and user demand the warehouse becomes more fragile and less responsive, you are constantly fighting performance issues, and the users start to express a lack of confidence in the results that are being posted.

By far the most important thing is to work with a proven architecture and to follow the guiding principles of data warehousing to the letter.

You need a correctly sized server, a data retention plan, a backup plan, a disaster recovery plan, a job execution plan and, of course, trained admins for the server, DBMS and operations environment (jobs scheduler, network, FTP server, monitoring tools, etc) before you even start thinking about designing data objects.

A data warehouse is wholly incremental. The minute you start updating facts you are adding unneeded complexity, and it is likely that your ETL won't scale as you ramp up the volume. There are good design techniques for handling any data warehouse scenario you encounter. Stay incremental, and you will sleep better at night.

You need data controls and monitoring. If you don't have either you are taking the high wire approach. Personally I like to give my business users confidence in the quality and timeliness of my data. I also like to be transparent too. The state of the data warehouse should be visible to every one.

ETL is simple, analytics are complex. If your warehouse is designed correctly, everything is incremental and the ETL is simple. The data controls are all placed in the processes at the correct places. Making decisions in your ETL is a simple process - you look to the controls that you build to support correct decisions. At a minimum you need; High watermark (used by analytics), Contiguous data (gaps are embarrassing), Duplicates (more embarrassing), Dependencies (quality matters), Required Feeds (knowing that you received nothing is important), Data Quality (required attributes, range checks, etc). Also, when you build good controls they are available for the analytics too. You need to be sure you are not introducing data quality or availability issues into your analytics.

Instrument your processes. You want to watch them execute on the monitoring tools, you want to trend performance over time.  The more you can detect performance issues early the less pain they will cause later.

Vendor Scorecards. I keep coming back to this. You need to quantify the quality of data you receive from external vendors. These rules need to be defined as part of the SLA that you establish with the vendor. It is your responsibility then to capture these metrics and then provide the vendor with a monthly scorecard. This is the only way to ensure that any data quality issues are addressed. It actually helps both you and the vendor by eliminating that stream of emails you would normally send for any data quality issues. The scorecard is a shared document - that you can both review, and you have already agreed upon the metrics that were implemented.

It is possible to have a successful data warehouse without of all this infrastructure, but personally I like to think that all of this is just common sense. I don't want to the king of my data warehouse - where only I understand the complete picture. I want it to be well-reasoned and open for anyone to understand.

I certainly agree with defining your vendor data in your SLA! In many instances when the SLA was not clear and complete, with all the needed details, this spelled major disaster! Thanks for the post!

Like
Reply

To view or add a comment, sign in

More articles by Nigel Shaw

  • Data Quality Is Having A 42 Moment

    Of course you know that 42 is the answer to everything, the true challenge is identifying the question. If you ask…

  • The Shape Of Data

    Understanding data requires taking something very abstract and putting it into terms that we can use to visualize it -…

  • Data Quality: The Unspoken Truth

    I love data quality. When we build analytics we promise two things; accuracy and trust.

  • Developing Your Own Data Language

    As a new data analyst, the first thing you struggle with is identity - how to present yourself as a valuable…

    3 Comments
  • Data Analysis - New Pathways, New Opportunities

    Sometimes the first steps in innovation are just trying stuff that doesn't make sense. I took a dataset that we are all…

  • Data Modeling - Complex Segments

    A segment is a unique set of values for a set of attributes. The simplest example of a segment would be a handful of…

    1 Comment
  • Data Modeling - Nested Sub-Segments

    Do you think you are a great data modeler? Can you design a set of data structures to support Nested Segments? The…

    1 Comment
  • Getting Data Fixed

    If there is one thing that annoys Data Engineers and Data Analysts it is data quality. It seems like an endless cycle…

  • Data Quality: The Unspoken Truth

    Everybody complains about data quality - but, nobody wants to speak the truth about root cause. All data quality issues…

  • Synesthesia

    Understanding a condition is totally liberating. You go from a state of confusion to a more rational thought process.

    2 Comments

Others also viewed

Explore content categories