Data Warehouse Triage
Too many data warehouses fail because they collapse under their own weight - as you put the warehouse under load from daily refresh processes and user demand the warehouse becomes more fragile and less responsive, you are constantly fighting performance issues, and the users start to express a lack of confidence in the results that are being posted.
By far the most important thing is to work with a proven architecture and to follow the guiding principles of data warehousing to the letter.
You need a correctly sized server, a data retention plan, a backup plan, a disaster recovery plan, a job execution plan and, of course, trained admins for the server, DBMS and operations environment (jobs scheduler, network, FTP server, monitoring tools, etc) before you even start thinking about designing data objects.
A data warehouse is wholly incremental. The minute you start updating facts you are adding unneeded complexity, and it is likely that your ETL won't scale as you ramp up the volume. There are good design techniques for handling any data warehouse scenario you encounter. Stay incremental, and you will sleep better at night.
You need data controls and monitoring. If you don't have either you are taking the high wire approach. Personally I like to give my business users confidence in the quality and timeliness of my data. I also like to be transparent too. The state of the data warehouse should be visible to every one.
ETL is simple, analytics are complex. If your warehouse is designed correctly, everything is incremental and the ETL is simple. The data controls are all placed in the processes at the correct places. Making decisions in your ETL is a simple process - you look to the controls that you build to support correct decisions. At a minimum you need; High watermark (used by analytics), Contiguous data (gaps are embarrassing), Duplicates (more embarrassing), Dependencies (quality matters), Required Feeds (knowing that you received nothing is important), Data Quality (required attributes, range checks, etc). Also, when you build good controls they are available for the analytics too. You need to be sure you are not introducing data quality or availability issues into your analytics.
Instrument your processes. You want to watch them execute on the monitoring tools, you want to trend performance over time. The more you can detect performance issues early the less pain they will cause later.
Vendor Scorecards. I keep coming back to this. You need to quantify the quality of data you receive from external vendors. These rules need to be defined as part of the SLA that you establish with the vendor. It is your responsibility then to capture these metrics and then provide the vendor with a monthly scorecard. This is the only way to ensure that any data quality issues are addressed. It actually helps both you and the vendor by eliminating that stream of emails you would normally send for any data quality issues. The scorecard is a shared document - that you can both review, and you have already agreed upon the metrics that were implemented.
It is possible to have a successful data warehouse without of all this infrastructure, but personally I like to think that all of this is just common sense. I don't want to the king of my data warehouse - where only I understand the complete picture. I want it to be well-reasoned and open for anyone to understand.
I certainly agree with defining your vendor data in your SLA! In many instances when the SLA was not clear and complete, with all the needed details, this spelled major disaster! Thanks for the post!
Interesting read!