Flexible Data Warehouse Design
Change is inevitable.
Your data warehouse should be able to rapidly adapt to changes.
In this blog post I will outline an architecture that will allow you to rapidly respond to changes in both the source systems and the data warehouse which reports and cubes use for their source of data.
Overview
The diagram below illustrates the architecture of a higly adaptable BI system built using SQL Server 2012 or later. It is an incremental load process which ensures minimal load times and optimal performance.
The flow of data in this system goes from left to right and follows these steps; replicate, process changes, transform changes and lastly process/query.
Replicate
A real-time replication process maintains an exact copy of each source system into a operational data store (ODS). The ODS database has change tracking turned on so that modifications done by the replication process are tracked and can later be picked up by the process that loads staging.
Process Changes
An ETL process, SQL Server Integration Services (SSIS), picks up all of changes that have occurred in the ODS since the last time it successfully ran. This is done leveraging the change data capture transforms in SSIS. The changes are processed into similaraly structured tables in different schemas in the staging database. The only difference between the tables in staging and the ODS is that there are some additional audit fields and a CDC field indicated what kind of change occurred on the record (0 = no change, 1 = delete, 2 = insert, 3 = pre-update, 4= post update).
On top of each table is a view. This view will be the source for downstream SSIS processes which further transform the data prior to loading it into the enterprise data warehouse (EDW). The views act as layer of abstraction protecting the more complex SSIS processes which transform the data and load the EDW. The views allow you to rapidly adapt to source system changes by enabling you to modify a logical layer so downstream processes do not break. Typically modifying a view is easier and faster then modifying a complex ETL process.
Tranform Changes
Next another ETL process will grab all of the changed records from the views in staging and load them into dimension and fact tables in the data warehouse. Because we are using CDC and SSIS we can identify all the records that have changed as well as what types of changes have occured inserts, updates and even deletes.
This is very powerful because it gives you the ability to easily process data in the appropriate manor such as expire records in the data warehouse that have been deleted in the source system or track changes to transaction records that should not change. This would be very difficult using a traditional load control process.
On top of each fact and dimension table in the data warehouse are views that allow you to rapidly adapt to changes required by users of the reports and cubes; for example a business users believes that the code value should be incorporated with description for every record in a dimension table however they are not sure of the format <code> : <description> or (<code>) <description> etc. this logical layer allows you rapidly iterate through until the best option is found and then incorporate that logic into the ETL at a later point in time.
Processs/Query
All of the cubes and reports should be sourced from the views in the datawarehouse. In order to maximize query performance the views should have the schemabinding option enabled so that indexes can be added to the views.
By leveraging two layers of abstraction (views on top of tables) one in staging and one in the data warehouse you will have a robust system that can be easily modified to accommodate changes in the source system as well as in the reporting requirements.
In my next post I will talk about a great way monitor the replication process to ensure changes are happening in your ODS.
Until next time.
Anthony