How to document selections in a data warehouse build

How to document selections in a data warehouse build

How to select the transaction types you want?

You have gathered your requirements. You have designed your build (it does not matter if it’s via scripts, procedures, ELT/ETL or pipelines…)

You have translated them into a series of JOINs and decided to create selections in one or more tables to load only the transactions that were requested.

You may join additionally to lookup tables, to select the transaction types you require (which may add complexity and running time/cost to the SQL) or form the belief that the transaction codes will never change and select them directly without accessing the lookup tables.

You decide to hardcode.

Hardcoding, what is the issue?

If you write SQL code, you may have a very long query. You may produce thousands of lines. Similarly, ETL/ELT diagrams may be complex.

In both cases, the selection may not be obvious.

The log produced by the process may not be complete. When running procedures, some SQL systems will not log what SQL statements were run by the procedure, only the fact that the procedure was called.

Even if the log is complete, the code retrieved from the log may look even more complex than it did in the procedure.

But surely, I can put some comments

Indeed, and you should. The comments should be informative. They should not just say “We are selecting this transaction type or this transaction code.” They should explain why: link to external documentation, explanation of why this choice was made, who made the decision…

Comments may not be enough

Depending on how you coded, some comments may make their way to the log. This could be useful. The comments should guide updates to the code and hopefully, the external documentation will also be updated, but there are enemies of local documentation, be it in SQL code or ETL/ELT tools: migration.

Migration may be bad news

If code is converted or re-written in a migration, the comments may fall by the wayside. Conversion from one type to another (say SQL procedure to a pipeline or an ETL/ELT tool, move from one system to another) can lead to comments being left behind. This means explanations as to what transaction types or codes were selected could be lost.

A better way

Instead of using hardcoding, you may consider using a lookup table that will drive your warehouse build.

The table will list what transaction types and/or codes you need for a particular statement or ETL/ELT diagram. The advantage of such a table is that you can add columns to document why and when decisions were made. This lookup table may offer the possibility to evolve the selection over time, maybe because from a particular date, the transaction types or codes you need have changed.

Keeping track of the old selections is useful for several reasons: keep a log of changes and make it possible to reload your data if a severe problem is detected that requires a reload of the data warehouse.

The advantage of using a table is that all decisions are highlighted and documented in a single place that’s used for all processes. If one day you realise you forgot a transaction code/type, you don’t have to scour all your code base or ETL/ELT tool to find them and if necessary change them. It also means you may not have to make physical changes to ETL/ELT or code base to change your selections.

Do you need to select in the first place?

Maybe not.

Nowadays, data space is plentiful and reasonably cheap. Code written well can run quickly.

The way data is used changes a lot, sometimes in unpredictable ways.

You may decide today that for this build, you only need a certain type of transactions yet tomorrow, you may realise that other transactions you excluded are of interest. Maybe not for the current users of the data warehouse table you are building. Maybe as a source for AI.

How to avoid selection

There are 2 conflicting needs:

-          You want to build your data for all transaction types provided data is returned by your JOINs

-          You only want to expose to users those rows that correspond to the transaction types or codes identified by the requirements

A possibility is to build data for all transaction types but only expose to users the rows they requested. This can be done via a view. The view can be driven by the lookup table mentioned in “A better way.”

An advantage of this: if one day, someone realises “we forgot this transaction type”, you don’t have to rebuild anything. You don’t have to change code. The only thing you need to do: update the lookup table and the rows for the forgotten transaction types will appear just like magic.

Another advantage: let us say the same table is used by different departments in your company. You could build several views, all looking at the same table, but exposing different transaction types to your clients.  

To view or add a comment, sign in

Others also viewed

Explore content categories