High performance data warehousing rule #9

Robert Harmon

Published Jul 2, 2023

Much of the work we do in the data warehouse world is terribly repatitious and boring as hell. I can think of no greater torture than writing one more ETL/ELT pipeline. Beyond that, even schema design in ODS structures is pure unadulterated pain and repetition.

So don't do it. We have machines for this.

The basis for aggressive DW automation is business rules. If I've taken a business rules approach to my design, from the operational systems out, the rest is easy. We can catalog our business rules, add metadata to these rules (physical field name, relationship name, constraint construction etc.) into our host DBMS's data catalog, or even create our own repository in user tables.

Once here, defining a source system to ODS or even DW pipeline is simple, we can generate the code from these catalog elements. As changes are requested by the business, we can simply adjust our business rules in our catalog and run the generator. Compare the output to prod and push. All necessary artifacts can be automated in this method. If your target platform doesn't support DRI, we can generate triggers from the business rules. If it doesn't support triggers, we can add constraint sanity check code to the transform logic.

The benefits to this process are:

You'll save a LOT of manpower.
You'll have better buy in from the business, as your business rules MUST accurately describe their requests.
You'll stop chasing bugs and anomalies. Computers either don't make mistakes, or they destroy the whole system and you'll notice.

There are full system generators available in the commercial market, and if they fit your need, I'd highly suggest giving them a try. If you have unique needs, they can be built out in a few months. Start small, generate tables fields and relationships first. You can always do the rest by hand. Once there, start automating constraints etc. Then off to generating ETL/ELT code.

#datawarehousing #dataengineering #dataarchitecture

Shashank Jain 2y

Conceptually looks fantastic, easy, scalable and perfect. But i was involved in multiple such projects where for whole banking organisations or telecom we tried this approach (2-3 million files, 100-200k tables) : would like to share some challenges 1) ownership of such metadata become huge challenges between IT and business 2) heavy dependencies on developers or coders who intially wrote the code or wrapper. Once it developed, these folks will leave for greater success and project will be on the edge of failure. 3) took longer time than expected due to too much non-technical politics - 1-2 yrs easily 4) Organisations believe that it is cheaper to higher coders and write, design this in-house instead of relying on market.