Data Vault modeling and Benefits (and SQL Scripts for illustration)
Source: https://www.databricks.com/glossary/data-vault

Data Vault modeling and Benefits (and SQL Scripts for illustration)

Data vault is a data modeling design pattern used to build a data warehouse for enterprise-scale analytics. The data vault has three types of entities: hubs, links, and satellites.

Hubs represent core business concepts, links represent relationships between hubs, and satellites store information about hubs and relationships between them.

-- 𝗦𝗤𝗟 𝘀𝗰𝗿𝗶𝗽𝘁𝘀 𝗳𝗼𝗿 𝗶𝗹𝗹𝘂𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻
-----------in Silver layer----------- 
-- Hub for Customer
CREATE TABLE IF NOT EXISTS H_Customer
(
   customer_hk STRING NOT NULL COMMENT 'MD5(customer_id)',
   customer_id INT NOT NULL,
   load_timestamp TIMESTAMP NOT NULL,
   record_source STRING,
   CONSTRAINT pk_h_customer PRIMARY KEY (customer_hk)
);

-- Hub for Orders
CREATE TABLE IF NOT EXISTS H_Order
(
   order_hk STRING NOT NULL COMMENT 'MD5(order_id)',
   order_id INT NOT NULL,
   load_timestamp TIMESTAMP NOT NULL,
   record_source STRING,
   CONSTRAINT pk_h_order PRIMARY KEY (order_hk)
 );

-- Creates a link table to map customers to their orders with a hashed primary key
CREATE TABLE IF NOT EXISTS L_Customer_Order
(
   customer_order_hk STRING NOT NULL COMMENT 'MD5(customer_hk||order_hk)',
   customer_hk STRING NOT NULL,
   order_hk STRING NOT NULL,
   load_timestamp TIMESTAMP NOT NULL,
   record_source STRING,
   CONSTRAINT pk_l_customer_order PRIMARY KEY (customer_order_hk)
 );

-- Satellite for Customer Descriptive Info
CREATE TABLE IF NOT EXISTS S_Customer
(
   customer_hk STRING NOT NULL,
   hash_diff STRING NOT NULL COMMENT 'MD5 of all descriptive columns',
   name STRING,
   address STRING,
   nation_key INT,
   phone STRING,
   acct_bal DECIMAL(12,2),
   market_segment STRING,
   comment STRING,
   load_timestamp TIMESTAMP NOT NULL,
   record_source STRING,
   CONSTRAINT pk_s_customer PRIMARY KEY (customer_hk, load_timestamp)
);

-- Satellite for Order Descriptive Info
CREATE TABLE IF NOT EXISTS S_Order
(
   order_hk STRING NOT NULL,
   hash_diff STRING NOT NULL COMMENT 'MD5 of all descriptive columns',
   order_status STRING,
   total_price DECIMAL(12,2),
   order_date DATE,
   order_priority STRING,
   clerk STRING,
   ship_priority INT,
   comment STRING,
   load_timestamp TIMESTAMP NOT NULL,
  record_source STRING,
   CONSTRAINT pk_s_order PRIMARY KEY (order_hk, load_timestamp)
);

-----------in Gold layer----------- 
-- Combining Customer and Order details 
CREATE OR REPLACE MATERIALIZED VIEW gold.BV_Customer_Order AS
 SELECT 
     hc.customer_id,
     sc.name AS customer_name,
     sc.address AS customer_address,
     ho.order_id,
     so.order_date,
     so.total_price,
     so.order_status
 FROM 
     silver.H_Customer hc
 JOIN 
     silver.S_Customer sc ON hc.customer_hk = sc.customer_hk
 JOIN 
     silver.L_Customer_Order lco ON hc.customer_hk = lco.customer_hk
 JOIN 
     silver.H_Order ho ON lco.order_hk = ho.order_hk
 JOIN 
     silver.S_Order so ON ho.order_hk = so.order_hk;        

𝗞𝗲𝘆 𝗕𝗲𝗻𝗲𝗳𝗶𝘁𝘀:

* 𝗦𝗰𝗮𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆: The architecture is built to handle massive data growth, from ingestion to management, and supports large and complex data environments. 

* 𝗙𝗹𝗲𝘅𝗶𝗯𝗶𝗹𝗶𝘁𝘆 & 𝗔𝗱𝗮𝗽𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆: Data Vault can easily incorporate new data sources or change business requirements without requiring a complete overhaul of the existing structure. 

* 𝗔𝘂𝗱𝗶𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 & 𝗧𝗿𝗮𝗰𝗲𝗮𝗯𝗶𝗹𝗶𝘁𝘆: It provides a comprehensive historical record of all data changes, ensuring a strong audit trail and data lineage, which is vital for compliance. 

* 𝗔𝗴𝗶𝗹𝗶𝘁𝘆: The modular design allows for agile, iterative development, enabling faster integration of new datasets and quicker time-to-insight. 

* 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴: The separation of data into distinct components (hubs, links, satellites) allows for concurrent data loading and extraction, significantly improving performance. 

* 𝗗𝗮𝘁𝗮 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 & 𝗚𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲: By capturing raw data and implementing standardized patterns, it supports robust data quality and governance practices throughout the data lifecycle. 

* 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 & 𝗦𝘁𝗮𝗻𝗱𝗮𝗿𝗱𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Data Vault uses repeatable patterns and standardized components, which enables the automation of data warehouse engineering and reduces boilerplate code. 

* 𝗘𝗮𝘀𝗲 𝗼𝗳 𝗜𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗼𝗻: New systems can be integrated easily, as each can be modeled independently and then integrated into the existing Data Vault. 

* 𝗛𝗶𝘀𝘁𝗼𝗿𝗶𝗰𝗮𝗹 𝗗𝗮𝘁𝗮 𝗣𝗿𝗲𝘀𝗲𝗿𝘃𝗮𝘁𝗶𝗼𝗻: Satellite tables specifically store changes over time, providing a complete historical perspective on data. 

To view or add a comment, sign in

More articles by Quoc Nguyen

Explore content categories