Data Vault modeling and Benefits (and SQL Scripts for illustration)
Data vault is a data modeling design pattern used to build a data warehouse for enterprise-scale analytics. The data vault has three types of entities: hubs, links, and satellites.
Hubs represent core business concepts, links represent relationships between hubs, and satellites store information about hubs and relationships between them.
-- 𝗦𝗤𝗟 𝘀𝗰𝗿𝗶𝗽𝘁𝘀 𝗳𝗼𝗿 𝗶𝗹𝗹𝘂𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻
-----------in Silver layer-----------
-- Hub for Customer
CREATE TABLE IF NOT EXISTS H_Customer
(
customer_hk STRING NOT NULL COMMENT 'MD5(customer_id)',
customer_id INT NOT NULL,
load_timestamp TIMESTAMP NOT NULL,
record_source STRING,
CONSTRAINT pk_h_customer PRIMARY KEY (customer_hk)
);
-- Hub for Orders
CREATE TABLE IF NOT EXISTS H_Order
(
order_hk STRING NOT NULL COMMENT 'MD5(order_id)',
order_id INT NOT NULL,
load_timestamp TIMESTAMP NOT NULL,
record_source STRING,
CONSTRAINT pk_h_order PRIMARY KEY (order_hk)
);
-- Creates a link table to map customers to their orders with a hashed primary key
CREATE TABLE IF NOT EXISTS L_Customer_Order
(
customer_order_hk STRING NOT NULL COMMENT 'MD5(customer_hk||order_hk)',
customer_hk STRING NOT NULL,
order_hk STRING NOT NULL,
load_timestamp TIMESTAMP NOT NULL,
record_source STRING,
CONSTRAINT pk_l_customer_order PRIMARY KEY (customer_order_hk)
);
-- Satellite for Customer Descriptive Info
CREATE TABLE IF NOT EXISTS S_Customer
(
customer_hk STRING NOT NULL,
hash_diff STRING NOT NULL COMMENT 'MD5 of all descriptive columns',
name STRING,
address STRING,
nation_key INT,
phone STRING,
acct_bal DECIMAL(12,2),
market_segment STRING,
comment STRING,
load_timestamp TIMESTAMP NOT NULL,
record_source STRING,
CONSTRAINT pk_s_customer PRIMARY KEY (customer_hk, load_timestamp)
);
-- Satellite for Order Descriptive Info
CREATE TABLE IF NOT EXISTS S_Order
(
order_hk STRING NOT NULL,
hash_diff STRING NOT NULL COMMENT 'MD5 of all descriptive columns',
order_status STRING,
total_price DECIMAL(12,2),
order_date DATE,
order_priority STRING,
clerk STRING,
ship_priority INT,
comment STRING,
load_timestamp TIMESTAMP NOT NULL,
record_source STRING,
CONSTRAINT pk_s_order PRIMARY KEY (order_hk, load_timestamp)
);
-----------in Gold layer-----------
-- Combining Customer and Order details
CREATE OR REPLACE MATERIALIZED VIEW gold.BV_Customer_Order AS
SELECT
hc.customer_id,
sc.name AS customer_name,
sc.address AS customer_address,
ho.order_id,
so.order_date,
so.total_price,
so.order_status
FROM
silver.H_Customer hc
JOIN
silver.S_Customer sc ON hc.customer_hk = sc.customer_hk
JOIN
silver.L_Customer_Order lco ON hc.customer_hk = lco.customer_hk
JOIN
silver.H_Order ho ON lco.order_hk = ho.order_hk
JOIN
silver.S_Order so ON ho.order_hk = so.order_hk;
𝗞𝗲𝘆 𝗕𝗲𝗻𝗲𝗳𝗶𝘁𝘀:
* 𝗦𝗰𝗮𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆: The architecture is built to handle massive data growth, from ingestion to management, and supports large and complex data environments.
* 𝗙𝗹𝗲𝘅𝗶𝗯𝗶𝗹𝗶𝘁𝘆 & 𝗔𝗱𝗮𝗽𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆: Data Vault can easily incorporate new data sources or change business requirements without requiring a complete overhaul of the existing structure.
* 𝗔𝘂𝗱𝗶𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 & 𝗧𝗿𝗮𝗰𝗲𝗮𝗯𝗶𝗹𝗶𝘁𝘆: It provides a comprehensive historical record of all data changes, ensuring a strong audit trail and data lineage, which is vital for compliance.
* 𝗔𝗴𝗶𝗹𝗶𝘁𝘆: The modular design allows for agile, iterative development, enabling faster integration of new datasets and quicker time-to-insight.
* 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴: The separation of data into distinct components (hubs, links, satellites) allows for concurrent data loading and extraction, significantly improving performance.
* 𝗗𝗮𝘁𝗮 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 & 𝗚𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲: By capturing raw data and implementing standardized patterns, it supports robust data quality and governance practices throughout the data lifecycle.
* 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻 & 𝗦𝘁𝗮𝗻𝗱𝗮𝗿𝗱𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Data Vault uses repeatable patterns and standardized components, which enables the automation of data warehouse engineering and reduces boilerplate code.
* 𝗘𝗮𝘀𝗲 𝗼𝗳 𝗜𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗼𝗻: New systems can be integrated easily, as each can be modeled independently and then integrated into the existing Data Vault.
* 𝗛𝗶𝘀𝘁𝗼𝗿𝗶𝗰𝗮𝗹 𝗗𝗮𝘁𝗮 𝗣𝗿𝗲𝘀𝗲𝗿𝘃𝗮𝘁𝗶𝗼𝗻: Satellite tables specifically store changes over time, providing a complete historical perspective on data.